If the runtime journal is opened, we will anyway write journal entries
to the runtime journal, even if the persistent journal is writable.
Hence, we need to flush the runtime journal file later.
If an initrd has an empty or uninitialized /etc/machine-id file,
then PID1 write a valid machine ID. So, the logic is important only on
soft-reboot. Let's mention that explicitly.
Follow-up for 16718dcf78.
This effectively reverts ba540e9f1c.
https://github.com/systemd/systemd/pull/32915#discussion_r1608258136
> In many cases we allow --root=/ as a mechanism for forcing an "offline" mode,
> while still operating on the root dir. if we do the getenv_for_pid() thing
> below I'd claim this is very much an "online" operation, and hence --root=/
> should really disable that.
Before:
/etc/kernel/install.conf:6: Unknown key name 'asdf' in section '(null)', ignoring.
After:
/etc/kernel/install.conf:6: Unknown key 'asdf', ignoring.
Also make the message a bit better.
If both literal and signed PCR bindings are not used then we won't
determine a PCR bank to use, and hence we shouldnt attempt to serialize
it either.
Hence, if the bank is zero, skip serialization.
(And while we are at it, also skip serialization of the primary
algorithm if not set, purely to make things systematic).
[This effectively results in little change, as previously we'd then
seralize a json "null", while now we simply won't genreate the field]
We so far derived the PCR bank to use from the PCR values specified fr
literal PCR binding. However, when that's not used then we left the bank
uninitialized – which will break if signed PCR binds are used (where we
need to pick a bank too after all).
Hence, let's explicitly pick a bank to use if literal PCR values are not
used, to make things just work.
Fixes: #32946
In varlink.c we generally do not make failing callback functions fatal,
since that should be up to the app. Hence, in case of varlinkctl (where
we want failures to be fatal), make sure to propagate the error back
explicitly.
Before this change a failing call to "varlinkctl --more call …" would result in
a zero exit code. With this it will correctly exit with a non-zero exit
code.
When running in LXC with AppArmor we'll most likely get an error when creating
a network namespace due to a kernel regression in < v6.2 affecting AppArmor,
resulting in denials. Like other tests, avoid failing in case of permission
issues and handle it gracefully.
As per the documentation, EACCES is only returned when F_SETLK is
used, and only on some platforms, which doesn't seem to include
Linux:
https://github.com/torvalds/linux/blob/master/fs/locks.c
F_OFD_SETLK is documented to only return EAGAIN, and F_SETLKW/F_OFD_SETLKW
are blocking operations so this logic doesn't apply to them in the
first place.
Hence, only automatically convert EACCES into EAGAIN for F_SETLK
operations, and propagate the original error in the other cases.
This is important because in some cases we catch permission errors
and gracefully fallback, which is not possible if the original error
is lost.
This is an issue in practice because, due to a kernel bug present
before v6.2, AppArmor denies locking on file descriptors to LXC
containers. We support all currently maintained LTS kernels,
including v6.1, where despite a lot of effort and attempts over almost
a year, the bugfix still hasn't been backported, as it is complex and
requires large changes to AppArmor.
On affected kernels, all services running with PrivateNetwork=yes
fail and do not recover, instead of the normal behaviour of gracefully
downgrading to PrivateNetwork=no.
The integration tests in the Debian CI fail due to this issue:
https://ci.debian.net/packages/s/systemd/testing/arm64/46828037/
Currently, when service_set_main_pidref() is called
without specifying start_timestamp, exec_status_start()
always uses dual_timestamp_now(). This is not ideal,
though, as when the main pid changes halfway due to
e.g. sd_notify + MAINPID=, it's definitely spurious.
Coverity gets confused since the iterator change, so add an
assert to indicate that this is allocated if n_old_groups is > 0
CID#1545922
Follow-up for 125cca1b51
Currently, if proc_mounted() != 0, some functions
propagate -ENOENT while others return -EBADF.
Let's make things consistent, by introducing
a static inline helper responsible for finding out
the appropriate errno.
Follow-up for ade0789fab
The change in behavior was partly intentional, as I think
if both --wait and --pty are used, manually disconnecting
from PTY forwarder should not result in systemd-run exiting
with "Finished with ..." log. But we should check for
--wait here.
Closes#32953
Nowadays the tmpfs where the final shutdown phase
is initiated has got its own name.
Plus, "Returning to initrd" sounds spurious anyway,
as we're not returning to the initial root tmpfs
as seen by the kernel.
Fixes https://github.com/systemd/systemd/issues/28514.
Quoting https://github.com/systemd/systemd/issues/28514#issuecomment-1831781486:
> Whenever PAM is enabled for a service, we set up the PAM session and then
> fork off a process whose only job is to eventually close the PAM session when
> the service dies. That services we run with service privileges, both to
> minimize attack surface and because we want to use PR_SET_DEATHSIG to be get
> a notification via signal whenever the main process dies. But that only works
> if we have the same credentials as that main process.
>
> Now, if pam_systemd runs inside the PAM stack (which it normally does) it's
> session close hook will ask logind to synchronously end the session via a bus
> call. Currently that call is not accessible to unprivileged clients. And
> that's the part we need to relax: allow users to end their own sessions.
The check is implemented in a way that allows the kill if the sender is in
the target session.
I found 'sudo systemctl --user -M "zbyszek@" is-system-running' to
be a convenient reproducer.
Before:
May 16 16:25:26 x1c systemd[1]: run-u24754.service: Deactivated successfully.
May 16 16:25:26 x1c dbus-broker[1489]: A security policy denied :1.24757 to send method call /org/freedesktop/login1:org.freedesktop.login1.Manager.ReleaseSession to org.freedesktop.login1.
May 16 16:25:26 x1c (sd-pam)[3036470]: pam_systemd(login:session): Failed to release session: Access denied
May 16 16:25:26 x1c systemd[1]: Stopping session-114.scope...
May 16 16:25:26 x1c systemd[1]: session-114.scope: Deactivated successfully.
May 16 16:25:26 x1c systemd[1]: Stopped session-114.scope.
May 16 16:25:26 x1c systemd[1]: session-c151.scope: Deactivated successfully.
May 16 16:25:26 x1c systemd-logind[1513]: Session c151 logged out. Waiting for processes to exit.
May 16 16:25:26 x1c systemd-logind[1513]: Removed session c151.
After:
May 16 17:02:15 x1c systemd[1]: run-u24770.service: Deactivated successfully.
May 16 17:02:15 x1c systemd[1]: Stopping session-115.scope...
May 16 17:02:15 x1c systemd[1]: session-c153.scope: Deactivated successfully.
May 16 17:02:15 x1c systemd[1]: session-115.scope: Deactivated successfully.
May 16 17:02:15 x1c systemd[1]: Stopped session-115.scope.
May 16 17:02:15 x1c systemd-logind[1513]: Session c153 logged out. Waiting for processes to exit.
May 16 17:02:15 x1c systemd-logind[1513]: Removed session c153.
Edit: this seems to also fix https://github.com/systemd/systemd/issues/8598.
It seems that with the call to ReleaseSession, we wait for the pam session
close hooks to finish. I inserted a 'sleep(10)' after the call to ReleaseSession
in pam_systemd, and things block on that, nothing is killed prematurely.
We have the following timestamp status:
$ systemctl show systemd-fsck-root.service | grep InactiveExitTimestamp
InactiveExitTimestamp=Thu 2023-11-02 12:27:24 CET
InactiveExitTimestampMonotonic=15143158
$ systemctl show | grep UserspaceTimestamp
UserspaceTimestamp=Thu 2023-11-02 12:27:25 CET
UserspaceTimestampMonotonic=15804273
i.e. UserspaceTimestamp is before InactiveExit of systemd-fsck-root.service.
This is fine, but on display, we'd subtract those values and print a huge
negative value bogusly:
$ build/systemd-analyze critical-chain systemd-remount-fs.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.
systemd-remount-fs.service +137ms
└─systemd-fsck-root.service @584542y 2w 2d 20h 1min 48.890s +45ms
└─systemd-journald.socket
└─system.slice
└─-.slice
In fact, list_dependencies_print() already had a branch where the check that
'times->activating > boot->userspace_time', but it didn't cover all cases. So
make it cover both branches, and also change to '>=', since it's fine if
something happened with the same timestamp.
With the patch:
$ build/systemd-analyze critical-chain systemd-remount-fs.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.
systemd-remount-fs.service +42ms
└─systemd-fsck-root.service
└─systemd-journald.socket
└─system.slice
└─-.slice
Fixes https://github.com/systemd/systemd/issues/17191.
When running inside an LXC container the 'su' process will not be part of
any unit or slice.
manager_get_user_by_pid() which was used until v255 (included) does not fail
if it cannot find a unit/slice, but simply returns 'not found'. Do the same
in manager_get_session_by_pidref().
This was not detected as Semaphore CI does not reboot the testbed before
the logind test, so the session is started by the old logind from the base
distro, instead of the one being tested.
Follow-up for 8494f562c8
Follow-up for 5099a50d43
Fixes https://github.com/systemd/systemd/issues/32929
If machine ID is previously stored at /run/machine-id, then let's reuse
it. This is important on switching root and /etc/machine-id was previously
a mount point.
Fixes#32908.
Fixes a bug introduced by 1ddb263d21.
Note, this requires the previous two commits, and cannot backport without them.
Note, before the previous commit, the use-after-free could be triggered
only by Rename() DBus method, and could not by RenameImage(), as we did not
cache Image object when RenameImage() method is called. And machinectl
always uses RenameImage(). Hence, the issue could be triggered only when
Rename() DBus method is explicitly called by e.g. busctl.
With the previous commit, the Image object passed to the function is
always cached. Hence, the issue could be triggered even with machinectl
command, and this fix is important.
Previously, Image objects were only cached when reading properties or
methods in the org.freedesktop.machine1.Image interface are called.
This makes that, when a method in the main interface (org.freedesktop.machine1)
for an image is called, also acquire the Image object from the cache,
and if not cached, create Image object and put into the cache, like we
do for org.freedesktop.machine1.Image.
Otherwise, if some properties of an image are updated by methods in the main
interface, e.g. MarkImageReadOnly(), the changes do not applied to the cached
Image object, and subsequent read of proerties through the interface for the
image, e.g. ReadOnly property, may provide outdated values.
Follow-up for 1ddb263d21.
Fixes#32888.
Otherwise, ReadOnly DBus property in org.freedesktop.machine1.Image or
org.freedesktop.portable1.Image will not be updated by MarkReadOnly DBus
method.
The rationale is similar to 40e1f4ea74.
Currently, we only pass TTYPath=/dev/pts/... to
the transient service spawned by systemd-run.
This is a bit problematic though, when ExecStartPre=
or ExecStopPost= is used. Since when these control
processes get to run, the main process is not yet
started/has already exited, hence the slave suffers
from the same vhangup problem as the mentioned commit.
By passing the slave fd in, the service manager will
hold the fd open as long as the service is alive.
Fixes#32916
"norecovery" was deprecated for btrfs in
74ef00185e
and removed in
a1912f7121.
Let's drop our assumption that btrfs supports "norecovery" and first query for the
new name of the option followed by querying for the old name.
Fixes a bug introduced by 8d01e44c1f.
If a .netdev file for a wireguard interface requests to configure
routes for the interface, the routes were removed during configuring
another interface.
Fixes#32859.
For manager test runs, the generator output paths are located in
/tmp, which means that if we mount a private /tmp for generators,
we lose all the generated units (actually the generators will just
fail because the directories don't exist, but if they did exist,
we'd still lose all the units).
Let's avoid the problem by skipping the private /tmp for manager
test runs. This also avoids any possible privilege issues with
mounting a private /tmp that might happen in this scenario.
Currently, during soft-reboot, some services may survive,
but their associated credential mounts are dropped.
Let's instead preserve them, as discussed.
Follow-up for 3c0a1b1e70
Before this commit, DefaultDependencies=no is set in
mount_add_extras(). However, when generating mount units
from /proc/self/mountinfo, we don't have a unit in memory
yet, and mount_setup_new_unit() doesn't call into
mount_add_extras().
Fixes#32838
Locking of the tty device and then /dev/console was added to synchronize
vconsole-setup with other writers to the console. But it turns out that often
the locking doesn't work and we carved out various cases where we ignore
failure:
- lack of permissions (in the user manager)
- missing device node
It turns out that there's at least one more failure mode: we get -EIO when the
console is (mis-)configured to point to an invalid device. E.g. in
rhbug#2273069 the reporter has a VM in Proxmox without a virtual console
configured and has 'console=tty console=ttyS0' on the kernel cmdline. I
couldn't reproduce this under libvirt, but failure with EIO has been reported
by at least four users in #30501.
Note that in systemd-vconsole-setup we report this is a hard failure, while
in the manager, we only do a debug line. So it's possible that the failure
also occured there, causing the rest of the setup of the tty to be skipped
without further notice.
Ignore the locking failure, since there's just too many ways it can fail. If we
proceed without a lock, we're back to the situation before we started locking,
which wasn't too bad. OTOH, skipping setup of the console is problematic for
users, and it seems better to try to do the setup without locking.
Fixes https://github.com/systemd/systemd/issues/30501,
https://bugzilla.redhat.com/show_bug.cgi?id=2273069.
Follow-up for 6b34871f5d.
The general idea is that the list of credentials to load can and will specify
credentials which actually aren't provided, so a warning is too much. Let's
downgrade this to "info". If it turns out to be too noisy, we can downgrade
further in the future.
icmp6-util-linux.c sounds like a specialized implementation of the functions in
icmp6-util.c. But it's just a set of stub versions used in tests. Rename the
file to make this more obvious.
Coverity was complaining that we use the received packet size as a loop bound
without checking. This is indeed a bit iffy, because depending on how the host
is configured, the packet could be rather large. Let's refuse anything more
than the standard size early to prevent suspicious activity.
Resolves coverity CID#1534892, CID#1543949.
Currently, on soft-reboot, /run/credentials/@system is unmounted
because it has DefaultDependencies=yes and as such will have
Conflicts=umount.target and Before=umount.target. Let's make sure
credential mounts survive soft-reboot by implying DefaultDependencies=no
for credential mounts.
The test-event test seems to be taking quite a bit more time than
the other 'simple tests', which usually complete in < 1s. In case
of a slower or loaded machine the default 30s timeout is not enough.
RFC 4862 Section 5.5.3, bullet e, sub-bullet 3 applies to existing
addresses, i.e. when address_get() returns success. If the address is
new (i.e. address_get() fails), then we should not be adding 2 hours to
the lifetime_valid_usec. Instead, use the valid_lifetime from the RA's
Prefix Information Option.
This change allows v6LC.3.2.2 to pass. Also verified all v6LC3.2.* tests
pass. This covers all the v6LC tests from Group2: Router Advertisement
Processing and Address Lifetime.
Fixes#32652.
Previously, RA option fields were being ignored when the Router Lifetime
value was zero. Remove this logic to be compliant with RFC4861.
Extract from: https://www.ietf.org/rfc/rfc4861.html#section-4.2, p.21,
first paragraph:
The Router Lifetime applies only to
the router's usefulness as a default router; it
does not apply to information contained in other
message fields or options.
This affected IPv6 Conformance test:
v6LC.2.2.15: Router Advertisement Processing, Reachable Time.
Fixes#31842.
Co-authored-by: Matt Muggeridge <Matt.Muggeridge@hpe.com>
If we destroy both an event loop and a curl contect object at the same
time, then we get into this weird situation where curl wants us to
reconfigure a timout event source right before destruction, which
sd-event will refuse however, since it is already being shutdown.
Hence, catch that and simply don't bother adjusting the timeout, since
we cannot get back from there anyway.
A fixed name is too rigid, let's give users the ability to define
custom drop-in names which at the same time also allows defining
multiple dropins per unit.
We use ~ as the separator because:
- ':' is not allowed in credential names
- '=' is used to separate credential from value in mkosi's --credential
argument.
- '-' is commonly used in filenames
- '@' already has meaning as the unit template specifier which might be
confusing when adding dropins for template units
dangerous_hardlinks() -> hardlinks_protected(),
and the meaning of the function is now in line
with fs.protected_hardlinks value.
Plus, We ship 50-default.conf where the sysctl
is enabled. Mention it in the comment.
Previously, even if a DNS server is in the acquired prefix, the route to the
server might have gateway address.
This makes the prefix route, which is always configured, is also handled
as same as static routes, and do not use any gateway if the prefix route
is the most suitable route to access the destination.
The same change is also applied to route to NTP servers and semi-static
routes.
Fixes a regression introduced by 0ce86f5eeb.
Fixes#32715.
Fixes a regression introduced by e44f06065b.
After the offending commit, if a boot ID suffixed with an offset is
specified to --boot=, the boot ID was ignored.
This fixes the issue.
To fix the issue, this merges journal_find_boot_by_id() and
journal_find_boot_by_offset().
Otherwise, if several matches already set, then the first seek to head
or tail may move the cursor to an invalid place, hence they provide
wrong ID(s). Also, reading journal after calling these function may
provide unexpected data.
Currently, the caller does not install any matches before calling the
functions, and does not read any journal entry after journal_get_boots()
succeeds or journal_find_boot_by_offset() succeeds with 0. Hence, this
should not change any behavior. Just for safety.
- rename to parse_id_descriptor(), to make it usable for other kind of
ID later.
- add missing assertions,
- prefix arguments for storing results with 'ret_',
- drop unnecessary 'else'.
Follow-up for f380473edf
This cleans up the code a bit. Also, before this commit,
if MemoryAvailable is set but show_memory_available
is false, and we have nothing else to output, empty
parenthesis is shown. This can be easily reproduced
on -.slice:
> systemctl status -- -.slice
> ...
> Memory: 1.8G ()
> ...
This commit adds the new varlink interface io.systemd.Machine at
/run/systemd/machine/io.systemd.Machine with a single method Register
It supports all combinations of RegisterMachine[WithSSH,WithNetwork] all
under the same method.
Use 'recommended' priority for the default compression library, to
indicate that it should be prioritized over the other ones, as it
will be used to compress journals/core files.
Also use 'recommended' for kmod, as systems will likely fail to boot
if it's missing from the initrd.
Use 'suggested' for everything else.
There is one dlopen'ed TPM library that has the name generated
at runtime (depending on the driver), so that cannot be added, as it
needs to be known at build time.
Also when we support multiple ABI versions list them all, as for the
same reason we cannot know which one will be used at build time.
$ dlopen-notes.py build/libsystemd.so.0.39.0 build/src/shared/libsystemd-shared-256.so
libarchive.so.13 suggested
libbpf.so.0 suggested
libbpf.so.1 suggested
libcryptsetup.so.12 suggested
libdw.so.1 suggested
libelf.so.1 suggested
libfido2.so.1 suggested
libgcrypt.so.20 suggested
libidn2.so.0 suggested
libip4tc.so.2 suggested
libkmod.so.2 recommended
liblz4.so.1 suggested
liblzma.so.5 suggested
libp11-kit.so.0 suggested
libpcre2-8.so.0 suggested
libpwquality.so.1 suggested
libqrencode.so.3 suggested
libqrencode.so.4 suggested
libtss2-esys.so.0 suggested
libtss2-mu.so.0 suggested
libtss2-rc.so.0 suggested
libzstd.so.1 recommended
Co-authored-by: Luca Boccassi <bluca@debian.org>
This allows code to declare "weak" dlopen() style deps via an ELF
section following the just added specification.
The idea is that any user of dlopen() will place ELF_NOTE_DLOPEN(…)
somewhere close which will synthesize the note.
Tools such as rpm/dpkg package builders as well as initrd generators
(such as dracut) can then automatically pick up these weak deps of
suggested dependencies for their purposes.
Co-authored-by: Luca Boccassi <bluca@debian.org>
If the file was removed by some other program, we should just go
to the next one without failing. item_do() is only used for recursive
globs instead of fixed paths so skipping on missing files makes sense
(unlike if the path was fixed where we should probably fail).
Fixes#32691 (hopefully)
Also adds three properties:
- VsockCid: the VSOCK CID of the VM
- SshAddress: the address of the VM in a format SSH can connect to
- SshPrivateKeyPath: the path to the SSH private key to use to connect
to the VM.
GetMachineSSHInfo is essentially a convenience method to query both the
SshAddress and SshPrivateKeyPath properties at once.
Currently test-aux-scope.service can get killed by the test before
it's had a chance to setup its signal handler. Make it Type=notify
to fix the race.
Fixes#32670 (hopefully)
Firstly, if we encounter an error when iterating over the directory, gather
the error but continue. This is unlikely to happen, but if it happens, then
it doesn't seem very useful to break the preset processing at a random
point. If we can't process a unit — too bad, but since we already might
have processed some units earlier, we might as well try to process the
remaining ones.
Secondly, add missing error codes for units that are in a bad state to the
exclusion list. Those, we report them in the changes list, but consider the
whole operation a success. (-ETXTBSY and -ENOLINK were missing.)
Thirdly, add a message generator for -ENOLINK.
Fixes https://github.com/systemd/systemd/issues/21224.
In the majority of cases, this is caused by
sleep_supported() returning error. Hence it's
very likely that it would fail again, so
the fallback is not really useful. Instead,
honor the --force option for these verbs.
Currently, SLEEP_NOT_ENOUGH_SWAP_SPACE (ENOSPC) is returned
on all sorts of error conditions. But one important case
that's worth differentiating from that is when the resume device
is manually specified yet missing.
Closes#32644
Follow-up for 34c3d57474
O_RDONLY is dropped when O_DIRECTORY is specified, since
it's unnecessary and even arguably confusing here, as
the dir is modified.