Currently, we don't explicitly call unit_freezer_thaw(),
but rely on the destructor to thaw the frozen unit on
return. This has several problems though, one of them
being that we ignore the return value of ThawUnit(),
which is something we really shouldn't do here,
since such failure can easily leave the whole system
in unusable state. Moreover, the logging is kinda messy,
e.g. homed might log "Everything completed" yet immediately
followed by "Failed to thaw unit". Instead, we should log
consistently and at higher level, to make things more
debuggable.
Therefore, let's step away from the practice. Plus,
make UnitFreezer object heap-allocated, to match
with existing unit_freezer_new() and allow us to
use NULL to denote that the freezer is disabled.
The test fails when running under sanitizers due to missing sanitizer
libraries. For now, let's skip the test until we can make the necessary
changes to run it under sanitizers.
We can only do a debug log from the library, so let's add that. Callers
probably want to provide some hint when that happens, but it's very unlikely so
not worth coding in every caller.
And drop the now-unnecessary handling in unit_dequeue_rewatch_pids().
fd_cloexec_many promised to report if work was done, but that code was
not effective, because it always reported true if any fds were open.
But no callers care about the return value, so let's just drop this.
This makes output a bit shorter and nicer. For us, shorter output is generally
better.
Also, drop unnecessary UINT64_C macros. The left operand is always uint64_t,
and C upcasting rules mean that it doesn't matter if the right operand is
narrower or signed, the operation is always done on the wider unsigned type.
Opening pidfds for non thread group leaders only works from 6.9 onwards with PIDFD_THREAD. On
older kernels or without PIDFD_THREAD pidfd_open() fails with EINVAL. Since we might read non
thread group leader IDs from cgroup.threads, we introduce and set CGROUP_NO_PIDFD to avoid
trying open pidfd's for them and instead use the pid as is.
If we are looking at a TPM1.2 event log the first log record will not be
the "EfiSpecIdEvent" but something else. Let's improve the log messages
about this, and say explicitly that this is likely not a TPM2.0 event
log.
If the ceck for the ACPI TPM2 table did not work we currently check if
the EFI TPM table exists to check if the firmware supports TPM2.
Specifically we check if
/sys/kernel/security/tpm0/binary_bios_measurements exists. But that's
not enough, since that also exists on TPM1.2 systems. Hence, let's also
check /sys/class/tpm/tpm0/tpm_version_major which should exist under
similar conditions and tells us the kernel's idea of the TPM version in
use.
I originally intended to read the signature of the
/sys/kernel/security/tpm0/binary_bios_measurements contents for this,
but this is not ideal since that file has tight access mode, and our TPM
availability check would thus not work anymore if invoked unpriv.
Follow-up for 4b33911581Fixes: #33077
Otherwise we'll close the device disarming it as side-effect of
watchdog_free_device(), which is not intended. Hence, let's close the fd
first explicitly leaving it armed.
Fixes: #33075
When we open a watchdog fresh we have never pinged it, hence reset the
ping timestamp explicitly, so that it is not only reset the first time
we open the device, but all times.
libbpf returns error codes from the kernel unmodified, and we don't understand
them so non-fatal ones are handled as hard errors.
Add a translation helper, and start by translating 524 to EOPNOTSUPP, which is
returned when nsresourced tries to use LSM BPF hooks that are not
implemented on a given arch (in this case, arm64 is misssing trampolines).
Fixes https://github.com/systemd/systemd/issues/32170
Mention in the warning message for a failed open on a to be removed file
why systemd-tmpfiles tried to open it.
Also open the file with the O_NOCTTY flag, since it should never become
the controlling terminal.
if machined exits while a machine is still running, we'll issue the
UnrefUnit() call on the unit. This quite likely will fail if during
shutdown the bus connection is already down. But that's no reason to
warn at all, since the ref count will implicitly be dropped if our side
disappears from the bus. Hence, downgrade to LOG_DEBUG in case of
connection problems.
On distros like SUSE where ssh config dropins in /usr are supported, there's no
need for a symlink in /etc/ssh/ssh_config.d/ that points to the dropin
installed somewhere in /usr (that is not reachable by ssh).
With 430cc5d3ab,
the value of GENHD_FL_NO_PART, previously named as GENHD_FL_NO_PART_SCAN,
is changed from 0x0200 to 0x0004. So, we need to check both flags.
Follow-up for 14631951ce
Before this commit, if WorkingDirectory= is empty or literally "-",
'simplified' is not populated, resulting in the ASSERT_PTR
in unit_write_settingf() below getting triggered.
Also, do not accept "-", so that the parser is consistent
with load-fragment.c
Fixes#33015
If the runtime journal is opened, we will anyway write journal entries
to the runtime journal, even if the persistent journal is writable.
Hence, we need to flush the runtime journal file later.
If an initrd has an empty or uninitialized /etc/machine-id file,
then PID1 write a valid machine ID. So, the logic is important only on
soft-reboot. Let's mention that explicitly.
Follow-up for 16718dcf78.
This effectively reverts ba540e9f1c.
https://github.com/systemd/systemd/pull/32915#discussion_r1608258136
> In many cases we allow --root=/ as a mechanism for forcing an "offline" mode,
> while still operating on the root dir. if we do the getenv_for_pid() thing
> below I'd claim this is very much an "online" operation, and hence --root=/
> should really disable that.
Before:
/etc/kernel/install.conf:6: Unknown key name 'asdf' in section '(null)', ignoring.
After:
/etc/kernel/install.conf:6: Unknown key 'asdf', ignoring.
Also make the message a bit better.
If both literal and signed PCR bindings are not used then we won't
determine a PCR bank to use, and hence we shouldnt attempt to serialize
it either.
Hence, if the bank is zero, skip serialization.
(And while we are at it, also skip serialization of the primary
algorithm if not set, purely to make things systematic).
[This effectively results in little change, as previously we'd then
seralize a json "null", while now we simply won't genreate the field]
We so far derived the PCR bank to use from the PCR values specified fr
literal PCR binding. However, when that's not used then we left the bank
uninitialized – which will break if signed PCR binds are used (where we
need to pick a bank too after all).
Hence, let's explicitly pick a bank to use if literal PCR values are not
used, to make things just work.
Fixes: #32946
In varlink.c we generally do not make failing callback functions fatal,
since that should be up to the app. Hence, in case of varlinkctl (where
we want failures to be fatal), make sure to propagate the error back
explicitly.
Before this change a failing call to "varlinkctl --more call …" would result in
a zero exit code. With this it will correctly exit with a non-zero exit
code.
When running in LXC with AppArmor we'll most likely get an error when creating
a network namespace due to a kernel regression in < v6.2 affecting AppArmor,
resulting in denials. Like other tests, avoid failing in case of permission
issues and handle it gracefully.
As per the documentation, EACCES is only returned when F_SETLK is
used, and only on some platforms, which doesn't seem to include
Linux:
https://github.com/torvalds/linux/blob/master/fs/locks.c
F_OFD_SETLK is documented to only return EAGAIN, and F_SETLKW/F_OFD_SETLKW
are blocking operations so this logic doesn't apply to them in the
first place.
Hence, only automatically convert EACCES into EAGAIN for F_SETLK
operations, and propagate the original error in the other cases.
This is important because in some cases we catch permission errors
and gracefully fallback, which is not possible if the original error
is lost.
This is an issue in practice because, due to a kernel bug present
before v6.2, AppArmor denies locking on file descriptors to LXC
containers. We support all currently maintained LTS kernels,
including v6.1, where despite a lot of effort and attempts over almost
a year, the bugfix still hasn't been backported, as it is complex and
requires large changes to AppArmor.
On affected kernels, all services running with PrivateNetwork=yes
fail and do not recover, instead of the normal behaviour of gracefully
downgrading to PrivateNetwork=no.
The integration tests in the Debian CI fail due to this issue:
https://ci.debian.net/packages/s/systemd/testing/arm64/46828037/
Currently, when service_set_main_pidref() is called
without specifying start_timestamp, exec_status_start()
always uses dual_timestamp_now(). This is not ideal,
though, as when the main pid changes halfway due to
e.g. sd_notify + MAINPID=, it's definitely spurious.
Coverity gets confused since the iterator change, so add an
assert to indicate that this is allocated if n_old_groups is > 0
CID#1545922
Follow-up for 125cca1b51
Currently, if proc_mounted() != 0, some functions
propagate -ENOENT while others return -EBADF.
Let's make things consistent, by introducing
a static inline helper responsible for finding out
the appropriate errno.
Follow-up for ade0789fab
The change in behavior was partly intentional, as I think
if both --wait and --pty are used, manually disconnecting
from PTY forwarder should not result in systemd-run exiting
with "Finished with ..." log. But we should check for
--wait here.
Closes#32953
Nowadays the tmpfs where the final shutdown phase
is initiated has got its own name.
Plus, "Returning to initrd" sounds spurious anyway,
as we're not returning to the initial root tmpfs
as seen by the kernel.
Fixes https://github.com/systemd/systemd/issues/28514.
Quoting https://github.com/systemd/systemd/issues/28514#issuecomment-1831781486:
> Whenever PAM is enabled for a service, we set up the PAM session and then
> fork off a process whose only job is to eventually close the PAM session when
> the service dies. That services we run with service privileges, both to
> minimize attack surface and because we want to use PR_SET_DEATHSIG to be get
> a notification via signal whenever the main process dies. But that only works
> if we have the same credentials as that main process.
>
> Now, if pam_systemd runs inside the PAM stack (which it normally does) it's
> session close hook will ask logind to synchronously end the session via a bus
> call. Currently that call is not accessible to unprivileged clients. And
> that's the part we need to relax: allow users to end their own sessions.
The check is implemented in a way that allows the kill if the sender is in
the target session.
I found 'sudo systemctl --user -M "zbyszek@" is-system-running' to
be a convenient reproducer.
Before:
May 16 16:25:26 x1c systemd[1]: run-u24754.service: Deactivated successfully.
May 16 16:25:26 x1c dbus-broker[1489]: A security policy denied :1.24757 to send method call /org/freedesktop/login1:org.freedesktop.login1.Manager.ReleaseSession to org.freedesktop.login1.
May 16 16:25:26 x1c (sd-pam)[3036470]: pam_systemd(login:session): Failed to release session: Access denied
May 16 16:25:26 x1c systemd[1]: Stopping session-114.scope...
May 16 16:25:26 x1c systemd[1]: session-114.scope: Deactivated successfully.
May 16 16:25:26 x1c systemd[1]: Stopped session-114.scope.
May 16 16:25:26 x1c systemd[1]: session-c151.scope: Deactivated successfully.
May 16 16:25:26 x1c systemd-logind[1513]: Session c151 logged out. Waiting for processes to exit.
May 16 16:25:26 x1c systemd-logind[1513]: Removed session c151.
After:
May 16 17:02:15 x1c systemd[1]: run-u24770.service: Deactivated successfully.
May 16 17:02:15 x1c systemd[1]: Stopping session-115.scope...
May 16 17:02:15 x1c systemd[1]: session-c153.scope: Deactivated successfully.
May 16 17:02:15 x1c systemd[1]: session-115.scope: Deactivated successfully.
May 16 17:02:15 x1c systemd[1]: Stopped session-115.scope.
May 16 17:02:15 x1c systemd-logind[1513]: Session c153 logged out. Waiting for processes to exit.
May 16 17:02:15 x1c systemd-logind[1513]: Removed session c153.
Edit: this seems to also fix https://github.com/systemd/systemd/issues/8598.
It seems that with the call to ReleaseSession, we wait for the pam session
close hooks to finish. I inserted a 'sleep(10)' after the call to ReleaseSession
in pam_systemd, and things block on that, nothing is killed prematurely.
We have the following timestamp status:
$ systemctl show systemd-fsck-root.service | grep InactiveExitTimestamp
InactiveExitTimestamp=Thu 2023-11-02 12:27:24 CET
InactiveExitTimestampMonotonic=15143158
$ systemctl show | grep UserspaceTimestamp
UserspaceTimestamp=Thu 2023-11-02 12:27:25 CET
UserspaceTimestampMonotonic=15804273
i.e. UserspaceTimestamp is before InactiveExit of systemd-fsck-root.service.
This is fine, but on display, we'd subtract those values and print a huge
negative value bogusly:
$ build/systemd-analyze critical-chain systemd-remount-fs.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.
systemd-remount-fs.service +137ms
└─systemd-fsck-root.service @584542y 2w 2d 20h 1min 48.890s +45ms
└─systemd-journald.socket
└─system.slice
└─-.slice
In fact, list_dependencies_print() already had a branch where the check that
'times->activating > boot->userspace_time', but it didn't cover all cases. So
make it cover both branches, and also change to '>=', since it's fine if
something happened with the same timestamp.
With the patch:
$ build/systemd-analyze critical-chain systemd-remount-fs.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.
systemd-remount-fs.service +42ms
└─systemd-fsck-root.service
└─systemd-journald.socket
└─system.slice
└─-.slice
Fixes https://github.com/systemd/systemd/issues/17191.
When running inside an LXC container the 'su' process will not be part of
any unit or slice.
manager_get_user_by_pid() which was used until v255 (included) does not fail
if it cannot find a unit/slice, but simply returns 'not found'. Do the same
in manager_get_session_by_pidref().
This was not detected as Semaphore CI does not reboot the testbed before
the logind test, so the session is started by the old logind from the base
distro, instead of the one being tested.
Follow-up for 8494f562c8
Follow-up for 5099a50d43
Fixes https://github.com/systemd/systemd/issues/32929
If machine ID is previously stored at /run/machine-id, then let's reuse
it. This is important on switching root and /etc/machine-id was previously
a mount point.
Fixes#32908.
Fixes a bug introduced by 1ddb263d21.
Note, this requires the previous two commits, and cannot backport without them.
Note, before the previous commit, the use-after-free could be triggered
only by Rename() DBus method, and could not by RenameImage(), as we did not
cache Image object when RenameImage() method is called. And machinectl
always uses RenameImage(). Hence, the issue could be triggered only when
Rename() DBus method is explicitly called by e.g. busctl.
With the previous commit, the Image object passed to the function is
always cached. Hence, the issue could be triggered even with machinectl
command, and this fix is important.
Previously, Image objects were only cached when reading properties or
methods in the org.freedesktop.machine1.Image interface are called.
This makes that, when a method in the main interface (org.freedesktop.machine1)
for an image is called, also acquire the Image object from the cache,
and if not cached, create Image object and put into the cache, like we
do for org.freedesktop.machine1.Image.
Otherwise, if some properties of an image are updated by methods in the main
interface, e.g. MarkImageReadOnly(), the changes do not applied to the cached
Image object, and subsequent read of proerties through the interface for the
image, e.g. ReadOnly property, may provide outdated values.
Follow-up for 1ddb263d21.
Fixes#32888.
Otherwise, ReadOnly DBus property in org.freedesktop.machine1.Image or
org.freedesktop.portable1.Image will not be updated by MarkReadOnly DBus
method.
The rationale is similar to 40e1f4ea74.
Currently, we only pass TTYPath=/dev/pts/... to
the transient service spawned by systemd-run.
This is a bit problematic though, when ExecStartPre=
or ExecStopPost= is used. Since when these control
processes get to run, the main process is not yet
started/has already exited, hence the slave suffers
from the same vhangup problem as the mentioned commit.
By passing the slave fd in, the service manager will
hold the fd open as long as the service is alive.
Fixes#32916
"norecovery" was deprecated for btrfs in
74ef00185e
and removed in
a1912f7121.
Let's drop our assumption that btrfs supports "norecovery" and first query for the
new name of the option followed by querying for the old name.
Fixes a bug introduced by 8d01e44c1f.
If a .netdev file for a wireguard interface requests to configure
routes for the interface, the routes were removed during configuring
another interface.
Fixes#32859.
For manager test runs, the generator output paths are located in
/tmp, which means that if we mount a private /tmp for generators,
we lose all the generated units (actually the generators will just
fail because the directories don't exist, but if they did exist,
we'd still lose all the units).
Let's avoid the problem by skipping the private /tmp for manager
test runs. This also avoids any possible privilege issues with
mounting a private /tmp that might happen in this scenario.