This adds support for systematically destroying connections in
pam_sm_session_open() even on failure, so that under no circumstances
unserved dbus connection are around while the invoking process waits for
the session to end. Previously we'd only do this on success, now do it
in all cases.
This matters since so far we suggested people hook pam_systemd into
their pam stacks prefixed with "-", so that login proceeds even if
pam_systemd fails. This however means that in an error case our
cached connection doesn't get disconnected even if the session then is
invoked. This fixes that.
Let's systematically avoid sharing cached busses between processes (i.e.
from parent and child after fork()), by including the PID in the field
name.
With that we're never tempted to use a bus object the parent created in
the child.
(Note this is about *use*, not about *destruction*. Destruction needs to
be checked by other means.)
Let's make use of the new DelegateSubgroup= feature and delegate the
/supervisor/ subcgroup already to nspawn, so that moving the supervisor
process there is unnecessary.
If we create a subcroup (regardless if the '.control' subgroup we
always created or one configured via DelegateSubgroup=) it's inside of
the delegated territory of the cgroup tree, hence it should be owned
fully by the unit's users. Hence do so.
We don't need to apply the journal/oomd xattrs to the subcgroups we add,
since those daemons already look for the xattrs up the tree anyway.
Hence remove this.
This is in particular relevant as it means later changes to the xattr
don#t need to be replicated on the subcgroup either.
This implements a minimal subset of #24961, but in a lot more
restrictive way: we only allow one level of subcgroup (as that's enough
to address the no-processes in inner cgroups rule), and does not change
anything about threaded cgroup logic or similar, or make any of this new
behaviour mandatory.
All this does is this: all non-control processes we invoke for a unit
we'll invoke in a subgroup by the specified name.
We'll later port all our current services that use cgroup delegation
over to this, i.e. user@.service, systemd-nspawn@.service and
systemd-udevd.service.
Let's clean up validation/escaping of cgroup names. i.e. split out code
that tests if name needs escaping. Return proper error codes, and extend
test a bit.
According to setfacl(1), "the character X stands for
the execute permission if the file is a directory
or already has execute permission for some user."
After this commit, parse_acl() would return 3 acl
objects. The newly-added acl_exec object contains
entries that are subject to conditionalized execute
bit mangling. In tmpfiles, we would iterate the acl_exec
object, check the permission of the target files,
and remove the execute bit if necessary.
Here's an example entry:
A /tmp/test - - - - u:test:rwX
Closes#25114
When encoding partition policy flags we allow parts of the flags to be
"unspecified" (i.e. entirely zeros), which when actually checking the
policy we'll automatically consider equivalent to "any" (i.e. entirely
ones). This "extension" of the flags was so far done as part of
partition_policy_normalized_flags(). Let's split this logic out into a
new function partition_policy_flags_extend() that simply sets all bits
in a specific part of the flags field if they were entirely zeroes so
far.
When comparing policy objects for equivalence we so far used
partition_policy_normalized_flags() to compare the per-designator flags,
which thus meant that "underspecified" flags, and fully specified ones
that are set to "any" were considered equivalent. Which is great.
However, we forgot to do that for the fallback policy flags, the flags
that apply to all partitions for which no explicit policy flags are
specified.
Let's use the new partition_policy_flags_extend() call to compare them
in extended form, so that there two we can hide the difference between
"underspecified" and "any" flags.
This is for the case when we fail to deserialize job ID.
In job_install_deserialized(), we also check the job type, and that is
for the case when we failed to deserialize the job.
Let's gracefully handle the failure in deserializing the job ID.
This is paranoia, and just for safety. Should not change any behavior.
When we fail to deserialize job ID, or the current_job_id is overflowed,
we may have jobs with the same ID.
This is paranoia, and just for safety.
Note, we already use hashmap_remove_value() in job_uninstall().
We translate 'all' to UNIT64_MAX, which has a lot more 'f's. Use the
helper macro, since a decimal uint64_t will always be >> than a hex
representation.
root@image:~# systemd-run -t --property CoredumpFilter=all ls /tmp
Running as unit: run-u13.service
Press ^] three times within 1s to disconnect TTY.
*** stack smashing detected ***: terminated
[137256.320511] systemd[1]: run-u13.service: Main process exited, code=dumped, status=6/ABRT
[137256.320850] systemd[1]: run-u13.service: Failed with result 'core-dump'.
If journal_file_data_payload() returns -EBADMSG or -EADDRNOTAVAIL,
we skip the entry and go to the next entry, but we never modify
the number of items that we pass to journal_file_append_entry_internal()
if that happens, which means we could try to append garbage to the
journal file.
Let's keep track of the number of fields we've appended to avoid this
problem.
When doing a conversion and the specified 'xc->xvariant' has no match, select
the x11 layout entry with a matching layout and an empty xvariant if such entry
exists. It's still better than no conversion at all.
"credential provided widget" would be better spelled as "credential-provided widget".
But let's adjust the message to name the bad credential explicitly: this
makes it easier to fix for the user.
Realistically, the only thing that the caller can do is ignore failures related
to missing credentials. If the caller requires some credentials to be present,
they should just check which output variables are not NULL. One of the callers
was already doing that, and the other wanted to, but missed -ENOENT. By
suppressing -ENOENT and -ENXIO, both callers are simplified.
Fixes a warning at boot:
systemd-vconsole-setup[221]: Failed to import credentials, ignoring: No such file or directory
This will make things a bit longer for now, but more powerful as we can
reuse the userns fd between calls to remount_idmap() if we need to
adjust multiple mounts.
No change in behaviour, just some minor refactoring.
I guess it was only a question of time until we need to add the final
frontier of notification functions: one that combines the features of
all the others:
1. specifiying a source PID
2. taking a list of fds to send along
3. accepting a format string for the status string
Hence, let's add it.
When pam_end() is called after a fork, and it cleans up caches, it sets
PAM_DATA_SILENT in error_status. FDs will be shared with the parent, so
we do not want to attempt to close them from a child process, or we'll
hit assertions. Complain loudly and skip.
it's a bit confusing that on 32bit systems we'd risk session IDs
overruns like this. Let's expose the same behaviour everywhere and stick
to 64bit ids.
Since we format the ids as strings anyway this doesn't really change
anything performance-wise, it just pushes out collisions by overrun to
basically never happen.
sd-event objects use hashmaps, which use module-global state, so it is not safe
to pass a sd-event object created by a module instance to another module instance
(e.g.: when two libraries static linking sd-event are pulled in a single process).
Initialize a random per-module origin id and store it in the object, and compare
it when entering a public API, and error out if they don't match, together with
the PID.
sd-journal objects use hashmaps, which use module-global state, so it is not safe
to pass a sd-journal object created by a module instance to another module instance
(e.g.: when two libraries static linking sd-journal are pulled in a single process).
Initialize a random per-module origin id and store it in the object, and compare
it when entering a public API, and error out if they don't match, together with
the PID.
sd-bus objects use hashmaps, which use module-global state, so it is not safe
to pass a sd-bus object created by a module instance to another module instance
(e.g.: when two libraries static linking sd-bus are pulled in a single process).
Initialize a random per-module origin id and store it in the object, and compare
it when entering a public API, and error out if they don't match, together with
the PID.
crypsetup-fido2 always depended on both libfido2 and libcryptsetup, but
0a8e026e82 forgot to make the then
implicit dependency on libcryptsetup explicit when moving it from
cryptsetup/ to shared/. This breaks builds when libfido2 is autodetected
but the system is missing libcryptsetup.
Introduce an explicit check for HAVE_LIBCRYPTSETUP such that
cryptsetup-fido2 is only built when both libraries are available.
Fixes#27374.
If we try to close after a fork, the FDs will have been cloned
too and we'll assert. This can happen for example in PAM modules.
Avoid the macro and define ref/unref by hand to do the same check.
This doesn't really matter too much as both are static functions. But
it's confusing as hell both when debugging and reading code, given that
homed actually uses mount-util.c
Hence, let's just rename one of the two, to minimize confusion.
No actual change in behaviour.
(and sooner or later we might want to export mount-util.c's version of
the function, since it's generically useful)
When doing x11->console conversions, find_converted_keymap() searches
automatically for a candidate in the converted keymap directory for a given x11
layout.
However doing console->x11 conversions, this automatic search is not done hence
simple conversion in this direction can't be achieved without populating
kbd-model-map with entries for converted keymaps.
For example, let's consider "at" layout which is not part of kbd-model-map. The
"at" x11 layout has a generated keymap
"/usr/share/kbd/keymaps/xkb/at.map.gz". If we configure "at" for the x11
layout, localed is able to automatically find the "at" converted vc layout and
the conversion just works :
$ localectl set-x11-keymap at
$ localectl
System Locale: LANG=en_US.UTF-8
VC Keymap: at
X11 Layout: at
However in the opposite direction, ie when setting the vc keymap to "at", no
conversion is done and the x11 layout is not defined:
$ localectl set-keymap at
$ localectl
System Locale: LANG=en_US.UTF-8
VC Keymap: at
X11 Layout: (unset)
This patch fixes this limitation as the implemenation is relatively simple and
it removes the need to populate kbd-model-map with (many) entries for converted
keymaps. However the patch doesn't remove the existing entries in kbd-model-map
which became unneeded after this change to be on the safe side.
Note: by default the automatically generated x11 keyboard configs use keyboard
model "microsoftpro" which should be equivalent to "pc105" model but with the
internet/media key mapping added.
When we're checking if /etc/resolv.conf exists so we can bind mount
on top of it, we care about whether the symlink itself exists if
/etc/resolv.conf exists and not the file it points to, so add
CHASE_NOFOLLOW to make sure we check existence of the symlink and
not the file it points to.
sd-bus connection is cached by the two pam modules globally, but this
can lead to issues due to hashmaps (used by sd-bus) using a global
static variable for the shared hash key, which is different per module
as both modules are loaded in the same process.
This happens because the sd-bus object is create in one module, but
used in the other, so global state does not match.
Use a different pam cache identifier for the sd-bus pointer, so that
each module uses a different sd-bus connection as a workaround.
Fixes https://github.com/systemd/systemd/issues/27216
Fixes https://github.com/systemd/systemd/issues/17266
acquire_home() takes a reference to a sd-bus object, which the open_session
hook cleans on success. But only when handling a user actually owned by homed,
it did not clean it up when skipping because it is being invoked on a system
user.
We need to be careful with sd-bus here as pam_sm_open_session is the last
hook before forking, and we want to clean up sd-bus before that happens, or
we'll have a broken reference (FDs are cloexec) in the child process, which
will then assert when attempting to close them, or leak the bus connection
which causes dbus to complain loudly:
dbus-daemon[62]: [system] Connection has not authenticated soon enough, closing it (auth_timeout=30000ms, elapsed: 30020ms)
This makes syntax be the same for commands which are started by the manager and
those which are spawned directly (when --scope is used).
Before:
$ systemd-run -q -t echo '$TERM'
xterm-256color
$ systemd-run -q --scope echo '$TERM'
$TERM
Now:
$ systemd-run -q --scope echo '$TERM'
xterm-256color
Previous behaviour can be restored via --expand-environment=no:
$ systemd-run -q --scope --expand-environment=no echo '$TERM'
$TERM
Fixes#22948.
At some level, this is a compat break. Fortunately --scope is not very widely
used, so I think we can get away with this. Having different syntax depending
on whether --scope was used or not was bad UX.
A NEWS entry will be required.
This uses StartExecEx to get the equivalent of ExecStart=:. StartExecEx was
added in b3d593673c, so this will not work with
older systemds.
A hint is emitted if we get an error indicating lack of support. PID1 returns
SD_BUS_ERROR_PROPERTY_READ_ONLY, but I'm checking for
SD_BUS_ERROR_UNKNOWN_PROPERTY too for safety.
Just refactoring, in preparation for future changes.
(Though I think it'd be reasonable to do anyway, those functions were
awfully long.)
'git diff' displays this badly. The middle part of start_transient_service()
is moved to make_transient_service_unit(), and the middle part of
start_transient_trigger() is moved to make_transient_trigger_unit().
start_transient_service() would return two ints: one normally and one via
*retval. We can just return one int and propagate it directly, because we
use DEFINE_MAIN_FUNCTION_WITH_POSITIVE_FAILURE().
The property name is called ExecStartEx, but we have to write it as ExecStart=
in the unit file. :(
Bug introduced in b3d593673c when ex-properties
were initially added.
In addition, we cannot escape $ as $$, because when ":" is used, we wouldn't
unescape $$ back to $.
Unfortunately we can't escape $ when ':' is used to prohibit variable expansion:
ExecStart=:echo $$
is not the same as
ExecStart=:echo $
This just adds the functionality and the unittests, without using it anywhere
for real yet.
Our escaping of '$' is '$$', not '\$'. We would write unit files that
were not valid:
$ systemd-run --user bash -c 'echo $$; sleep 1000'
Running as unit: run-r1c7c45b5b69f487c86ae205e12100808.service
$ systemctl cat --user run-r1c7c45b5b69f487c86ae205e12100808
# /run/user/1000/systemd/transient/run-r1c7c45b5b69f487c86ae205e12100808.service
...
ExecStart="/usr/bin/bash" "-c" "echo \$\$\; sleep 1000"
$ systemd-analyze verify /run/user/1000/systemd/transient/run-r1c7c45b5b69f487c86ae205e12100808.service
/run/user/1000/systemd/transient/run-r1c7c45b5b69f487c86ae205e12100808.service:7:
Ignoring unknown escape sequences: "echo \$\$\; sleep 1000"
Similarly, ';' cannot be escaped as '\;'. Only a handful of characters
listed in "Supported escapes" is allowed.
Escaping of "'" can be done, but it's not useful because we use double quotes
around the string anyway whenever we do escaping.
unit_write_setting() is called all over the place. In a great majority of
places we write either fixed strings or something that we generate ourselves,
so no escaping or quoting is needed. (And it's not allowed, e.g.
'Type="oneshot"' would not work.) But if we forgot to add escaping or quoting
for a free-style string, it would probably allow writing a unit file that would
be read completely wrong. I looked over various places where
unit_write_setting() is called, and I couldn't find any place where
quoting/escaping was forgotten. But trying to figure out the full
ramifications of this change is not easy.
__builtin_popcount() is a bit of a mouthful, so let's provide a helper.
Using _Generic has the advantage that if a type other then the ones on
the list is given, compilation will fail. This is nice, because if by any
change we pass a wider type, it is rejected immediately instead of being
truncated.
log.h is also needed. It is included transitively, but let's include it
directly.
macro.h is *not* needed.
Normal users do not have permissions to access /proc/1/root, so
'systemd-detect-virt -r' fails, but the output, even at debug level
is cryptic:
$ SYSTEMD_LOG_LEVEL=debug build/systemd-detect-virt -r
Failed to check for chroot() environment: Permission denied
Let's make this a bit easier to figure out:
$ SYSTEMD_LOG_LEVEL=debug build/systemd-detect-virt -r
Cannot stat /proc/1/root: Permission denied
Failed to check for chroot() environment: Permission denied
I looked over other users of files_same(), and I think in general the message
at debug level is OK for them too.
Meta's resource control demo project[0] includes a benchmark tool that can
be used to calculate the best iocost solutions for a given SSD.
[0]: https://github.com/facebookexperimental/resctl-demo
A project[1] has now been started to create a publicly available database
of results that can be used to apply them automatically.
[1]: https://github.com/iocost-benchmark/iocost-benchmarks
This change adds a new tool that gets triggered by a udev rule for any
block device and queries the hwdb for known solutions. The format for
the hwdb file that is currently generated by the github action looks like
this:
# This file was auto-generated on Tue, 23 Aug 2022 13:03:57 +0000.
# From the following commit:
# ca82acfe93
#
# Match key format:
# block:<devpath>:name:<model name>:
# 12 points, MOF=[1.346,1.346], aMOF=[1.249,1.249]
block:*:name:HFS256GD9TNG-62A0A:fwver:*:
IOCOST_SOLUTIONS=isolation isolated-bandwidth bandwidth naive
IOCOST_MODEL_ISOLATION=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
IOCOST_QOS_ISOLATION=rpct=0.00 rlat=8807 wpct=0.00 wlat=59023 min=100.00 max=100.00
IOCOST_MODEL_ISOLATED_BANDWIDTH=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
IOCOST_QOS_ISOLATED_BANDWIDTH=rpct=0.00 rlat=8807 wpct=0.00 wlat=59023 min=100.00 max=100.00
IOCOST_MODEL_BANDWIDTH=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
IOCOST_QOS_BANDWIDTH=rpct=0.00 rlat=8807 wpct=0.00 wlat=59023 min=100.00 max=100.00
IOCOST_MODEL_NAIVE=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
IOCOST_QOS_NAIVE=rpct=99.00 rlat=8807 wpct=99.00 wlat=59023 min=75.00 max=100.00
The IOCOST_SOLUTIONS key lists the solutions available for that device
in the preferred order for higher isolation, which is a reasonable
default for most client systems. This can be overriden to choose better
defaults for custom use cases, like the various data center workloads.
The tool can also be used to query the known solutions for a specific
device or to apply a non-default solution (say, isolation or bandwidth).
Co-authored-by: Santosh Mahto <santosh.mahto@collabora.com>
getty-generator enables serial-getty@.service for virtualizer consoles
that it can find in /sys/class/tty. To make sure this works for
virtio consoles, let's make sure we load the module is loaded early
so that the /sys/class/tty/hvc0 exists before we run getty-generator.
This is the function version of STARTSWITH_SET(). We also move
STARTSWITH_SET() to string-util.h as it fits more there than in
strv.h and reimplement it using startswith_strv().
Let's avoid confusing developers and users when log messages suddenly
stop getting logged to kmsg because of ratelimiting by logging an
additional message if we start ratelimiting log messages to kmsg.
When trying to mount a partition that is encrypted without the
encryption first having been set up we want to return a
recognizable error (EUNATCH). This was broken by
80ce8580f5 which added an allowlist check
for permissible file systems first. Let's reverse the check order, so
that we get EUNATCH again, as before. (And leave EIDRM as error for the
failed allowlist check).
An overflow here (i.e. the counter reaching 2^32 within a ratelimit time
window) is not so unlikely. Let's handle this somewhat sanely
and simply stop counting, while remaining in the "limit is hit" state until
the time window has passed.
See added code comment for a longer explanation. TLDR: Linux maintains
distinct block device caches for partition and "whole" block devices,
and a simply BLKFLSBUF should make the worst confusions this causes go
away.
DHCP static leases are looked up by the client identifier as send by
the client, while configured based on MAC. As RFC 2131 states the client
identifier is an opaque key and must not be interpreted by the server
this means that DHCP clients can (/will) also use a client identifier
which is not a MAC address. One of these clients actually is
systemd-networkd which uses an RFC 4361 by default to generate the
client identifier. For these kind of DHCP clients static leases thus
don't work because of this mismatch between configuring a MAC address
but the server matching based on client identifier. This adds a fallback
to try to look up a configured static lease based on the "chaddr" of the
DHCP message as this will always contain the MAC address of the client.
Fixes#21368
If the device unit is not the head of the list saved in
Manager.devices_by_sysfs, then it is not necessary to replace the
existing hashmap entry. This should not change any behavior, just
refactoring.
The function path_prefix_root_cwd() was introduced for prefixing the
result from chaseat() with root, but
- it is named slightly generic,
- the logic is different from what chase() does.
This makes the name more explanative and specific for the result of the
chaseat(), and make the logic consistent with chase().
Fixes https://github.com/systemd/systemd/pull/27199#issuecomment-1511387731.
Follow-up for #27199.
Now, dir_fd_is_root() is heavily used in chaseat(), which is used at
various places. If the kernel is too old and /proc is not mounted, then
there is no way to get the mount ID of a directory. In that case, let's
silently skip the mount ID check.
Fixes https://github.com/systemd/systemd/pull/27299#issuecomment-1511403680.
Usually, we pass the file descriptor of the root directory to chaseat()
when `--root=` is not specified. Previously, even in such case, the
result was relative, and we need to prefix the path with "/" when we
want to pass the path to other functions that do not support dir_fd, or
log or show the path. That's inconvenient.
E.g. in logs on jammy-ppc64el in https://github.com/systemd/systemd/pull/27294:
Apr 16 17:42:50 H systemd-gpt-auto-generator[300]: Failed to dissect partition table of block device /dev/sda: No message of desired type
Apr 16 17:42:50 H (sd-execu[295]: /usr/lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1.
ee0e6e476e made this particular condition not an
error. But for other errnos we want to print a better message too.
dissect_loop_device_and_warn() already does this, but it always prints the
error at error level. We want to suppress some of the errors, so let's make the
print helper public and do the error suppression in the caller.
If getty-generator runs in the initrd, the corresponding tty might not
have been instantiated yet in /dev, which means a serial getty is not
spawned on it. Instead, let's instantiate the serial-getty when the
device appears so that it always gets instantiated.
This makes the bpf LSM check generic, so that we can use it elsewhere.
it also drops the caching inside it, given that bpf-lsm code in PID1
will cache it a second time a stack frame further up when it checks for
various other bpf functionality.
Running systemd with IP accounting enabled generates many bpf maps (two
per unit for accounting, another two if IPAddressAllow/Deny are used).
Systemd itself knows which maps belong to what unit and commands like
`systemctl status <unit>` can be used to query what service has which
map, but monitoring these values all the time costs 4 dbus requests
(calling the .IP{E,I}gress{Bytes,Packets} method for each unit) and
makes services like the prometheus systemd_exporter[1] somewhat slow
when doing that for every units, while less precise information could
quickly be obtained by looking directly at the maps.
Unfortunately, bpf map names are rather limited:
- only 15 characters in length (16, but last byte must be 0)
- only allows isalnum(), _ and . characters
If it wasn't for the length limit we could use the normal unit escape
functions but I've opted to just make any forbidden character into
underscores for maximum brievty -- the map prefix is also rather short:
This isn't meant as a precise mapping, but as a hint for admins who want
to look at these.
(Note there is no problem if multiple maps have the same name)
Link: https://github.com/povilasv/systemd_exporter [1]
Let's be more careful with generating error codes for (expected) error
causes.
This does not introduce new error conditions, it just changes what we
return under specific cases, to make things nicely recognizable in each
case. Most importantly this detects if fdinfo reports a pid of "-1" for
pidfds with processes that are already reaped (and thus have no PID
anymore)
None of our current users care about these error codes, but let's get
this right for the future.
Commit f90eea7d18
virt: Improve detection of EC2 metal instances
Added support for detecting EC2 metal instances via the product
name in DMI by testing for the ".metal" suffix.
Unfortunately this doesn't cover all cases, as there are going to be
instance types where ".metal" is not a suffix (ie, .metal-16xl,
.metal-32xl, ...)
This modifies the logic to also allow those new forms.
Signed-off-by: Benjamin Herrenschmidt <benh@amazon.com>
The cmd(3) man page says about CMSG_DATA():
> The pointer returned cannot be assumed to be suitably aligned for
> accessing arbitrary payload data types. Applications should not cast
> it to a pointer type matching the payload, but should instead use
> memcpy(3) to copy data to or from a suitably declared object.
Hence, if we want to use unaligned data in cmsg, we need to copy it
before use. That's typically important for reading timestamps in
RISCV32, as the time_t is 64bit and size_t is 32bit on the system.
This removes remaining hardcoded occurences of `/sbin/fsck`, and instead
uses `find_executable` to find `fsck`.
We also use `fsck_exists_for_fstype` to check for the `fsck.*`
executable, which also checks in `$PATH`, so it's fair to assume fsck
itself is also available.
The ignore directive specifies to not do anything with the given
unit and leave existing configuration intact. This allows distributions
to gradually adopt preset files by shipping a ignore * preset file.
strstrafter() is like strstr() but returns a pointer to the first
character *after* the found substring, not on the substring itself.
Quite often this is what we actually want.
Inspired by #27267 I think it makes sense to add a helper for this,
to avoid the potentially fragile manual pointer increment afterwards.
The overflow check was hosed in two ways: overflows in C are undefined,
hence gcc was free to just optimize the whole thing away. We need to
catch overflows before we run into them, not after.
It checked for an overflow against size_t, but the field we need to
write this in is unsigned. i.e. typically 32bit rather than 64bit. Hence
check for the right maximum.
(The whole check is paranoia anyway, the kernel really shouldn't return
values that would induce an overflow, but you never know, the syscall
turned out to be problematic in so many other ways, hence let's stick to
this.)
The concept of a "mount" is a local one, hence there's no point in going
to the network to retrieve mnt_id or STATX_ATTR_MOUNT_ROOT. Hence set
AT_STATX_DONT_SYNC so that the call will not go to the network ever, and
risk deadlocking on that.
Just some extra safety.
When CHASE_MKDIR_0755 is specified without CHASE_NONEXISTENT and
CHASE_PARENT, then chase() succeeds only when the file specified by
the path already exists, and in that case, chase() does not create
any parent directories, and CHASE_MKDIR_0755 is meaningless.
Let's mention that CHASE_MKDIR_0755 needs to be specified with
CHASE_NONEXISTENT or CHASE_PARENT, and adds a assertion about that.
Enabling these options when not running as root requires a user
namespace, so implicitly enable PrivateUsers=.
This has a side effect as it changes which users are visible to the unit.
However until now these options did not work at all for user units, and
in practice just a handful of user units in Fedora, Debian and Ubuntu
mistakenly used them (and they have been all fixed since).
This fixes the long-standing confusing issue that the user and system
units take the same options but the behaviour is wildly (and sometimes
silently) different depending on which is which, with user units
requiring manually specifiying PrivateUsers= in order for sandboxing
options to actually work and not be silently ignored.
In scope_set_state(), the timer event source may be disabled depending
on the state. Currently, it will be disabled when the state is
SCOPE_RUNNING. This has the effect of new RuntimeMaxSec values being
ignored on coldplug.
Note that this issue is not currently present when scopes are started
because when scope_start() is called, scope_arm_timer() is called after
scope_set_state().
Confexts should not contain code, so mount confexts with noexec.
We cannot mount invidial extensions as noexec, as the overlay ignores
it and bypasses it, we need to use the flag on the whole overlay for
it to be effective.
But given there are legacy scripts still shipped in /etc, allow to
override it with --noexec=false.
When a unit is upheld and fails, and there are no state changes in
the upholder, it will not be retried, which is against what the
documentation suggests.
Requeue when the job finishes. Same for the other two queues.