Normal users do not have permissions to access /proc/1/root, so
'systemd-detect-virt -r' fails, but the output, even at debug level
is cryptic:
$ SYSTEMD_LOG_LEVEL=debug build/systemd-detect-virt -r
Failed to check for chroot() environment: Permission denied
Let's make this a bit easier to figure out:
$ SYSTEMD_LOG_LEVEL=debug build/systemd-detect-virt -r
Cannot stat /proc/1/root: Permission denied
Failed to check for chroot() environment: Permission denied
I looked over other users of files_same(), and I think in general the message
at debug level is OK for them too.
Meta's resource control demo project[0] includes a benchmark tool that can
be used to calculate the best iocost solutions for a given SSD.
[0]: https://github.com/facebookexperimental/resctl-demo
A project[1] has now been started to create a publicly available database
of results that can be used to apply them automatically.
[1]: https://github.com/iocost-benchmark/iocost-benchmarks
This change adds a new tool that gets triggered by a udev rule for any
block device and queries the hwdb for known solutions. The format for
the hwdb file that is currently generated by the github action looks like
this:
# This file was auto-generated on Tue, 23 Aug 2022 13:03:57 +0000.
# From the following commit:
# ca82acfe93
#
# Match key format:
# block:<devpath>:name:<model name>:
# 12 points, MOF=[1.346,1.346], aMOF=[1.249,1.249]
block:*:name:HFS256GD9TNG-62A0A:fwver:*:
IOCOST_SOLUTIONS=isolation isolated-bandwidth bandwidth naive
IOCOST_MODEL_ISOLATION=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
IOCOST_QOS_ISOLATION=rpct=0.00 rlat=8807 wpct=0.00 wlat=59023 min=100.00 max=100.00
IOCOST_MODEL_ISOLATED_BANDWIDTH=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
IOCOST_QOS_ISOLATED_BANDWIDTH=rpct=0.00 rlat=8807 wpct=0.00 wlat=59023 min=100.00 max=100.00
IOCOST_MODEL_BANDWIDTH=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
IOCOST_QOS_BANDWIDTH=rpct=0.00 rlat=8807 wpct=0.00 wlat=59023 min=100.00 max=100.00
IOCOST_MODEL_NAIVE=rbps=1091439492 rseqiops=52286 rrandiops=63784 wbps=192329466 wseqiops=12309 wrandiops=16119
IOCOST_QOS_NAIVE=rpct=99.00 rlat=8807 wpct=99.00 wlat=59023 min=75.00 max=100.00
The IOCOST_SOLUTIONS key lists the solutions available for that device
in the preferred order for higher isolation, which is a reasonable
default for most client systems. This can be overriden to choose better
defaults for custom use cases, like the various data center workloads.
The tool can also be used to query the known solutions for a specific
device or to apply a non-default solution (say, isolation or bandwidth).
Co-authored-by: Santosh Mahto <santosh.mahto@collabora.com>
getty-generator enables serial-getty@.service for virtualizer consoles
that it can find in /sys/class/tty. To make sure this works for
virtio consoles, let's make sure we load the module is loaded early
so that the /sys/class/tty/hvc0 exists before we run getty-generator.
This is the function version of STARTSWITH_SET(). We also move
STARTSWITH_SET() to string-util.h as it fits more there than in
strv.h and reimplement it using startswith_strv().
Let's avoid confusing developers and users when log messages suddenly
stop getting logged to kmsg because of ratelimiting by logging an
additional message if we start ratelimiting log messages to kmsg.
When trying to mount a partition that is encrypted without the
encryption first having been set up we want to return a
recognizable error (EUNATCH). This was broken by
80ce8580f5 which added an allowlist check
for permissible file systems first. Let's reverse the check order, so
that we get EUNATCH again, as before. (And leave EIDRM as error for the
failed allowlist check).
An overflow here (i.e. the counter reaching 2^32 within a ratelimit time
window) is not so unlikely. Let's handle this somewhat sanely
and simply stop counting, while remaining in the "limit is hit" state until
the time window has passed.
See added code comment for a longer explanation. TLDR: Linux maintains
distinct block device caches for partition and "whole" block devices,
and a simply BLKFLSBUF should make the worst confusions this causes go
away.
DHCP static leases are looked up by the client identifier as send by
the client, while configured based on MAC. As RFC 2131 states the client
identifier is an opaque key and must not be interpreted by the server
this means that DHCP clients can (/will) also use a client identifier
which is not a MAC address. One of these clients actually is
systemd-networkd which uses an RFC 4361 by default to generate the
client identifier. For these kind of DHCP clients static leases thus
don't work because of this mismatch between configuring a MAC address
but the server matching based on client identifier. This adds a fallback
to try to look up a configured static lease based on the "chaddr" of the
DHCP message as this will always contain the MAC address of the client.
Fixes#21368
If the device unit is not the head of the list saved in
Manager.devices_by_sysfs, then it is not necessary to replace the
existing hashmap entry. This should not change any behavior, just
refactoring.
The function path_prefix_root_cwd() was introduced for prefixing the
result from chaseat() with root, but
- it is named slightly generic,
- the logic is different from what chase() does.
This makes the name more explanative and specific for the result of the
chaseat(), and make the logic consistent with chase().
Fixes https://github.com/systemd/systemd/pull/27199#issuecomment-1511387731.
Follow-up for #27199.
Now, dir_fd_is_root() is heavily used in chaseat(), which is used at
various places. If the kernel is too old and /proc is not mounted, then
there is no way to get the mount ID of a directory. In that case, let's
silently skip the mount ID check.
Fixes https://github.com/systemd/systemd/pull/27299#issuecomment-1511403680.
Usually, we pass the file descriptor of the root directory to chaseat()
when `--root=` is not specified. Previously, even in such case, the
result was relative, and we need to prefix the path with "/" when we
want to pass the path to other functions that do not support dir_fd, or
log or show the path. That's inconvenient.
E.g. in logs on jammy-ppc64el in https://github.com/systemd/systemd/pull/27294:
Apr 16 17:42:50 H systemd-gpt-auto-generator[300]: Failed to dissect partition table of block device /dev/sda: No message of desired type
Apr 16 17:42:50 H (sd-execu[295]: /usr/lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1.
ee0e6e476e made this particular condition not an
error. But for other errnos we want to print a better message too.
dissect_loop_device_and_warn() already does this, but it always prints the
error at error level. We want to suppress some of the errors, so let's make the
print helper public and do the error suppression in the caller.
If getty-generator runs in the initrd, the corresponding tty might not
have been instantiated yet in /dev, which means a serial getty is not
spawned on it. Instead, let's instantiate the serial-getty when the
device appears so that it always gets instantiated.
This makes the bpf LSM check generic, so that we can use it elsewhere.
it also drops the caching inside it, given that bpf-lsm code in PID1
will cache it a second time a stack frame further up when it checks for
various other bpf functionality.
Running systemd with IP accounting enabled generates many bpf maps (two
per unit for accounting, another two if IPAddressAllow/Deny are used).
Systemd itself knows which maps belong to what unit and commands like
`systemctl status <unit>` can be used to query what service has which
map, but monitoring these values all the time costs 4 dbus requests
(calling the .IP{E,I}gress{Bytes,Packets} method for each unit) and
makes services like the prometheus systemd_exporter[1] somewhat slow
when doing that for every units, while less precise information could
quickly be obtained by looking directly at the maps.
Unfortunately, bpf map names are rather limited:
- only 15 characters in length (16, but last byte must be 0)
- only allows isalnum(), _ and . characters
If it wasn't for the length limit we could use the normal unit escape
functions but I've opted to just make any forbidden character into
underscores for maximum brievty -- the map prefix is also rather short:
This isn't meant as a precise mapping, but as a hint for admins who want
to look at these.
(Note there is no problem if multiple maps have the same name)
Link: https://github.com/povilasv/systemd_exporter [1]
Let's be more careful with generating error codes for (expected) error
causes.
This does not introduce new error conditions, it just changes what we
return under specific cases, to make things nicely recognizable in each
case. Most importantly this detects if fdinfo reports a pid of "-1" for
pidfds with processes that are already reaped (and thus have no PID
anymore)
None of our current users care about these error codes, but let's get
this right for the future.
Commit f90eea7d18
virt: Improve detection of EC2 metal instances
Added support for detecting EC2 metal instances via the product
name in DMI by testing for the ".metal" suffix.
Unfortunately this doesn't cover all cases, as there are going to be
instance types where ".metal" is not a suffix (ie, .metal-16xl,
.metal-32xl, ...)
This modifies the logic to also allow those new forms.
Signed-off-by: Benjamin Herrenschmidt <benh@amazon.com>
The cmd(3) man page says about CMSG_DATA():
> The pointer returned cannot be assumed to be suitably aligned for
> accessing arbitrary payload data types. Applications should not cast
> it to a pointer type matching the payload, but should instead use
> memcpy(3) to copy data to or from a suitably declared object.
Hence, if we want to use unaligned data in cmsg, we need to copy it
before use. That's typically important for reading timestamps in
RISCV32, as the time_t is 64bit and size_t is 32bit on the system.
This removes remaining hardcoded occurences of `/sbin/fsck`, and instead
uses `find_executable` to find `fsck`.
We also use `fsck_exists_for_fstype` to check for the `fsck.*`
executable, which also checks in `$PATH`, so it's fair to assume fsck
itself is also available.
The ignore directive specifies to not do anything with the given
unit and leave existing configuration intact. This allows distributions
to gradually adopt preset files by shipping a ignore * preset file.
strstrafter() is like strstr() but returns a pointer to the first
character *after* the found substring, not on the substring itself.
Quite often this is what we actually want.
Inspired by #27267 I think it makes sense to add a helper for this,
to avoid the potentially fragile manual pointer increment afterwards.
The overflow check was hosed in two ways: overflows in C are undefined,
hence gcc was free to just optimize the whole thing away. We need to
catch overflows before we run into them, not after.
It checked for an overflow against size_t, but the field we need to
write this in is unsigned. i.e. typically 32bit rather than 64bit. Hence
check for the right maximum.
(The whole check is paranoia anyway, the kernel really shouldn't return
values that would induce an overflow, but you never know, the syscall
turned out to be problematic in so many other ways, hence let's stick to
this.)
The concept of a "mount" is a local one, hence there's no point in going
to the network to retrieve mnt_id or STATX_ATTR_MOUNT_ROOT. Hence set
AT_STATX_DONT_SYNC so that the call will not go to the network ever, and
risk deadlocking on that.
Just some extra safety.
When CHASE_MKDIR_0755 is specified without CHASE_NONEXISTENT and
CHASE_PARENT, then chase() succeeds only when the file specified by
the path already exists, and in that case, chase() does not create
any parent directories, and CHASE_MKDIR_0755 is meaningless.
Let's mention that CHASE_MKDIR_0755 needs to be specified with
CHASE_NONEXISTENT or CHASE_PARENT, and adds a assertion about that.
Enabling these options when not running as root requires a user
namespace, so implicitly enable PrivateUsers=.
This has a side effect as it changes which users are visible to the unit.
However until now these options did not work at all for user units, and
in practice just a handful of user units in Fedora, Debian and Ubuntu
mistakenly used them (and they have been all fixed since).
This fixes the long-standing confusing issue that the user and system
units take the same options but the behaviour is wildly (and sometimes
silently) different depending on which is which, with user units
requiring manually specifiying PrivateUsers= in order for sandboxing
options to actually work and not be silently ignored.
In scope_set_state(), the timer event source may be disabled depending
on the state. Currently, it will be disabled when the state is
SCOPE_RUNNING. This has the effect of new RuntimeMaxSec values being
ignored on coldplug.
Note that this issue is not currently present when scopes are started
because when scope_start() is called, scope_arm_timer() is called after
scope_set_state().
Confexts should not contain code, so mount confexts with noexec.
We cannot mount invidial extensions as noexec, as the overlay ignores
it and bypasses it, we need to use the flag on the whole overlay for
it to be effective.
But given there are legacy scripts still shipped in /etc, allow to
override it with --noexec=false.
When a unit is upheld and fails, and there are no state changes in
the upholder, it will not be retried, which is against what the
documentation suggests.
Requeue when the job finishes. Same for the other two queues.
Repart considers the start and end of the usable space to the first multiple
of grainsz (at least 4096 bytes). However the first usable LBA of a GPT
partition is at sector 34 (512 bytes sectors) which is not a multiple of 4096.
The backup GPT label at the end also takes up 33 sectors, meaning the last
usable LBA is at 34 sectors from the end, unlikely to be a 4096 multiple as
well.
This meant that the very first and last sectors were never discarded. However
more problematically if an existing partition started before the first
usable grainsz multiple its start didn't get taken into account as a valid
starting point and got its data discarded.
Signed-off-by: Sjoerd Simons <sjoerd@collabora.com>
Apparently CMSG_DATA() alignment is very much undefined. Which is quite
an ABI fuck-up, but we need to deal with this. CMSG_TYPED_DATA() already
checks alignment of the specified pointer. Let's also check matching
alignment of the underlying structures, which we already can do at
compile-time.
See: #27241
(This does not fix#27241, but should catch such errors already at
compile-time instead of runtime)
Just to match service_release_stdio_fd() and service_release_fd_store()
in the name, since they do similar things.
This follows the concept that we "release" resources, and this is all
generically wrapped in "service_release_resources()".
We already clear the various fds we keep from the release_resources()
handler, let's also destroy the runtime dir from there if this
preservation mode is selected.
This makes a minor semantic change: previously we'd keep a runtime
directory around if RuntimeDirectoryPreserve=restart is selected and at
least one JOB_START job was around. With this logic we'll keep it around
a tiny bit longer: as long as any job for the unit is around.
The file descriptors we keep in the fdstore might be basically anything,
let's clean it up with our asynchronous closing feature, to not
deadlock on close().
(Let's also do the same for stdin/stdout/stderr fds, since they might
point to network services these days.)
Now that we have a potentially pinned fdstore let's add a concept for
cleaning it explicitly on user requested. Let's expose this via
"systemctl clean", i.e. the same way as user directories are cleaned.
Oftentimes it is useful to allow the per-service fd store to survive
longer than for a restart. This is useful in various scenarios:
1. An fd to some security relevant object needs to be stashed somewhere,
that should not be cleaned automatically, because the security
enforcement would be dropped then.
2. A user namespace fd should be allocated on first invocation and be
kept around until the user logs out (i.e. systemd --user ends), á la
#16328 (This does not implement what #16318 asks for, but should
solve the use-case discussed there.)
3. There's interest in allow a concept of "userspace reboots" where the
kernel stays running, and userspace is swapped out (i.e. all services
exit, and the rootfs transitioned into a new version of it) while
keeping some select resources pinned, very similar to how we
implement a switch root. Thus it is useful to allow services to exit,
while leaving their fds around till the very end.
This is exposed through a new FileDescriptorStorePreserve= setting that
is closely modelled after RuntimeDirectoryPreserve= (in fact it reused
the same internal type), since we want similar behaviour in the end, and
quite often they probably want to be used together.
Let's normalize how we release service resources, i.e. the three types
of fds we maintain for each service:
1. the fdstore
2. the socket fd for per-connection socket activated services
3. stdin/stdout/stderr
The generic service_release_resources() hook now calls into
service_release_fd_store() + service_close_socket_fd()
service_release_stdio_fd() one after the other, releasing them all for
the generic "release_resources" infra of the unit lifecycle.
We do no longer close the socket fd from service_set_state(), moving
this exclusively into service_release_resources(), so that all fds are
closed the same way.
The per-unit-type release_resources() hook (most prominent use: to
release a service unit's fdstore once a unit is entirely dead and has no
jobs more) was currently invoked as part of unit_check_gc(), whose
primary purpose is to determine if a unit should be GC'ed. This was
always a bit ugly, as release_resources() changes state of the unit,
while unit_check_gc() is otherwise (and was before release_resources()
was added) a "passive" function that just checks for a couple of
conditions.
unit_check_gc() is called at various places, including when we wonder if
we should add a unit to the gc queue, and then again when we take it out
of the gc queue to dtermine whether to really gc it now. The fact that
these checks have side effects so far wasn't too problematic, as the
state changes (primarily: that services would empty their fdstores) were
relatively limited and scope.
A later patch in this series is supposed to extend the service state
engine with a separate state distinct from SERVICE_DEAD that is very
much like it but indicates that the service still has active resources
(specifically the fdstore). For cases like that the releasing of the
fdstore would result in state changes (as we'd then return to a classic
SERVICE_DEAD state). And this is where the fact that the
release_resources() is called as side-effect becomes problematic: it
would mean that unit state changes would instantly propagate to state
changes elsewhere, though we usually want this to be done through the
run queue for coalescing and avoidance of recursion.
Hence, let's clean this up: let's move the release_resources() logic
into a queue of its own, and then enqueue items into it from the general
state change notification handle in unit_notify().
Since da6053d0a7 this is a size_t, not an
unsigned. The difference doesn't matter on LE archs, but it matters on
BE (i.e. s390x), since we'll return entirely nonsensical data.
Let's fix that.
Follow-up-for: da6053d0a7
An embarassing bug introduced in 2018... That made me scratch my head
for way too long, as it made #27135 fail on s390x while it passed
everywhere else.
The verity fec_* parameters allows to use Forward Error Correction to
recover from corruption if hash verification fails.
This adds the options fec_device, fec_offset and fec_roots (sixth
argument) which are the equivalent of the options --fec-device,
--fec-offset and --fec-roots in the veritysetup world.
- fec-device=FILE
- fec-offset=BYTES
- fec-roots=UINT64
See `veritysetup(8)` for more details.
The verity parameter no_superblock allows to format/open an hash device
without the superblock. However, the superblock data must be set to open
the data-device.
This adds the option superblocks (sixth argument) and all the underlying
options which are implied to set the superblock manually if hash device
has no superblock:
- superblock=BOOL
- format=NUMBER (hash version type, 0 for original ChromeOS, 1 for
modern)
- data-block-size=BYTES (max page-size, multiple of 512)
- hash-block-size=BYTES (max page-size, multiple of 512)
- data-blocks=BLOCKS (size of data-device in blocks)
- salt=HEXSTR (salt used at format, max 256 bytes)
- uuid=UUID
- hash=STR (algorithm name for dm-verity used at format, default is
sha256)
See `veritysetup(8)` for more details.
The verity parameter hash_area_offset allows to locate the superblock in
the hash device. It can be used to have a single device which contains
both data and hashes.
This adds the option hash-offset=BYTES (sixth argument) which is the
equivalent of the option --hash-offset in the veritysetup world.
See `veritysetup(8)` for more details.
Correct what appears to be a copy/paste error in config_parse_exec_coredump_filter that is preventing the coredump_filter setting from working correctly.
The Upholds= promise is that as long as unit A is up and Upholds=B,
B will be activated if failed or inactive. But there is a hard-coded,
non-configurable rate limit for this, so add a timed retry after the
ratelimit has expired.
Apply to BindsTo= and StopWhenUnneeded= as well.
valgrind systemctl is-enabled --root=/ -l default.target >/dev/null
==746041== Memcheck, a memory error detector
==746041== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==746041== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==746041== Command: systemctl is-enabled --root=/ -l default.target
==746041==
==746041==
==746041== HEAP SUMMARY:
==746041== in use at exit: 8,251 bytes in 4 blocks
==746041== total heap usage: 3,440 allocs, 3,436 frees, 1,163,346 bytes allocated
==746041==
==746041== LEAK SUMMARY:
==746041== definitely lost: 24 bytes in 1 blocks
==746041== indirectly lost: 35 bytes in 1 blocks
==746041== possibly lost: 0 bytes in 0 blocks
==746041== still reachable: 8,192 bytes in 2 blocks
==746041== suppressed: 0 bytes in 0 blocks
==746041== Rerun with --leak-check=full to see details of leaked memory
==746041==
==746041== For lists of detected and suppressed errors, rerun with: -s
==746041== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
This also fixes a memory leak in the old code.
valgrind systemctl -t socket --root=/ list-unit-files >/dev/null
==2601899== Memcheck, a memory error detector
==2601899== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==2601899== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==2601899== Command: systemctl -t socket --root=/ list-unit-files
==2601899==
==2601899==
==2601899== HEAP SUMMARY:
==2601899== in use at exit: 39,984 bytes in 994 blocks
==2601899== total heap usage: 344,414 allocs, 343,420 frees, 2,001,612,404 bytes allocated
==2601899==
==2601899== LEAK SUMMARY:
==2601899== definitely lost: 7,952 bytes in 497 blocks
==2601899== indirectly lost: 32,032 bytes in 497 blocks
==2601899== possibly lost: 0 bytes in 0 blocks
==2601899== still reachable: 0 bytes in 0 blocks
==2601899== suppressed: 0 bytes in 0 blocks
==2601899== Rerun with --leak-check=full to see details of leaked memory
==2601899==
==2601899== For lists of detected and suppressed errors, rerun with: -s
==2601899== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Let's honour the flag if it is set, just to be safe.
(This only handles the case for the writing side: whenever the client
code hands us a json object with the flag set we'll honour it till the
it's out of reach for us. This does *not* handle the reading side, which
is left for a later patch once needed. We probably should add a
per-connection flag that simply globally enables the sensitive logic for
all messages coming in on a specific varlink conneciton.)
Let's add infrastructure to implement fd passing in varlink, when used
over AF_UNIX.
This will optionally associate one or more fds with a message sent via
varlink and deliver it to the server.
Some minor refactoring. This adds a helper call whose only job is to
unref the JSON object of the currently processed incoming message.
This doesn't make too much sense on its own, given this just replaces
one line by another. However, in a later patch when we'll add fd passing
we'll extend the function to also destroy associated fds, and then it
will start to make more sense.
So far, if we do a synchronous varlink call from the client side via
varlink_call(), we'll
move the returned json object from "v->current" into "v->reply", and
keep it referenced there until the next call. We then return a pointer
to it. This ensures that the json object remains valid between two
varlink_call() invocations.
But the thing is, we don't need a separate field for that, we can just
leave the data in "v->current". This means VARLINK_IDLE_CLIENT state
will be permitted with and without v->current initialized. Initially,
after connection setup it will be set to NULL, but after the first
varlink_call() it will be set to the most recent response, pinning it
into memory.
When running in a container, we can propagate the exit status of
pid1 as usual via the process exit status. This is not possible
when running in a VM. Instead, let's send EXIT_STATUS=%i via the
notify socket if one is configured. The user running the VM can then
pick up the exit status from the notify socket after the VM has shut
down.
systemd-nspawn now optionally supports colon-separated pair of
host interface name and container interface name for --network-macvlan, --network-ipvlan and --network-interface options.
Also supported in .nspawn configuration files (i.e Interface=, MACVLAN=, IPVLAN= parameters).
man page changed for ntwk interface naming
Neither of the callers of bus_deserialize_and_dump_unit_file_changes()
touches the changes array, so let's simplify things and keep it internal
to the function.
On x86 EFI follows the windows ABI, which expects 8-byte aligned long
long. The x86 sysv ELF ABI expects them to be 8-byte aligned when used
alone, but 4-byte aligned when they appear inside of structs:
struct S {
int i;
long long ll;
};
// _Static_assert(sizeof(struct S) == 12, "x86 sysv ABI");
_Static_assert(sizeof(struct S) == 16, "EFI/MS ABI");
To get the behavior we need when building with sysv ELF ABI we need to
pass '-malign-double' to the compiler as done by EDK2.
This in turn will make ubsan unhappy as the stack may not be properly
aligned on entry, so we have to tell the compiler explicitly to re-align
the stack on entry to efi_main.
This fixes loading EFI drivers on x86 that were previously always
rejected as the EFI_LOADED_IMAGE_PROTOCOL had a wrong memory layout.
See also: https://github.com/rhboot/shim/pull/516
There were a few remaining cases where we used arg_root instead of
the root directory file descriptor. Let's port those over to use the
root directory file descriptor as well.
To make it consistent with other env vars, e.g. $SYSTEMD_ESP_PATH or
$SYSTEMD_XBOOTLDR_PATH.
This is useful when the root is specified by a file descriptor, instead
of a path.
For consistency with other functions.
Unfortunately, va_start() requires that the previous argument is a
pointer, hence the order of the arguments in the internal function
cannot be changed.
The variable 'r' is usually used for storing return value of functional
call. Let's introduce another boolean to store the current loop status.
No functional change, just refactoring.
In that branch, 'root' is a non-root and absolute path.
Hence, delete_trailing_chars() does not make the path empty.
And, if the path contains redundant slashes at the end, that will be
dropped by path_simplify().
This is a followup to
413e8650b7
> tree-wide: Use "unmet" for condition checks, not "failed"
Since I noticed when running `systemctl status` on a recent
systemd still seeing
`Condition: start condition failed`
To recap the original rationale here for "unmet" is that it's
normal for some units to be conditional, so the term "failure"
here is too strong.
Unlikely, but even if find_esp() or friends called with unnormalized or
relative 'root', let's make the result path normalized and absolute.
Note, before 63105f33ed, these functions
returned an absolute and normalized path. But the commit made the result
path simply concatenated with root.
Follow-up for 63105f33ed.
When extension is not specified, image class is not necessary to be
specified. Let's use _IMAGE_CLASS_INVALID as an indicator that no
extension is specified.
When the `systemd-network-generator` is included in the initrd and runs from
there first, the next times it runs after switching to real root it
thinks there is a duplicate entry on the kernel command line.
This patch rewrites the unit file if the content has changed, instead of
displaying an error message.
When path_find_first_component() returns the last component, the iterator
must be an empty string. The fact is heavily used in chaseat(). Let's
explicitly test it.
Previously, struct stat may not be correctly synced with the currently
opened fd, e.g. when a path contains symlink which points to an absolute
path.
This also rename variables for struct stat, to make them consistent with
the corresponding fd.
All tags are managed under /run/udev/tags, and the directories there are
named with tags. Hence, each tag must be a valid filename.
This also makes all validity check moved to sd-device side, and
makes failure caused by setting invalid tags non-critical.
With this change, an empty string cannot be assigned to TAG=, hence the
test cases are adjusted.
After manually editing /etc/locale.gen, calling localectl set-locale
sometimes fails. When it fails, the systemd journal shows:
systemd-localed: free() / invalid pointer.
It turned out that it only fails if some of the uncommented lines in
/etc/locale.gen have leading spaces, as in:
* C.UTF-8 <= OK
* en_US.UTF-8 <= OK
* fr_FR.UTF-8 <= NOK
After parsing a line from /etc/locale.gen, we use strstrip() to obtain
the "trimmed" line (without leading or trailing spaces).
However, we store the result of strstrip() in the original pointer
containing the untrimmed line. This pointer is later passed to free
(this is done automatically using _cleanup_free_).
This is a problem because if any leading space is present, the pointer
will essentially be shifted from its original value. This will result in
an invalid free upon cleanup.
The same issue is present in the locale_gen_locale_supported function.
Fixed by storing the result of strstrip() in a different pointer.
Before this commit, if `original_path` is given,
it will always be used to overwrite `path`.
After this commit, it's controlled by the newly-added
switch `overwrite_with_origin`.
With certain fstabs we may propagate ENXIO from the $SYSTEMD_SYSFS_CHECK
check all the way up, making fstab-generator exit with a non-zero EC and
without any helpful message, which is really confusing.
The confext concept is an extension of the existing sysext concept and
allows to extend the host's filesystem or a unit's filesystem with signed
images that add new files to the /etc/ directory using OverlayFS.
The release file that accompanies the confext images needs to be
host compatible to be able to be merged into the host /etc/ directory.
This commit checks for version compatibility between the image file and
the host file.
Adds a new image type called IMAGE_CONFEXT which is similar to IMAGE_SYSEXT but works
for the /etc/ directory instead of /usr/ and /opt/. This commit also adds the ability to
parse the release file that is present with the confext image in /etc/confext-release.d/
directory.
If we don't find a single useful partition table, refusing dissection.
(Except in systemd-dissect, when we are supposed to show DDI
information, in that case allow this to run and show general DDI
information, i.e. size, UUID and name at least)
This allows unprivileged validation of DDIs. Only superficial structure,
i.e. not mounting or so. This becomes particularly handy in the
integration tests, and to validate image policies.
This is to dissect_image_file() what dissect_loop_device_and_warn() is
to dissect_loop_device(), i.e. it dissects the image file and logs an
error string if that fails instead of just returning an error.
Fixes:
- Comment style
- Alignment style
- cleanup macro usage
- incorrect error message[1]
1. Thanks to tempusfugit991@gmail.com for pointing out the error
message mistake.
Signed-off-by: William Roberts <william.c.roberts@intel.com>
None of the existing test files fit very well. test-unit-serialize is
pretty close, but it does special cgroup setup, which we don't need in
this case. I hope we can add more tests in the future for this basic
functionality, so I'm adding a brand new file names after the source file
it's testing.
Move the tests that link to libcore into a separate subgroup.
They are special and it makes sense to keep them together. While
at it, make the list alphabetical.
Also, merge the list additions into one. No idea why it was like that.
If a root directory is specified, and e.g. /var under the root directory
is a symlink to the host's /var, then we wrongly read host's machine ID,
even if O_NOFOLLOW is set.
Let's chase the path with CHASE_NOFOLLOW to refuse such case.
Also, refuse null ID, otherwise we may setup machine ID with NULL.
Previously, when the NULL (all zero) machine ID is configured in the
container, nspawn refused to execute.
Now id128_get_machine() is used, so NULL machine ID is refused with
-ENOMEDIUM, and fallback to specified UUID or randomly generated one.