Commit graph

187 commits

Author SHA1 Message Date
Mike Yuan 9c025022d9
core/service: store BUSERROR= & VARLINKERROR= received through notification
Closes #6073
2024-06-20 19:03:44 +02:00
Zbigniew Jędrzejewski-Szmek 75ced6d5ee various: update links to usr-merge 2024-05-28 14:48:56 +02:00
Zbigniew Jędrzejewski-Szmek f81af0b082 man: update links to "New Control Group Interfaces" 2024-05-28 14:46:44 +02:00
Lennart Poettering 3c1d1ca146 manager: switch service unit type over to using new handoff timestamping logic
Also: rename Handover → Handoff. I think it makes it clearer that this
is not really about handing over any resources, but that the executor is
out off the game from that point on.
2024-04-25 13:40:41 +02:00
Luca Boccassi c75c8a38b8 man: document service types that record ExecMainHandoverTimestamp
Follow-up for 93cb78aee2
2024-04-24 07:55:37 +02:00
Mike Yuan 844863c61e
core/manager: add unmerged-bin taint 2024-04-24 08:43:08 +08:00
Mike Yuan ea81442892
core/manager: rearrange taint tags 2024-04-24 08:40:25 +08:00
Mike Yuan 2b28dfe6e6
core/manager: drop obsolete cgroup taint string
Wwe can't boot on systems without cgroup anyway
(even cgroup v1 will be gone pretty soon).
2024-04-24 08:39:29 +08:00
Luca Boccassi 93cb78aee2 core: add ExecMainHandoverTimestamp property recording time-of-execve
Enable the exec_fd logic for Type=notify* services too, and change it
to send a timestamp instead of a '1' byte. Record the timestamp in a
new ExecMainHandoverTimestamp property so that users can track accurately
when control is handed over from systemd to the service payload, so
that latency and startup performance can be trivially and accurately
tracked and attributed.
2024-04-22 15:16:05 +02:00
Luca Boccassi ef5f7f9437 systemctl: add --clean= values to documentation and shell completion 2024-04-18 14:07:07 +02:00
Luca Boccassi b3f548615f core: rename SoftRebootStartTimestamp -> ShutdownStartTimestamp and generalize
Follow-up for 54f86b86ba
2024-04-17 18:19:27 +01:00
Luca Boccassi 95a289bfe7 man: mention initial value of SoftRebootsCount
Follow-up for 66f35161f6
2024-04-16 00:26:04 +01:00
Luca Boccassi 66f35161f6 core: add counter for soft-reboot iterations
Allow to query via D-Bus how many times the current booted system has
been soft rebooted
2024-03-27 01:27:35 +00:00
Luca Boccassi 54f86b86ba core: add SoftRebootStartTimestamp
Will be useful to calculate how long it took to shut down the system before starting
in the new root
2024-03-27 01:25:49 +00:00
Jakub Sitnicki 97df75d7bd socket: pass socket FDs to all ExecXYZ= commands but ExecStartPre=
Today listen file descriptors created by socket unit don't get passed to
commands in Exec{Start,Stop}{Pre,Post}= socket options.

This prevents ExecXYZ= commands from accessing the created socket FDs to do
any kind of system setup which involves the socket but is not covered by
existing socket unit options.

One concrete example is to insert a socket FD into a BPF map capable of
holding socket references, such as BPF sockmap/sockhash [1] or
reuseport_sockarray [2]. Or, similarly, send the file descriptor with
SCM_RIGHTS to another process, which has access to a BPF map for storing
sockets.

To unblock this use case, pass ListenXYZ= file descriptors to ExecXYZ=
commands as listen FDs [4]. As an exception, ExecStartPre= command does not
inherit any file descriptors because it gets invoked before the listen FDs
are created.

This new behavior can potentially break existing configurations. Commands
invoked from ExecXYZ= might not expect to inherit file descriptors through
sd_listen_fds protocol.

To prevent breakage, add a new socket unit parameter,
PassFileDescriptorsToExec=, to control whether ExecXYZ= programs inherit
listen FDs.

[1] https://docs.kernel.org/bpf/map_sockmap.html
[2] https://lore.kernel.org/r/20180808075917.3009181-1-kafai@fb.com
[3] https://man.archlinux.org/man/socket.7#SO_INCOMING_CPU
[4] https://www.freedesktop.org/software/systemd/man/latest/sd_listen_fds.html
2024-03-27 01:41:26 +08:00
Mike Yuan 1ea275f119 core/cgroup: introduce MemoryZSwapWriteback setting
Added in
501a06fe8e
2024-03-13 23:36:25 +00:00
Frantisek Sumsal 43b238f1c1 man: suffix signals with ()
Since signals can take arguments, let's suffix them with () as we
already do with functions. To make sure we remain consistent, make the
`update-dbus-docs.py` script check & fix any occurrences where this is
not the case.

Resolves: #31002
2024-01-23 16:27:50 +01:00
Luca Boccassi d156e66f82 man: add more suggestions on how to use StartUnit and JobRemoved
This is not immediately clear for users, so spell out the preferred pattern
clearly in the D-Bus documentation.
2024-01-18 17:22:12 +00:00
Lennart Poettering 658dc909dc man: fix references to systemd.exec(5)
For some reason the section for the systemd.exec man page was added
incorrectly and then copypasted everywhere else incorrectly too. Let's
fix that.
2024-01-11 12:19:44 +00:00
Yu Watanabe 82a1597778
Merge pull request #28797 from Werkov/eff_limits
Add MemoryMaxEffective=, MemoryHighEffective= and TasksMaxEff…  …ective= properties
2024-01-04 05:38:06 +09:00
Michal Sekletar 84c01612de core/manager: add dbus API to create auxiliary scope from running service
This commit introduces new D-Bus API, StartAuxiliaryScope(). It may be
used by services as part of the restart procedure. Service sends an
array of PID file descriptors corresponding to processes that are part
of the service and must continue running also after service restarts,
i.e. they haven't finished the job why they were spawned in the first
place (e.g. long running video transcoding job). Systemd creates new
scope unit for these processes and migrates them into it. Cgroup
properties of scope are copied from the service so it retains same
cgroup settings and limits as service had.
2024-01-03 13:50:41 +01:00
Michal Koutný 4fb0d2dc14 cgroup: Add EffectiveMemoryMax=, EffectiveMemoryHigh= and EffectiveTasksMax= properties
Users become perplexed when they run their workload in a unit with no
explicit limits configured (moreover, listing the limit property would
even show it's infinity) but they experience unexpected resource
limitation.

The memory and pid limits come as the most visible, therefore add new
unit read-only properties:
- EffectiveMemoryMax=,
- EffectiveMemoryHigh=,
- EffectiveTasksMax=.

These properties represent the most stringent limit systemd is aware of
for the given unit -- and that is typically(*) the effective value.

Implement the properties by simply traversing all parents in the
leaf-slice tree and picking the minimum value. Note that effective
limits are thus defined even for units that don't enable explicit
accounting (because of the hierarchy).

(*) The evasive case is when systemd runs in a cgroupns and cannot
reason about outer setup. Complete solution would need kernel support.
2024-01-03 13:37:08 +01:00
David Tardon eea10b26f7 man: use same version in public and system ident. 2023-12-25 15:51:47 +01:00
Andrew Sayers ff47602f5e Fix a typo in the org.freedesktop.systemd1 man page 2023-12-15 07:39:05 +09:00
Luca Boccassi 9e615fa3aa core: add WantsMountsFor=
This is the equivalent of RequiresMountsFor=, but adds Wants= instead
of Requires=. It will be useful for example for the autogenerated
systemd-cryptsetup units.

Fixes https://github.com/systemd/systemd/issues/11646
2023-11-29 11:04:59 +00:00
Yu Watanabe 58cde42f65 core: rename MemoryZswapCurrent -> MemoryZSwapCurrent
Follow-up for 26caa66867.
2023-11-13 13:54:56 +01:00
Florian Schmaus 26caa66867 cgroup: add support for memory.zswap.current 2023-11-12 21:10:40 +01:00
Florian Schmaus 37533c9432 cgroup: add support for memory.swap.current
In systemctl-show we only show current swap if ever swapped or non-zero. This
reduces the noise on swapless systems, that would otherwise always show a swap
value that never has the chance to become non-zero. It further reduces the
noise for services that never swapped.
2023-11-11 12:16:29 +01:00
Florian Schmaus aac3384e56 cgroup: add support for memory.swap.peak 2023-11-11 12:14:07 +01:00
Florian Schmaus 6c71db763c cgroup: add support for memory.peak
Linux's Control Group v2 interfaces exposes memory.peak, which contains the
"max memory usage recorded for the cgroup and its descendants since the
creation of the cgroup."

This commit adds a new property "MemoryPeak" for units and makes "systemctl
show" display this value if it is available.

Fixes #29878.

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
2023-11-06 18:08:33 +01:00
Lennart Poettering cde8cc946b
Merge pull request #29272 from enr0n/coredump-container
coredump: support forwarding coredumps to containers
2023-10-16 16:13:16 +02:00
Luca Boccassi 7c83d42ef8 mount-util: use mount beneath to replace previous namespace mount
Instead of mounting over, do an atomic swap using mount beneath, if
available. This way assets can be mounted again and again (e.g.:
updates) without leaking mounts.
2023-10-16 14:33:47 +01:00
Nick Rosbrook cfc015f09e man: document CoredumpReceive= setting 2023-10-13 15:28:50 -04:00
Mike Yuan 854eca4a95
core/execute: always set $USER and introduce SetLoginEnvironment=
Before this commit, $USER, $HOME, $LOGNAME and $SHELL are only
set when User= is set for the unit. For system service, this
results in different behaviors depending on whether User=root is set.

$USER always makes sense on its own, so let's set it unconditionally.
Ideally $HOME should be set too, but it causes trouble when e.g. getty
passes '-p' to login(1), which then doesn't override $HOME. $LOGNAME and
$SHELL are more like "login environments", and are generally not
suitable for system services. Therefore, a new option SetLoginEnvironment=
is also added to control the latter three variables.

Fixes #23438

Replaces #8227
2023-10-10 00:00:26 +08:00
Luca Boccassi 559214cbbd pid1: add SurviveFinalKillSignal= to skip units on final sigterm/sigkill spree
Add a new boolean for units, SurviveFinalKillSignal=yes/no. Units that
set it will not have their process receive the final sigterm/sigkill in
the shutdown phase.

This is implemented by checking if a process is part of a cgroup marked
with a user.survive_final_kill_signal xattr (or a trusted xattr if we
can't set a user one, which were added only in kernel v5.7 and are not
supported in CentOS 8).
2023-09-28 13:48:14 +01:00
Mike Yuan 6bd8340d11
man/org.freedesktop.systemd1: add version info for NFTSet
Follow-up for dc7d69b3c1
2023-09-28 03:04:28 +08:00
Topi Miettinen dc7d69b3c1 core: firewall integration of cgroups with NFTSet=
New directive `NFTSet=` provides a method for integrating dynamic cgroup IDs
into firewall rules with NFT sets. The benefit of using this setting is to be
able to use control group as a selector in firewall rules easily and this in
turn allows more fine grained filtering. Also, NFT rules for cgroup matching
use numeric cgroup IDs, which change every time a service is restarted, making
them hard to use in systemd environment.

This option expects a whitespace separated list of NFT set definitions. Each
definition consists of a colon-separated tuple of source type (only "cgroup"),
NFT address family (one of "arp", "bridge", "inet", "ip", "ip6", or "netdev"),
table name and set name. The names of tables and sets must conform to lexical
restrictions of NFT table names. The type of the element used in the NFT filter
must be "cgroupsv2". When a control group for a unit is realized, the cgroup ID
will be appended to the NFT sets and it will be be removed when the control
group is removed.  systemd only inserts elements to (or removes from) the sets,
so the related NFT rules, tables and sets must be prepared elsewhere in
advance.  Failures to manage the sets will be ignored.

If the firewall rules are reinstalled so that the contents of NFT sets are
destroyed, command systemctl daemon-reload can be used to refill the sets.

Example:

```
table inet filter {
...
        set timesyncd {
                type cgroupsv2
        }

        chain ntp_output {
                socket cgroupv2 != @timesyncd counter drop
                accept
        }
...
}
```

/etc/systemd/system/systemd-timesyncd.service.d/override.conf
```
[Service]
NFTSet=cgroup:inet:filter:timesyncd
```

```
$ sudo nft list set inet filter timesyncd
table inet filter {
        set timesyncd {
                type cgroupsv2
                elements = { "system.slice/systemd-timesyncd.service" }
        }
}
```
2023-09-27 18:10:11 +00:00
Luca Boccassi 4c9a288154 man: document SystemState's possible values 2023-09-25 22:55:54 +01:00
Abderrahim Kitouni d9d2d16aea man: add version information for dbus interfaces
These only go back to version 250 which is the first version to provide the
export-dbus-interfaces build target.
2023-09-19 14:33:34 +01:00
Lennart Poettering 2bec84e7a5 core: add new "PollLimit" settings to .socket units
This adds a new "PollLimit" pair of settings to .socket units, very
similar to existing "TriggerLimit" logic. The differences are:

* PollLimit focusses on the polling on the sockets, and pauses that
  temporarily if a ratelimit on that is reached. TriggerLimit otoh
  focusses on the triggering effect of socket units, and stops
  triggering once the ratelimit is hit.

* While the trigger limit being hit is an action that causes the socket
  unit to fail the polling limit being reached will just temporarily
  disable polling on the socket fd, and it is resumed once the ratelimit
  interval is over.

* When a socket unit operates on multiple socket fds (e,g, ListenStream=
  on both some ipv6 and an ipv4 address or so). Then the PollLimit will
  be specific to each fd, while the trigger limit is specific to the
  whole unit.

Implementation-wise this is mostly a wrapper around sd-event's
sd_event_source_set_ratelimit(), which exposes the desired behaviour
directly.

Usecase for all of this: socket services which when overloaded with
connections should just slow down reception of it, but not fail
persistently.
2023-09-18 18:55:19 +02:00
Michal Koutný 055665d596 dbus: Document org.freedesktop.systemd1.Service.MemoryAvailable property
The value is an optimistic estimate, make it clear in the docs.
2023-09-09 10:42:38 +02:00
Abderrahim Kitouni ec07c3c80b man: add version info
This tries to add information about when each option was added. It goes
back to version 183.

The version info is included from a separate file to allow generating it,
which would allow more control on the formatting of the final output.
2023-08-29 14:07:24 +01:00
Luca Boccassi b0d3095fd6 Drop split-usr and unmerged-usr support
As previously announced, execute order 66:

https://lists.freedesktop.org/archives/systemd-devel/2022-September/048352.html

The meson options split-usr, rootlibdir and rootprefix become no-ops
that print a warning if they are set to anything other than the
default values. We can remove them in a future release.
2023-07-28 19:34:03 +01:00
Luca Boccassi 3835b9aa4b Revert "core: add IgnoreOnSoftReboot= unit option"
The feature is not ready, postpone it

This reverts commit b80fc61e89.
2023-07-22 23:27:27 +01:00
Luca Boccassi b80fc61e89 core: add IgnoreOnSoftReboot= unit option
As it says on the tin, configures the unit to survive a soft reboot.
Currently all the following options have to be set by hand:

Conflicts=reboot.target kexec.target poweroff.target halt.target
Before=reboot.target kexec.target poweroff.target halt.target
After=sysinit.target basic.target
DefaultDependencies=no
IgnoreOnIsolate=yes

This is not very user friendly. If new default dependencies are added,
or new shutdown/reboot types, they also have to be added manually.

The new option is much simpler, easy to find, and does the right thing
by default.
2023-07-21 18:05:41 +02:00
Luca Boccassi b2deaaf01b
Merge pull request #27584 from rphibel/add-restartquick-option
service: add new RestartMode option
2023-07-06 20:37:31 +01:00
Richard Phibel e568fea9fc service: add new RestartMode option
When this option is set to direct, the service restarts without entering a failed
state. Dependent units are not notified of transitory failure.

This is useful for the following use case:

We have a target with Requires=my-service, After=my-service.
my-service.service is a oneshot service and has Restart=on-failure in
its definition.

my-service.service can get stuck for various reasons and time out, in
which case it is restarted. Currently, when it fails the first time, the
target fails, even though my-service is restarted.

The behavior we're looking for is that until my-service is not restarted
anymore, the target stays pending waiting for my-service.service to
start successfully or fail without being restarted anymore.
2023-07-06 14:33:52 +02:00
Daniel P. Berrangé 1257274ad8 dbus: add 'ConfidentialVirtualization' property to manager object
This property reports whether the system is running inside a confidential
virtual machine.

Related: https://github.com/systemd/systemd/issues/27604
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2023-07-06 12:20:04 +01:00
Daan De Meyer 9c0c670125 core: Add RootEphemeral= setting
This setting allows services to run in an ephemeral copy of the root
directory or root image. To make sure the ephemeral copies are always
cleaned up, we add a tmpfiles snippet to unconditionally clean up
/var/lib/systemd/ephemeral. To prevent in use ephemeral copies from
being cleaned up by tmpfiles, we use the newly added COPY_LOCK_BSD
and BTRFS_SNAPSHOT_LOCK_BSD flags to take a BSD lock on the ephemeral
copies which instruct tmpfiles to not touch those ephemeral copies as
long as the BSD lock is held.
2023-06-21 12:48:46 +02:00
licunlong a068eeac6f core/dbus-manager: also show DefaultIOAccounting and DefaultIPAccounting
fix: https://github.com/systemd/systemd/issues/28045
2023-06-19 09:57:11 +02:00