Merge pull request #26393 from poettering/mempress

watch and act on memory pressure in most of our long-running services, including PID 1
This commit is contained in:
Luca Boccassi 2023-03-01 12:28:12 +00:00 committed by GitHub
commit adee01643d
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
53 changed files with 1132 additions and 30 deletions

5
TODO
View file

@ -159,6 +159,11 @@ Features:
invokes systemd-mount and exits. This is then useful to use in
ENV{SYSTEMD_WANTS} in udev rules, and a bit prettier than using RUN+=
* udevd: extend memory pressure logic: also kill any idle worker processes
* SIGRTMIN+18 and memory pressure handling should still be added to: hostnamed,
localed, oomd, timedated.
* sd-journal puts a limit on parallel journal files to view at once. journald
should probably honour that same limit (JOURNAL_FILES_MAX) when vacuuming to
ensure we never generate more files than we can actually view.

240
docs/MEMORY_PRESSURE.md Normal file
View file

@ -0,0 +1,240 @@
---
title: Memory Pressure Handling
category: Interfaces
layout: default
SPDX-License-Identifier: LGPL-2.1-or-later
---
# Memory Pressure Handling in systemd
When the system is under memory pressure (i.e. some component of the OS
requires memory allocation but there is only very little or none available),
it can attempt various things to make more memory available again ("reclaim"):
* The kernel can flush out memory pages backed by files on disk, under the
knowledge that it can reread them from disk when needed again. Candidate
pages are the many memory mapped executable files and shared libraries on
disk, among others.
* The kernel can flush out memory packages not backed by files on disk
("anonymous" memory, i.e. memory allocated via `malloc()` and similar calls,
or `tmpfs` file system contents) if there's swap to write it to.
* Userspace can proactively release memory it allocated but doesn't immediately
require back to the kernel. This includes allocation caches, and other forms
of caches that are not required for normal operation to continue.
The latter is what we want to focus on in this document: how to ensure
userspace process can detect mounting memory pressure early and release memory
back to the kernel as it happens, relieving the memory pressure before it
becomes too critical.
The effects of memory pressure during runtime generaly are growing latencies
during operation: when a program requires memory but the system is busy writing
out memory to (relatively slow) disks in order make some available, this
generally surfaces in scheduling latencies, and applications and services will
slow down until memory pressure is relieved. Hence, to ensure stable service
latencies it is essential to release unneeded memory back to the kernel early
on.
On Linux the [Pressure Stall Information
(PSI)](https://docs.kernel.org/accounting/psi.html) Linux kernel interface is
the primary way to determine the system or a part of it is under memory
pressure. PSI provides a way how userspace can acquire a `poll()`-able file
descriptor that gets notifications whenever memory pressure latencies for the
system or a for a control group grow beyond some level.
`systemd` itself makes use of PSI, and helps applications to do so
too. Specifically:
* Most of systemd's long running components watch for PSI memory pressure
events, and release allocation caches and other resources once seen.
* systemd's service manager provides a protocol for asking services to listen
to PSI events and configure the appropriate pressure thresholds.
* systemd's `sd-event` event loop API provides a high-level call
`sd_event_add_memory_pressure()` which allows programs using it to
efficiently hook into the PSI memory pressure protocol provided by the
service manager, with very few lines of code.
## Memory Pressure Service Protocol
If memory pressure handling for a specific service is enabled via
`MemoryPressureWatch=` the memory pressure service protocol is used to tell the
service code about this. Specifically two environment variables are set by the
service manager, and typically consumed by the service:
* The `$MEMORY_PRESSURE_WATCH` environment variable will contain an absolute
path in the file system to the file to watch for memory pressure events. This
will usually point to a PSI file such as the `memory.pressure` file of the
service's cgroup. In order to make debugging easier, and allow later
extension it is recommended for applications to also allow this path to refer
to an `AF_UNIX` stream socket in the file system or a FIFO inode in the file
system. Regardless which of the three types of inodes this absolute path
refers to, all three are `poll()`-able for memory pressure events. The
variable can also be set to the literal string `/dev/null`. If so the service
code should take this as indication that memory pressure monitoring is not
desired and should be turned off.
* The `$MEMORY_PRESSURE_WRITE` environment variable is optional. If set by the
service manager it contains Base64 encoded data (that may contain arbitrary
binary values, including NUL bytes) that should be written into the path
provided via `$MEMORY_PRESSURE_WATCH` right after opening it. Typically, if
talking directly to a PSI kernel file this will contain information about the
threshold settings configurable in the service manager.
When a service initializes it hence should look for
`$MEMORY_PRESSURE_WATCH`. If set, it should try to open the specified path. If
it detects the path to refer to a regular file it should assume it refers to a
PSI kernel file. If so, it should write the data from `$MEMORY_PRESSURE_WRITE`
into the file descriptor (after Base64-decoding it, and only if the variable is
set) and then watch for `POLLPRI` events on it. If it detects the paths refers
to a FIFO inode, it should open it, write the `$MEMORY_PRESSURE_WRITE` data
into it (as above) and then watch for `POLLIN` events on it. Whenever `POLLIN`
is seen it should read and discard any data queued in the FIFO. If the path
refers to an `AF_UNIX` socket in the file system, the application should
`connect()` a stream socket to it, write `$MEMORY_PRESSURE_WRITE` into it (as
above) and watch for `POLLIN`, discarding any data it might receive.
To summarize:
* If `$MEMORY_PRESSURE_WATCH` points to a regular file: open and watch for
`POLLPRI`, never read from the file descriptor.
* If `$MEMORY_PRESSURE_WATCH` points to a FIFO: open and watch for `POLLIN`,
read/discard any incoming data.
* If `$MEMORY_PRESSURE_WATCH` points to an `AF_UNIX` socket: connect and watch
for `POLLIN`, read/discard any incoming data.
* If `$MEMORY_PRESSURE_WATCH` contains the literal string `/dev/null`, turn off
memory pressure handling.
(And in each case, immediately after opening/connecting to the path, write the
decoded `$MEMORY_PRESSURE_WRITE` data into it.)
Whenever a `POLLPRI`/`POLLIN` event is seen the service is under memory
pressure. It should use this as hint to release suitable redundant resources,
for example:
* glibc's memory allocation cache, via
[`malloc_trim()`](https://man7.org/linux/man-pages/man3/malloc_trim.3.html). Similar,
allocation caches implemented in the service itself.
* Any other local caches, such DNS caches, or web caches (in particular if
service is a web browser).
* Terminate any idle worker threads or processes.
* Run a garbage collection (GC) cycle, if the programming languages supports that.
* Terminate the process if idle, and if it can be automatically started when
needed next.
Which actions precisely to take depends on the service in question. Note that
the notifications are delivered when memory allocation latency already degraded
beyond some point. Hence when discussing which resources to keep and which ones
to discard it should be kept in mind that it is typically acceptable that
latencies to recover the discarded resources at a later point are less of a
problem, given that latencies *already* are affected negatively.
In case the path supplied via `$MEMORY_PRESSURE_WATCH` points to a PSI kernel
API file, or to an `AF_UNIX` opening it multiple times is safe and reliable,
and should deliver notifications to each of the opened file descriptors. This
is specifically useful for services that consist of multiple processes, and
where each of them shall be able to release resources on memory pressure.
The `POLLPRI`/`POLLIN` conditions will be triggered every time memory pressure
is detected, but not continously. It is thus safe to keep `poll()`-ing on the
same file descriptor continously, and executing resource release operations
whenever the file descriptor triggers without having to expect overloading the
process.
(Currently, the protocol defined here only allows configuration of a single
"degree" of memory pressure, there's no distinction made on how strong the
pressure is. In future, if it becomes apparent that there's clear need to
extend this we might eventually add different degrees, most likely by adding
additional environment variables such as `$MEMORY_PRESSURE_WRITE_LOW` and
`$MEMORY_PRESSURE_WRITE_HIGH` or similar, which may contain different settings
for lower or higher memory pressure thresholds.)
## Service Manager Settings
The service manager provides two per-service settings that control the memory
pressure handling:
* The
[`MemoryPressureWatch=`](https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemoryPressureWatch=)
setting controls whether to enable the memory pressure protocol for the
service in question.
* The `MemoryPressureThresholdSec=` setting allows to configure the threshold
when to signal memory pressure to the services. It takes a time value
(usually in the millisecond range) that defines a threshold per 1s time
window: if memory allocation latencies grow beyond this threshold
notifications are generated towards the service, requesting it to release
resources.
The `/etc/systemd/system.conf` file provides two settings that may be used to
select the default values for the above settings. If the threshold is neither
configured via the per-service nor via the default system-wide option, it
defaults to 100ms.
Ẁhen memory pressure monitoring is enabled for a service via
`MemoryPressureWatch=` this primarily does three things:
* It enables cgroup memory accounting for the service (this is a requirement
for per-cgroup PSI)
* It sets the aforementioned two environment variables for processes invoked
for the service, based on the control group of the service and provided
settings.
* The `memory.pressure` PSI control group file associated with the service's
cgroup is delegated to the service (i.e. permissions are relaxed so that
unprivileged service payload code can open the file for writing).
## Memory Pressure Events in `sd-event`
The
[`sd-event`](https://www.freedesktop.org/software/systemd/man/sd-event.html)
event loop library provides two API calls that encapsulate the
functionality described above:
* The
[`sd_event_add_memory_pressure()`](https://www.freedesktop.org/software/systemd/man/sd_event_add_memory_pressure.html)
call implements the service-side of the memory pressure protocol and
integrates it with an `sd-event` event loop. It reads the two environment
variables, connects/opens the specified file, writes the the specified data
to it and then watches for events.
* The `sd_event_trim_memory()` call may be called to trim the calling
processes' memory. It's a wrapper around glibc's `malloc_trim()`, but first
releases allocation caches maintained by libsystemd internally. If the
callback function passed to `sd_event_add_memory_pressure()` is passed as
`NULL` this function is called as default implementation.
Making use of this, in order to hook up a service using `sd-event` with
automatic memory pressure handling, it's typically sufficient to add a line
such as:
```c
(void) sd_event_add_memory_pressure(event, NULL, NULL, NULL);
```
right after allocating the event loop object `event`.
## Other APIs
Other programming environments might have native APIs to watch memory
pressure/low memory events. Most notable is probably GLib's
[GMemoryMonitor](https://developer-old.gnome.org/gio/stable/GMemoryMonitor.html). It
currently uses the per-system Linux PSI interface as backend, but it operates
differently than the above: memory pressure events are picked up by a system
service, which then propagates this through D-Bus to the applications. This is
typically less than ideal, since this means each notification event has to
travel through three processes before being handled, and this creates
additional latencies at a time where the system is already experiencing adverse
latencies. Moreover, it focusses on system-wide PSI events, even though
service-local ones are generally the better approach.

View file

@ -529,6 +529,10 @@ node /org/freedesktop/systemd1 {
readonly t DefaultLimitRTTIMESoft = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly t DefaultTasksMax = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly t DefaultMemoryPressureThresholdUSec = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly s DefaultMemoryPressureWatch = '...';
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
readonly t TimerSlackNSec = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
@ -782,6 +786,10 @@ node /org/freedesktop/systemd1 {
<!--property DefaultTasksMax is not documented!-->
<!--property DefaultMemoryPressureThresholdUSec is not documented!-->
<!--property DefaultMemoryPressureWatch is not documented!-->
<!--property TimerSlackNSec is not documented!-->
<!--property DefaultOOMPolicy is not documented!-->
@ -1208,6 +1216,10 @@ node /org/freedesktop/systemd1 {
<variablelist class="dbus-property" generated="True" extra-ref="DefaultTasksMax"/>
<variablelist class="dbus-property" generated="True" extra-ref="DefaultMemoryPressureThresholdUSec"/>
<variablelist class="dbus-property" generated="True" extra-ref="DefaultMemoryPressureWatch"/>
<variablelist class="dbus-property" generated="True" extra-ref="TimerSlackNSec"/>
<variablelist class="dbus-property" generated="True" extra-ref="DefaultOOMPolicy"/>
@ -2803,6 +2815,10 @@ node /org/freedesktop/systemd1/unit/avahi_2ddaemon_2eservice {
readonly a(iiqq) SocketBindDeny = [...];
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly (bas) RestrictNetworkInterfaces = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly s MemoryPressureWatch = '...';
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly t MemoryPressureThresholdUSec = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
readonly as Environment = ['...', ...];
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
@ -3395,6 +3411,10 @@ node /org/freedesktop/systemd1/unit/avahi_2ddaemon_2eservice {
<!--property RestrictNetworkInterfaces is not documented!-->
<!--property MemoryPressureWatch is not documented!-->
<!--property MemoryPressureThresholdUSec is not documented!-->
<!--property EnvironmentFiles is not documented!-->
<!--property PassEnvironment is not documented!-->
@ -3995,6 +4015,10 @@ node /org/freedesktop/systemd1/unit/avahi_2ddaemon_2eservice {
<variablelist class="dbus-property" generated="True" extra-ref="RestrictNetworkInterfaces"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureWatch"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureThresholdUSec"/>
<variablelist class="dbus-property" generated="True" extra-ref="Environment"/>
<variablelist class="dbus-property" generated="True" extra-ref="EnvironmentFiles"/>
@ -4747,6 +4771,10 @@ node /org/freedesktop/systemd1/unit/avahi_2ddaemon_2esocket {
readonly a(iiqq) SocketBindDeny = [...];
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly (bas) RestrictNetworkInterfaces = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly s MemoryPressureWatch = '...';
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly t MemoryPressureThresholdUSec = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
readonly as Environment = ['...', ...];
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
@ -5359,6 +5387,10 @@ node /org/freedesktop/systemd1/unit/avahi_2ddaemon_2esocket {
<!--property RestrictNetworkInterfaces is not documented!-->
<!--property MemoryPressureWatch is not documented!-->
<!--property MemoryPressureThresholdUSec is not documented!-->
<!--property EnvironmentFiles is not documented!-->
<!--property PassEnvironment is not documented!-->
@ -5949,6 +5981,10 @@ node /org/freedesktop/systemd1/unit/avahi_2ddaemon_2esocket {
<variablelist class="dbus-property" generated="True" extra-ref="RestrictNetworkInterfaces"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureWatch"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureThresholdUSec"/>
<variablelist class="dbus-property" generated="True" extra-ref="Environment"/>
<variablelist class="dbus-property" generated="True" extra-ref="EnvironmentFiles"/>
@ -6590,6 +6626,10 @@ node /org/freedesktop/systemd1/unit/home_2emount {
readonly a(iiqq) SocketBindDeny = [...];
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly (bas) RestrictNetworkInterfaces = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly s MemoryPressureWatch = '...';
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly t MemoryPressureThresholdUSec = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
readonly as Environment = ['...', ...];
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
@ -7130,6 +7170,10 @@ node /org/freedesktop/systemd1/unit/home_2emount {
<!--property RestrictNetworkInterfaces is not documented!-->
<!--property MemoryPressureWatch is not documented!-->
<!--property MemoryPressureThresholdUSec is not documented!-->
<!--property EnvironmentFiles is not documented!-->
<!--property PassEnvironment is not documented!-->
@ -7638,6 +7682,10 @@ node /org/freedesktop/systemd1/unit/home_2emount {
<variablelist class="dbus-property" generated="True" extra-ref="RestrictNetworkInterfaces"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureWatch"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureThresholdUSec"/>
<variablelist class="dbus-property" generated="True" extra-ref="Environment"/>
<variablelist class="dbus-property" generated="True" extra-ref="EnvironmentFiles"/>
@ -8406,6 +8454,10 @@ node /org/freedesktop/systemd1/unit/dev_2dsda3_2eswap {
readonly a(iiqq) SocketBindDeny = [...];
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly (bas) RestrictNetworkInterfaces = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly s MemoryPressureWatch = '...';
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly t MemoryPressureThresholdUSec = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
readonly as Environment = ['...', ...];
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
@ -8932,6 +8984,10 @@ node /org/freedesktop/systemd1/unit/dev_2dsda3_2eswap {
<!--property RestrictNetworkInterfaces is not documented!-->
<!--property MemoryPressureWatch is not documented!-->
<!--property MemoryPressureThresholdUSec is not documented!-->
<!--property EnvironmentFiles is not documented!-->
<!--property PassEnvironment is not documented!-->
@ -9426,6 +9482,10 @@ node /org/freedesktop/systemd1/unit/dev_2dsda3_2eswap {
<variablelist class="dbus-property" generated="True" extra-ref="RestrictNetworkInterfaces"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureWatch"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureThresholdUSec"/>
<variablelist class="dbus-property" generated="True" extra-ref="Environment"/>
<variablelist class="dbus-property" generated="True" extra-ref="EnvironmentFiles"/>
@ -10053,6 +10113,10 @@ node /org/freedesktop/systemd1/unit/system_2eslice {
readonly a(iiqq) SocketBindDeny = [...];
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly (bas) RestrictNetworkInterfaces = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly s MemoryPressureWatch = '...';
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly t MemoryPressureThresholdUSec = ...;
};
interface org.freedesktop.DBus.Peer { ... };
interface org.freedesktop.DBus.Introspectable { ... };
@ -10219,6 +10283,10 @@ node /org/freedesktop/systemd1/unit/system_2eslice {
<!--property RestrictNetworkInterfaces is not documented!-->
<!--property MemoryPressureWatch is not documented!-->
<!--property MemoryPressureThresholdUSec is not documented!-->
<!--Autogenerated cross-references for systemd.directives, do not edit-->
<variablelist class="dbus-interface" generated="True" extra-ref="org.freedesktop.systemd1.Unit"/>
@ -10391,6 +10459,10 @@ node /org/freedesktop/systemd1/unit/system_2eslice {
<variablelist class="dbus-property" generated="True" extra-ref="RestrictNetworkInterfaces"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureWatch"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureThresholdUSec"/>
<!--End of Autogenerated section-->
<refsect2>
@ -10586,6 +10658,10 @@ node /org/freedesktop/systemd1/unit/session_2d1_2escope {
readonly a(iiqq) SocketBindDeny = [...];
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly (bas) RestrictNetworkInterfaces = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly s MemoryPressureWatch = '...';
@org.freedesktop.DBus.Property.EmitsChangedSignal("false")
readonly t MemoryPressureThresholdUSec = ...;
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
readonly s KillMode = '...';
@org.freedesktop.DBus.Property.EmitsChangedSignal("const")
@ -10772,6 +10848,10 @@ node /org/freedesktop/systemd1/unit/session_2d1_2escope {
<!--property RestrictNetworkInterfaces is not documented!-->
<!--property MemoryPressureWatch is not documented!-->
<!--property MemoryPressureThresholdUSec is not documented!-->
<!--property KillMode is not documented!-->
<!--property KillSignal is not documented!-->
@ -10974,6 +11054,10 @@ node /org/freedesktop/systemd1/unit/session_2d1_2escope {
<variablelist class="dbus-property" generated="True" extra-ref="RestrictNetworkInterfaces"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureWatch"/>
<variablelist class="dbus-property" generated="True" extra-ref="MemoryPressureThresholdUSec"/>
<variablelist class="dbus-property" generated="True" extra-ref="KillMode"/>
<variablelist class="dbus-property" generated="True" extra-ref="KillSignal"/>

View file

@ -160,6 +160,9 @@
accessible for invocation at any time (see above). This function will log a structured log message at
<constant>LOG_DEBUG</constant> level (with message ID f9b0be465ad540d0850ad32172d57c21) about the memory
pressure operation.</para>
<para>For further details see <ulink url="https://systemd.io/MEMORY_PRESSURE">Memory Pressure Handling in
systemd</ulink>.</para>
</refsect1>
<refsect1>

View file

@ -556,6 +556,18 @@
to configure the rate limit window, and <varname>ReloadLimitBurst=</varname> takes a positive integer to
configure the maximum allowed number of reloads within the configured time window.</para></listitem>
</varlistentry>
<varlistentry>
<term><varname>DefaultMemoryPressureWatch=</varname></term>
<term><varname>DefaultMemoryPressureThresholdSec=</varname></term>
<listitem><para>Configures the default settings for the per-unit
<varname>MemoryPressureWatch=</varname> and <varname>MemoryPressureThresholdSec=</varname>
settings. See
<citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>
for details. Defaults to <literal>auto</literal> and <literal>100ms</literal>, respectively. This
also sets the memory pressure monitoring threshold for the service manager itself.</para></listitem>
</varlistentry>
</variablelist>
</refsect1>

View file

@ -3779,6 +3779,16 @@ StandardInputData=V2XigLJyZSBubyBzdHJhbmdlcnMgdG8gbG92ZQpZb3Uga25vdyB0aGUgcnVsZX
</para></listitem>
</varlistentry>
<varlistentry>
<term><varname>$MEMORY_PRESSURE_WATCH</varname></term>
<term><varname>$MEMORY_PRESSURE_WRITE</varname></term>
<listitem><para>If memory pressure monitoring is enabled for this service unit, the path to watch
and the data to write into it. See <ulink url="https://systemd.io/MEMORY_PRESSURE">Memory Pressure
Handling</ulink> for details about these variables and the service protocol data they
convey.</para></listitem>
</varlistentry>
</variablelist>
<para>For system services, when <varname>PAMName=</varname> is enabled and <command>pam_systemd</command> is part

View file

@ -1169,6 +1169,53 @@ DeviceAllow=/dev/loop-control
</para>
</listitem>
</varlistentry>
<varlistentry>
<term><varname>MemoryPressureWatch=</varname></term>
<listitem><para>Controls memory pressure monitoring for invoked processes. Takes one of
<literal>off</literal>, <literal>on</literal>, <literal>auto</literal> or <literal>skip</literal>. If
<literal>off</literal> tells the service not to watch for memory pressure events, by setting the
<varname>$MEMORY_PRESSURE_WATCH</varname> environment variable to the literal string
<filename>/dev/null</filename>. If <literal>on</literal> tells the service to watch for memory
pressure events. This enables memory accounting for the service, and ensures the
<filename>memory.pressure</filename> cgroup attribute files is accessible for read and write to the
service's user. It then sets the <varname>$MEMORY_PRESSURE_WATCH</varname> environment variable for
processes invoked by the unit to the file system path to this file. The threshold information
configured with <varname>MemoryPressureThresholdSec=</varname> is encoded in the
<varname>$MEMORY_PRESSURE_WRITE</varname> environment variable. If the <literal>auto</literal> value
is set the protocol is enabled if memory accounting is anyway enabled for the unit, and disabled
otherwise. If set to <literal>skip</literal> the logic is neither enabled, nor disabled and the two
environment variables are not set.</para>
<para>Note that services are free to use the two environment variables, but it's unproblematic if
they ignore them. Memory pressure handling must be implemented individually in each service, and
usually means different things for different software. For further details on memory pressure
handling see <ulink url="https://systemd.io/MEMORY_PRESSURE">Memory Pressure Handling in
systemd</ulink>.</para>
<para>Services implemented using
<citerefentry><refentrytitle>sd-event</refentrytitle><manvolnum>3</manvolnum></citerefentry> may use
<citerefentry><refentrytitle>sd_event_add_memory_pressure</refentrytitle><manvolnum>3</manvolnum></citerefentry>
to watch for and handle memory pressure events.</para>
<para>If not explicit set, defaults to the <varname>DefaultMemoryPressureWatch=</varname> setting in
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>.</para></listitem>
</varlistentry>
<varlistentry>
<term><varname>MemoryPressureThresholdSec=</varname></term>
<listitem><para>Sets the memory pressure threshold time for memory pressure monitor as configured via
<varname>MemoryPressureWatch=</varname>. Specifies the maximum allocation latency before a memory
pressure event is signalled to the service, per 1s window. If not specified defaults to the
<varname>DefaultMemoryPressureThresholdSec=</varname> setting in
<citerefentry><refentrytitle>systemd-system.conf</refentrytitle><manvolnum>5</manvolnum></citerefentry>
(which in turn defaults to 100ms). The specified value expects a time unit such as
<literal>ms</literal> or <literal>µs</literal>, see
<citerefentry><refentrytitle>systemd.time</refentrytitle><manvolnum>7</manvolnum></citerefentry> for
details on the permitted syntax.</para></listitem>
</varlistentry>
</variablelist>
</refsect1>

View file

@ -175,6 +175,9 @@ void cgroup_context_init(CGroupContext *c) {
.moom_swap = MANAGED_OOM_AUTO,
.moom_mem_pressure = MANAGED_OOM_AUTO,
.moom_preference = MANAGED_OOM_PREFERENCE_NONE,
.memory_pressure_watch = _CGROUP_PRESSURE_WATCH_INVALID,
.memory_pressure_threshold_usec = USEC_INFINITY,
};
}
@ -517,7 +520,8 @@ void cgroup_context_dump(Unit *u, FILE* f, const char *prefix) {
"%sManagedOOMSwap: %s\n"
"%sManagedOOMMemoryPressure: %s\n"
"%sManagedOOMMemoryPressureLimit: " PERMYRIAD_AS_PERCENT_FORMAT_STR "\n"
"%sManagedOOMPreference: %s\n",
"%sManagedOOMPreference: %s\n"
"%sMemoryPressureWatch: %s\n",
prefix, yes_no(c->cpu_accounting),
prefix, yes_no(c->io_accounting),
prefix, yes_no(c->blockio_accounting),
@ -559,7 +563,12 @@ void cgroup_context_dump(Unit *u, FILE* f, const char *prefix) {
prefix, managed_oom_mode_to_string(c->moom_swap),
prefix, managed_oom_mode_to_string(c->moom_mem_pressure),
prefix, PERMYRIAD_AS_PERCENT_FORMAT_VAL(UINT32_SCALE_TO_PERMYRIAD(c->moom_mem_pressure_limit)),
prefix, managed_oom_preference_to_string(c->moom_preference));
prefix, managed_oom_preference_to_string(c->moom_preference),
prefix, cgroup_pressure_watch_to_string(c->memory_pressure_watch));
if (c->memory_pressure_threshold_usec != USEC_INFINITY)
fprintf(f, "%sMemoryPressureThresholdSec: %s\n",
prefix, FORMAT_TIMESPAN(c->memory_pressure_threshold_usec, 1));
if (c->delegate) {
_cleanup_free_ char *t = NULL;
@ -2362,6 +2371,13 @@ static int unit_update_cgroup(
cgroup_context_apply(u, target_mask, state);
cgroup_xattr_apply(u);
/* For most units we expect that memory monitoring is set up before the unit is started and we won't
* touch it after. For PID 1 this is different though, because we couldn't possibly do that given
* that PID 1 runs before init.scope is even set up. Hence, whenever init.scope is realized, let's
* try to open the memory pressure interface anew. */
if (unit_has_name(u, SPECIAL_INIT_SCOPE))
(void) manager_setup_memory_pressure_event_source(u->manager);
return 0;
}
@ -4369,3 +4385,12 @@ static const char* const freezer_action_table[_FREEZER_ACTION_MAX] = {
};
DEFINE_STRING_TABLE_LOOKUP(freezer_action, FreezerAction);
static const char* const cgroup_pressure_watch_table[_CGROUP_PRESSURE_WATCH_MAX] = {
[CGROUP_PRESSURE_WATCH_OFF] = "off",
[CGROUP_PRESSURE_WATCH_AUTO] = "auto",
[CGROUP_PRESSURE_WATCH_ON] = "on",
[CGROUP_PRESSURE_WATCH_SKIP] = "skip",
};
DEFINE_STRING_TABLE_LOOKUP_WITH_BOOLEAN(cgroup_pressure_watch, CGroupPressureWatch, CGROUP_PRESSURE_WATCH_ON);

View file

@ -110,6 +110,15 @@ struct CGroupSocketBindItem {
uint16_t port_min;
};
typedef enum CGroupPressureWatch {
CGROUP_PRESSURE_WATCH_OFF, /* → tells the service payload explicitly not to watch for memory pressure */
CGROUP_PRESSURE_WATCH_AUTO, /* → on if memory account is on anyway for the unit, otherwise off */
CGROUP_PRESSURE_WATCH_ON,
CGROUP_PRESSURE_WATCH_SKIP, /* → doesn't set up memory pressure watch, but also doesn't explicitly tell payload to avoid it */
_CGROUP_PRESSURE_WATCH_MAX,
_CGROUP_PRESSURE_WATCH_INVALID = -EINVAL,
} CGroupPressureWatch;
struct CGroupContext {
bool cpu_accounting;
bool io_accounting;
@ -207,6 +216,12 @@ struct CGroupContext {
ManagedOOMMode moom_mem_pressure;
uint32_t moom_mem_pressure_limit; /* Normalized to 2^32-1 == 100% */
ManagedOOMPreference moom_preference;
/* Memory pressure logic */
CGroupPressureWatch memory_pressure_watch;
usec_t memory_pressure_threshold_usec;
/* NB: For now we don't make the period configurable, not the type, nor do we allow multiple
* triggers, nor triggers for non-memory pressure. We might add that later. */
};
/* Used when querying IP accounting data */
@ -248,6 +263,13 @@ void cgroup_context_free_blockio_device_bandwidth(CGroupContext *c, CGroupBlockI
void cgroup_context_remove_bpf_foreign_program(CGroupContext *c, CGroupBPFForeignProgram *p);
void cgroup_context_remove_socket_bind(CGroupSocketBindItem **head);
static inline bool cgroup_context_want_memory_pressure(const CGroupContext *c) {
assert(c);
return c->memory_pressure_watch == CGROUP_PRESSURE_WATCH_ON ||
(c->memory_pressure_watch == CGROUP_PRESSURE_WATCH_AUTO && c->memory_accounting);
}
int cgroup_add_device_allow(CGroupContext *c, const char *dev, const char *mode);
int cgroup_add_bpf_foreign_program(CGroupContext *c, uint32_t attach_type, const char *path);
@ -351,3 +373,6 @@ int unit_cgroup_freezer_action(Unit *u, FreezerAction action);
const char* freezer_action_to_string(FreezerAction a) _const_;
FreezerAction freezer_action_from_string(const char *s) _pure_;
const char* cgroup_pressure_watch_to_string(CGroupPressureWatch a) _const_;
CGroupPressureWatch cgroup_pressure_watch_from_string(const char *s) _pure_;

View file

@ -24,6 +24,7 @@
#include "socket-util.h"
BUS_DEFINE_PROPERTY_GET(bus_property_get_tasks_max, "t", TasksMax, tasks_max_resolve);
BUS_DEFINE_PROPERTY_GET_ENUM(bus_property_get_cgroup_pressure_watch, cgroup_pressure_watch, CGroupPressureWatch);
static BUS_DEFINE_PROPERTY_GET_ENUM(property_get_cgroup_device_policy, cgroup_device_policy, CGroupDevicePolicy);
static BUS_DEFINE_PROPERTY_GET_ENUM(property_get_managed_oom_mode, managed_oom_mode, ManagedOOMMode);
@ -494,6 +495,8 @@ const sd_bus_vtable bus_cgroup_vtable[] = {
SD_BUS_PROPERTY("SocketBindAllow", "a(iiqq)", property_get_socket_bind, offsetof(CGroupContext, socket_bind_allow), 0),
SD_BUS_PROPERTY("SocketBindDeny", "a(iiqq)", property_get_socket_bind, offsetof(CGroupContext, socket_bind_deny), 0),
SD_BUS_PROPERTY("RestrictNetworkInterfaces", "(bas)", property_get_restrict_network_interfaces, 0, 0),
SD_BUS_PROPERTY("MemoryPressureWatch", "s", bus_property_get_cgroup_pressure_watch, offsetof(CGroupContext, memory_pressure_watch), 0),
SD_BUS_PROPERTY("MemoryPressureThresholdUSec", "t", bus_property_get_usec, offsetof(CGroupContext, memory_pressure_threshold_usec), 0),
SD_BUS_VTABLE_END
};
@ -743,6 +746,47 @@ static int bus_cgroup_set_transient_property(
}
}
return 1;
} else if (streq(name, "MemoryPressureWatch")) {
CGroupPressureWatch p;
const char *t;
r = sd_bus_message_read(message, "s", &t);
if (r < 0)
return r;
if (isempty(t))
p = _CGROUP_PRESSURE_WATCH_INVALID;
else {
p = cgroup_pressure_watch_from_string(t);
if (p < 0)
return p;
}
if (!UNIT_WRITE_FLAGS_NOOP(flags)) {
c->memory_pressure_watch = p;
unit_write_settingf(u, flags, name, "MemoryPressureWatch=%s", strempty(cgroup_pressure_watch_to_string(p)));
}
return 1;
} else if (streq(name, "MemoryPressureThresholdUSec")) {
uint64_t t;
r = sd_bus_message_read(message, "t", &t);
if (r < 0)
return r;
if (!UNIT_WRITE_FLAGS_NOOP(flags)) {
c->memory_pressure_threshold_usec = t;
if (t == UINT64_MAX)
unit_write_setting(u, flags, name, "MemoryPressureThresholdUSec=");
else
unit_write_settingf(u, flags, name, "MemoryPressureThresholdUSec=%" PRIu64, t);
}
return 1;
}

View file

@ -10,5 +10,6 @@
extern const sd_bus_vtable bus_cgroup_vtable[];
int bus_property_get_tasks_max(sd_bus *bus, const char *path, const char *interface, const char *property, sd_bus_message *reply, void *userdata, sd_bus_error *ret_error);
int bus_property_get_cgroup_pressure_watch(sd_bus *bus, const char *path, const char *interface, const char *property, sd_bus_message *reply, void *userdata, sd_bus_error *ret_error);
int bus_cgroup_set_property(Unit *u, CGroupContext *c, const char *name, sd_bus_message *message, UnitWriteFlags flags, sd_bus_error *error);

View file

@ -2943,6 +2943,8 @@ const sd_bus_vtable bus_manager_vtable[] = {
SD_BUS_PROPERTY("DefaultLimitRTTIME", "t", bus_property_get_rlimit, offsetof(Manager, rlimit[RLIMIT_RTTIME]), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("DefaultLimitRTTIMESoft", "t", bus_property_get_rlimit, offsetof(Manager, rlimit[RLIMIT_RTTIME]), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("DefaultTasksMax", "t", bus_property_get_tasks_max, offsetof(Manager, default_tasks_max), 0),
SD_BUS_PROPERTY("DefaultMemoryPressureThresholdUSec", "t", bus_property_get_usec, offsetof(Manager, default_memory_pressure_threshold_usec), 0),
SD_BUS_PROPERTY("DefaultMemoryPressureWatch", "s", bus_property_get_cgroup_pressure_watch, offsetof(Manager, default_memory_pressure_watch), 0),
SD_BUS_PROPERTY("TimerSlackNSec", "t", property_get_timer_slack_nsec, 0, SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("DefaultOOMPolicy", "s", bus_property_get_oom_policy, offsetof(Manager, default_oom_policy), SD_BUS_VTABLE_PROPERTY_CONST),
SD_BUS_PROPERTY("DefaultOOMScoreAdjust", "i", property_get_oom_score_adjust, 0, SD_BUS_VTABLE_PROPERTY_CONST),

View file

@ -80,6 +80,7 @@
#include "parse-util.h"
#include "path-util.h"
#include "process-util.h"
#include "psi-util.h"
#include "random-util.h"
#include "recurse-dir.h"
#include "rlimit-util.h"
@ -1808,6 +1809,7 @@ static int build_environment(
const Unit *u,
const ExecContext *c,
const ExecParameters *p,
const CGroupContext *cgroup_context,
size_t n_fds,
char **fdnames,
const char *home,
@ -1815,6 +1817,7 @@ static int build_environment(
const char *shell,
dev_t journal_stream_dev,
ino_t journal_stream_ino,
const char *memory_pressure_path,
char ***ret) {
_cleanup_strv_free_ char **our_env = NULL;
@ -1826,7 +1829,7 @@ static int build_environment(
assert(p);
assert(ret);
#define N_ENV_VARS 17
#define N_ENV_VARS 19
our_env = new0(char*, N_ENV_VARS + _EXEC_DIRECTORY_TYPE_MAX);
if (!our_env)
return -ENOMEM;
@ -1990,8 +1993,35 @@ static int build_environment(
our_env[n_env++] = x;
our_env[n_env++] = NULL;
assert(n_env <= N_ENV_VARS + _EXEC_DIRECTORY_TYPE_MAX);
if (memory_pressure_path) {
x = strjoin("MEMORY_PRESSURE_WATCH=", memory_pressure_path);
if (!x)
return -ENOMEM;
our_env[n_env++] = x;
if (cgroup_context && !path_equal(memory_pressure_path, "/dev/null")) {
_cleanup_free_ char *b = NULL, *e = NULL;
if (asprintf(&b, "%s " USEC_FMT " " USEC_FMT,
MEMORY_PRESSURE_DEFAULT_TYPE,
cgroup_context->memory_pressure_threshold_usec == USEC_INFINITY ? MEMORY_PRESSURE_DEFAULT_THRESHOLD_USEC :
CLAMP(cgroup_context->memory_pressure_threshold_usec, 1U, MEMORY_PRESSURE_DEFAULT_WINDOW_USEC),
MEMORY_PRESSURE_DEFAULT_WINDOW_USEC) < 0)
return -ENOMEM;
if (base64mem(b, strlen(b) + 1, &e) < 0)
return -ENOMEM;
x = strjoin("MEMORY_PRESSURE_WRITE=", e);
if (!x)
return -ENOMEM;
our_env[n_env++] = x;
}
}
assert(n_env < N_ENV_VARS + _EXEC_DIRECTORY_TYPE_MAX);
#undef N_ENV_VARS
*ret = TAKE_PTR(our_env);
@ -4246,6 +4276,7 @@ static int exec_child(
const ExecParameters *params,
ExecRuntime *runtime,
DynamicCreds *dcreds,
const CGroupContext *cgroup_context,
int socket_fd,
const int named_iofds[static 3],
int *params_fds,
@ -4259,7 +4290,7 @@ static int exec_child(
int r, ngids = 0, exec_fd;
_cleanup_free_ gid_t *supplementary_gids = NULL;
const char *username = NULL, *groupname = NULL;
_cleanup_free_ char *home_buffer = NULL;
_cleanup_free_ char *home_buffer = NULL, *memory_pressure_path = NULL;
const char *home = NULL, *shell = NULL;
char **final_argv = NULL;
dev_t journal_stream_dev = 0;
@ -4672,15 +4703,41 @@ static int exec_child(
}
}
/* If delegation is enabled we'll pass ownership of the cgroup to the user of the new process. On cgroup v1
* this is only about systemd's own hierarchy, i.e. not the controller hierarchies, simply because that's not
* safe. On cgroup v2 there's only one hierarchy anyway, and delegation is safe there, hence in that case only
* touch a single hierarchy too. */
if (params->cgroup_path && context->user && (params->flags & EXEC_CGROUP_DELEGATE)) {
r = cg_set_access(SYSTEMD_CGROUP_CONTROLLER, params->cgroup_path, uid, gid);
if (r < 0) {
*exit_status = EXIT_CGROUP;
return log_unit_error_errno(unit, r, "Failed to adjust control group access: %m");
if (params->cgroup_path) {
/* If delegation is enabled we'll pass ownership of the cgroup to the user of the new process. On cgroup v1
* this is only about systemd's own hierarchy, i.e. not the controller hierarchies, simply because that's not
* safe. On cgroup v2 there's only one hierarchy anyway, and delegation is safe there, hence in that case only
* touch a single hierarchy too. */
if (params->flags & EXEC_CGROUP_DELEGATE) {
r = cg_set_access(SYSTEMD_CGROUP_CONTROLLER, params->cgroup_path, uid, gid);
if (r < 0) {
*exit_status = EXIT_CGROUP;
return log_unit_error_errno(unit, r, "Failed to adjust control group access: %m");
}
}
if (cgroup_context && cg_unified() > 0 && is_pressure_supported() > 0) {
if (cgroup_context_want_memory_pressure(cgroup_context)) {
r = cg_get_path("memory", params->cgroup_path, "memory.pressure", &memory_pressure_path);
if (r < 0) {
*exit_status = EXIT_MEMORY;
return log_oom();
}
r = chmod_and_chown(memory_pressure_path, 0644, uid, gid);
if (r < 0) {
log_unit_full_errno(unit, r == -ENOENT || ERRNO_IS_PRIVILEGE(r) ? LOG_DEBUG : LOG_WARNING, r,
"Failed to adjust ownership of '%s', ignoring: %m", memory_pressure_path);
memory_pressure_path = mfree(memory_pressure_path);
}
} else if (cgroup_context->memory_pressure_watch == CGROUP_PRESSURE_WATCH_OFF) {
memory_pressure_path = strdup("/dev/null"); /* /dev/null is explicit indicator for turning of memory pressure watch */
if (!memory_pressure_path) {
*exit_status = EXIT_MEMORY;
return log_oom();
}
}
}
}
@ -4704,6 +4761,7 @@ static int exec_child(
unit,
context,
params,
cgroup_context,
n_fds,
fdnames,
home,
@ -4711,6 +4769,7 @@ static int exec_child(
shell,
journal_stream_dev,
journal_stream_ino,
memory_pressure_path,
&our_env);
if (r < 0) {
*exit_status = EXIT_MEMORY;
@ -5358,6 +5417,7 @@ int exec_spawn(Unit *unit,
const ExecParameters *params,
ExecRuntime *runtime,
DynamicCreds *dcreds,
const CGroupContext *cgroup_context,
pid_t *ret) {
int socket_fd, r, named_iofds[3] = { -1, -1, -1 }, *fds = NULL;
@ -5445,6 +5505,7 @@ int exec_spawn(Unit *unit,
params,
runtime,
dcreds,
cgroup_context,
socket_fd,
named_iofds,
fds,

View file

@ -441,6 +441,7 @@ int exec_spawn(Unit *unit,
const ExecParameters *exec_params,
ExecRuntime *runtime,
DynamicCreds *dynamic_creds,
const CGroupContext *cgroup_context,
pid_t *ret);
void exec_command_done_array(ExecCommand *c, size_t n);

View file

@ -146,6 +146,7 @@ DEFINE_CONFIG_PARSE_ENUM(config_parse_service_timeout_failure_mode, service_time
DEFINE_CONFIG_PARSE_ENUM(config_parse_socket_bind, socket_address_bind_ipv6_only_or_bool, SocketAddressBindIPv6Only, "Failed to parse bind IPv6 only value");
DEFINE_CONFIG_PARSE_ENUM(config_parse_oom_policy, oom_policy, OOMPolicy, "Failed to parse OOM policy");
DEFINE_CONFIG_PARSE_ENUM(config_parse_managed_oom_preference, managed_oom_preference, ManagedOOMPreference, "Failed to parse ManagedOOMPreference=");
DEFINE_CONFIG_PARSE_ENUM(config_parse_cgroup_pressure_watch, cgroup_pressure_watch, CGroupPressureWatch, "Failed to parse CGroupPressureWatch=");
DEFINE_CONFIG_PARSE_ENUM_WITH_DEFAULT(config_parse_ip_tos, ip_tos, int, -1, "Failed to parse IP TOS value");
DEFINE_CONFIG_PARSE_PTR(config_parse_blockio_weight, cg_blkio_weight_parse, uint64_t, "Invalid block IO weight");
DEFINE_CONFIG_PARSE_PTR(config_parse_cg_weight, cg_weight_parse, uint64_t, "Invalid weight");

View file

@ -152,6 +152,7 @@ CONFIG_PARSER_PROTOTYPE(config_parse_watchdog_sec);
CONFIG_PARSER_PROTOTYPE(config_parse_tty_size);
CONFIG_PARSER_PROTOTYPE(config_parse_log_filter_patterns);
CONFIG_PARSER_PROTOTYPE(config_parse_open_file);
CONFIG_PARSER_PROTOTYPE(config_parse_cgroup_pressure_watch);
/* gperf prototypes */
const struct ConfigPerfItem* load_fragment_gperf_lookup(const char *key, GPERF_LEN_TYPE length);

View file

@ -75,6 +75,7 @@
#include "pretty-print.h"
#include "proc-cmdline.h"
#include "process-util.h"
#include "psi-util.h"
#include "random-util.h"
#include "rlimit-util.h"
#if HAVE_SECCOMP
@ -162,6 +163,8 @@ static bool arg_default_blockio_accounting;
static bool arg_default_memory_accounting;
static bool arg_default_tasks_accounting;
static TasksMax arg_default_tasks_max;
static usec_t arg_default_memory_pressure_threshold_usec;
static CGroupPressureWatch arg_default_memory_pressure_watch;
static sd_id128_t arg_machine_id;
static EmergencyAction arg_cad_burst_action;
static OOMPolicy arg_default_oom_policy;
@ -686,6 +689,8 @@ static int parse_config_file(void) {
{ "Manager", "DefaultMemoryAccounting", config_parse_bool, 0, &arg_default_memory_accounting },
{ "Manager", "DefaultTasksAccounting", config_parse_bool, 0, &arg_default_tasks_accounting },
{ "Manager", "DefaultTasksMax", config_parse_tasks_max, 0, &arg_default_tasks_max },
{ "Manager", "DefaultMemoryPressureThresholdSec", config_parse_sec, 0, &arg_default_memory_pressure_threshold_usec },
{ "Manager", "DefaultMemoryPressureWatch", config_parse_cgroup_pressure_watch, 0, &arg_default_memory_pressure_watch },
{ "Manager", "CtrlAltDelBurstAction", config_parse_emergency_action, arg_system, &arg_cad_burst_action },
{ "Manager", "DefaultOOMPolicy", config_parse_oom_policy, 0, &arg_default_oom_policy },
{ "Manager", "DefaultOOMScoreAdjust", config_parse_oom_score_adjust, 0, NULL },
@ -767,6 +772,8 @@ static void set_manager_defaults(Manager *m) {
m->default_memory_accounting = arg_default_memory_accounting;
m->default_tasks_accounting = arg_default_tasks_accounting;
m->default_tasks_max = arg_default_tasks_max;
m->default_memory_pressure_watch = arg_default_memory_pressure_watch;
m->default_memory_pressure_threshold_usec = arg_default_memory_pressure_threshold_usec;
m->default_oom_policy = arg_default_oom_policy;
m->default_oom_score_adjust_set = arg_default_oom_score_adjust_set;
m->default_oom_score_adjust = arg_default_oom_score_adjust;
@ -2474,6 +2481,8 @@ static void reset_arguments(void) {
arg_default_memory_accounting = MEMORY_ACCOUNTING_DEFAULT;
arg_default_tasks_accounting = true;
arg_default_tasks_max = DEFAULT_TASKS_MAX;
arg_default_memory_pressure_threshold_usec = MEMORY_PRESSURE_DEFAULT_THRESHOLD_USEC;
arg_default_memory_pressure_watch = CGROUP_PRESSURE_WATCH_AUTO;
arg_machine_id = (sd_id128_t) {};
arg_cad_burst_action = EMERGENCY_ACTION_REBOOT_FORCE;
arg_default_oom_policy = OOM_STOP;

View file

@ -31,6 +31,7 @@
#include "bus-util.h"
#include "clean-ipc.h"
#include "clock-util.h"
#include "common-signal.h"
#include "constants.h"
#include "core-varlink.h"
#include "creds-util.h"
@ -69,6 +70,7 @@
#include "path-lookup.h"
#include "path-util.h"
#include "process-util.h"
#include "psi-util.h"
#include "ratelimit.h"
#include "rlimit-util.h"
#include "rm-rf.h"
@ -567,7 +569,11 @@ static int manager_setup_signals(Manager *m) {
SIGRTMIN+15, /* systemd: Immediate reboot */
SIGRTMIN+16, /* systemd: Immediate kexec */
/* ... space for more immediate system state changes ... */
/* ... space for one more immediate system state change ... */
SIGRTMIN+18, /* systemd: control command */
/* ... space ... */
SIGRTMIN+20, /* systemd: enable status messages */
SIGRTMIN+21, /* systemd: disable status messages */
@ -638,6 +644,8 @@ static char** sanitize_environment(char **l) {
"LOG_NAMESPACE",
"MAINPID",
"MANAGERPID",
"MEMORY_PRESSURE_WATCH",
"MEMORY_PRESSURE_WRITE",
"MONITOR_EXIT_CODE",
"MONITOR_EXIT_STATUS",
"MONITOR_INVOCATION_ID",
@ -787,6 +795,31 @@ static int manager_setup_sigchld_event_source(Manager *m) {
return 0;
}
int manager_setup_memory_pressure_event_source(Manager *m) {
int r;
assert(m);
m->memory_pressure_event_source = sd_event_source_disable_unref(m->memory_pressure_event_source);
r = sd_event_add_memory_pressure(m->event, &m->memory_pressure_event_source, NULL, NULL);
if (r < 0)
log_full_errno(ERRNO_IS_NOT_SUPPORTED(r) || ERRNO_IS_PRIVILEGE(r) || (r == -EHOSTDOWN) ? LOG_DEBUG : LOG_NOTICE, r,
"Failed to establish memory pressure event source, ignoring: %m");
else if (m->default_memory_pressure_threshold_usec != USEC_INFINITY) {
/* If there's a default memory pressure threshold set, also apply it to the service manager itself */
r = sd_event_source_set_memory_pressure_period(
m->memory_pressure_event_source,
m->default_memory_pressure_threshold_usec,
MEMORY_PRESSURE_DEFAULT_WINDOW_USEC);
if (r < 0)
log_warning_errno(r, "Failed to adjust memory pressure threshold, ignoring: %m");
}
return 0;
}
static int manager_find_credentials_dirs(Manager *m) {
const char *e;
int r;
@ -877,6 +910,9 @@ int manager_new(LookupScope scope, ManagerTestRunFlags test_run_flags, Manager *
.test_run_flags = test_run_flags,
.default_oom_policy = OOM_STOP,
.default_memory_pressure_watch = CGROUP_PRESSURE_WATCH_AUTO,
.default_memory_pressure_threshold_usec = USEC_INFINITY,
};
#if ENABLE_EFI
@ -967,6 +1003,10 @@ int manager_new(LookupScope scope, ManagerTestRunFlags test_run_flags, Manager *
if (r < 0)
return r;
r = manager_setup_memory_pressure_event_source(m);
if (r < 0)
return r;
#if HAVE_LIBBPF
if (MANAGER_IS_SYSTEM(m) && lsm_bpf_supported(/* initialize = */ true)) {
r = lsm_bpf_setup(m);
@ -1541,6 +1581,7 @@ Manager* manager_free(Manager *m) {
sd_event_source_unref(m->jobs_in_progress_event_source);
sd_event_source_unref(m->run_queue_event_source);
sd_event_source_unref(m->user_lookup_event_source);
sd_event_source_unref(m->memory_pressure_event_source);
safe_close(m->signal_fd);
safe_close(m->notify_fd);
@ -2892,6 +2933,47 @@ static int manager_dispatch_signal_fd(sd_event_source *source, int fd, uint32_t
switch (sfsi.ssi_signo - SIGRTMIN) {
case 18: {
bool generic = false;
if (sfsi.ssi_code != SI_QUEUE)
generic = true;
else {
/* Override a few select commands by our own PID1-specific logic */
switch (sfsi.ssi_int) {
case _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE..._COMMON_SIGNAL_COMMAND_LOG_LEVEL_END:
manager_override_log_level(m, sfsi.ssi_int - _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE);
break;
case COMMON_SIGNAL_COMMAND_CONSOLE:
manager_override_log_target(m, LOG_TARGET_CONSOLE);
break;
case COMMON_SIGNAL_COMMAND_JOURNAL:
manager_override_log_target(m, LOG_TARGET_JOURNAL);
break;
case COMMON_SIGNAL_COMMAND_KMSG:
manager_override_log_target(m, LOG_TARGET_KMSG);
break;
case COMMON_SIGNAL_COMMAND_NULL:
manager_override_log_target(m, LOG_TARGET_NULL);
break;
default:
generic = true;
}
}
if (generic)
return sigrtmin18_handler(source, &sfsi, NULL);
break;
}
case 20:
manager_override_show_status(m, SHOW_STATUS_YES, "signal");
break;

View file

@ -377,6 +377,9 @@ struct Manager {
int default_oom_score_adjust;
bool default_oom_score_adjust_set;
CGroupPressureWatch default_memory_pressure_watch;
usec_t default_memory_pressure_threshold_usec;
int original_log_level;
LogTarget original_log_target;
bool log_level_overridden;
@ -464,6 +467,8 @@ struct Manager {
/* Allow users to configure a rate limit for Reload() operations */
RateLimit reload_ratelimit;
sd_event_source *memory_pressure_event_source;
};
static inline usec_t manager_default_timeout_abort_usec(Manager *m) {
@ -517,6 +522,8 @@ void manager_unwatch_pid(Manager *m, pid_t pid);
unsigned manager_dispatch_load_queue(Manager *m);
int manager_setup_memory_pressure_event_source(Manager *m);
int manager_default_environment(Manager *m);
int manager_transient_environment_add(Manager *m, char **plus);
int manager_client_environment_modify(Manager *m, char **minus, char **plus);

View file

@ -922,6 +922,7 @@ static int mount_spawn(Mount *m, ExecCommand *c, pid_t *_pid) {
&exec_params,
m->exec_runtime,
&m->dynamic_creds,
&m->cgroup_context,
&pid);
if (r < 0)
return r;

View file

@ -1709,6 +1709,7 @@ static int service_spawn_internal(
&exec_params,
s->exec_runtime,
&s->dynamic_creds,
&s->cgroup_context,
&pid);
if (r < 0)
return r;

View file

@ -1948,6 +1948,7 @@ static int socket_spawn(Socket *s, ExecCommand *c, pid_t *_pid) {
&exec_params,
s->exec_runtime,
&s->dynamic_creds,
&s->cgroup_context,
&pid);
if (r < 0)
return r;

View file

@ -690,6 +690,7 @@ static int swap_spawn(Swap *s, ExecCommand *c, pid_t *_pid) {
&exec_params,
s->exec_runtime,
&s->dynamic_creds,
&s->cgroup_context,
&pid);
if (r < 0)
goto fail;

View file

@ -184,6 +184,9 @@ static void unit_init(Unit *u) {
if (u->type != UNIT_SLICE)
cc->tasks_max = u->manager->default_tasks_max;
cc->memory_pressure_watch = u->manager->default_memory_pressure_watch;
cc->memory_pressure_threshold_usec = u->manager->default_memory_pressure_threshold_usec;
}
ec = unit_get_exec_context(u);

View file

@ -18,6 +18,7 @@
#include "bus-log-control-api.h"
#include "bus-polkit.h"
#include "clean-ipc.h"
#include "common-signal.h"
#include "conf-files.h"
#include "device-util.h"
#include "dirent-util.h"
@ -225,6 +226,15 @@ int manager_new(Manager **ret) {
if (r < 0)
return r;
r = sd_event_add_memory_pressure(m->event, NULL, NULL, NULL);
if (r < 0)
log_full_errno(ERRNO_IS_NOT_SUPPORTED(r) || ERRNO_IS_PRIVILEGE(r) || (r == -EHOSTDOWN) ? LOG_DEBUG : LOG_WARNING, r,
"Failed to allocate memory pressure watch, ignoring: %m");
r = sd_event_add_signal(m->event, NULL, SIGRTMIN+18, sigrtmin18_handler, NULL);
if (r < 0)
return r;
(void) sd_event_set_watchdog(m->event, true);
m->homes_by_uid = hashmap_new(&homes_by_uid_hash_ops);

View file

@ -29,7 +29,7 @@ static int run(int argc, char *argv[]) {
umask(0022);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGTERM, SIGINT, -1) >= 0);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGTERM, SIGINT, SIGRTMIN+18, -1) >= 0);
r = manager_new(&m);
if (r < 0)

View file

@ -10,6 +10,7 @@
#include "bus-get-properties.h"
#include "bus-log-control-api.h"
#include "bus-polkit.h"
#include "common-signal.h"
#include "constants.h"
#include "env-util.h"
#include "fd-util.h"
@ -636,7 +637,23 @@ static int manager_new(Manager **ret) {
if (r < 0)
return r;
sd_event_set_watchdog(m->event, true);
(void) sd_event_set_watchdog(m->event, true);
r = sd_event_add_signal(m->event, NULL, SIGINT, NULL, NULL);
if (r < 0)
return r;
r = sd_event_add_signal(m->event, NULL, SIGTERM, NULL, NULL);
if (r < 0)
return r;
r = sd_event_add_signal(m->event, NULL, SIGRTMIN+18, sigrtmin18_handler, NULL);
if (r < 0)
return r;
r = sd_event_add_memory_pressure(m->event, NULL, NULL, NULL);
if (r < 0)
log_debug_errno(r, "Failed allocate memory pressure event source, ignoring: %m");
r = sd_bus_default_system(&m->bus);
if (r < 0)
@ -1389,7 +1406,7 @@ static int run(int argc, char *argv[]) {
umask(0022);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, -1) >= 0);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGTERM, SIGINT, SIGRTMIN+18, -1) >= 0);
r = manager_new(&m);
if (r < 0)

View file

@ -636,6 +636,10 @@ static void client_context_try_shrink_to(Server *s, size_t limit) {
}
}
void client_context_flush_regular(Server *s) {
client_context_try_shrink_to(s, 0);
}
void client_context_flush_all(Server *s) {
assert(s);
@ -644,7 +648,7 @@ void client_context_flush_all(Server *s) {
s->my_context = client_context_release(s, s->my_context);
s->pid1_context = client_context_release(s, s->pid1_context);
client_context_try_shrink_to(s, 0);
client_context_flush_regular(s);
assert(prioq_size(s->client_contexts_lru) == 0);
assert(hashmap_size(s->client_contexts) == 0);

View file

@ -89,6 +89,7 @@ void client_context_maybe_refresh(
void client_context_acquire_default(Server *s);
void client_context_flush_all(Server *s);
void client_context_flush_regular(Server *s);
static inline size_t client_context_extra_fields_n_iovec(const ClientContext *c) {
return c ? c->extra_fields_n_iovec : 0;

View file

@ -1707,7 +1707,7 @@ static int server_setup_signals(Server *s) {
assert(s);
assert_se(sigprocmask_many(SIG_SETMASK, NULL, SIGINT, SIGTERM, SIGUSR1, SIGUSR2, SIGRTMIN+1, -1) >= 0);
assert_se(sigprocmask_many(SIG_SETMASK, NULL, SIGINT, SIGTERM, SIGUSR1, SIGUSR2, SIGRTMIN+1, SIGRTMIN+18, -1) >= 0);
r = sd_event_add_signal(s->event, &s->sigusr1_event_source, SIGUSR1, dispatch_sigusr1, s);
if (r < 0)
@ -1747,6 +1747,10 @@ static int server_setup_signals(Server *s) {
if (r < 0)
return r;
r = sd_event_add_signal(s->event, NULL, SIGRTMIN+18, sigrtmin18_handler, &s->sigrtmin18_info);
if (r < 0)
return r;
return 0;
}
@ -2420,6 +2424,42 @@ static int server_set_namespace(Server *s, const char *namespace) {
return 1;
}
static int server_memory_pressure(sd_event_source *es, void *userdata) {
Server *s = ASSERT_PTR(userdata);
log_info("Under memory pressure, flushing caches.");
/* Flushed the cached info we might have about client processes */
client_context_flush_regular(s);
/* Let's also close all user files (but keep the system/runtime one open) */
for (;;) {
ManagedJournalFile *first = ordered_hashmap_steal_first(s->user_journals);
if (!first)
break;
(void) managed_journal_file_close(first);
}
sd_event_trim_memory();
return 0;
}
static int server_setup_memory_pressure(Server *s) {
int r;
assert(s);
r = sd_event_add_memory_pressure(s->event, NULL, server_memory_pressure, s);
if (r < 0)
log_full_errno(ERRNO_IS_NOT_SUPPORTED(r) || ERRNO_IS_PRIVILEGE(r) || (r == -EHOSTDOWN) ? LOG_DEBUG : LOG_NOTICE, r,
"Failed to install memory pressure event source, ignoring: %m");
return 0;
}
int server_init(Server *s, const char *namespace) {
const char *native_socket, *syslog_socket, *stdout_socket, *varlink_socket, *e;
_cleanup_fdset_free_ FDSet *fds = NULL;
@ -2470,6 +2510,9 @@ int server_init(Server *s, const char *namespace) {
.interval = DEFAULT_KMSG_OWN_INTERVAL,
.burst = DEFAULT_KMSG_OWN_BURST,
},
.sigrtmin18_info.memory_pressure_handler = server_memory_pressure,
.sigrtmin18_info.memory_pressure_userdata = s,
};
r = server_set_namespace(s, namespace);
@ -2652,6 +2695,10 @@ int server_init(Server *s, const char *namespace) {
if (r < 0)
return r;
r = server_setup_memory_pressure(s);
if (r < 0)
return r;
s->ratelimit = journal_ratelimit_new();
if (!s->ratelimit)
return log_oom();

View file

@ -8,6 +8,7 @@
typedef struct Server Server;
#include "common-signal.h"
#include "conf-parser.h"
#include "hashmap.h"
#include "journald-context.h"
@ -95,6 +96,7 @@ struct Server {
sd_event_source *notify_event_source;
sd_event_source *watchdog_event_source;
sd_event_source *idle_event_source;
struct sigrtmin18_info sigrtmin18_info;
ManagedJournalFile *runtime_journal;
ManagedJournalFile *system_journal;

View file

@ -14,6 +14,7 @@
#include "bus-log-control-api.h"
#include "bus-polkit.h"
#include "cgroup-util.h"
#include "common-signal.h"
#include "constants.h"
#include "daemon-util.h"
#include "device-util.h"
@ -85,6 +86,14 @@ static int manager_new(Manager **ret) {
if (r < 0)
return r;
r = sd_event_add_signal(m->event, NULL, SIGRTMIN+18, sigrtmin18_handler, NULL);
if (r < 0)
return r;
r = sd_event_add_memory_pressure(m->event, NULL, NULL, NULL);
if (r < 0)
log_debug_errno(r, "Failed allocate memory pressure event source, ignoring: %m");
(void) sd_event_set_watchdog(m->event, true);
manager_reset_config(m);
@ -1196,7 +1205,7 @@ static int run(int argc, char *argv[]) {
(void) mkdir_label("/run/systemd/users", 0755);
(void) mkdir_label("/run/systemd/sessions", 0755);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGHUP, SIGTERM, SIGINT, SIGCHLD, -1) >= 0);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGHUP, SIGTERM, SIGINT, SIGCHLD, SIGRTMIN+18, -1) >= 0);
r = manager_new(&m);
if (r < 0)

View file

@ -12,6 +12,7 @@
#include "bus-log-control-api.h"
#include "bus-polkit.h"
#include "cgroup-util.h"
#include "common-signal.h"
#include "daemon-util.h"
#include "dirent-util.h"
#include "discover-image.h"
@ -61,6 +62,15 @@ static int manager_new(Manager **ret) {
if (r < 0)
return r;
r = sd_event_add_signal(m->event, NULL, SIGRTMIN+18, sigrtmin18_handler, NULL);
if (r < 0)
return r;
r = sd_event_add_memory_pressure(m->event, NULL, NULL, NULL);
if (r < 0)
log_full_errno(ERRNO_IS_NOT_SUPPORTED(r) || ERRNO_IS_PRIVILEGE(r) || r == -EHOSTDOWN ? LOG_DEBUG : LOG_NOTICE, r,
"Unable to create memory pressure event source, ignoring: %m");
(void) sd_event_set_watchdog(m->event, true);
*ret = TAKE_PTR(m);
@ -339,7 +349,7 @@ static int run(int argc, char *argv[]) {
* make sure this check stays in. */
(void) mkdir_label("/run/systemd/machines", 0755);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGTERM, SIGINT, -1) >= 0);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGTERM, SIGINT, SIGRTMIN+18, -1) >= 0);
r = manager_new(&m);
if (r < 0)

View file

@ -16,6 +16,7 @@
#include "bus-log-control-api.h"
#include "bus-polkit.h"
#include "bus-util.h"
#include "common-signal.h"
#include "conf-parser.h"
#include "constants.h"
#include "daemon-util.h"
@ -521,6 +522,11 @@ int manager_setup(Manager *m) {
(void) sd_event_add_signal(m->event, NULL, SIGINT | SD_EVENT_SIGNAL_PROCMASK, signal_terminate_callback, m);
(void) sd_event_add_signal(m->event, NULL, SIGUSR2 | SD_EVENT_SIGNAL_PROCMASK, signal_restart_callback, m);
(void) sd_event_add_signal(m->event, NULL, SIGHUP | SD_EVENT_SIGNAL_PROCMASK, signal_reload_callback, m);
(void) sd_event_add_signal(m->event, NULL, (SIGRTMIN+18) | SD_EVENT_SIGNAL_PROCMASK, sigrtmin18_handler, NULL);
r = sd_event_add_memory_pressure(m->event, NULL, NULL, NULL);
if (r < 0)
log_debug_errno(r, "Failed allocate memory pressure event source, ignoring: %m");
r = sd_event_add_post(m->event, NULL, manager_dirty_handler, m);
if (r < 0)

View file

@ -35,6 +35,7 @@
#include "capability-util.h"
#include "cgroup-util.h"
#include "chase-symlinks.h"
#include "common-signal.h"
#include "copy.h"
#include "cpu-set-util.h"
#include "creds-util.h"
@ -5162,6 +5163,12 @@ static int run_container(
(void) sd_event_add_signal(event, NULL, SIGTERM, NULL, NULL);
}
(void) sd_event_add_signal(event, NULL, SIGRTMIN+18, sigrtmin18_handler, NULL);
r = sd_event_add_memory_pressure(event, NULL, NULL, NULL);
if (r < 0)
log_debug_errno(r, "Failed allocate memory pressure event source, ignoring: %m");
/* Exit when the child exits */
(void) sd_event_add_signal(event, NULL, SIGCHLD, on_sigchld, PID_TO_PTR(*pid));
@ -5803,7 +5810,7 @@ static int run(int argc, char *argv[]) {
log_info("Spawning container %s on %s.\nPress Ctrl-] three times within 1s to kill container.",
arg_machine, arg_image ?: arg_directory);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGWINCH, SIGTERM, SIGINT, -1) >= 0);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGWINCH, SIGTERM, SIGINT, SIGRTMIN+18, -1) >= 0);
if (prctl(PR_SET_CHILD_SUBREAPER, 1, 0, 0, 0) < 0) {
r = log_error_errno(errno, "Failed to become subreaper: %m");

View file

@ -8,6 +8,7 @@
#include "alloc-util.h"
#include "bus-log-control-api.h"
#include "bus-polkit.h"
#include "common-signal.h"
#include "constants.h"
#include "daemon-util.h"
#include "main-func.h"
@ -43,6 +44,14 @@ static int manager_new(Manager **ret) {
if (r < 0)
return r;
r = sd_event_add_signal(m->event, NULL, SIGRTMIN+18, sigrtmin18_handler, NULL);
if (r < 0)
return r;
r = sd_event_add_memory_pressure(m->event, NULL, NULL, NULL);
if (r < 0)
log_debug_errno(r, "Failed allocate memory pressure event source, ignoring: %m");
(void) sd_event_set_watchdog(m->event, true);
*ret = TAKE_PTR(m);
@ -143,7 +152,7 @@ static int run(int argc, char *argv[]) {
if (argc != 1)
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "This program takes no arguments.");
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGTERM, SIGINT, -1) >= 0);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGTERM, SIGINT, SIGRTMIN+18, -1) >= 0);
r = manager_new(&m);
if (r < 0)

View file

@ -543,6 +543,30 @@ static int manager_sigrtmin1(sd_event_source *s, const struct signalfd_siginfo *
return 0;
}
static int manager_memory_pressure(sd_event_source *s, void *userdata) {
Manager *m = ASSERT_PTR(userdata);
log_info("Under memory pressure, flushing caches.");
manager_flush_caches(m, LOG_INFO);
sd_event_trim_memory();
return 0;
}
static int manager_memory_pressure_listen(Manager *m) {
int r;
assert(m);
r = sd_event_add_memory_pressure(m->event, NULL, manager_memory_pressure, m);
if (r < 0)
log_full_errno(ERRNO_IS_NOT_SUPPORTED(r) || ERRNO_IS_PRIVILEGE(r) || (r == -EHOSTDOWN )? LOG_DEBUG : LOG_NOTICE, r,
"Failed to install memory pressure event source, ignoring: %m");
return 0;
}
int manager_new(Manager **ret) {
_cleanup_(manager_freep) Manager *m = NULL;
int r;
@ -572,6 +596,9 @@ int manager_new(Manager **ret) {
.need_builtin_fallbacks = true,
.etc_hosts_last = USEC_INFINITY,
.read_etc_hosts = true,
.sigrtmin18_info.memory_pressure_handler = manager_memory_pressure,
.sigrtmin18_info.memory_pressure_userdata = m,
};
r = dns_trust_anchor_load(&m->trust_anchor);
@ -621,6 +648,10 @@ int manager_new(Manager **ret) {
if (r < 0)
return r;
r = manager_memory_pressure_listen(m);
if (r < 0)
return r;
r = manager_connect_bus(m);
if (r < 0)
return r;
@ -628,6 +659,7 @@ int manager_new(Manager **ret) {
(void) sd_event_add_signal(m->event, &m->sigusr1_event_source, SIGUSR1, manager_sigusr1, m);
(void) sd_event_add_signal(m->event, &m->sigusr2_event_source, SIGUSR2, manager_sigusr2, m);
(void) sd_event_add_signal(m->event, &m->sigrtmin1_event_source, SIGRTMIN+1, manager_sigrtmin1, m);
(void) sd_event_add_signal(m->event, NULL, SIGRTMIN+18, sigrtmin18_handler, &m->sigrtmin18_info);
manager_cleanup_saved_user(m);

View file

@ -7,6 +7,7 @@
#include "sd-netlink.h"
#include "sd-network.h"
#include "common-signal.h"
#include "hashmap.h"
#include "list.h"
#include "ordered-set.h"
@ -156,6 +157,8 @@ struct Manager {
LIST_HEAD(SocketGraveyard, socket_graveyard);
SocketGraveyard *socket_graveyard_oldest;
size_t n_socket_graveyard;
struct sigrtmin18_info sigrtmin18_info;
};
/* Manager */

View file

@ -67,7 +67,7 @@ static int run(int argc, char *argv[]) {
return log_error_errno(r, "Failed to drop privileges: %m");
}
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGTERM, SIGINT, SIGUSR1, SIGUSR2, SIGRTMIN+1, -1) >= 0);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGTERM, SIGINT, SIGUSR1, SIGUSR2, SIGRTMIN+1, SIGRTMIN+18, -1) >= 0);
r = manager_new(&m);
if (r < 0)

View file

@ -460,7 +460,8 @@ static int bus_append_cgroup_property(sd_bus_message *m, const char *field, cons
"Slice",
"ManagedOOMSwap",
"ManagedOOMMemoryPressure",
"ManagedOOMPreference"))
"ManagedOOMPreference",
"MemoryPressureWatch"))
return bus_append_string(m, field, eq);
if (STR_IN_SET(field, "ManagedOOMMemoryPressureLimit")) {
@ -913,6 +914,9 @@ static int bus_append_cgroup_property(sd_bus_message *m, const char *field, cons
return 1;
}
if (streq(field, "MemoryPressureThresholdSec"))
return bus_append_parse_sec_rename(m, field, eq);
return 0;
}

View file

@ -0,0 +1,94 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#include "common-signal.h"
#include "fd-util.h"
#include "fileio.h"
#include "process-util.h"
#include "signal-util.h"
int sigrtmin18_handler(sd_event_source *s, const struct signalfd_siginfo *si, void *userdata) {
struct sigrtmin18_info *info = userdata;
_cleanup_free_ char *comm = NULL;
int r;
assert(s);
assert(si);
(void) get_process_comm(si->ssi_pid, &comm);
if (si->ssi_code != SI_QUEUE) {
log_notice("Received control signal %s from process " PID_FMT " (%s) without command value, ignoring.",
signal_to_string(si->ssi_signo),
(pid_t) si->ssi_pid,
strna(comm));
return 0;
}
log_debug("Received control signal %s from process " PID_FMT " (%s) with command 0x%08x.",
signal_to_string(si->ssi_signo),
(pid_t) si->ssi_pid,
strna(comm),
(unsigned) si->ssi_int);
switch (si->ssi_int) {
case _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE..._COMMON_SIGNAL_COMMAND_LOG_LEVEL_END:
log_set_max_level(si->ssi_int - _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE);
break;
case COMMON_SIGNAL_COMMAND_CONSOLE:
log_set_target_and_open(LOG_TARGET_CONSOLE);
break;
case COMMON_SIGNAL_COMMAND_JOURNAL:
log_set_target_and_open(LOG_TARGET_JOURNAL);
break;
case COMMON_SIGNAL_COMMAND_KMSG:
log_set_target_and_open(LOG_TARGET_KMSG);
break;
case COMMON_SIGNAL_COMMAND_NULL:
log_set_target_and_open(LOG_TARGET_NULL);
break;
case COMMON_SIGNAL_COMMAND_MEMORY_PRESSURE:
if (info && info->memory_pressure_handler)
return info->memory_pressure_handler(s, info->memory_pressure_userdata);
sd_event_trim_memory();
break;
case COMMON_SIGNAL_COMMAND_MALLOC_INFO: {
_cleanup_free_ char *data = NULL;
_cleanup_fclose_ FILE *f = NULL;
size_t sz;
f = open_memstream_unlocked(&data, &sz);
if (!f) {
log_oom();
break;
}
if (malloc_info(0, f) < 0) {
log_error_errno(errno, "Failed to invoke malloc_info(): %m");
break;
}
fputc(0, f);
r = fflush_and_check(f);
if (r < 0) {
log_error_errno(r, "Failed to flush malloc_info() buffer: %m");
break;
}
log_dump(LOG_INFO, data);
break;
}
default:
log_notice("Received control signal %s with unknown command 0x%08x, ignoring.",
signal_to_string(si->ssi_signo), (unsigned) si->ssi_int);
break;
}
return 0;
}

View file

@ -0,0 +1,58 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#include <syslog.h>
#include <sd-event.h>
/* All our long-running services should implement a SIGRTMIN+18 handler that can be used to trigger certain
* actions that affect service runtime. The specific action is indicated via the "value integer" you can pass
* along realtime signals. This is mostly intended for debugging purposes and is entirely asynchronous in
* nature. Specifically, these are the commands:
*
* Currently available operations:
*
* Change maximum log level
* Change log target
* Invoke memory trimming, like under memory pressure
* Write glibc malloc() allocation info to logs
*
* How to use this? Via a command like the following:
*
* /usr/bin/kill -s RTMIN+18 -q 768 1
*
* (This will tell PID 1 to trim its memory use.)
*
* or:
*
* systemctl kill --kill-value=0x300 -s RTMIN+18 systemd-journald
*
* (This will tell journald to trim its memory use.)
*/
enum {
_COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE = 0x100,
COMMON_SIGNAL_COMMAND_LOG_EMERG = _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE + LOG_EMERG,
COMMON_SIGNAL_COMMAND_LOG_ALERT = _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE + LOG_ALERT,
COMMON_SIGNAL_COMMAND_LOG_CRIT = _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE + LOG_CRIT,
COMMON_SIGNAL_COMMAND_LOG_ERR = _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE + LOG_ERR,
COMMON_SIGNAL_COMMAND_LOG_WARNING = _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE + LOG_WARNING,
COMMON_SIGNAL_COMMAND_LOG_NOTICE = _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE + LOG_NOTICE,
COMMON_SIGNAL_COMMAND_LOG_INFO = _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE + LOG_INFO,
COMMON_SIGNAL_COMMAND_LOG_DEBUG = _COMMON_SIGNAL_COMMAND_LOG_LEVEL_BASE + LOG_DEBUG,
_COMMON_SIGNAL_COMMAND_LOG_LEVEL_END = COMMON_SIGNAL_COMMAND_LOG_DEBUG,
COMMON_SIGNAL_COMMAND_CONSOLE = 0x200,
COMMON_SIGNAL_COMMAND_JOURNAL,
COMMON_SIGNAL_COMMAND_KMSG,
COMMON_SIGNAL_COMMAND_NULL,
COMMON_SIGNAL_COMMAND_MEMORY_PRESSURE = 0x300,
COMMON_SIGNAL_COMMAND_MALLOC_INFO,
};
struct sigrtmin18_info {
sd_event_handler_t memory_pressure_handler;
void *memory_pressure_userdata;
};
int sigrtmin18_handler(sd_event_source *s, const struct signalfd_siginfo *si, void *userdata);

View file

@ -35,6 +35,7 @@ shared_sources = files(
'chown-recursive.c',
'clean-ipc.c',
'clock-util.c',
'common-signal.c',
'compare-operator.c',
'condition.c',
'conf-parser.c',

View file

@ -15,6 +15,7 @@
#include "alloc-util.h"
#include "bus-polkit.h"
#include "common-signal.h"
#include "dns-domain.h"
#include "event-util.h"
#include "fd-util.h"
@ -1129,6 +1130,11 @@ int manager_new(Manager **ret) {
(void) sd_event_add_signal(m->event, NULL, SIGTERM, NULL, NULL);
(void) sd_event_add_signal(m->event, NULL, SIGINT, NULL, NULL);
(void) sd_event_add_signal(m->event, NULL, SIGRTMIN+18, sigrtmin18_handler, NULL);
r = sd_event_add_memory_pressure(m->event, NULL, NULL, NULL);
if (r < 0)
log_debug_errno(r, "Failed allocate memory pressure event source, ignoring: %m");
(void) sd_event_set_watchdog(m->event, true);

View file

@ -174,7 +174,7 @@ static int run(int argc, char *argv[]) {
return log_error_errno(r, "Failed to drop privileges: %m");
}
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGTERM, SIGINT, -1) >= 0);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGTERM, SIGINT, SIGRTMIN+18, -1) >= 0);
r = manager_new(&m);
if (r < 0)

View file

@ -31,6 +31,7 @@
#include "blockdev-util.h"
#include "cgroup-setup.h"
#include "cgroup-util.h"
#include "common-signal.h"
#include "cpu-set-util.h"
#include "daemon-util.h"
#include "dev-setup.h"
@ -112,6 +113,9 @@ typedef struct Manager {
sd_event_source *kill_workers_event;
sd_event_source *memory_pressure_event_source;
sd_event_source *sigrtmin18_event_source;
usec_t last_usec;
bool udev_node_needs_cleanup;
@ -264,6 +268,9 @@ static Manager* manager_free(Manager *manager) {
safe_close(manager->inotify_fd);
safe_close_pair(manager->worker_watch);
sd_event_source_unref(manager->memory_pressure_event_source);
sd_event_source_unref(manager->sigrtmin18_event_source);
free(manager->cgroup);
return mfree(manager);
}
@ -1918,7 +1925,7 @@ static int main_loop(Manager *manager) {
udev_watch_restore(manager->inotify_fd);
/* block and listen to all signals on signalfd */
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGTERM, SIGINT, SIGHUP, SIGCHLD, -1) >= 0);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGTERM, SIGINT, SIGHUP, SIGCHLD, SIGRTMIN+18, -1) >= 0);
r = sd_event_default(&manager->event);
if (r < 0)
@ -1976,6 +1983,16 @@ static int main_loop(Manager *manager) {
if (r < 0)
return log_error_errno(r, "Failed to create post event source: %m");
/* Eventually, we probably want to do more here on memory pressure, for example, kill idle workers immediately */
r = sd_event_add_memory_pressure(manager->event, &manager->memory_pressure_event_source, NULL, NULL);
if (r < 0)
log_full_errno(ERRNO_IS_NOT_SUPPORTED(r) || ERRNO_IS_PRIVILEGE(r) || (r == -EHOSTDOWN) ? LOG_DEBUG : LOG_WARNING, r,
"Failed to allocate memory pressure watch, ignoring: %m");
r = sd_event_add_signal(manager->event, &manager->memory_pressure_event_source, SIGRTMIN+18, sigrtmin18_handler, NULL);
if (r < 0)
return log_error_errno(r, "Failed to allocate SIGRTMIN+18 event source, ignoring: %m");
manager->last_usec = now(CLOCK_MONOTONIC);
udev_builtin_init();

View file

@ -4,6 +4,7 @@
#include "sd-daemon.h"
#include "common-signal.h"
#include "fd-util.h"
#include "fs-util.h"
#include "mkdir.h"
@ -102,6 +103,14 @@ int manager_new(Manager **ret) {
if (r < 0)
return r;
r = sd_event_add_signal(m->event, NULL, SIGRTMIN+18, sigrtmin18_handler, NULL);
if (r < 0)
return r;
r = sd_event_add_memory_pressure(m->event, NULL, NULL, NULL);
if (r < 0)
log_debug_errno(r, "Failed allocate memory pressure event source, ignoring: %m");
(void) sd_event_set_watchdog(m->event, true);
m->workers_fixed = set_new(NULL);

View file

@ -37,7 +37,7 @@ static int run(int argc, char *argv[]) {
if (setenv("SYSTEMD_BYPASS_USERDB", "io.systemd.NameServiceSwitch:io.systemd.Multiplexer:io.systemd.DropIn", 1) < 0)
return log_error_errno(errno, "Failed to set $SYSTEMD_BYPASS_USERDB: %m");
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGTERM, SIGINT, SIGUSR2, -1) >= 0);
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD, SIGTERM, SIGINT, SIGUSR2, SIGRTMIN+18, -1) >= 0);
r = manager_new(&m);
if (r < 0)

View file

@ -0,0 +1 @@
../TEST-01-BASIC/Makefile

16
test/TEST-79-MEMPRESS/test.sh Executable file
View file

@ -0,0 +1,16 @@
#!/usr/bin/env bash
# SPDX-License-Identifier: LGPL-2.1-or-later
set -e
TEST_DESCRIPTION="Test Memory Pressure handling"
# Ignore gcov complaints caused by DynamicUser=true
IGNORE_MISSING_COVERAGE=yes
# shellcheck source=test/test-functions
. "$TEST_BASE_DIR/test-functions"
test_append_files() {
image_install base64
}
do_test "$@"

View file

@ -0,0 +1,8 @@
# SPDX-License-Identifier: LGPL-2.1-or-later
[Unit]
Description=TEST-79-MEMPRESS
[Service]
Type=oneshot
ExecStart=/usr/lib/systemd/tests/testdata/units/%N.sh
MemoryAccounting=1

63
test/units/testsuite-79.sh Executable file
View file

@ -0,0 +1,63 @@
#!/usr/bin/env bash
# SPDX-License-Identifier: LGPL-2.1-or-later
set -ex
set -o pipefail
# We not just test if the file exists, but try to read from it, since if
# CONFIG_PSI_DEFAULT_DISABLED is set in the kernel the file will exist and can
# be opened, but any read()s will fail with EOPNOTSUPP, which we want to
# detect.
if ! cat /proc/pressure/memory >/dev/null ; then
echo "kernel too old, has no PSI." >&2
echo OK >/testok
exit 0
fi
systemd-analyze log-level debug
CGROUP=/sys/fs/cgroup/"$(systemctl show testsuite-79.service -P ControlGroup)"
test -d "$CGROUP"
if ! test -f "$CGROUP"/memory.pressure ; then
echo "No memory accounting/PSI delegated via cgroup, can't test." >&2
echo OK >/testok
exit 0
fi
UNIT="test-mempress-$RANDOM.service"
SCRIPT="/run/bin/mempress-$RANDOM.sh"
mkdir -p "/run/bin"
cat >"$SCRIPT" <<'EOF'
#!/bin/bash
set -ex
export
id
test -n "$MEMORY_PRESSURE_WATCH"
test "$MEMORY_PRESSURE_WATCH" != /dev/null
test -w "$MEMORY_PRESSURE_WATCH"
ls -al "$MEMORY_PRESSURE_WATCH"
EXPECTED="$(echo -n -e "some 123000 1000000\x00" | base64)"
test "$EXPECTED" = "$MEMORY_PRESSURE_WRITE"
EOF
chmod +x "$SCRIPT"
systemd-run -u "$UNIT" -p Type=exec -p DynamicUser=1 -p MemoryPressureWatch=on -p MemoryPressureThresholdSec=123ms --wait "$SCRIPT"
rm "$SCRIPT"
rmdir /run/bin ||:
systemd-analyze log-level info
echo OK >/testok
exit 0

View file

@ -26,3 +26,4 @@ TasksMax=infinity
TimeoutStopSec={{ DEFAULT_USER_TIMEOUT_SEC*4//3 }}s
KeyringMode=inherit
OOMScoreAdjust=100
MemoryPressureWatch=skip