core: allow using seccomp without no_new_privs when unprivileged

Until now, using any form of seccomp while being unprivileged (User=)
resulted in systemd enabling no_new_privs.

There's no need for doing this because:

* We trust the filters we apply
* If User= is set and a process wants to apply a new seccomp filter, it
will need to set no_new_privs itself

An example of application that might want seccomp + !no_new_privs is a
program that wants to run as an unprivileged user but uses file
capabilities to start a web server on a privileged port while
benefitting from a restrictive seccomp profile.

We now keep the privileges needed to do seccomp before calling
enforce_user() and drop them after the seccomp filters are applied.

If the syscall filter doesn't allow the needed syscalls to drop the
privileges, we keep the previous behavior by enabling no_new_privs.
This commit is contained in:
Iago López Galeiras 2023-11-07 11:06:56 +01:00
parent b3e199cec8
commit 24832d10b6
4 changed files with 181 additions and 101 deletions

View file

@ -823,21 +823,10 @@ CapabilityBoundingSet=~CAP_B CAP_C</programlisting>
<listitem><para>Takes a boolean argument. If true, ensures that the service process and all its
children can never gain new privileges through <function>execve()</function> (e.g. via setuid or
setgid bits, or filesystem capabilities). This is the simplest and most effective way to ensure that
a process and its children can never elevate privileges again. Defaults to false, but certain
settings override this and ignore the value of this setting. This is the case when
<varname>DynamicUser=</varname>, <varname>LockPersonality=</varname>,
<varname>MemoryDenyWriteExecute=</varname>, <varname>PrivateDevices=</varname>,
<varname>ProtectClock=</varname>, <varname>ProtectHostname=</varname>,
<varname>ProtectKernelLogs=</varname>, <varname>ProtectKernelModules=</varname>,
<varname>ProtectKernelTunables=</varname>, <varname>RestrictAddressFamilies=</varname>,
<varname>RestrictNamespaces=</varname>, <varname>RestrictRealtime=</varname>,
<varname>RestrictSUIDSGID=</varname>, <varname>SystemCallArchitectures=</varname>,
<varname>SystemCallFilter=</varname>, or <varname>SystemCallLog=</varname> are specified. Note that
even if this setting is overridden by them, <command>systemctl show</command> shows the original
value of this setting. In case the service will be run in a new mount namespace anyway and SELinux is
disabled, all file systems are mounted with <constant>MS_NOSUID</constant> flag. Also see
the kernel document
<ulink url="https://docs.kernel.org/userspace-api/no_new_privs.html">No New Privileges Flag</ulink>.
a process and its children can never elevate privileges again. Defaults to false. In case the service
will be run in a new mount namespace anyway and SELinux is disabled, all file systems are mounted with
<constant>MS_NOSUID</constant> flag. Also see <ulink
url="https://docs.kernel.org/userspace-api/no_new_privs.html">No New Privileges Flag</ulink>.
</para>
<para>Note that this setting only has an effect on the unit's processes themselves (or any processes
@ -1779,9 +1768,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
<citerefentry><refentrytitle>mmap</refentrytitle><manvolnum>2</manvolnum></citerefentry> of
<filename>/dev/zero</filename> instead of using <constant>MAP_ANON</constant>. For this setting the
same restrictions regarding mount propagation and privileges apply as for
<varname>ReadOnlyPaths=</varname> and related calls, see above. If turned on and if running in user
mode, or in system mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
<varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied.</para>
<varname>ReadOnlyPaths=</varname> and related calls, see above.</para>
<para>Note that the implementation of this setting might be impossible (for example if mount
namespaces are not available), and the unit should be written in a way that does not solely rely on
@ -1973,10 +1960,6 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
the system into the service, it is hence not suitable for services that need to take notice of system
hostname changes dynamically.</para>
<para>If this setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant>
capability (e.g. services for which <varname>User=</varname> is set),
<varname>NoNewPrivileges=yes</varname> is implied.</para>
<xi:include href="system-or-user-ns.xml" xpointer="singular"/>
<xi:include href="version-info.xml" xpointer="v242"/></listitem>
@ -1994,9 +1977,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
Effectively, <filename>/dev/rtc0</filename>, <filename>/dev/rtc1</filename>, etc. are made read-only
to the service. See
<citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>
for the details about <varname>DeviceAllow=</varname>. If this setting is on, but the unit doesn't
have the <constant>CAP_SYS_ADMIN</constant> capability (e.g. services for which
<varname>User=</varname> is set), <varname>NoNewPrivileges=yes</varname> is implied.</para>
for the details about <varname>DeviceAllow=</varname>.</para>
<para>It is recommended to turn this on for most services that do not need modify the clock or check
its state.</para>
@ -2018,13 +1999,10 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
<citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry> mechanism. Few
services need to write to these at runtime; it is hence recommended to turn this on for most services. For this
setting the same restrictions regarding mount propagation and privileges apply as for
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off. If this
setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant> capability
(e.g. services for which <varname>User=</varname> is set),
<varname>NoNewPrivileges=yes</varname> is implied. Note that this option does not prevent
indirect changes to kernel tunables effected by IPC calls to other processes. However,
<varname>InaccessiblePaths=</varname> may be used to make relevant IPC file system objects
inaccessible. If <varname>ProtectKernelTunables=</varname> is set,
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.
Note that this option does not prevent indirect changes to kernel tunables effected by IPC calls to
other processes. However, <varname>InaccessiblePaths=</varname> may be used to make relevant IPC file system
objects inaccessible. If <varname>ProtectKernelTunables=</varname> is set,
<varname>MountAPIVFS=yes</varname> is implied.</para>
<xi:include href="system-or-user-ns.xml" xpointer="singular"/>
@ -2046,9 +2024,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
both privileged and unprivileged. To disable module auto-load feature please see
<citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry>
<constant>kernel.modules_disabled</constant> mechanism and
<filename>/proc/sys/kernel/modules_disabled</filename> documentation. If this setting is on,
but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant> capability (e.g. services for
which <varname>User=</varname> is set), <varname>NoNewPrivileges=yes</varname> is implied.</para>
<filename>/proc/sys/kernel/modules_disabled</filename> documentation.</para>
<xi:include href="system-or-user-ns.xml" xpointer="singular"/>
@ -2067,9 +2043,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
<citerefentry project='man-pages'><refentrytitle>syslog</refentrytitle><manvolnum>3</manvolnum></citerefentry>
for userspace logging). The kernel exposes its log buffer to userspace via <filename>/dev/kmsg</filename> and
<filename>/proc/kmsg</filename>. If enabled, these are made inaccessible to all the processes in the unit.
If this setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant>
capability (e.g. services for which <varname>User=</varname> is set),
<varname>NoNewPrivileges=yes</varname> is implied.</para>
</para>
<xi:include href="system-or-user-ns.xml" xpointer="singular"/>
@ -2113,12 +2087,9 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
including x86-64). Note that on systems supporting multiple ABIs (such as x86/x86-64) it is
recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the
restrictions of this option. Specifically, it is recommended to combine this option with
<varname>SystemCallArchitectures=native</varname> or similar. If running in user mode, or in system
mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
<varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied. By default, no
restrictions apply, all address families are accessible to processes. If assigned the empty string,
any previous address family restriction changes are undone. This setting does not affect commands
prefixed with <literal>+</literal>.</para>
<varname>SystemCallArchitectures=native</varname> or similar. By default, no restrictions apply, all
address families are accessible to processes. If assigned the empty string, any previous address family
restriction changes are undone. This setting does not affect commands prefixed with <literal>+</literal>.</para>
<para>Use this option to limit exposure of processes to remote access, in particular via exotic and sensitive
network protocols, such as <constant>AF_PACKET</constant>. Note that in most cases, the local
@ -2251,9 +2222,7 @@ RestrictFileSystems=ext4</programlisting>
creation and switching of the specified types of namespaces (or all of them, if true) access to the
<function>setns()</function> system call with a zero flags parameter is prohibited. This setting is only
supported on x86, x86-64, mips, mips-le, mips64, mips64-le, mips64-n32, mips64-le-n32, ppc64, ppc64-le, s390
and s390x, and enforces no restrictions on other architectures. If running in user mode, or in system mode, but
without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
<varname>NoNewPrivileges=yes</varname> is implied.</para>
and s390x, and enforces no restrictions on other architectures.</para>
<para>Example: if a unit has the following,
<programlisting>RestrictNamespaces=cgroup ipc
@ -2274,9 +2243,7 @@ RestrictNamespaces=~cgroup net</programlisting>
project='man-pages'><refentrytitle>personality</refentrytitle><manvolnum>2</manvolnum></citerefentry> system
call so that the kernel execution domain may not be changed from the default or the personality selected with
<varname>Personality=</varname> directive. This may be useful to improve security, because odd personality
emulations may be poorly tested and source of vulnerabilities. If running in user mode, or in system mode, but
without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
<varname>NoNewPrivileges=yes</varname> is implied.</para>
emulations may be poorly tested and source of vulnerabilities.</para>
<xi:include href="version-info.xml" xpointer="v235"/></listitem>
</varlistentry>
@ -2308,9 +2275,7 @@ RestrictNamespaces=~cgroup net</programlisting>
available on x86. Note that on systems supporting multiple ABIs (such as x86/x86-64) it is
recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the
restrictions of this option. Specifically, it is recommended to combine this option with
<varname>SystemCallArchitectures=native</varname> or similar. If running in user mode, or in system
mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
<varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied.</para>
<varname>SystemCallArchitectures=native</varname> or similar.</para>
<xi:include href="version-info.xml" xpointer="v231"/></listitem>
</varlistentry>
@ -2322,9 +2287,7 @@ RestrictNamespaces=~cgroup net</programlisting>
the unit are refused. This restricts access to realtime task scheduling policies such as
<constant>SCHED_FIFO</constant>, <constant>SCHED_RR</constant> or <constant>SCHED_DEADLINE</constant>. See
<citerefentry project='man-pages'><refentrytitle>sched</refentrytitle><manvolnum>7</manvolnum></citerefentry>
for details about these scheduling policies. If running in user mode, or in system mode, but without the
<constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
<varname>NoNewPrivileges=yes</varname> is implied. Realtime scheduling policies may be used to monopolize CPU
for details about these scheduling policies. Realtime scheduling policies may be used to monopolize CPU
time for longer periods of time, and may hence be used to lock up or otherwise trigger Denial-of-Service
situations on the system. It is hence recommended to restrict access to realtime scheduling to the few programs
that actually require them. Defaults to off.</para>
@ -2338,10 +2301,8 @@ RestrictNamespaces=~cgroup net</programlisting>
<listitem><para>Takes a boolean argument. If set, any attempts to set the set-user-ID (SUID) or
set-group-ID (SGID) bits on files or directories will be denied (for details on these bits see
<citerefentry
project='man-pages'><refentrytitle>inode</refentrytitle><manvolnum>7</manvolnum></citerefentry>). If
running in user mode, or in system mode, but without the <constant>CAP_SYS_ADMIN</constant>
capability (e.g. setting <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is
implied. As the SUID/SGID bits are mechanisms to elevate privileges, and allow users to acquire the
project='man-pages'><refentrytitle>inode</refentrytitle><manvolnum>7</manvolnum></citerefentry>).
As the SUID/SGID bits are mechanisms to elevate privileges, and allow users to acquire the
identity of other users, it is recommended to restrict creation of SUID/SGID files to the few
programs that actually require them. Note that this restricts marking of any type of file system
object with these bits, including both regular files and directories (where the SGID is a different
@ -2457,15 +2418,12 @@ RestrictNamespaces=~cgroup net</programlisting>
full list). This value will be returned when a deny-listed system call is triggered, instead of
terminating the processes immediately. Special setting <literal>kill</literal> can be used to
explicitly specify killing. This value takes precedence over the one given in
<varname>SystemCallErrorNumber=</varname>, see below. If running in user mode, or in system mode,
but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
<varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied. This feature
makes use of the Secure Computing Mode 2 interfaces of the kernel ('seccomp filtering') and is useful
for enforcing a minimal sandboxing environment. Note that the <function>execve()</function>,
<function>exit()</function>, <function>exit_group()</function>, <function>getrlimit()</function>,
<function>rt_sigreturn()</function>, <function>sigreturn()</function> system calls and the system calls
for querying time and sleeping are implicitly allow-listed and do not need to be listed
explicitly. This option may be specified more than once, in which case the filter masks are
<varname>SystemCallErrorNumber=</varname>, see below. This feature makes use of the Secure Computing Mode 2
interfaces of the kernel ('seccomp filtering') and is useful for enforcing a minimal sandboxing environment.
Note that the <function>execve()</function>, <function>exit()</function>, <function>exit_group()</function>,
<function>getrlimit()</function>, <function>rt_sigreturn()</function>, <function>sigreturn()</function>
system calls and the system calls for querying time and sleeping are implicitly allow-listed and do not
need to be listed explicitly. This option may be specified more than once, in which case the filter masks are
merged. If the empty string is assigned, the filter is reset, all prior assignments will have no
effect. This does not affect commands prefixed with <literal>+</literal>.</para>
@ -2692,10 +2650,7 @@ SystemCallErrorNumber=EPERM</programlisting>
as well as <constant>x32</constant>, <constant>mips64-n32</constant>, <constant>mips64-le-n32</constant>, and
the special identifier <constant>native</constant>. The special identifier <constant>native</constant>
implicitly maps to the native architecture of the system (or more precisely: to the architecture the system
manager is compiled for). If running in user mode, or in system mode, but without the
<constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
<varname>NoNewPrivileges=yes</varname> is implied. By default, this option is set to the empty list, i.e. no
filtering is applied.</para>
manager is compiled for). By default, this option is set to the empty list, i.e. no filtering is applied.</para>
<para>If this setting is used, processes of this unit will only be permitted to call native system calls, and
system calls of the specified architectures. For the purposes of this option, the x32 architecture is treated
@ -2723,13 +2678,11 @@ SystemCallErrorNumber=EPERM</programlisting>
<listitem><para>Takes a space-separated list of system call names. If this setting is used, all
system calls executed by the unit processes for the listed ones will be logged. If the first
character of the list is <literal>~</literal>, the effect is inverted: all system calls except the
listed system calls will be logged. If running in user mode, or in system mode, but without the
<constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
<varname>NoNewPrivileges=yes</varname> is implied. This feature makes use of the Secure Computing
Mode 2 interfaces of the kernel ('seccomp filtering') and is useful for auditing or setting up a
minimal sandboxing environment. This option may be specified more than once, in which case the filter
masks are merged. If the empty string is assigned, the filter is reset, all prior assignments will
have no effect. This does not affect commands prefixed with <literal>+</literal>.</para>
listed system calls will be logged. This feature makes use of the Secure Computing Mode 2 interfaces
of the kernel ('seccomp filtering') and is useful for auditing or setting up a minimal sandboxing
environment. This option may be specified more than once, in which case the filter masks are merged.
If the empty string is assigned, the filter is reset, all prior assignments will have no effect.
This does not affect commands prefixed with <literal>+</literal>.</para>
<xi:include href="version-info.xml" xpointer="v247"/></listitem>
</varlistentry>

View file

@ -367,16 +367,16 @@ int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities) {
return 0;
}
int drop_capability(cap_value_t cv) {
static int change_capability(cap_value_t cv, cap_flag_value_t flag) {
_cleanup_cap_free_ cap_t tmp_cap = NULL;
tmp_cap = cap_get_proc();
if (!tmp_cap)
return -errno;
if ((cap_set_flag(tmp_cap, CAP_INHERITABLE, 1, &cv, CAP_CLEAR) < 0) ||
(cap_set_flag(tmp_cap, CAP_PERMITTED, 1, &cv, CAP_CLEAR) < 0) ||
(cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, CAP_CLEAR) < 0))
if ((cap_set_flag(tmp_cap, CAP_INHERITABLE, 1, &cv, flag) < 0) ||
(cap_set_flag(tmp_cap, CAP_PERMITTED, 1, &cv, flag) < 0) ||
(cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, flag) < 0))
return -errno;
if (cap_set_proc(tmp_cap) < 0)
@ -385,6 +385,14 @@ int drop_capability(cap_value_t cv) {
return 0;
}
int drop_capability(cap_value_t cv) {
return change_capability(cv, CAP_CLEAR);
}
int keep_capability(cap_value_t cv) {
return change_capability(cv, CAP_SET);
}
bool ambient_capabilities_supported(void) {
static int cache = -1;

View file

@ -31,6 +31,7 @@ int capability_update_inherited_set(cap_t caps, uint64_t ambient_set);
int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities);
int drop_capability(cap_value_t cv);
int keep_capability(cap_value_t cv);
DEFINE_TRIVIAL_CLEANUP_FUNC_FULL(cap_t, cap_free, NULL);
#define _cleanup_cap_free_ _cleanup_(cap_freep)

View file

@ -1378,15 +1378,7 @@ static bool context_has_syscall_logs(const ExecContext *c) {
!hashmap_isempty(c->syscall_log);
}
static bool context_has_no_new_privileges(const ExecContext *c) {
assert(c);
if (c->no_new_privileges)
return true;
if (have_effective_cap(CAP_SYS_ADMIN) > 0) /* if we are privileged, we don't need NNP */
return false;
static bool context_has_seccomp(const ExecContext *c) {
/* We need NNP if we have any form of seccomp and are unprivileged */
return c->lock_personality ||
c->memory_deny_write_execute ||
@ -1405,8 +1397,49 @@ static bool context_has_no_new_privileges(const ExecContext *c) {
context_has_syscall_logs(c);
}
static bool context_has_no_new_privileges(const ExecContext *c) {
assert(c);
if (c->no_new_privileges)
return true;
if (have_effective_cap(CAP_SYS_ADMIN) > 0) /* if we are privileged, we don't need NNP */
return false;
return context_has_seccomp(c);
}
#if HAVE_SECCOMP
static bool seccomp_allows_drop_privileges(const ExecContext *c) {
void *id, *val;
bool has_capget = false, has_capset = false, has_prctl = false;
assert(c);
/* No syscall filter, we are allowed to drop privileges */
if (hashmap_isempty(c->syscall_filter))
return true;
HASHMAP_FOREACH_KEY(val, id, c->syscall_filter) {
_cleanup_free_ char *name = NULL;
name = seccomp_syscall_resolve_num_arch(SCMP_ARCH_NATIVE, PTR_TO_INT(id) - 1);
if (streq(name, "capget"))
has_capget = true;
else if (streq(name, "capset"))
has_capset = true;
else if (streq(name, "prctl"))
has_prctl = true;
}
if (c->syscall_allow_list)
return has_capget && has_capset && has_prctl;
else
return !(has_capget || has_capset || has_prctl);
}
static bool skip_seccomp_unavailable(const ExecContext *c, const ExecParameters *p, const char* msg) {
if (is_seccomp_available())
@ -3911,6 +3944,7 @@ int exec_invoke(
needs_setuid, /* Do we need to do the actual setresuid()/setresgid() calls? */
needs_mount_namespace, /* Do we need to set up a mount namespace for this kernel? */
needs_ambient_hack; /* Do we need to apply the ambient capabilities hack? */
bool keep_seccomp_privileges = false;
#if HAVE_SELINUX
_cleanup_free_ char *mac_selinux_context_net = NULL;
bool use_selinux = false;
@ -3920,6 +3954,9 @@ int exec_invoke(
#endif
#if HAVE_APPARMOR
bool use_apparmor = false;
#endif
#if HAVE_SECCOMP
uint64_t saved_bset = 0;
#endif
uid_t saved_uid = getuid();
gid_t saved_gid = getgid();
@ -4817,6 +4854,28 @@ int exec_invoke(
(UINT64_C(1) << CAP_SETUID) |
(UINT64_C(1) << CAP_SETGID);
#if HAVE_SECCOMP
/* If the service has any form of a seccomp filter and it allows dropping privileges, we'll
* keep the needed privileges to apply it even if we're not root. */
if (needs_setuid &&
uid_is_valid(uid) &&
context_has_seccomp(context) &&
seccomp_allows_drop_privileges(context)) {
keep_seccomp_privileges = true;
if (prctl(PR_SET_KEEPCAPS, 1) < 0) {
*exit_status = EXIT_USER;
return log_exec_error_errno(context, params, errno, "Failed to enable keep capabilities flag: %m");
}
/* Save the current bounding set so we can restore it after applying the seccomp
* filter */
saved_bset = bset;
bset |= (UINT64_C(1) << CAP_SYS_ADMIN) |
(UINT64_C(1) << CAP_SETPCAP);
}
#endif
if (!cap_test_all(bset)) {
r = capability_bounding_set_drop(bset, /* right_now= */ false);
if (r < 0) {
@ -4858,6 +4917,26 @@ int exec_invoke(
return log_exec_error_errno(context, params, r, "Failed to change UID to " UID_FMT ": %m", uid);
}
if (keep_seccomp_privileges) {
r = drop_capability(CAP_SETUID);
if (r < 0) {
*exit_status = EXIT_USER;
return log_exec_error_errno(context, params, r, "Failed to drop CAP_SETUID: %m");
}
r = keep_capability(CAP_SYS_ADMIN);
if (r < 0) {
*exit_status = EXIT_USER;
return log_exec_error_errno(context, params, r, "Failed to keep CAP_SYS_ADMIN: %m");
}
r = keep_capability(CAP_SETPCAP);
if (r < 0) {
*exit_status = EXIT_USER;
return log_exec_error_errno(context, params, r, "Failed to keep CAP_SETPCAP: %m");
}
}
if (!needs_ambient_hack && capability_ambient_set != 0) {
/* Raise the ambient capabilities after user change. */
@ -5027,14 +5106,6 @@ int exec_invoke(
*exit_status = EXIT_SECCOMP;
return log_exec_error_errno(context, params, r, "Failed to apply system call log filters: %m");
}
/* This really should remain the last step before the execve(), to make sure our own code is unaffected
* by the filter as little as possible. */
r = apply_syscall_filter(context, params, needs_ambient_hack);
if (r < 0) {
*exit_status = EXIT_SECCOMP;
return log_exec_error_errno(context, params, r, "Failed to apply system call filters: %m");
}
#endif
#if HAVE_LIBBPF
@ -5045,6 +5116,53 @@ int exec_invoke(
}
#endif
#if HAVE_SECCOMP
/* This really should remain as close to the execve() as possible, to make sure our own code is unaffected
* by the filter as little as possible. */
r = apply_syscall_filter(context, params, needs_ambient_hack);
if (r < 0) {
*exit_status = EXIT_SECCOMP;
return log_exec_error_errno(context, params, r, "Failed to apply system call filters: %m");
}
if (keep_seccomp_privileges) {
/* Restore the capability bounding set with what's expected from the service + the
* ambient capabilities hack */
if (!cap_test_all(saved_bset)) {
r = capability_bounding_set_drop(saved_bset, /* right_now= */ false);
if (r < 0) {
*exit_status = EXIT_CAPABILITIES;
return log_exec_error_errno(context, params, r, "Failed to drop bset capabilities: %m");
}
}
/* Only drop CAP_SYS_ADMIN if it's not in the bounding set, otherwise we'll break
* applications that use it. */
if (!FLAGS_SET(saved_bset, (UINT64_C(1) << CAP_SYS_ADMIN))) {
r = drop_capability(CAP_SYS_ADMIN);
if (r < 0) {
*exit_status = EXIT_USER;
return log_exec_error_errno(context, params, r, "Failed to drop CAP_SYS_ADMIN: %m");
}
}
/* Only drop CAP_SETPCAP if it's not in the bounding set, otherwise we'll break
* applications that use it. */
if (!FLAGS_SET(saved_bset, (UINT64_C(1) << CAP_SETPCAP))) {
r = drop_capability(CAP_SETPCAP);
if (r < 0) {
*exit_status = EXIT_USER;
return log_exec_error_errno(context, params, r, "Failed to drop CAP_SETPCAP: %m");
}
}
if (prctl(PR_SET_KEEPCAPS, 0) < 0) {
*exit_status = EXIT_USER;
return log_exec_error_errno(context, params, errno, "Failed to drop keep capabilities flag: %m");
}
}
#endif
}
if (!strv_isempty(context->unset_environment)) {