core: allow using seccomp without no_new_privs when unprivileged

Until now, using any form of seccomp while being unprivileged (User=) resulted in systemd enabling no_new_privs. There's no need for doing this because: * We trust the filters we apply * If User= is set and a process wants to apply a new seccomp filter, it will need to set no_new_privs itself An example of application that might want seccomp + !no_new_privs is a program that wants to run as an unprivileged user but uses file capabilities to start a web server on a privileged port while benefitting from a restrictive seccomp profile. We now keep the privileges needed to do seccomp before calling enforce_user() and drop them after the seccomp filters are applied. If the syscall filter doesn't allow the needed syscalls to drop the privileges, we keep the previous behavior by enabling no_new_privs.
2024-07-21 18:24:38 +00:00 · 2023-11-07 11:06:56 +01:00 · 2023-11-07 11:06:56 +01:00 · 24832d10b6
parent b3e199cec8
commit 24832d10b6
4 changed files with 181 additions and 101 deletions
--- a/man/systemd.exec.xml
+++ b/man/systemd.exec.xml
@ -823,21 +823,10 @@ CapabilityBoundingSet=~CAP_B CAP_C</programlisting>
        <listitem><para>Takes a boolean argument. If true, ensures that the service process and all its
        children can never gain new privileges through <function>execve()</function> (e.g. via setuid or
        setgid bits, or filesystem capabilities). This is the simplest and most effective way to ensure that
-        a process and its children can never elevate privileges again. Defaults to false, but certain
-        settings override this and ignore the value of this setting. This is the case when
-        <varname>DynamicUser=</varname>, <varname>LockPersonality=</varname>,
-        <varname>MemoryDenyWriteExecute=</varname>, <varname>PrivateDevices=</varname>,
-        <varname>ProtectClock=</varname>, <varname>ProtectHostname=</varname>,
-        <varname>ProtectKernelLogs=</varname>, <varname>ProtectKernelModules=</varname>,
-        <varname>ProtectKernelTunables=</varname>, <varname>RestrictAddressFamilies=</varname>,
-        <varname>RestrictNamespaces=</varname>, <varname>RestrictRealtime=</varname>,
-        <varname>RestrictSUIDSGID=</varname>, <varname>SystemCallArchitectures=</varname>,
-        <varname>SystemCallFilter=</varname>, or <varname>SystemCallLog=</varname> are specified. Note that
-        even if this setting is overridden by them, <command>systemctl show</command> shows the original
-        value of this setting. In case the service will be run in a new mount namespace anyway and SELinux is
-        disabled, all file systems are mounted with <constant>MS_NOSUID</constant> flag. Also see
-        the kernel document
-        <ulink url="https://docs.kernel.org/userspace-api/no_new_privs.html">No New Privileges Flag</ulink>.
+        a process and its children can never elevate privileges again. Defaults to false. In case the service
+        will be run in a new mount namespace anyway and SELinux is disabled, all file systems are mounted with
+        <constant>MS_NOSUID</constant> flag. Also see <ulink
+        url="https://docs.kernel.org/userspace-api/no_new_privs.html">No New Privileges Flag</ulink>.
        </para>

        <para>Note that this setting only has an effect on the unit's processes themselves (or any processes
@ -1779,9 +1768,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
        <citerefentry><refentrytitle>mmap</refentrytitle><manvolnum>2</manvolnum></citerefentry> of
        <filename>/dev/zero</filename> instead of using <constant>MAP_ANON</constant>. For this setting the
        same restrictions regarding mount propagation and privileges apply as for
-        <varname>ReadOnlyPaths=</varname> and related calls, see above. If turned on and if running in user
-        mode, or in system mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
-        <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied.</para>
+        <varname>ReadOnlyPaths=</varname> and related calls, see above.</para>

        <para>Note that the implementation of this setting might be impossible (for example if mount
        namespaces are not available), and the unit should be written in a way that does not solely rely on
@ -1973,10 +1960,6 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
        the system into the service, it is hence not suitable for services that need to take notice of system
        hostname changes dynamically.</para>

-        <para>If this setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant>
-        capability (e.g. services for which <varname>User=</varname> is set),
-        <varname>NoNewPrivileges=yes</varname> is implied.</para>
-
        <xi:include href="system-or-user-ns.xml" xpointer="singular"/>

        <xi:include href="version-info.xml" xpointer="v242"/></listitem>
@ -1994,9 +1977,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
        Effectively, <filename>/dev/rtc0</filename>, <filename>/dev/rtc1</filename>, etc. are made read-only
        to the service. See
        <citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>
-        for the details about <varname>DeviceAllow=</varname>. If this setting is on, but the unit doesn't
-        have the <constant>CAP_SYS_ADMIN</constant> capability (e.g. services for which
-        <varname>User=</varname> is set), <varname>NoNewPrivileges=yes</varname> is implied.</para>
+        for the details about <varname>DeviceAllow=</varname>.</para>

        <para>It is recommended to turn this on for most services that do not need modify the clock or check
        its state.</para>
@ -2018,13 +1999,10 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
        <citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry> mechanism. Few
        services need to write to these at runtime; it is hence recommended to turn this on for most services. For this
        setting the same restrictions regarding mount propagation and privileges apply as for
-        <varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off. If this
-        setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant> capability
-        (e.g. services for which <varname>User=</varname> is set),
-        <varname>NoNewPrivileges=yes</varname> is implied. Note that this option does not prevent
-        indirect changes to kernel tunables effected by IPC calls to other processes. However,
-        <varname>InaccessiblePaths=</varname> may be used to make relevant IPC file system objects
-        inaccessible. If <varname>ProtectKernelTunables=</varname> is set,
+        <varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.
+        Note that this option does not prevent indirect changes to kernel tunables effected by IPC calls to
+        other processes. However, <varname>InaccessiblePaths=</varname> may be used to make relevant IPC file system
+        objects inaccessible. If <varname>ProtectKernelTunables=</varname> is set,
        <varname>MountAPIVFS=yes</varname> is implied.</para>

        <xi:include href="system-or-user-ns.xml" xpointer="singular"/>
@ -2046,9 +2024,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
        both privileged and unprivileged. To disable module auto-load feature please see
        <citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry>
        <constant>kernel.modules_disabled</constant> mechanism and
-        <filename>/proc/sys/kernel/modules_disabled</filename> documentation. If this setting is on,
-        but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant> capability (e.g. services for
-        which <varname>User=</varname> is set), <varname>NoNewPrivileges=yes</varname> is implied.</para>
+        <filename>/proc/sys/kernel/modules_disabled</filename> documentation.</para>

        <xi:include href="system-or-user-ns.xml" xpointer="singular"/>

@ -2067,9 +2043,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
        <citerefentry project='man-pages'><refentrytitle>syslog</refentrytitle><manvolnum>3</manvolnum></citerefentry>
        for userspace logging). The kernel exposes its log buffer to userspace via <filename>/dev/kmsg</filename> and
        <filename>/proc/kmsg</filename>. If enabled, these are made inaccessible to all the processes in the unit.
-        If this setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant>
-        capability (e.g. services for which <varname>User=</varname> is set),
-        <varname>NoNewPrivileges=yes</varname> is implied.</para>
+        </para>

        <xi:include href="system-or-user-ns.xml" xpointer="singular"/>

@ -2113,12 +2087,9 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting>
        including x86-64). Note that on systems supporting multiple ABIs (such as x86/x86-64) it is
        recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the
        restrictions of this option. Specifically, it is recommended to combine this option with
-        <varname>SystemCallArchitectures=native</varname> or similar. If running in user mode, or in system
-        mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
-        <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied. By default, no
-        restrictions apply, all address families are accessible to processes. If assigned the empty string,
-        any previous address family restriction changes are undone. This setting does not affect commands
-        prefixed with <literal>+</literal>.</para>
+        <varname>SystemCallArchitectures=native</varname> or similar. By default, no restrictions apply, all
+        address families are accessible to processes. If assigned the empty string, any previous address family
+        restriction changes are undone. This setting does not affect commands prefixed with <literal>+</literal>.</para>

        <para>Use this option to limit exposure of processes to remote access, in particular via exotic and sensitive
        network protocols, such as <constant>AF_PACKET</constant>. Note that in most cases, the local
@ -2251,9 +2222,7 @@ RestrictFileSystems=ext4</programlisting>
        creation and switching of the specified types of namespaces (or all of them, if true) access to the
        <function>setns()</function> system call with a zero flags parameter is prohibited. This setting is only
        supported on x86, x86-64, mips, mips-le, mips64, mips64-le, mips64-n32, mips64-le-n32, ppc64, ppc64-le, s390
-        and s390x, and enforces no restrictions on other architectures. If running in user mode, or in system mode, but
-        without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
-        <varname>NoNewPrivileges=yes</varname> is implied.</para>
+        and s390x, and enforces no restrictions on other architectures.</para>

        <para>Example: if a unit has the following,
        <programlisting>RestrictNamespaces=cgroup ipc
@ -2274,9 +2243,7 @@ RestrictNamespaces=~cgroup net</programlisting>
        project='man-pages'><refentrytitle>personality</refentrytitle><manvolnum>2</manvolnum></citerefentry> system
        call so that the kernel execution domain may not be changed from the default or the personality selected with
        <varname>Personality=</varname> directive. This may be useful to improve security, because odd personality
-        emulations may be poorly tested and source of vulnerabilities. If running in user mode, or in system mode, but
-        without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
-        <varname>NoNewPrivileges=yes</varname> is implied.</para>
+        emulations may be poorly tested and source of vulnerabilities.</para>

        <xi:include href="version-info.xml" xpointer="v235"/></listitem>
      </varlistentry>
@ -2308,9 +2275,7 @@ RestrictNamespaces=~cgroup net</programlisting>
        available on x86. Note that on systems supporting multiple ABIs (such as x86/x86-64) it is
        recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the
        restrictions of this option. Specifically, it is recommended to combine this option with
-        <varname>SystemCallArchitectures=native</varname> or similar. If running in user mode, or in system
-        mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
-        <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied.</para>
+        <varname>SystemCallArchitectures=native</varname> or similar.</para>

        <xi:include href="version-info.xml" xpointer="v231"/></listitem>
      </varlistentry>
@ -2322,9 +2287,7 @@ RestrictNamespaces=~cgroup net</programlisting>
        the unit are refused. This restricts access to realtime task scheduling policies such as
        <constant>SCHED_FIFO</constant>, <constant>SCHED_RR</constant> or <constant>SCHED_DEADLINE</constant>. See
        <citerefentry project='man-pages'><refentrytitle>sched</refentrytitle><manvolnum>7</manvolnum></citerefentry>
-        for details about these scheduling policies. If running in user mode, or in system mode, but without the
-        <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
-        <varname>NoNewPrivileges=yes</varname> is implied. Realtime scheduling policies may be used to monopolize CPU
+        for details about these scheduling policies. Realtime scheduling policies may be used to monopolize CPU
        time for longer periods of time, and may hence be used to lock up or otherwise trigger Denial-of-Service
        situations on the system. It is hence recommended to restrict access to realtime scheduling to the few programs
        that actually require them. Defaults to off.</para>
@ -2338,10 +2301,8 @@ RestrictNamespaces=~cgroup net</programlisting>
        <listitem><para>Takes a boolean argument. If set, any attempts to set the set-user-ID (SUID) or
        set-group-ID (SGID) bits on files or directories will be denied (for details on these bits see
        <citerefentry
-        project='man-pages'><refentrytitle>inode</refentrytitle><manvolnum>7</manvolnum></citerefentry>). If
-        running in user mode, or in system mode, but without the <constant>CAP_SYS_ADMIN</constant>
-        capability (e.g. setting <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is
-        implied. As the SUID/SGID bits are mechanisms to elevate privileges, and allow users to acquire the
+        project='man-pages'><refentrytitle>inode</refentrytitle><manvolnum>7</manvolnum></citerefentry>).
+        As the SUID/SGID bits are mechanisms to elevate privileges, and allow users to acquire the
        identity of other users, it is recommended to restrict creation of SUID/SGID files to the few
        programs that actually require them. Note that this restricts marking of any type of file system
        object with these bits, including both regular files and directories (where the SGID is a different
@ -2457,15 +2418,12 @@ RestrictNamespaces=~cgroup net</programlisting>
        full list). This value will be returned when a deny-listed system call is triggered, instead of
        terminating the processes immediately. Special setting <literal>kill</literal> can be used to
        explicitly specify killing. This value takes precedence over the one given in
-        <varname>SystemCallErrorNumber=</varname>, see below. If running in user mode, or in system mode,
-        but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting
-        <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied. This feature
-        makes use of the Secure Computing Mode 2 interfaces of the kernel ('seccomp filtering') and is useful
-        for enforcing a minimal sandboxing environment. Note that the <function>execve()</function>,
-        <function>exit()</function>, <function>exit_group()</function>, <function>getrlimit()</function>,
-        <function>rt_sigreturn()</function>, <function>sigreturn()</function> system calls and the system calls
-        for querying time and sleeping are implicitly allow-listed and do not need to be listed
-        explicitly. This option may be specified more than once, in which case the filter masks are
+        <varname>SystemCallErrorNumber=</varname>, see below. This feature makes use of the Secure Computing Mode 2
+        interfaces of the kernel ('seccomp filtering') and is useful for enforcing a minimal sandboxing environment.
+        Note that the <function>execve()</function>, <function>exit()</function>, <function>exit_group()</function>,
+        <function>getrlimit()</function>, <function>rt_sigreturn()</function>, <function>sigreturn()</function>
+        system calls and the system calls for querying time and sleeping are implicitly allow-listed and do not
+        need to be listed explicitly. This option may be specified more than once, in which case the filter masks are
        merged. If the empty string is assigned, the filter is reset, all prior assignments will have no
        effect. This does not affect commands prefixed with <literal>+</literal>.</para>

@ -2692,10 +2650,7 @@ SystemCallErrorNumber=EPERM</programlisting>
        as well as <constant>x32</constant>, <constant>mips64-n32</constant>, <constant>mips64-le-n32</constant>, and
        the special identifier <constant>native</constant>. The special identifier <constant>native</constant>
        implicitly maps to the native architecture of the system (or more precisely: to the architecture the system
-        manager is compiled for). If running in user mode, or in system mode, but without the
-        <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
-        <varname>NoNewPrivileges=yes</varname> is implied. By default, this option is set to the empty list, i.e. no
-        filtering is applied.</para>
+        manager is compiled for). By default, this option is set to the empty list, i.e. no filtering is applied.</para>

        <para>If this setting is used, processes of this unit will only be permitted to call native system calls, and
        system calls of the specified architectures. For the purposes of this option, the x32 architecture is treated
@ -2723,13 +2678,11 @@ SystemCallErrorNumber=EPERM</programlisting>
        <listitem><para>Takes a space-separated list of system call names. If this setting is used, all
        system calls executed by the unit processes for the listed ones will be logged. If the first
        character of the list is <literal>~</literal>, the effect is inverted: all system calls except the
-        listed system calls will be logged. If running in user mode, or in system mode, but without the
-        <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>),
-        <varname>NoNewPrivileges=yes</varname> is implied. This feature makes use of the Secure Computing
-        Mode 2 interfaces of the kernel ('seccomp filtering') and is useful for auditing or setting up a
-        minimal sandboxing environment. This option may be specified more than once, in which case the filter
-        masks are merged. If the empty string is assigned, the filter is reset, all prior assignments will
-        have no effect. This does not affect commands prefixed with <literal>+</literal>.</para>
+        listed system calls will be logged. This feature makes use of the Secure Computing Mode 2 interfaces
+        of the kernel ('seccomp filtering') and is useful for auditing or setting up a minimal sandboxing
+        environment. This option may be specified more than once, in which case the filter masks are merged.
+        If the empty string is assigned, the filter is reset, all prior assignments will have no effect.
+        This does not affect commands prefixed with <literal>+</literal>.</para>

        <xi:include href="version-info.xml" xpointer="v247"/></listitem>
      </varlistentry>
--- a/src/basic/capability-util.c
+++ b/src/basic/capability-util.c
@ -367,16 +367,16 @@ int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities) {
        return 0;
 }

-int drop_capability(cap_value_t cv) {
+static int change_capability(cap_value_t cv, cap_flag_value_t flag) {
        _cleanup_cap_free_ cap_t tmp_cap = NULL;

        tmp_cap = cap_get_proc();
        if (!tmp_cap)
                return -errno;

-        if ((cap_set_flag(tmp_cap, CAP_INHERITABLE, 1, &cv, CAP_CLEAR) < 0) ||
-            (cap_set_flag(tmp_cap, CAP_PERMITTED, 1, &cv, CAP_CLEAR) < 0) ||
-            (cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, CAP_CLEAR) < 0))
+        if ((cap_set_flag(tmp_cap, CAP_INHERITABLE, 1, &cv, flag) < 0) ||
+            (cap_set_flag(tmp_cap, CAP_PERMITTED, 1, &cv, flag) < 0) ||
+            (cap_set_flag(tmp_cap, CAP_EFFECTIVE, 1, &cv, flag) < 0))
                return -errno;

        if (cap_set_proc(tmp_cap) < 0)
@ -385,6 +385,14 @@ int drop_capability(cap_value_t cv) {
        return 0;
 }

+int drop_capability(cap_value_t cv) {
+        return change_capability(cv, CAP_CLEAR);
+}
+
+int keep_capability(cap_value_t cv) {
+        return change_capability(cv, CAP_SET);
+}
+
 bool ambient_capabilities_supported(void) {
        static int cache = -1;

--- a/src/basic/capability-util.h
+++ b/src/basic/capability-util.h
@ -31,6 +31,7 @@ int capability_update_inherited_set(cap_t caps, uint64_t ambient_set);
 int drop_privileges(uid_t uid, gid_t gid, uint64_t keep_capabilities);

 int drop_capability(cap_value_t cv);
+int keep_capability(cap_value_t cv);

 DEFINE_TRIVIAL_CLEANUP_FUNC_FULL(cap_t, cap_free, NULL);
 #define _cleanup_cap_free_ _cleanup_(cap_freep)
--- a/src/core/exec-invoke.c
+++ b/src/core/exec-invoke.c
@ -1378,15 +1378,7 @@ static bool context_has_syscall_logs(const ExecContext *c) {
                !hashmap_isempty(c->syscall_log);
 }

-static bool context_has_no_new_privileges(const ExecContext *c) {
-        assert(c);
-
-        if (c->no_new_privileges)
-                return true;
-
-        if (have_effective_cap(CAP_SYS_ADMIN) > 0) /* if we are privileged, we don't need NNP */
-                return false;
-
+static bool context_has_seccomp(const ExecContext *c) {
        /* We need NNP if we have any form of seccomp and are unprivileged */
        return c->lock_personality ||
                c->memory_deny_write_execute ||
@ -1405,8 +1397,49 @@ static bool context_has_no_new_privileges(const ExecContext *c) {
                context_has_syscall_logs(c);
 }

+static bool context_has_no_new_privileges(const ExecContext *c) {
+        assert(c);
+
+        if (c->no_new_privileges)
+                return true;
+
+        if (have_effective_cap(CAP_SYS_ADMIN) > 0) /* if we are privileged, we don't need NNP */
+                return false;
+
+        return context_has_seccomp(c);
+}
+
 #if HAVE_SECCOMP

+static bool seccomp_allows_drop_privileges(const ExecContext *c) {
+        void *id, *val;
+        bool has_capget = false, has_capset = false, has_prctl = false;
+
+        assert(c);
+
+        /* No syscall filter, we are allowed to drop privileges */
+        if (hashmap_isempty(c->syscall_filter))
+                return true;
+
+        HASHMAP_FOREACH_KEY(val, id, c->syscall_filter) {
+                _cleanup_free_ char *name = NULL;
+
+                name = seccomp_syscall_resolve_num_arch(SCMP_ARCH_NATIVE, PTR_TO_INT(id) - 1);
+
+                if (streq(name, "capget"))
+                        has_capget = true;
+                else if (streq(name, "capset"))
+                        has_capset = true;
+                else if (streq(name, "prctl"))
+                        has_prctl = true;
+        }
+
+        if (c->syscall_allow_list)
+                return has_capget && has_capset && has_prctl;
+        else
+                return !(has_capget || has_capset || has_prctl);
+}
+
 static bool skip_seccomp_unavailable(const ExecContext *c, const ExecParameters *p, const char* msg) {

        if (is_seccomp_available())
@ -3911,6 +3944,7 @@ int exec_invoke(
                needs_setuid,           /* Do we need to do the actual setresuid()/setresgid() calls? */
                needs_mount_namespace,  /* Do we need to set up a mount namespace for this kernel? */
                needs_ambient_hack;     /* Do we need to apply the ambient capabilities hack? */
+        bool keep_seccomp_privileges = false;
 #if HAVE_SELINUX
        _cleanup_free_ char *mac_selinux_context_net = NULL;
        bool use_selinux = false;
@ -3920,6 +3954,9 @@ int exec_invoke(
 #endif
 #if HAVE_APPARMOR
        bool use_apparmor = false;
+#endif
+#if HAVE_SECCOMP
+        uint64_t saved_bset = 0;
 #endif
        uid_t saved_uid = getuid();
        gid_t saved_gid = getgid();
@ -4817,6 +4854,28 @@ int exec_invoke(
                                (UINT64_C(1) << CAP_SETUID) |
                                (UINT64_C(1) << CAP_SETGID);

+#if HAVE_SECCOMP
+                /* If the service has any form of a seccomp filter and it allows dropping privileges, we'll
+                 * keep the needed privileges to apply it even if we're not root. */
+                if (needs_setuid &&
+                    uid_is_valid(uid) &&
+                    context_has_seccomp(context) &&
+                    seccomp_allows_drop_privileges(context)) {
+                        keep_seccomp_privileges = true;
+
+                        if (prctl(PR_SET_KEEPCAPS, 1) < 0) {
+                                *exit_status = EXIT_USER;
+                                return log_exec_error_errno(context, params, errno, "Failed to enable keep capabilities flag: %m");
+                        }
+
+                        /* Save the current bounding set so we can restore it after applying the seccomp
+                         * filter */
+                        saved_bset = bset;
+                        bset |= (UINT64_C(1) << CAP_SYS_ADMIN) |
+                                (UINT64_C(1) << CAP_SETPCAP);
+                }
+#endif
+
                if (!cap_test_all(bset)) {
                        r = capability_bounding_set_drop(bset, /* right_now= */ false);
                        if (r < 0) {
@ -4858,6 +4917,26 @@ int exec_invoke(
                                return log_exec_error_errno(context, params, r, "Failed to change UID to " UID_FMT ": %m", uid);
                        }

+                        if (keep_seccomp_privileges) {
+                                r = drop_capability(CAP_SETUID);
+                                if (r < 0) {
+                                        *exit_status = EXIT_USER;
+                                        return log_exec_error_errno(context, params, r, "Failed to drop CAP_SETUID: %m");
+                                }
+
+                                r = keep_capability(CAP_SYS_ADMIN);
+                                if (r < 0) {
+                                        *exit_status = EXIT_USER;
+                                        return log_exec_error_errno(context, params, r, "Failed to keep CAP_SYS_ADMIN: %m");
+                                }
+
+                                r = keep_capability(CAP_SETPCAP);
+                                if (r < 0) {
+                                        *exit_status = EXIT_USER;
+                                        return log_exec_error_errno(context, params, r, "Failed to keep CAP_SETPCAP: %m");
+                                }
+                        }
+
                        if (!needs_ambient_hack && capability_ambient_set != 0) {

                                /* Raise the ambient capabilities after user change. */
@ -5027,14 +5106,6 @@ int exec_invoke(
                        *exit_status = EXIT_SECCOMP;
                        return log_exec_error_errno(context, params, r, "Failed to apply system call log filters: %m");
                }
-
-                /* This really should remain the last step before the execve(), to make sure our own code is unaffected
-                 * by the filter as little as possible. */
-                r = apply_syscall_filter(context, params, needs_ambient_hack);
-                if (r < 0) {
-                        *exit_status = EXIT_SECCOMP;
-                        return log_exec_error_errno(context, params, r, "Failed to apply system call filters: %m");
-                }
 #endif

 #if HAVE_LIBBPF
@ -5045,6 +5116,53 @@ int exec_invoke(
                }
 #endif

+#if HAVE_SECCOMP
+                /* This really should remain as close to the execve() as possible, to make sure our own code is unaffected
+                 * by the filter as little as possible. */
+                r = apply_syscall_filter(context, params, needs_ambient_hack);
+                if (r < 0) {
+                        *exit_status = EXIT_SECCOMP;
+                        return log_exec_error_errno(context, params, r, "Failed to apply system call filters: %m");
+                }
+
+                if (keep_seccomp_privileges) {
+                        /* Restore the capability bounding set with what's expected from the service + the
+                         * ambient capabilities hack */
+                        if (!cap_test_all(saved_bset)) {
+                                r = capability_bounding_set_drop(saved_bset, /* right_now= */ false);
+                                if (r < 0) {
+                                        *exit_status = EXIT_CAPABILITIES;
+                                        return log_exec_error_errno(context, params, r, "Failed to drop bset capabilities: %m");
+                                }
+                        }
+
+                        /* Only drop CAP_SYS_ADMIN if it's not in the bounding set, otherwise we'll break
+                         * applications that use it. */
+                        if (!FLAGS_SET(saved_bset, (UINT64_C(1) << CAP_SYS_ADMIN))) {
+                                r = drop_capability(CAP_SYS_ADMIN);
+                                if (r < 0) {
+                                        *exit_status = EXIT_USER;
+                                        return log_exec_error_errno(context, params, r, "Failed to drop CAP_SYS_ADMIN: %m");
+                                }
+                        }
+
+                        /* Only drop CAP_SETPCAP if it's not in the bounding set, otherwise we'll break
+                         * applications that use it. */
+                        if (!FLAGS_SET(saved_bset, (UINT64_C(1) << CAP_SETPCAP))) {
+                                r = drop_capability(CAP_SETPCAP);
+                                if (r < 0) {
+                                        *exit_status = EXIT_USER;
+                                        return log_exec_error_errno(context, params, r, "Failed to drop CAP_SETPCAP: %m");
+                                }
+                        }
+
+                        if (prctl(PR_SET_KEEPCAPS, 0) < 0) {
+                                *exit_status = EXIT_USER;
+                                return log_exec_error_errno(context, params, errno, "Failed to drop keep capabilities flag: %m");
+                        }
+                }
+#endif
+
        }

        if (!strv_isempty(context->unset_environment)) {