nsresourced: add new daemon for granting clients user namespaces and assigning resources to them

This adds a small, socket-activated Varlink daemon that can delegate UID
ranges for user namespaces to clients asking for it.

The primary call is AllocateUserRange() where the user passes in an
uninitialized userns fd, which is then set up.

There are other calls that allow assigning a mount fd to a userns
allocated that way, to set up permissions for a cgroup subtree, and to
allocate a veth for such a user namespace.

Since the UID assignments are supposed to be transitive, i.e. not
permanent, care is taken to ensure that users cannot create inodes owned
by these UIDs, so that persistancy cannot be acquired. This is
implemented via a BPF-LSM module that ensures that any member of a
userns allocated that way cannot create files unless the mount it
operates on is owned by the userns itself, or is explicitly
allowelisted.

BPF LSM program with contributions from Alexei Starovoitov.
This commit is contained in:
Lennart Poettering 2023-04-20 19:07:33 +02:00
parent 593428680c
commit 8aee931e7a
27 changed files with 4292 additions and 2 deletions

View file

@ -1007,6 +1007,10 @@ manpages = [
['systemd-networkd.service', '8', ['systemd-networkd'], 'ENABLE_NETWORKD'],
['systemd-notify', '1', [], ''],
['systemd-nspawn', '1', [], ''],
['systemd-nsresourced.service',
'8',
['systemd-nsresourced'],
'ENABLE_NSRESOURCED'],
['systemd-oomd.service', '8', ['systemd-oomd'], 'ENABLE_OOMD'],
['systemd-path', '1', [], ''],
['systemd-pcrlock',

View file

@ -0,0 +1,81 @@
<?xml version='1.0'?> <!--*-nxml-*-->
<!DOCTYPE refentry PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<!-- SPDX-License-Identifier: LGPL-2.1-or-later -->
<refentry id="systemd-nsresourced.service" conditional='ENABLE_NSRESOURCED'>
<refentryinfo>
<title>systemd-nsresourced.service</title>
<productname>systemd</productname>
</refentryinfo>
<refmeta>
<refentrytitle>systemd-nsresourced.service</refentrytitle>
<manvolnum>8</manvolnum>
</refmeta>
<refnamediv>
<refname>systemd-nsresourced.service</refname>
<refname>systemd-nsresourced</refname>
<refpurpose>User Namespace Resource Delegation Service</refpurpose>
</refnamediv>
<refsynopsisdiv>
<para><filename>systemd-nsresourced.service</filename></para>
<para><filename>/usr/lib/systemd/systemd-nsresourced</filename></para>
</refsynopsisdiv>
<refsect1>
<title>Description</title>
<para><command>systemd-nsresourced</command> is a system service that permits transient delegation of a a
UID/GID range to a user namespace (see <citerefentry
project='man-pages'><refentrytitle>user_namespaces</refentrytitle><manvolnum>7</manvolnum></citerefentry>)
allocated by a client, via a Varlink IPC API.</para>
<para>Unprivileged clients may allocate a user namespace, and then request a UID/GID range to be assigned
to it via this service. The user namespace may then be used to run containers and other sandboxes, and/or
apply it to an id-mapped mount.</para>
<para>Allocations of UIDs/GIDs this way are transient: when a user namespace goes away, its UID/GID range
is returned to the pool of available ranges. In order to ensure that clients cannot gain persistency in
their transient UID/GID range a BPF-LSM based policy is enforced that ensures that user namespaces set up
this way can only write to file systems they allocate themselves or that are explicitly allowlisted via
<command>systemd-nsresourced</command>.</para>
<para><command>systemd-nsresourced</command> automatically ensures that any registered UID ranges show up
in the system's NSS database via the <ulink url="https://systemd.io/USER_GROUP_API">User/Group Record
Lookup API via Varlink</ulink>.</para>
<para>Currently, only UID/GID ranges consisting of either exactly 1 or exactly 65536 UIDs/GIDs can be
registered with this service. Moreover, UIDs and GIDs are always allocated together, and
symmetrically.</para>
<para>The service provides API calls to allowlist mounts (referenced via their mount file descriptors as
per Linux <function>fsmount()</function> API), to pass ownership of a cgroup subtree to the user
namespace and to delegate a virtual Ethernet device pair to the user namespace. When used in combination
this is sufficient to implement fully unprivileged container environments, as implemented by
<citerefentry><refentrytitle>systemd-nspawn</refentrytitle><manvolnum>1</manvolnum></citerefentry>, fully
unprivileged <varname>RootImage=</varname> (see
<citerefentry><refentrytitle>systemd.exec</refentrytitle><manvolnum>5</manvolnum></citerefentry>) or
fully unprivileged disk image tools such as
<citerefentry><refentrytitle>systemd-dissect</refentrytitle><manvolnum>1</manvolnum></citerefentry>.</para>
<para>This service provides one <ulink url="https://varlink.org/">Varlink</ulink> service:
<constant>io.systemd.NamespaceResource</constant> allows registering user namespaces, and assign mounts,
cgroups and network interfaces to it.</para>
</refsect1>
<refsect1>
<title>See Also</title>
<para>
<citerefentry><refentrytitle>systemd</refentrytitle><manvolnum>1</manvolnum></citerefentry>,
<citerefentry><refentrytitle>systemd-mountfsd.service</refentrytitle><manvolnum>8</manvolnum></citerefentry>,
<citerefentry><refentrytitle>systemd-nspawn</refentrytitle><manvolnum>1</manvolnum></citerefentry>,
<citerefentry><refentrytitle>systemd.exec</refentrytitle><manvolnum>5</manvolnum></citerefentry>,
<citerefentry><refentrytitle>systemd-dissect</refentrytitle><manvolnum>1</manvolnum></citerefentry>,
<citerefentry project='man-pages'><refentrytitle>user_namespaces</refentrytitle><manvolnum>7</manvolnum></citerefentry>
</para>
</refsect1>
</refentry>

View file

@ -272,6 +272,7 @@ conf.set_quoted('SYSTEMD_TEST_DATA', testdata_dir)
conf.set_quoted('SYSTEMD_TTY_ASK_PASSWORD_AGENT_BINARY_PATH', bindir / 'systemd-tty-ask-password-agent')
conf.set_quoted('SYSTEMD_UPDATE_HELPER_PATH', libexecdir / 'systemd-update-helper')
conf.set_quoted('SYSTEMD_USERWORK_PATH', libexecdir / 'systemd-userwork')
conf.set_quoted('SYSTEMD_NSRESOURCEWORK_PATH', libexecdir / 'systemd-nsresourcework')
conf.set_quoted('SYSTEMD_VERITYSETUP_PATH', libexecdir / 'systemd-veritysetup')
conf.set_quoted('SYSTEM_CONFIG_UNIT_DIR', pkgsysconfdir / 'system')
conf.set_quoted('SYSTEM_DATA_UNIT_DIR', systemunitdir)
@ -1619,6 +1620,7 @@ foreach term : ['analyze',
'machined',
'networkd',
'nscd',
'nsresourced',
'nss-myhostname',
'nss-systemd',
'oomd',
@ -2279,6 +2281,7 @@ subdir('src/mount')
subdir('src/network')
subdir('src/notify')
subdir('src/nspawn')
subdir('src/nsresourced')
subdir('src/nss-myhostname')
subdir('src/nss-mymachines')
subdir('src/nss-resolve')

View file

@ -140,6 +140,8 @@ option('remote', type : 'feature', deprecated : { 'true' : 'enabled', 'false' :
description : 'support for "journal over the network"')
option('create-log-dirs', type : 'boolean',
description : 'create /var/log/journal{,/remote}')
option('nsresourced', type : 'boolean',
description : 'install the systemd-nsresourced stack')
option('nss-myhostname', type : 'boolean',
description : 'install nss-myhostname module')
option('nss-mymachines', type : 'feature', deprecated : { 'true' : 'enabled', 'false' : 'disabled' },

View file

@ -0,0 +1,31 @@
# SPDX-License-Identifier: MIT-0
#
# This config file is installed as part of systemd.
# It may be freely copied and edited (following the MIT No Attribution license).
#
# To make local modifications, one of the following methods may be used:
# 1. add a drop-in file that extends this file by creating the
# /etc/systemd/network/80-namespace-ns.network.d/ directory and creating a
# new .conf file there.
# 2. copy this file into /etc/systemd/network or one of the other paths checked
# by systemd-networkd and edit it there.
# This file should not be edited in place, because it'll be overwritten on upgrades.
# This network file matches the host-side of the virtual Ethernet link
# created by systemd-nsresourced's network support. See systemd-nsresourced(1) for
# details.
[Match]
Kind=veth
Name=ns-*
[Network]
# Default to using a /28 prefix, giving up to 13 addresses per namespace
Address=0.0.0.0/28
LinkLocalAddressing=yes
DHCPServer=yes
IPMasquerade=both
LLDP=yes
EmitLLDP=customer-bridge
IPv6AcceptRA=no
IPv6SendRA=yes

View file

@ -11,6 +11,7 @@ if conf.get('ENABLE_NETWORKD') == 1
'80-container-ve.link',
'80-container-vz.network',
'80-container-vz.link',
'80-namespace-ns.network',
'80-vm-vt.network',
'80-vm-vt.link',
'80-wifi-adhoc.network',

View file

@ -24,8 +24,9 @@ enable systemd-homed-activate.service
enable systemd-homed-firstboot.service
enable systemd-journald-audit.socket
enable systemd-network-generator.service
enable systemd-networkd.service
enable systemd-networkd-wait-online.service
enable systemd-networkd.service
enable systemd-nsresourced.socket
enable systemd-pstore.service
enable systemd-resolved.service
enable systemd-sysext.service

View file

@ -0,0 +1,25 @@
# SPDX-License-Identifier: LGPL-2.1-or-later
if conf.get('HAVE_VMLINUX_H') != 1
subdir_done()
endif
userns_restrict_bpf_o_unstripped = custom_target(
'userns-restrict.bpf.unstripped.o',
input : 'userns-restrict.bpf.c',
output : 'userns-restrict.bpf.unstripped.o',
command : bpf_o_unstripped_cmd,
depends : vmlinux_h_dependency)
userns_restrict_bpf_o = custom_target(
'userns-restrict.bpf.o',
input : userns_restrict_bpf_o_unstripped,
output : 'userns-restrict.bpf.o',
command : bpf_o_cmd)
userns_restrict_skel_h = custom_target(
'userns-restrict.skel.h',
input : userns_restrict_bpf_o,
output : 'userns-restrict.skel.h',
command : skel_h_cmd,
capture : true)

View file

@ -0,0 +1,17 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
/* The SPDX header above is actually correct in claiming this was
* LGPL-2.1-or-later, because it is. Since the kernel doesn't consider that
* compatible with GPL we will claim this to be GPL however, which should be
* fine given that LGPL-2.1-or-later downgrades to GPL if needed.
*/
#include "bpf-dlopen.h"
/* libbpf is used via dlopen(), so rename symbols */
#define bpf_object__attach_skeleton sym_bpf_object__attach_skeleton
#define bpf_object__destroy_skeleton sym_bpf_object__destroy_skeleton
#define bpf_object__load_skeleton sym_bpf_object__load_skeleton
#define bpf_object__open_skeleton sym_bpf_object__open_skeleton
#include "bpf/userns_restrict/userns-restrict.skel.h"

View file

@ -0,0 +1,174 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
/* The SPDX header above is actually correct in claiming this was
* LGPL-2.1-or-later, because it is. Since the kernel doesn't consider that
* compatible with GPL we will claim this to be GPL however, which should be
* fine given that LGPL-2.1-or-later downgrades to GPL if needed.
*/
/* If offsetof() is implemented via __builtin_offset() then it doesn't work on current compilers, since the
* built-ins do not understand CO-RE. Let's undefine any such macros here, to force bpf_helpers.h to define
* its own definitions for this. (In new versions it will do so automatically, but at least in libbpf 1.1.0
* it does not.) */
#undef offsetof
#undef container_of
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include <bpf/bpf_core_read.h>
#include <errno.h>
void *bpf_rdonly_cast(void *, __u32) __ksym;
/* BPF module that implements an allowlist of mounts (identified by mount ID) for user namespaces (identified
* by their inode number in nsfs) that restricts creation of inodes (which would inherit the callers UID/GID)
* or changing of ownership (similar).
*
* This hooks into the varius path-based LSM entrypoints that control inode creation as well as chmod(), and
* then looks up the calling process' user namespace in a global map of namespaces, which points us to
* another map that is simply a list of allowed mnt_ids. */
// FIXME: ACL adjustments are currently not blocked. There's no path-based LSM hook available in the kernel
// for setting xattrs or ACLs, hence we cannot easily block them, even though we want that. We can get away
// with ignoring this for now, as ACLs never define ownership, but purely access: i.e. ACLs never allow
// taking possession of an object, but only control access to it. Thus, things like suid access modes should
// not be reachable through it. It still sucks though that a user can persistently add an ACL entry to a file
// with their transient UIDs/GIDs.
/* kernel currently enforces a maximum usernamespace nesting depth of 32, see create_user_ns() in the kernel sources */
#define USER_NAMESPACE_DEPTH_MAX 32U
struct mnt_id_map {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 1); /* placeholder, configured otherwise by nsresourced */
__type(key, int);
__type(value, int);
};
struct {
__uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
__uint(max_entries, 1); /* placeholder, configured otherwise by nsresourced */
__type(key, unsigned); /* userns inode */
__array(values, struct mnt_id_map);
} userns_mnt_id_hash SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 4096);
} userns_ringbuf SEC(".maps");
static inline struct mount *real_mount(struct vfsmount *mnt) {
return container_of(mnt, struct mount, mnt);
}
static int validate_inode_on_mount(struct inode *inode, struct vfsmount *v) {
struct user_namespace *mount_userns, *task_userns, *p;
unsigned task_userns_inode;
struct task_struct *task;
void *mnt_id_map;
struct mount *m;
int mnt_id;
/* Get user namespace from vfsmount */
m = bpf_rdonly_cast(real_mount(v), bpf_core_type_id_kernel(struct mount));
mount_userns = m->mnt_ns->user_ns;
/* Get user namespace from task */
task = (struct task_struct*) bpf_get_current_task_btf();
task_userns = task->cred->user_ns;
/* Is the file on a mount that belongs to our own user namespace or a child of it? If so, say
* yes immediately. */
p = mount_userns;
for (unsigned i = 0; i < USER_NAMESPACE_DEPTH_MAX; i++) {
if (p == task_userns)
return 0; /* our task's user namespace (or a child thereof) owns this superblock: allow! */
p = p->parent;
if (!p)
break;
}
/* Hmm, something is fishy if there's more than 32 levels of namespaces involved. Let's better be
* safe than sorry, and refuse. */
if (p)
return -EPERM;
/* This is a mount foreign to our task's user namespace, let's consult our allow list */
task_userns_inode = task_userns->ns.inum;
mnt_id_map = bpf_map_lookup_elem(&userns_mnt_id_hash, &task_userns_inode);
if (!mnt_id_map) /* No rules installed for this userns? Then say yes, too! */
return 0;
mnt_id = m->mnt_id;
/* Otherwise, say yes if the mount ID is allowlisted */
if (bpf_map_lookup_elem(mnt_id_map, &mnt_id))
return 0;
return -EPERM;
}
static int validate_path(const struct path *path, int ret) {
struct inode *inode;
struct vfsmount *v;
if (ret != 0) /* propagate earlier error */
return ret;
inode = path->dentry->d_inode;
v = path->mnt;
return validate_inode_on_mount(inode, v);
}
SEC("lsm/path_chown")
int BPF_PROG(userns_restrict_path_chown, struct path *path, void* uid, void *gid, int ret) {
return validate_path(path, ret);
}
SEC("lsm/path_mkdir")
int BPF_PROG(userns_restrict_path_mkdir, struct path *dir, struct dentry *dentry, umode_t mode, int ret) {
return validate_path(dir, ret);
}
SEC("lsm/path_mknod")
int BPF_PROG(userns_restrict_path_mknod, const struct path *dir, struct dentry *dentry, umode_t mode, unsigned int dev, int ret) {
return validate_path(dir, ret);
}
SEC("lsm/path_symlink")
int BPF_PROG(userns_restrict_path_symlink, const struct path *dir, struct dentry *dentry, const char *old_name, int ret) {
return validate_path(dir, ret);
}
SEC("lsm/path_link")
int BPF_PROG(userns_restrict_path_link, struct dentry *old_dentry, const struct path *new_dir, struct dentry *new_dentry, int ret) {
return validate_path(new_dir, ret);
}
SEC("kprobe/free_user_ns")
void BPF_KPROBE(userns_restrict_free_user_ns, struct work_struct *work) {
struct user_namespace *userns;
unsigned inode;
void *mnt_id_map;
/* Inform userspace that a user namespace just went away. I wish there was a nicer way to hook into
* user namespaces being deleted than using kprobes, but couldn't find any. */
userns = bpf_rdonly_cast(container_of(work, struct user_namespace, work),
bpf_core_type_id_kernel(struct user_namespace));
inode = userns->ns.inum;
mnt_id_map = bpf_map_lookup_elem(&userns_mnt_id_hash, &inode);
if (!mnt_id_map) /* No rules installed for this userns? Then send no notification. */
return;
bpf_ringbuf_output(&userns_ringbuf, &inode, sizeof(inode), 0);
}
static const char _license[] SEC("license") = "GPL";

View file

@ -0,0 +1,48 @@
# SPDX-License-Identifier: LGPL-2.1-or-later
subdir('bpf/userns_restrict')
systemd_nsresourcework_sources = files(
'nsresourcework.c',
'userns-restrict.c',
'userns-registry.c',
)
systemd_nsresourced_sources = files(
'nsresourced-manager.c',
'nsresourced.c',
'userns-restrict.c',
'userns-registry.c',
)
userns_restrict_include = include_directories('.')
if conf.get('HAVE_VMLINUX_H') == 1
systemd_nsresourcework_sources += userns_restrict_skel_h
systemd_nsresourced_sources += userns_restrict_skel_h
executables += [
test_template + {
'sources' : files('test-userns-restrict.c', 'userns-restrict.c') + userns_restrict_skel_h,
'conditions' : ['ENABLE_NSRESOURCED', 'HAVE_VMLINUX_H'],
'include_directories' : [ includes, userns_restrict_include ],
},
]
endif
executables += [
libexec_template + {
'name' : 'systemd-nsresourcework',
'conditions' : ['ENABLE_NSRESOURCED'],
'sources' : systemd_nsresourcework_sources,
'dependencies' : threads,
'include_directories' : [ includes, userns_restrict_include ],
},
libexec_template + {
'name' : 'systemd-nsresourced',
'conditions' : ['ENABLE_NSRESOURCED'],
'sources' : systemd_nsresourced_sources,
'dependencies' : threads,
'include_directories' : [ includes, userns_restrict_include ],
},
]

View file

@ -0,0 +1,647 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#include <sys/mount.h>
#include <sys/wait.h>
#include "sd-daemon.h"
#include "bpf-dlopen.h"
#include "build-path.h"
#include "common-signal.h"
#include "env-util.h"
#include "fd-util.h"
#include "fs-util.h"
#include "mkdir.h"
#include "nsresourced-manager.h"
#include "parse-util.h"
#include "process-util.h"
#include "recurse-dir.h"
#include "set.h"
#include "signal-util.h"
#include "socket-util.h"
#include "stat-util.h"
#include "stdio-util.h"
#include "strv.h"
#include "umask-util.h"
#include "unaligned.h"
#include "user-util.h"
#include "userns-registry.h"
#include "userns-restrict.h"
#define LISTEN_TIMEOUT_USEC (25 * USEC_PER_SEC)
static int start_workers(Manager *m, bool explicit_request);
static int on_worker_exit(sd_event_source *s, const siginfo_t *si, void *userdata) {
Manager *m = ASSERT_PTR(userdata);
assert(s);
assert_se(!set_remove(m->workers_dynamic, s) != !set_remove(m->workers_fixed, s));
sd_event_source_disable_unref(s);
if (si->si_code == CLD_EXITED) {
if (si->si_status == EXIT_SUCCESS)
log_debug("Worker " PID_FMT " exited successfully.", si->si_pid);
else
log_warning("Worker " PID_FMT " died with a failure exit status %i, ignoring.", si->si_pid, si->si_status);
} else if (si->si_code == CLD_KILLED)
log_warning("Worker " PID_FMT " was killed by signal %s, ignoring.", si->si_pid, signal_to_string(si->si_status));
else if (si->si_code == CLD_DUMPED)
log_warning("Worker " PID_FMT " dumped core by signal %s, ignoring.", si->si_pid, signal_to_string(si->si_status));
else
log_warning("Got unexpected exit code via SIGCHLD, ignoring.");
(void) start_workers(m, /* explicit_request= */ false); /* Fill up workers again if we fell below the low watermark */
return 0;
}
static int on_sigusr2(sd_event_source *s, const struct signalfd_siginfo *si, void *userdata) {
Manager *m = ASSERT_PTR(userdata);
assert(s);
(void) start_workers(m, /* explicit_request=*/ true); /* Workers told us there's more work, let's add one more worker as long as we are below the high watermark */
return 0;
}
static int on_deferred_start_worker(sd_event_source *s, uint64_t usec, void *userdata) {
Manager *m = ASSERT_PTR(userdata);
assert(s);
m->deferred_start_worker_event_source = sd_event_source_unref(m->deferred_start_worker_event_source);
(void) start_workers(m, /* explicit_request=*/ false);
return 0;
}
DEFINE_PRIVATE_HASH_OPS_WITH_KEY_DESTRUCTOR(
event_source_hash_ops,
sd_event_source,
(void (*)(const sd_event_source*, struct siphash*)) trivial_hash_func,
(int (*)(const sd_event_source*, const sd_event_source*)) trivial_compare_func,
sd_event_source_disable_unref);
int manager_new(Manager **ret) {
_cleanup_(manager_freep) Manager *m = NULL;
int r;
m = new(Manager, 1);
if (!m)
return -ENOMEM;
*m = (Manager) {
.listen_fd = -EBADF,
.worker_ratelimit = {
.interval = 2 * USEC_PER_SEC,
.burst = 250,
},
.registry_fd = -EBADF,
};
r = sd_event_new(&m->event);
if (r < 0)
return r;
r = sd_event_set_signal_exit(m->event, true);
if (r < 0)
return r;
r = sd_event_add_signal(m->event, NULL, (SIGRTMIN+18)|SD_EVENT_SIGNAL_PROCMASK, sigrtmin18_handler, NULL);
if (r < 0)
return r;
r = sd_event_add_memory_pressure(m->event, NULL, NULL, NULL);
if (r < 0)
log_debug_errno(r, "Failed allocate memory pressure event source, ignoring: %m");
r = sd_event_set_watchdog(m->event, true);
if (r < 0)
log_debug_errno(r, "Failed to enable watchdog handling, ignoring: %m");
r = sd_event_add_signal(m->event, NULL, SIGUSR2|SD_EVENT_SIGNAL_PROCMASK, on_sigusr2, m);
if (r < 0)
return r;
*ret = TAKE_PTR(m);
return 0;
}
Manager* manager_free(Manager *m) {
if (!m)
return NULL;
set_free(m->workers_fixed);
set_free(m->workers_dynamic);
m->deferred_start_worker_event_source = sd_event_source_unref(m->deferred_start_worker_event_source);
safe_close(m->listen_fd);
#if HAVE_VMLINUX_H
sd_event_source_disable_unref(m->userns_restrict_bpf_ring_buffer_event_source);
if (m->userns_restrict_bpf_ring_buffer)
sym_ring_buffer__free(m->userns_restrict_bpf_ring_buffer);
userns_restrict_bpf_free(m->userns_restrict_bpf);
#endif
safe_close(m->registry_fd);
sd_event_unref(m->event);
return mfree(m);
}
static size_t manager_current_workers(Manager *m) {
assert(m);
return set_size(m->workers_fixed) + set_size(m->workers_dynamic);
}
static int start_one_worker(Manager *m) {
_cleanup_(sd_event_source_disable_unrefp) sd_event_source *source = NULL;
bool fixed;
pid_t pid;
int r;
assert(m);
fixed = set_size(m->workers_fixed) < NSRESOURCE_WORKERS_MIN;
r = safe_fork_full(
"(sd-worker)",
/* stdio_fds= */ NULL,
&m->listen_fd, 1,
FORK_RESET_SIGNALS|FORK_DEATHSIG_SIGTERM|FORK_REOPEN_LOG|FORK_LOG|FORK_CLOSE_ALL_FDS,
&pid);
if (r < 0)
return log_error_errno(r, "Failed to fork new worker child: %m");
if (r == 0) {
char pids[DECIMAL_STR_MAX(pid_t)];
/* Child */
if (m->listen_fd == 3) {
r = fd_cloexec(3, false);
if (r < 0) {
log_error_errno(r, "Failed to turn off O_CLOEXEC for fd 3: %m");
_exit(EXIT_FAILURE);
}
} else {
if (dup2(m->listen_fd, 3) < 0) { /* dup2() creates with O_CLOEXEC off */
log_error_errno(errno, "Failed to move listen fd to 3: %m");
_exit(EXIT_FAILURE);
}
safe_close(m->listen_fd);
}
xsprintf(pids, PID_FMT, pid);
if (setenv("LISTEN_PID", pids, 1) < 0) {
log_error_errno(errno, "Failed to set $LISTEN_PID: %m");
_exit(EXIT_FAILURE);
}
if (setenv("LISTEN_FDS", "1", 1) < 0) {
log_error_errno(errno, "Failed to set $LISTEN_FDS: %m");
_exit(EXIT_FAILURE);
}
if (setenv("NSRESOURCE_FIXED_WORKER", one_zero(fixed), 1) < 0) {
log_error_errno(errno, "Failed to set $NSRESOURCE_FIXED_WORKER: %m");
_exit(EXIT_FAILURE);
}
#if HAVE_VMLINUX_H
bool supported = m->userns_restrict_bpf;
#else
bool supported = false;
#endif
/* Tell the workers whether to enable the userns API */
if (setenv("NSRESOURCE_API", one_zero(supported), 1) < 0) {
log_error_errno(errno, "Failed to set $NSRESOURCE_API: %m");
_exit(EXIT_FAILURE);
}
r = setenv_systemd_log_level();
if (r < 0) {
log_error_errno(r, "Failed to set $SYSTEMD_LOG_LEVEL: %m");
_exit(EXIT_FAILURE);
}
r = invoke_callout_binary(SYSTEMD_NSRESOURCEWORK_PATH, STRV_MAKE("systemd-nsresourcework", "xxxxxxxxxxxxxxxx")); /* With some extra space rename_process() can make use of */
log_error_errno(r, "Failed start worker process: %m");
_exit(EXIT_FAILURE);
}
r = sd_event_add_child(m->event, &source, pid, WEXITED, on_worker_exit, m);
if (r < 0)
return log_error_errno(r, "Failed to watch child " PID_FMT ": %m", pid);
r = set_ensure_put(
fixed ? &m->workers_fixed : &m->workers_dynamic,
&event_source_hash_ops,
source);
if (r < 0)
return log_error_errno(r, "Failed to add child process to set: %m");
TAKE_PTR(source);
return 0;
}
static int start_workers(Manager *m, bool explicit_request) {
int r;
assert(m);
for (;;) {
size_t n;
n = manager_current_workers(m);
if (n >= NSRESOURCE_WORKERS_MIN && (!explicit_request || n >= NSRESOURCE_WORKERS_MAX))
break;
if (!ratelimit_below(&m->worker_ratelimit)) {
/* If we keep starting workers too often but none sticks, let's fail the whole
* daemon, something is wrong */
if (n == 0) {
sd_event_exit(m->event, EXIT_FAILURE);
return log_error_errno(SYNTHETIC_ERRNO(EUCLEAN), "Worker threads requested too frequently, but worker count is zero, something is wrong.");
}
/* Otherwise, let's stop spawning more for a while. */
log_warning("Worker threads requested too frequently, not starting new ones for a while.");
if (!m->deferred_start_worker_event_source) {
r = sd_event_add_time(
m->event,
&m->deferred_start_worker_event_source,
CLOCK_MONOTONIC,
ratelimit_end(&m->worker_ratelimit),
/* accuracy_usec= */ 0,
on_deferred_start_worker,
m);
if (r < 0)
return log_error_errno(r, "Failed to allocate deferred start worker event source: %m");
}
break;
}
r = start_one_worker(m);
if (r < 0)
return r;
explicit_request = false;
}
return 0;
}
static void manager_release_userns_bpf(Manager *m, uint64_t inode) {
#if HAVE_VMLINUX_H
int r;
assert(m);
if (inode == 0)
return;
assert(m->userns_restrict_bpf);
r = userns_restrict_reset_by_inode(m->userns_restrict_bpf, inode);
if (r < 0)
return (void) log_warning_errno(r, "Failed to remove namespace inode from BPF map, ignoring: %m");
#endif
}
static void manager_release_userns_fds(Manager *m, uint64_t inode) {
int r;
assert(m);
assert(inode != 0);
r = sd_notifyf(/* unset_environment= */ false,
"FDSTOREREMOVE=1\n"
"FDNAME=userns-%" PRIu64 "\n", inode);
if (r < 0)
log_warning_errno(r, "Failed to send fd store removal message, ignoring: %m");
}
static void manager_release_userns_by_inode(Manager *m, uint64_t inode) {
_cleanup_(userns_info_freep) UserNamespaceInfo *userns_info = NULL;
_cleanup_close_ int lock_fd = -EBADF;
int r;
assert(m);
assert(inode != 0);
lock_fd = userns_registry_lock(m->registry_fd);
if (lock_fd < 0)
return (void) log_error_errno(lock_fd, "Failed to lock registry: %m");
r = userns_registry_load_by_userns_inode(m->registry_fd, inode, &userns_info);
if (r < 0)
log_full_errno(r == -ENOENT ? LOG_DEBUG : LOG_WARNING, r,
"Failed to find userns for inode %" PRIu64 ", ignoring: %m", inode);
if (userns_info && uid_is_valid(userns_info->start))
log_debug("Removing user namespace mapping %" PRIu64 " for UID " UID_FMT ".", inode, userns_info->start);
else
log_debug("Removing user namespace mapping %" PRIu64 ".", inode);
/* Remove the BPF rules */
manager_release_userns_bpf(m, inode);
/* Remove the resources from the fdstore */
manager_release_userns_fds(m, inode);
/* And finally remove the resources file from disk */
if (userns_info) {
/* Remove the cgroups of this userns */
r = userns_info_remove_cgroups(userns_info);
if (r < 0)
log_warning_errno(r, "Failed to remove cgroups of user namespace: %m");
r = userns_registry_remove(m->registry_fd, userns_info);
if (r < 0)
log_warning_errno(r, "Failed to remove user namespace '%s', ignoring.", userns_info->name);
}
}
static int manager_scan_registry(Manager *m, Set **registry_inodes) {
_cleanup_free_ DirectoryEntries *de = NULL;
int r;
assert(m);
assert(registry_inodes);
assert(m->registry_fd >= 0);
r = readdir_all(m->registry_fd, RECURSE_DIR_IGNORE_DOT, &de);
if (r < 0)
return log_error_errno(r, "Failed to enumerate registry.");
for (size_t i = 0; i < de->n_entries; i++) {
struct dirent *dentry = de->entries[i];
_cleanup_free_ char *u = NULL;
const char *e, *p;
uint64_t inode;
p = startswith(dentry->d_name, "i");
if (!p)
continue;
e = endswith(p, ".userns");
if (!e)
continue;
u = strndup(p, e - p);
if (!u)
return log_oom();
r = safe_atou64(u, &inode);
if (r < 0) {
log_warning_errno(r, "Failed to parse userns inode number from '%s', skipping: %m", dentry->d_name);
continue;
}
if (inode > UINT32_MAX) { /* namespace inode numbers are 23bit only right now */
log_warning("userns inode number outside of 32bit range, skipping.");
continue;
}
if (set_ensure_put(registry_inodes, NULL, UINT32_TO_PTR(inode)) < 0)
return log_oom();
log_debug("Found user namespace %" PRIu64 " in registry directory", inode);
}
return 0;
}
static int manager_make_listen_socket(Manager *m) {
static const union sockaddr_union sockaddr = {
.un.sun_family = AF_UNIX,
.un.sun_path = "/run/systemd/io.systemd.NamespaceResource",
};
int r;
assert(m);
if (m->listen_fd >= 0)
return 0;
m->listen_fd = socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0);
if (m->listen_fd < 0)
return log_error_errno(errno, "Failed to bind on socket: %m");
(void) sockaddr_un_unlink(&sockaddr.un);
WITH_UMASK(0000)
if (bind(m->listen_fd, &sockaddr.sa, SOCKADDR_UN_LEN(sockaddr.un)) < 0)
return log_error_errno(errno, "Failed to bind socket: %m");
r = mkdir_p("/run/systemd/userdb", 0755);
if (r < 0)
return log_error_errno(r, "Failed to create /run/systemd/userdb: %m");
r = symlink_idempotent("../io.systemd.NamespaceResource", "/run/systemd/userdb/io.systemd.NamespaceResource", /* make_relative= */ false);
if (r < 0)
return log_error_errno(r, "Failed to symlink userdb socket: %m");
if (listen(m->listen_fd, SOMAXCONN) < 0)
return log_error_errno(errno, "Failed to listen on socket: %m");
return 1;
}
static int manager_scan_listen_fds(Manager *m, Set **fdstore_inodes) {
_cleanup_strv_free_ char **names = NULL;
int n, r;
assert(m);
assert(fdstore_inodes);
n = sd_listen_fds_with_names(/* unset_environment= */ true, &names);
if (n < 0)
return log_error_errno(n, "Failed to determine number of passed file descriptors: %m");
for (int i = 0; i < n; i++) {
_cleanup_close_ int fd = SD_LISTEN_FDS_START + i; /* Take possession */
const char *e;
/* If this is a BPF allowlist related fd, just close it, but remember which start UIDs this covers */
e = startswith(names[i], "userns-");
if (e) {
uint64_t inode;
r = safe_atou64(e, &inode);
if (r < 0) {
log_warning_errno(r, "Failed to parse UID from fd name '%s', ignoring: %m", e);
continue;
}
if (inode > UINT32_MAX) {
log_warning("Inode number outside of 32bit range, ignoring");
continue;
}
if (set_ensure_put(fdstore_inodes, NULL, UINT32_TO_PTR(inode)) < 0)
return log_oom();
continue;
}
/* We don't check the name for the stream socket, for compatibility with older versions */
r = sd_is_socket(fd, AF_UNIX, SOCK_STREAM, 1);
if (r < 0)
return log_error_errno(r, "Failed to detect if passed file descriptor is a socket: %m");
if (r > 0) {
if (m->listen_fd >= 0)
return log_error_errno(SYNTHETIC_ERRNO(ENOTUNIQ), "Passed more than one AF_UNIX/SOCK_STREAM socket, refusing.");
m->listen_fd = TAKE_FD(fd);
continue;
}
log_warning("Closing passed file descriptor %i (%s) we don't recognize.", fd, names[i]);
}
return 0;
}
#if HAVE_VMLINUX_H
static int ringbuf_event(void *userdata, void *data, size_t size) {
Manager *m = ASSERT_PTR(userdata);
size_t n;
if ((size % sizeof(unsigned int)) != 0) /* Not multiples of "unsigned int"? */
return -EIO;
n = size / sizeof(unsigned int);
for (size_t i = 0; i < n; i++) {
const void *d;
uint64_t inode;
d = (const uint8_t*) data + i * sizeof(unsigned int);
inode = unaligned_read_ne32(d);
log_debug("Got BPF ring buffer notification that user namespace %" PRIu64 " is now dead.", inode);
manager_release_userns_by_inode(m, inode);
}
return 0;
}
static int on_ringbuf_io(sd_event_source *s, int fd, uint32_t revents, void *userdata) {
Manager *m = ASSERT_PTR(userdata);
int r;
r = sym_ring_buffer__poll(m->userns_restrict_bpf_ring_buffer, 0);
if (r < 0)
return log_error_errno(r, "Got failure reading from BPF ring buffer: %m");
return 0;
}
static int manager_setup_bpf(Manager *m) {
int rb_fd = -EBADF, poll_fd = -EBADF, r;
assert(m);
assert(!m->userns_restrict_bpf);
assert(!m->userns_restrict_bpf_ring_buffer);
assert(!m->userns_restrict_bpf_ring_buffer_event_source);
r = userns_restrict_install(/* pin= */ true, &m->userns_restrict_bpf);
if (r < 0) {
log_notice_errno(r, "Proceeding with user namespace interfaces disabled.");
return 0;
}
rb_fd = sym_bpf_map__fd(m->userns_restrict_bpf->maps.userns_ringbuf);
if (rb_fd < 0)
return log_error_errno(rb_fd, "Failed to get fd of ring buffer: %m");
m->userns_restrict_bpf_ring_buffer = sym_ring_buffer__new(rb_fd, ringbuf_event, m, NULL);
if (!m->userns_restrict_bpf_ring_buffer)
return log_error_errno(errno, "Failed to allocate BPF ring buffer object: %m");
poll_fd = sym_ring_buffer__epoll_fd(m->userns_restrict_bpf_ring_buffer);
if (poll_fd < 0)
return log_error_errno(poll_fd, "Failed to get poll fd of ring buffer: %m");
r = sd_event_add_io(
m->event,
&m->userns_restrict_bpf_ring_buffer_event_source,
poll_fd,
EPOLLIN,
on_ringbuf_io,
m);
if (r < 0)
return log_error_errno(r, "Failed to allocate event source for BPF ring buffer: %m");
return 0;
}
#else
static int manager_setup_bpf(Manager *m) {
log_notice("Not setting up BPF subsystem, as functionality has been disabled at compile time.");
return 0;
}
#endif
int manager_startup(Manager *m) {
_cleanup_(set_freep) Set *fdstore_inodes = NULL, *registry_inodes = NULL;
void *p;
int r;
assert(m);
assert(m->registry_fd < 0);
assert(m->listen_fd < 0);
m->registry_fd = userns_registry_open_fd();
if (m->registry_fd < 0)
return log_error_errno(m->registry_fd, "Failed to open registry directory: %m");
r = manager_setup_bpf(m);
if (r < 0)
return r;
r = manager_scan_listen_fds(m, &fdstore_inodes);
if (r < 0)
return r;
r = manager_scan_registry(m, &registry_inodes);
if (r < 0)
return r;
/* If there are resources tied to UIDs not found in the registry, then release them */
SET_FOREACH(p, fdstore_inodes) {
uint64_t inode;
if (set_contains(registry_inodes, p))
continue;
inode = PTR_TO_UINT32(p);
log_debug("Found stale fd store entry for user namespace %" PRIu64 ", removing.", inode);
manager_release_userns_by_inode(m, inode);
}
r = manager_make_listen_socket(m);
if (r < 0)
return r;
/* Let's make sure every accept() call on this socket times out after 25s. This allows workers to be
* GC'ed on idle */
if (setsockopt(m->listen_fd, SOL_SOCKET, SO_RCVTIMEO, TIMEVAL_STORE(LISTEN_TIMEOUT_USEC), sizeof(struct timeval)) < 0)
return log_error_errno(errno, "Failed to se SO_RCVTIMEO: %m");
r = start_workers(m, /* explicit_request= */ false);
if (r < 0)
return r;
return 0;
}

View file

@ -0,0 +1,40 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#pragma once
#include "sd-bus.h"
#include "sd-event.h"
typedef struct Manager Manager;
#include "hashmap.h"
#include "ratelimit.h"
#define NSRESOURCE_WORKERS_MIN 5
#define NSRESOURCE_WORKERS_MAX 4096
struct Manager {
sd_event *event;
Set *workers_fixed; /* Workers 0…NSRESOURCE_WORKERS_MIN */
Set *workers_dynamic; /* Workers NSRESOURCES_WORKERS_MIN+1…NSRESOURCES_WORKERS_MAX */
int listen_fd;
RateLimit worker_ratelimit;
sd_event_source *deferred_start_worker_event_source;
#if HAVE_VMLINUX_H
struct userns_restrict_bpf *userns_restrict_bpf;
struct ring_buffer *userns_restrict_bpf_ring_buffer;
sd_event_source *userns_restrict_bpf_ring_buffer_event_source;
#endif
int registry_fd;
};
int manager_new(Manager **ret);
Manager* manager_free(Manager *m);
DEFINE_TRIVIAL_CLEANUP_FUNC(Manager*, manager_free);
int manager_startup(Manager *m);

View file

@ -0,0 +1,46 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#include <sys/stat.h>
#include <sys/types.h>
#include "daemon-util.h"
#include "nsresourced-manager.h"
#include "log.h"
#include "main-func.h"
#include "signal-util.h"
static int run(int argc, char *argv[]) {
_cleanup_(manager_freep) Manager *m = NULL;
int r;
log_setup();
umask(0022);
if (argc != 1)
return log_error_errno(SYNTHETIC_ERRNO(EINVAL), "This program takes no arguments.");
if (setenv("SYSTEMD_BYPASS_USERDB", "io.systemd.NamespaceResource", 1) < 0)
return log_error_errno(errno, "Failed to set $SYSTEMD_BYPASS_USERDB: %m");
assert_se(sigprocmask_many(SIG_BLOCK, NULL, SIGCHLD) >= 0);
r = manager_new(&m);
if (r < 0)
return log_error_errno(r, "Could not create manager: %m");
r = manager_startup(m);
if (r < 0)
return log_error_errno(r, "Failed to start up daemon: %m");
_unused_ _cleanup_(notify_on_cleanup) const char *notify_stop = NULL;
notify_stop = notify_start(NOTIFY_READY, NOTIFY_STOPPING);
r = sd_event_loop(m->event);
if (r < 0)
return log_error_errno(r, "Event loop failed: %m");
return 0;
}
DEFINE_MAIN_FUNCTION(run);

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,182 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#include <sys/eventfd.h>
#include "fd-util.h"
#include "main-func.h"
#include "missing_mount.h"
#include "missing_syscall.h"
#include "namespace-util.h"
#include "process-util.h"
#include "rm-rf.h"
#include "tmpfile-util.h"
#include "userns-restrict.h"
static int make_tmpfs_fsmount(void) {
_cleanup_close_ int fsfd = -EBADF, mntfd = -EBADF;
fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
assert_se(fsfd >= 0);
assert_se(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0) >= 0);
mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
assert_se(mntfd >= 0);
return TAKE_FD(mntfd);
}
static void test_works_reg(int parent_fd, const char *fname) {
_cleanup_close_ int fd = -EBADF;
fd = openat(parent_fd, fname, O_RDWR|O_CREAT|O_CLOEXEC, 0666);
assert_se(fd >= 0);
}
static void test_fails_reg(int parent_fd, const char *fname) {
errno = 0;
assert_se(openat(parent_fd, fname, O_RDWR|O_CREAT|O_CLOEXEC, 0666) < 0);
assert_se(errno == EPERM);
}
static void test_works_dir(int parent_fd, const char *fname) {
assert_se(mkdirat(parent_fd, fname, 0666) >= 0);
}
static void test_fails_dir(int parent_fd, const char *fname) {
errno = 0;
assert_se(mkdirat(parent_fd, fname, 0666) < 0);
assert_se(errno == EPERM);
}
static int run(int argc, char *argv[]) {
_cleanup_(userns_restrict_bpf_freep) struct userns_restrict_bpf *obj = NULL;
_cleanup_close_ int userns_fd = -EBADF, host_fd1 = -EBADF, host_tmpfs = -EBADF, afd = -EBADF, bfd = -EBADF;
_cleanup_(rm_rf_physical_and_freep) char *t = NULL;
_cleanup_(sigkill_waitp) pid_t pid = 0;
int r;
log_set_max_level(LOG_DEBUG);
log_open();
r = userns_restrict_install(/* pin= */ false, &obj);
if (ERRNO_IS_NOT_SUPPORTED(r)) {
log_notice("Skipping test, LSM-BPF logic not supported.");
return EXIT_TEST_SKIP;
}
if (ERRNO_IS_PRIVILEGE(r)) {
log_notice("Skipping test, lacking privileges.");
return EXIT_TEST_SKIP;
}
if (r < 0)
return r;
assert_se(mkdtemp_malloc(NULL, &t) >= 0);
host_fd1 = open(t, O_DIRECTORY|O_CLOEXEC);
assert_se(host_fd1 >= 0);
host_tmpfs = make_tmpfs_fsmount();
assert_se(host_tmpfs >= 0);
userns_fd = userns_acquire("0 0 1", "0 0 1");
if (userns_fd < 0)
return log_error_errno(userns_fd, "Failed to make user namespace: %m");
r = userns_restrict_put_by_fd(
obj,
userns_fd,
/* replace= */ true,
/* mount_fds= */ NULL,
/* n_mount_fds= */ 0);
if (r < 0)
return log_error_errno(r, "Failed to restrict user namespace: %m");
afd = eventfd(0, EFD_CLOEXEC);
bfd = eventfd(0, EFD_CLOEXEC);
assert_se(afd >= 0 && bfd >= 0);
r = safe_fork("(test)", FORK_DEATHSIG_SIGKILL, &pid);
assert_se(r >= 0);
if (r == 0) {
_cleanup_close_ int private_tmpfs = -EBADF;
assert_se(setns(userns_fd, CLONE_NEWUSER) >= 0);
assert_se(unshare(CLONE_NEWNS) >= 0);
/* Allocate tmpfs locally */
private_tmpfs = make_tmpfs_fsmount();
/* These two host mounts should be inaccessible */
test_fails_reg(host_fd1, "test");
test_fails_reg(host_tmpfs, "xxx");
test_fails_dir(host_fd1, "test2");
test_fails_dir(host_tmpfs, "xxx2");
/* But this mount created locally should be fine */
test_works_reg(private_tmpfs, "yyy");
test_works_dir(private_tmpfs, "yyy2");
/* Let's sync with the parent, so that it allowlists more stuff for us */
assert_se(eventfd_write(afd, 1) >= 0);
uint64_t x;
assert_se(eventfd_read(bfd, &x) >= 0);
/* And now we should also have access to the host tmpfs */
test_works_reg(host_tmpfs, "zzz");
test_works_reg(private_tmpfs, "aaa");
test_works_dir(host_tmpfs, "zzz2");
test_works_dir(private_tmpfs, "aaa2");
/* But this one should still fail */
test_fails_reg(host_fd1, "bbb");
test_fails_dir(host_fd1, "bbb2");
/* Sync again, to get more stuff allowlisted */
assert_se(eventfd_write(afd, 1) >= 0);
assert_se(eventfd_read(bfd, &x) >= 0);
/* Everything should now be allowed */
test_works_reg(host_tmpfs, "ccc");
test_works_reg(host_fd1, "ddd");
test_works_reg(private_tmpfs, "eee");
test_works_dir(host_tmpfs, "ccc2");
test_works_reg(host_fd1, "ddd2");
test_works_dir(private_tmpfs, "eee2");
_exit(EXIT_SUCCESS);
}
uint64_t x;
assert_se(eventfd_read(afd, &x) >= 0);
r = userns_restrict_put_by_fd(
obj,
userns_fd,
/* replace= */ false,
&host_tmpfs,
1);
if (r < 0)
return log_error_errno(r, "Failed to loosen user namespace: %m");
assert_se(eventfd_write(bfd, 1) >= 0);
assert_se(eventfd_read(afd, &x) >= 0);
r = userns_restrict_put_by_fd(
obj,
userns_fd,
/* replace= */ false,
&host_fd1,
1);
if (r < 0)
return log_error_errno(r, "Failed to loosen user namespace: %m");
assert_se(eventfd_write(bfd, 1) >= 0);
assert_se(wait_for_terminate_and_check("(test)", pid, WAIT_LOG) >= 0);
return 0;
}
DEFINE_MAIN_FUNCTION(run);

View file

@ -0,0 +1,646 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#include "chase.h"
#include "fd-util.h"
#include "fileio.h"
#include "format-util.h"
#include "fs-util.h"
#include "json.h"
#include "missing_magic.h"
#include "path-util.h"
#include "recurse-dir.h"
#include "rm-rf.h"
#include "user-util.h"
#include "userns-registry.h"
int userns_registry_open_fd(void) {
int fd;
fd = chase_and_open(
"/run/systemd/nsresource/registry",
/* root= */ NULL,
CHASE_MKDIR_0755,
O_CLOEXEC|O_DIRECTORY|O_CREAT,
/* ret_path= */ NULL);
if (fd < 0)
return log_debug_errno(fd, "Failed to open registry dir: %m");
return fd;
}
int userns_registry_lock(int dir_fd) {
_cleanup_close_ int registry_fd = -EBADF, lock_fd = -EBADF;
if (dir_fd < 0) {
registry_fd = userns_registry_open_fd();
if (registry_fd < 0)
return registry_fd;
dir_fd = registry_fd;
}
lock_fd = xopenat_lock_full(dir_fd, "lock", O_CREAT|O_RDWR|O_CLOEXEC, /* xopen_flags= */ 0, 0600, LOCK_BSD, LOCK_EX);
if (lock_fd < 0)
return log_debug_errno(lock_fd, "Failed to open nsresource registry lock file: %m");
return TAKE_FD(lock_fd);
}
UserNamespaceInfo* userns_info_new(void) {
UserNamespaceInfo *info = new(UserNamespaceInfo, 1);
if (!info)
return NULL;
*info = (UserNamespaceInfo) {
.owner = UID_INVALID,
.start = UID_INVALID,
.target = UID_INVALID,
};
return info;
}
UserNamespaceInfo *userns_info_free(UserNamespaceInfo *userns) {
if (!userns)
return NULL;
free(userns->cgroups);
free(userns->name);
return mfree(userns);
}
static int dispatch_cgroups_array(const char *name, JsonVariant *variant, JsonDispatchFlags flags, void *userdata) {
UserNamespaceInfo *info = ASSERT_PTR(userdata);
_cleanup_free_ uint64_t *cgroups = NULL;
size_t n_cgroups = 0;
if (json_variant_is_null(variant)) {
info->cgroups = mfree(info->cgroups);
info->n_cgroups = 0;
return 0;
}
if (!json_variant_is_array(variant))
return json_log(variant, flags, SYNTHETIC_ERRNO(EINVAL), "JSON field '%s' is not an array.", strna(name));
cgroups = new(uint64_t, json_variant_elements(variant));
if (!cgroups)
return json_log_oom(variant, flags);
JsonVariant *e;
JSON_VARIANT_ARRAY_FOREACH(e, variant) {
bool found = false;
if (!json_variant_is_unsigned(e))
return json_log(e, flags, SYNTHETIC_ERRNO(EINVAL), "JSON array element is not a number.");
FOREACH_ARRAY(cg, cgroups, n_cgroups)
if (*cg == json_variant_unsigned(e)) {
found = true;
break;
}
if (found) /* suppress duplicate */
continue;
cgroups[n_cgroups++] = json_variant_unsigned(e);
}
assert(n_cgroups <= json_variant_elements(variant));
free_and_replace(info->cgroups, cgroups);
info->n_cgroups = n_cgroups;
return 0;
}
static int userns_registry_load(int dir_fd, const char *fn, UserNamespaceInfo **ret) {
static const JsonDispatch dispatch_table[] = {
{ "owner", JSON_VARIANT_UNSIGNED, json_dispatch_uid_gid, offsetof(UserNamespaceInfo, owner), JSON_MANDATORY },
{ "name", JSON_VARIANT_STRING, json_dispatch_string, offsetof(UserNamespaceInfo, name), JSON_MANDATORY },
{ "userns", JSON_VARIANT_UNSIGNED, json_dispatch_uint64, offsetof(UserNamespaceInfo, userns_inode), JSON_MANDATORY },
{ "start", JSON_VARIANT_UNSIGNED, json_dispatch_uid_gid, offsetof(UserNamespaceInfo, start), 0 },
{ "size", JSON_VARIANT_UNSIGNED, json_dispatch_uint32, offsetof(UserNamespaceInfo, size), 0 },
{ "target", JSON_VARIANT_UNSIGNED, json_dispatch_uid_gid, offsetof(UserNamespaceInfo, target), 0 },
{ "cgroups", JSON_VARIANT_ARRAY, dispatch_cgroups_array, 0, 0 },
{}
};
_cleanup_(userns_info_freep) UserNamespaceInfo *userns_info = NULL;
_cleanup_(json_variant_unrefp) JsonVariant *v = NULL;
_cleanup_close_ int registry_fd = -EBADF;
int r;
if (dir_fd < 0) {
registry_fd = userns_registry_open_fd();
if (registry_fd < 0)
return registry_fd;
dir_fd = registry_fd;
}
r = json_parse_file_at(NULL, dir_fd, fn, 0, &v, NULL, NULL);
if (r < 0)
return r;
userns_info = userns_info_new();
if (!userns_info)
return -ENOMEM;
r = json_dispatch(v, dispatch_table, 0, userns_info);
if (r < 0)
return r;
if (userns_info->userns_inode == 0)
return -EBADMSG;
if (userns_info->start == 0)
return -EBADMSG;
if (userns_info->size == 0) {
if (uid_is_valid(userns_info->start) || uid_is_valid(userns_info->target))
return -EBADMSG;
} else {
if (!uid_is_valid(userns_info->start) || !uid_is_valid(userns_info->target))
return -EBADMSG;
if (userns_info->size > UINT32_MAX - userns_info->start ||
userns_info->size > UINT32_MAX - userns_info->target)
return -EBADMSG;
}
if (ret)
*ret = TAKE_PTR(userns_info);
return 0;
}
int userns_registry_uid_exists(int dir_fd, uid_t start) {
_cleanup_free_ char *fn = NULL;
assert(dir_fd >= 0);
if (!uid_is_valid(start))
return -ENOENT;
if (start == 0)
return true;
if (asprintf(&fn, "u" UID_FMT ".userns", start) < 0)
return -ENOMEM;
if (faccessat(dir_fd, fn, F_OK, AT_SYMLINK_NOFOLLOW) < 0)
return errno == ENOENT ? false : -errno;
return true;
}
int userns_registry_name_exists(int dir_fd, const char *name) {
_cleanup_free_ char *fn = NULL;
assert(dir_fd >= 0);
if (!userns_name_is_valid(name))
return -EINVAL;
fn = strjoin("n", name, ".userns");
if (!fn)
return -ENOMEM;
if (faccessat(dir_fd, fn, F_OK, AT_SYMLINK_NOFOLLOW) < 0)
return errno == ENOENT ? false : -errno;
return true;
}
int userns_registry_inode_exists(int dir_fd, uint64_t inode) {
_cleanup_free_ char *fn = NULL;
assert(dir_fd >= 0);
if (inode <= 0)
return -EINVAL;
if (asprintf(&fn, "i%" PRIu64 ".userns", inode) < 0)
return -ENOMEM;
if (faccessat(dir_fd, fn, F_OK, AT_SYMLINK_NOFOLLOW) < 0)
return errno == ENOENT ? false : -errno;
return true;
}
int userns_registry_load_by_start_uid(int dir_fd, uid_t start, UserNamespaceInfo **ret) {
_cleanup_(userns_info_freep) UserNamespaceInfo *userns_info = NULL;
_cleanup_close_ int registry_fd = -EBADF;
_cleanup_free_ char *fn = NULL;
int r;
if (!uid_is_valid(start))
return -ENOENT;
if (dir_fd < 0) {
registry_fd = userns_registry_open_fd();
if (registry_fd < 0)
return registry_fd;
dir_fd = registry_fd;
}
if (asprintf(&fn, "u" UID_FMT ".userns", start) < 0)
return -ENOMEM;
r = userns_registry_load(dir_fd, fn, &userns_info);
if (r < 0)
return r;
if (userns_info->start != start)
return -EBADMSG;
if (ret)
*ret = TAKE_PTR(userns_info);
return 0;
}
int userns_registry_load_by_userns_inode(int dir_fd, uint64_t inode, UserNamespaceInfo **ret) {
_cleanup_(userns_info_freep) UserNamespaceInfo *userns_info = NULL;
_cleanup_close_ int registry_fd = -EBADF;
_cleanup_free_ char *fn = NULL;
int r;
if (inode == 0)
return -ENOENT;
if (dir_fd < 0) {
registry_fd = userns_registry_open_fd();
if (registry_fd < 0)
return registry_fd;
dir_fd = registry_fd;
}
if (asprintf(&fn, "i%" PRIu64 ".userns", inode) < 0)
return -ENOMEM;
r = userns_registry_load(dir_fd, fn, &userns_info);
if (r < 0)
return r;
if (userns_info->userns_inode != inode)
return -EBADMSG;
if (ret)
*ret = TAKE_PTR(userns_info);
return 0;
}
int userns_registry_load_by_name(int dir_fd, const char *name, UserNamespaceInfo **ret) {
_cleanup_(userns_info_freep) UserNamespaceInfo *userns_info = NULL;
_cleanup_close_ int registry_fd = -EBADF;
_cleanup_free_ char *fn = NULL;
int r;
assert(name);
if (!userns_name_is_valid(name)) /* Invalid names never exist */
return -ENOENT;
if (dir_fd < 0) {
registry_fd = userns_registry_open_fd();
if (registry_fd < 0)
return registry_fd;
dir_fd = registry_fd;
}
fn = strjoin("n", name, ".userns");
if (!fn)
return -ENOMEM;
r = userns_registry_load(dir_fd, fn, &userns_info);
if (r < 0)
return r;
if (!streq_ptr(userns_info->name, name))
return -EBADMSG;
if (ret)
*ret = TAKE_PTR(userns_info);
return 0;
}
int userns_registry_store(int dir_fd, UserNamespaceInfo *info) {
_cleanup_close_ int registry_fd = -EBADF;
int r;
assert(info);
if (!uid_is_valid(info->owner) ||
!info->name ||
info->userns_inode == 0)
return -EINVAL;
if (dir_fd < 0) {
registry_fd = userns_registry_open_fd();
if (registry_fd < 0)
return registry_fd;
dir_fd = registry_fd;
}
_cleanup_(json_variant_unrefp) JsonVariant *cgroup_array = NULL;
FOREACH_ARRAY(cg, info->cgroups, info->n_cgroups) {
r = json_variant_append_arrayb(
&cgroup_array,
JSON_BUILD_UNSIGNED(*cg));
if (r < 0)
return r;
}
_cleanup_(json_variant_unrefp) JsonVariant *def = NULL;
r = json_build(&def, JSON_BUILD_OBJECT(
JSON_BUILD_PAIR("owner", JSON_BUILD_UNSIGNED(info->owner)),
JSON_BUILD_PAIR("name", JSON_BUILD_STRING(info->name)),
JSON_BUILD_PAIR("userns", JSON_BUILD_UNSIGNED(info->userns_inode)),
JSON_BUILD_PAIR_CONDITION(uid_is_valid(info->start), "start", JSON_BUILD_UNSIGNED(info->start)),
JSON_BUILD_PAIR_CONDITION(uid_is_valid(info->start), "size", JSON_BUILD_UNSIGNED(info->size)),
JSON_BUILD_PAIR_CONDITION(uid_is_valid(info->start), "target", JSON_BUILD_UNSIGNED(info->target)),
JSON_BUILD_PAIR_CONDITION(cgroup_array, "cgroups", JSON_BUILD_VARIANT(cgroup_array))));
if (r < 0)
return r;
_cleanup_free_ char *def_buf = NULL;
r = json_variant_format(def, 0, &def_buf);
if (r < 0)
return log_debug_errno(r, "Failed to format userns JSON object: %m");
_cleanup_free_ char *reg_fn = NULL, *link1_fn = NULL, *link2_fn = NULL, *owner_fn = NULL, *uid_fn = NULL;
if (asprintf(&reg_fn, "i%" PRIu64 ".userns", info->userns_inode) < 0)
return log_oom_debug();
r = write_string_file_at(dir_fd, reg_fn, def_buf, WRITE_STRING_FILE_CREATE|WRITE_STRING_FILE_ATOMIC);
if (r < 0)
return log_debug_errno(r, "Failed to write userns data to '%s' in registry: %m", reg_fn);
link1_fn = strjoin("n", info->name, ".userns");
if (!link1_fn) {
r = log_oom_debug();
goto fail;
}
r = linkat_replace(dir_fd, reg_fn, dir_fd, link1_fn);
if (r < 0) {
log_debug_errno(r, "Failed to link userns data to '%s' in registry: %m", link1_fn);
goto fail;
}
if (uid_is_valid(info->start)) {
if (asprintf(&link2_fn, "u" UID_FMT ".userns", info->start) < 0) {
r = log_oom_debug();
goto fail;
}
r = linkat_replace(dir_fd, reg_fn, dir_fd, link2_fn);
if (r < 0) {
log_debug_errno(r, "Failed to link userns data to '%s' in registry: %m", link2_fn);
goto fail;
}
}
if (asprintf(&uid_fn, "o" UID_FMT ".owns", info->owner) < 0) {
r = log_oom_debug();
goto fail;
}
if (mkdirat(dir_fd, uid_fn, 0755) < 0 && errno != EEXIST) {
r = log_debug_errno(errno, "Failed to create per-UID subdir '%s' of registry: %m", uid_fn);
goto fail;
}
if (asprintf(&owner_fn, "%s/i%" PRIu64 ".userns", uid_fn, info->userns_inode) < 0) {
r = log_oom_debug();
goto fail;
}
r = linkat_replace(dir_fd, reg_fn, dir_fd, owner_fn);
if (r < 0) {
log_debug_errno(r, "Failed to link userns data to '%s' in registry: %m", owner_fn);
goto fail;
}
return 0;
fail:
if (reg_fn)
(void) unlinkat(dir_fd, reg_fn, /* flags= */ 0);
if (link1_fn)
(void) unlinkat(dir_fd, link1_fn, /* flags= */ 0);
if (link2_fn)
(void) unlinkat(dir_fd, link2_fn, /* flags= */ 0);
if (owner_fn)
(void) unlinkat(dir_fd, owner_fn, /* flags= */ 0);
if (uid_fn)
(void) unlinkat(dir_fd, uid_fn, AT_REMOVEDIR);
return r;
}
int userns_registry_remove(int dir_fd, UserNamespaceInfo *info) {
_cleanup_close_ int registry_fd = -EBADF;
int ret = 0, r;
assert(info);
if (dir_fd < 0) {
registry_fd = userns_registry_open_fd();
if (registry_fd < 0)
return registry_fd;
dir_fd = registry_fd;
}
_cleanup_free_ char *reg_fn = NULL;
if (asprintf(&reg_fn, "i%" PRIu64 ".userns", info->userns_inode) < 0)
return log_oom_debug();
ret = RET_NERRNO(unlinkat(dir_fd, reg_fn, 0));
_cleanup_free_ char *link1_fn = NULL;
link1_fn = strjoin("n", info->name, ".userns");
if (!link1_fn)
return log_oom_debug();
RET_GATHER(ret, RET_NERRNO(unlinkat(dir_fd, link1_fn, 0)));
if (uid_is_valid(info->start)) {
_cleanup_free_ char *link2_fn = NULL;
if (asprintf(&link2_fn, "u" UID_FMT ".userns", info->start) < 0)
return log_oom_debug();
RET_GATHER(ret, RET_NERRNO(unlinkat(dir_fd, link2_fn, 0)));
}
_cleanup_free_ char *uid_fn = NULL;
if (asprintf(&uid_fn, "o" UID_FMT ".owns", info->owner) < 0)
return log_oom_debug();
_cleanup_free_ char *owner_fn = NULL;
if (asprintf(&owner_fn, "%s/i%" PRIu64 ".userns", uid_fn, info->userns_inode) < 0)
return log_oom_debug();
RET_GATHER(ret, RET_NERRNO(unlinkat(dir_fd, owner_fn, 0)));
r = RET_NERRNO(unlinkat(dir_fd, uid_fn, AT_REMOVEDIR));
if (r != -ENOTEMPTY)
RET_GATHER(ret, r);
return ret;
}
bool userns_info_has_cgroup(UserNamespaceInfo *userns, uint64_t cgroup_id) {
assert(userns);
FOREACH_ARRAY(i, userns->cgroups, userns->n_cgroups)
if (*i == cgroup_id)
return true;
return false;
}
int userns_info_add_cgroup(UserNamespaceInfo *userns, uint64_t cgroup_id) {
if (userns_info_has_cgroup(userns, cgroup_id))
return 0;
if (!GREEDY_REALLOC(userns->cgroups, userns->n_cgroups+1))
return -ENOMEM;
userns->cgroups[userns->n_cgroups++] = cgroup_id;
return 1;
}
static int userns_destroy_cgroup(uint64_t cgroup_id) {
_cleanup_close_ int cgroup_fd = -EBADF, parent_fd = -EBADF;
int r;
cgroup_fd = cg_cgroupid_open(/* cgroupfsfd= */ -EBADF, cgroup_id);
if (cgroup_fd == -ESTALE) {
log_debug_errno(cgroup_fd, "Control group %" PRIu64 " already gone, ignoring: %m", cgroup_id);
return 0;
}
if (cgroup_fd < 0)
return log_debug_errno(errno, "Failed to open cgroup %" PRIu64 ", ignoring: %m", cgroup_id);
_cleanup_free_ char *path = NULL;
r = fd_get_path(cgroup_fd, &path);
if (r < 0)
return log_debug_errno(r, "Failed to get path of cgroup %" PRIu64 ", ignoring: %m", cgroup_id);
const char *e = path_startswith(path, "/sys/fs/cgroup/");
if (!e)
return log_debug_errno(SYNTHETIC_ERRNO(EPERM), "Got cgroup path that doesn't start with /sys/fs/cgroup/, refusing: %s", path);
if (isempty(e))
return log_debug_errno(SYNTHETIC_ERRNO(EPERM), "Got root cgroup path, which can't be right, refusing.");
log_debug("Path of cgroup %" PRIu64 " is: %s", cgroup_id, path);
_cleanup_free_ char *fname = NULL;
r = path_extract_filename(path, &fname);
if (r < 0)
return log_debug_errno(r, "Failed to extract name of cgroup %" PRIu64 ", ignoring: %m", cgroup_id);
parent_fd = openat(cgroup_fd, "..", O_CLOEXEC|O_DIRECTORY);
if (parent_fd < 0)
return log_debug_errno(errno, "Failed to open parent cgroup of %" PRIu64 ", ignoring: %m", cgroup_id);
/* Safety check, never leave cgroupfs */
r = fd_is_fs_type(parent_fd, CGROUP2_SUPER_MAGIC);
if (r < 0)
return log_debug_errno(r, "Failed to determine if parent directory of cgroup %" PRIu64 " is still a cgroup, ignoring: %m", cgroup_id);
if (!r)
return log_debug_errno(SYNTHETIC_ERRNO(EPERM), "Parent directory of cgroup %" PRIu64 " is not a cgroup, refusing.", cgroup_id);
cgroup_fd = safe_close(cgroup_fd);
r = rm_rf_child(parent_fd, fname, REMOVE_ONLY_DIRECTORIES|REMOVE_PHYSICAL|REMOVE_CHMOD);
if (r < 0)
log_debug_errno(r, "Failed to remove delegated cgroup %" PRIu64 ", ignoring: %m", cgroup_id);
return 0;
}
int userns_info_remove_cgroups(UserNamespaceInfo *userns) {
int ret = 0;
assert(userns);
FOREACH_ARRAY(c, userns->cgroups, userns->n_cgroups)
RET_GATHER(ret, userns_destroy_cgroup(*c));
userns->cgroups = mfree(userns->cgroups);
userns->n_cgroups = 0;
return ret;
}
bool userns_name_is_valid(const char *name) {
/* Checks if the specified string is suitable as user namespace name. */
if (strlen(name) > NAME_MAX) /* before we use alloca(), let's check for size */
return false;
const char *f = strjoina("n", name, ".userns"); /* Make sure we can name our lookup symlink with this name */
if (!filename_is_valid(f))
return false;
const char *u = strjoina("ns-", name, "-65535"); /* Make sure we can turn this into valid user names */
if (!valid_user_group_name(u, 0))
return false;
return true;
}
int userns_registry_per_uid(int dir_fd, uid_t owner) {
_cleanup_close_ int registry_fd = -EBADF;
int n = 0, r;
if (dir_fd < 0) {
registry_fd = userns_registry_open_fd();
if (registry_fd < 0)
return registry_fd;
dir_fd = registry_fd;
}
_cleanup_free_ char *uid_fn = NULL;
if (asprintf(&uid_fn, "o" UID_FMT ".owns", owner) < 0)
return log_oom_debug();
_cleanup_free_ DirectoryEntries *de = NULL;
r = readdir_all_at(dir_fd, uid_fn, RECURSE_DIR_IGNORE_DOT|RECURSE_DIR_ENSURE_TYPE, &de);
if (r == -ENOENT)
return 0;
if (r < 0)
return log_debug_errno(r, "Failed to enumerate contents of '%s' sub-directory: %m", uid_fn);
FOREACH_ARRAY(i, de->entries, de->n_entries) {
struct dirent *e = *i;
if (e->d_type != DT_REG)
continue;
if (!startswith(e->d_name, "i") || !endswith(e->d_name, ".userns"))
continue;
n++;
if (n == INT_MAX) /* overflow safety check, just in case */
break;
}
return n;
}

View file

@ -0,0 +1,42 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#pragma once
#define USER_NAMESPACE_CGROUPS_DELEGATE_MAX 16
typedef struct UserNamespaceInfo {
uid_t owner;
char *name;
uint64_t userns_inode;
uid_t start;
uint32_t size;
uid_t target;
uint64_t *cgroups;
size_t n_cgroups;
} UserNamespaceInfo;
UserNamespaceInfo* userns_info_new(void);
UserNamespaceInfo* userns_info_free(UserNamespaceInfo *userns);
DEFINE_TRIVIAL_CLEANUP_FUNC(UserNamespaceInfo*, userns_info_free);
bool userns_info_has_cgroup(UserNamespaceInfo *userns, uint64_t cgroup_id);
int userns_info_add_cgroup(UserNamespaceInfo *userns, uint64_t cgroup_id);
int userns_info_remove_cgroups(UserNamespaceInfo *userns);
bool userns_name_is_valid(const char *name);
int userns_registry_open_fd(void);
int userns_registry_lock(int dir_fd);
int userns_registry_load_by_start_uid(int dir_fd, uid_t start, UserNamespaceInfo **ret);
int userns_registry_load_by_userns_inode(int dir_fd, uint64_t userns, UserNamespaceInfo **ret);
int userns_registry_load_by_name(int dir_fd, const char *name, UserNamespaceInfo **ret);
int userns_registry_store(int dir_fd, UserNamespaceInfo *info);
int userns_registry_remove(int dir_fd, UserNamespaceInfo *info);
int userns_registry_inode_exists(int dir_fd, uint64_t inode);
int userns_registry_name_exists(int dir_fd, const char *name);
int userns_registry_uid_exists(int dir_fd, uid_t start);
int userns_registry_per_uid(int dir_fd, uid_t owner);

View file

@ -0,0 +1,346 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#include "userns-restrict.h"
#if HAVE_VMLINUX_H
#include <sched.h>
#include "bpf-dlopen.h"
#include "bpf-link.h"
#include "fd-util.h"
#include "fs-util.h"
#include "lsm-util.h"
#include "missing_mount.h"
#include "mkdir.h"
#include "mount-util.h"
#include "mountpoint-util.h"
#include "namespace-util.h"
#include "path-util.h"
#define USERNS_MAX (16U*1024U)
#define MOUNTS_MAX 4096U
#define PROGRAM_LINK_PREFIX "/sys/fs/bpf/systemd/userns-restrict/programs"
#define MAP_LINK_PREFIX "/sys/fs/bpf/systemd/userns-restrict/maps"
struct userns_restrict_bpf *userns_restrict_bpf_free(struct userns_restrict_bpf *obj) {
(void) userns_restrict_bpf__destroy(obj); /* this call is fine with NULL */
return NULL;
}
static int make_inner_hash_map(void) {
int fd;
fd = compat_bpf_map_create(
BPF_MAP_TYPE_HASH,
NULL,
sizeof(int),
sizeof(uint32_t),
MOUNTS_MAX,
NULL);
if (fd < 0)
return log_debug_errno(errno, "Failed allocate inner BPF map: %m");
return fd;
}
int userns_restrict_install(
bool pin,
struct userns_restrict_bpf **ret) {
_cleanup_(userns_restrict_bpf_freep) struct userns_restrict_bpf *obj = NULL;
_cleanup_close_ int dummy_mnt_id_hash_fd = -EBADF;
int r;
r = lsm_supported("bpf");
if (r < 0)
return r;
if (r == 0)
return log_error_errno(SYNTHETIC_ERRNO(EOPNOTSUPP), "bpf-lsm not supported, can't lock down user namespace.");
r = dlopen_bpf();
if (r < 0)
return r;
/* bpf_object__next_map() is not available in libbpf pre-0.7.0, and we want to use it. */
if (!sym_bpf_object__next_map)
return log_error_errno(SYNTHETIC_ERRNO(EOPNOTSUPP), "libbpf too old for locking down user namespace.");
obj = userns_restrict_bpf__open();
if (!obj)
return log_error_errno(errno, "Failed to open userns_restrict BPF object: %m");
if (pin) {
struct bpf_map *map;
/* libbpf will only create one level of dirs. Let's create the rest */
(void) mkdir_p(MAP_LINK_PREFIX, 0755);
(void) mkdir_p(PROGRAM_LINK_PREFIX, 0755);
map = sym_bpf_object__next_map(obj->obj, NULL);
while (map) {
_cleanup_free_ char *fn = NULL;
fn = path_join(MAP_LINK_PREFIX, sym_bpf_map__name(map));
if (!fn)
return log_oom();
r = sym_bpf_map__set_pin_path(map, fn);
if (r < 0)
return log_error_errno(r, "Failed to set pin path to '%s': %m", fn);
map = sym_bpf_object__next_map(obj->obj, map);
}
}
r = sym_bpf_map__set_max_entries(obj->maps.userns_mnt_id_hash, USERNS_MAX);
if (r < 0)
return log_error_errno(r, "Failed to size userns/mnt_id hash table: %m");
r = sym_bpf_map__set_max_entries(obj->maps.userns_ringbuf, USERNS_MAX * sizeof(unsigned int));
if (r < 0)
return log_error_errno(r, "Failed to size userns ring buffer: %m");
/* Dummy map to satisfy the verifier */
dummy_mnt_id_hash_fd = make_inner_hash_map();
if (dummy_mnt_id_hash_fd < 0)
return dummy_mnt_id_hash_fd;
r = sym_bpf_map__set_inner_map_fd(obj->maps.userns_mnt_id_hash, dummy_mnt_id_hash_fd);
if (r < 0)
return log_error_errno(r, "Failed to set inner BPF map: %m");
r = userns_restrict_bpf__load(obj);
if (r < 0)
return log_error_errno(r, "Failed to load BPF object: %m");
for (int i = 0; i < obj->skeleton->prog_cnt; i++) {
_cleanup_(bpf_link_freep) struct bpf_link *link = NULL;
struct bpf_prog_skeleton *ps = obj->skeleton->progs + i;
_cleanup_free_ char *fn = NULL;
bool linked = false;
const char *e;
e = startswith(ps->name, "userns_restrict_");
assert(e);
if (pin) {
fn = path_join(PROGRAM_LINK_PREFIX, e);
if (!fn)
return log_oom();
link = sym_bpf_link__open(fn);
r = sym_libbpf_get_error(link);
if (r < 0) {
if (r != -ENOENT)
return log_error_errno(r, "Unable to open pinned program link: %m");
link = NULL;
} else {
linked = true;
log_info("userns-restrict BPF-LSM program %s already attached.", ps->name);
}
}
if (!link) {
link = sym_bpf_program__attach(*ps->prog);
r = sym_libbpf_get_error(link);
if (r < 0)
return log_error_errno(r, "Failed to attach LSM BPF program: %m");
log_info("userns-restrict BPF-LSM program %s now attached.", ps->name);
}
if (pin && !linked) {
assert(fn);
r = sym_bpf_link__pin(link, fn);
if (r < 0)
return log_error_errno(r, "Failed to pin LSM attachment: %m");
}
*ps->link = TAKE_PTR(link);
}
if (pin) {
r = sym_bpf_object__pin_maps(obj->obj, NULL);
if (r < 0)
return log_error_errno(r, "Failed to pin BPF maps: %m");
}
if (ret)
*ret = TAKE_PTR(obj);
return 0;
}
int userns_restrict_put_by_inode(
struct userns_restrict_bpf *obj,
uint64_t userns_inode,
bool replace,
const int mount_fds[],
size_t n_mount_fds) {
_cleanup_close_ int inner_map_fd = -EBADF;
_cleanup_free_ int *mnt_ids = NULL;
uint64_t ino = userns_inode;
int r, outer_map_fd;
assert(obj);
assert(userns_inode != 0);
assert(n_mount_fds == 0 || mount_fds);
/* The BPF map type BPF_MAP_TYPE_HASH_OF_MAPS only supports 32bit keys, and user namespace inode
* numbers are 32bit too, even though ino_t is 64bit these days. Should we ever run into a 64bit
* inode let's refuse early, we can't support this with the current BPF code for now. */
if (userns_inode > UINT32_MAX)
return -EINVAL;
mnt_ids = new(int, n_mount_fds);
if (!mnt_ids)
return -ENOMEM;
for (size_t i = 0; i < n_mount_fds; i++) {
r = path_get_mnt_id_at(mount_fds[i], "", mnt_ids + i);
if (r < 0)
return log_debug_errno(r, "Failed to get mount ID: %m");
}
outer_map_fd = sym_bpf_map__fd(obj->maps.userns_mnt_id_hash);
if (outer_map_fd < 0)
return log_debug_errno(outer_map_fd, "Failed to get outer BPF map fd: %m");
if (replace) {
/* Add if missing, replace if already exists */
inner_map_fd = make_inner_hash_map();
if (inner_map_fd < 0)
return inner_map_fd;
r = sym_bpf_map_update_elem(outer_map_fd, &ino, &inner_map_fd, BPF_ANY);
if (r < 0)
return log_debug_errno(errno, "Failed to replace map in inode hash: %m");
} else {
/* Let's add an entry for this userns inode if missing. If it exists just extend the existing map. We
* might race against each other, hence we try a couple of times */
for (size_t n_try = 10;; n_try--) {
uint32_t innermap_id;
if (n_try == 0)
return log_debug_errno(SYNTHETIC_ERRNO(EEXIST),
"Stillcan't create inode entry in BPF map after 10 tries.");
r = sym_bpf_map_lookup_elem(outer_map_fd, &ino, &innermap_id);
if (r >= 0) {
inner_map_fd = sym_bpf_map_get_fd_by_id(innermap_id);
if (inner_map_fd < 0)
return log_debug_errno(inner_map_fd, "Failed to get file descriptor for inner map: %m");
break;
}
if (errno != ENOENT)
return log_debug_errno(errno, "Failed to look up inode hash entry: %m");
/* No entry for this user namespace yet. Let's create one */
inner_map_fd = make_inner_hash_map();
if (inner_map_fd < 0)
return inner_map_fd;
r = sym_bpf_map_update_elem(outer_map_fd, &ino, &inner_map_fd, BPF_NOEXIST);
if (r >= 0)
break;
if (errno != EEXIST)
return log_debug_errno(errno, "Failed to add mount ID list to inode hash: %m");
}
}
FOREACH_ARRAY(mntid, mnt_ids, n_mount_fds) {
uint32_t dummy_value = 1;
r = sym_bpf_map_update_elem(inner_map_fd, mntid, &dummy_value, BPF_ANY);
if (r < 0)
return log_debug_errno(errno, "Failed to add mount ID to map: %m");
log_debug("Allowing mount %i on userns inode %" PRIu64, *mntid, ino);
}
return 0;
}
int userns_restrict_put_by_fd(
struct userns_restrict_bpf *obj,
int userns_fd,
bool replace,
const int mount_fds[],
size_t n_mount_fds) {
struct stat st;
int r;
assert(obj);
assert(userns_fd >= 0);
assert(n_mount_fds == 0 || mount_fds);
r = fd_is_ns(userns_fd, CLONE_NEWUSER);
if (r < 0)
return log_debug_errno(r, "Failed to determine if file descriptor is user namespace: %m");
if (r == 0)
return log_debug_errno(SYNTHETIC_ERRNO(EBADF), "User namespace fd is not actually a user namespace fd.");
if (fstat(userns_fd, &st) < 0)
return log_debug_errno(errno, "Failed to fstat() user namespace: %m");
return userns_restrict_put_by_inode(
obj,
st.st_ino,
replace,
mount_fds,
n_mount_fds);
}
int userns_restrict_reset_by_inode(
struct userns_restrict_bpf *obj,
uint64_t ino) {
int r, outer_map_fd;
unsigned u;
assert(obj);
assert(ino != 0);
if (ino > UINT32_MAX) /* inodes larger than 32bit are definitely not included in our map, exit early */
return 0;
outer_map_fd = sym_bpf_map__fd(obj->maps.userns_mnt_id_hash);
if (outer_map_fd < 0)
return log_debug_errno(outer_map_fd, "Failed to get outer BPF map fd: %m");
u = (uint32_t) ino;
r = sym_bpf_map_delete_elem(outer_map_fd, &u);
if (r < 0)
return log_debug_errno(outer_map_fd, "Failed to remove entry for inode %" PRIu64 " from outer map: %m", ino);
return 0;
}
#else
int userns_restrict_install(bool pin, struct userns_restrict_bpf **ret) {
return log_error_errno(SYNTHETIC_ERRNO(EOPNOTSUPP), "User Namespace Restriction BPF support disabled.");
}
struct userns_restrict_bpf *userns_restrict_bpf_free(struct userns_restrict_bpf *obj) {
return NULL;
}
int userns_restrict_put_by_fd(struct userns_restrict_bpf *obj, int userns_fd, bool replace, const int mount_fds[], size_t n_mount_fds) {
return log_error_errno(SYNTHETIC_ERRNO(EOPNOTSUPP), "User Namespace Restriction BPF support disabled.");
}
int userns_restrict_put_by_inode(struct userns_restrict_bpf *obj, uint64_t userns_inode, bool replace, const int mount_fds[], size_t n_mount_fds) {
return log_error_errno(SYNTHETIC_ERRNO(EOPNOTSUPP), "User Namespace Restriction BPF support disabled.");
}
int userns_restrict_reset_by_inode(struct userns_restrict_bpf *obj, uint64_t userns_inode) {
return log_error_errno(SYNTHETIC_ERRNO(EOPNOTSUPP), "User Namespace Restriction BPF support disabled.");
}
#endif

View file

@ -0,0 +1,22 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#pragma once
#include <stdbool.h>
#include "macro.h"
#if HAVE_VMLINUX_H
#include "bpf/userns_restrict/userns-restrict-skel.h"
#else
struct userns_restrict_bpf;
#endif
int userns_restrict_install(bool pin, struct userns_restrict_bpf **ret);
struct userns_restrict_bpf *userns_restrict_bpf_free(struct userns_restrict_bpf *obj);
int userns_restrict_put_by_fd(struct userns_restrict_bpf *obj, int userns_fd, bool replace, const int mount_fds[], size_t n_mount_fds);
int userns_restrict_put_by_inode(struct userns_restrict_bpf *obj, uint64_t userns_inode, bool replace, const int mount_fds[], size_t n_mount_fds);
int userns_restrict_reset_by_inode(struct userns_restrict_bpf *obj, uint64_t userns_inode);
DEFINE_TRIVIAL_CLEANUP_FUNC(struct userns_restrict_bpf*, userns_restrict_bpf_free);

View file

@ -180,6 +180,7 @@ shared_sources = files(
'varlink-io.systemd.Hostname.c',
'varlink-io.systemd.Journal.c',
'varlink-io.systemd.ManagedOOM.c',
'varlink-io.systemd.NamespaceResource.c',
'varlink-io.systemd.Network.c',
'varlink-io.systemd.PCRExtend.c',
'varlink-io.systemd.PCRLock.c',

View file

@ -0,0 +1,62 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#include "varlink-io.systemd.NamespaceResource.h"
static VARLINK_DEFINE_METHOD(
AllocateUserRange,
VARLINK_DEFINE_INPUT(name, VARLINK_STRING, 0),
VARLINK_DEFINE_INPUT(size, VARLINK_INT, 0),
VARLINK_DEFINE_INPUT(target, VARLINK_INT, VARLINK_NULLABLE),
VARLINK_DEFINE_INPUT(userNamespaceFileDescriptor, VARLINK_INT, 0));
static VARLINK_DEFINE_METHOD(
RegisterUserNamespace,
VARLINK_DEFINE_INPUT(name, VARLINK_STRING, 0),
VARLINK_DEFINE_INPUT(userNamespaceFileDescriptor, VARLINK_INT, 0));
static VARLINK_DEFINE_METHOD(
AddMountToUserNamespace,
VARLINK_DEFINE_INPUT(userNamespaceFileDescriptor, VARLINK_INT, 0),
VARLINK_DEFINE_INPUT(mountFileDescriptor, VARLINK_INT, 0));
static VARLINK_DEFINE_METHOD(
AddControlGroupToUserNamespace,
VARLINK_DEFINE_INPUT(userNamespaceFileDescriptor, VARLINK_INT, 0),
VARLINK_DEFINE_INPUT(controlGroupFileDescriptor, VARLINK_INT, 0));
static VARLINK_DEFINE_METHOD(
AddNetworkToUserNamespace,
VARLINK_DEFINE_INPUT(userNamespaceFileDescriptor, VARLINK_INT, 0),
VARLINK_DEFINE_INPUT(networkNamespaceFileDescriptor, VARLINK_INT, 0),
VARLINK_DEFINE_INPUT(namespaceInterfaceName, VARLINK_STRING, VARLINK_NULLABLE),
VARLINK_DEFINE_INPUT(mode, VARLINK_STRING, 0),
VARLINK_DEFINE_OUTPUT(hostInterfaceName, VARLINK_STRING, 0),
VARLINK_DEFINE_OUTPUT(namespaceInterfaceName, VARLINK_STRING, 0));
static VARLINK_DEFINE_ERROR(UserNamespaceInterfaceNotSupported);
static VARLINK_DEFINE_ERROR(NameExists);
static VARLINK_DEFINE_ERROR(UserNamespaceExists);
static VARLINK_DEFINE_ERROR(DynamicRangeUnavailable);
static VARLINK_DEFINE_ERROR(NoDynamicRange);
static VARLINK_DEFINE_ERROR(UserNamespaceNotRegistered);
static VARLINK_DEFINE_ERROR(UserNamespaceWithoutUserRange);
static VARLINK_DEFINE_ERROR(TooManyControlGroups);
static VARLINK_DEFINE_ERROR(ControlGroupAlreadyAdded);
VARLINK_DEFINE_INTERFACE(
io_systemd_NamespaceResource,
"io.systemd.NamespaceResource",
&vl_method_AllocateUserRange,
&vl_method_RegisterUserNamespace,
&vl_method_AddMountToUserNamespace,
&vl_method_AddControlGroupToUserNamespace,
&vl_method_AddNetworkToUserNamespace,
&vl_error_UserNamespaceInterfaceNotSupported,
&vl_error_NameExists,
&vl_error_UserNamespaceExists,
&vl_error_DynamicRangeUnavailable,
&vl_error_NoDynamicRange,
&vl_error_UserNamespaceNotRegistered,
&vl_error_UserNamespaceWithoutUserRange,
&vl_error_TooManyControlGroups,
&vl_error_ControlGroupAlreadyAdded);

View file

@ -0,0 +1,6 @@
/* SPDX-License-Identifier: LGPL-2.1-or-later */
#pragma once
#include "varlink-idl.h"
extern const VarlinkInterface vl_interface_io_systemd_NamespaceResource;

View file

@ -12,11 +12,12 @@
#include "varlink-io.systemd.Credentials.h"
#include "varlink-io.systemd.Journal.h"
#include "varlink-io.systemd.ManagedOOM.h"
#include "varlink-io.systemd.NamespaceResource.h"
#include "varlink-io.systemd.Network.h"
#include "varlink-io.systemd.PCRExtend.h"
#include "varlink-io.systemd.PCRLock.h"
#include "varlink-io.systemd.Resolve.Monitor.h"
#include "varlink-io.systemd.Resolve.h"
#include "varlink-io.systemd.Resolve.Monitor.h"
#include "varlink-io.systemd.UserDatabase.h"
#include "varlink-io.systemd.oom.h"
#include "varlink-io.systemd.service.h"
@ -129,6 +130,8 @@ TEST(parse_format) {
print_separator();
test_parse_format_one(&vl_interface_io_systemd_UserDatabase);
print_separator();
test_parse_format_one(&vl_interface_io_systemd_NamespaceResource);
print_separator();
test_parse_format_one(&vl_interface_io_systemd_Journal);
print_separator();
test_parse_format_one(&vl_interface_io_systemd_Resolve);

View file

@ -732,6 +732,14 @@ units = [
'file' : 'systemd-userdbd.socket',
'conditions' : ['ENABLE_USERDB'],
},
{
'file' : 'systemd-nsresourced.service.in',
'conditions' : ['ENABLE_NSRESOURCED'],
},
{
'file' : 'systemd-nsresourced.socket',
'conditions' : ['ENABLE_NSRESOURCED'],
},
{
'file' : 'systemd-vconsole-setup.service.in',
'conditions' : ['ENABLE_VCONSOLE'],

View file

@ -0,0 +1,47 @@
# SPDX-License-Identifier: LGPL-2.1-or-later
#
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.
[Unit]
Description=Namespace Resource Manager
Documentation=man:systemd-nsresourced.service(8)
Requires=systemd-nsresourced.socket
After=systemd-nsresourced.socket
Conflicts=shutdown.target
Before=sysinit.target shutdown.target
DefaultDependencies=no
[Service]
CapabilityBoundingSet=CAP_DAC_READ_SEARCH CAP_SYS_RESOURCE CAP_BPF CAP_PERFMON CAP_SETGID CAP_SETUID CAP_SYS_ADMIN CAP_CHOWN CAP_FOWNER
ExecStart={{LIBEXECDIR}}/systemd-nsresourced
IPAddressDeny=any
LimitNOFILE={{HIGH_RLIMIT_NOFILE}}
LockPersonality=yes
MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
PrivateDevices=yes
ProtectProc=invisible
ProtectControlGroups=yes
ProtectHome=yes
ProtectHostname=yes
ProtectKernelLogs=yes
ProtectKernelModules=yes
ProtectSystem=strict
RestrictAddressFamilies=AF_UNIX AF_NETLINK
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallErrorNumber=EPERM
SystemCallFilter=@system-service bpf perf_event_open open_by_handle_at
Type=notify
NotifyAccess=all
FileDescriptorStoreMax=4096
{{SERVICE_WATCHDOG}}
[Install]
Also=systemd-nsresourced.socket

View file

@ -0,0 +1,23 @@
# SPDX-License-Identifier: LGPL-2.1-or-later
#
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.
[Unit]
Description=Namespace Resource Manager Socket
Documentation=man:systemd-nsresourced.service(8)
DefaultDependencies=no
Conflicts=shutdown.target
Before=sockets.target shutdown.target
[Socket]
ListenStream=/run/systemd/io.systemd.NamespaceResource
Symlinks=/run/systemd/userdb/io.systemd.NamespaceResource
SocketMode=0666
[Install]
WantedBy=sockets.target