cgroup: Changes for v6.11

- Added Michal Koutný as a maintainer.
 
 - Counters in pids.events were behaving inconsistently. pids.events made
   properly hierarchical and pids.events.local added.
 
 - misc.peak and misc.events.local added.
 
 - cpuset remote partition creation and cpuset.cpus.exclusive handling
   improved.
 
 - Code cleanups, non-critical fixes, doc updates.
 
 - for-6.10-fixes is merged in to receive two non-critical fixes that didn't
   trigger pull.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZpSsdw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGSEMAQDQ5VfcRz+rW20ez5IAgyN3EKIwSbW6pY6jojgj
 bJtJSQD/TzA8DoRxcCvTdHcZcwJ2e2zBVcuM8NkZHfSCNiPrrgs=
 =5f3I
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - Added Michal Koutný as a maintainer

 - Counters in pids.events were behaving inconsistently. pids.events
   made properly hierarchical and pids.events.local added

 - misc.peak and misc.events.local added

 - cpuset remote partition creation and cpuset.cpus.exclusive handling
   improved

 - Code cleanups, non-critical fixes, doc updates

 - for-6.10-fixes is merged in to receive two non-critical fixes that
   didn't trigger pull

* tag 'cgroup-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (23 commits)
  cgroup: Add Michal Koutný as a maintainer
  cgroup/misc: Introduce misc.events.local
  cgroup/rstat: add force idle show helper
  cgroup: Protect css->cgroup write under css_set_lock
  cgroup/misc: Introduce misc.peak
  cgroup_misc: add kernel-doc comments for enum misc_res_type
  cgroup/cpuset: Prevent UAF in proc_cpuset_show()
  selftest/cgroup: Update test_cpuset_prs.sh to match changes
  cgroup/cpuset: Make cpuset.cpus.exclusive independent of cpuset.cpus
  cgroup/cpuset: Delay setting of CS_CPU_EXCLUSIVE until valid partition
  selftest/cgroup: Fix test_cpuset_prs.sh problems reported by test robot
  cgroup/cpuset: Fix remote root partition creation problem
  cgroup: avoid the unnecessary list_add(dying_tasks) in cgroup_exit()
  cgroup/cpuset: Optimize isolated partition only generate_sched_domains() calls
  cgroup/cpuset: Reduce the lock protecting CS_SCHED_LOAD_BALANCE
  kernel/cgroup: cleanup cgroup_base_files when fail to add cgroup_psi_files
  selftests: cgroup: Add basic tests for pids controller
  selftests: cgroup: Lexicographic order in Makefile
  cgroup/pids: Add pids.events.local
  cgroup/pids: Make event counters hierarchical
  ...
This commit is contained in:
Linus Torvalds 2024-07-15 16:41:32 -07:00
commit 895b9b1207
14 changed files with 679 additions and 159 deletions

View file

@ -36,7 +36,8 @@ superset of parent/child/pids.current.
The pids.events file contains event counters:
- max: Number of times fork failed because limit was hit.
- max: Number of times fork failed in the cgroup because limit was hit in
self or ancestors.
Example
-------

View file

@ -239,6 +239,13 @@ cgroup v2 currently supports the following mount options.
will not be tracked by the memory controller (even if cgroup
v2 is remounted later on).
pids_localevents
The option restores v1-like behavior of pids.events:max, that is only
local (inside cgroup proper) fork failures are counted. Without this
option pids.events.max represents any pids.max enforcemnt across
cgroup's subtree.
Organizing Processes and Threads
--------------------------------
@ -2205,12 +2212,18 @@ PID Interface Files
descendants has ever reached.
pids.events
A read-only flat-keyed file which exists on non-root cgroups. The
following entries are defined. Unless specified otherwise, a value
change in this file generates a file modified event.
A read-only flat-keyed file which exists on non-root cgroups. Unless
specified otherwise, a value change in this file generates a file
modified event. The following entries are defined.
max
Number of times fork failed because limit was hit.
The number of times the cgroup's total number of processes hit the pids.max
limit (see also pids_localevents).
pids.events.local
Similar to pids.events but the fields in the file are local
to the cgroup i.e. not hierarchical. The file modified event
generated on this file reflects only the local events.
Organisational operations are not blocked by cgroup policies, so it is
possible to have pids.current > pids.max. This can be done by either
@ -2346,8 +2359,12 @@ Cpuset Interface Files
is always a subset of it.
Users can manually set it to a value that is different from
"cpuset.cpus". The only constraint in setting it is that the
list of CPUs must be exclusive with respect to its sibling.
"cpuset.cpus". One constraint in setting it is that the list of
CPUs must be exclusive with respect to "cpuset.cpus.exclusive"
of its sibling. If "cpuset.cpus.exclusive" of a sibling cgroup
isn't set, its "cpuset.cpus" value, if set, cannot be a subset
of it to leave at least one CPU available when the exclusive
CPUs are taken away.
For a parent cgroup, any one of its exclusive CPUs can only
be distributed to at most one of its child cgroups. Having an
@ -2363,8 +2380,8 @@ Cpuset Interface Files
cpuset-enabled cgroups.
This file shows the effective set of exclusive CPUs that
can be used to create a partition root. The content of this
file will always be a subset of "cpuset.cpus" and its parent's
can be used to create a partition root. The content
of this file will always be a subset of its parent's
"cpuset.cpus.exclusive.effective" if its parent is not the root
cgroup. It will also be a subset of "cpuset.cpus.exclusive"
if it is set. If "cpuset.cpus.exclusive" is not set, it is
@ -2625,6 +2642,15 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_
res_a 3
res_b 0
misc.peak
A read-only flat-keyed file shown in all cgroups. It shows the
historical maximum usage of the resources in the cgroup and its
children.::
$ cat misc.peak
res_a 10
res_b 8
misc.max
A read-write flat-keyed file shown in the non root cgroups. Allowed
maximum usage of the resources in the cgroup and its children.::
@ -2654,6 +2680,11 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_
The number of times the cgroup's resource usage was
about to go over the max boundary.
misc.events.local
Similar to misc.events but the fields in the file are local to the
cgroup i.e. not hierarchical. The file modified event generated on
this file reflects only the local events.
Migration and Ownership
~~~~~~~~~~~~~~~~~~~~~~~

View file

@ -5528,6 +5528,7 @@ CONTROL GROUP (CGROUP)
M: Tejun Heo <tj@kernel.org>
M: Zefan Li <lizefan.x@bytedance.com>
M: Johannes Weiner <hannes@cmpxchg.org>
M: Michal Koutný <mkoutny@suse.com>
L: cgroups@vger.kernel.org
S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git

View file

@ -119,7 +119,12 @@ enum {
/*
* Enable hugetlb accounting for the memory controller.
*/
CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING = (1 << 19),
/*
* Enable legacy local pids.events.
*/
CGRP_ROOT_PIDS_LOCAL_EVENTS = (1 << 20),
};
/* cftype->flags */

View file

@ -9,15 +9,16 @@
#define _MISC_CGROUP_H_
/**
* Types of misc cgroup entries supported by the host.
* enum misc_res_type - Types of misc cgroup entries supported by the host.
*/
enum misc_res_type {
#ifdef CONFIG_KVM_AMD_SEV
/* AMD SEV ASIDs resource */
/** @MISC_CG_RES_SEV: AMD SEV ASIDs resource */
MISC_CG_RES_SEV,
/* AMD SEV-ES ASIDs resource */
/** @MISC_CG_RES_SEV_ES: AMD SEV-ES ASIDs resource */
MISC_CG_RES_SEV_ES,
#endif
/** @MISC_CG_RES_TYPES: count of enum misc_res_type constants */
MISC_CG_RES_TYPES
};
@ -30,13 +31,16 @@ struct misc_cg;
/**
* struct misc_res: Per cgroup per misc type resource
* @max: Maximum limit on the resource.
* @watermark: Historical maximum usage of the resource.
* @usage: Current usage of the resource.
* @events: Number of times, the resource limit exceeded.
*/
struct misc_res {
u64 max;
atomic64_t watermark;
atomic64_t usage;
atomic64_t events;
atomic64_t events_local;
};
/**
@ -50,6 +54,8 @@ struct misc_cg {
/* misc.events */
struct cgroup_file events_file;
/* misc.events.local */
struct cgroup_file events_local_file;
struct misc_res res[MISC_CG_RES_TYPES];
};

View file

@ -1744,8 +1744,11 @@ static int css_populate_dir(struct cgroup_subsys_state *css)
if (cgroup_psi_enabled()) {
ret = cgroup_addrm_files(css, cgrp,
cgroup_psi_files, true);
if (ret < 0)
if (ret < 0) {
cgroup_addrm_files(css, cgrp,
cgroup_base_files, false);
return ret;
}
}
} else {
ret = cgroup_addrm_files(css, cgrp,
@ -1839,9 +1842,9 @@ int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask)
RCU_INIT_POINTER(scgrp->subsys[ssid], NULL);
rcu_assign_pointer(dcgrp->subsys[ssid], css);
ss->root = dst_root;
css->cgroup = dcgrp;
spin_lock_irq(&css_set_lock);
css->cgroup = dcgrp;
WARN_ON(!list_empty(&dcgrp->e_csets[ss->id]));
list_for_each_entry_safe(cset, cset_pos, &scgrp->e_csets[ss->id],
e_cset_node[ss->id]) {
@ -1922,6 +1925,7 @@ enum cgroup2_param {
Opt_memory_localevents,
Opt_memory_recursiveprot,
Opt_memory_hugetlb_accounting,
Opt_pids_localevents,
nr__cgroup2_params
};
@ -1931,6 +1935,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
fsparam_flag("memory_localevents", Opt_memory_localevents),
fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
fsparam_flag("pids_localevents", Opt_pids_localevents),
{}
};
@ -1960,6 +1965,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
case Opt_memory_hugetlb_accounting:
ctx->flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
return 0;
case Opt_pids_localevents:
ctx->flags |= CGRP_ROOT_PIDS_LOCAL_EVENTS;
return 0;
}
return -EINVAL;
}
@ -1989,6 +1997,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING;
if (root_flags & CGRP_ROOT_PIDS_LOCAL_EVENTS)
cgrp_dfl_root.flags |= CGRP_ROOT_PIDS_LOCAL_EVENTS;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_PIDS_LOCAL_EVENTS;
}
}
@ -2004,6 +2017,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
seq_puts(seq, ",memory_recursiveprot");
if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING)
seq_puts(seq, ",memory_hugetlb_accounting");
if (cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS)
seq_puts(seq, ",pids_localevents");
return 0;
}
@ -6686,8 +6701,10 @@ void cgroup_exit(struct task_struct *tsk)
WARN_ON_ONCE(list_empty(&tsk->cg_list));
cset = task_css_set(tsk);
css_set_move_task(tsk, cset, NULL, false);
list_add_tail(&tsk->cg_list, &cset->dying_tasks);
cset->nr_tasks--;
/* matches the signal->live check in css_task_iter_advance() */
if (thread_group_leader(tsk) && atomic_read(&tsk->signal->live))
list_add_tail(&tsk->cg_list, &cset->dying_tasks);
if (dl_task(tsk))
dec_dl_tasks_cs(tsk);
@ -6714,10 +6731,12 @@ void cgroup_release(struct task_struct *task)
ss->release(task);
} while_each_subsys_mask();
spin_lock_irq(&css_set_lock);
css_set_skip_task_iters(task_css_set(task), task);
list_del_init(&task->cg_list);
spin_unlock_irq(&css_set_lock);
if (!list_empty(&task->cg_list)) {
spin_lock_irq(&css_set_lock);
css_set_skip_task_iters(task_css_set(task), task);
list_del_init(&task->cg_list);
spin_unlock_irq(&css_set_lock);
}
}
void cgroup_free(struct task_struct *task)
@ -7062,7 +7081,8 @@ static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr,
"favordynmods\n"
"memory_localevents\n"
"memory_recursiveprot\n"
"memory_hugetlb_accounting\n");
"memory_hugetlb_accounting\n"
"pids_localevents\n");
}
static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features);

View file

@ -21,6 +21,7 @@
* License. See the file COPYING in the main directory of the Linux
* distribution for more details.
*/
#include "cgroup-internal.h"
#include <linux/cpu.h>
#include <linux/cpumask.h>
@ -87,7 +88,7 @@ static const char * const perr_strings[] = {
[PERR_NOTEXCL] = "Cpu list in cpuset.cpus not exclusive",
[PERR_NOCPUS] = "Parent unable to distribute cpu downstream",
[PERR_HOTPLUG] = "No cpu available due to hotplug",
[PERR_CPUSEMPTY] = "cpuset.cpus is empty",
[PERR_CPUSEMPTY] = "cpuset.cpus and cpuset.cpus.exclusive are empty",
[PERR_HKEEPING] = "partition config conflicts with housekeeping setup",
};
@ -127,19 +128,28 @@ struct cpuset {
/*
* Exclusive CPUs dedicated to current cgroup (default hierarchy only)
*
* This exclusive CPUs must be a subset of cpus_allowed. A parent
* cgroup can only grant exclusive CPUs to one of its children.
* The effective_cpus of a valid partition root comes solely from its
* effective_xcpus and some of the effective_xcpus may be distributed
* to sub-partitions below & hence excluded from its effective_cpus.
* For a valid partition root, its effective_cpus have no relationship
* with cpus_allowed unless its exclusive_cpus isn't set.
*
* When the cgroup becomes a valid partition root, effective_xcpus
* defaults to cpus_allowed if not set. The effective_cpus of a valid
* partition root comes solely from its effective_xcpus and some of the
* effective_xcpus may be distributed to sub-partitions below & hence
* excluded from its effective_cpus.
* This value will only be set if either exclusive_cpus is set or
* when this cpuset becomes a local partition root.
*/
cpumask_var_t effective_xcpus;
/*
* Exclusive CPUs as requested by the user (default hierarchy only)
*
* Its value is independent of cpus_allowed and designates the set of
* CPUs that can be granted to the current cpuset or its children when
* it becomes a valid partition root. The effective set of exclusive
* CPUs granted (effective_xcpus) depends on whether those exclusive
* CPUs are passed down by its ancestors and not yet taken up by
* another sibling partition root along the way.
*
* If its value isn't set, it defaults to cpus_allowed.
*/
cpumask_var_t exclusive_cpus;
@ -169,7 +179,7 @@ struct cpuset {
/* for custom sched domain */
int relax_domain_level;
/* number of valid sub-partitions */
/* number of valid local child partitions */
int nr_subparts;
/* partition root state */
@ -230,6 +240,17 @@ static struct list_head remote_children;
* 2 - partition root without load balancing (isolated)
* -1 - invalid partition root
* -2 - invalid isolated partition root
*
* There are 2 types of partitions - local or remote. Local partitions are
* those whose parents are partition root themselves. Setting of
* cpuset.cpus.exclusive are optional in setting up local partitions.
* Remote partitions are those whose parents are not partition roots. Passing
* down exclusive CPUs by setting cpuset.cpus.exclusive along its ancestor
* nodes are mandatory in creating a remote partition.
*
* For simplicity, a local partition can be created under a local or remote
* partition but a remote partition cannot have any partition root in its
* ancestor chain except the cgroup root.
*/
#define PRS_MEMBER 0
#define PRS_ROOT 1
@ -434,7 +455,7 @@ static struct cpuset top_cpuset = {
* by other task, we use alloc_lock in the task_struct fields to protect
* them.
*
* The cpuset_common_file_read() handlers only hold callback_lock across
* The cpuset_common_seq_show() handlers only hold callback_lock across
* small pieces of code, such as when reading out possibly multi-word
* cpumasks and nodemasks.
*
@ -709,6 +730,19 @@ static inline void free_cpuset(struct cpuset *cs)
kfree(cs);
}
/* Return user specified exclusive CPUs */
static inline struct cpumask *user_xcpus(struct cpuset *cs)
{
return cpumask_empty(cs->exclusive_cpus) ? cs->cpus_allowed
: cs->exclusive_cpus;
}
static inline bool xcpus_empty(struct cpuset *cs)
{
return cpumask_empty(cs->cpus_allowed) &&
cpumask_empty(cs->exclusive_cpus);
}
static inline struct cpumask *fetch_xcpus(struct cpuset *cs)
{
return !cpumask_empty(cs->exclusive_cpus) ? cs->exclusive_cpus :
@ -825,17 +859,41 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
/*
* If either I or some sibling (!= me) is exclusive, we can't
* overlap
* overlap. exclusive_cpus cannot overlap with each other if set.
*/
ret = -EINVAL;
cpuset_for_each_child(c, css, par) {
if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) &&
c != cur) {
bool txset, cxset; /* Are exclusive_cpus set? */
if (c == cur)
continue;
txset = !cpumask_empty(trial->exclusive_cpus);
cxset = !cpumask_empty(c->exclusive_cpus);
if (is_cpu_exclusive(trial) || is_cpu_exclusive(c) ||
(txset && cxset)) {
if (!cpusets_are_exclusive(trial, c))
goto out;
} else if (txset || cxset) {
struct cpumask *xcpus, *acpus;
/*
* When just one of the exclusive_cpus's is set,
* cpus_allowed of the other cpuset, if set, cannot be
* a subset of it or none of those CPUs will be
* available if these exclusive CPUs are activated.
*/
if (txset) {
xcpus = trial->exclusive_cpus;
acpus = c->cpus_allowed;
} else {
xcpus = c->exclusive_cpus;
acpus = trial->cpus_allowed;
}
if (!cpumask_empty(acpus) && cpumask_subset(acpus, xcpus))
goto out;
}
if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) &&
c != cur &&
nodes_intersects(trial->mems_allowed, c->mems_allowed))
goto out;
}
@ -957,13 +1015,15 @@ static int generate_sched_domains(cpumask_var_t **domains,
int nslot; /* next empty doms[] struct cpumask slot */
struct cgroup_subsys_state *pos_css;
bool root_load_balance = is_sched_load_balance(&top_cpuset);
bool cgrpv2 = cgroup_subsys_on_dfl(cpuset_cgrp_subsys);
doms = NULL;
dattr = NULL;
csa = NULL;
/* Special case for the 99% of systems with one, full, sched domain */
if (root_load_balance && !top_cpuset.nr_subparts) {
if (root_load_balance && cpumask_empty(subpartitions_cpus)) {
single_root_domain:
ndoms = 1;
doms = alloc_sched_domains(ndoms);
if (!doms)
@ -991,16 +1051,18 @@ static int generate_sched_domains(cpumask_var_t **domains,
cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
if (cp == &top_cpuset)
continue;
if (cgrpv2)
goto v2;
/*
* v1:
* Continue traversing beyond @cp iff @cp has some CPUs and
* isn't load balancing. The former is obvious. The
* latter: All child cpusets contain a subset of the
* parent's cpus, so just skip them, and then we call
* update_domain_attr_tree() to calc relax_domain_level of
* the corresponding sched domain.
*
* If root is load-balancing, we can skip @cp if it
* is a subset of the root's effective_cpus.
*/
if (!cpumask_empty(cp->cpus_allowed) &&
!(is_sched_load_balance(cp) &&
@ -1008,20 +1070,39 @@ static int generate_sched_domains(cpumask_var_t **domains,
housekeeping_cpumask(HK_TYPE_DOMAIN))))
continue;
if (root_load_balance &&
cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
continue;
if (is_sched_load_balance(cp) &&
!cpumask_empty(cp->effective_cpus))
csa[csn++] = cp;
/* skip @cp's subtree if not a partition root */
if (!is_partition_valid(cp))
/* skip @cp's subtree */
pos_css = css_rightmost_descendant(pos_css);
continue;
v2:
/*
* Only valid partition roots that are not isolated and with
* non-empty effective_cpus will be saved into csn[].
*/
if ((cp->partition_root_state == PRS_ROOT) &&
!cpumask_empty(cp->effective_cpus))
csa[csn++] = cp;
/*
* Skip @cp's subtree if not a partition root and has no
* exclusive CPUs to be granted to child cpusets.
*/
if (!is_partition_valid(cp) && cpumask_empty(cp->exclusive_cpus))
pos_css = css_rightmost_descendant(pos_css);
}
rcu_read_unlock();
/*
* If there are only isolated partitions underneath the cgroup root,
* we can optimize out unneeded sched domains scanning.
*/
if (root_load_balance && (csn == 1))
goto single_root_domain;
for (i = 0; i < csn; i++)
csa[i]->pn = i;
ndoms = csn;
@ -1064,6 +1145,20 @@ static int generate_sched_domains(cpumask_var_t **domains,
dattr = kmalloc_array(ndoms, sizeof(struct sched_domain_attr),
GFP_KERNEL);
/*
* Cgroup v2 doesn't support domain attributes, just set all of them
* to SD_ATTR_INIT. Also non-isolating partition root CPUs are a
* subset of HK_TYPE_DOMAIN housekeeping CPUs.
*/
if (cgrpv2) {
for (i = 0; i < ndoms; i++) {
cpumask_copy(doms[i], csa[i]->effective_cpus);
if (dattr)
dattr[i] = SD_ATTR_INIT;
}
goto done;
}
for (nslot = 0, i = 0; i < csn; i++) {
struct cpuset *a = csa[i];
struct cpumask *dp;
@ -1223,7 +1318,7 @@ static void rebuild_sched_domains_locked(void)
* root should be only a subset of the active CPUs. Since a CPU in any
* partition root could be offlined, all must be checked.
*/
if (top_cpuset.nr_subparts) {
if (!cpumask_empty(subpartitions_cpus)) {
rcu_read_lock();
cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
if (!is_partition_valid(cs)) {
@ -1338,7 +1433,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
*/
static int update_partition_exclusive(struct cpuset *cs, int new_prs)
{
bool exclusive = (new_prs > 0);
bool exclusive = (new_prs > PRS_MEMBER);
if (exclusive && !is_cpu_exclusive(cs)) {
if (update_flag(CS_CPU_EXCLUSIVE, cs, 1))
@ -1532,7 +1627,7 @@ EXPORT_SYMBOL_GPL(cpuset_cpu_is_isolated);
* Return: true if xcpus is not empty, false otherwise.
*
* Starting with exclusive_cpus (cpus_allowed if exclusive_cpus is not set),
* it must be a subset of cpus_allowed and parent's effective_xcpus.
* it must be a subset of parent's effective_xcpus.
*/
static bool compute_effective_exclusive_cpumask(struct cpuset *cs,
struct cpumask *xcpus)
@ -1542,12 +1637,7 @@ static bool compute_effective_exclusive_cpumask(struct cpuset *cs,
if (!xcpus)
xcpus = cs->effective_xcpus;
if (!cpumask_empty(cs->exclusive_cpus))
cpumask_and(xcpus, cs->exclusive_cpus, cs->cpus_allowed);
else
cpumask_copy(xcpus, cs->cpus_allowed);
return cpumask_and(xcpus, xcpus, parent->effective_xcpus);
return cpumask_and(xcpus, user_xcpus(cs), parent->effective_xcpus);
}
static inline bool is_remote_partition(struct cpuset *cs)
@ -1826,8 +1916,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
*/
adding = deleting = false;
old_prs = new_prs = cs->partition_root_state;
xcpus = !cpumask_empty(cs->exclusive_cpus)
? cs->effective_xcpus : cs->cpus_allowed;
xcpus = user_xcpus(cs);
if (cmd == partcmd_invalidate) {
if (is_prs_invalid(old_prs))
@ -1855,7 +1944,7 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
return is_partition_invalid(parent)
? PERR_INVPARENT : PERR_NOTPART;
}
if (!newmask && cpumask_empty(cs->cpus_allowed))
if (!newmask && xcpus_empty(cs))
return PERR_CPUSEMPTY;
nocpu = tasks_nocpu_error(parent, cs, xcpus);
@ -2583,8 +2672,6 @@ static int update_exclusive_cpumask(struct cpuset *cs, struct cpuset *trialcs,
retval = cpulist_parse(buf, trialcs->exclusive_cpus);
if (retval < 0)
return retval;
if (!is_cpu_exclusive(cs))
set_bit(CS_CPU_EXCLUSIVE, &trialcs->flags);
}
/* Nothing to do if the CPUs didn't change */
@ -3071,9 +3158,9 @@ static int update_prstate(struct cpuset *cs, int new_prs)
? partcmd_enable : partcmd_enablei;
/*
* cpus_allowed cannot be empty.
* cpus_allowed and exclusive_cpus cannot be both empty.
*/
if (cpumask_empty(cs->cpus_allowed)) {
if (xcpus_empty(cs)) {
err = PERR_CPUSEMPTY;
goto out;
}
@ -4009,8 +4096,6 @@ cpuset_css_alloc(struct cgroup_subsys_state *parent_css)
}
__set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
nodes_clear(cs->mems_allowed);
nodes_clear(cs->effective_mems);
fmeter_init(&cs->fmeter);
cs->relax_domain_level = -1;
INIT_LIST_HEAD(&cs->remote_sibling);
@ -4040,6 +4125,12 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
set_bit(CS_SPREAD_PAGE, &cs->flags);
if (is_spread_slab(parent))
set_bit(CS_SPREAD_SLAB, &cs->flags);
/*
* For v2, clear CS_SCHED_LOAD_BALANCE if parent is isolated
*/
if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
!is_sched_load_balance(parent))
clear_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
cpuset_inc();
@ -4050,14 +4141,6 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
cs->use_parent_ecpus = true;
parent->child_ecpus_count++;
}
/*
* For v2, clear CS_SCHED_LOAD_BALANCE if parent is isolated
*/
if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
!is_sched_load_balance(parent))
clear_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
spin_unlock_irq(&callback_lock);
if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
@ -4571,7 +4654,7 @@ static void cpuset_handle_hotplug(void)
* In the rare case that hotplug removes all the cpus in
* subpartitions_cpus, we assumed that cpus are updated.
*/
if (!cpus_updated && top_cpuset.nr_subparts)
if (!cpus_updated && !cpumask_empty(subpartitions_cpus))
cpus_updated = true;
/* For v1, synchronize cpus_allowed to cpu_active_mask */
@ -5051,10 +5134,14 @@ int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
if (!buf)
goto out;
css = task_get_css(tsk, cpuset_cgrp_id);
retval = cgroup_path_ns(css->cgroup, buf, PATH_MAX,
current->nsproxy->cgroup_ns);
css_put(css);
rcu_read_lock();
spin_lock_irq(&css_set_lock);
css = task_css(tsk, cpuset_cgrp_id);
retval = cgroup_path_ns_locked(css->cgroup, buf, PATH_MAX,
current->nsproxy->cgroup_ns);
spin_unlock_irq(&css_set_lock);
rcu_read_unlock();
if (retval == -E2BIG)
retval = -ENAMETOOLONG;
if (retval < 0)

View file

@ -121,6 +121,30 @@ static void misc_cg_cancel_charge(enum misc_res_type type, struct misc_cg *cg,
misc_res_name[type]);
}
static void misc_cg_update_watermark(struct misc_res *res, u64 new_usage)
{
u64 old;
while (true) {
old = atomic64_read(&res->watermark);
if (new_usage <= old)
break;
if (atomic64_cmpxchg(&res->watermark, old, new_usage) == old)
break;
}
}
static void misc_cg_event(enum misc_res_type type, struct misc_cg *cg)
{
atomic64_inc(&cg->res[type].events_local);
cgroup_file_notify(&cg->events_local_file);
for (; parent_misc(cg); cg = parent_misc(cg)) {
atomic64_inc(&cg->res[type].events);
cgroup_file_notify(&cg->events_file);
}
}
/**
* misc_cg_try_charge() - Try charging the misc cgroup.
* @type: Misc res type to charge.
@ -159,14 +183,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
ret = -EBUSY;
goto err_charge;
}
misc_cg_update_watermark(res, new_usage);
}
return 0;
err_charge:
for (j = i; j; j = parent_misc(j)) {
atomic64_inc(&j->res[type].events);
cgroup_file_notify(&j->events_file);
}
misc_cg_event(type, i);
for (j = cg; j != i; j = parent_misc(j))
misc_cg_cancel_charge(type, j, amount);
@ -307,6 +329,29 @@ static int misc_cg_current_show(struct seq_file *sf, void *v)
return 0;
}
/**
* misc_cg_peak_show() - Show the peak usage of the misc cgroup.
* @sf: Interface file
* @v: Arguments passed
*
* Context: Any context.
* Return: 0 to denote successful print.
*/
static int misc_cg_peak_show(struct seq_file *sf, void *v)
{
int i;
u64 watermark;
struct misc_cg *cg = css_misc(seq_css(sf));
for (i = 0; i < MISC_CG_RES_TYPES; i++) {
watermark = atomic64_read(&cg->res[i].watermark);
if (READ_ONCE(misc_res_capacity[i]) || watermark)
seq_printf(sf, "%s %llu\n", misc_res_name[i], watermark);
}
return 0;
}
/**
* misc_cg_capacity_show() - Show the total capacity of misc res on the host.
* @sf: Interface file
@ -331,20 +376,33 @@ static int misc_cg_capacity_show(struct seq_file *sf, void *v)
return 0;
}
static int misc_events_show(struct seq_file *sf, void *v)
static int __misc_events_show(struct seq_file *sf, bool local)
{
struct misc_cg *cg = css_misc(seq_css(sf));
u64 events;
int i;
for (i = 0; i < MISC_CG_RES_TYPES; i++) {
events = atomic64_read(&cg->res[i].events);
if (local)
events = atomic64_read(&cg->res[i].events_local);
else
events = atomic64_read(&cg->res[i].events);
if (READ_ONCE(misc_res_capacity[i]) || events)
seq_printf(sf, "%s.max %llu\n", misc_res_name[i], events);
}
return 0;
}
static int misc_events_show(struct seq_file *sf, void *v)
{
return __misc_events_show(sf, false);
}
static int misc_events_local_show(struct seq_file *sf, void *v)
{
return __misc_events_show(sf, true);
}
/* Misc cgroup interface files */
static struct cftype misc_cg_files[] = {
{
@ -357,6 +415,10 @@ static struct cftype misc_cg_files[] = {
.name = "current",
.seq_show = misc_cg_current_show,
},
{
.name = "peak",
.seq_show = misc_cg_peak_show,
},
{
.name = "capacity",
.seq_show = misc_cg_capacity_show,
@ -368,6 +430,12 @@ static struct cftype misc_cg_files[] = {
.file_offset = offsetof(struct misc_cg, events_file),
.seq_show = misc_events_show,
},
{
.name = "events.local",
.flags = CFTYPE_NOT_ON_ROOT,
.file_offset = offsetof(struct misc_cg, events_local_file),
.seq_show = misc_events_local_show,
},
{}
};

View file

@ -38,6 +38,14 @@
#define PIDS_MAX (PID_MAX_LIMIT + 1ULL)
#define PIDS_MAX_STR "max"
enum pidcg_event {
/* Fork failed in subtree because this pids_cgroup limit was hit. */
PIDCG_MAX,
/* Fork failed in this pids_cgroup because ancestor limit was hit. */
PIDCG_FORKFAIL,
NR_PIDCG_EVENTS,
};
struct pids_cgroup {
struct cgroup_subsys_state css;
@ -49,11 +57,12 @@ struct pids_cgroup {
atomic64_t limit;
int64_t watermark;
/* Handle for "pids.events" */
/* Handles for pids.events[.local] */
struct cgroup_file events_file;
struct cgroup_file events_local_file;
/* Number of times fork failed because limit was hit. */
atomic64_t events_limit;
atomic64_t events[NR_PIDCG_EVENTS];
atomic64_t events_local[NR_PIDCG_EVENTS];
};
static struct pids_cgroup *css_pids(struct cgroup_subsys_state *css)
@ -148,12 +157,13 @@ static void pids_charge(struct pids_cgroup *pids, int num)
* pids_try_charge - hierarchically try to charge the pid count
* @pids: the pid cgroup state
* @num: the number of pids to charge
* @fail: storage of pid cgroup causing the fail
*
* This function follows the set limit. It will fail if the charge would cause
* the new value to exceed the hierarchical limit. Returns 0 if the charge
* succeeded, otherwise -EAGAIN.
*/
static int pids_try_charge(struct pids_cgroup *pids, int num)
static int pids_try_charge(struct pids_cgroup *pids, int num, struct pids_cgroup **fail)
{
struct pids_cgroup *p, *q;
@ -166,9 +176,10 @@ static int pids_try_charge(struct pids_cgroup *pids, int num)
* p->limit is %PIDS_MAX then we know that this test will never
* fail.
*/
if (new > limit)
if (new > limit) {
*fail = p;
goto revert;
}
/*
* Not technically accurate if we go over limit somewhere up
* the hierarchy, but that's tolerable for the watermark.
@ -229,6 +240,36 @@ static void pids_cancel_attach(struct cgroup_taskset *tset)
}
}
static void pids_event(struct pids_cgroup *pids_forking,
struct pids_cgroup *pids_over_limit)
{
struct pids_cgroup *p = pids_forking;
bool limit = false;
/* Only log the first time limit is hit. */
if (atomic64_inc_return(&p->events_local[PIDCG_FORKFAIL]) == 1) {
pr_info("cgroup: fork rejected by pids controller in ");
pr_cont_cgroup_path(p->css.cgroup);
pr_cont("\n");
}
cgroup_file_notify(&p->events_local_file);
if (!cgroup_subsys_on_dfl(pids_cgrp_subsys) ||
cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS)
return;
for (; parent_pids(p); p = parent_pids(p)) {
if (p == pids_over_limit) {
limit = true;
atomic64_inc(&p->events_local[PIDCG_MAX]);
cgroup_file_notify(&p->events_local_file);
}
if (limit)
atomic64_inc(&p->events[PIDCG_MAX]);
cgroup_file_notify(&p->events_file);
}
}
/*
* task_css_check(true) in pids_can_fork() and pids_cancel_fork() relies
* on cgroup_threadgroup_change_begin() held by the copy_process().
@ -236,7 +277,7 @@ static void pids_cancel_attach(struct cgroup_taskset *tset)
static int pids_can_fork(struct task_struct *task, struct css_set *cset)
{
struct cgroup_subsys_state *css;
struct pids_cgroup *pids;
struct pids_cgroup *pids, *pids_over_limit;
int err;
if (cset)
@ -244,16 +285,10 @@ static int pids_can_fork(struct task_struct *task, struct css_set *cset)
else
css = task_css_check(current, pids_cgrp_id, true);
pids = css_pids(css);
err = pids_try_charge(pids, 1);
if (err) {
/* Only log the first time events_limit is incremented. */
if (atomic64_inc_return(&pids->events_limit) == 1) {
pr_info("cgroup: fork rejected by pids controller in ");
pr_cont_cgroup_path(css->cgroup);
pr_cont("\n");
}
cgroup_file_notify(&pids->events_file);
}
err = pids_try_charge(pids, 1, &pids_over_limit);
if (err)
pids_event(pids, pids_over_limit);
return err;
}
@ -337,11 +372,32 @@ static s64 pids_peak_read(struct cgroup_subsys_state *css,
return READ_ONCE(pids->watermark);
}
static int pids_events_show(struct seq_file *sf, void *v)
static int __pids_events_show(struct seq_file *sf, bool local)
{
struct pids_cgroup *pids = css_pids(seq_css(sf));
enum pidcg_event pe = PIDCG_MAX;
atomic64_t *events;
seq_printf(sf, "max %lld\n", (s64)atomic64_read(&pids->events_limit));
if (!cgroup_subsys_on_dfl(pids_cgrp_subsys) ||
cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS) {
pe = PIDCG_FORKFAIL;
local = true;
}
events = local ? pids->events_local : pids->events;
seq_printf(sf, "max %lld\n", (s64)atomic64_read(&events[pe]));
return 0;
}
static int pids_events_show(struct seq_file *sf, void *v)
{
__pids_events_show(sf, false);
return 0;
}
static int pids_events_local_show(struct seq_file *sf, void *v)
{
__pids_events_show(sf, true);
return 0;
}
@ -368,9 +424,42 @@ static struct cftype pids_files[] = {
.file_offset = offsetof(struct pids_cgroup, events_file),
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "events.local",
.seq_show = pids_events_local_show,
.file_offset = offsetof(struct pids_cgroup, events_local_file),
.flags = CFTYPE_NOT_ON_ROOT,
},
{ } /* terminate */
};
static struct cftype pids_files_legacy[] = {
{
.name = "max",
.write = pids_max_write,
.seq_show = pids_max_show,
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "current",
.read_s64 = pids_current_read,
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "peak",
.flags = CFTYPE_NOT_ON_ROOT,
.read_s64 = pids_peak_read,
},
{
.name = "events",
.seq_show = pids_events_show,
.file_offset = offsetof(struct pids_cgroup, events_file),
.flags = CFTYPE_NOT_ON_ROOT,
},
{ } /* terminate */
};
struct cgroup_subsys pids_cgrp_subsys = {
.css_alloc = pids_css_alloc,
.css_free = pids_css_free,
@ -379,7 +468,7 @@ struct cgroup_subsys pids_cgrp_subsys = {
.can_fork = pids_can_fork,
.cancel_fork = pids_cancel_fork,
.release = pids_release,
.legacy_cftypes = pids_files,
.legacy_cftypes = pids_files_legacy,
.dfl_cftypes = pids_files,
.threaded = true,
};

View file

@ -594,49 +594,46 @@ static void root_cgroup_cputime(struct cgroup_base_stat *bstat)
}
}
static void cgroup_force_idle_show(struct seq_file *seq, struct cgroup_base_stat *bstat)
{
#ifdef CONFIG_SCHED_CORE
u64 forceidle_time = bstat->forceidle_sum;
do_div(forceidle_time, NSEC_PER_USEC);
seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time);
#endif
}
void cgroup_base_stat_cputime_show(struct seq_file *seq)
{
struct cgroup *cgrp = seq_css(seq)->cgroup;
u64 usage, utime, stime;
struct cgroup_base_stat bstat;
#ifdef CONFIG_SCHED_CORE
u64 forceidle_time;
#endif
if (cgroup_parent(cgrp)) {
cgroup_rstat_flush_hold(cgrp);
usage = cgrp->bstat.cputime.sum_exec_runtime;
cputime_adjust(&cgrp->bstat.cputime, &cgrp->prev_cputime,
&utime, &stime);
#ifdef CONFIG_SCHED_CORE
forceidle_time = cgrp->bstat.forceidle_sum;
#endif
cgroup_rstat_flush_release(cgrp);
} else {
root_cgroup_cputime(&bstat);
usage = bstat.cputime.sum_exec_runtime;
utime = bstat.cputime.utime;
stime = bstat.cputime.stime;
#ifdef CONFIG_SCHED_CORE
forceidle_time = bstat.forceidle_sum;
#endif
/* cgrp->bstat of root is not actually used, reuse it */
root_cgroup_cputime(&cgrp->bstat);
usage = cgrp->bstat.cputime.sum_exec_runtime;
utime = cgrp->bstat.cputime.utime;
stime = cgrp->bstat.cputime.stime;
}
do_div(usage, NSEC_PER_USEC);
do_div(utime, NSEC_PER_USEC);
do_div(stime, NSEC_PER_USEC);
#ifdef CONFIG_SCHED_CORE
do_div(forceidle_time, NSEC_PER_USEC);
#endif
seq_printf(seq, "usage_usec %llu\n"
"user_usec %llu\n"
"system_usec %llu\n",
usage, utime, stime);
#ifdef CONFIG_SCHED_CORE
seq_printf(seq, "core_sched.force_idle_usec %llu\n", forceidle_time);
#endif
cgroup_force_idle_show(seq, &cgrp->bstat);
}
/* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */

View file

@ -1,11 +1,12 @@
# SPDX-License-Identifier: GPL-2.0-only
test_memcontrol
test_core
test_freezer
test_kmem
test_kill
test_cpu
test_cpuset
test_zswap
test_freezer
test_hugetlb_memcg
test_kill
test_kmem
test_memcontrol
test_pids
test_zswap
wait_inotify

View file

@ -6,26 +6,29 @@ all: ${HELPER_PROGS}
TEST_FILES := with_stress.sh
TEST_PROGS := test_stress.sh test_cpuset_prs.sh test_cpuset_v1_hp.sh
TEST_GEN_FILES := wait_inotify
TEST_GEN_PROGS = test_memcontrol
TEST_GEN_PROGS += test_kmem
TEST_GEN_PROGS += test_core
TEST_GEN_PROGS += test_freezer
TEST_GEN_PROGS += test_kill
# Keep the lists lexicographically sorted
TEST_GEN_PROGS = test_core
TEST_GEN_PROGS += test_cpu
TEST_GEN_PROGS += test_cpuset
TEST_GEN_PROGS += test_zswap
TEST_GEN_PROGS += test_freezer
TEST_GEN_PROGS += test_hugetlb_memcg
TEST_GEN_PROGS += test_kill
TEST_GEN_PROGS += test_kmem
TEST_GEN_PROGS += test_memcontrol
TEST_GEN_PROGS += test_pids
TEST_GEN_PROGS += test_zswap
LOCAL_HDRS += $(selfdir)/clone3/clone3_selftests.h $(selfdir)/pidfd/pidfd.h
include ../lib.mk
$(OUTPUT)/test_memcontrol: cgroup_util.c
$(OUTPUT)/test_kmem: cgroup_util.c
$(OUTPUT)/test_core: cgroup_util.c
$(OUTPUT)/test_freezer: cgroup_util.c
$(OUTPUT)/test_kill: cgroup_util.c
$(OUTPUT)/test_cpu: cgroup_util.c
$(OUTPUT)/test_cpuset: cgroup_util.c
$(OUTPUT)/test_zswap: cgroup_util.c
$(OUTPUT)/test_freezer: cgroup_util.c
$(OUTPUT)/test_hugetlb_memcg: cgroup_util.c
$(OUTPUT)/test_kill: cgroup_util.c
$(OUTPUT)/test_kmem: cgroup_util.c
$(OUTPUT)/test_memcontrol: cgroup_util.c
$(OUTPUT)/test_pids: cgroup_util.c
$(OUTPUT)/test_zswap: cgroup_util.c

View file

@ -28,6 +28,14 @@ CPULIST=$(cat $CGROUP2/cpuset.cpus.effective)
NR_CPUS=$(lscpu | grep "^CPU(s):" | sed -e "s/.*:[[:space:]]*//")
[[ $NR_CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!"
# Check to see if /dev/console exists and is writable
if [[ -c /dev/console && -w /dev/console ]]
then
CONSOLE=/dev/console
else
CONSOLE=/dev/null
fi
# Set verbose flag and delay factor
PROG=$1
VERBOSE=0
@ -103,8 +111,8 @@ console_msg()
{
MSG=$1
echo "$MSG"
echo "" > /dev/console
echo "$MSG" > /dev/console
echo "" > $CONSOLE
echo "$MSG" > $CONSOLE
pause 0.01
}
@ -161,6 +169,14 @@ test_add_proc()
# T = put a task into cgroup
# O<c>=<v> = Write <v> to CPU online file of <c>
#
# ECPUs - effective CPUs of cpusets
# Pstate - partition root state
# ISOLCPUS - isolated CPUs (<icpus>[,<icpus2>])
#
# Note that if there are 2 fields in ISOLCPUS, the first one is for
# sched-debug matching which includes offline CPUs and single-CPU partitions
# while the second one is for matching cpuset.cpus.isolated.
#
SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
TEST_MATRIX=(
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
@ -220,23 +236,29 @@ TEST_MATRIX=(
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3:P2 . . 0 A1:0-1,A2:2-3,A3:2-3 A1:P0,A2:P2 2-3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X3:P2 . . 0 A1:0-2,A2:3,A3:3 A1:P0,A2:P2 3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2 . 0 A1:0-1,A2:1,A3:2-3 A1:P0,A3:P2 2-3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:C3 . 0 A1:0-2,A2:1-2,A3:3 A1:P0,A3:P2 3"
" C0-3:S+ C1-3:S+ C2-3 . X2-3 X2-3 X2-3:P2:C3 . 0 A1:0-1,A2:1,A3:2-3 A1:P0,A3:P2 2-3"
" C0-3:S+ C1-3:S+ C2-3 C2-3 . . . P2 0 A1:0-3,A2:1-3,A3:2-3,B1:2-3 A1:P0,A3:P0,B1:P-2"
" C0-3:S+ C1-3:S+ C2-3 C4-5 . . . P2 0 B1:4-5 B1:P2 4-5"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3 X2-3:P2 P2 0 A3:2-3,B1:4 A3:P2,B1:P2 2-4"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3 X2-3:P2:C1-3 P2 0 A3:2-3,B1:4 A3:P2,B1:P2 2-4"
" C0-3:S+ C1-3:S+ C2-3 C4 X1-3 X1-3:P2 P2 . 0 A2:1,A3:2-3 A2:P2,A3:P2 1-3"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3 X2-3:P2 P2:C4-5 0 A3:2-3,B1:4-5 A3:P2,B1:P2 2-5"
" C4:X0-3:S+ X1-3:S+ X2-3 . . P2 . . 0 A1:4,A2:1-3,A3:1-3 A2:P2 1-3"
" C4:X0-3:S+ X1-3:S+ X2-3 . . . P2 . 0 A1:4,A2:4,A3:2-3 A3:P2 2-3"
# Nested remote/local partition tests
" C0-3:S+ C1-3:S+ C2-3 C4-5 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:,A3:2-3,B1:4-5 \
A1:P0,A2:P1,A3:P2,B1:P1 2-3"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:,A3:2-3,B1:4 \
A1:P0,A2:P1,A3:P2,B1:P1 2-4,2-3"
" C0-3:S+ C1-3:S+ C2-3 C4 X2-3 X2-3:P1 . P1 0 A1:0-1,A2:2-3,A3:2-3,B1:4 \
A1:P0,A2:P1,A3:P0,B1:P1"
" C0-3:S+ C1-3:S+ C3 C4 X2-3 X2-3:P1 P2 P1 0 A1:0-1,A2:2,A3:3,B1:4 \
A1:P0,A2:P1,A3:P2,B1:P1 2-4,3"
" C0-4:S+ C1-4:S+ C2-4 . X2-4 X2-4:P2 X4:P1 . 0 A1:0-1,A2:2-3,A3:4 \
A1:P0,A2:P2,A3:P1 2-4,2-3"
" C0-4:S+ C1-4:S+ C2-4 . X2-4 X2-4:P2 X3-4:P1 . 0 A1:0-1,A2:2,A3:3-4 \
A1:P0,A2:P2,A3:P1 2"
" C0-4:X2-4:S+ C1-4:X2-4:S+:P2 C2-4:X4:P1 \
. . X5 . . 0 A1:0-4,A2:1-4,A3:2-4 \
A1:P0,A2:P-2,A3:P-1"
@ -262,8 +284,8 @@ TEST_MATRIX=(
. . X2-3 P2 . . 0 A1:0-2,A2:3,XA2:3 A2:P2 3"
# Invalid to valid local partition direct transition tests
" C1-3:S+:P2 C2-3:X1:P2 . . . . . . 0 A1:1-3,XA1:1-3,A2:2-3:XA2: A1:P2,A2:P-2 1-3"
" C1-3:S+:P2 C2-3:X1:P2 . . . X3:P2 . . 0 A1:1-2,XA1:1-3,A2:3:XA2:3 A1:P2,A2:P2 1-3"
" C1-3:S+:P2 X4:P2 . . . . . . 0 A1:1-3,XA1:1-3,A2:1-3:XA2: A1:P2,A2:P-2 1-3"
" C1-3:S+:P2 X4:P2 . . . X3:P2 . . 0 A1:1-2,XA1:1-3,A2:3:XA2:3 A1:P2,A2:P2 1-3"
" C0-3:P2 . . C4-6 C0-4 . . . 0 A1:0-4,B1:4-6 A1:P-2,B1:P0"
" C0-3:P2 . . C4-6 C0-4:C0-3 . . . 0 A1:0-3,B1:4-6 A1:P2,B1:P0 0-3"
" C0-3:P2 . . C3-5:C4-5 . . . . 0 A1:0-3,B1:4-5 A1:P2,B1:P0 0-3"
@ -274,21 +296,18 @@ TEST_MATRIX=(
" C0-3:X1-3:S+:P2 C1-3:X2-3:S+:P2 C2-3:X3:P2 \
. . X4 . . 0 A1:1-3,A2:1-3,A3:2-3,XA2:,XA3: A1:P2,A2:P-2,A3:P-2 1-3"
" C0-3:X1-3:S+:P2 C1-3:X2-3:S+:P2 C2-3:X3:P2 \
. . C4 . . 0 A1:1-3,A2:1-3,A3:2-3,XA2:,XA3: A1:P2,A2:P-2,A3:P-2 1-3"
. . C4:X . . 0 A1:1-3,A2:1-3,A3:2-3,XA2:,XA3: A1:P2,A2:P-2,A3:P-2 1-3"
# Local partition CPU change tests
" C0-5:S+:P2 C4-5:S+:P1 . . . C3-5 . . 0 A1:0-2,A2:3-5 A1:P2,A2:P1 0-2"
" C0-5:S+:P2 C4-5:S+:P1 . . C1-5 . . . 0 A1:1-3,A2:4-5 A1:P2,A2:P1 1-3"
# cpus_allowed/exclusive_cpus update tests
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. C4 . P2 . 0 A1:4,A2:4,XA2:,XA3:,A3:4 \
. X:C4 . P2 . 0 A1:4,A2:4,XA2:,XA3:,A3:4 \
A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. X1 . P2 . 0 A1:0-3,A2:1-3,XA1:1,XA2:,XA3:,A3:2-3 \
A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. . C3 P2 . 0 A1:0-2,A2:0-2,XA2:3,XA3:3,A3:3 \
A1:P0,A3:P2 3"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3 \
. . X3 P2 . 0 A1:0-2,A2:1-2,XA2:3,XA3:3,A3:3 \
A1:P0,A3:P2 3"
@ -296,10 +315,7 @@ TEST_MATRIX=(
. . X3 . . 0 A1:0-3,A2:1-3,XA2:3,XA3:3,A3:2-3 \
A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
. . C3 . . 0 A1:0-3,A2:3,XA2:3,XA3:3,A3:3 \
A1:P0,A3:P-2"
" C0-3:X2-3:S+ C1-3:X2-3:S+ C2-3:X2-3:P2 \
. C4 . . . 0 A1:4,A2:4,A3:4,XA1:,XA2:,XA3 \
. X4 . . . 0 A1:0-3,A2:1-3,A3:2-3,XA1:4,XA2:,XA3 \
A1:P0,A3:P-2"
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
@ -346,6 +362,9 @@ TEST_MATRIX=(
" C0-1:P1 . . P1:C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P-1,B1:P-1"
" C0-1 . . P1:C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P0,B1:P-1"
# cpuset.cpus can overlap with sibling cpuset.cpus.exclusive but not subsumed by it
" C0-3 . . C4-5 X5 . . . 0 A1:0-3,B1:4-5"
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
# ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
# Failure cases:
@ -355,6 +374,9 @@ TEST_MATRIX=(
# Changes to cpuset.cpus.exclusive that violate exclusivity rule is rejected
" C0-3 . . C4-5 X0-3 . . X3-5 1 A1:0-3,B1:4-5"
# cpuset.cpus cannot be a subset of sibling cpuset.cpus.exclusive
" C0-3 . . C4-5 X3-5 . . . 1 A1:0-3,B1:4-5"
)
#
@ -556,14 +578,15 @@ check_cgroup_states()
do
set -- $(echo $CHK | sed -e "s/:/ /g")
CGRP=$1
CGRP_DIR=$CGRP
STATE=$2
FILE=
EVAL=$(expr substr $STATE 2 2)
[[ $CGRP = A2 ]] && CGRP=A1/A2
[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
[[ $CGRP = A2 ]] && CGRP_DIR=A1/A2
[[ $CGRP = A3 ]] && CGRP_DIR=A1/A2/A3
case $STATE in
P*) FILE=$CGRP/cpuset.cpus.partition
P*) FILE=$CGRP_DIR/cpuset.cpus.partition
;;
*) echo "Unknown state: $STATE!"
exit 1
@ -587,6 +610,16 @@ check_cgroup_states()
;;
esac
[[ $EVAL != $VAL ]] && return 1
#
# For root partition, dump sched-domains info to console if
# verbose mode set for manual comparison with sched debug info.
#
[[ $VAL -eq 1 && $VERBOSE -gt 0 ]] && {
DOMS=$(cat $CGRP_DIR/cpuset.cpus.effective)
[[ -n "$DOMS" ]] &&
echo " [$CGRP] sched-domain: $DOMS" > $CONSOLE
}
done
return 0
}
@ -694,9 +727,9 @@ null_isolcpus_check()
[[ $VERBOSE -gt 0 ]] || return 0
# Retry a few times before printing error
RETRY=0
while [[ $RETRY -lt 5 ]]
while [[ $RETRY -lt 8 ]]
do
pause 0.01
pause 0.02
check_isolcpus "."
[[ $? -eq 0 ]] && return 0
((RETRY++))
@ -726,7 +759,7 @@ run_state_test()
while [[ $I -lt $CNT ]]
do
echo "Running test $I ..." > /dev/console
echo "Running test $I ..." > $CONSOLE
[[ $VERBOSE -gt 1 ]] && {
echo ""
eval echo \${$TEST[$I]}
@ -783,7 +816,7 @@ run_state_test()
while [[ $NEWLIST != $CPULIST && $RETRY -lt 8 ]]
do
# Wait a bit longer & recheck a few times
pause 0.01
pause 0.02
((RETRY++))
NEWLIST=$(cat cpuset.cpus.effective)
done

View file

@ -0,0 +1,178 @@
// SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
#include <errno.h>
#include <linux/limits.h>
#include <signal.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include "../kselftest.h"
#include "cgroup_util.h"
static int run_success(const char *cgroup, void *arg)
{
return 0;
}
static int run_pause(const char *cgroup, void *arg)
{
return pause();
}
/*
* This test checks that pids.max prevents forking new children above the
* specified limit in the cgroup.
*/
static int test_pids_max(const char *root)
{
int ret = KSFT_FAIL;
char *cg_pids;
int pid;
cg_pids = cg_name(root, "pids_test");
if (!cg_pids)
goto cleanup;
if (cg_create(cg_pids))
goto cleanup;
if (cg_read_strcmp(cg_pids, "pids.max", "max\n"))
goto cleanup;
if (cg_write(cg_pids, "pids.max", "2"))
goto cleanup;
if (cg_enter_current(cg_pids))
goto cleanup;
pid = cg_run_nowait(cg_pids, run_pause, NULL);
if (pid < 0)
goto cleanup;
if (cg_run_nowait(cg_pids, run_success, NULL) != -1 || errno != EAGAIN)
goto cleanup;
if (kill(pid, SIGINT))
goto cleanup;
ret = KSFT_PASS;
cleanup:
cg_enter_current(root);
cg_destroy(cg_pids);
free(cg_pids);
return ret;
}
/*
* This test checks that pids.events are counted in cgroup associated with pids.max
*/
static int test_pids_events(const char *root)
{
int ret = KSFT_FAIL;
char *cg_parent = NULL, *cg_child = NULL;
int pid;
cg_parent = cg_name(root, "pids_parent");
cg_child = cg_name(cg_parent, "pids_child");
if (!cg_parent || !cg_child)
goto cleanup;
if (cg_create(cg_parent))
goto cleanup;
if (cg_write(cg_parent, "cgroup.subtree_control", "+pids"))
goto cleanup;
if (cg_create(cg_child))
goto cleanup;
if (cg_write(cg_parent, "pids.max", "2"))
goto cleanup;
if (cg_read_strcmp(cg_child, "pids.max", "max\n"))
goto cleanup;
if (cg_enter_current(cg_child))
goto cleanup;
pid = cg_run_nowait(cg_child, run_pause, NULL);
if (pid < 0)
goto cleanup;
if (cg_run_nowait(cg_child, run_success, NULL) != -1 || errno != EAGAIN)
goto cleanup;
if (kill(pid, SIGINT))
goto cleanup;
if (cg_read_key_long(cg_child, "pids.events", "max ") != 0)
goto cleanup;
if (cg_read_key_long(cg_parent, "pids.events", "max ") != 1)
goto cleanup;
ret = KSFT_PASS;
cleanup:
cg_enter_current(root);
if (cg_child)
cg_destroy(cg_child);
if (cg_parent)
cg_destroy(cg_parent);
free(cg_child);
free(cg_parent);
return ret;
}
#define T(x) { x, #x }
struct pids_test {
int (*fn)(const char *root);
const char *name;
} tests[] = {
T(test_pids_max),
T(test_pids_events),
};
#undef T
int main(int argc, char **argv)
{
char root[PATH_MAX];
ksft_print_header();
ksft_set_plan(ARRAY_SIZE(tests));
if (cg_find_unified_root(root, sizeof(root), NULL))
ksft_exit_skip("cgroup v2 isn't mounted\n");
/*
* Check that pids controller is available:
* pids is listed in cgroup.controllers
*/
if (cg_read_strstr(root, "cgroup.controllers", "pids"))
ksft_exit_skip("pids controller isn't available\n");
if (cg_read_strstr(root, "cgroup.subtree_control", "pids"))
if (cg_write(root, "cgroup.subtree_control", "+pids"))
ksft_exit_skip("Failed to set pids controller\n");
for (int i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) {
case KSFT_PASS:
ksft_test_result_pass("%s\n", tests[i].name);
break;
case KSFT_SKIP:
ksft_test_result_skip("%s\n", tests[i].name);
break;
default:
ksft_test_result_fail("%s\n", tests[i].name);
break;
}
}
ksft_finished();
}