linux/Documentation/cgroups/freezer-subsystem.txt
Tejun Heo ef9fe980c6 cgroup_freezer: implement proper hierarchy support
Up until now, cgroup_freezer didn't implement hierarchy properly.
cgroups could be arranged in hierarchy but it didn't make any
difference in how each cgroup_freezer behaved.  They all operated
separately.

This patch implements proper hierarchy support.  If a cgroup is
frozen, all its descendants are frozen.  A cgroup is thawed iff it and
all its ancestors are THAWED.  freezer.self_freezing shows the current
freezing state for the cgroup itself.  freezer.parent_freezing shows
whether the cgroup is freezing because any of its ancestors is
freezing.

freezer_post_create() locks the parent and new cgroup and inherits the
parent's state and freezer_change_state() applies new state top-down
using cgroup_for_each_descendant_pre() which guarantees that no child
can escape its parent's state.  update_if_frozen() uses
cgroup_for_each_descendant_post() to propagate frozen states
bottom-up.

Synchronization could be coarser and easier by using a single mutex to
protect all hierarchy operations.  Finer grained approach was used
because it wasn't too difficult for cgroup_freezer and I think it's
beneficial to have an example implementation and cgroup_freezer is
rather simple and can serve a good one.

As this makes cgroup_freezer properly hierarchical,
freezer_subsys.broken_hierarchy marking is removed.

Note that this patch changes userland visible behavior - freezing a
cgroup now freezes all its descendants too.  This behavior change is
intended and has been warned via .broken_hierarchy.

v2: Michal spotted a bug in freezer_change_state() - descendants were
    inheriting from the wrong ancestor.  Fixed.

v3: Documentation/cgroups/freezer-subsystem.txt updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
2012-11-09 10:52:30 -08:00

124 lines
4.8 KiB
Plaintext

The cgroup freezer is useful to batch job management system which start
and stop sets of tasks in order to schedule the resources of a machine
according to the desires of a system administrator. This sort of program
is often used on HPC clusters to schedule access to the cluster as a
whole. The cgroup freezer uses cgroups to describe the set of tasks to
be started/stopped by the batch job management system. It also provides
a means to start and stop the tasks composing the job.
The cgroup freezer will also be useful for checkpointing running groups
of tasks. The freezer allows the checkpoint code to obtain a consistent
image of the tasks by attempting to force the tasks in a cgroup into a
quiescent state. Once the tasks are quiescent another task can
walk /proc or invoke a kernel interface to gather information about the
quiesced tasks. Checkpointed tasks can be restarted later should a
recoverable error occur. This also allows the checkpointed tasks to be
migrated between nodes in a cluster by copying the gathered information
to another node and restarting the tasks there.
Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping
and resuming tasks in userspace. Both of these signals are observable
from within the tasks we wish to freeze. While SIGSTOP cannot be caught,
blocked, or ignored it can be seen by waiting or ptracing parent tasks.
SIGCONT is especially unsuitable since it can be caught by the task. Any
programs designed to watch for SIGSTOP and SIGCONT could be broken by
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
demonstrate this problem using nested bash shells:
$ echo $$
16644
$ bash
$ echo $$
16690
From a second, unrelated bash shell:
$ kill -SIGSTOP 16690
$ kill -SIGCONT 16690
<at this point 16690 exits and causes 16644 to exit too>
This happens because bash can observe both signals and choose how it
responds to them.
Another example of a program which catches and responds to these
signals is gdb. In fact any program designed to use ptrace is likely to
have a problem with this method of stopping and resuming tasks.
In contrast, the cgroup freezer uses the kernel freezer code to
prevent the freeze/unfreeze cycle from becoming visible to the tasks
being frozen. This allows the bash example above and gdb to run as
expected.
The cgroup freezer is hierarchical. Freezing a cgroup freezes all
tasks beloning to the cgroup and all its descendant cgroups. Each
cgroup has its own state (self-state) and the state inherited from the
parent (parent-state). Iff both states are THAWED, the cgroup is
THAWED.
The following cgroupfs files are created by cgroup freezer.
* freezer.state: Read-write.
When read, returns the effective state of the cgroup - "THAWED",
"FREEZING" or "FROZEN". This is the combined self and parent-states.
If any is freezing, the cgroup is freezing (FREEZING or FROZEN).
FREEZING cgroup transitions into FROZEN state when all tasks
belonging to the cgroup and its descendants become frozen. Note that
a cgroup reverts to FREEZING from FROZEN after a new task is added
to the cgroup or one of its descendant cgroups until the new task is
frozen.
When written, sets the self-state of the cgroup. Two values are
allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup,
if not already freezing, enters FREEZING state along with all its
descendant cgroups.
If THAWED is written, the self-state of the cgroup is changed to
THAWED. Note that the effective state may not change to THAWED if
the parent-state is still freezing. If a cgroup's effective state
becomes THAWED, all its descendants which are freezing because of
the cgroup also leave the freezing state.
* freezer.self_freezing: Read only.
Shows the self-state. 0 if the self-state is THAWED; otherwise, 1.
This value is 1 iff the last write to freezer.state was "FROZEN".
* freezer.parent_freezing: Read only.
Shows the parent-state. 0 if none of the cgroup's ancestors is
frozen; otherwise, 1.
The root cgroup is non-freezable and the above interface files don't
exist.
* Examples of usage :
# mkdir /sys/fs/cgroup/freezer
# mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer
# mkdir /sys/fs/cgroup/freezer/0
# echo $some_pid > /sys/fs/cgroup/freezer/0/tasks
to get status of the freezer subsystem :
# cat /sys/fs/cgroup/freezer/0/freezer.state
THAWED
to freeze all tasks in the container :
# echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state
# cat /sys/fs/cgroup/freezer/0/freezer.state
FREEZING
# cat /sys/fs/cgroup/freezer/0/freezer.state
FROZEN
to unfreeze all tasks in the container :
# echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state
# cat /sys/fs/cgroup/freezer/0/freezer.state
THAWED
This is the basic mechanism which should do the right thing for user space task
in a simple scenario.