linux/kernel
Eric W. Biederman a606488513 pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
Serge Hallyn <serge.hallyn@ubuntu.com> writes:

> Since commit af4b8a83ad it's been
> possible to get into a situation where a pidns reaper is
> <defunct>, reparented to host pid 1, but never reaped.  How to
> reproduce this is documented at
>
> https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526
> (and see
> https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1168526/comments/13)
> In short, run repeated starts of a container whose init is
>
> Process.exit(0);
>
> sysrq-t when such a task is playing zombie shows:
>
> [  131.132978] init            x ffff88011fc14580     0  2084   2039 0x00000000
> [  131.132978]  ffff880116e89ea8 0000000000000002 ffff880116e89fd8 0000000000014580
> [  131.132978]  ffff880116e89fd8 0000000000014580 ffff8801172a0000 ffff8801172a0000
> [  131.132978]  ffff8801172a0630 ffff88011729fff0 ffff880116e14650 ffff88011729fff0
> [  131.132978] Call Trace:
> [  131.132978]  [<ffffffff816f6159>] schedule+0x29/0x70
> [  131.132978]  [<ffffffff81064591>] do_exit+0x6e1/0xa40
> [  131.132978]  [<ffffffff81071eae>] ? signal_wake_up_state+0x1e/0x30
> [  131.132978]  [<ffffffff8106496f>] do_group_exit+0x3f/0xa0
> [  131.132978]  [<ffffffff810649e4>] SyS_exit_group+0x14/0x20
> [  131.132978]  [<ffffffff8170102f>] tracesys+0xe1/0xe6
>
> Further debugging showed that every time this happened, zap_pid_ns_processes()
> started with nr_hashed being 3, while we were expecting it to drop to 2.
> Any time it didn't happen, nr_hashed was 1 or 2.  So the reaper was
> waiting for nr_hashed to become 2, but free_pid() only wakes the reaper
> if nr_hashed hits 1.

The issue is that when the task group leader of an init process exits
before other tasks of the init process when the init process finally
exits it will be a secondary task sleeping in zap_pid_ns_processes and
waiting to wake up when the number of hashed pids drops to two.  This
case waits forever as free_pid only sends a wake up when the number of
hashed pids drops to 1.

To correct this the simple strategy of sending a possibly unncessary
wake up when the number of hashed pids drops to 2 is adopted.

Sending one extraneous wake up is relatively harmless, at worst we
waste a little cpu time in the rare case when a pid namespace
appropaches exiting.

We can detect the case when the pid namespace drops to just two pids
hashed race free in free_pid.

Dereferencing pid_ns->child_reaper with the pidmap_lock held is safe
without out the tasklist_lock because it is guaranteed that the
detach_pid will be called on the child_reaper before it is freed and
detach_pid calls __change_pid which calls free_pid which takes the
pidmap_lock.  __change_pid only calls free_pid if this is the
last use of the pid.  For a thread that is not the thread group leader
the threads pid will only ever have one user because a threads pid
is not allowed to be the pid of a process, of a process group or
a session.  For a thread that is a thread group leader all of
the other threads of that process will be reaped before it is allowed
for the thread group leader to be reaped ensuring there will only
be one user of the threads pid as a process pid.  Furthermore
because the thread is the init process of a pid namespace all of the
other processes in the pid namespace will have also been already freed
leading to the fact that the pid will not be used as a session pid or
a process group pid for any other running process.

CC: stable@vger.kernel.org
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
Reported-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-08-30 17:30:37 -07:00
..
cpu idle: Enable interrupts in the weak arch_cpu_idle() implementation 2013-06-14 23:01:05 +02:00
debug kgdb/sysrq: fix inconstistent help message of sysrq key 2013-04-30 17:04:10 -07:00
events perf: Fix perf_lock_task_context() vs RCU 2013-07-12 11:11:09 +02:00
gcov kernel/gcov: remove depends on CONFIG_EXPERIMENTAL 2013-01-11 11:39:33 -08:00
irq Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-07-13 15:37:30 -07:00
power Merge branch 'akpm' (updates from Andrew Morton) 2013-07-03 17:12:13 -07:00
sched sched: Fix HRTICK 2013-07-12 13:52:58 +02:00
time tick: broadcast: Check broadcast mode on CPU hotplug 2013-07-12 12:35:40 +02:00
trace The majority of the changes here are cleanups for the large changes that 2013-07-11 09:02:09 -07:00
.gitignore kernel/hz.bc: ignore. 2013-04-22 07:09:06 -07:00
acct.c fs: Fix hang with BSD accounting on frozen filesystem 2013-05-04 14:57:58 -04:00
async.c async: rename and redefine async_func_ptr 2013-03-12 13:59:14 -07:00
audit.c audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE 2013-06-12 16:29:45 -07:00
audit.h audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent audit_names record 2013-07-09 10:33:19 -07:00
audit_tree.c kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules() 2013-06-12 16:29:46 -07:00
audit_watch.c audit: catch possible NULL audit buffers 2013-01-11 14:54:55 -08:00
auditfilter.c audit: Fix decimal constant description 2013-07-09 10:33:19 -07:00
auditsc.c audit: fix mq_open and mq_unlink to add the MQ root as a hidden parent audit_names record 2013-07-09 10:33:19 -07:00
backtracetest.c
bounds.c
capability.c Add file_ns_capable() helper function for open-time capability checking 2013-04-14 10:06:31 -07:00
cgroup.c cgroup: we can use simple_lookup() now 2013-07-14 17:50:23 +04:00
cgroup_freezer.c cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() 2012-11-19 08:13:38 -08:00
compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-05-01 07:21:43 -07:00
configs.c proc: Supply PDE attribute setting accessor functions 2013-05-01 17:29:18 -04:00
context_tracking.c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-06-20 08:18:35 -10:00
cpu.c CPU hotplug: provide a generic helper to disable/enable CPU hotplug 2013-06-12 16:29:44 -07:00
cpu_pm.c
cpuset.c Merge branch 'for-3.11-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2013-07-02 20:04:25 -07:00
crash_dump.c
cred.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2012-12-18 10:55:28 -08:00
delayacct.c cputime: Use accessors to read task cputime stats 2013-01-27 19:23:31 +01:00
dma.c
elfcore.c
exec_domain.c
exit.c ptrace: revert "Prepare to fix racy accesses on task breakpoints" 2013-07-09 10:33:26 -07:00
extable.c extable: Flip the sorting message 2013-04-15 13:25:16 +02:00
fork.c mm: remove free_area_cache 2013-07-10 18:11:34 -07:00
freezer.c freezer: skip waking up tasks with PF_FREEZER_SKIP set 2013-05-12 14:16:22 +02:00
futex.c futex: Use freezable blocking call 2013-06-25 23:11:19 +02:00
futex_compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-02-23 18:50:11 -08:00
groups.c
hrtimer.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-07-06 14:09:38 -07:00
hung_task.c
irq_work.c Merge branch 'nohz/printk-v8' into irq/core 2013-02-05 00:48:46 +01:00
itimer.c
jump_label.c
kallsyms.c kernel: kallsyms: memory override issue, need check destination buffer length 2013-04-15 15:17:26 +09:30
kcmp.c kcmp: include linux/ptrace.h 2012-12-20 17:40:19 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking: Fix copy/paste errors of "ARCH_INLINE_*_UNLOCK_BH" 2013-05-28 08:50:00 +02:00
Kconfig.preempt
kexec.c kexec: Use min() and min_t() to simplify logic 2013-04-30 17:04:07 -07:00
kmod.c usermodehelper: kill the sub_info->path[0] check 2013-07-03 16:08:02 -07:00
kprobes.c kprobes: handle empty/invalid input to debugfs "enabled" file 2013-07-03 16:07:46 -07:00
ksysfs.c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-12-11 18:10:49 -08:00
kthread.c kthread: implement probe_kthread_data() 2013-04-30 17:04:02 -07:00
latencytop.c
lglock.c
lockdep.c lockdep: remove task argument from debug_check_no_locks_held 2013-05-12 14:16:21 +02:00
lockdep_internals.h
lockdep_proc.c lockdep: Use KSYM_NAME_LEN'ed buffer for __get_key_name() 2012-10-24 12:39:09 +02:00
lockdep_states.h
Makefile reboot: move shutdown/reboot related functions to kernel/reboot.c 2013-07-09 10:33:29 -07:00
modsign_certificate.S CONFIG_SYMBOL_PREFIX: cleanup. 2013-03-15 15:09:43 +10:30
modsign_pubkey.c keys: use keyring_alloc() to create module signing keyring 2012-12-20 17:40:21 -08:00
module-internal.h MODSIGN: Move the magic string to the end of a module and eliminate the search 2012-10-19 17:30:40 -07:00
module.c Nothing interesting. Except the most embarrassing bugfix ever. But let's 2013-07-10 14:51:41 -07:00
module_signing.c MODSIGN: Don't use enum-type bitfields in module signature info block 2012-12-05 11:27:24 +10:30
mutex-debug.c
mutex-debug.h
mutex.c mutex: Move ww_mutex definitions to ww_mutex.h 2013-07-12 12:07:46 +02:00
mutex.h
notifier.c
nsproxy.c kernel/nsproxy.c: Improving a snippet of code. 2013-08-26 17:45:56 -07:00
padata.c padata: use __this_cpu_read per-cpu helper 2012-12-06 17:16:23 +08:00
panic.c The majority of the changes here are cleanups for the large changes that 2013-07-11 09:02:09 -07:00
params.c There is no /sys/parameters 2013-07-02 15:38:19 +09:30
pid.c pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup 2013-08-30 17:30:37 -07:00
pid_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
posix-cpu-timers.c posix_timers: fix racy timer delta caching on task exit 2013-07-03 16:54:42 +02:00
posix-timers.c posix-timers: Remove unused variable 2013-04-18 12:51:19 +02:00
printk.c Merge branch 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-07-10 18:14:24 -07:00
profile.c proc: Supply PDE attribute setting accessor functions 2013-05-01 17:29:18 -04:00
ptrace.c ptrace: PTRACE_DETACH should do flush_ptrace_hw_breakpoint(child) 2013-07-09 10:33:26 -07:00
range.c range: Do not add new blank slot with add_range_with_merge 2013-06-18 11:32:10 -05:00
rcu.h rcu: Provide RCU CPU stall warnings for tiny RCU 2013-01-28 22:06:21 -08:00
rcupdate.c Merge branches 'cbnum.2013.06.10a', 'doc.2013.06.10a', 'fixes.2013.06.10a', 'srcu.2013.06.10a' and 'tiny.2013.06.10a' into HEAD 2013-06-10 13:46:44 -07:00
rcutiny.c rcu: Shrink TINY_RCU by reworking CPU-stall ifdefs 2013-06-10 13:45:53 -07:00
rcutiny_plugin.h rcu: Shrink TINY_RCU by reworking CPU-stall ifdefs 2013-06-10 13:45:53 -07:00
rcutorture.c rcu: Remove srcu_read_lock_raw() and srcu_read_unlock_raw(). 2013-06-10 13:45:25 -07:00
rcutree.c drivers: avoid parsing names as kthread_run() format strings 2013-07-03 16:07:41 -07:00
rcutree.h rcu: Drive quiescent-state-forcing delay from HZ 2013-06-10 13:44:56 -07:00
rcutree_plugin.h Merge branches 'cbnum.2013.06.10a', 'doc.2013.06.10a', 'fixes.2013.06.10a', 'srcu.2013.06.10a' and 'tiny.2013.06.10a' into HEAD 2013-06-10 13:46:44 -07:00
rcutree_trace.c rcutrace: single_open() leaks 2013-05-05 00:16:35 -04:00
reboot.c reboot: move arch/x86 reboot= handling to generic kernel 2013-07-09 10:33:29 -07:00
relay.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
res_counter.c res_counter: return amount of charges after res_counter_uncharge() 2012-12-18 15:02:12 -08:00
resource.c kernel/resource.c: remove the unneeded assignment in function __find_resource 2013-07-03 16:08:06 -07:00
rtmutex-debug.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
rtmutex-debug.h
rtmutex-tester.c locking/rtmutex/tester: Set correct permissions on sysfs files 2013-04-10 14:48:37 +02:00
rtmutex.c rtmutex: Document rt_mutex_adjust_prio_chain() 2013-05-28 09:23:52 +02:00
rtmutex.h
rtmutex_common.h
rwsem.c Revert "rw_semaphore: remove up/down_read_non_owner" 2013-03-23 15:53:52 -07:00
seccomp.c seccomp: allow BPF_XOR based ALU instructions. 2013-03-26 11:07:19 +11:00
semaphore.c semaphore: use `bool' type for semaphore_waiter's up 2013-04-30 17:04:08 -07:00
signal.c sigtimedwait: use freezable blocking call 2013-05-12 14:16:23 +02:00
smp.c kernel/smp.c: cleanups 2013-04-30 17:04:03 -07:00
smpboot.c kthread: Prevent unpark race which puts threads on the wrong cpu 2013-04-12 14:18:43 +02:00
smpboot.h
softirq.c Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-07-02 16:14:35 -07:00
spinlock.c
srcu.c srcu: use ACCESS_ONCE() to access sp->completed in srcu_read_lock() 2013-02-07 15:19:36 -08:00
stacktrace.c
stop_machine.c stop_machine: Mark per cpu stopper enabled early 2013-02-26 22:25:17 +01:00
sys.c reboot: move shutdown/reboot related functions to kernel/reboot.c 2013-07-09 10:33:29 -07:00
sys_ni.c unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE 2013-05-09 13:46:38 -04:00
sysctl.c Merge branch 'linus' into timers/urgent 2013-07-12 12:34:42 +02:00
sysctl_binary.c kernel: remove unnecessary head file 2013-06-26 18:01:46 +09:00
task_work.c
taskstats.c taskstats: cgroupstats_user_cmd() may leak on error 2012-10-06 03:05:31 +09:00
test_kprobes.c kernel/: rename random32() to prandom_u32() 2013-04-29 18:28:42 -07:00
time.c sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00
timeconst.bc kernel: Replace timeconst.pl with a bc script 2013-02-16 23:17:25 +01:00
timer.c timer: Fix jiffies wrap behavior of round_jiffies_common() 2013-06-28 17:10:11 +02:00
tracepoint.c Tracing updates for Linux 3.10 2013-04-29 13:55:38 -07:00
tsacct.c cputime: Use accessors to read task cputime stats 2013-01-27 19:23:31 +01:00
uid16.c make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect 2013-03-03 22:58:33 -05:00
up.c
user-return-notifier.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
user.c userns: Better restrictions on when proc and sysfs can be mounted 2013-08-26 19:17:03 -07:00
user_namespace.c userns: Better restrictions on when proc and sysfs can be mounted 2013-08-26 19:17:03 -07:00
utsname.c proc: Split the namespace stuff out into linux/proc_ns.h 2013-05-01 17:29:39 -04:00
utsname_sysctl.c kernel/utsname_sysctl.c: put get/get_uts() into CONFIG_PROC_SYSCTL code block 2013-02-27 19:10:22 -08:00
wait.c Add wait_on_atomic_t() and wake_up_atomic_t() 2013-05-15 13:50:38 +01:00
watchdog.c watchdog: Boot-disable by default on full dynticks 2013-06-20 15:46:32 +02:00
workqueue.c Merge branch 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2013-07-02 19:53:30 -07:00
workqueue_internal.h sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00