linux/kernel/sched
Johannes Weiner d583d360a6 psi: Fix psi state corruption when schedule() races with cgroup move
4117cebf1a ("psi: Optimize task switch inside shared cgroups")
introduced a race condition that corrupts internal psi state. This
manifests as kernel warnings, sometimes followed by bogusly high IO
pressure:

  psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0
  (schedule() decreasing RUNNING and ONCPU, both of which are 0)

  psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0
  (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU)

What the offending commit does is batch the two psi callbacks in
schedule() to reduce the number of cgroup tree updates. When prev is
deactivated and removed from the runqueue, nothing is done in psi at
first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are
updated along with TSK_ONCPU.

However, the deactivation and the task switch inside schedule() aren't
atomic: pick_next_task() may drop the rq lock for load balancing. When
this happens, cgroup_move_task() can run after the task has been
physically dequeued, but the psi updates are still pending. Since it
looks at the task's scheduler state, it doesn't move everything to the
new cgroup that the task switch that follows is about to clear from
it. cgroup_move_task() will leak the TSK_RUNNING count in the old
cgroup, and psi_sched_switch() will underflow it in the new cgroup.

A similar thing can happen for iowait. TSK_IOWAIT is usually set when
a p->in_iowait task is dequeued, but again this update is deferred to
the switch. cgroup_move_task() can see an unqueued p->in_iowait task
and move a non-existent TSK_IOWAIT. This results in the inconsistent
task state warning, as well as a counter underflow that will result in
permanent IO ghost pressure being reported.

Fix this bug by making cgroup_move_task() use task->psi_flags instead
of looking at the potentially mismatching scheduler state.

[ We used the scheduler state historically in order to not rely on
  task->psi_flags for anything but debugging. But that ship has sailed
  anyway, and this is simpler and more robust.

  We previously already batched TSK_ONCPU clearing with the
  TSK_RUNNING update inside the deactivation call from schedule(). But
  that ordering was safe and didn't result in TSK_ONCPU corruption:
  unlike most places in the scheduler, cgroup_move_task() only checked
  task_current() and handled TSK_ONCPU if the task was still queued. ]

Fixes: 4117cebf1a ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210503174917.38579-1-hannes@cmpxchg.org
2021-05-06 15:33:26 +02:00
..
autogroup.c sched/autogroup: Make autogroup_path() always available 2019-06-24 19:23:40 +02:00
autogroup.h
clock.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
completion.c completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all() 2020-03-23 18:40:25 +01:00
core.c Scheduler updates for this cycle are: 2021-04-28 13:33:57 -07:00
cpuacct.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
cpudeadline.c sched,rt: Use the full cpumask for balancing 2020-11-10 18:39:00 +01:00
cpudeadline.h
cpufreq.c cpufreq: Avoid leaving stale IRQ work items during CPU offline 2019-12-12 17:59:43 +01:00
cpufreq_schedutil.c Scheduler updates for this cycle are: 2021-04-28 13:33:57 -07:00
cpupri.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
cpupri.h sched/cpupri: Add CPUPRI_HIGHER 2020-10-29 11:00:30 +01:00
cputime.c Scheduler updates for this cycle are: 2021-04-28 13:33:57 -07:00
deadline.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
debug.c sched/debug: Fix cgroup_path[] serialization 2021-04-21 13:55:42 +02:00
fair.c kthread: Fix PF_KTHREAD vs to_kthread() race 2021-04-21 13:55:42 +02:00
features.h sched: Warn on long periods of pending need_resched 2021-04-21 13:55:41 +02:00
idle.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
isolation.c isolcpus: Affine unbound kernel threads to housekeeping cpus 2020-06-15 14:10:03 +02:00
loadavg.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
Makefile kcsan: Improve various small stylistic details 2019-11-20 10:47:23 +01:00
membarrier.c sched/membarrier: fix missing local execution of ipi_sync_rq_state() 2021-03-06 12:40:21 +01:00
pelt.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
pelt.h sched: Fix various typos 2021-03-22 00:11:52 +01:00
psi.c psi: Fix psi state corruption when schedule() races with cgroup move 2021-05-06 15:33:26 +02:00
rt.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
sched-pelt.h sched/fair: Fix "runnable_avg_yN_inv" not used warnings 2019-06-17 12:15:58 +02:00
sched.h sched: Warn on long periods of pending need_resched 2021-04-21 13:55:41 +02:00
smp.h sched/headers: Split out open-coded prototypes into kernel/sched/smp.h 2020-05-28 11:03:20 +02:00
stats.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
stats.h psi: Optimize task switch inside shared cgroups 2021-03-06 12:40:23 +01:00
stop_task.c sched: Remove select_task_rq()'s sd_flag parameter 2020-11-10 18:39:06 +01:00
swait.c sched/swait: Prepare usage in completions 2020-03-21 16:00:23 +01:00
topology.c sched/debug: Rename the sched_debug parameter to sched_verbose 2021-04-17 13:22:44 +02:00
wait.c sched/wait: Add add_wait_queue_priority() 2020-11-15 09:49:09 -05:00
wait_bit.c sched/wait: fix ___wait_var_event(exclusive) 2019-12-17 13:32:50 +01:00