linux/kernel
Johannes Weiner 1b69ac6b40 psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating.  However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.

Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again.  This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)

Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep.  To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.

What if the worker is also executing other items before or after?

Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself.  If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.

If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity.  But that
should not be a problem.  The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation.  If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.

Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
Fixes: eb414681d5 ("psi: pressure stall information for CPU, memory, and IO")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
..
bpf bpf: fix inner map masking to prevent oob under speculation 2019-01-18 15:19:56 -08:00
cgroup Merge branch 'for-4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2018-12-29 10:57:20 -08:00
configs kvm_config: add CONFIG_VIRTIO_MENU 2018-10-24 20:55:56 -04:00
debug kdb: use bool for binary state indicators 2018-12-30 08:31:52 +00:00
dma swiotlb: clear io_tlb_start and io_tlb_end in swiotlb_exit 2019-01-16 09:59:17 -05:00
events Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
gcov
irq genirq/irqdesc: Fix double increment in alloc_descs() 2019-01-18 00:43:09 +01:00
livepatch livepatch: Replace synchronize_sched() with synchronize_rcu() 2018-12-01 12:38:50 -08:00
locking locking/rwsem: Fix (possible) missed wakeup 2019-01-21 11:15:39 +01:00
power mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
printk Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
rcu rcutorture: Don't do busted forward-progress testing 2018-12-01 12:45:42 -08:00
sched psi: fix aggregation idle shut-off 2019-02-01 15:46:23 -08:00
time posix-cpu-timers: Unbreak timer rearming 2019-01-15 16:34:37 +01:00
trace tracing/kprobes: Fix NULL pointer dereference in trace_kprobe_create() 2019-01-15 11:33:45 -05:00
.gitignore
acct.c
async.c
audit.c audit: remove duplicated include from audit.c 2018-12-14 12:09:30 -05:00
audit.h audit: use current whenever possible 2018-11-26 18:41:21 -05:00
audit_fsnotify.c audit: minimize our use of audit_log_format() 2018-11-26 18:40:00 -05:00
audit_tree.c audit: minimize our use of audit_log_format() 2018-11-26 18:40:00 -05:00
audit_watch.c audit: minimize our use of audit_log_format() 2018-11-26 18:40:00 -05:00
auditfilter.c
auditsc.c audit: use current whenever possible 2018-11-26 18:41:21 -05:00
backtracetest.c
bounds.c kbuild: fix kernel/bounds.c 'W=1' warning 2018-10-31 08:54:14 -07:00
capability.c
compat.c make 'user_access_begin()' do 'access_ok()' 2019-01-04 12:56:09 -08:00
configs.c
context_tracking.c
cpu.c x86/speculation: Rework SMT state change 2018-11-28 11:57:07 +01:00
cpu_pm.c
crash_core.c kernel/crash_core.c: print timestamp using time64_t 2018-08-22 10:52:47 -07:00
crash_dump.c
cred.c cred: export get_task_cred(). 2018-12-19 13:52:44 -05:00
delayacct.c delayacct: track delays from thrashing cache pages 2018-10-26 16:26:32 -07:00
dma.c
elfcore.c
exec_domain.c
exit.c kernel/exit.c: release ptraced tasks before zap_pid_ns_processes 2019-02-01 15:46:23 -08:00
extable.c
fail_function.c kernel/fail_function.c: remove meaningless null pointer check before debugfs_remove_recursive 2018-10-31 08:54:12 -07:00
fork.c Merge branch 'akpm' (patches from Andrew) 2019-01-08 18:58:29 -08:00
freezer.c PM / reboot: Eliminate race between reboot and suspend 2018-08-06 12:35:20 +02:00
futex.c futex: Fix (possible) missed wakeup 2019-01-21 11:15:38 +01:00
groups.c
hung_task.c kernel/hung_task.c: break RCU locks based on jiffies 2019-01-04 13:13:45 -08:00
iomem.c
irq_work.c
jump_label.c jump_label: move 'asm goto' support test to Kconfig 2019-01-06 09:46:51 +09:00
kallsyms.c kallsyms: reduce size a little on 64-bit 2018-09-10 22:54:33 +09:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt kconfig: warn no new line at end of file 2018-12-15 17:44:35 +09:00
kcov.c kernel/kcov.c: mark write_comp_data() as notrace 2019-01-04 13:13:47 -08:00
kexec.c kexec: add call to LSM hook in original kexec_load syscall 2018-07-16 12:31:57 -07:00
kexec_core.c mm: convert totalram_pages and totalhigh_pages variables to atomic 2018-12-28 12:11:47 -08:00
kexec_file.c kexec_file: kexec_walk_memblock() only walks a dedicated region at kdump 2018-12-06 14:38:50 +00:00
kexec_internal.h
kmod.c
kprobes.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-12-26 14:45:18 -08:00
ksysfs.c
kthread.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-08-13 11:25:07 -07:00
latencytop.c
Makefile kbuild: change filechk to surround the given command with { } 2019-01-06 09:46:51 +09:00
memremap.c mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm 2018-12-28 12:11:52 -08:00
module-internal.h
module.c jump_label: move 'asm goto' support test to Kconfig 2019-01-06 09:46:51 +09:00
module_signing.c modsign: use all trusted keys to verify module signature 2018-11-07 14:41:41 +01:00
notifier.c
nsproxy.c
padata.c padata: clean an indentation issue, remove extraneous space 2018-11-16 14:11:04 +08:00
panic.c kernel/sysctl: add panic_print into sysctl 2019-01-04 13:13:47 -08:00
params.c
pid.c Fix failure path in alloc_pid() 2018-12-28 12:42:30 -08:00
pid_namespace.c signal: Use group_send_sig_info to kill all processes in a pid namespace 2018-09-16 16:08:25 +02:00
profile.c mm: remove include/linux/bootmem.h 2018-10-31 08:54:16 -07:00
ptrace.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
range.c
reboot.c kernel/reboot.c: export pm_power_off_prepare 2018-09-11 16:13:24 +01:00
relay.c
resource.c kernel, resource: check for IORESOURCE_SYSRAM in release_mem_region_adjustable 2018-12-28 12:11:49 -08:00
rseq.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
seccomp.c seccomp: fix UAF in user-trap code 2019-01-15 09:43:12 -08:00
signal.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
smp.c smp,cpumask: introduce on_each_cpu_cond_mask 2018-10-09 16:51:11 +02:00
smpboot.c
smpboot.h
softirq.c Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-10-25 11:43:47 -07:00
stackleak.c stackleak: Mark stackleak_track_stack() as notrace 2018-12-05 19:31:44 -08:00
stacktrace.c
stop_machine.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-08-13 11:25:07 -07:00
sys.c kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymore 2019-01-14 10:38:03 +12:00
sys_ni.c y2038: socket: Add compat_sys_recvmmsg_time64 2018-12-18 16:13:04 +01:00
sysctl.c kernel/sysctl: add panic_print into sysctl 2019-01-04 13:13:47 -08:00
sysctl_binary.c kernel/sysctl: add panic_print into sysctl 2019-01-04 13:13:47 -08:00
task_work.c
taskstats.c
test_kprobes.c
torture.c torture: Remove unnecessary "ret" variables 2018-12-01 12:45:35 -08:00
tracepoint.c tracing: Replace synchronize_sched() and call_rcu_sched() 2018-11-27 09:21:41 -08:00
tsacct.c
ucount.c
uid16.c
uid16.h
umh.c umh: add exit routine for UMH process 2019-01-11 18:05:40 -08:00
up.c smp,cpumask: introduce on_each_cpu_cond_mask 2018-10-09 16:51:11 +02:00
user-return-notifier.c
user.c userns: use irqsave variant of refcount_dec_and_lock() 2018-08-22 10:52:47 -07:00
user_namespace.c userns: also map extents in the reverse map to kernel IDs 2018-11-07 23:51:16 -06:00
utsname.c
utsname_sysctl.c sys: don't hold uts_sem while accessing userspace memory 2018-08-11 02:05:53 -05:00
watchdog.c watchdog: Mark watchdog touch functions as notrace 2018-08-30 12:56:40 +02:00
watchdog_hld.c watchdog: Mark watchdog touch functions as notrace 2018-08-30 12:56:40 +02:00
workqueue.c psi: fix aggregation idle shut-off 2019-02-01 15:46:23 -08:00
workqueue_internal.h psi: fix aggregation idle shut-off 2019-02-01 15:46:23 -08:00