linux/kernel/sched
Aaron Lu 1528c661c2 sched/fair: Ratelimit update to tg->load_avg
When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

    13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
    10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions and with each migration, the task's load
will transfer from src cfs_rq to target cfs_rq and each change involves
an update to tg->load_avg. Since the workload can trigger as many wakeups
and migrations, the access(both read and write) to tg->load_avg can be
unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. The update frequency is a tradeoff between tracking accuracy and
overhead. 1ms is chosen because PELT window is roughly 1ms and it
delivered good results for the tests that I've done. After this change,
the cost of accessing tg->load_avg is greatly reduced and performance
improved. Detailed test results below.

  ==============================
  postgres_sysbench on SPR:
  25%
  base:   42382±19.8%
  patch:  50174±9.5%  (noise)

  50%
  base:   67626±1.3%
  patch:  67365±3.1%  (noise)

  75%
  base:   100216±1.2%
  patch:  112470±0.1% +12.2%

  100%
  base:    93671±0.4%
  patch:  113563±0.2% +21.2%

  ==============================
  hackbench on ICL:
  group=1
  base:    114912±5.2%
  patch:   117857±2.5%  (noise)

  group=4
  base:    359902±1.6%
  patch:   361685±2.7%  (noise)

  group=8
  base:    461070±0.8%
  patch:   491713±0.3% +6.6%

  group=16
  base:    309032±5.0%
  patch:   378337±1.3% +22.4%

  =============================
  hackbench on SPR:
  group=1
  base:    100768±2.9%
  patch:   103134±2.9%  (noise)

  group=4
  base:    413830±12.5%
  patch:   378660±16.6% (noise)

  group=8
  base:    436124±0.6%
  patch:   490787±3.2% +12.5%

  group=16
  base:    457730±3.2%
  patch:   680452±1.3% +48.8%

  ============================
  netperf/udp_rr on ICL
  25%
  base:    114413±0.1%
  patch:   115111±0.0% +0.6%

  50%
  base:    86803±0.5%
  patch:   86611±0.0%  (noise)

  75%
  base:    35959±5.3%
  patch:   49801±0.6% +38.5%

  100%
  base:    61951±6.4%
  patch:   70224±0.8% +13.4%

  ===========================
  netperf/udp_rr on SPR
  25%
  base:   104954±1.3%
  patch:  107312±2.8%  (noise)

  50%
  base:    55394±4.6%
  patch:   54940±7.4%  (noise)

  75%
  base:    13779±3.1%
  patch:   36105±1.1% +162%

  100%
  base:     9703±3.7%
  patch:   28011±0.2% +189%

  ==============================================
  netperf/tcp_stream on ICL (all in noise range)
  25%
  base:    43092±0.1%
  patch:   42891±0.5%

  50%
  base:    19278±14.9%
  patch:   22369±7.2%

  75%
  base:    16822±3.0%
  patch:   17086±2.3%

  100%
  base:    18216±0.6%
  patch:   18078±2.9%

  ===============================================
  netperf/tcp_stream on SPR (all in noise range)
  25%
  base:    34491±0.3%
  patch:   34886±0.5%

  50%
  base:    19278±14.9%
  patch:   22369±7.2%

  75%
  base:    16822±3.0%
  patch:   17086±2.3%

  100%
  base:    18216±0.6%
  patch:   18078±2.9%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Vernet <void@manifault.com>
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com
2023-09-18 08:14:45 +02:00
..
autogroup.c
autogroup.h
build_policy.c
build_utility.c
clock.c Locking changes for v6.5: 2023-06-27 14:14:30 -07:00
completion.c sched: add a few helpers to wake up tasks on the current cpu 2023-07-17 16:08:08 -07:00
core.c freezer,sched: Use saved_state to reduce some spurious wakeups 2023-09-18 08:14:36 +02:00
core_sched.c
cpuacct.c
cpudeadline.c
cpudeadline.h
cpufreq.c
cpufreq_schedutil.c sched/fair, cpufreq: Introduce 'runnable boosting' 2023-06-05 21:13:44 +02:00
cpupri.c
cpupri.h
cputime.c cputime: remove cputime_to_nsecs fallback 2022-12-27 12:52:17 +01:00
deadline.c cgroup: Changes for v6.5 2023-06-27 16:54:21 -07:00
debug.c sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice 2023-07-19 09:43:59 +02:00
fair.c sched/fair: Ratelimit update to tg->load_avg 2023-09-18 08:14:45 +02:00
features.h sched/eevdf: Curb wakeup-preemption 2023-08-17 17:07:07 +02:00
idle.c sched/idle: Mark arch_cpu_idle_dead() __noreturn 2023-03-08 08:44:28 -08:00
isolation.c
loadavg.c
Makefile
membarrier.c sched/membarrier: Introduce MEMBARRIER_CMD_GET_REGISTRATIONS 2023-01-07 11:29:29 +01:00
pelt.c
pelt.h
psi.c Linux 6.5-rc2 2023-07-19 09:43:25 +02:00
rt.c sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset 2023-08-14 17:01:23 +02:00
sched-pelt.h
sched.h sched/fair: Ratelimit update to tg->load_avg 2023-09-18 08:14:45 +02:00
smp.h sched, smp: Trace smp callback causing an IPI 2023-03-24 11:01:29 +01:00
stats.c
stats.h sched/psi: Use task->psi_flags to clear in CPU migration 2022-10-30 10:12:15 +01:00
stop_task.c
swait.c sched: add a few helpers to wake up tasks on the current cpu 2023-07-17 16:08:08 -07:00
topology.c sched/topology: Fix sched_numa_find_nth_cpu() comment 2023-09-15 13:48:11 +02:00
wait.c sched: add a few helpers to wake up tasks on the current cpu 2023-07-17 16:08:08 -07:00
wait_bit.c