Commit graph

4059 commits

Author SHA1 Message Date
Vincent Guittot 7736ae5572 sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)
sched_feat(UTIL_EST_FASTUP) has been added to easily disable the feature
in order to check for possibly related regressions. After 3 years, it has
never been used and no regression has been reported. Let's remove it
and make fast increase a permanent behavior.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Hongyan Xia <hongyan.xia2@arm.com>
Reviewed-by: Tang Yizhou <yizhou.tang@shopee.com>
Reviewed-by: Yanteng Si <siyanteng@loongson.cn> [for the Chinese translation]
Reviewed-by: Alex Shi <alexs@kernel.org>
Link: https://lore.kernel.org/r/20231201161652.1241695-2-vincent.guittot@linaro.org
2023-12-23 15:59:56 +01:00
Vincent Guittot b3edde44e5 cpufreq/schedutil: Use a fixed reference frequency
cpuinfo.max_freq can change at runtime because of boost as an example. This
implies that the value could be different than the one that has been
used when computing the capacity of a CPU.

The new arch_scale_freq_ref() returns a fixed and coherent reference
frequency that can be used when computing a frequency based on utilization.

Use this arch_scale_freq_ref() when available and fallback to
policy otherwise.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lore.kernel.org/r/20231211104855.558096-4-vincent.guittot@linaro.org
2023-12-23 15:52:35 +01:00
Arnd Bergmann b1c3efe079 sched: fair: move unused stub functions to header
These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED,
and empty one that is only referenced when FAIR_GROUP_SCHED is disabled
but CGROUP_SCHED is still enabled.  If both are turned off, the functions
are still defined but the misisng prototype causes a W=1 warning:

kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group'
kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group'
kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group'
kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group'

Move the alternatives into the header as static inline functions with the
correct combination of #ifdef checks to avoid the warning without adding
even more complexity.

[A different patch with the same description got applied by accident
 and was later reverted, but the original patch is still missing]

Link: https://lkml.kernel.org/r/20231123110506.707903-4-arnd@kernel.org
Fixes: 7aa55f2a59 ("sched/fair: Move unused stub functions to header")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nicolas@fjasle.eu>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Tudor Ambarus <tudor.ambarus@linaro.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-10 17:21:43 -08:00
Yiwei Lin 5068d84054 sched/fair: Update min_vruntime for reweight_entity() correctly
Since reweight_entity() may have chance to change the weight of
cfs_rq->curr entity, we should also update_min_vruntime() if
this is the case

Fixes: eab03c23c2 ("sched/eevdf: Fix vruntime adjustment on reweight")
Signed-off-by: Yiwei Lin <s921975628@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Abel Wu <wuyun.abel@bytedance.com>
Link: https://lore.kernel.org/r/20231117080106.12890-1-s921975628@gmail.com
2023-11-29 15:43:52 +01:00
Vincent Guittot f12560779f sched/cpufreq: Rework iowait boost
Use the max value that has already been computed inside sugov_get_util()
to cap the iowait boost and remove dependency with uclamp_rq_util_with()
which is not used anymore.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231122133904.446032-3-vincent.guittot@linaro.org
2023-11-23 11:32:02 +01:00
Vincent Guittot 9c0b4bb7f6 sched/cpufreq: Rework schedutil governor performance estimation
The current method to take into account uclamp hints when estimating the
target frequency can end in a situation where the selected target
frequency is finally higher than uclamp hints, whereas there are no real
needs. Such cases mainly happen because we are currently mixing the
traditional scheduler utilization signal with the uclamp performance
hints. By adding these 2 metrics, we loose an important information when
it comes to select the target frequency, and we have to make some
assumptions which can't fit all cases.

Rework the interface between the scheduler and schedutil governor in order
to propagate all information down to the cpufreq governor.

effective_cpu_util() interface changes and now returns the actual
utilization of the CPU with 2 optional inputs:

- The minimum performance for this CPU; typically the capacity to handle
  the deadline task and the interrupt pressure. But also uclamp_min
  request when available.

- The maximum targeting performance for this CPU which reflects the
  maximum level that we would like to not exceed. By default it will be
  the CPU capacity but can be reduced because of some performance hints
  set with uclamp. The value can be lower than actual utilization and/or
  min performance level.

A new sugov_effective_cpu_perf() interface is also available to compute
the final performance level that is targeted for the CPU, after applying
some cpufreq headroom and taking into account all inputs.

With these 2 functions, schedutil is now able to decide when it must go
above uclamp hints. It now also has a generic way to get the min
performance level.

The dependency between energy model and cpufreq governor and its headroom
policy doesn't exist anymore.

eenv_pd_max_util() asks schedutil for the targeted performance after
applying the impact of the waking task.

[ mingo: Refined the changelog & C comments. ]

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20231122133904.446032-2-vincent.guittot@linaro.org
2023-11-23 11:32:01 +01:00
Vincent Guittot 50181c0cff sched/pelt: Avoid underestimation of task utilization
Lukasz Luba reported that a thread's util_est can significantly decrease as
a result of sharing the CPU with other threads.

The use case can be easily reproduced with a periodic task TA that runs 1ms
and sleeps 100us. When the task is alone on the CPU, its max utilization and
its util_est is around 888. If another similar task starts to run on the
same CPU, TA will have to share the CPU runtime and its maximum utilization
will decrease around half the CPU capacity (512) then TA's util_est will
follow this new maximum trend which is only the result of sharing the CPU
with others tasks.

Such situation can be detected with runnable_avg wich is close or
equal to util_avg when TA is alone, but increases above util_avg when TA
shares the CPU with other threads and wait on the runqueue.

[ We prefer an util_est that overestimate rather than under estimate
  because in 1st case we will not provide enough performance to the
  task which will remain under-provisioned, whereas in the other case we
  will create some idle time which will enable to reduce contention and
  as a result reduces the util_est so the overestimate will be transient
  whereas the underestimate will remain. ]

[ mingo: Refined the changelog, added comments from the LKML discussion. ]

Reported-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@mail.gmail.com/#t
Link: https://lore.kernel.org/r/20231122140119.472110-1-vincent.guittot@linaro.org
Cc: Hongyan Xia <hongyan.xia2@arm.com>
2023-11-23 11:24:28 +01:00
Frederic Weisbecker 194600008d sched/timers: Explain why idle task schedules out on remote timer enqueue
Trying to avoid that didn't bring much value after testing, add comment
about this.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231114193840.4041-3-frederic@kernel.org
2023-11-15 09:57:52 +01:00
Frederic Weisbecker dd5403869a sched/cpuidle: Comment about timers requirements VS idle handler
Add missing explanation concerning IRQs re-enablement constraints in
the cpuidle path against timers.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lkml.kernel.org/r/20231114193840.4041-2-frederic@kernel.org
2023-11-15 09:57:51 +01:00
Peter Zijlstra 63ba8422f8 sched/deadline: Introduce deadline servers
Low priority tasks (e.g., SCHED_OTHER) can suffer starvation if tasks
with higher priority (e.g., SCHED_FIFO) monopolize CPU(s).

RT Throttling has been introduced a while ago as a (mostly debug)
countermeasure one can utilize to reserve some CPU time for low priority
tasks (usually background type of work, e.g. workqueues, timers, etc.).
It however has its own problems (see documentation) and the undesired
effect of unconditionally throttling FIFO tasks even when no lower
priority activity needs to run (there are mechanisms to fix this issue
as well, but, again, with their own problems).

Introduce deadline servers to service low priority tasks needs under
starvation conditions. Deadline servers are built extending SCHED_DEADLINE
implementation to allow 2-level scheduling (a sched_deadline entity
becomes a container for lower priority scheduling entities).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/4968601859d920335cf85822eb573a5f179f04b8.1699095159.git.bristot@kernel.org
2023-11-15 09:57:51 +01:00
Peter Zijlstra 2f7a0f5894 sched/deadline: Move bandwidth accounting into {en,de}queue_dl_entity
In preparation of introducing !task sched_dl_entity; move the
bandwidth accounting into {en.de}queue_dl_entity().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/a86dccbbe44e021b8771627e1dae01a69b73466d.1699095159.git.bristot@kernel.org
2023-11-15 09:57:50 +01:00
Peter Zijlstra 9e07d45c52 sched/deadline: Collect sched_dl_entity initialization
Create a single function that initializes a sched_dl_entity.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lkml.kernel.org/r/51acc695eecf0a1a2f78f9a044e11ffd9b316bcf.1699095159.git.bristot@kernel.org
2023-11-15 09:57:50 +01:00
Peter Zijlstra c708a4dc5a sched: Unify more update_curr*()
Now that trace_sched_stat_runtime() no longer takes a vruntime
argument, the task specific bits are identical between
update_curr_common() and update_curr().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-11-15 09:57:49 +01:00
Peter Zijlstra 5fe6ec8f6a sched: Remove vruntime from trace_sched_stat_runtime()
Tracing the runtime delta makes sense, observer can sum over time.
Tracing the absolute vruntime makes less sense, inconsistent:
absolute-vs-delta, but also vruntime delta can be computed from
runtime delta.

Removing the vruntime thing also makes the two tracepoint sites
identical, allowing to unify the code in a later patch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-11-15 09:57:49 +01:00
Peter Zijlstra 5d69eca542 sched: Unify runtime accounting across classes
All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Phil Auld <pauld@redhat.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/54d148a144f26d9559698c4dd82d8859038a7380.1699095159.git.bristot@kernel.org
2023-11-15 09:57:48 +01:00
Abel Wu ee4373dc90 sched/eevdf: O(1) fastpath for task selection
Since the RB-tree is now sorted by deadline, let's first try the
leftmost entity which has the earliest virtual deadline. I've done
some benchmarks to see its effectiveness.

All the benchmarks are done inside a normal cpu cgroup in a clean
environment with cpu turbo disabled, on a dual-CPU Intel Xeon(R)
Platinum 8260 with 2 NUMA nodes each of which has 24C/48T.

  hackbench: process/thread + pipe/socket + 1/2/4/8 groups
  netperf:   TCP/UDP + STREAM/RR + 24/48/72/96/192 threads
  tbench:    loopback 24/48/72/96/192 threads
  schbench:  1/2/4/8 mthreads

  direct:    cfs_rq has only one entity
  parity:    RUN_TO_PARITY
  fast:      O(1) fastpath
  slow:	     heap search

    (%)		direct	parity	fast	slow
  hackbench	92.95	2.02	4.91	0.12
  netperf	68.08	6.60	24.18	1.14
  tbench	67.55	11.22	20.61	0.62
  schbench	69.91	2.65	25.73	1.71

The above results indicate that this fastpath really makes task
selection more efficient.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231115033647.80785-4-wuyun.abel@bytedance.com
2023-11-15 09:57:47 +01:00
Abel Wu 2227a957e1 sched/eevdf: Sort the rbtree by virtual deadline
Sort the task timeline by virtual deadline and keep the min_vruntime
in the augmented tree, so we can avoid doubling the worst case cost
and make full use of the cached leftmost node to enable O(1) fastpath
picking in next patch.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231115033647.80785-3-wuyun.abel@bytedance.com
2023-11-15 09:57:47 +01:00
Raghavendra K T 84db47ca71 sched/numa: Fix mm numa_scan_seq based unconditional scan
Since commit fc137c0dda ("sched/numa: enhance vma scanning logic")

NUMA Balancing allows updating PTEs to trap NUMA hinting faults if the
task had previously accessed VMA. However unconditional scan of VMAs are
allowed during initial phase of VMA creation until process's
mm numa_scan_seq reaches 2 even though current task had not accessed VMA.

Rationale:
 - Without initial scan subsequent PTE update may never happen.
 - Give fair opportunity to all the VMAs to be scanned and subsequently
understand the access pattern of all the VMAs.

But it has a corner case where, if a VMA is created after some time,
process's mm numa_scan_seq could be already greater than 2.

For e.g., values of mm numa_scan_seq when VMAs are created by running
mmtest autonuma benchmark briefly looks like:
start_seq=0 : 459
start_seq=2 : 138
start_seq=3 : 144
start_seq=4 : 8
start_seq=8 : 1
start_seq=9 : 1
This results in no unconditional PTE updates for those VMAs created after
some time.

Fix:
 - Note down the initial value of mm numa_scan_seq in per VMA start_seq.
 - Allow unconditional scan till start_seq + 2.

Result:
SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus.
base kernel: upstream 6.6-rc6 with Mels patches [1] applied.

kernbench
==========		base                  patched %gain
Amean    elsp-128      165.09 ( 0.00%)      164.78 *   0.19%*

Duration User       41404.28    41375.08
Duration System      9862.22     9768.48
Duration Elapsed      519.87      518.72

Ops NUMA PTE updates           1041416.00      831536.00
Ops NUMA hint faults            263296.00      220966.00
Ops NUMA pages migrated         258021.00      212769.00
Ops AutoNUMA cost                 1328.67        1114.69

autonumabench

NUMA01_THREADLOCAL
==================
Amean  elsp-NUMA01_THREADLOCAL   81.79 (0.00%)  67.74 *  17.18%*

Duration User       54832.73    47379.67
Duration System        75.00      185.75
Duration Elapsed      576.72      476.09

Ops NUMA PTE updates                  394429.00    11121044.00
Ops NUMA hint faults                    1001.00     8906404.00
Ops NUMA pages migrated                  288.00     2998694.00
Ops AutoNUMA cost                          7.77       44666.84

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/2ea7cbce80ac7c62e90cbfb9653a7972f902439f.1697816692.git.raghavendra.kt@amd.com
2023-11-15 09:57:46 +01:00
Paul E. McKenney d6111cf45c sched: Use WRITE_ONCE() for p->on_rq
Since RCU-tasks uses READ_ONCE(p->on_rq), ensure the write-side
matches with WRITE_ONCE().

Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/e4896e0b-eacc-45a2-a7a8-de2280a51ecc@paulmck-laptop
2023-11-15 09:57:45 +01:00
Keisuke Nishimura 6d7e4782bc sched/fair: Fix the decision for load balance
should_we_balance is called for the decision to do load-balancing.
When sched ticks invoke this function, only one CPU should return
true. However, in the current code, two CPUs can return true. The
following situation, where b means busy and i means idle, is an
example, because CPU 0 and CPU 2 return true.

        [0, 1] [2, 3]
         b  b   i  b

This fix checks if there exists an idle CPU with busy sibling(s)
after looking for a CPU on an idle core. If some idle CPUs with busy
siblings are found, just the first one should do load-balancing.

Fixes: b1bfeab9b0 ("sched/fair: Consider the idle state of the whole core for load balance")
Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231031133821.1570861-1-keisuke.nishimura@inria.fr
2023-11-14 22:27:01 +01:00
Abel Wu eab03c23c2 sched/eevdf: Fix vruntime adjustment on reweight
vruntime of the (on_rq && !0-lag) entity needs to be adjusted when
it gets re-weighted, and the calculations can be simplified based
on the fact that re-weight won't change the w-average of all the
entities. Please check the proofs in comments.

But adjusting vruntime can also cause position change in RB-tree
hence require re-queue to fix up which might be costly. This might
be avoided by deferring adjustment to the time the entity actually
leaves tree (dequeue/pick), but that will negatively affect task
selection and probably not good enough either.

Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231107090510.71322-2-wuyun.abel@bytedance.com
2023-11-14 22:27:00 +01:00
Linus Torvalds 8f6f76a6a2 As usual, lots of singleton and doubleton patches all over the tree and
there's little I can say which isn't in the individual changelogs.
 
 The lengthier patch series are
 
 - "kdump: use generic functions to simplify crashkernel reservation in
   arch", from Baoquan He.  This is mainly cleanups and consolidation of
   the "crashkernel=" kernel parameter handling.
 
 - After much discussion, David Laight's "minmax: Relax type checks in
   min() and max()" is here.  Hopefully reduces some typecasting and the
   use of min_t() and max_t().
 
 - A group of patches from Oleg Nesterov which clean up and slightly fix
   our handling of reads from /proc/PID/task/...  and which remove
   task_struct.therad_group.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA
 jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4
 63BNcG3+hM9nwGJHb5lyh5m79nBMRg0=
 =On4u
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:
 "As usual, lots of singleton and doubleton patches all over the tree
  and there's little I can say which isn't in the individual changelogs.

  The lengthier patch series are

   - 'kdump: use generic functions to simplify crashkernel reservation
     in arch', from Baoquan He. This is mainly cleanups and
     consolidation of the 'crashkernel=' kernel parameter handling

   - After much discussion, David Laight's 'minmax: Relax type checks in
     min() and max()' is here. Hopefully reduces some typecasting and
     the use of min_t() and max_t()

   - A group of patches from Oleg Nesterov which clean up and slightly
     fix our handling of reads from /proc/PID/task/... and which remove
     task_struct.thread_group"

* tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits)
  scripts/gdb/vmalloc: disable on no-MMU
  scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n
  .mailmap: add address mapping for Tomeu Vizoso
  mailmap: update email address for Claudiu Beznea
  tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions
  .mailmap: map Benjamin Poirier's address
  scripts/gdb: add lx_current support for riscv
  ocfs2: fix a spelling typo in comment
  proc: test ProtectionKey in proc-empty-vm test
  proc: fix proc-empty-vm test with vsyscall
  fs/proc/base.c: remove unneeded semicolon
  do_io_accounting: use sig->stats_lock
  do_io_accounting: use __for_each_thread()
  ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error()
  ocfs2: fix a typo in a comment
  scripts/show_delta: add __main__ judgement before main code
  treewide: mark stuff as __ro_after_init
  fs: ocfs2: check status values
  proc: test /proc/${pid}/statm
  compiler.h: move __is_constexpr() to compiler.h
  ...
2023-11-02 20:53:31 -10:00
Linus Torvalds ecae0bd517 Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
 
 - Kemeng Shi has contributed some compation maintenance work in the
   series "Fixes and cleanups to compaction".
 
 - Joel Fernandes has a patchset ("Optimize mremap during mutual
   alignment within PMD") which fixes an obscure issue with mremap()'s
   pagetable handling during a subsequent exec(), based upon an
   implementation which Linus suggested.
 
 - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
   following patch series:
 
 	mm/damon: misc fixups for documents, comments and its tracepoint
 	mm/damon: add a tracepoint for damos apply target regions
 	mm/damon: provide pseudo-moving sum based access rate
 	mm/damon: implement DAMOS apply intervals
 	mm/damon/core-test: Fix memory leaks in core-test
 	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
 
 - In the series "Do not try to access unaccepted memory" Adrian Hunter
   provides some fixups for the recently-added "unaccepted memory' feature.
   To increase the feature's checking coverage.  "Plug a few gaps where
   RAM is exposed without checking if it is unaccepted memory".
 
 - In the series "cleanups for lockless slab shrink" Qi Zheng has done
   some maintenance work which is preparation for the lockless slab
   shrinking code.
 
 - Qi Zheng has redone the earlier (and reverted) attempt to make slab
   shrinking lockless in the series "use refcount+RCU method to implement
   lockless slab shrink".
 
 - David Hildenbrand contributes some maintenance work for the rmap code
   in the series "Anon rmap cleanups".
 
 - Kefeng Wang does more folio conversions and some maintenance work in
   the migration code.  Series "mm: migrate: more folio conversion and
   unification".
 
 - Matthew Wilcox has fixed an issue in the buffer_head code which was
   causing long stalls under some heavy memory/IO loads.  Some cleanups
   were added on the way.  Series "Add and use bdev_getblk()".
 
 - In the series "Use nth_page() in place of direct struct page
   manipulation" Zi Yan has fixed a potential issue with the direct
   manipulation of hugetlb page frames.
 
 - In the series "mm: hugetlb: Skip initialization of gigantic tail
   struct pages if freed by HVO" has improved our handling of gigantic
   pages in the hugetlb vmmemmep optimizaton code.  This provides
   significant boot time improvements when significant amounts of gigantic
   pages are in use.
 
 - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
   rationalization and folio conversions in the hugetlb code.
 
 - Yin Fengwei has improved mlock()'s handling of large folios in the
   series "support large folio for mlock"
 
 - In the series "Expose swapcache stat for memcg v1" Liu Shixin has
   added statistics for memcg v1 users which are available (and useful)
   under memcg v2.
 
 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
   prctl so that userspace may direct the kernel to not automatically
   propagate the denial to child processes.  The series is named "MDWE
   without inheritance".
 
 - Kefeng Wang has provided the series "mm: convert numa balancing
   functions to use a folio" which does what it says.
 
 - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
   makes is possible for a process to propagate KSM treatment across
   exec().
 
 - Huang Ying has enhanced memory tiering's calculation of memory
   distances.  This is used to permit the dax/kmem driver to use "high
   bandwidth memory" in addition to Optane Data Center Persistent Memory
   Modules (DCPMM).  The series is named "memory tiering: calculate
   abstract distance based on ACPI HMAT"
 
 - In the series "Smart scanning mode for KSM" Stefan Roesch has
   optimized KSM by teaching it to retain and use some historical
   information from previous scans.
 
 - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
   series "mm: memcg: fix tracking of pending stats updates values".
 
 - In the series "Implement IOCTL to get and optionally clear info about
   PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
   us to atomically read-then-clear page softdirty state.  This is mainly
   used by CRIU.
 
 - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
   - a bunch of relatively minor maintenance tweaks to this code.
 
 - Matthew Wilcox has increased the use of the VMA lock over file-backed
   page faults in the series "Handle more faults under the VMA lock".  Some
   rationalizations of the fault path became possible as a result.
 
 - In the series "mm/rmap: convert page_move_anon_rmap() to
   folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
   and folio conversions.
 
 - In the series "various improvements to the GUP interface" Lorenzo
   Stoakes has simplified and improved the GUP interface with an eye to
   providing groundwork for future improvements.
 
 - Andrey Konovalov has sent along the series "kasan: assorted fixes and
   improvements" which does those things.
 
 - Some page allocator maintenance work from Kemeng Shi in the series
   "Two minor cleanups to break_down_buddy_pages".
 
 - In thes series "New selftest for mm" Breno Leitao has developed
   another MM self test which tickles a race we had between madvise() and
   page faults.
 
 - In the series "Add folio_end_read" Matthew Wilcox provides cleanups
   and an optimization to the core pagecache code.
 
 - Nhat Pham has added memcg accounting for hugetlb memory in the series
   "hugetlb memcg accounting".
 
 - Cleanups and rationalizations to the pagemap code from Lorenzo
   Stoakes, in the series "Abstract vma_merge() and split_vma()".
 
 - Audra Mitchell has fixed issues in the procfs page_owner code's new
   timestamping feature which was causing some misbehaviours.  In the
   series "Fix page_owner's use of free timestamps".
 
 - Lorenzo Stoakes has fixed the handling of new mappings of sealed files
   in the series "permit write-sealed memfd read-only shared mappings".
 
 - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
   series "Batch hugetlb vmemmap modification operations".
 
 - Some buffer_head folio conversions and cleanups from Matthew Wilcox in
   the series "Finish the create_empty_buffers() transition".
 
 - As a page allocator performance optimization Huang Ying has added
   automatic tuning to the allocator's per-cpu-pages feature, in the series
   "mm: PCP high auto-tuning".
 
 - Roman Gushchin has contributed the patchset "mm: improve performance
   of accounted kernel memory allocations" which improves their performance
   by ~30% as measured by a micro-benchmark.
 
 - folio conversions from Kefeng Wang in the series "mm: convert page
   cpupid functions to folios".
 
 - Some kmemleak fixups in Liu Shixin's series "Some bugfix about
   kmemleak".
 
 - Qi Zheng has improved our handling of memoryless nodes by keeping them
   off the allocation fallback list.  This is done in the series "handle
   memoryless nodes more appropriately".
 
 - khugepaged conversions from Vishal Moola in the series "Some
   khugepaged folio conversions".
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
 jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
 FgeUPAD1oasg6CP+INZvCj34waNxwAc=
 =E+Y4
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:
 "Many singleton patches against the MM code. The patch series which are
  included in this merge do the following:

   - Kemeng Shi has contributed some compation maintenance work in the
     series 'Fixes and cleanups to compaction'

   - Joel Fernandes has a patchset ('Optimize mremap during mutual
     alignment within PMD') which fixes an obscure issue with mremap()'s
     pagetable handling during a subsequent exec(), based upon an
     implementation which Linus suggested

   - More DAMON/DAMOS maintenance and feature work from SeongJae Park i
     the following patch series:

	mm/damon: misc fixups for documents, comments and its tracepoint
	mm/damon: add a tracepoint for damos apply target regions
	mm/damon: provide pseudo-moving sum based access rate
	mm/damon: implement DAMOS apply intervals
	mm/damon/core-test: Fix memory leaks in core-test
	mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval

   - In the series 'Do not try to access unaccepted memory' Adrian
     Hunter provides some fixups for the recently-added 'unaccepted
     memory' feature. To increase the feature's checking coverage. 'Plug
     a few gaps where RAM is exposed without checking if it is
     unaccepted memory'

   - In the series 'cleanups for lockless slab shrink' Qi Zheng has done
     some maintenance work which is preparation for the lockless slab
     shrinking code

   - Qi Zheng has redone the earlier (and reverted) attempt to make slab
     shrinking lockless in the series 'use refcount+RCU method to
     implement lockless slab shrink'

   - David Hildenbrand contributes some maintenance work for the rmap
     code in the series 'Anon rmap cleanups'

   - Kefeng Wang does more folio conversions and some maintenance work
     in the migration code. Series 'mm: migrate: more folio conversion
     and unification'

   - Matthew Wilcox has fixed an issue in the buffer_head code which was
     causing long stalls under some heavy memory/IO loads. Some cleanups
     were added on the way. Series 'Add and use bdev_getblk()'

   - In the series 'Use nth_page() in place of direct struct page
     manipulation' Zi Yan has fixed a potential issue with the direct
     manipulation of hugetlb page frames

   - In the series 'mm: hugetlb: Skip initialization of gigantic tail
     struct pages if freed by HVO' has improved our handling of gigantic
     pages in the hugetlb vmmemmep optimizaton code. This provides
     significant boot time improvements when significant amounts of
     gigantic pages are in use

   - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
     rationalization and folio conversions in the hugetlb code

   - Yin Fengwei has improved mlock()'s handling of large folios in the
     series 'support large folio for mlock'

   - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
     added statistics for memcg v1 users which are available (and
     useful) under memcg v2

   - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
     prctl so that userspace may direct the kernel to not automatically
     propagate the denial to child processes. The series is named 'MDWE
     without inheritance'

   - Kefeng Wang has provided the series 'mm: convert numa balancing
     functions to use a folio' which does what it says

   - In the series 'mm/ksm: add fork-exec support for prctl' Stefan
     Roesch makes is possible for a process to propagate KSM treatment
     across exec()

   - Huang Ying has enhanced memory tiering's calculation of memory
     distances. This is used to permit the dax/kmem driver to use 'high
     bandwidth memory' in addition to Optane Data Center Persistent
     Memory Modules (DCPMM). The series is named 'memory tiering:
     calculate abstract distance based on ACPI HMAT'

   - In the series 'Smart scanning mode for KSM' Stefan Roesch has
     optimized KSM by teaching it to retain and use some historical
     information from previous scans

   - Yosry Ahmed has fixed some inconsistencies in memcg statistics in
     the series 'mm: memcg: fix tracking of pending stats updates
     values'

   - In the series 'Implement IOCTL to get and optionally clear info
     about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
     which permits us to atomically read-then-clear page softdirty
     state. This is mainly used by CRIU

   - Hugh Dickins contributed the series 'shmem,tmpfs: general
     maintenance', a bunch of relatively minor maintenance tweaks to
     this code

   - Matthew Wilcox has increased the use of the VMA lock over
     file-backed page faults in the series 'Handle more faults under the
     VMA lock'. Some rationalizations of the fault path became possible
     as a result

   - In the series 'mm/rmap: convert page_move_anon_rmap() to
     folio_move_anon_rmap()' David Hildenbrand has implemented some
     cleanups and folio conversions

   - In the series 'various improvements to the GUP interface' Lorenzo
     Stoakes has simplified and improved the GUP interface with an eye
     to providing groundwork for future improvements

   - Andrey Konovalov has sent along the series 'kasan: assorted fixes
     and improvements' which does those things

   - Some page allocator maintenance work from Kemeng Shi in the series
     'Two minor cleanups to break_down_buddy_pages'

   - In thes series 'New selftest for mm' Breno Leitao has developed
     another MM self test which tickles a race we had between madvise()
     and page faults

   - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
     and an optimization to the core pagecache code

   - Nhat Pham has added memcg accounting for hugetlb memory in the
     series 'hugetlb memcg accounting'

   - Cleanups and rationalizations to the pagemap code from Lorenzo
     Stoakes, in the series 'Abstract vma_merge() and split_vma()'

   - Audra Mitchell has fixed issues in the procfs page_owner code's new
     timestamping feature which was causing some misbehaviours. In the
     series 'Fix page_owner's use of free timestamps'

   - Lorenzo Stoakes has fixed the handling of new mappings of sealed
     files in the series 'permit write-sealed memfd read-only shared
     mappings'

   - Mike Kravetz has optimized the hugetlb vmemmap optimization in the
     series 'Batch hugetlb vmemmap modification operations'

   - Some buffer_head folio conversions and cleanups from Matthew Wilcox
     in the series 'Finish the create_empty_buffers() transition'

   - As a page allocator performance optimization Huang Ying has added
     automatic tuning to the allocator's per-cpu-pages feature, in the
     series 'mm: PCP high auto-tuning'

   - Roman Gushchin has contributed the patchset 'mm: improve
     performance of accounted kernel memory allocations' which improves
     their performance by ~30% as measured by a micro-benchmark

   - folio conversions from Kefeng Wang in the series 'mm: convert page
     cpupid functions to folios'

   - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
     kmemleak'

   - Qi Zheng has improved our handling of memoryless nodes by keeping
     them off the allocation fallback list. This is done in the series
     'handle memoryless nodes more appropriately'

   - khugepaged conversions from Vishal Moola in the series 'Some
     khugepaged folio conversions'"

[ bcachefs conflicts with the dynamically allocated shrinkers have been
  resolved as per Stephen Rothwell in

     https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/

  with help from Qi Zheng.

  The clone3 test filtering conflict was half-arsed by yours truly ]

* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
  mm/damon/sysfs: update monitoring target regions for online input commit
  mm/damon/sysfs: remove requested targets when online-commit inputs
  selftests: add a sanity check for zswap
  Documentation: maple_tree: fix word spelling error
  mm/vmalloc: fix the unchecked dereference warning in vread_iter()
  zswap: export compression failure stats
  Documentation: ubsan: drop "the" from article title
  mempolicy: migration attempt to match interleave nodes
  mempolicy: mmap_lock is not needed while migrating folios
  mempolicy: alloc_pages_mpol() for NUMA policy without vma
  mm: add page_rmappable_folio() wrapper
  mempolicy: remove confusing MPOL_MF_LAZY dead code
  mempolicy: mpol_shared_policy_init() without pseudo-vma
  mempolicy trivia: use pgoff_t in shared mempolicy tree
  mempolicy trivia: slightly more consistent naming
  mempolicy trivia: delete those ancient pr_debug()s
  mempolicy: fix migrate_pages(2) syscall return nr_failed
  kernfs: drop shared NUMA mempolicy hooks
  hugetlbfs: drop shared NUMA mempolicy pretence
  mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
  ...
2023-11-02 19:38:47 -10:00
Linus Torvalds 1e0c505e13 asm-generic updates for v6.7
The ia64 architecture gets its well-earned retirement as planned,
 now that there is one last (mostly) working release that will
 be maintained as an LTS kernel.
 
 The architecture specific system call tables are updated for
 the added map_shadow_stack() syscall and to remove references
 to the long-gone sys_lookup_dcookie() syscall.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmVC40IACgkQYKtH/8kJ
 Uidhmw/9EX+aWSXGoObJ3fngaNSMw+PmrEuP8qEKBHxfKHcCdX3hc451Oh4GlhaQ
 tru91pPwgNvN2/rfoKusxT+V4PemGIzfNni/04rp+P0kvmdw5otQ2yNhsQNsfVmq
 XGWvkxF4P2GO6bkjjfR/1dDq7GtlyXtwwPDKeLbYb6TnJOZjtx+EAN27kkfSn1Ms
 R4Sa3zJ+DfHUmHL5S9g+7UD/CZ5GfKNmIskI4Mz5GsfoUz/0iiU+Bge/9sdcdSJQ
 kmbLy5YnVzfooLZ3TQmBFsO3iAMWb0s/mDdtyhqhTVmTUshLolkPYyKnPFvdupyv
 shXcpEST2XJNeaDRnL2K4zSCdxdbnCZHDpjfl9wfioBg7I8NfhXKpf1jYZHH1de4
 LXq8ndEFEOVQw/zSpYWfQq1sux8Jiqr+UK/ukbVeFWiGGIUs91gEWtPAf8T0AZo9
 ujkJvaWGl98O1g5wmBu0/dAR6QcFJMDfVwbmlIFpU8O+MEaz6X8mM+O5/T0IyTcD
 eMbAUjj4uYcU7ihKzHEv/0SS9Of38kzff67CLN5k8wOP/9NlaGZ78o1bVle9b52A
 BdhrsAefFiWHp1jT6Y9Rg4HOO/TguQ9e6EWSKOYFulsiLH9LEFaB9RwZLeLytV0W
 vlAgY9rUW77g1OJcb7DoNv33nRFuxsKqsnz3DEIXtgozo9CzbYI=
 =H1vH
 -----END PGP SIGNATURE-----

Merge tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic

Pull ia64 removal and asm-generic updates from Arnd Bergmann:

 - The ia64 architecture gets its well-earned retirement as planned,
   now that there is one last (mostly) working release that will be
   maintained as an LTS kernel.

 - The architecture specific system call tables are updated for the
   added map_shadow_stack() syscall and to remove references to the
   long-gone sys_lookup_dcookie() syscall.

* tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
  hexagon: Remove unusable symbols from the ptrace.h uapi
  asm-generic: Fix spelling of architecture
  arch: Reserve map_shadow_stack() syscall number for all architectures
  syscalls: Cleanup references to sys_lookup_dcookie()
  Documentation: Drop or replace remaining mentions of IA64
  lib/raid6: Drop IA64 support
  Documentation: Drop IA64 from feature descriptions
  kernel: Drop IA64 support from sig_fault handlers
  arch: Remove Itanium (IA-64) architecture
2023-11-01 15:28:33 -10:00
Linus Torvalds ad1871ad8d Power management updates for 6.7-rc1
- Add support for several Qualcomm SoC versions and other similar
    changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil
    Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan
    Gerhold and Varadarajan Narayanan).
 
  - Clean up the tegra cpufreq driver (Sumit Gupta).
 
  - Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob
    Herring).
 
  - Add support for TI's am62p5 Soc (Bryan Brattlof).
 
  - Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian
    Fainelli).
 
  - Update Kconfig to mention i.MX7 as well (Alexander Stein).
 
  - Revise global turbo disable check in intel_pstate (Srinivas
    Pandruvada).
 
  - Carry out initialization of sg_cpu in the schedutil cpufreq governor
    in one loop (Liao Chang).
 
  - Simplify the condition for storing 'down_threshold' in the
    conservative cpufreq governor (Liao Chang).
 
  - Use fine-grained mutex in the userspace cpufreq governor (Liao
    Chang).
 
  - Move is_managed indicator in the userspace cpufreq governor into a
    per-policy structure (Liao Chang).
 
  - Rebuild sched-domains when removing cpufreq driver (Pierre Gondois).
 
  - Fix buffer overflow detection in trans_stats() (Christian Marangi).
 
  - Switch to dev_pm_opp_find_freq_(ceil/floor)_indexed() APIs to support
    specific devices like UFS which handle multiple clocks through OPP
    (Operating Performance Point) framework (Manivannan Sadhasivam).
 
  - Add perf support to the Rockchip DFI (DDR Monitor Module) devfreq-
    event driver:
    * Generalize rockchip-dfi.c to support new RK3568/RK3588 using
      different DDR type (Sascha Hauer).
    * Convert DT binding document format to yaml (Sascha Hauer).
    * Add perf support for DFI (a unit suitable for measuring DDR
      utilization) to rockchip-dfi.c to extend DFI usage (Sascha Hauer).
 
  - Add locking to the OPP handling code in the Mediatek CCI devfreq
    driver, because the voltage of shared OPP might be changed by
    multiple drivers (Mark Tseng, Dan Carpenter).
 
  - Use device_get_match_data() in the Samsung Exynos PPMU devfreq-event
    driver (Rob Herring).
 
  - Extend support for the opp-level beyond required-opps (Ulf Hansson).
 
  - Add dev_pm_opp_find_level_floor() (Krishna chaitanya chundru).
 
  - dt-bindings: Allow opp-peak-kBpsfor kryo CPUs, support Qualcomm Krait
    SoCs and document named opp-microvolt property (Bjorn Andersson,
    Dmitry Baryshkov and Christian Marangi).
 
  - Fix -Wunsequenced warning _of_add_opp_table_v1() (Nathan Chancellor).
 
  - General cleanup of OPP code (Viresh Kumar).
 
  - Use __get_safe_page() rather than touching the list in hibernation
    snapshot code (Brian Geffon).
 
  - Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav).
 
  - Clean up sync_read handling in snapshot_write_next() (Brian Geffon).
 
  - Fix kerneldoc comments for swsusp_check() and swsusp_close() to
    better match code (Christoph Hellwig).
 
  - Downgrade BIOS locked limits pr_warn() in the Intel RAPL power
    capping driver to pr_debug() (Ville Syrjälä).
 
  - Change the minimum python version for the intel_pstate_tracer utility
    from 2.7 to 3.6 (Doug Smythies).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmU6bqYSHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxzLcP/Avv9PgVRVqZlJ1Rs2fqIcyOY+j5qrvx
 xRiO3TBwdAzRy49ItnQY4W/CHk/skGY4vFhiluZE+OTUlx1fmPKeQLFpel+1+PvW
 vLezQ9v18sH7d2Kd6gJO5k9xsyu5ZMHEkwiejA/tmS2vTs5ne4wB7ONTJObYx5iB
 9Hrg6jLnk7MmolQqvQB6vmpej1eeWmuu7AXlg2OsXqYsCEnhS5iGBq86E35LvlKA
 Pnef/B2ZP9RaFg2dVapSZwubn0FkUtd29ifhtGC7Fw5LM8WCRc/KHAWZwMe4dcMf
 38uKux28xIEalGZm9zMhKO8gHGdfF/v1C46/hBvgjavwVJF3AUNXnsfc+v5SerDp
 tXx1xghGyM/blbHUdTfzZc4l5TyqsjhkBMSCMEQcj9QYjsCY0pTZmwLz8F0BAv4D
 0FukGf5jK987RBGvaHY90UCE+NvokOyJDckuSHQffrAZWghnhSgbZxMD5oiIjRYR
 BioM5wQsL+wOxWdUGAOVhK6wKj32kf2XjBqWdEBk70qcpbvEmc0N8t1BSd+TzzoK
 qM2hnyo+yxvv98wi/cglcJeZ1mbL+s1agTh7jFTkC23ap/GrZEw0EB5xdj4NbzOk
 hO1OXas8J1LA1GFwL0WoLDyY0gvGDYFWkh0yeu0SUgxTVwKapyG03OMPQATN5M/y
 cp+PK3ibS8Mb
 =h8my
 -----END PGP SIGNATURE-----

Merge tag 'pm-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These add new hardware support (new Qualcomm SoC versions in cpufreq,
  RK3568/RK3588 in devfreq), extend the OPP (operating performance
  points) framework, improve cpufreq governors, fix issues and clean up
  code (most of the changes are in cpufreq and devfreq).

  Specifics:

   - Add support for several Qualcomm SoC versions and other similar
     changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil
     Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan
     Gerhold and Varadarajan Narayanan)

   - Clean up the tegra cpufreq driver (Sumit Gupta)

   - Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob
     Herring)

   - Add support for TI's am62p5 Soc (Bryan Brattlof)

   - Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian
     Fainelli)

   - Update Kconfig to mention i.MX7 as well (Alexander Stein)

   - Revise global turbo disable check in intel_pstate (Srinivas
     Pandruvada)

   - Carry out initialization of sg_cpu in the schedutil cpufreq
     governor in one loop (Liao Chang)

   - Simplify the condition for storing 'down_threshold' in the
     conservative cpufreq governor (Liao Chang)

   - Use fine-grained mutex in the userspace cpufreq governor (Liao
     Chang)

   - Move is_managed indicator in the userspace cpufreq governor into a
     per-policy structure (Liao Chang)

   - Rebuild sched-domains when removing cpufreq driver (Pierre Gondois)

   - Fix buffer overflow detection in trans_stats() (Christian Marangi)

   - Switch to dev_pm_opp_find_freq_(ceil/floor)_indexed() APIs to
     support specific devices like UFS which handle multiple clocks
     through OPP (Operating Performance Point) framework (Manivannan
     Sadhasivam)

   - Add perf support to the Rockchip DFI (DDR Monitor Module) devfreq-
     event driver:
      * Generalize rockchip-dfi.c to support new RK3568/RK3588 using
        different DDR type (Sascha Hauer).
      * Convert DT binding document format to yaml (Sascha Hauer).
      * Add perf support for DFI (a unit suitable for measuring DDR
        utilization) to rockchip-dfi.c to extend DFI usage (Sascha
        Hauer)

   - Add locking to the OPP handling code in the Mediatek CCI devfreq
     driver, because the voltage of shared OPP might be changed by
     multiple drivers (Mark Tseng, Dan Carpenter)

   - Use device_get_match_data() in the Samsung Exynos PPMU
     devfreq-event driver (Rob Herring)

   - Extend support for the opp-level beyond required-opps (Ulf Hansson)

   - Add dev_pm_opp_find_level_floor() (Krishna chaitanya chundru)

   - dt-bindings: Allow opp-peak-kBpsfor kryo CPUs, support Qualcomm
     Krait SoCs and document named opp-microvolt property (Bjorn
     Andersson, Dmitry Baryshkov and Christian Marangi)

   - Fix -Wunsequenced warning _of_add_opp_table_v1() (Nathan
     Chancellor)

   - General cleanup of OPP code (Viresh Kumar)

   - Use __get_safe_page() rather than touching the list in hibernation
     snapshot code (Brian Geffon)

   - Fix symbol export for _SIMPLE_ variants of _PM_OPS() (Raag Jadav)

   - Clean up sync_read handling in snapshot_write_next() (Brian Geffon)

   - Fix kerneldoc comments for swsusp_check() and swsusp_close() to
     better match code (Christoph Hellwig)

   - Downgrade BIOS locked limits pr_warn() in the Intel RAPL power
     capping driver to pr_debug() (Ville Syrjälä)

   - Change the minimum python version for the intel_pstate_tracer
     utility from 2.7 to 3.6 (Doug Smythies)"

* tag 'pm-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (82 commits)
  dt-bindings: cpufreq: qcom-hw: document SM8650 CPUFREQ Hardware
  cpufreq: arm: Kconfig: Add i.MX7 to supported SoC for ARM_IMX_CPUFREQ_DT
  cpufreq: qcom-nvmem: add support for IPQ8064
  cpufreq: qcom-nvmem: also accept operating-points-v2-krait-cpu
  cpufreq: qcom-nvmem: drop pvs_ver for format a fuses
  dt-bindings: cpufreq: qcom-cpufreq-nvmem: Document krait-cpu
  cpufreq: qcom-nvmem: add support for IPQ6018
  dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ6018
  cpufreq: qcom-nvmem: Add MSM8909
  cpufreq: qcom-nvmem: Simplify driver data allocation
  powercap: intel_rapl: Downgrade BIOS locked limits pr_warn() to pr_debug()
  cpufreq: stats: Fix buffer overflow detection in trans_stats()
  dt-bindings: devfreq: event: rockchip,dfi: Add rk3588 support
  dt-bindings: devfreq: event: rockchip,dfi: Add rk3568 support
  dt-bindings: devfreq: event: convert Rockchip DFI binding to yaml
  PM / devfreq: rockchip-dfi: add support for RK3588
  PM / devfreq: rockchip-dfi: account for multiple DDRMON_CTRL registers
  PM / devfreq: rockchip-dfi: make register stride SoC specific
  PM / devfreq: rockchip-dfi: Add perf support
  PM / devfreq: rockchip-dfi: give variable a better name
  ...
2023-10-31 15:38:12 -10:00
Linus Torvalds 63ce50fff9 Scheduler changes for v6.7 are:
- Fair scheduler (SCHED_OTHER) improvements:
 
     - Remove the old and now unused SIS_PROP code & option
     - Scan cluster before LLC in the wake-up path
     - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
 
  - NUMA scheduling improvements:
 
     - Improve the VMA access-PID code to better skip/scan VMAs
     - Extend tracing to cover VMA-skipping decisions
     - Improve/fix the recently introduced sched_numa_find_nth_cpu() code
     - Generalize numa_map_to_online_node()
 
  - Energy scheduling improvements:
 
     - Remove the EM_MAX_COMPLEXITY limit
     - Add tracepoints to track energy computation
     - Make the behavior of the 'sched_energy_aware' sysctl more consistent
     - Consolidate and clean up access to a CPU's max compute capacity
     - Fix uclamp code corner cases
 
  - RT scheduling improvements:
 
     - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
     - Drive the ->rto_mask with rt_rq->pushable_tasks updates
 
  - Scheduler scalability improvements:
 
     - Rate-limit updates to tg->load_avg
     - On x86 disable IBRS when CPU is offline to improve single-threaded performance
     - Micro-optimize in_task() and in_interrupt()
     - Micro-optimize the PSI code
     - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
 
  - Core scheduler infrastructure improvements:
 
     - Use saved_state to reduce some spurious freezer wakeups
     - Bring in a handful of fast-headers improvements to scheduler headers
     - Make the scheduler UAPI headers more widely usable by user-space
     - Simplify the control flow of scheduler syscalls by using lock guards
     - Fix sched_setaffinity() vs. CPU hotplug race
 
  - Scheduler debuggability improvements:
     - Disallow writing invalid values to sched_rt_period_us
     - Fix a race in the rq-clock debugging code triggering warnings
     - Fix a warning in the bandwidth distribution code
     - Micro-optimize in_atomic_preempt_off() checks
     - Enforce that the tasklist_lock is held in for_each_thread()
     - Print the TGID in sched_show_task()
     - Remove the /proc/sys/kernel/sched_child_runs_first sysctl
 
  - Misc cleanups & fixes
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU8/NoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gN+xAAvKGYNZBCBG4jowxccgqAbCx81KOhhsy/
 KUaOmdLPg9WaXuqjZ5sggXQCMT0wUqBYAmqV7ts53VhWcma2I1ap4dCM6Jj+RLrc
 vNwkeNetsikiZtarMoCJs5NahL8ULh3liBaoAkkToPjQ5r43aZ/eKwDovEdIKc+g
 +Vgn7jUY8ssIrAOKT1midSwY1y8kAU2AzWOSFDTgedkJP4PgOu9/lBl9jSJ2sYaX
 N4XqONYPXTwOHUtvmzkYILxLz0k0GgJ7hmt78E8Xy2rC4taGCRwCfCMBYxREuwiP
 huo3O1P/iIe5svm4/EBUvcpvf44eAWTV+CD0dnJPwOc9IvFhpSzqSZZAsyy/JQKt
 Lnzmc/xmyc1PnXCYJfHuXrw2/m+MyUHaegPzh5iLJFrlqa79GavOElj0jNTAMzbZ
 39fybzPtuFP+64faRfu0BBlQZfORPBNc/oWMpPKqgP58YGuveKTWaUF5rl5lM7Ne
 nm07uOmq02JVR8YzPl/FcfhU2dPMawWuMwUjEr2eU+lAunY3PF88vu0FALj7iOBd
 66F8qrtpDHJanOxrdEUwSJ7hgw79qY1iw66Db7cQYjMazFKZONxArQPqFUZ0ngLI
 n9hVa7brg1bAQKrQflqjcIAIbpVu3SjPEl15cKpAJTB/gn5H66TQgw8uQ6HfG+h2
 GtOsn1nlvuk=
 =GDqb
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Fair scheduler (SCHED_OTHER) improvements:
   - Remove the old and now unused SIS_PROP code & option
   - Scan cluster before LLC in the wake-up path
   - Use candidate prev/recent_used CPU if scanning failed for cluster
     wakeup

  NUMA scheduling improvements:
   - Improve the VMA access-PID code to better skip/scan VMAs
   - Extend tracing to cover VMA-skipping decisions
   - Improve/fix the recently introduced sched_numa_find_nth_cpu() code
   - Generalize numa_map_to_online_node()

  Energy scheduling improvements:
   - Remove the EM_MAX_COMPLEXITY limit
   - Add tracepoints to track energy computation
   - Make the behavior of the 'sched_energy_aware' sysctl more
     consistent
   - Consolidate and clean up access to a CPU's max compute capacity
   - Fix uclamp code corner cases

  RT scheduling improvements:
   - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
   - Drive the ->rto_mask with rt_rq->pushable_tasks updates

  Scheduler scalability improvements:
   - Rate-limit updates to tg->load_avg
   - On x86 disable IBRS when CPU is offline to improve single-threaded
     performance
   - Micro-optimize in_task() and in_interrupt()
   - Micro-optimize the PSI code
   - Avoid updating PSI triggers and ->rtpoll_total when there are no
     state changes

  Core scheduler infrastructure improvements:
   - Use saved_state to reduce some spurious freezer wakeups
   - Bring in a handful of fast-headers improvements to scheduler
     headers
   - Make the scheduler UAPI headers more widely usable by user-space
   - Simplify the control flow of scheduler syscalls by using lock
     guards
   - Fix sched_setaffinity() vs. CPU hotplug race

  Scheduler debuggability improvements:
   - Disallow writing invalid values to sched_rt_period_us
   - Fix a race in the rq-clock debugging code triggering warnings
   - Fix a warning in the bandwidth distribution code
   - Micro-optimize in_atomic_preempt_off() checks
   - Enforce that the tasklist_lock is held in for_each_thread()
   - Print the TGID in sched_show_task()
   - Remove the /proc/sys/kernel/sched_child_runs_first sysctl

  ... and misc cleanups & fixes"

* tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
  sched/fair: Remove SIS_PROP
  sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
  sched/fair: Scan cluster before scanning LLC in wake-up path
  sched: Add cpus_share_resources API
  sched/core: Fix RQCF_ACT_SKIP leak
  sched/fair: Remove unused 'curr' argument from pick_next_entity()
  sched/nohz: Update comments about NEWILB_KICK
  sched/fair: Remove duplicate #include
  sched/psi: Update poll => rtpoll in relevant comments
  sched: Make PELT acronym definition searchable
  sched: Fix stop_one_cpu_nowait() vs hotplug
  sched/psi: Bail out early from irq time accounting
  sched/topology: Rename 'DIE' domain to 'PKG'
  sched/psi: Delete the 'update_total' function parameter from update_triggers()
  sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
  sched/headers: Remove comment referring to rq::cpu_load, since this has been removed
  sched/numa: Complete scanning of inactive VMAs when there is no alternative
  sched/numa: Complete scanning of partial VMAs regardless of PID activity
  sched/numa: Move up the access pid reset logic
  sched/numa: Trace decisions related to skipping VMAs
  ...
2023-10-30 13:12:15 -10:00
Linus Torvalds 3cf3fabccb Locking changes in this cycle are:
- Futex improvements:
 
     - Add the 'futex2' syscall ABI, which is an attempt to get away from the
       multiplex syscall and adds a little room for extentions, while lifting
       some limitations.
 
     - Fix futex PI recursive rt_mutex waiter state bug
 
     - Fix inter-process shared futexes on no-MMU systems
 
     - Use folios instead of pages
 
  - Micro-optimizations of locking primitives:
 
     - Improve arch_spin_value_unlocked() on asm-generic ticket spinlock
       architectures, to improve lockref code generation.
 
     - Improve the x86-32 lockref_get_not_zero() main loop by adding
       build-time CMPXCHG8B support detection for the relevant lockref code,
       and by better interfacing the CMPXCHG8B assembly code with the compiler.
 
     - Introduce arch_sync_try_cmpxchg() on x86 to improve sync_try_cmpxchg()
       code generation. Convert some sync_cmpxchg() users to sync_try_cmpxchg().
 
     - Micro-optimize rcuref_put_slowpath()
 
  - Locking debuggability improvements:
 
     - Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well
 
     - Enforce atomicity of sched_submit_work(), which is de-facto atomic but
       was un-enforced previously.
 
     - Extend <linux/cleanup.h>'s no_free_ptr() with __must_check semantics
 
     - Fix ww_mutex self-tests
 
     - Clean up const-propagation in <linux/seqlock.h> and simplify
       the API-instantiation macros a bit.
 
  - RT locking improvements:
 
     - Provide the rt_mutex_*_schedule() primitives/helpers and use them
       in the rtmutex code to avoid recursion vs. rtlock on the PI state.
 
     - Add nested blocking lockdep asserts to rt_mutex_lock(), rtlock_lock()
       and rwbase_read_lock().
 
  - Plus misc fixes & cleanups
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU877IRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1g9jw/+N7rxQ78dmFCYh4UWnLCYvuKP0/ivHErG
 493JcB8MupuA2tfJHIkDdr4aM2mNq2E61w69/WlZAQWWD6pdOhwgF5Xf5eoEcJm0
 vsAhWBGLxihXdtevPuMAx0dEpg3AMp2wc6i5PkN831KdPUgCNsrKq9Bfnfef7/G8
 MQTSHjmtba6jxleyxfEa4tE2xe5PJX825nRfkX2e1cf+stkYua+uJFxVxUfxFWGE
 4pBy70D9OC7MsJ44WWOA1gwkVtMMiBTmRPNjlP8Gz2GQ0f3ERHRwYk3jDHOPHZI6
 0GNt7pE3IMXQn2UuDtfkvv9IFTd+U5qD+APnWIn2ntWXqzGLFqOlmovMrobVn7El
 olYDCyweWPG71m1Qblsb1VK2QjRPQVJ9NAEg8RlDHIu2ThxHbMysDVGPVOYnPFq4
 S8QFpmldzbNoPU4rDJyT1fAmoUIrusBHkl+Us3yGfC74iM+fHnDEvaSoMZbzEdY1
 x/Nocj9XgKEgfXdYzrCWFmZ9xXqHkO25/wDL6yKqBdQtvaEalXuHTT6mQcYxrUPm
 Xx1BPan2Jg7p4u2oOFcVtKewUtRH9KBx8qytr5S+JK4PJbrBsixMnr84HLd/3X2V
 ykYkO+367T5MTYv4TnJDE5vdurzUqekKSCFPY3skPujPJfdLj1vsPzYf9iMkCLdo
 hU2f/R+Wpdk=
 =36Ff
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Info Molnar:
 "Futex improvements:

   - Add the 'futex2' syscall ABI, which is an attempt to get away from
     the multiplex syscall and adds a little room for extentions, while
     lifting some limitations.

   - Fix futex PI recursive rt_mutex waiter state bug

   - Fix inter-process shared futexes on no-MMU systems

   - Use folios instead of pages

  Micro-optimizations of locking primitives:

   - Improve arch_spin_value_unlocked() on asm-generic ticket spinlock
     architectures, to improve lockref code generation

   - Improve the x86-32 lockref_get_not_zero() main loop by adding
     build-time CMPXCHG8B support detection for the relevant lockref
     code, and by better interfacing the CMPXCHG8B assembly code with
     the compiler

   - Introduce arch_sync_try_cmpxchg() on x86 to improve
     sync_try_cmpxchg() code generation. Convert some sync_cmpxchg()
     users to sync_try_cmpxchg().

   - Micro-optimize rcuref_put_slowpath()

  Locking debuggability improvements:

   - Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well

   - Enforce atomicity of sched_submit_work(), which is de-facto atomic
     but was un-enforced previously.

   - Extend <linux/cleanup.h>'s no_free_ptr() with __must_check
     semantics

   - Fix ww_mutex self-tests

   - Clean up const-propagation in <linux/seqlock.h> and simplify the
     API-instantiation macros a bit

  RT locking improvements:

   - Provide the rt_mutex_*_schedule() primitives/helpers and use them
     in the rtmutex code to avoid recursion vs. rtlock on the PI state.

   - Add nested blocking lockdep asserts to rt_mutex_lock(),
     rtlock_lock() and rwbase_read_lock()

  .. plus misc fixes & cleanups"

* tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
  futex: Don't include process MM in futex key on no-MMU
  locking/seqlock: Fix grammar in comment
  alpha: Fix up new futex syscall numbers
  locking/seqlock: Propagate 'const' pointers within read-only methods, remove forced type casts
  locking/lockdep: Fix string sizing bug that triggers a format-truncation compiler-warning
  locking/seqlock: Change __seqprop() to return the function pointer
  locking/seqlock: Simplify SEQCOUNT_LOCKNAME()
  locking/atomics: Use atomic_try_cmpxchg_release() to micro-optimize rcuref_put_slowpath()
  locking/atomic, xen: Use sync_try_cmpxchg() instead of sync_cmpxchg()
  locking/atomic/x86: Introduce arch_sync_try_cmpxchg()
  locking/atomic: Add generic support for sync_try_cmpxchg() and its fallback
  locking/seqlock: Fix typo in comment
  futex/requeue: Remove unnecessary ‘NULL’ initialization from futex_proxy_trylock_atomic()
  locking/local, arch: Rewrite local_add_unless() as a static inline function
  locking/debug: Fix debugfs API return value checks to use IS_ERR()
  locking/ww_mutex/test: Make sure we bail out instead of livelock
  locking/ww_mutex/test: Fix potential workqueue corruption
  locking/ww_mutex/test: Use prng instead of rng to avoid hangs at bootup
  futex: Add sys_futex_requeue()
  futex: Add flags2 argument to futex_requeue()
  ...
2023-10-30 12:38:48 -10:00
Rafael J. Wysocki 78b1f56a6f Merge branch 'pm-cpufreq'
Merge cpufreq updates for 6.7-rc1:

 - Add support for several Qualcomm SoC versions and other similar
   changes (Christian Marangi, Dmitry Baryshkov, Luca Weiss, Neil
   Armstrong, Richard Acayan, Robert Marko, Rohit Agarwal, Stephan
   Gerhold and Varadarajan Narayanan).

 - Clean up the tegra cpufreq driver (Sumit Gupta).

 - Use of_property_read_reg() to parse "reg" in pmac32 driver (Rob
   Herring).

 - Add support for TI's am62p5 Soc (Bryan Brattlof).

 - Make ARM_BRCMSTB_AVS_CPUFREQ depends on !ARM_SCMI_CPUFREQ (Florian
   Fainelli).

 - Update Kconfig to mention i.MX7 as well (Alexander Stein).

 - Revise global turbo disable check in intel_pstate (Srinivas
   Pandruvada).

 - Carry out initialization of sg_cpu in the schedutil cpufreq governor
   in one loop (Liao Chang).

 - Simplify the condition for storing 'down_threshold' in the
   conservative cpufreq governor (Liao Chang).

 - Use fine-grained mutex in the userspace cpufreq governor (Liao
   Chang).

 - Move is_managed indicator in the userspace cpufreq governor into a
   per-policy structure (Liao Chang).

 - Rebuild sched-domains when removing cpufreq driver (Pierre Gondois).

 - Fix buffer overflow detection in trans_stats() (Christian Marangi).

* pm-cpufreq: (32 commits)
  dt-bindings: cpufreq: qcom-hw: document SM8650 CPUFREQ Hardware
  cpufreq: arm: Kconfig: Add i.MX7 to supported SoC for ARM_IMX_CPUFREQ_DT
  cpufreq: qcom-nvmem: add support for IPQ8064
  cpufreq: qcom-nvmem: also accept operating-points-v2-krait-cpu
  cpufreq: qcom-nvmem: drop pvs_ver for format a fuses
  dt-bindings: cpufreq: qcom-cpufreq-nvmem: Document krait-cpu
  cpufreq: qcom-nvmem: add support for IPQ6018
  dt-bindings: cpufreq: qcom-cpufreq-nvmem: document IPQ6018
  cpufreq: qcom-nvmem: Add MSM8909
  cpufreq: qcom-nvmem: Simplify driver data allocation
  cpufreq: stats: Fix buffer overflow detection in trans_stats()
  dt-bindings: cpufreq: cpufreq-qcom-hw: Add SDX75 compatible
  cpufreq: ARM_BRCMSTB_AVS_CPUFREQ cannot be used with ARM_SCMI_CPUFREQ
  cpufreq: ti-cpufreq: Add opp support for am62p5 SoCs
  cpufreq: dt-platdev: add am62p5 to blocklist
  cpufreq: tegra194: remove redundant AND with cpu_online_mask
  cpufreq: tegra194: use refclk delta based loop instead of udelay
  cpufreq: tegra194: save CPU data to avoid repeated SMP calls
  cpufreq: Rebuild sched-domains when removing cpufreq driver
  cpufreq: userspace: Move is_managed indicator into per-policy structure
  ...
2023-10-26 15:14:38 +02:00
Kefeng Wang 1b143cc77f sched/fair: use folio_xchg_last_cpupid() in should_numa_migrate_memory()
Convert to use folio_xchg_last_cpupid() in should_numa_migrate_memory().

Link: https://lkml.kernel.org/r/20231018140806.2783514-14-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25 16:47:12 -07:00
Kefeng Wang 0b201c3624 sched/fair: use folio_xchg_access_time() in numa_hint_fault_latency()
Convert to use folio_xchg_access_time() in numa_hint_fault_latency().

Link: https://lkml.kernel.org/r/20231018140806.2783514-9-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-25 16:47:12 -07:00
Peter Zijlstra 984ffb6a43 sched/fair: Remove SIS_PROP
SIS_UTIL seems to work well, lets remove the old thing.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net
2023-10-24 10:38:44 +02:00
Yicong Yang 22165f61d0 sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
Chen Yu reports a hackbench regression of cluster wakeup when
hackbench threads equal to the CPU number [1]. Analysis shows
it's because we wake up more on the target CPU even if the
prev_cpu is a good wakeup candidate and leads to the decrease
of the CPU utilization.

Generally if the task's prev_cpu is idle we'll wake up the task
on it without scanning. On cluster machines we'll try to wake up
the task in the same cluster of the target for better cache
affinity, so if the prev_cpu is idle but not sharing the same
cluster with the target we'll still try to find an idle CPU within
the cluster. This will improve the performance at low loads on
cluster machines. But in the issue above, if the prev_cpu is idle
but not in the cluster with the target CPU, we'll try to scan an
idle one in the cluster. But since the system is busy, we're
likely to fail the scanning and use target instead, even if
the prev_cpu is idle. Then leads to the regression.

This patch solves this in 2 steps:
o record the prev_cpu/recent_used_cpu if they're good wakeup
  candidates but not sharing the cluster with the target.
o on scanning failure use the prev_cpu/recent_used_cpu if
  they're recorded as idle

[1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/

Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20231019033323.54147-4-yangyicong@huawei.com
2023-10-24 10:38:43 +02:00
Barry Song 8881e1639f sched/fair: Scan cluster before scanning LLC in wake-up path
For platforms having clusters like Kunpeng920, CPUs within the same cluster
have lower latency when synchronizing and accessing shared resources like
cache. Thus, this patch tries to find an idle cpu within the cluster of the
target CPU before scanning the whole LLC to gain lower latency. This
will be implemented in 2 steps in select_idle_sibling():
1. When the prev_cpu/recent_used_cpu are good wakeup candidates, use them
   if they're sharing cluster with the target CPU. Otherwise trying to
   scan for an idle CPU in the target's cluster.
2. Scanning the cluster prior to the LLC of the target CPU for an
   idle CPU to wakeup.

Testing has been done on Kunpeng920 by pinning tasks to one numa and two
numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.

With this patch, We noticed enhancement on tbench and netperf within one
numa or cross two numa on top of tip-sched-core commit
9b46f1abc6d4 ("sched/debug: Print 'tgid' in sched_show_task()")

tbench results (node 0):
            baseline                     patched
  1:        327.2833        372.4623 (   13.80%)
  4:       1320.5933       1479.8833 (   12.06%)
  8:       2638.4867       2921.5267 (   10.73%)
 16:       5282.7133       5891.5633 (   11.53%)
 32:       9810.6733       9877.3400 (    0.68%)
 64:       7408.9367       7447.9900 (    0.53%)
128:       6203.2600       6191.6500 (   -0.19%)
tbench results (node 0-1):
            baseline                     patched
  1:        332.0433        372.7223 (   12.25%)
  4:       1325.4667       1477.6733 (   11.48%)
  8:       2622.9433       2897.9967 (   10.49%)
 16:       5218.6100       5878.2967 (   12.64%)
 32:      10211.7000      11494.4000 (   12.56%)
 64:      13313.7333      16740.0333 (   25.74%)
128:      13959.1000      14533.9000 (    4.12%)

netperf results TCP_RR (node 0):
            baseline                     patched
  1:      76546.5033      90649.9867 (   18.42%)
  4:      77292.4450      90932.7175 (   17.65%)
  8:      77367.7254      90882.3467 (   17.47%)
 16:      78519.9048      90938.8344 (   15.82%)
 32:      72169.5035      72851.6730 (    0.95%)
 64:      25911.2457      25882.2315 (   -0.11%)
128:      10752.6572      10768.6038 (    0.15%)

netperf results TCP_RR (node 0-1):
            baseline                     patched
  1:      76857.6667      90892.2767 (   18.26%)
  4:      78236.6475      90767.3017 (   16.02%)
  8:      77929.6096      90684.1633 (   16.37%)
 16:      77438.5873      90502.5787 (   16.87%)
 32:      74205.6635      88301.5612 (   19.00%)
 64:      69827.8535      71787.6706 (    2.81%)
128:      25281.4366      25771.3023 (    1.94%)

netperf results UDP_RR (node 0):
            baseline                     patched
  1:      96869.8400     110800.8467 (   14.38%)
  4:      97744.9750     109680.5425 (   12.21%)
  8:      98783.9863     110409.9637 (   11.77%)
 16:      99575.0235     110636.2435 (   11.11%)
 32:      95044.7250      97622.8887 (    2.71%)
 64:      32925.2146      32644.4991 (   -0.85%)
128:      12859.2343      12824.0051 (   -0.27%)

netperf results UDP_RR (node 0-1):
            baseline                     patched
  1:      97202.4733     110190.1200 (   13.36%)
  4:      95954.0558     106245.7258 (   10.73%)
  8:      96277.1958     105206.5304 (    9.27%)
 16:      97692.7810     107927.2125 (   10.48%)
 32:      79999.6702     103550.2999 (   29.44%)
 64:      80592.7413      87284.0856 (    8.30%)
128:      27701.5770      29914.5820 (    7.99%)

Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
in the code has not been tested but it supposed to work.

Chen Yu also noticed this will improve the performance of tbench and
netperf on a 24 CPUs Jacobsville machine, there are 4 CPUs in one
cluster sharing L2 Cache.

[https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lkml.kernel.org/r/20231019033323.54147-3-yangyicong@huawei.com
2023-10-24 10:38:43 +02:00
Barry Song b95303e0ae sched: Add cpus_share_resources API
Add cpus_share_resources() API. This is the preparation for the
optimization of select_idle_cpu() on platforms with cluster scheduler
level.

On a machine with clusters cpus_share_resources() will test whether
two cpus are within the same cluster. On a non-cluster machine it
will behaves the same as cpus_share_cache(). So we use "resources"
here for cache resources.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
2023-10-24 10:38:42 +02:00
Hao Jia 5ebde09d91 sched/core: Fix RQCF_ACT_SKIP leak
Igor Raits and Bagas Sanjaya report a RQCF_ACT_SKIP leak warning.

This warning may be triggered in the following situations:

    CPU0                                      CPU1

__schedule()
  *rq->clock_update_flags <<= 1;*   unregister_fair_sched_group()
  pick_next_task_fair+0x4a/0x410      destroy_cfs_bandwidth()
    newidle_balance+0x115/0x3e0       for_each_possible_cpu(i) *i=0*
      rq_unpin_lock(this_rq, rf)      __cfsb_csd_unthrottle()
      raw_spin_rq_unlock(this_rq)
                                      rq_lock(*CPU0_rq*, &rf)
                                      rq_clock_start_loop_update()
                                      rq->clock_update_flags & RQCF_ACT_SKIP <--
      raw_spin_rq_lock(this_rq)

The purpose of RQCF_ACT_SKIP is to skip the update rq clock,
but the update is very early in __schedule(), but we clear
RQCF_*_SKIP very late, causing it to span that gap above
and triggering this warning.

In __schedule() we can clear the RQCF_*_SKIP flag immediately
after update_rq_clock() to avoid this RQCF_ACT_SKIP leak warning.
And set rq->clock_update_flags to RQCF_UPDATED to avoid
rq->clock_update_flags < RQCF_ACT_SKIP warning that may be triggered later.

Fixes: ebb83d84e4 ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")
Closes: https://lore.kernel.org/all/20230913082424.73252-1-jiahao.os@bytedance.com
Reported-by: Igor Raits <igor.raits@gmail.com>
Reported-by: Bagas Sanjaya <bagasdotme@gmail.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/a5dd536d-041a-2ce9-f4b7-64d8d85c86dc@gmail.com
2023-10-24 10:38:42 +02:00
Ingo Molnar 4e5b65a22b Linux 6.6-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmU1ngkeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGrsIH/0k/+gdBBYFFdEym
 foRhKir9WV3ZX4oIozJjA1f7T+qVYclKs6kaYm3gNepRBb6AoG8pdgv4MMAqhYsf
 QMe2XHi0MrO/qKBgfNfivxEa9jq+0QK5uvTbqCRqCAB8LfwVyDqapCmg3EuiZcPW
 UbMITmnwLIfXgPxvp9rabmCsTqO6FLbf0GDOVIkNSAIDBXMpcO1iffjrWUbhRa7n
 oIoiJmWJLcXLxPWDsRKbpJwzw2cIG08YhfQYAiQnC3YaeRm1FKLDIICRBsmfYzja
 rWv9r4dn4TDfV4/AnjggQnsZvz2yPCxNaFSQIT88nIeiLvyuUTJ9j8aidsSfMZQf
 xZAbzbA=
 =NoQv
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-rc7' into sched/core, to pick up fixes

Pick up recent sched/urgent fixes merged upstream.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-23 11:32:25 +02:00
Yiwei Lin 4c456c9ad3 sched/fair: Remove unused 'curr' argument from pick_next_entity()
The 'curr' argument of pick_next_entity() has become unused after
the EEVDF changes.

[ mingo: Updated the changelog. ]

Signed-off-by: Yiwei Lin <s921975628@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231020055617.42064-1-s921975628@gmail.com
2023-10-20 15:55:04 +02:00
Joel Fernandes (Google) fb064e5ae1 sched/nohz: Update comments about NEWILB_KICK
How ILB is triggered without IPIs is cryptic. Out of mercy for future
code readers, document it in code comments.

The comments are derived from a discussion with Vincent in a past
review.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231020014031.919742-2-joel@joelfernandes.org
2023-10-20 09:56:21 +02:00
Alexey Dobriyan 68279f9c9f treewide: mark stuff as __ro_after_init
__read_mostly predates __ro_after_init. Many variables which are marked
__read_mostly should have been __ro_after_init from day 1.

Also, mark some stuff as "const" and "__init" while I'm at it.

[akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning]
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18 14:43:23 -07:00
Matthew Wilcox (Oracle) 37acade0ce sched: remove wait bookmarks
There are no users of wait bookmarks left, so simplify the wait
code by removing them.

Link: https://lkml.kernel.org/r/20231010035829.544242-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Benjamin Segall <bsegall@google.com>
Cc: Bin Lai <sclaibin@gmail.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18 14:34:18 -07:00
Jiapeng Chong 1b7ef2d94f sched/fair: Remove duplicate #include
./kernel/sched/fair.c: linux/sched/cond_resched.h is included more than once.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231018062759.44375-1-jiapeng.chong@linux.alibaba.com

Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6907
2023-10-18 10:32:31 +02:00
Peter Zijlstra d2929762cc sched/eevdf: Fix heap corruption more
Because someone is a flaming idiot... and forgot we have current as
se->on_rq but not actually in the tree itself, and walking rb_parent()
on an entry not in the tree is 'funky' and KASAN complains.

Fixes: 8dafa9d0eb ("sched/eevdf: Fix min_deadline heap integrity")
Reported-by: 0599jiangyc@gmail.com
Reported-by: Dmitry Safonov <0x7f454c46@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=218020
Link: https://lkml.kernel.org/r/CAJwJo6ZGXO07%3DQvW4fgQfbsDzQPs9xj5sAQ1zp%3DmAyPMNbHYww%40mail.gmail.com
2023-10-18 10:22:13 +02:00
Kefeng Wang 8c9ae56dc7 sched/numa, mm: make numa migrate functions to take a folio
The cpupid (or access time) is stored in the head page for THP, so it is
safely to make should_numa_migrate_memory() and numa_hint_fault_latency()
to take a folio.  This is in preparation for large folio numa balancing.

Link: https://lkml.kernel.org/r/20230921074417.24004-7-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-16 15:44:38 -07:00
Fan Yu 7b3d8df549 sched/psi: Update poll => rtpoll in relevant comments
The PSI trigger code is now making a distinction between privileged and
unprivileged triggers, after the following commit:

 65457b74aa ("sched/psi: Rename existing poll members in preparation")

But some comments have not been modified along with the code, so they
need to be updated.

This will help readers better understand the code.

Signed-off-by: Fan Yu <fan.yu9@zte.com.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/202310161920399921184@zte.com.cn
2023-10-16 13:42:49 +02:00
Mathieu Desnoyers 1b8a955dd3 sched: Make PELT acronym definition searchable
The PELT acronym definition can be found right at the top of
kernel/sched/pelt.c (of course), but it cannot be found through use of

grep -r PELT kernel/sched/

Add the acronym "(PELT)" after "Per Entity Load Tracking" at the top of
the source file.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20231012125824.1260774-1-mathieu.desnoyers@efficios.com
2023-10-13 09:56:30 +02:00
Peter Zijlstra f0498d2a54 sched: Fix stop_one_cpu_nowait() vs hotplug
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU
hotplug stress-test -- notably affine_move_task() remains stuck in
wait_for_completion(), leading to a hung-task detector warning.

Specifically, it was reported that stop_one_cpu_nowait(.fn =
migration_cpu_stop) returns false -- this stopper is responsible for
the matching complete().

The race scenario is:

	CPU0					CPU1

					// doing _cpu_down()

  __set_cpus_allowed_ptr()
    task_rq_lock();
					takedown_cpu()
					  stop_machine_cpuslocked(take_cpu_down..)

					<PREEMPT: cpu_stopper_thread()
					  MULTI_STOP_PREPARE
					  ...
    __set_cpus_allowed_ptr_locked()
      affine_move_task()
        task_rq_unlock();

  <PREEMPT: cpu_stopper_thread()\>
    ack_state()
					  MULTI_STOP_RUN
					    take_cpu_down()
					      __cpu_disable();
					      stop_machine_park();
						stopper->enabled = false;
					 />
   />
	stop_one_cpu_nowait(.fn = migration_cpu_stop);
          if (stopper->enabled) // false!!!

That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the
stopper thread gets a chance to preempt and allows the cpu-down for
the target CPU to complete.

OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to
issue a wakeup, it must not be ran under the scheduler locks.

Solve this apparent contradiction by keeping preemption disabled over
the unlock + queue_stopper combination:

	preempt_disable();
	task_rq_unlock(...);
	if (!stop_pending)
	  stop_one_cpu_nowait(...)
	preempt_enable();

This respects the lock ordering contraints while still avoiding the
above race. That is, if we find the CPU is online under rq-lock, the
targeted stop_one_cpu_nowait() must succeed.

Apply this pattern to all similar stop_one_cpu_nowait() invocations.

Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com>
Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net
2023-10-13 09:56:29 +02:00
Haifeng Xu 0c2924079f sched/psi: Bail out early from irq time accounting
We could bail out early when psi was disabled.

Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <zhouchengming@bytedance.com>
Link: https://lore.kernel.org/r/20230926115722.467833-1-haifeng.xu@shopee.com
2023-10-13 09:56:29 +02:00
Peter Zijlstra f577cd57bf sched/topology: Rename 'DIE' domain to 'PKG'
While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain
for the package mask. :-)

Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the
name less ambiguous.

[ Shrikanth Hegde: rename on s390 as well. ]
[ Valentin Schneider: also rename it in the comments. ]
[ mingo: port to recent kernels & find all remaining occurances. ]

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Valentin Schneider <vschneid@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net
2023-10-12 09:38:16 +02:00
Yang Yang 3657680f38 sched/psi: Delete the 'update_total' function parameter from update_triggers()
The 'update_total' parameter of update_triggers() is always true after the
previous commit:

  80cc1d1d5e ("sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes")

If the 'changed_states & group->rtpoll_states' condition is true,
'new_stall' in update_triggers() will be true, and then 'update_total'
should also be true.

So update_total is redundant - remove it.

[ mingo: Changelog updates ]

Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/202310101645437859599@zte.com.cn
2023-10-11 23:08:09 +02:00
Yang Yang 80cc1d1d5e sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
When psimon wakes up and there are no state changes for ->rtpoll_states,
it's unnecessary to update triggers and ->rtpoll_total because the pressures
being monitored by the user have not changed.

This will help to slightly reduce unnecessary computations of PSI.

[ mingo: Changelog updates ]

Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Ziljstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/202310101641075436843@zte.com.cn
2023-10-11 23:07:50 +02:00
Colin Ian King b19fdb16fb sched/headers: Remove comment referring to rq::cpu_load, since this has been removed
There is a comment that refers to cpu_load, however, this cpu_load was
removed with:

  55627e3cd2 ("sched/core: Remove rq->cpu_load[]")

... back in 2019. The comment does not make sense with respect to this
removed array, so remove the comment.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231010155744.1381065-1-colin.i.king@gmail.com
2023-10-10 23:42:20 +02:00
Mel Gorman f169c62ff7 sched/numa: Complete scanning of inactive VMAs when there is no alternative
VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.

Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.

The VMA skipping activity frequency with and without the patch:

	6.6.0-rc2-sched-numabtrace-v1
	=============================
	    649 reason=scan_delay
	  9,094 reason=unsuitable
	 48,915 reason=shared_ro
	143,919 reason=inaccessible
	193,050 reason=pid_inactive

	6.6.0-rc2-sched-numabselective-v1
	=============================
	    146 reason=seq_completed
	    622 reason=ignore_pid_inactive

	    624 reason=scan_delay
	  6,570 reason=unsuitable
	 16,101 reason=shared_ro
	 27,608 reason=inaccessible
	 41,939 reason=pid_inactive

Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.

On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;

                                                 6.6.0-rc2              6.6.0-rc2
                                       sched-numabtrace-v1 sched-numabselective-v1
  Min       elsp-NUMA01_THREADLOCAL      174.22 (   0.00%)      117.64 (  32.48%)
  Amean     elsp-NUMA01_THREADLOCAL      175.68 (   0.00%)      123.34 *  29.79%*
  Stddev    elsp-NUMA01_THREADLOCAL        1.20 (   0.00%)        4.06 (-238.20%)
  CoeffVar  elsp-NUMA01_THREADLOCAL        0.68 (   0.00%)        3.29 (-381.70%)
  Max       elsp-NUMA01_THREADLOCAL      177.18 (   0.00%)      128.03 (  27.74%)

The time to complete the workload is reduced by almost 30%:

                     6.6.0-rc2   6.6.0-rc2
                  sched-numabtrace-v1 sched-numabselective-v1 /
  Duration User       91201.80    63506.64
  Duration System      2015.53     1819.78
  Duration Elapsed     1234.77      868.37

In this specific case, system CPU time was not increased but it's not
universally true.

From vmstat, the NUMA scanning and fault activity is as follows;

                                        6.6.0-rc2      6.6.0-rc2
                              sched-numabtrace-v1 sched-numabselective-v1
  Ops NUMA base-page range updates       64272.00    26374386.00
  Ops NUMA PTE updates                   36624.00       55538.00
  Ops NUMA PMD updates                      54.00       51404.00
  Ops NUMA hint faults                   15504.00       75786.00
  Ops NUMA hint local faults %           14860.00       56763.00
  Ops NUMA hint local percent               95.85          74.90
  Ops NUMA pages migrated                 1629.00     6469222.00

Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net
2023-10-10 23:42:15 +02:00
Mel Gorman b7a5b537c5 sched/numa: Complete scanning of partial VMAs regardless of PID activity
NUMA Balancing skips VMAs when the current task has not trapped a NUMA
fault within the VMA. If the VMA is skipped then mm->numa_scan_offset
advances and a task that is trapping faults within the VMA may never
fully update PTEs within the VMA.

Force tasks to update PTEs for partially scanned PTEs. The VMA will
be tagged for NUMA hints by some task but this removes some of the
benefit of tracking PID activity within a VMA. A follow-on patch
will mitigate this problem.

The test cases and machines evaluated did not trigger the corner case so
the performance results are neutral with only small changes within the
noise from normal test-to-test variance. However, the next patch makes
the corner case easier to trigger.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Link: https://lore.kernel.org/r/20231010083143.19593-6-mgorman@techsingularity.net
2023-10-10 23:41:47 +02:00
Raghavendra K T 2e2675db19 sched/numa: Move up the access pid reset logic
Recent NUMA hinting faulting activity is reset approximately every
VMA_PID_RESET_PERIOD milliseconds. However, if the current task has not
accessed a VMA then the reset check is missed and the reset is potentially
deferred forever. Check if the PID activity information should be reset
before checking if the current task recently trapped a NUMA hinting fault.

[ mgorman@techsingularity.net: Rewrite changelog ]

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231010083143.19593-5-mgorman@techsingularity.net
2023-10-10 11:10:01 +02:00
Mel Gorman ed2da8b725 sched/numa: Trace decisions related to skipping VMAs
NUMA balancing skips or scans VMAs for a variety of reasons. In preparation
for completing scans of VMAs regardless of PID access, trace the reasons
why a VMA was skipped. In a later patch, the tracing will be used to track
if a VMA was forcibly scanned.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231010083143.19593-4-mgorman@techsingularity.net
2023-10-10 11:10:00 +02:00
Mel Gorman f3a6c97940 sched/numa: Rename vma_numab_state::access_pids[] => ::pids_active[], ::next_pid_reset => ::pids_active_reset
The access_pids[] field name is somewhat ambiguous as no PIDs are accessed.
Similarly, it's not clear that next_pid_reset is related to access_pids[].
Rename the fields to more accurately reflect their purpose.

[ mingo: Rename in the comments too. ]

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231010083143.19593-3-mgorman@techsingularity.net
2023-10-10 11:10:00 +02:00
Ingo Molnar fdb8b7a1af Linux 6.6-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmUjFeceHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGNCAH/RDI8G44DCV9Ps5U
 rl/FMf6iLUxU6fCS3Wwe8vtppLjPP7Y16AH5HKMumoDIqTfh9ZAUVKhZfT+PTgz3
 /oFXcGzZQLTcdbtH7XK2/zk7N/RI25/rDiCDd1uIJVCNii+hsBKS6Ihc4wXadxaR
 0z3lwoEKp2egeaeqmJWMzJLdjRrYhLs33+SEciVYqTiIvlWsM5QBm/sMvES7V57s
 TXrs5/y7yXtDBZ2PgYNCBRLyBazjqB28x07aQoePOAs6nFXl5N/wWPW/4wirWFHT
 s9LYZlmVo+O+RHWj10ASm/2l+ihgn959ZfRj1VekK2AWU1x/VzSPcuCXKvsrUoa+
 xEjL+vM=
 =efE3
 -----END PGP SIGNATURE-----

Merge tag 'v6.6-rc5' into locking/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-09 18:09:23 +02:00
Ingo Molnar f2273f4e19 sched/topology: Move the declaration of 'schedutil_gov' to kernel/sched/sched.h
Move it out of the .c file into the shared scheduler-internal header file,
to gain type-checking.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
2023-10-09 17:33:10 +02:00
Shrikanth Hegde 8f833c82cd sched/topology: Change behaviour of the 'sched_energy_aware' sysctl, based on the platform
The 'sched_energy_aware' sysctl is available for the admin to disable/enable
energy aware scheduling(EAS). EAS is enabled only if few conditions are
met by the platform. They are, asymmetric CPU capacity, no SMT,
schedutil CPUfreq governor, frequency invariant load tracking etc.
A platform may boot without EAS capability, but could gain such
capability at runtime. For example, changing/registering the cpufreq
governor to schedutil.

At present, though platform doesn't support EAS, this sysctl returns 1
and it ends up calling build_perf_domains on write to 1 and
NOP when writing to 0. That is confusing and un-necessary.

Desired behavior would be to have this sysctl to enable/disable the EAS
on supported platform. On non-supported platform write to the sysctl
would return not supported error and read of the sysctl would return
empty. So sched_energy_aware returns empty - EAS is not possible at this moment
This will include EAS capable platforms which have at least one EAS
condition false during startup, e.g. not using the schedutil cpufreq governor
sched_energy_aware returns 0 - EAS is supported but disabled by admin.
sched_energy_aware returns 1 - EAS is supported and enabled.

User can find out the reason why EAS is not possible by checking
info messages. sched_is_eas_possible returns true if the platform
can do EAS at this moment.

Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Pierre Gondois <pierre.gondois@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20231009060037.170765-3-sshegde@linux.vnet.ibm.com
2023-10-09 17:24:44 +02:00
Yang Yang e03dc9fa06 sched/psi: Change update_triggers() to a 'void' function
Update_triggers() always returns now + group->rtpoll_min_period, and the
return value is only used by psi_rtpoll_work(), so change update_triggers()
to a void function, let group->rtpoll_next_update = now +
group->rtpoll_min_period directly.

This will avoid unnecessary function return value passing & simplifies
the function.

[ mingo: Updated changelog ]

Suggested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/202310092024289721617@zte.com.cn
2023-10-09 14:54:50 +02:00
Pierre Gondois 5b77261c55 sched/topology: Remove the EM_MAX_COMPLEXITY limit
The Energy Aware Scheduler (EAS) estimates the energy consumption
of placing a task on different CPUs. The goal is to minimize this
energy consumption. Estimating the energy of different task placements
is increasingly complex with the size of the platform.

To avoid having a slow wake-up path, EAS is only enabled if this
complexity is low enough.

The current complexity limit was set in:

  b68a4c0dba ("sched/topology: Disable EAS on inappropriate platforms")

... based on the first implementation of EAS, which was re-computing
the power of the whole platform for each task placement scenario, see:

  390031e4c3 ("sched/fair: Introduce an energy estimation helper function")

... but the complexity of EAS was reduced in:

  eb92692b25 ("sched/fair: Speed-up energy-aware wake-ups")

... and find_energy_efficient_cpu() (feec) algorithm was updated in:

  3e8c6c9aac ("sched/fair: Remove task_util from effective utilization in feec()")

find_energy_efficient_cpu() (feec) is now doing:

	feec()
	\_ for_each_pd(pd) [0]
	  // get max_spare_cap_cpu and compute_prev_delta
	  \_ for_each_cpu(pd) [1]

	  \_ eenv_pd_busy_time(pd) [2]
		\_ for_each_cpu(pd)

	  // compute_energy(pd) without the task
	  \_ eenv_pd_max_util(pd, -1) [3.0]
	    \_ for_each_cpu(pd)
	  \_ em_cpu_energy(pd, -1)
	    \_ for_each_ps(pd)

	  // compute_energy(pd) with the task on prev_cpu
	  \_ eenv_pd_max_util(pd, prev_cpu) [3.1]
	    \_ for_each_cpu(pd)
	  \_ em_cpu_energy(pd, prev_cpu)
	    \_ for_each_ps(pd)

	  // compute_energy(pd) with the task on max_spare_cap_cpu
	  \_ eenv_pd_max_util(pd, max_spare_cap_cpu) [3.2]
	    \_ for_each_cpu(pd)
	  \_ em_cpu_energy(pd, max_spare_cap_cpu)
	    \_ for_each_ps(pd)

	[3.1] happens only once since prev_cpu is unique. With the same
	      definitions for nr_pd, nr_cpus and nr_ps, the complexity is of:

		nr_pd * (2 * [nr_cpus in pd] + 2 * ([nr_cpus in pd] + [nr_ps in pd]))
		+ ([nr_cpus in pd] + [nr_ps in pd])

		 [0]  * (     [1] + [2]      +       [3.0] + [3.2]                  )
		+ [3.1]

		= nr_pd * (4 * [nr_cpus in pd] + 2 * [nr_ps in pd])
		+ [nr_cpus in prev pd] + nr_ps

The complexity limit was set to 2048 in:

  b68a4c0dba ("sched/topology: Disable EAS on inappropriate platforms")

... to make "EAS usable up to 16 CPUs with per-CPU DVFS and less than 8
performance states each". For the same platform, the complexity would
actually be of:

  16 * (4 + 2 * 7) + 1 + 7 = 296

Since the EAS complexity was greatly reduced since the limit was
introduced, bigger platforms can handle EAS.

For instance, a platform with 112 CPUs with 7 performance states
each would not reach it:

  112 * (4 + 2 * 7) + 1 + 7 = 2024

To reflect this improvement in the underlying EAS code, remove
the EAS complexity check.

Note that a limit on the number of CPUs still holds against
EM_MAX_NUM_CPUS to avoid overflows during the energy estimation.

[ mingo: Updates to the changelog. ]

Signed-off-by: Pierre Gondois <Pierre.Gondois@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20231009060037.170765-2-sshegde@linux.vnet.ibm.com
2023-10-09 13:07:27 +02:00
Vincent Guittot 7bc263840b sched/topology: Consolidate and clean up access to a CPU's max compute capacity
Remove the rq::cpu_capacity_orig field and use arch_scale_cpu_capacity()
instead.

The scheduler uses 3 methods to get access to a CPU's max compute capacity:

 - arch_scale_cpu_capacity(cpu) which is the default way to get a CPU's capacity.

 - cpu_capacity_orig field which is periodically updated with
   arch_scale_cpu_capacity().

 - capacity_orig_of(cpu) which encapsulates rq->cpu_capacity_orig.

There is no real need to save the value returned by arch_scale_cpu_capacity()
in struct rq. arch_scale_cpu_capacity() returns:

 - either a per_cpu variable.

 - or a const value for systems which have only one capacity.

Remove rq::cpu_capacity_orig and use arch_scale_cpu_capacity() everywhere.

No functional changes.

Some performance tests on Arm64:

  - small SMP device (hikey): no noticeable changes
  - HMP device (RB5):         hackbench shows minor improvement (1-2%)
  - large smp (thx2):         hackbench and tbench shows minor improvement (1%)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20231009103621.374412-2-vincent.guittot@linaro.org
2023-10-09 12:59:48 +02:00
Yajun Deng 089768dfeb sched/rt: Change the type of 'sysctl_sched_rt_period' from 'unsigned int' to 'int'
Doing this matches the natural type of 'int' based calculus
in sched_rt_handler(), and also enables the adding in of a
correct upper bounds check on the sysctl interface.

[ mingo: Rewrote the changelog. ]

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231008021538.3063250-1-yajun.deng@linux.dev
2023-10-09 12:44:56 +02:00
Ingo Molnar f4bb570511 sched/nohz: Remove unnecessarily complex error handling pattern from find_new_ilb()
find_new_ilb() returns nr_cpu_ids on failure - which is the usual
cpumask bitops return pattern, but is weird & unnecessary in this
context: not only is it a global variable, it it is a +1 out of
bounds CPU index and also has different signedness ...

Its only user, kick_ilb(), then checks the return against nr_cpu_ids
to decide to return. There's no other use.

So instead of this, use a standard -1 return on failure to find an
idle CPU, as the argument is signed already.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20231006102518.2452758-4-mingo@kernel.org
2023-10-09 12:21:23 +02:00
Ingo Molnar b6dd698483 sched/nohz: Use consistent variable names in find_new_ilb() and kick_ilb()
Use 'ilb_cpu' consistently in both functions.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20231006102518.2452758-3-mingo@kernel.org
2023-10-09 12:21:23 +02:00
Ingo Molnar 7ef7145a2b sched/nohz: Update idle load-balancing (ILB) comments
- Fix incorrect/misleading comments,

 - clarify some others,

 - fix typos & grammar,

 - and use more consistent style throughout.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/20231006102518.2452758-2-mingo@kernel.org
2023-10-09 12:21:23 +02:00
Benjamin Segall b01db23d59 sched/eevdf: Fix pick_eevdf()
The old pick_eevdf() could fail to find the actual earliest eligible
deadline when it descended to the right looking for min_deadline, but
it turned out that that min_deadline wasn't actually eligible. In that
case we need to go back and search through any left branches we
skipped looking for the actual best _eligible_ min_deadline.

This is more expensive, but still O(log n), and at worst should only
involve descending two branches of the rbtree.

I've run this through a userspace stress test (thank you
tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
corner cases.

Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/xm261qego72d.fsf_-_@google.com
2023-10-09 09:48:33 +02:00
Peter Zijlstra 8dafa9d0eb sched/eevdf: Fix min_deadline heap integrity
Marek and Biju reported instances of:

  "EEVDF scheduling fail, picking leftmost"

which Mike correlated with cgroup scheduling and the min_deadline heap
getting corrupted; some trace output confirms:

> And yeah, min_deadline is hosed somehow:
>
>    validate_cfs_rq: --- /
>    __print_se: ffff88845cf48080 w: 1024 ve: -58857638 lag: 870381 vd: -55861854 vmd: -66302085 E (11372/tr)
>    __print_se:   ffff88810d165800 w: 25 ve: -80323686 lag: 22336429 vd: -41496434 vmd: -66302085 E (-1//autogroup-31)
>    __print_se:   ffff888108379000 w: 25 ve: 0 lag: -57987257 vd: 114632828 vmd: 114632828 N (-1//autogroup-33)
>    validate_cfs_rq: min_deadline: -55861854 avg_vruntime: -62278313462 / 1074 = -57987256

Turns out that reweight_entity(), which tries really hard to be fast,
does not do the normal dequeue+update+enqueue pattern but *does* scale
the deadline.

However, it then fails to propagate the updated deadline value up the
heap.

Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: Biju Das <biju.das.jz@bp.renesas.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Link: https://lkml.kernel.org/r/20231006192445.GE743@noisy.programming.kicks-ass.net
2023-10-09 09:48:32 +02:00
Yajun Deng bc87127a45 sched/debug: Print 'tgid' in sched_show_task()
Multiple blocked tasks are printed when the system hangs. They may have
the same parent pid, but belong to different task groups.

Printing tgid lets users better know whether these tasks are from the same
task group or not.

Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230720080516.1515297-1-yajun.deng@linux.dev
2023-10-07 11:33:28 +02:00
Ingo Molnar ea41bb514f sched/core: Update stale comment in try_to_wake_up()
The following commit:

  9b3c4ab304 ("sched,rcu: Rework try_invoke_on_locked_down_task()")

... renamed try_invoke_on_locked_down_task() to task_call_func(),
but forgot to update the comment in try_to_wake_up().

But it turns out that the smp_rmb() doesn't live in task_call_func()
either, it was moved to __task_needs_rq_lock() in:

  91dabf33ae ("sched: Fix race in task_call_func()")

Fix that now.

Also fix the s/smb/smp typo while at it.

Reported-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230731085759.11443-1-zhangqiao22@huawei.com
2023-10-07 11:33:28 +02:00
Ingo Molnar 8db30574db Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh the branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-07 11:32:24 +02:00
Pierre Gondois e7a1b32e43 cpufreq: Rebuild sched-domains when removing cpufreq driver
The Energy Aware Scheduler (EAS) relies on the schedutil governor.
When moving to/from the schedutil governor, sched domains must be
rebuilt to allow re-evaluating the enablement conditions of EAS.
This is done through sched_cpufreq_governor_change().

Having a cpufreq governor assumes a cpufreq driver is running.
Inserting/removing a cpufreq driver should trigger a re-evaluation
of EAS enablement conditions, avoiding to see EAS enabled when
removing a running cpufreq driver.

Rebuild the sched domains in schedutil's sugov_init()/sugov_exit(),
allowing to check EAS's enablement condition whenever schedutil
governor is initialized/exited from.
Move relevant code up in schedutil.c to avoid a split and conditional
function declaration.
Rename sched_cpufreq_governor_change() to sugov_eas_rebuild_sd().

Signed-off-by: Pierre Gondois <pierre.gondois@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-10-06 22:05:56 +02:00
Liao Chang 16a03c71bb cpufreq: schedutil: Merge initialization code of sg_cpu in single loop
The initialization code of the per-cpu sg_cpu struct is currently split
into two for-loop blocks. This can be simplified by merging the two
blocks into a single loop. This will make the code more maintainable.

Signed-off-by: Liao Chang <liaochang1@huawei.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2023-10-06 21:51:40 +02:00
Xuewen Yan 9e0bc36ab0 cpufreq: schedutil: Update next_freq when cpufreq_limits change
When cpufreq's policy is 'single', there is a scenario that will
cause sg_policy's next_freq to be unable to update.

When the CPU's util is always max, the cpufreq will be max,
and then if we change the policy's scaling_max_freq to be a
lower freq, indeed, the sg_policy's next_freq need change to
be the lower freq, however, because the cpu_is_busy, the next_freq
would keep the max_freq.

For example:

The cpu7 is a single CPU:

  unisoc:/sys/devices/system/cpu/cpufreq/policy7 # while true;do done& [1] 4737
  unisoc:/sys/devices/system/cpu/cpufreq/policy7 # taskset -p 80 4737
  pid 4737's current affinity mask: ff
  pid 4737's new affinity mask: 80
  unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
  2301000
  unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_cur_freq
  2301000
  unisoc:/sys/devices/system/cpu/cpufreq/policy7 # echo 2171000 > scaling_max_freq
  unisoc:/sys/devices/system/cpu/cpufreq/policy7 # cat scaling_max_freq
  2171000

At this time, the sg_policy's next_freq would stay at 2301000, which
is wrong.

To fix this, add a check for the ->need_freq_update flag.

[ mingo: Clarified the changelog. ]

Co-developed-by: Guohua Yan <guohua.yan@unisoc.com>
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Guohua Yan <guohua.yan@unisoc.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230719130527.8074-1-xuewen.yan@unisoc.com
2023-10-05 22:09:50 +02:00
Yu Liao d4d6596b43 sched/headers: Remove duplicate header inclusions
<linux/psi.h> and "autogroup.h" are included twice, remove the duplicate header
inclusion.

Signed-off-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230802021501.2511569-1-liaoyu15@huawei.com
2023-10-03 21:27:55 +02:00
Peter Zijlstra 650cad561c sched/eevdf: Fix avg_vruntime()
The expectation is that placing a task at avg_vruntime() makes it
eligible. Turns out there is a corner case where this is not the case.

Specifically, avg_vruntime() relies on the fact that integer division
is a flooring function (eg. it discards the remainder). By this
property the value returned is slightly left of the true average.

However! when the average is a negative (relative to min_vruntime) the
effect is flipped and it becomes a ceil, with the result that the
returned value is just right of the average and thus not eligible.

Fixes: af4cf40470 ("sched/fair: Add cfs_rq::avg_vruntime")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-10-03 12:32:29 +02:00
Peter Zijlstra 2f2fc17bab sched/eevdf: Also update slice on placement
Tasks that never consume their full slice would not update their slice value.
This means that tasks that are spawned before the sysctl scaling keep their
original (UP) slice length.

Fixes: 147f3efaa2 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230915124822.847197830@noisy.programming.kicks-ass.net
2023-10-03 12:32:29 +02:00
Cyril Hrubis 079be8fc63 sched/rt: Disallow writing invalid values to sched_rt_period_us
The validation of the value written to sched_rt_period_us was broken
because:

  - the sysclt_sched_rt_period is declared as unsigned int
  - parsed by proc_do_intvec()
  - the range is asserted after the value parsed by proc_do_intvec()

Because of this negative values written to the file were written into a
unsigned integer that were later on interpreted as large positive
integers which did passed the check:

  if (sysclt_sched_rt_period <= 0)
	return EINVAL;

This commit fixes the parsing by setting explicit range for both
perid_us and runtime_us into the sched_rt_sysctls table and processes
the values with proc_dointvec_minmax() instead.

Alternatively if we wanted to use full range of unsigned int for the
period value we would have to split the proc_handler and use
proc_douintvec() for it however even the
Documentation/scheduller/sched-rt-group.rst describes the range as 1 to
INT_MAX.

As far as I can tell the only problem this causes is that the sysctl
file allows writing negative values which when read back may confuse
userspace.

There is also a LTP test being submitted for these sysctl files at:

  http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/

Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz
2023-10-02 15:15:56 +02:00
Qais Yousef 15874a3d27 sched/debug: Add new tracepoint to track compute energy computation
It was useful to track feec() placement decision and debug the spare
capacity and optimization issues vs uclamp_max.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230916232955.2099394-4-qyousef@layalina.io
2023-09-29 10:29:18 +02:00
Qais Yousef 23c9519def sched/uclamp: Ignore (util == 0) optimization in feec() when p_util_max = 0
find_energy_efficient_cpu() bails out early if effective util of the
task is 0 as the delta at this point will be zero and there's nothing
for EAS to do. When uclamp is being used, this could lead to wrong
decisions when uclamp_max is set to 0. In this case the task is capped
to performance point 0, but it is actually running and consuming energy
and we can benefit from EAS energy calculations.

Rework the condition so that it bails out when both util and uclamp_min
are 0.

We can do that without needing to use uclamp_task_util(); remove it.

Fixes: d81304bc61 ("sched/uclamp: Cater for uclamp in find_energy_efficient_cpu()'s early exit condition")
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230916232955.2099394-3-qyousef@layalina.io
2023-09-29 10:29:14 +02:00
Qais Yousef 6b00a40147 sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0
When uclamp_max is being used, the util of the task could be higher than
the spare capacity of the CPU, but due to uclamp_max value we force-fit
it there.

The way the condition for checking for max_spare_cap in
find_energy_efficient_cpu() was constructed; it ignored any CPU that has
its spare_cap less than or _equal_ to max_spare_cap. Since we initialize
max_spare_cap to 0; this lead to never setting max_spare_cap_cpu and
hence ending up never performing compute_energy() for this cluster and
missing an opportunity for a better energy efficient placement to honour
uclamp_max setting.

	max_spare_cap = 0;
	cpu_cap = capacity_of(cpu) - cpu_util(p);  // 0 if cpu_util(p) is high

	...

	util_fits_cpu(...);		// will return true if uclamp_max forces it to fit

	...

	// this logic will fail to update max_spare_cap_cpu if cpu_cap is 0
	if (cpu_cap > max_spare_cap) {
		max_spare_cap = cpu_cap;
		max_spare_cap_cpu = cpu;
	}

prev_spare_cap suffers from a similar problem.

Fix the logic by converting the variables into long and treating -1
value as 'not populated' instead of 0 which is a viable and correct
spare capacity value. We need to be careful signed comparison is used
when comparing with cpu_cap in one of the conditions.

Fixes: 1d42509e47 ("sched/fair: Make EAS wakeup placement consider uclamp restrictions")
Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230916232955.2099394-2-qyousef@layalina.io
2023-09-29 10:29:14 +02:00
Valentin Schneider 5fe7765997 sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded
dl_rq->dl_nr_migratory is increased whenever a DL entity is enqueued and it has
nr_cpus_allowed > 1. Unlike the pushable_dl_tasks tree, dl_rq->dl_nr_migratory
includes a dl_rq's current task. This means a dl_rq can have a migratable
current, N non-migratable queued tasks, and be flagged as overloaded and have
its CPU set in the dlo_mask, despite having an empty pushable_tasks tree.

Make an dl_rq's overload logic be driven by {enqueue,dequeue}_pushable_dl_task(),
in other words make DL RQs only be flagged as overloaded if they have at
least one runnable-but-not-current migratable task.

 o push_dl_task() is unaffected, as it is a no-op if there are no pushable
   tasks.

 o pull_dl_task() now no longer scans runqueues whose sole migratable task is
   their current one, which it can't do anything about anyway.
   It may also now pull tasks to a DL RQ with dl_nr_running > 1 if only its
   current task is migratable.

Since dl_rq->dl_nr_migratory becomes unused, remove it.

RT had the exact same mechanism (rt_rq->rt_nr_migratory) which was dropped
in favour of relying on rt_rq->pushable_tasks, see:

  612f769edd ("sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask")

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20230928150251.463109-1-vschneid@redhat.com
2023-09-29 10:20:21 +02:00
Joel Fernandes (Google) fc09027786 sched/rt: Fix live lock between select_fallback_rq() and RT push
During RCU-boost testing with the TREE03 rcutorture config, I found that
after a few hours, the machine locks up.

On tracing, I found that there is a live lock happening between 2 CPUs.
One CPU has an RT task running, while another CPU is being offlined
which also has an RT task running.  During this offlining, all threads
are migrated. The migration thread is repeatedly scheduled to migrate
actively running tasks on the CPU being offlined. This results in a live
lock because select_fallback_rq() keeps picking the CPU that an RT task
is already running on only to get pushed back to the CPU being offlined.

It is anyway pointless to pick CPUs for pushing tasks to if they are
being offlined only to get migrated away to somewhere else. This could
also add unwanted latency to this task.

Fix these issues by not selecting CPUs in RT if they are not 'active'
for scheduling, using the cpu_active_mask. Other parts in core.c already
use cpu_active_mask to prevent tasks from being put on CPUs going
offline.

With this fix I ran the tests for days and could not reproduce the
hang. Without the patch, I hit it in a few hours.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org
2023-09-28 22:58:13 +02:00
Valentin Schneider 612f769edd sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask
Sebastian noted that the rto_push_work IRQ work can be queued for a CPU
that has an empty pushable_tasks list, which means nothing useful will be
done in the IPI other than queue the work for the next CPU on the rto_mask.

rto_push_irq_work_func() only operates on tasks in the pushable_tasks list,
but the conditions for that irq_work to be queued (and for a CPU to be
added to the rto_mask) rely on rq_rt->nr_migratory instead.

nr_migratory is increased whenever an RT task entity is enqueued and it has
nr_cpus_allowed > 1. Unlike the pushable_tasks list, nr_migratory includes a
rt_rq's current task. This means a rt_rq can have a migratible current, N
non-migratible queued tasks, and be flagged as overloaded / have its CPU
set in the rto_mask, despite having an empty pushable_tasks list.

Make an rt_rq's overload logic be driven by {enqueue,dequeue}_pushable_task().
Since rt_rq->{rt_nr_migratory,rt_nr_total} become unused, remove them.

Note that the case where the current task is pushed away to make way for a
migration-disabled task remains unchanged: the migration-disabled task has
to be in the pushable_tasks list in the first place, which means it has
nr_cpus_allowed > 1.

Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20230811112044.3302588-1-vschneid@redhat.com
2023-09-25 10:25:29 +02:00
Wang Jinchao 3eafe22599 sched/core: Refactor the task_flags check for worker sleeping in sched_submit_work()
Simplify the conditional logic for checking worker flags
by splitting the original compound `if` statement into
separate `if` and `else if` clauses.

This modification not only retains the previous functionality,
but also reduces a single `if` check, improving code clarity
and potentially enhancing performance.

Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/ZOIMvURE99ZRAYEj@fedora
2023-09-24 12:15:06 +02:00
Josh Don 2f8c62296b sched/fair: Fix warning in bandwidth distribution
We've observed the following warning being hit in
distribute_cfs_runtime():

	SCHED_WARN_ON(cfs_rq->runtime_remaining > 0)

We have the following race:

 - CPU 0: running bandwidth distribution (distribute_cfs_runtime).
   Inspects the local cfs_rq and makes its runtime_remaining positive.
   However, we defer unthrottling the local cfs_rq until after
   considering all remote cfs_rq's.

 - CPU 1: starts running bandwidth distribution from the slack timer. When
   it finds the cfs_rq for CPU 0 on the throttled list, it observers the
   that the cfs_rq is throttled, yet is not on the CSD list, and has a
   positive runtime_remaining, thus triggering the warning in
   distribute_cfs_runtime.

To fix this, we can rework the local unthrottling logic to put the local
cfs_rq on a local list, so that any future bandwidth distributions will
realize that the cfs_rq is about to be unthrottled.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230922230535.296350-2-joshdon@google.com
2023-09-24 12:08:29 +02:00
Josh Don 30797bce8e sched/fair: Make cfs_rq->throttled_csd_list available on !SMP
This makes the following patch cleaner by avoiding extra CONFIG_SMP
conditionals on the availability of rq->throttled_csd_list.

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230922230535.296350-1-joshdon@google.com
2023-09-24 12:08:28 +02:00
Liming Wu dc461c48de sched/debug: Avoid checking in_atomic_preempt_off() twice in schedule_debug()
in_atomic_preempt_off() already gets called in schedule_debug() once,
which is the only caller of __schedule_bug().

Skip the second call within __schedule_bug(), it should always be true
at this point.

[ mingo: Clarified the changelog. ]

Signed-off-by: Liming Wu <liming.wu@jaguarmicro.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230825023501.1848-1-liming.wu@jaguarmicro.com
2023-09-22 11:50:49 +02:00
Sebastian Andrzej Siewior 622f0a1d54 sched/debug: Update stale reference to sched_debug.c
Since commit:

   8a99b6833c ("sched: Move SCHED_DEBUG sysctl to debugfs")

The sched_debug interface moved from /proc to debugfs. The comment
mentions still the outdated proc interfaces.

Update the comment, point to the current location of the interface.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230920130025.412071-3-bigeasy@linutronix.de
2023-09-21 08:30:19 +02:00
Sebastian Andrzej Siewior 17e7170645 sched/debug: Remove the /proc/sys/kernel/sched_child_runs_first sysctl
The /proc/sys/kernel/sched_child_runs_first knob is no longer connected since:

   5e963f2bd4 ("sched/fair: Commit to EEVDF")

Remove it.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230920130025.412071-2-bigeasy@linutronix.de
2023-09-21 08:30:18 +02:00
Peter Zijlstra 6b596e62ed sched: Provide rt_mutex specific scheduler helpers
With PREEMPT_RT there is a rt_mutex recursion problem where
sched_submit_work() can use an rtlock (aka spinlock_t). More
specifically what happens is:

  mutex_lock() /* really rt_mutex */
    ...
      __rt_mutex_slowlock_locked()
	task_blocks_on_rt_mutex()
          // enqueue current task as waiter
          // do PI chain walk
        rt_mutex_slowlock_block()
          schedule()
            sched_submit_work()
              ...
              spin_lock() /* really rtlock */
                ...
                  __rt_mutex_slowlock_locked()
                    task_blocks_on_rt_mutex()
                      // enqueue current task as waiter *AGAIN*
                      // *CONFUSION*

Fix this by making rt_mutex do the sched_submit_work() early, before
it enqueues itself as a waiter -- before it even knows *if* it will
wait.

[[ basically Thomas' patch but with different naming and a few asserts
   added ]]

Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-5-bigeasy@linutronix.de
2023-09-20 09:31:12 +02:00
Thomas Gleixner de1474b46d sched: Extract __schedule_loop()
There are currently two implementations of this basic __schedule()
loop, and there is soon to be a third.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-4-bigeasy@linutronix.de
2023-09-20 09:31:12 +02:00
Peter Zijlstra 28bc55f654 sched: Constrain locks in sched_submit_work()
Even though sched_submit_work() is ran from preemptible context,
it is discouraged to have it use blocking locks due to the recursion
potential.

Enforce this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230908162254.999499-2-bigeasy@linutronix.de
2023-09-20 09:31:11 +02:00
Liam R. Howlett cff9b2332a kernel/sched: Modify initial boot task idle setup
Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle().  Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set.  Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed.  Recent changes
have caused just this scenario and IRQs have been enabled early.

This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.

Fix this issue by setting the PF_IDLE flag later in the boot sequence.

Although the boot task was marked as idle since (at least) d80e4fda576d,
I am not sure that it is wrong to do so.  The forced context-switch on
idle task was introduced in the tiny_rcu update, so I'm going to claim
this fixes 5f6130fa52.

Fixes: 5f6130fa52 ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
2023-09-19 10:48:04 +02:00
Ingo Molnar e23edc86b0 sched/fair: Rename check_preempt_curr() to wakeup_preempt()
The name is a bit opaque - make it clear that this is about wakeup
preemption.

Also rename the ->check_preempt_curr() methods similarly.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-09-19 10:40:10 +02:00
Ingo Molnar 82845683ca sched/fair: Rename check_preempt_wakeup() to check_preempt_wakeup_fair()
Other scheduling classes already postfix their similar methods
with the class name.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2023-09-19 10:32:15 +02:00
GUO Zihua 7ad0354d18 sched/headers: Remove duplicated includes in kernel/sched/sched.h
Remove duplicated includes of linux/cgroup.h and linux/psi.h. Both of
these includes are included regardless of the config and they are all
protected by ifndef, so no point including them again.

Signed-off-by: GUO Zihua <guozihua@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230818015633.18370-1-guozihua@huawei.com
2023-09-18 09:41:37 +02:00
Aaron Lu 1528c661c2 sched/fair: Ratelimit update to tg->load_avg
When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

    13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
    10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side. tg->load_avg is per task group and when different tasks
of the same taskgroup running on different CPUs frequently access
tg->load_avg, it can be heavily contended.

E.g. when running postgres_sysbench on a 2sockets/112cores/224cpus Intel
Sappire Rapids, during a 5s window, the wakeup number is 14millions and
migration number is 11millions and with each migration, the task's load
will transfer from src cfs_rq to target cfs_rq and each change involves
an update to tg->load_avg. Since the workload can trigger as many wakeups
and migrations, the access(both read and write) to tg->load_avg can be
unbound. As a result, the two mentioned functions showed noticeable
overhead. With netperf/nr_client=nr_cpu/UDP_RR, the problem is worse:
during a 5s window, wakeup number is 21millions and migration number is
14millions; update_cfs_group() costs ~25% and update_load_avg() costs ~16%.

Reduce the overhead by limiting updates to tg->load_avg to at most once
per ms. The update frequency is a tradeoff between tracking accuracy and
overhead. 1ms is chosen because PELT window is roughly 1ms and it
delivered good results for the tests that I've done. After this change,
the cost of accessing tg->load_avg is greatly reduced and performance
improved. Detailed test results below.

  ==============================
  postgres_sysbench on SPR:
  25%
  base:   42382±19.8%
  patch:  50174±9.5%  (noise)

  50%
  base:   67626±1.3%
  patch:  67365±3.1%  (noise)

  75%
  base:   100216±1.2%
  patch:  112470±0.1% +12.2%

  100%
  base:    93671±0.4%
  patch:  113563±0.2% +21.2%

  ==============================
  hackbench on ICL:
  group=1
  base:    114912±5.2%
  patch:   117857±2.5%  (noise)

  group=4
  base:    359902±1.6%
  patch:   361685±2.7%  (noise)

  group=8
  base:    461070±0.8%
  patch:   491713±0.3% +6.6%

  group=16
  base:    309032±5.0%
  patch:   378337±1.3% +22.4%

  =============================
  hackbench on SPR:
  group=1
  base:    100768±2.9%
  patch:   103134±2.9%  (noise)

  group=4
  base:    413830±12.5%
  patch:   378660±16.6% (noise)

  group=8
  base:    436124±0.6%
  patch:   490787±3.2% +12.5%

  group=16
  base:    457730±3.2%
  patch:   680452±1.3% +48.8%

  ============================
  netperf/udp_rr on ICL
  25%
  base:    114413±0.1%
  patch:   115111±0.0% +0.6%

  50%
  base:    86803±0.5%
  patch:   86611±0.0%  (noise)

  75%
  base:    35959±5.3%
  patch:   49801±0.6% +38.5%

  100%
  base:    61951±6.4%
  patch:   70224±0.8% +13.4%

  ===========================
  netperf/udp_rr on SPR
  25%
  base:   104954±1.3%
  patch:  107312±2.8%  (noise)

  50%
  base:    55394±4.6%
  patch:   54940±7.4%  (noise)

  75%
  base:    13779±3.1%
  patch:   36105±1.1% +162%

  100%
  base:     9703±3.7%
  patch:   28011±0.2% +189%

  ==============================================
  netperf/tcp_stream on ICL (all in noise range)
  25%
  base:    43092±0.1%
  patch:   42891±0.5%

  50%
  base:    19278±14.9%
  patch:   22369±7.2%

  75%
  base:    16822±3.0%
  patch:   17086±2.3%

  100%
  base:    18216±0.6%
  patch:   18078±2.9%

  ===============================================
  netperf/tcp_stream on SPR (all in noise range)
  25%
  base:    34491±0.3%
  patch:   34886±0.5%

  50%
  base:    19278±14.9%
  patch:   22369±7.2%

  75%
  base:    16822±3.0%
  patch:   17086±2.3%

  100%
  base:    18216±0.6%
  patch:   18078±2.9%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: David Vernet <void@manifault.com>
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Tested-by: Swapnil Sapkal <Swapnil.Sapkal@amd.com>
Link: https://lkml.kernel.org/r/20230912065808.2530-2-aaron.lu@intel.com
2023-09-18 08:14:45 +02:00
Elliot Berman 8f0eed4a78 freezer,sched: Use saved_state to reduce some spurious wakeups
After commit f5d39b0208 ("freezer,sched: Rewrite core freezer logic"),
tasks that transition directly from TASK_FREEZABLE to TASK_FROZEN  are
always woken up on the thaw path. Prior to that commit, tasks could ask
freezer to consider them "frozen enough" via freezer_do_not_count(). The
commit replaced freezer_do_not_count() with a TASK_FREEZABLE state which
allows freezer to immediately mark the task as TASK_FROZEN without
waking up the task.  This is efficient for the suspend path, but on the
thaw path, the task is always woken up even if the task didn't need to
wake up and goes back to its TASK_(UN)INTERRUPTIBLE state. Although
these tasks are capable of handling of the wakeup, we can observe a
power/perf impact from the extra wakeup.

We observed on Android many tasks wait in the TASK_FREEZABLE state
(particularly due to many of them being binder clients). We observed
nearly 4x the number of tasks and a corresponding linear increase in
latency and power consumption when thawing the system. The latency
increased from ~15ms to ~50ms.

Avoid the spurious wakeups by saving the state of TASK_FREEZABLE tasks.
If the task was running before entering TASK_FROZEN state
(__refrigerator()) or if the task received a wake up for the saved
state, then the task is woken on thaw. saved_state from PREEMPT_RT locks
can be re-used because freezer would not stomp on the rtlock wait flow:
TASK_RTLOCK_WAIT isn't considered freezable.

Reported-by: Prakash Viswalingam <quic_prakashv@quicinc.com>
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-18 08:14:36 +02:00
Elliot Berman fbaa6a181a sched/core: Remove ifdeffery for saved_state
In preparation for freezer to also use saved_state, remove the
CONFIG_PREEMPT_RT compilation guard around saved_state.

On the arm64 platform I tested which did not have CONFIG_PREEMPT_RT,
there was no statistically significant deviation by applying this patch.

Test methodology:

perf bench sched message -g 40 -l 40

Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-18 08:13:57 +02:00
Uros Bizjak 4ff34ad3d3 sched/core: Use do-while instead of for loop in set_nr_if_polling()
Use equivalent do-while loop instead of infinite for loop.

There are no asm code changes.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230228161426.4508-1-ubizjak@gmail.com
2023-09-15 17:18:02 +02:00
Chengming Zhou c0490bc9bb sched/fair: Fix cfs_rq_is_decayed() on !SMP
We don't need to maintain per-queue leaf_cfs_rq_list on !SMP, since
it's used for cfs_rq load tracking & balancing on SMP.

But sched debug interface uses it to print per-cfs_rq stats.

This patch fixes the !SMP version of cfs_rq_is_decayed(), so the
per-queue leaf_cfs_rq_list is also maintained correctly on !SMP,
to fix the warning in assert_list_leaf_cfs_rq().

Fixes: 0a00a35464 ("sched/fair: Delete useless condition in tg_unthrottle_up()")
Reported-by: Leo Yu-Chi Liang <ycliang@andestech.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Leo Yu-Chi Liang <ycliang@andestech.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Closes: https://lore.kernel.org/all/ZN87UsqkWcFLDxea@swlinux02/
Link: https://lore.kernel.org/r/20230913132031.2242151-1-chengming.zhou@linux.dev
2023-09-15 14:24:00 +02:00
Yury Norov 6d08ad2166 sched/topology: Fix sched_numa_find_nth_cpu() comment
Reword sched_numa_find_nth_cpu() comment and make it kernel-doc compatible.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20230819141239.287290-7-yury.norov@gmail.com
2023-09-15 13:48:11 +02:00
Yury Norov 9ecea9ae4d sched/topology: Handle NUMA_NO_NODE in sched_numa_find_nth_cpu()
sched_numa_find_nth_cpu() doesn't handle NUMA_NO_NODE properly, and
may crash kernel if passed with it. On the other hand, the only user
of sched_numa_find_nth_cpu() has to check NUMA_NO_NODE case explicitly.

It would be easier for users if this logic will get moved into
sched_numa_find_nth_cpu().

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20230819141239.287290-6-yury.norov@gmail.com
2023-09-15 13:48:11 +02:00
Yury Norov 617f2c38cb sched/topology: Fix sched_numa_find_nth_cpu() in CPU-less case
When the node provided by user is CPU-less, corresponding record in
sched_domains_numa_masks is not set. Trying to dereference it in the
following code leads to kernel crash.

To avoid it, start searching from the nearest node with CPUs.

Fixes: cd7f55359c ("sched: add sched_numa_find_nth_cpu()")
Reported-by: Yicong Yang <yangyicong@hisilicon.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20230819141239.287290-4-yury.norov@gmail.com

Closes: https://lore.kernel.org/lkml/CAAH8bW8C5humYnfpW3y5ypwx0E-09A3QxFE1JFzR66v+mO4XfA@mail.gmail.com/T/
Closes: https://lore.kernel.org/lkml/ZMHSNQfv39HN068m@yury-ThinkPad/T/#mf6431cb0b7f6f05193c41adeee444bc95bf2b1c4
2023-09-15 13:48:10 +02:00
Yury Norov d1db9fb432 sched/fair: Fix open-coded numa_nearest_node()
task_numa_placement() searches for a nearest node to migrate by calling
for_each_node_state(). Now that we have numa_nearest_node(), switch to
using it.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20230819141239.287290-3-yury.norov@gmail.com
2023-09-15 13:48:10 +02:00
Tim Chen 450e749707 sched/fair: Fix SMT4 group_smt_balance handling
For SMT4, any group with more than 2 tasks will be marked as
group_smt_balance. Retain the behaviour of group_has_spare by marking
the busiest group as the group which has the least number of idle_cpus.

Also, handle rounding effect of adding (ncores_local + ncores_busy) when
the local is fully idle and busy group imbalance is less than 2 tasks.
Local group should try to pull at least 1 task in this case so imbalance
should be set to 2 instead.

Fixes: fee1759e4f ("sched/fair: Determine active load balance for SMT sched groups")
Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/6cd1633036bb6b651af575c32c2a9608a106702c.camel@linux.intel.com
2023-09-13 15:03:06 +02:00
Peter Zijlstra 0e34600ac9 sched: Misc cleanups
Random remaining guard use...

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-13 15:01:48 +02:00
Peter Zijlstra 6fb4546061 sched: Simplify tg_set_cfs_bandwidth()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-13 15:01:42 +02:00
Peter Zijlstra fa614b4feb sched: Simplify sched_move_task()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-13 15:01:38 +02:00
Peter Zijlstra af7c5763f5 sched: Simplify sched_rr_get_interval()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-13 15:01:33 +02:00
Peter Zijlstra 7a50f76674 sched: Simplify yield_to()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-13 15:01:28 +02:00
Peter Zijlstra 92c2ec5bc1 sched: Simplify sched_{set,get}affinity()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-13 15:01:24 +02:00
Peter Zijlstra febe162d4d sched: Simplify syscalls
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-13 15:01:19 +02:00
Peter Zijlstra 94b548a15e sched: Simplify set_user_nice()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-09-13 15:01:14 +02:00
Ard Biesheuvel cf8e865810 arch: Remove Itanium (IA-64) architecture
The Itanium architecture is obsolete, and an informal survey [0] reveals
that any residual use of Itanium hardware in production is mostly HP-UX
or OpenVMS based. The use of Linux on Itanium appears to be limited to
enthusiasts that occasionally boot a fresh Linux kernel to see whether
things are still working as intended, and perhaps to churn out some
distro packages that are rarely used in practice.

None of the original companies behind Itanium still produce or support
any hardware or software for the architecture, and it is listed as
'Orphaned' in the MAINTAINERS file, as apparently, none of the engineers
that contributed on behalf of those companies (nor anyone else, for that
matter) have been willing to support or maintain the architecture
upstream or even be responsible for applying the odd fix. The Intel
firmware team removed all IA-64 support from the Tianocore/EDK2
reference implementation of EFI in 2018. (Itanium is the original
architecture for which EFI was developed, and the way Linux supports it
deviates significantly from other architectures.) Some distros, such as
Debian and Gentoo, still maintain [unofficial] ia64 ports, but many have
dropped support years ago.

While the argument is being made [1] that there is a 'for the common
good' angle to being able to build and run existing projects such as the
Grid Community Toolkit [2] on Itanium for interoperability testing, the
fact remains that none of those projects are known to be deployed on
Linux/ia64, and very few people actually have access to such a system in
the first place. Even if there were ways imaginable in which Linux/ia64
could be put to good use today, what matters is whether anyone is
actually doing that, and this does not appear to be the case.

There are no emulators widely available, and so boot testing Itanium is
generally infeasible for ordinary contributors. GCC still supports IA-64
but its compile farm [3] no longer has any IA-64 machines. GLIBC would
like to get rid of IA-64 [4] too because it would permit some overdue
code cleanups. In summary, the benefits to the ecosystem of having IA-64
be part of it are mostly theoretical, whereas the maintenance overhead
of keeping it supported is real.

So let's rip off the band aid, and remove the IA-64 arch code entirely.
This follows the timeline proposed by the Debian/ia64 maintainer [5],
which removes support in a controlled manner, leaving IA-64 in a known
good state in the most recent LTS release. Other projects will follow
once the kernel support is removed.

[0] https://lore.kernel.org/all/CAMj1kXFCMh_578jniKpUtx_j8ByHnt=s7S+yQ+vGbKt9ud7+kQ@mail.gmail.com/
[1] https://lore.kernel.org/all/0075883c-7c51-00f5-2c2d-5119c1820410@web.de/
[2] https://gridcf.org/gct-docs/latest/index.html
[3] https://cfarm.tetaneutral.net/machines/list/
[4] https://lore.kernel.org/all/87bkiilpc4.fsf@mid.deneb.enyo.de/
[5] https://lore.kernel.org/all/ff58a3e76e5102c94bb5946d99187b358def688a.camel@physik.fu-berlin.de/

Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2023-09-11 08:13:17 +00:00
Shrikanth Hegde f8858d9606 sched/fair: Optimize should_we_balance() for large SMT systems
should_we_balance() is called in load_balance() to find out if the CPU that
is trying to do the load balance is the right one or not.

With commit:

  b1bfeab9b002("sched/fair: Consider the idle state of the whole core for load balance")

the code tries to find an idle core to do the load balancing
and falls back on an idle sibling CPU if there is no idle core.

However, on larger SMT systems, it could be needlessly iterating to find a
idle by scanning all the CPUs in an non-idle core. If the core is not idle,
and first SMT sibling which is idle has been found, then its not needed to
check other SMT siblings for idleness

Lets say in SMT4, Core0 has 0,2,4,6 and CPU0 is BUSY and rest are IDLE.
balancing domain is MC/DIE. CPU2 will be set as the first idle_smt and
same process would be repeated for CPU4 and CPU6 but this is unnecessary.
Since calling is_core_idle loops through all CPU's in the SMT mask, effect
is multiplied by weight of smt_mask. For example,when say 1 CPU is busy,
we would skip loop for 2 CPU's and skip iterating over 8CPU's. That
effect would be more in DIE/NUMA domain where there are more cores.

Testing and performance evaluation
==================================

The test has been done on this system which has 12 cores, i.e 24 small
cores with SMT=4:

  lscpu
  Architecture:            ppc64le
    Byte Order:            Little Endian
  CPU(s):                  96
    On-line CPU(s) list:   0-95
  Model name:              POWER10 (architected), altivec supported
    Thread(s) per core:    8

Used funclatency bcc tool to evaluate the time taken by should_we_balance(). For
base tip/sched/core the time taken is collected by making the
should_we_balance() noinline. time is in nanoseconds. The values are
collected by running the funclatency tracer for 60 seconds. values are
average of 3 such runs. This represents the expected reduced time with
patch.

tip/sched/core was at commit:

  2f88c8e802 ("sched/eevdf/doc: Modify the documented knob to base_slice_ns as well")

Results:

	------------------------------------------------------------------------------
	workload			   tip/sched/core	with_patch(%gain)
	------------------------------------------------------------------------------
	idle system				 809.3		 695.0(16.45)
	stress ng – 12 threads -l 100		1013.5		 893.1(13.49)
	stress ng – 24 threads -l 100		1073.5		 980.0(9.54)
	stress ng – 48 threads -l 100		 683.0		 641.0(6.55)
	stress ng – 96 threads -l 100		2421.0		2300(5.26)
	stress ng – 96 threads -l 15		 375.5		 377.5(-0.53)
	stress ng – 96 threads -l 25		 635.5		 637.5(-0.31)
	stress ng – 96 threads -l 35		 934.0		 891.0(4.83)

Ran schbench(old), hackbench and stress_ng  to evaluate the workload
performance between tip/sched/core and with patch.
No modification to tip/sched/core

TL;DR:

Good improvement is seen with schbench. when hackbench and stress_ng
runs for longer good improvement is seen.

	------------------------------------------------------------------------------
	schbench(old)		            tip		+patch(%gain)
	10 iterations			sched/core
	------------------------------------------------------------------------------
	1 Threads
	50.0th:		      		    8.00       9.00(-12.50)
	75.0th:   			    9.60       9.00(6.25)
	90.0th:   			   11.80      10.20(13.56)
	95.0th:   			   12.60      10.40(17.46)
	99.0th:   			   13.60      11.90(12.50)
	99.5th:   			   14.10      12.60(10.64)
	99.9th:   			   15.90      14.60(8.18)
	2 Threads
	50.0th:   			    9.90       9.20(7.07)
	75.0th:   			   12.60      10.10(19.84)
	90.0th:   			   15.50      12.00(22.58)
	95.0th:   			   17.70      14.00(20.90)
	99.0th:   			   21.20      16.90(20.28)
	99.5th:   			   22.60      17.50(22.57)
	99.9th:   			   30.40      19.40(36.18)
	4 Threads
	50.0th:   			   12.50      10.60(15.20)
	75.0th:   			   15.30      12.00(21.57)
	90.0th:   			   18.60      14.10(24.19)
	95.0th:   			   21.30      16.20(23.94)
	99.0th:   			   26.00      20.70(20.38)
	99.5th:   			   27.60      22.50(18.48)
	99.9th:   			   33.90      31.40(7.37)
	8 Threads
	50.0th:   			   16.30      14.30(12.27)
	75.0th:   			   20.20      17.40(13.86)
	90.0th:   			   24.50      21.90(10.61)
	95.0th:   			   27.30      24.70(9.52)
	99.0th:   			   35.00      31.20(10.86)
	99.5th:   			   46.40      33.30(28.23)
	99.9th:   			   89.30      57.50(35.61)
	16 Threads
	50.0th:   			   22.70      20.70(8.81)
	75.0th:   			   30.10      27.40(8.97)
	90.0th:   			   36.00      32.80(8.89)
	95.0th:   			   39.60      36.40(8.08)
	99.0th:   			   49.20      44.10(10.37)
	99.5th:   			   64.90      50.50(22.19)
	99.9th:   			  143.50     100.60(29.90)
	32 Threads
	50.0th:   			   34.60      35.50(-2.60)
	75.0th:   			   48.20      50.50(-4.77)
	90.0th:   			   59.20      62.40(-5.41)
	95.0th:   			   65.20      69.00(-5.83)
	99.0th:   			   80.40      83.80(-4.23)
	99.5th:   			  102.10      98.90(3.13)
	99.9th:   			  727.10     506.80(30.30)

schbench does improve in general. There is some run to run variation with
schbench. Did a validation run to confirm that trend is similar.

	------------------------------------------------------------------------------
	hackbench				tip	   +patch(%gain)
	20 iterations, 50000 loops	     sched/core
	------------------------------------------------------------------------------
	Process 10 groups                :      11.74      11.70(0.34)
	Process 20 groups                :      22.73      22.69(0.18)
	Process 30 groups                :      33.39      33.40(-0.03)
	Process 40 groups                :      43.73      43.61(0.27)
	Process 50 groups                :      53.82      54.35(-0.98)
	Process 60 groups                :      64.16      65.29(-1.76)
	thread 10 Time                   :      12.81      12.79(0.16)
	thread 20 Time                   :      24.63      24.47(0.65)
	Process(Pipe) 10 Time            :       6.40       6.34(0.94)
	Process(Pipe) 20 Time            :      10.62      10.63(-0.09)
	Process(Pipe) 30 Time            :      15.09      14.84(1.66)
	Process(Pipe) 40 Time            :      19.42      19.01(2.11)
	Process(Pipe) 50 Time            :      24.04      23.34(2.91)
	Process(Pipe) 60 Time            :      28.94      27.51(4.94)
	thread(Pipe) 10 Time             :       6.96       6.87(1.29)
	thread(Pipe) 20 Time             :      11.74      11.73(0.09)

hackbench shows slight improvement with pipe. Slight degradation in process.

	------------------------------------------------------------------------------
	stress_ng				tip        +patch(%gain)
	10 iterations 100000 cpu_ops	     sched/core
	------------------------------------------------------------------------------

	--cpu=96 -util=100 Time taken  	 :       5.30,       5.01(5.47)
	--cpu=48 -util=100 Time taken    :       7.94,       6.73(15.24)
	--cpu=24 -util=100 Time taken    :      11.67,       8.75(25.02)
	--cpu=12 -util=100 Time taken    :      15.71,      15.02(4.39)
	--cpu=96 -util=10 Time taken     :      22.71,      22.19(2.29)
	--cpu=96 -util=20 Time taken     :      12.14,      12.37(-1.89)
	--cpu=96 -util=30 Time taken     :       8.76,       8.86(-1.14)
	--cpu=96 -util=40 Time taken     :       7.13,       7.14(-0.14)
	--cpu=96 -util=50 Time taken     :       6.10,       6.13(-0.49)
	--cpu=96 -util=60 Time taken     :       5.42,       5.41(0.18)
	--cpu=96 -util=70 Time taken     :       4.94,       4.94(0.00)
	--cpu=96 -util=80 Time taken     :       4.56,       4.53(0.66)
	--cpu=96 -util=90 Time taken     :       4.27,       4.26(0.23)

Good improvement seen with 24 CPUs. In this case only one CPU is busy,
and no core is idle. Decent improvement with 100% utilization case. no
difference in other utilization.

Fixes: b1bfeab9b0 ("sched/fair: Consider the idle state of the whole core for load balance")
Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230902081204.232218-1-sshegde@linux.vnet.ibm.com
2023-09-02 12:56:04 +02:00
Hao Jia c958ca2013 sched/fair: Make update_entity_lag() static
The function update_entity_lag() is only used inside the kernel/sched/fair.c file.
Make it static.

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230829030325.69128-1-jiahao.os@bytedance.com
2023-08-29 21:05:28 +02:00
Linus Torvalds 97efd28334 Misc x86 cleanups.
The following commit deserves special mention:
 
    22dc02f81c Revert "sched/fair: Move unused stub functions to header"
 
 This is in x86/cleanups, because the revert is a re-application of a
 number of cleanups that got removed inadvertedly.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDkoRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1jCMw//UvQGM8yxsTa57r0/ZpJHS2++P5pJxOsz
 45kBb3aBiDV6idArce4EHpthp3MvF3Pycibp9w0qg//NOtIHTKeagXv52abxsu1W
 hmS6gXJZDXZvjO1BFaUlmv97iYtzGfKnQppj32C4tMr9SaP49h3KvOHH1Z8CR3mP
 1nZaJJwYIi2qBh7msnmLGG+F0drb85O/dfHdoLX6iVJw9UP4n5nu9u8u1E0iC7J7
 2GC6AwP60A0EBRTK9EHQQEYwy9uvdS/TG5f2Qk1VP87KA9TTocs8MyapMG4DQu79
 hZKVEGuVQAlV3rYe9cJBNpDx1mTu3rmuMH0G71KEe3T6UcG5QRUiAPm8UfA9prPD
 uWjY4zm5o0W3tUio4V1MqqiLFIaBU63WmTY9RyM0QH8Ms8r8GugWKmnrTIuHfEC3
 9D+Uhyb5d8ID6qFGLTOvPm0g+v64lnH71qq83PcVJgsmZvUb2XvFA3d/A0h9JzLT
 2In/yfU10UsLUFTiNRyAgcLccjaGhliDB2oke9Kp0OyOTSQRcWmiq8kByVxCPImP
 auOWWcNXjcuOgjlnziEkMTDuRY12MgUB2If4zhELvdEFibIaaNW5sNCbY2msWaN1
 CUD7fcj0L3HZvzujUm72l5hxL2brJMuPwVNJfuOe4T8wzy569d6VJULrd1URBM1B
 vfaPs1Dz46Q=
 =kiAA
 -----END PGP SIGNATURE-----

Merge tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 cleanups from Ingo Molnar:
 "The following commit deserves special mention:

   22dc02f81c Revert "sched/fair: Move unused stub functions to header"

  This is in x86/cleanups, because the revert is a re-application of a
  number of cleanups that got removed inadvertedly"

[ This also effectively undoes the amd_check_microcode() microcode
  declaration change I had done in my microcode loader merge in commit
  42a7f6e3ff ("Merge tag 'x86_microcode_for_v6.6_rc1' [...]").

  I picked the declaration change by Arnd from this branch instead,
  which put it in <asm/processor.h> instead of <asm/microcode.h> like I
  had done in my merge resolution   - Linus ]

* tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/platform/uv: Refactor code using deprecated strncpy() interface to use strscpy()
  x86/hpet: Refactor code using deprecated strncpy() interface to use strscpy()
  x86/platform/uv: Refactor code using deprecated strcpy()/strncpy() interfaces to use strscpy()
  x86/qspinlock-paravirt: Fix missing-prototype warning
  x86/paravirt: Silence unused native_pv_lock_init() function warning
  x86/alternative: Add a __alt_reloc_selftest() prototype
  x86/purgatory: Include header for warn() declaration
  x86/asm: Avoid unneeded __div64_32 function definition
  Revert "sched/fair: Move unused stub functions to header"
  x86/apic: Hide unused safe_smp_processor_id() on 32-bit UP
  x86/cpu: Fix amd_check_microcode() declaration
2023-08-28 17:05:58 -07:00
Linus Torvalds 3ca9a836ff Scheduler changes for v6.6:
- The biggest change is introduction of a new iteration of the
   SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
   Deadline First") scheduler.
 
   EEVDF too is a virtual-time scheduler, with two parameters (weight
   and relative deadline), compared to CFS that had weight only.
   It completely reworks the base scheduler: placement, preemption,
   picking -- everything.
 
   LWN.net, as usual, has a terrific writeup about EEVDF:
 
      https://lwn.net/Articles/925371/
 
   Preemption (both tick and wakeup) is driven by testing against
   a fresh pick. Because the tree is now effectively an interval
   tree, and the selection is no longer the 'leftmost' task,
   over-scheduling is less of a problem. A lot of the CFS
   heuristics are removed or replaced by more natural latency-space
   parameters & constructs.
 
   In terms of expected performance regressions: we'll and can fix
   everything where a 'good' workload misbehaves with the new scheduler,
   but EEVDF inevitably changes workload scheduling in a binary fashion,
   hopefully for the better in the overwhelming majority of cases,
   but in some cases it won't, especially in adversarial loads that
   got lucky with the previous code, such as some variants of hackbench.
   We are trying hard to err on the side of fixing all performance
   regressions, but we expect some inevitable post-release iterations
   of that process.
 
 - Improve load-balancing on hybrid x86 systems: enable cluster
   scheduling (again).
 
 - Improve & fix bandwidth-scheduling on nohz systems.
 
 - Improve bandwidth-throttling.
 
 - Use lock guards to simplify and de-goto-ify control flow.
 
 - Misc improvements, cleanups and fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDOgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iS4g//b9yewVW9OPxetKoN8zIJA0TjFYuuOVHK
 BlCJi5dbzXeCTrtENI65BRA7kPbTQ3AjwLRQ2BallAZ4dJceK0RhlZJvcrMNsm4e
 Adcpoch/FbqPKCrtAJQY04Ln1B244n/KyVifYett9220dMgTFQGJJYxrTc2G2+Kp
 F44vdUHzRczIE+KeOgBild1CwfKv5Zn5xgaXgtuoPLZtWBE0C1fSSzbK/PTINcUx
 bS4NVxK0CpOqSiNjnugV8KsYb71/0U6IgShBVjfHsrlBYigOH2NbVTH5xyjF8f83
 WxiGstlhxj+N6Kv4L6FOJIAr2BIggH82j3FaPACmv4c8pzEoBBbvlAJkfinLEgbn
 Povg3OF2t6uZ8NoHjeu3WxOjBsphbpkFz7H5nno1ibXSIR/JyUH5MdBPSx93QITB
 QoUKQpr/L8zWauWDOEzSaJjEsZbl8rkcIVq5Bk0bR3qn2xkZsIeVte+vCEu3+tBc
 b4JOZjq7AuPDqPnsBLvuyiFZ7zwsAfm+pOD5UF3/zbLjPn1N/7wTNQZ29zjc04jl
 SifpCZGgF1KlG8m8wNTlSfVvq0ksppCzJt+C6VFuejZ191IGpirQHn4Vp0sluMhC
 WRzXhb7v37Bq5JY10GMfeKb/jAiRs68kozhzqVPsBSAPS6I6jJssONgedq+LbQdC
 tFsmE9n09do=
 =XtCD
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - The biggest change is introduction of a new iteration of the
   SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
   Deadline First") scheduler

   EEVDF too is a virtual-time scheduler, with two parameters (weight
   and relative deadline), compared to CFS that had weight only. It
   completely reworks the base scheduler: placement, preemption, picking
   -- everything

   LWN.net, as usual, has a terrific writeup about EEVDF:

      https://lwn.net/Articles/925371/

   Preemption (both tick and wakeup) is driven by testing against a
   fresh pick. Because the tree is now effectively an interval tree, and
   the selection is no longer the 'leftmost' task, over-scheduling is
   less of a problem. A lot of the CFS heuristics are removed or
   replaced by more natural latency-space parameters & constructs

   In terms of expected performance regressions: we will and can fix
   everything where a 'good' workload misbehaves with the new scheduler,
   but EEVDF inevitably changes workload scheduling in a binary fashion,
   hopefully for the better in the overwhelming majority of cases, but
   in some cases it won't, especially in adversarial loads that got
   lucky with the previous code, such as some variants of hackbench. We
   are trying hard to err on the side of fixing all performance
   regressions, but we expect some inevitable post-release iterations of
   that process

 - Improve load-balancing on hybrid x86 systems: enable cluster
   scheduling (again)

 - Improve & fix bandwidth-scheduling on nohz systems

 - Improve bandwidth-throttling

 - Use lock guards to simplify and de-goto-ify control flow

 - Misc improvements, cleanups and fixes

* tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  sched/eevdf/doc: Modify the documented knob to base_slice_ns as well
  sched/eevdf: Curb wakeup-preemption
  sched: Simplify sched_core_cpu_{starting,deactivate}()
  sched: Simplify try_steal_cookie()
  sched: Simplify sched_tick_remote()
  sched: Simplify sched_exec()
  sched: Simplify ttwu()
  sched: Simplify wake_up_if_idle()
  sched: Simplify: migrate_swap_stop()
  sched: Simplify sysctl_sched_uclamp_handler()
  sched: Simplify get_nohz_timer_target()
  sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset
  sched/rt: Fix sysctl_sched_rr_timeslice intial value
  sched/fair: Block nohz tick_stop when cfs bandwidth in use
  sched, cgroup: Restore meaning to hierarchical_quota
  MAINTAINERS: Add Peter explicitly to the psi section
  sched/psi: Select KERNFS as needed
  sched/topology: Align group flags when removing degenerate domain
  sched/fair: remove util_est boosting
  sched/fair: Propagate enqueue flags into place_entity()
  ...
2023-08-28 16:43:39 -07:00
Peter Zijlstra 63304558ba sched/eevdf: Curb wakeup-preemption
Mike and others noticed that EEVDF does like to over-schedule quite a
bit -- which does hurt performance of a number of benchmarks /
workloads.

In particular, what seems to cause over-scheduling is that when lag is
of the same order (or larger) than the request / slice then placement
will not only cause the task to be placed left of current, but also
with a smaller deadline than current, which causes immediate
preemption.

[ notably, lag bounds are relative to HZ ]

Mike suggested we stick to picking 'current' for as long as it's
eligible to run, giving it uninterrupted runtime until it reaches
parity with the pack.

Augment Mike's suggestion by only allowing it to exhaust it's initial
request.

One random data point:

echo NO_RUN_TO_PARITY > /debug/sched/features
perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000

	3,723,554        context-switches      ( +-  0.56% )
	9.5136 +- 0.0394 seconds time elapsed  ( +-  0.41% )

echo RUN_TO_PARITY > /debug/sched/features
perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000

	2,556,535        context-switches      ( +-  0.51% )
	9.2427 +- 0.0302 seconds time elapsed  ( +-  0.33% )

Suggested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230816134059.GC982867@hirez.programming.kicks-ass.net
2023-08-17 17:07:07 +02:00
Peter Zijlstra 7170509cad sched: Simplify sched_core_cpu_{starting,deactivate}()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.371787909@infradead.org
2023-08-14 17:01:27 +02:00
Peter Zijlstra b4e1fa1e14 sched: Simplify try_steal_cookie()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.304154828@infradead.org
2023-08-14 17:01:27 +02:00
Peter Zijlstra 6dafc713e3 sched: Simplify sched_tick_remote()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.236247952@infradead.org
2023-08-14 17:01:26 +02:00
Peter Zijlstra 4bdada79f3 sched: Simplify sched_exec()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.168490417@infradead.org
2023-08-14 17:01:26 +02:00
Peter Zijlstra 857d315f12 sched: Simplify ttwu()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.101069260@infradead.org
2023-08-14 17:01:25 +02:00
Peter Zijlstra 4eb054f92b sched: Simplify wake_up_if_idle()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211812.032678917@infradead.org
2023-08-14 17:01:25 +02:00
Peter Zijlstra 5bb76f1ddf sched: Simplify: migrate_swap_stop()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211811.964370836@infradead.org
2023-08-14 17:01:25 +02:00
Peter Zijlstra 0f92cdf36f sched: Simplify sysctl_sched_uclamp_handler()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211811.896559109@infradead.org
2023-08-14 17:01:24 +02:00
Peter Zijlstra 7537b90c00 sched: Simplify get_nohz_timer_target()
Use guards to reduce gotos and simplify control flow.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211811.828443100@infradead.org
2023-08-14 17:01:24 +02:00
Cyril Hrubis c1fc6484e1 sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset
The sched_rr_timeslice can be reset to default by writing value that is
<= 0. However after reading from this file we always got the last value
written, which is not useful at all.

$ echo -1 > /proc/sys/kernel/sched_rr_timeslice_ms
$ cat /proc/sys/kernel/sched_rr_timeslice_ms
-1

Fix this by setting the variable that holds the sysctl file value to the
jiffies_to_msecs(RR_TIMESLICE) in case that <= 0 value was written.

Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Petr Vorel <pvorel@suse.cz>
Link: https://lore.kernel.org/r/20230802151906.25258-3-chrubis@suse.cz
2023-08-14 17:01:23 +02:00
Cyril Hrubis c7fcb99877 sched/rt: Fix sysctl_sched_rr_timeslice intial value
There is a 10% rounding error in the intial value of the
sysctl_sched_rr_timeslice with CONFIG_HZ_300=y.

This was found with LTP test sched_rr_get_interval01:

sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90
sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90

What this test does is to compare the return value from the
sched_rr_get_interval() and the sched_rr_timeslice_ms sysctl file and
fails if they do not match.

The problem it found is the intial sysctl file value which was computed as:

static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;

which works fine as long as MSEC_PER_SEC is multiple of HZ, however it
introduces 10% rounding error for CONFIG_HZ_300:

(MSEC_PER_SEC / HZ) * (100 * HZ / 1000)

(1000 / 300) * (100 * 300 / 1000)

3 * 30 = 90

This can be easily fixed by reversing the order of the multiplication
and division. After this fix we get:

(MSEC_PER_SEC * (100 * HZ / 1000)) / HZ

(1000 * (100 * 300 / 1000)) / 300

(1000 * 30) / 300 = 100

Fixes: 975e155ed8 ("sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds")
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Petr Vorel <pvorel@suse.cz>
Link: https://lore.kernel.org/r/20230802151906.25258-2-chrubis@suse.cz
2023-08-14 17:01:23 +02:00
Ingo Molnar b41bbb33cf Merge branch 'sched/eevdf' into sched/core
Pick up the EEVDF work into the main branch - it's looking good so far.

 Conflicts:
	kernel/sched/features.h

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-08-10 09:05:43 +02:00
Phil Auld 88c56cfeae sched/fair: Block nohz tick_stop when cfs bandwidth in use
CFS bandwidth limits and NOHZ full don't play well together.  Tasks
can easily run well past their quotas before a remote tick does
accounting.  This leads to long, multi-period stalls before such
tasks can run again. Currently, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.

Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled. We use cfs_b->hierarchical_quota to
determine if the task requires the tick.

Add check in pick_next_task_fair() as well since that is where
we have a handle on the task that is actually going to be running.

Add check in sched_can_stop_tick() to cover some edge cases such
as nr_running going from 2->1 and the 1 remains the running task.

Reviewed-By: Ben Segall <bsegall@google.com>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com
2023-08-02 16:19:26 +02:00
Phil Auld c98c18270b sched, cgroup: Restore meaning to hierarchical_quota
In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min.  It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.

In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.

Fixes: c53593e5cb ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com
2023-08-02 16:19:26 +02:00
Peter Zijlstra 22dc02f81c Revert "sched/fair: Move unused stub functions to header"
Revert commit 7aa55f2a59 ("sched/fair: Move unused stub functions to
header"), for while it has the right Changelog, the actual patch
content a revert of the previous 4 patches:

  f7df852ad6 ("sched: Make task_vruntime_update() prototype visible")
  c0bdfd72fb ("sched/fair: Hide unused init_cfs_bandwidth() stub")
  378be384e0 ("sched: Add schedule_user() declaration")
  d55ebae3f3 ("sched: Hide unused sched_update_scaling()")

So in effect this is a revert of a revert and re-applies those
patches.

Fixes: 7aa55f2a59 ("sched/fair: Move unused stub functions to header")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2023-07-31 11:47:08 +02:00
Chen Yu 4efcc8bc7e sched/topology: Align group flags when removing degenerate domain
The flags of the child of a given scheduling domain are used to initialize
the flags of its scheduling groups. When the child of a scheduling domain
is degenerated, the flags of its local scheduling group need to be updated
to align with the flags of its new child domain.

The flag SD_SHARE_CPUCAPACITY was aligned in
Commit bf2dc42d6b ("sched/topology: Propagate SMT flags when removing degenerate domain").
Further generalize this alignment so other flags can be used later, such as
in cluster-based task wakeup. [1]

Reported-by: Yicong Yang <yangyicong@huawei.com>
Suggested-by: Ricardo Neri <ricardo.neri@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20230713013133.2314153-1-yu.c.chen@intel.com
2023-07-26 12:28:51 +02:00
Vincent Guittot c2e164ac33 sched/fair: remove util_est boosting
There is no need to use runnable_avg when estimating util_est and that
even generates wrong behavior because one includes blocked tasks whereas
the other one doesn't. This can lead to accounting twice the waking task p,
once with the blocked runnable_avg and another one when adding its
util_est.

cpu's runnable_avg is already used when computing util_avg which is then
compared with util_est.

In some situation, feec will not select prev_cpu but another one on the
same performance domain because of higher max_util

Fixes: 7d0583cf9e ("sched/fair, cpufreq: Introduce 'runnable boosting'")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20230706135144.324311-1-vincent.guittot@linaro.org
2023-07-26 12:28:50 +02:00
Peter Zijlstra d07f09a1f9 sched/fair: Propagate enqueue flags into place_entity()
This allows place_entity() to consider ENQUEUE_WAKEUP and
ENQUEUE_MIGRATED.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.274010996@infradead.org
2023-07-19 09:43:59 +02:00
Peter Zijlstra e4ec3318a1 sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice
EEVDF uses this tunable as the base request/slice -- make sure the
name reflects this.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org
2023-07-19 09:43:59 +02:00
Peter Zijlstra 5e963f2bd4 sched/fair: Commit to EEVDF
EEVDF is a better defined scheduling policy, as a result it has less
heuristics/tunables. There is no compelling reason to keep CFS around.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra e8f331bcc2 sched/smp: Use lag to simplify cross-runqueue placement
Using lag is both more correct and simpler when moving between
runqueues.

Notable, min_vruntime() was invented as a cheap approximation of
avg_vruntime() for this very purpose (SMP migration). Since we now
have the real thing; use it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.068911180@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra 76cae9dbe1 sched/fair: Commit to lag based placement
Removes the FAIR_SLEEPERS code in favour of the new LAG based
placement.

Specifically, the whole FAIR_SLEEPER thing was a very crude
approximation to make up for the lack of lag based placement,
specifically the 'service owed' part. This is important for things
like 'starve' and 'hackbench'.

One side effect of FAIR_SLEEPER is that it caused 'small' unfairness,
specifically, by always ignoring up-to 'thresh' sleeptime it would
have a 50%/50% time distribution for a 50% sleeper vs a 100% runner,
while strictly speaking this should (of course) result in a 33%/67%
split (as CFS will also do if the sleep period exceeds 'thresh').

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.000198861@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra 147f3efaa2 sched/fair: Implement an EEVDF-like scheduling policy
Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.

Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.

EEVDF has two parameters:

 - weight, or time-slope: which is mapped to nice just as before

 - request size, or slice length: which is used to compute
   the virtual deadline as: vd_i = ve_i + r_i/w_i

Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.

Tick driven preemption is driven by request/slice completion; while
wakeup preemption is driven by the deadline.

Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra 86bfbb7ce4 sched/fair: Add lag based placement
With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.

Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.794929315@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra e0c2ff903c sched/fair: Remove sched_feat(START_DEBIT)
With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.722361178@infradead.org
2023-07-19 09:43:58 +02:00
Peter Zijlstra af4cf40470 sched/fair: Add cfs_rq::avg_vruntime
In order to move to an eligibility based scheduling policy, we need
to have a better approximation of the ideal scheduler.

Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).

As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.

Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org
2023-07-19 09:43:58 +02:00
Ingo Molnar 752182b24b Linux 6.5-rc2
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmS0at0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG6m8H/RZCd2DWeM94CgK5
 DIxNLxu90PrEcrOnqeHFJtQoSiUQTHeseh9E4BH0JdPDxtlU89VwYRAevseWiVkp
 JyyJPB40UR4i2DO0P1+oWBBsGEG+bo8lZ1M+uxU5k6lgC0fAi96/O48mwwmI0Mtm
 P6BkWd3IkSXc7l4AGIrKa5P+pYEnm0Z6YAZILfdMVBcLXZWv1OAwEEkZNdUuhE3d
 5DxlQ8MLzlQIbe+LasiwdL9r606acFaJG9Hewl9x5DBlsObZ3d2rX4vEt1NEVh89
 shV29xm2AjCpLh4kstJFdTDCkDw/4H7TcFB/NMxXyzEXp3Bx8YXq+mjVd2mpq1FI
 hHtCsOA=
 =KiaU
 -----END PGP SIGNATURE-----

Merge tag 'v6.5-rc2' into sched/core, to pick up fixes

Sync with upstream fixes before applying EEVDF.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-07-19 09:43:25 +02:00
Andrei Vagin 6f63904c8f sched: add a few helpers to wake up tasks on the current cpu
Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
up tasks on the current CPU.

These two helpers are useful when the task needs to make a synchronous context
switch to another task. In this context, synchronous means it wakes up the
target task and falls asleep right after that.

One example of such workloads is seccomp user notifies. This mechanism allows
the  supervisor process handles system calls on behalf of a target process.
While the supervisor is handling an intercepted system call, the target process
will be blocked in the kernel, waiting for a response to come back.

On-CPU context switches are much faster than regular ones.

Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-4-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17 16:08:08 -07:00
Peter Oskolkov ab83f455f0 sched: add WF_CURRENT_CPU and externise ttwu
Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-3-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17 16:08:08 -07:00