Scheduler changes for v6.6:

- The biggest change is introduction of a new iteration of the
   SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
   Deadline First") scheduler.
 
   EEVDF too is a virtual-time scheduler, with two parameters (weight
   and relative deadline), compared to CFS that had weight only.
   It completely reworks the base scheduler: placement, preemption,
   picking -- everything.
 
   LWN.net, as usual, has a terrific writeup about EEVDF:
 
      https://lwn.net/Articles/925371/
 
   Preemption (both tick and wakeup) is driven by testing against
   a fresh pick. Because the tree is now effectively an interval
   tree, and the selection is no longer the 'leftmost' task,
   over-scheduling is less of a problem. A lot of the CFS
   heuristics are removed or replaced by more natural latency-space
   parameters & constructs.
 
   In terms of expected performance regressions: we'll and can fix
   everything where a 'good' workload misbehaves with the new scheduler,
   but EEVDF inevitably changes workload scheduling in a binary fashion,
   hopefully for the better in the overwhelming majority of cases,
   but in some cases it won't, especially in adversarial loads that
   got lucky with the previous code, such as some variants of hackbench.
   We are trying hard to err on the side of fixing all performance
   regressions, but we expect some inevitable post-release iterations
   of that process.
 
 - Improve load-balancing on hybrid x86 systems: enable cluster
   scheduling (again).
 
 - Improve & fix bandwidth-scheduling on nohz systems.
 
 - Improve bandwidth-throttling.
 
 - Use lock guards to simplify and de-goto-ify control flow.
 
 - Misc improvements, cleanups and fixes.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDOgRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iS4g//b9yewVW9OPxetKoN8zIJA0TjFYuuOVHK
 BlCJi5dbzXeCTrtENI65BRA7kPbTQ3AjwLRQ2BallAZ4dJceK0RhlZJvcrMNsm4e
 Adcpoch/FbqPKCrtAJQY04Ln1B244n/KyVifYett9220dMgTFQGJJYxrTc2G2+Kp
 F44vdUHzRczIE+KeOgBild1CwfKv5Zn5xgaXgtuoPLZtWBE0C1fSSzbK/PTINcUx
 bS4NVxK0CpOqSiNjnugV8KsYb71/0U6IgShBVjfHsrlBYigOH2NbVTH5xyjF8f83
 WxiGstlhxj+N6Kv4L6FOJIAr2BIggH82j3FaPACmv4c8pzEoBBbvlAJkfinLEgbn
 Povg3OF2t6uZ8NoHjeu3WxOjBsphbpkFz7H5nno1ibXSIR/JyUH5MdBPSx93QITB
 QoUKQpr/L8zWauWDOEzSaJjEsZbl8rkcIVq5Bk0bR3qn2xkZsIeVte+vCEu3+tBc
 b4JOZjq7AuPDqPnsBLvuyiFZ7zwsAfm+pOD5UF3/zbLjPn1N/7wTNQZ29zjc04jl
 SifpCZGgF1KlG8m8wNTlSfVvq0ksppCzJt+C6VFuejZ191IGpirQHn4Vp0sluMhC
 WRzXhb7v37Bq5JY10GMfeKb/jAiRs68kozhzqVPsBSAPS6I6jJssONgedq+LbQdC
 tFsmE9n09do=
 =XtCD
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:

 - The biggest change is introduction of a new iteration of the
   SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
   Deadline First") scheduler

   EEVDF too is a virtual-time scheduler, with two parameters (weight
   and relative deadline), compared to CFS that had weight only. It
   completely reworks the base scheduler: placement, preemption, picking
   -- everything

   LWN.net, as usual, has a terrific writeup about EEVDF:

      https://lwn.net/Articles/925371/

   Preemption (both tick and wakeup) is driven by testing against a
   fresh pick. Because the tree is now effectively an interval tree, and
   the selection is no longer the 'leftmost' task, over-scheduling is
   less of a problem. A lot of the CFS heuristics are removed or
   replaced by more natural latency-space parameters & constructs

   In terms of expected performance regressions: we will and can fix
   everything where a 'good' workload misbehaves with the new scheduler,
   but EEVDF inevitably changes workload scheduling in a binary fashion,
   hopefully for the better in the overwhelming majority of cases, but
   in some cases it won't, especially in adversarial loads that got
   lucky with the previous code, such as some variants of hackbench. We
   are trying hard to err on the side of fixing all performance
   regressions, but we expect some inevitable post-release iterations of
   that process

 - Improve load-balancing on hybrid x86 systems: enable cluster
   scheduling (again)

 - Improve & fix bandwidth-scheduling on nohz systems

 - Improve bandwidth-throttling

 - Use lock guards to simplify and de-goto-ify control flow

 - Misc improvements, cleanups and fixes

* tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  sched/eevdf/doc: Modify the documented knob to base_slice_ns as well
  sched/eevdf: Curb wakeup-preemption
  sched: Simplify sched_core_cpu_{starting,deactivate}()
  sched: Simplify try_steal_cookie()
  sched: Simplify sched_tick_remote()
  sched: Simplify sched_exec()
  sched: Simplify ttwu()
  sched: Simplify wake_up_if_idle()
  sched: Simplify: migrate_swap_stop()
  sched: Simplify sysctl_sched_uclamp_handler()
  sched: Simplify get_nohz_timer_target()
  sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset
  sched/rt: Fix sysctl_sched_rr_timeslice intial value
  sched/fair: Block nohz tick_stop when cfs bandwidth in use
  sched, cgroup: Restore meaning to hierarchical_quota
  MAINTAINERS: Add Peter explicitly to the psi section
  sched/psi: Select KERNFS as needed
  sched/topology: Align group flags when removing degenerate domain
  sched/fair: remove util_est boosting
  sched/fair: Propagate enqueue flags into place_entity()
  ...
This commit is contained in:
Linus Torvalds 2023-08-28 16:43:39 -07:00
commit 3ca9a836ff
19 changed files with 1217 additions and 910 deletions

View file

@ -94,7 +94,7 @@ other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
way the previous scheduler had, and has no heuristics whatsoever. There is
only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
/sys/kernel/debug/sched/min_granularity_ns
/sys/kernel/debug/sched/base_slice_ns
which can be used to tune the scheduler from "desktop" (i.e., low latencies) to
"server" (i.e., good batching) workloads. It defaults to a setting suitable

View file

@ -17057,6 +17057,7 @@ F: drivers/net/ppp/pptp.c
PRESSURE STALL INFORMATION (PSI)
M: Johannes Weiner <hannes@cmpxchg.org>
M: Suren Baghdasaryan <surenb@google.com>
R: Peter Ziljstra <peterz@infradead.org>
S: Maintained
F: include/linux/psi*
F: kernel/sched/psi.c

View file

@ -624,14 +624,9 @@ static void __init build_sched_topology(void)
};
#endif
#ifdef CONFIG_SCHED_CLUSTER
/*
* For now, skip the cluster domain on Hybrid.
*/
if (!cpu_feature_enabled(X86_FEATURE_HYBRID_CPU)) {
x86_topology[i++] = (struct sched_domain_topology_level){
cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS)
};
}
x86_topology[i++] = (struct sched_domain_topology_level){
cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS)
};
#endif
#ifdef CONFIG_SCHED_MC
x86_topology[i++] = (struct sched_domain_topology_level){

View file

@ -661,6 +661,8 @@ struct cgroup_subsys {
void (*css_rstat_flush)(struct cgroup_subsys_state *css, int cpu);
int (*css_extra_stat_show)(struct seq_file *seq,
struct cgroup_subsys_state *css);
int (*css_local_stat_show)(struct seq_file *seq,
struct cgroup_subsys_state *css);
int (*can_attach)(struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup_taskset *tset);

View file

@ -60,6 +60,32 @@ rb_insert_augmented_cached(struct rb_node *node,
rb_insert_augmented(node, &root->rb_root, augment);
}
static __always_inline struct rb_node *
rb_add_augmented_cached(struct rb_node *node, struct rb_root_cached *tree,
bool (*less)(struct rb_node *, const struct rb_node *),
const struct rb_augment_callbacks *augment)
{
struct rb_node **link = &tree->rb_root.rb_node;
struct rb_node *parent = NULL;
bool leftmost = true;
while (*link) {
parent = *link;
if (less(node, parent)) {
link = &parent->rb_left;
} else {
link = &parent->rb_right;
leftmost = false;
}
}
rb_link_node(node, parent, link);
augment->propagate(parent, NULL); /* suboptimal */
rb_insert_augmented_cached(node, tree, leftmost, augment);
return leftmost ? node : NULL;
}
/*
* Template for declaring augmented rbtree callbacks (generic case)
*

View file

@ -75,14 +75,14 @@ struct user_event_mm;
* Task state bitmask. NOTE! These bits are also
* encoded in fs/proc/array.c: get_task_state().
*
* We have two separate sets of flags: task->state
* We have two separate sets of flags: task->__state
* is about runnability, while task->exit_state are
* about the task exiting. Confusing, but this way
* modifying one set can't modify the other one by
* mistake.
*/
/* Used in tsk->state: */
/* Used in tsk->__state: */
#define TASK_RUNNING 0x00000000
#define TASK_INTERRUPTIBLE 0x00000001
#define TASK_UNINTERRUPTIBLE 0x00000002
@ -92,7 +92,7 @@ struct user_event_mm;
#define EXIT_DEAD 0x00000010
#define EXIT_ZOMBIE 0x00000020
#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
/* Used in tsk->state again: */
/* Used in tsk->__state again: */
#define TASK_PARKED 0x00000040
#define TASK_DEAD 0x00000080
#define TASK_WAKEKILL 0x00000100
@ -173,7 +173,7 @@ struct user_event_mm;
#endif
/*
* set_current_state() includes a barrier so that the write of current->state
* set_current_state() includes a barrier so that the write of current->__state
* is correctly serialised wrt the caller's subsequent test of whether to
* actually sleep:
*
@ -196,9 +196,9 @@ struct user_event_mm;
* wake_up_state(p, TASK_UNINTERRUPTIBLE);
*
* where wake_up_state()/try_to_wake_up() executes a full memory barrier before
* accessing p->state.
* accessing p->__state.
*
* Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
* Wakeup will do: if (@state & p->__state) p->__state = TASK_RUNNING, that is,
* once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
* TASK_RUNNING store which can collide with __set_current_state(TASK_RUNNING).
*
@ -549,13 +549,18 @@ struct sched_entity {
/* For load-balancing: */
struct load_weight load;
struct rb_node run_node;
u64 deadline;
u64 min_deadline;
struct list_head group_node;
unsigned int on_rq;
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
u64 vruntime;
s64 vlag;
u64 slice;
u64 nr_migrations;
@ -2433,9 +2438,11 @@ extern void sched_core_free(struct task_struct *tsk);
extern void sched_core_fork(struct task_struct *p);
extern int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
unsigned long uaddr);
extern int sched_core_idle_cpu(int cpu);
#else
static inline void sched_core_free(struct task_struct *tsk) { }
static inline void sched_core_fork(struct task_struct *p) { }
static inline int sched_core_idle_cpu(int cpu) { return idle_cpu(cpu); }
#endif
extern void sched_set_stop_task(int cpu, struct task_struct *stop);

View file

@ -118,11 +118,47 @@ static inline struct task_struct *get_task_struct(struct task_struct *t)
}
extern void __put_task_struct(struct task_struct *t);
extern void __put_task_struct_rcu_cb(struct rcu_head *rhp);
static inline void put_task_struct(struct task_struct *t)
{
if (refcount_dec_and_test(&t->usage))
if (!refcount_dec_and_test(&t->usage))
return;
/*
* In !RT, it is always safe to call __put_task_struct().
* Under RT, we can only call it in preemptible context.
*/
if (!IS_ENABLED(CONFIG_PREEMPT_RT) || preemptible()) {
static DEFINE_WAIT_OVERRIDE_MAP(put_task_map, LD_WAIT_SLEEP);
lock_map_acquire_try(&put_task_map);
__put_task_struct(t);
lock_map_release(&put_task_map);
return;
}
/*
* under PREEMPT_RT, we can't call put_task_struct
* in atomic context because it will indirectly
* acquire sleeping locks.
*
* call_rcu() will schedule delayed_put_task_struct_rcu()
* to be called in process context.
*
* __put_task_struct() is called when
* refcount_dec_and_test(&t->usage) succeeds.
*
* This means that it can't "conflict" with
* put_task_struct_rcu_user() which abuses ->rcu the same
* way; rcu_users has a reference so task->usage can't be
* zero after rcu_users 1 -> 0 transition.
*
* delayed_free_task() also uses ->rcu, but it is only called
* when it fails to fork a process. Therefore, there is no
* way it can conflict with put_task_struct().
*/
call_rcu(&t->rcu, __put_task_struct_rcu_cb);
}
DEFINE_FREE(put_task, struct task_struct *, if (_T) put_task_struct(_T))

View file

@ -629,6 +629,7 @@ config TASK_IO_ACCOUNTING
config PSI
bool "Pressure stall information tracking"
select KERNFS
help
Collect metrics that indicate how overcommitted the CPU, memory,
and IO capacity are in the system.

View file

@ -3685,6 +3685,36 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
return ret;
}
static int __maybe_unused cgroup_local_stat_show(struct seq_file *seq,
struct cgroup *cgrp, int ssid)
{
struct cgroup_subsys *ss = cgroup_subsys[ssid];
struct cgroup_subsys_state *css;
int ret;
if (!ss->css_local_stat_show)
return 0;
css = cgroup_tryget_css(cgrp, ss);
if (!css)
return 0;
ret = ss->css_local_stat_show(seq, css);
css_put(css);
return ret;
}
static int cpu_local_stat_show(struct seq_file *seq, void *v)
{
struct cgroup __maybe_unused *cgrp = seq_css(seq)->cgroup;
int ret = 0;
#ifdef CONFIG_CGROUP_SCHED
ret = cgroup_local_stat_show(seq, cgrp, cpu_cgrp_id);
#endif
return ret;
}
#ifdef CONFIG_PSI
static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
{
@ -5235,6 +5265,10 @@ static struct cftype cgroup_base_files[] = {
.name = "cpu.stat",
.seq_show = cpu_stat_show,
},
{
.name = "cpu.stat.local",
.seq_show = cpu_local_stat_show,
},
{ } /* terminate */
};

View file

@ -985,6 +985,14 @@ void __put_task_struct(struct task_struct *tsk)
}
EXPORT_SYMBOL_GPL(__put_task_struct);
void __put_task_struct_rcu_cb(struct rcu_head *rhp)
{
struct task_struct *task = container_of(rhp, struct task_struct, rcu);
__put_task_struct(task);
}
EXPORT_SYMBOL_GPL(__put_task_struct_rcu_cb);
void __init __weak arch_task_cache_init(void) { }
/*

View file

@ -1097,25 +1097,22 @@ int get_nohz_timer_target(void)
hk_mask = housekeeping_cpumask(HK_TYPE_TIMER);
rcu_read_lock();
guard(rcu)();
for_each_domain(cpu, sd) {
for_each_cpu_and(i, sched_domain_span(sd), hk_mask) {
if (cpu == i)
continue;
if (!idle_cpu(i)) {
cpu = i;
goto unlock;
}
if (!idle_cpu(i))
return i;
}
}
if (default_cpu == -1)
default_cpu = housekeeping_any_cpu(HK_TYPE_TIMER);
cpu = default_cpu;
unlock:
rcu_read_unlock();
return cpu;
return default_cpu;
}
/*
@ -1194,6 +1191,20 @@ static void nohz_csd_func(void *info)
#endif /* CONFIG_NO_HZ_COMMON */
#ifdef CONFIG_NO_HZ_FULL
static inline bool __need_bw_check(struct rq *rq, struct task_struct *p)
{
if (rq->nr_running != 1)
return false;
if (p->sched_class != &fair_sched_class)
return false;
if (!task_on_rq_queued(p))
return false;
return true;
}
bool sched_can_stop_tick(struct rq *rq)
{
int fifo_nr_running;
@ -1229,6 +1240,18 @@ bool sched_can_stop_tick(struct rq *rq)
if (rq->nr_running > 1)
return false;
/*
* If there is one task and it has CFS runtime bandwidth constraints
* and it's on the cpu now we don't want to stop the tick.
* This check prevents clearing the bit if a newly enqueued task here is
* dequeued by migrating while the constrained task continues to run.
* E.g. going from 2->1 without going through pick_next_task().
*/
if (sched_feat(HZ_BW) && __need_bw_check(rq, rq->curr)) {
if (cfs_task_bw_constrained(rq->curr))
return false;
}
return true;
}
#endif /* CONFIG_NO_HZ_FULL */
@ -1804,7 +1827,8 @@ static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
int old_min, old_max, old_min_rt;
int result;
mutex_lock(&uclamp_mutex);
guard(mutex)(&uclamp_mutex);
old_min = sysctl_sched_uclamp_util_min;
old_max = sysctl_sched_uclamp_util_max;
old_min_rt = sysctl_sched_uclamp_util_min_rt_default;
@ -1813,7 +1837,7 @@ static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
if (result)
goto undo;
if (!write)
goto done;
return 0;
if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max ||
sysctl_sched_uclamp_util_max > SCHED_CAPACITY_SCALE ||
@ -1849,16 +1873,12 @@ static int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
* Otherwise, keep it simple and do just a lazy update at each next
* task enqueue time.
*/
goto done;
return 0;
undo:
sysctl_sched_uclamp_util_min = old_min;
sysctl_sched_uclamp_util_max = old_max;
sysctl_sched_uclamp_util_min_rt_default = old_min_rt;
done:
mutex_unlock(&uclamp_mutex);
return result;
}
#endif
@ -3413,7 +3433,6 @@ static int migrate_swap_stop(void *data)
{
struct migration_swap_arg *arg = data;
struct rq *src_rq, *dst_rq;
int ret = -EAGAIN;
if (!cpu_active(arg->src_cpu) || !cpu_active(arg->dst_cpu))
return -EAGAIN;
@ -3421,33 +3440,25 @@ static int migrate_swap_stop(void *data)
src_rq = cpu_rq(arg->src_cpu);
dst_rq = cpu_rq(arg->dst_cpu);
double_raw_lock(&arg->src_task->pi_lock,
&arg->dst_task->pi_lock);
double_rq_lock(src_rq, dst_rq);
guard(double_raw_spinlock)(&arg->src_task->pi_lock, &arg->dst_task->pi_lock);
guard(double_rq_lock)(src_rq, dst_rq);
if (task_cpu(arg->dst_task) != arg->dst_cpu)
goto unlock;
return -EAGAIN;
if (task_cpu(arg->src_task) != arg->src_cpu)
goto unlock;
return -EAGAIN;
if (!cpumask_test_cpu(arg->dst_cpu, arg->src_task->cpus_ptr))
goto unlock;
return -EAGAIN;
if (!cpumask_test_cpu(arg->src_cpu, arg->dst_task->cpus_ptr))
goto unlock;
return -EAGAIN;
__migrate_swap_task(arg->src_task, arg->dst_cpu);
__migrate_swap_task(arg->dst_task, arg->src_cpu);
ret = 0;
unlock:
double_rq_unlock(src_rq, dst_rq);
raw_spin_unlock(&arg->dst_task->pi_lock);
raw_spin_unlock(&arg->src_task->pi_lock);
return ret;
return 0;
}
/*
@ -3722,14 +3733,14 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
struct sched_domain *sd;
__schedstat_inc(p->stats.nr_wakeups_remote);
rcu_read_lock();
guard(rcu)();
for_each_domain(rq->cpu, sd) {
if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
__schedstat_inc(sd->ttwu_wake_remote);
break;
}
}
rcu_read_unlock();
}
if (wake_flags & WF_MIGRATED)
@ -3928,21 +3939,13 @@ static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags
void wake_up_if_idle(int cpu)
{
struct rq *rq = cpu_rq(cpu);
struct rq_flags rf;
rcu_read_lock();
if (!is_idle_task(rcu_dereference(rq->curr)))
goto out;
rq_lock_irqsave(rq, &rf);
if (is_idle_task(rq->curr))
resched_curr(rq);
/* Else CPU is not idle, do nothing here: */
rq_unlock_irqrestore(rq, &rf);
out:
rcu_read_unlock();
guard(rcu)();
if (is_idle_task(rcu_dereference(rq->curr))) {
guard(rq_lock_irqsave)(rq);
if (is_idle_task(rq->curr))
resched_curr(rq);
}
}
bool cpus_share_cache(int this_cpu, int that_cpu)
@ -4195,10 +4198,9 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
*/
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
unsigned long flags;
guard(preempt)();
int cpu, success = 0;
preempt_disable();
if (p == current) {
/*
* We're waking current, this means 'p->on_rq' and 'task_cpu(p)
@ -4225,129 +4227,127 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* reordered with p->state check below. This pairs with smp_store_mb()
* in set_current_state() that the waiting thread does.
*/
raw_spin_lock_irqsave(&p->pi_lock, flags);
smp_mb__after_spinlock();
if (!ttwu_state_match(p, state, &success))
goto unlock;
scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
smp_mb__after_spinlock();
if (!ttwu_state_match(p, state, &success))
break;
trace_sched_waking(p);
trace_sched_waking(p);
/*
* Ensure we load p->on_rq _after_ p->state, otherwise it would
* be possible to, falsely, observe p->on_rq == 0 and get stuck
* in smp_cond_load_acquire() below.
*
* sched_ttwu_pending() try_to_wake_up()
* STORE p->on_rq = 1 LOAD p->state
* UNLOCK rq->lock
*
* __schedule() (switch to task 'p')
* LOCK rq->lock smp_rmb();
* smp_mb__after_spinlock();
* UNLOCK rq->lock
*
* [task p]
* STORE p->state = UNINTERRUPTIBLE LOAD p->on_rq
*
* Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
* __schedule(). See the comment for smp_mb__after_spinlock().
*
* A similar smb_rmb() lives in try_invoke_on_locked_down_task().
*/
smp_rmb();
if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
goto unlock;
/*
* Ensure we load p->on_rq _after_ p->state, otherwise it would
* be possible to, falsely, observe p->on_rq == 0 and get stuck
* in smp_cond_load_acquire() below.
*
* sched_ttwu_pending() try_to_wake_up()
* STORE p->on_rq = 1 LOAD p->state
* UNLOCK rq->lock
*
* __schedule() (switch to task 'p')
* LOCK rq->lock smp_rmb();
* smp_mb__after_spinlock();
* UNLOCK rq->lock
*
* [task p]
* STORE p->state = UNINTERRUPTIBLE LOAD p->on_rq
*
* Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
* __schedule(). See the comment for smp_mb__after_spinlock().
*
* A similar smb_rmb() lives in try_invoke_on_locked_down_task().
*/
smp_rmb();
if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
break;
#ifdef CONFIG_SMP
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
* possible to, falsely, observe p->on_cpu == 0.
*
* One must be running (->on_cpu == 1) in order to remove oneself
* from the runqueue.
*
* __schedule() (switch to task 'p') try_to_wake_up()
* STORE p->on_cpu = 1 LOAD p->on_rq
* UNLOCK rq->lock
*
* __schedule() (put 'p' to sleep)
* LOCK rq->lock smp_rmb();
* smp_mb__after_spinlock();
* STORE p->on_rq = 0 LOAD p->on_cpu
*
* Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
* __schedule(). See the comment for smp_mb__after_spinlock().
*
* Form a control-dep-acquire with p->on_rq == 0 above, to ensure
* schedule()'s deactivate_task() has 'happened' and p will no longer
* care about it's own p->state. See the comment in __schedule().
*/
smp_acquire__after_ctrl_dep();
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
* possible to, falsely, observe p->on_cpu == 0.
*
* One must be running (->on_cpu == 1) in order to remove oneself
* from the runqueue.
*
* __schedule() (switch to task 'p') try_to_wake_up()
* STORE p->on_cpu = 1 LOAD p->on_rq
* UNLOCK rq->lock
*
* __schedule() (put 'p' to sleep)
* LOCK rq->lock smp_rmb();
* smp_mb__after_spinlock();
* STORE p->on_rq = 0 LOAD p->on_cpu
*
* Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
* __schedule(). See the comment for smp_mb__after_spinlock().
*
* Form a control-dep-acquire with p->on_rq == 0 above, to ensure
* schedule()'s deactivate_task() has 'happened' and p will no longer
* care about it's own p->state. See the comment in __schedule().
*/
smp_acquire__after_ctrl_dep();
/*
* We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
* == 0), which means we need to do an enqueue, change p->state to
* TASK_WAKING such that we can unlock p->pi_lock before doing the
* enqueue, such as ttwu_queue_wakelist().
*/
WRITE_ONCE(p->__state, TASK_WAKING);
/*
* We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
* == 0), which means we need to do an enqueue, change p->state to
* TASK_WAKING such that we can unlock p->pi_lock before doing the
* enqueue, such as ttwu_queue_wakelist().
*/
WRITE_ONCE(p->__state, TASK_WAKING);
/*
* If the owning (remote) CPU is still in the middle of schedule() with
* this task as prev, considering queueing p on the remote CPUs wake_list
* which potentially sends an IPI instead of spinning on p->on_cpu to
* let the waker make forward progress. This is safe because IRQs are
* disabled and the IPI will deliver after on_cpu is cleared.
*
* Ensure we load task_cpu(p) after p->on_cpu:
*
* set_task_cpu(p, cpu);
* STORE p->cpu = @cpu
* __schedule() (switch to task 'p')
* LOCK rq->lock
* smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
* STORE p->on_cpu = 1 LOAD p->cpu
*
* to ensure we observe the correct CPU on which the task is currently
* scheduling.
*/
if (smp_load_acquire(&p->on_cpu) &&
ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
goto unlock;
/*
* If the owning (remote) CPU is still in the middle of schedule() with
* this task as prev, considering queueing p on the remote CPUs wake_list
* which potentially sends an IPI instead of spinning on p->on_cpu to
* let the waker make forward progress. This is safe because IRQs are
* disabled and the IPI will deliver after on_cpu is cleared.
*
* Ensure we load task_cpu(p) after p->on_cpu:
*
* set_task_cpu(p, cpu);
* STORE p->cpu = @cpu
* __schedule() (switch to task 'p')
* LOCK rq->lock
* smp_mb__after_spin_lock() smp_cond_load_acquire(&p->on_cpu)
* STORE p->on_cpu = 1 LOAD p->cpu
*
* to ensure we observe the correct CPU on which the task is currently
* scheduling.
*/
if (smp_load_acquire(&p->on_cpu) &&
ttwu_queue_wakelist(p, task_cpu(p), wake_flags))
break;
/*
* If the owning (remote) CPU is still in the middle of schedule() with
* this task as prev, wait until it's done referencing the task.
*
* Pairs with the smp_store_release() in finish_task().
*
* This ensures that tasks getting woken will be fully ordered against
* their previous state and preserve Program Order.
*/
smp_cond_load_acquire(&p->on_cpu, !VAL);
/*
* If the owning (remote) CPU is still in the middle of schedule() with
* this task as prev, wait until it's done referencing the task.
*
* Pairs with the smp_store_release() in finish_task().
*
* This ensures that tasks getting woken will be fully ordered against
* their previous state and preserve Program Order.
*/
smp_cond_load_acquire(&p->on_cpu, !VAL);
cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
if (task_cpu(p) != cpu) {
if (p->in_iowait) {
delayacct_blkio_end(p);
atomic_dec(&task_rq(p)->nr_iowait);
cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
if (task_cpu(p) != cpu) {
if (p->in_iowait) {
delayacct_blkio_end(p);
atomic_dec(&task_rq(p)->nr_iowait);
}
wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
}
#else
cpu = task_cpu(p);
cpu = task_cpu(p);
#endif /* CONFIG_SMP */
ttwu_queue(p, cpu, wake_flags);
unlock:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
ttwu_queue(p, cpu, wake_flags);
}
out:
if (success)
ttwu_stat(p, task_cpu(p), wake_flags);
preempt_enable();
return success;
}
@ -4500,6 +4500,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
p->se.vlag = 0;
p->se.slice = sysctl_sched_base_slice;
INIT_LIST_HEAD(&p->se.group_node);
#ifdef CONFIG_FAIR_GROUP_SCHED
@ -5495,23 +5497,20 @@ unsigned int nr_iowait(void)
void sched_exec(void)
{
struct task_struct *p = current;
unsigned long flags;
struct migration_arg arg;
int dest_cpu;
raw_spin_lock_irqsave(&p->pi_lock, flags);
dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
if (dest_cpu == smp_processor_id())
goto unlock;
scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), WF_EXEC);
if (dest_cpu == smp_processor_id())
return;
if (likely(cpu_active(dest_cpu))) {
struct migration_arg arg = { p, dest_cpu };
if (unlikely(!cpu_active(dest_cpu)))
return;
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
return;
arg = (struct migration_arg){ p, dest_cpu };
}
unlock:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
stop_one_cpu(task_cpu(p), migration_cpu_stop, &arg);
}
#endif
@ -5721,9 +5720,6 @@ static void sched_tick_remote(struct work_struct *work)
struct tick_work *twork = container_of(dwork, struct tick_work, work);
int cpu = twork->cpu;
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr;
struct rq_flags rf;
u64 delta;
int os;
/*
@ -5733,30 +5729,26 @@ static void sched_tick_remote(struct work_struct *work)
* statistics and checks timeslices in a time-independent way, regardless
* of when exactly it is running.
*/
if (!tick_nohz_tick_stopped_cpu(cpu))
goto out_requeue;
if (tick_nohz_tick_stopped_cpu(cpu)) {
guard(rq_lock_irq)(rq);
struct task_struct *curr = rq->curr;
rq_lock_irq(rq, &rf);
curr = rq->curr;
if (cpu_is_offline(cpu))
goto out_unlock;
if (cpu_online(cpu)) {
update_rq_clock(rq);
update_rq_clock(rq);
if (!is_idle_task(curr)) {
/*
* Make sure the next tick runs within a
* reasonable amount of time.
*/
u64 delta = rq_clock_task(rq) - curr->se.exec_start;
WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);
}
curr->sched_class->task_tick(rq, curr, 0);
if (!is_idle_task(curr)) {
/*
* Make sure the next tick runs within a reasonable
* amount of time.
*/
delta = rq_clock_task(rq) - curr->se.exec_start;
WARN_ON_ONCE(delta > (u64)NSEC_PER_SEC * 3);
calc_load_nohz_remote(rq);
}
}
curr->sched_class->task_tick(rq, curr, 0);
calc_load_nohz_remote(rq);
out_unlock:
rq_unlock_irq(rq, &rf);
out_requeue:
/*
* Run the remote tick once per second (1Hz). This arbitrary
@ -6305,19 +6297,19 @@ static bool try_steal_cookie(int this, int that)
unsigned long cookie;
bool success = false;
local_irq_disable();
double_rq_lock(dst, src);
guard(irq)();
guard(double_rq_lock)(dst, src);
cookie = dst->core->core_cookie;
if (!cookie)
goto unlock;
return false;
if (dst->curr != dst->idle)
goto unlock;
return false;
p = sched_core_find(src, cookie);
if (!p)
goto unlock;
return false;
do {
if (p == src->core_pick || p == src->curr)
@ -6329,9 +6321,10 @@ static bool try_steal_cookie(int this, int that)
if (p->core_occupation > dst->idle->core_occupation)
goto next;
/*
* sched_core_find() and sched_core_next() will ensure that task @p
* is not throttled now, we also need to check whether the runqueue
* of the destination CPU is being throttled.
* sched_core_find() and sched_core_next() will ensure
* that task @p is not throttled now, we also need to
* check whether the runqueue of the destination CPU is
* being throttled.
*/
if (sched_task_is_throttled(p, this))
goto next;
@ -6349,10 +6342,6 @@ static bool try_steal_cookie(int this, int that)
p = sched_core_next(p, cookie);
} while (p);
unlock:
double_rq_unlock(dst, src);
local_irq_enable();
return success;
}
@ -6410,20 +6399,24 @@ static void queue_core_balance(struct rq *rq)
queue_balance_callback(rq, &per_cpu(core_balance_head, rq->cpu), sched_core_balance);
}
DEFINE_LOCK_GUARD_1(core_lock, int,
sched_core_lock(*_T->lock, &_T->flags),
sched_core_unlock(*_T->lock, &_T->flags),
unsigned long flags)
static void sched_core_cpu_starting(unsigned int cpu)
{
const struct cpumask *smt_mask = cpu_smt_mask(cpu);
struct rq *rq = cpu_rq(cpu), *core_rq = NULL;
unsigned long flags;
int t;
sched_core_lock(cpu, &flags);
guard(core_lock)(&cpu);
WARN_ON_ONCE(rq->core != rq);
/* if we're the first, we'll be our own leader */
if (cpumask_weight(smt_mask) == 1)
goto unlock;
return;
/* find the leader */
for_each_cpu(t, smt_mask) {
@ -6437,7 +6430,7 @@ static void sched_core_cpu_starting(unsigned int cpu)
}
if (WARN_ON_ONCE(!core_rq)) /* whoopsie */
goto unlock;
return;
/* install and validate core_rq */
for_each_cpu(t, smt_mask) {
@ -6448,29 +6441,25 @@ static void sched_core_cpu_starting(unsigned int cpu)
WARN_ON_ONCE(rq->core != core_rq);
}
unlock:
sched_core_unlock(cpu, &flags);
}
static void sched_core_cpu_deactivate(unsigned int cpu)
{
const struct cpumask *smt_mask = cpu_smt_mask(cpu);
struct rq *rq = cpu_rq(cpu), *core_rq = NULL;
unsigned long flags;
int t;
sched_core_lock(cpu, &flags);
guard(core_lock)(&cpu);
/* if we're the last man standing, nothing to do */
if (cpumask_weight(smt_mask) == 1) {
WARN_ON_ONCE(rq->core != rq);
goto unlock;
return;
}
/* if we're not the leader, nothing to do */
if (rq->core != rq)
goto unlock;
return;
/* find a new leader */
for_each_cpu(t, smt_mask) {
@ -6481,7 +6470,7 @@ static void sched_core_cpu_deactivate(unsigned int cpu)
}
if (WARN_ON_ONCE(!core_rq)) /* impossible */
goto unlock;
return;
/* copy the shared state to the new leader */
core_rq->core_task_seq = rq->core_task_seq;
@ -6503,9 +6492,6 @@ static void sched_core_cpu_deactivate(unsigned int cpu)
rq = cpu_rq(t);
rq->core = core_rq;
}
unlock:
sched_core_unlock(cpu, &flags);
}
static inline void sched_core_cpu_dying(unsigned int cpu)
@ -7382,6 +7368,19 @@ struct task_struct *idle_task(int cpu)
return cpu_rq(cpu)->idle;
}
#ifdef CONFIG_SCHED_CORE
int sched_core_idle_cpu(int cpu)
{
struct rq *rq = cpu_rq(cpu);
if (sched_core_enabled(rq) && rq->curr == rq->idle)
return 1;
return idle_cpu(cpu);
}
#endif
#ifdef CONFIG_SMP
/*
* This function computes an effective utilization for the given CPU, to be
@ -9939,7 +9938,7 @@ void __init sched_init(void)
ptr += nr_cpu_ids * sizeof(void **);
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
init_cfs_bandwidth(&root_task_group.cfs_bandwidth);
init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@ -11073,11 +11072,16 @@ static int tg_cfs_schedulable_down(struct task_group *tg, void *data)
/*
* Ensure max(child_quota) <= parent_quota. On cgroup2,
* always take the min. On cgroup1, only inherit when no
* limit is set:
* always take the non-RUNTIME_INF min. On cgroup1, only
* inherit when no limit is set. In both cases this is used
* by the scheduler to determine if a given CFS task has a
* bandwidth constraint at some higher level.
*/
if (cgroup_subsys_on_dfl(cpu_cgrp_subsys)) {
quota = min(quota, parent_quota);
if (quota == RUNTIME_INF)
quota = parent_quota;
else if (parent_quota != RUNTIME_INF)
quota = min(quota, parent_quota);
} else {
if (quota == RUNTIME_INF)
quota = parent_quota;
@ -11138,6 +11142,27 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
return 0;
}
static u64 throttled_time_self(struct task_group *tg)
{
int i;
u64 total = 0;
for_each_possible_cpu(i) {
total += READ_ONCE(tg->cfs_rq[i]->throttled_clock_self_time);
}
return total;
}
static int cpu_cfs_local_stat_show(struct seq_file *sf, void *v)
{
struct task_group *tg = css_tg(seq_css(sf));
seq_printf(sf, "throttled_time %llu\n", throttled_time_self(tg));
return 0;
}
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
@ -11214,6 +11239,10 @@ static struct cftype cpu_legacy_files[] = {
.name = "stat",
.seq_show = cpu_cfs_stat_show,
},
{
.name = "stat.local",
.seq_show = cpu_cfs_local_stat_show,
},
#endif
#ifdef CONFIG_RT_GROUP_SCHED
{
@ -11270,6 +11299,24 @@ static int cpu_extra_stat_show(struct seq_file *sf,
return 0;
}
static int cpu_local_stat_show(struct seq_file *sf,
struct cgroup_subsys_state *css)
{
#ifdef CONFIG_CFS_BANDWIDTH
{
struct task_group *tg = css_tg(css);
u64 throttled_self_usec;
throttled_self_usec = throttled_time_self(tg);
do_div(throttled_self_usec, NSEC_PER_USEC);
seq_printf(sf, "throttled_usec %llu\n",
throttled_self_usec);
}
#endif
return 0;
}
#ifdef CONFIG_FAIR_GROUP_SCHED
static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css,
struct cftype *cft)
@ -11448,6 +11495,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
.css_released = cpu_cgroup_css_released,
.css_free = cpu_cgroup_css_free,
.css_extra_stat_show = cpu_extra_stat_show,
.css_local_stat_show = cpu_local_stat_show,
#ifdef CONFIG_RT_GROUP_SCHED
.can_attach = cpu_cgroup_can_attach,
#endif

View file

@ -347,10 +347,7 @@ static __init int sched_init_debug(void)
debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
#endif
debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);
debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
debugfs_create_u32("idle_min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_idle_min_granularity);
debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);
debugfs_create_u32("base_slice_ns", 0644, debugfs_sched, &sysctl_sched_base_slice);
debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
@ -427,6 +424,7 @@ static void register_sd(struct sched_domain *sd, struct dentry *parent)
#undef SDM
debugfs_create_file("flags", 0444, parent, &sd->flags, &sd_flags_fops);
debugfs_create_file("groups_flags", 0444, parent, &sd->groups->flags, &sd_flags_fops);
}
void update_sched_domain_debugfs(void)
@ -581,9 +579,13 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
else
SEQ_printf(m, " %c", task_state_to_char(p));
SEQ_printf(m, " %15s %5d %9Ld.%06ld %9Ld %5d ",
SEQ_printf(m, "%15s %5d %9Ld.%06ld %c %9Ld.%06ld %9Ld.%06ld %9Ld.%06ld %9Ld %5d ",
p->comm, task_pid_nr(p),
SPLIT_NS(p->se.vruntime),
entity_eligible(cfs_rq_of(&p->se), &p->se) ? 'E' : 'N',
SPLIT_NS(p->se.deadline),
SPLIT_NS(p->se.slice),
SPLIT_NS(p->se.sum_exec_runtime),
(long long)(p->nvcsw + p->nivcsw),
p->prio);
@ -626,10 +628,9 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
{
s64 MIN_vruntime = -1, min_vruntime, max_vruntime = -1,
spread, rq0_min_vruntime, spread0;
s64 left_vruntime = -1, min_vruntime, right_vruntime = -1, spread;
struct sched_entity *last, *first;
struct rq *rq = cpu_rq(cpu);
struct sched_entity *last;
unsigned long flags;
#ifdef CONFIG_FAIR_GROUP_SCHED
@ -643,26 +644,25 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
SPLIT_NS(cfs_rq->exec_clock));
raw_spin_rq_lock_irqsave(rq, flags);
if (rb_first_cached(&cfs_rq->tasks_timeline))
MIN_vruntime = (__pick_first_entity(cfs_rq))->vruntime;
first = __pick_first_entity(cfs_rq);
if (first)
left_vruntime = first->vruntime;
last = __pick_last_entity(cfs_rq);
if (last)
max_vruntime = last->vruntime;
right_vruntime = last->vruntime;
min_vruntime = cfs_rq->min_vruntime;
rq0_min_vruntime = cpu_rq(0)->cfs.min_vruntime;
raw_spin_rq_unlock_irqrestore(rq, flags);
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "MIN_vruntime",
SPLIT_NS(MIN_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "left_vruntime",
SPLIT_NS(left_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "min_vruntime",
SPLIT_NS(min_vruntime));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "max_vruntime",
SPLIT_NS(max_vruntime));
spread = max_vruntime - MIN_vruntime;
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread",
SPLIT_NS(spread));
spread0 = min_vruntime - rq0_min_vruntime;
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread0",
SPLIT_NS(spread0));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "avg_vruntime",
SPLIT_NS(avg_vruntime(cfs_rq)));
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "right_vruntime",
SPLIT_NS(right_vruntime));
spread = right_vruntime - left_vruntime;
SEQ_printf(m, " .%-30s: %Ld.%06ld\n", "spread", SPLIT_NS(spread));
SEQ_printf(m, " .%-30s: %d\n", "nr_spread_over",
cfs_rq->nr_spread_over);
SEQ_printf(m, " .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
@ -863,10 +863,7 @@ static void sched_debug_header(struct seq_file *m)
SEQ_printf(m, " .%-40s: %Ld\n", #x, (long long)(x))
#define PN(x) \
SEQ_printf(m, " .%-40s: %Ld.%06ld\n", #x, SPLIT_NS(x))
PN(sysctl_sched_latency);
PN(sysctl_sched_min_granularity);
PN(sysctl_sched_idle_min_granularity);
PN(sysctl_sched_wakeup_granularity);
PN(sysctl_sched_base_slice);
P(sysctl_sched_child_runs_first);
P(sysctl_sched_features);
#undef PN

File diff suppressed because it is too large Load diff

View file

@ -1,16 +1,12 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
* Only give sleepers 50% of their service deficit. This allows
* them to run sooner, but does not allow tons of sleepers to
* rip the spread apart.
*/
SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true)
/*
* Place new tasks ahead so that they do not starve already running
* tasks
* Using the avg_vruntime, do the right thing and preserve lag across
* sleep+wake cycles. EEVDF placement strategy #1, #2 if disabled.
*/
SCHED_FEAT(START_DEBIT, true)
SCHED_FEAT(PLACE_LAG, true)
SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
SCHED_FEAT(RUN_TO_PARITY, true)
/*
* Prefer to schedule the task we woke last (assuming it failed
@ -19,13 +15,6 @@ SCHED_FEAT(START_DEBIT, true)
*/
SCHED_FEAT(NEXT_BUDDY, false)
/*
* Prefer to schedule the task that ran last (when we did
* wake-preempt) as that likely will touch the same data, increases
* cache locality.
*/
SCHED_FEAT(LAST_BUDDY, true)
/*
* Consider buddies to be cache hot, decreases the likeliness of a
* cache buddy being migrated away, increases cache locality.
@ -99,5 +88,4 @@ SCHED_FEAT(UTIL_EST_FASTUP, true)
SCHED_FEAT(LATENCY_WARN, false)
SCHED_FEAT(ALT_PERIOD, true)
SCHED_FEAT(BASE_SLICE, true)
SCHED_FEAT(HZ_BW, true)

View file

@ -140,7 +140,7 @@
static int psi_bug __read_mostly;
DEFINE_STATIC_KEY_FALSE(psi_disabled);
DEFINE_STATIC_KEY_TRUE(psi_cgroups_enabled);
static DEFINE_STATIC_KEY_TRUE(psi_cgroups_enabled);
#ifdef CONFIG_PSI_DEFAULT_DISABLED
static bool psi_enable;

View file

@ -25,7 +25,7 @@ unsigned int sysctl_sched_rt_period = 1000000;
int sysctl_sched_rt_runtime = 950000;
#ifdef CONFIG_SYSCTL
static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC * RR_TIMESLICE) / HZ;
static int sched_rt_handler(struct ctl_table *table, int write, void *buffer,
size_t *lenp, loff_t *ppos);
static int sched_rr_handler(struct ctl_table *table, int write, void *buffer,
@ -3062,6 +3062,9 @@ static int sched_rr_handler(struct ctl_table *table, int write, void *buffer,
sched_rr_timeslice =
sysctl_sched_rr_timeslice <= 0 ? RR_TIMESLICE :
msecs_to_jiffies(sysctl_sched_rr_timeslice);
if (sysctl_sched_rr_timeslice <= 0)
sysctl_sched_rr_timeslice = jiffies_to_msecs(RR_TIMESLICE);
}
mutex_unlock(&mutex);

View file

@ -454,11 +454,12 @@ extern void unregister_fair_sched_group(struct task_group *tg);
extern void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
struct sched_entity *se, int cpu,
struct sched_entity *parent);
extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent);
extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
extern bool cfs_task_bw_constrained(struct task_struct *p);
extern void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
struct sched_rt_entity *rt_se, int cpu,
@ -494,6 +495,7 @@ static inline void set_task_rq_fair(struct sched_entity *se,
#else /* CONFIG_CGROUP_SCHED */
struct cfs_bandwidth { };
static inline bool cfs_task_bw_constrained(struct task_struct *p) { return false; }
#endif /* CONFIG_CGROUP_SCHED */
@ -548,6 +550,9 @@ struct cfs_rq {
unsigned int idle_nr_running; /* SCHED_IDLE */
unsigned int idle_h_nr_running; /* SCHED_IDLE */
s64 avg_vruntime;
u64 avg_load;
u64 exec_clock;
u64 min_vruntime;
#ifdef CONFIG_SCHED_CORE
@ -567,8 +572,6 @@ struct cfs_rq {
*/
struct sched_entity *curr;
struct sched_entity *next;
struct sched_entity *last;
struct sched_entity *skip;
#ifdef CONFIG_SCHED_DEBUG
unsigned int nr_spread_over;
@ -636,6 +639,8 @@ struct cfs_rq {
u64 throttled_clock;
u64 throttled_clock_pelt;
u64 throttled_clock_pelt_time;
u64 throttled_clock_self;
u64 throttled_clock_self_time;
int throttled;
int throttle_count;
struct list_head throttled_list;
@ -1700,6 +1705,21 @@ rq_unlock(struct rq *rq, struct rq_flags *rf)
raw_spin_rq_unlock(rq);
}
DEFINE_LOCK_GUARD_1(rq_lock, struct rq,
rq_lock(_T->lock, &_T->rf),
rq_unlock(_T->lock, &_T->rf),
struct rq_flags rf)
DEFINE_LOCK_GUARD_1(rq_lock_irq, struct rq,
rq_lock_irq(_T->lock, &_T->rf),
rq_unlock_irq(_T->lock, &_T->rf),
struct rq_flags rf)
DEFINE_LOCK_GUARD_1(rq_lock_irqsave, struct rq,
rq_lock_irqsave(_T->lock, &_T->rf),
rq_unlock_irqrestore(_T->lock, &_T->rf),
struct rq_flags rf)
static inline struct rq *
this_rq_lock_irq(struct rq_flags *rf)
__acquires(rq->lock)
@ -1882,6 +1902,7 @@ struct sched_group {
atomic_t ref;
unsigned int group_weight;
unsigned int cores;
struct sched_group_capacity *sgc;
int asym_prefer_cpu; /* CPU of highest priority in group */
int flags;
@ -2196,6 +2217,7 @@ extern const u32 sched_prio_to_wmult[40];
#else
#define ENQUEUE_MIGRATED 0x00
#endif
#define ENQUEUE_INITIAL 0x80
#define RETRY_TASK ((void *)-1UL)
@ -2500,11 +2522,9 @@ extern void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags);
extern const_debug unsigned int sysctl_sched_nr_migrate;
extern const_debug unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_base_slice;
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_latency;
extern unsigned int sysctl_sched_min_granularity;
extern unsigned int sysctl_sched_idle_min_granularity;
extern unsigned int sysctl_sched_wakeup_granularity;
extern int sysctl_resched_latency_warn_ms;
extern int sysctl_resched_latency_warn_once;
@ -2610,6 +2630,12 @@ static inline void double_rq_clock_clear_update(struct rq *rq1, struct rq *rq2)
static inline void double_rq_clock_clear_update(struct rq *rq1, struct rq *rq2) {}
#endif
#define DEFINE_LOCK_GUARD_2(name, type, _lock, _unlock, ...) \
__DEFINE_UNLOCK_GUARD(name, type, _unlock, type *lock2; __VA_ARGS__) \
static inline class_##name##_t class_##name##_constructor(type *lock, type *lock2) \
{ class_##name##_t _t = { .lock = lock, .lock2 = lock2 }, *_T = &_t; \
_lock; return _t; }
#ifdef CONFIG_SMP
static inline bool rq_order_less(struct rq *rq1, struct rq *rq2)
@ -2739,6 +2765,16 @@ static inline void double_raw_lock(raw_spinlock_t *l1, raw_spinlock_t *l2)
raw_spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
}
static inline void double_raw_unlock(raw_spinlock_t *l1, raw_spinlock_t *l2)
{
raw_spin_unlock(l1);
raw_spin_unlock(l2);
}
DEFINE_LOCK_GUARD_2(double_raw_spinlock, raw_spinlock_t,
double_raw_lock(_T->lock, _T->lock2),
double_raw_unlock(_T->lock, _T->lock2))
/*
* double_rq_unlock - safely unlock two runqueues
*
@ -2796,6 +2832,10 @@ static inline void double_rq_unlock(struct rq *rq1, struct rq *rq2)
#endif
DEFINE_LOCK_GUARD_2(double_rq_lock, struct rq,
double_rq_lock(_T->lock, _T->lock2),
double_rq_unlock(_T->lock, _T->lock2))
extern struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq);
extern struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq);
@ -3483,4 +3523,7 @@ static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
static inline void init_sched_mm_cid(struct task_struct *t) { }
#endif
extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
#endif /* _KERNEL_SCHED_SCHED_H */

View file

@ -722,8 +722,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
if (parent->parent) {
parent->parent->child = tmp;
if (tmp->flags & SD_SHARE_CPUCAPACITY)
parent->parent->groups->flags |= SD_SHARE_CPUCAPACITY;
parent->parent->groups->flags = tmp->flags;
}
/*
@ -1275,14 +1274,24 @@ build_sched_groups(struct sched_domain *sd, int cpu)
static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
{
struct sched_group *sg = sd->groups;
struct cpumask *mask = sched_domains_tmpmask2;
WARN_ON(!sg);
do {
int cpu, max_cpu = -1;
int cpu, cores = 0, max_cpu = -1;
sg->group_weight = cpumask_weight(sched_group_span(sg));
cpumask_copy(mask, sched_group_span(sg));
for_each_cpu(cpu, mask) {
cores++;
#ifdef CONFIG_SCHED_SMT
cpumask_andnot(mask, mask, cpu_smt_mask(cpu));
#endif
}
sg->cores = cores;
if (!(sd->flags & SD_ASYM_PACKING))
goto next;

View file

@ -612,7 +612,7 @@ static inline void tick_irq_exit(void)
int cpu = smp_processor_id();
/* Make sure that timer wheel updates are propagated */
if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
if ((sched_core_idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
if (!in_hardirq())
tick_nohz_irq_exit();
}