A large set of updates and features for timers and timekeeping:

- The hierarchical timer pull model
 
     When timer wheel timers are armed they are placed into the timer wheel
     of a CPU which is likely to be busy at the time of expiry. This is done
     to avoid wakeups on potentially idle CPUs.
 
     This is wrong in several aspects:
 
      1) The heuristics to select the target CPU are wrong by
         definition as the chance to get the prediction right is close
         to zero.
 
      2) Due to #1 it is possible that timers are accumulated on a
         single target CPU
 
      3) The required computation in the enqueue path is just overhead for
      	dubious value especially under the consideration that the vast
      	majority of timer wheel timers are either canceled or rearmed
      	before they expire.
 
     The timer pull model avoids the above by removing the target
     computation on enqueue and queueing timers always on the CPU on which
     they get armed.
 
     This is achieved by having separate wheels for CPU pinned timers and
     global timers which do not care about where they expire.
 
     As long as a CPU is busy it handles both the pinned and the global
     timers which are queued on the CPU local timer wheels.
 
     When a CPU goes idle it evaluates its own timer wheels:
 
       - If the first expiring timer is a pinned timer, then the global
       	timers can be ignored as the CPU will wake up before they expire.
 
       - If the first expiring timer is a global timer, then the expiry time
         is propagated into the timer pull hierarchy and the CPU makes sure
         to wake up for the first pinned timer.
 
     The timer pull hierarchy organizes CPUs in groups of eight at the
     lowest level and at the next levels groups of eight groups up to the
     point where no further aggregation of groups is required, i.e. the
     number of levels is log8(NR_CPUS). The magic number of eight has been
     established by experimention, but can be adjusted if needed.
 
     In each group one busy CPU acts as the migrator. It's only one CPU to
     avoid lock contention on remote timer wheels.
 
     The migrator CPU checks in its own timer wheel handling whether there
     are other CPUs in the group which have gone idle and have global timers
     to expire. If there are global timers to expire, the migrator locks the
     remote CPU timer wheel and handles the expiry.
 
     Depending on the group level in the hierarchy this handling can require
     to walk the hierarchy downwards to the CPU level.
 
     Special care is taken when the last CPU goes idle. At this point the
     CPU is the systemwide migrator at the top of the hierarchy and it
     therefore cannot delegate to the hierarchy. It needs to arm its own
     timer device to expire either at the first expiring timer in the
     hierarchy or at the first CPU local timer, which ever expires first.
 
     This completely removes the overhead from the enqueue path, which is
     e.g. for networking a true hotpath and trades it for a slightly more
     complex idle path.
 
     This has been in development for a couple of years and the final series
     has been extensively tested by various teams from silicon vendors and
     ran through extensive CI.
 
     There have been slight performance improvements observed on network
     centric workloads and an Intel team confirmed that this allows them to
     power down a die completely on a mult-die socket for the first time in
     a mostly idle scenario.
 
     There is only one outstanding ~1.5% regression on a specific overloaded
     netperf test which is currently investigated, but the rest is either
     positive or neutral performance wise and positive on the power
     management side.
 
   - Fixes for the timekeeping interpolation code for cross-timestamps:
 
     cross-timestamps are used for PTP to get snapshots from hardware timers
     and interpolated them back to clock MONOTONIC. The changes address a
     few corner cases in the interpolation code which got the math and logic
     wrong.
 
   - Simplifcation of the clocksource watchdog retry logic to automatically
     adjust to handle larger systems correctly instead of having more
     incomprehensible command line parameters.
 
   - Treewide consolidation of the VDSO data structures.
 
   - The usual small improvements and cleanups all over the place.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmXuAN0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVKXEADIR45rjR1Xtz32js7B53Y65O4WNoOQ
 6/ycWcswuGzg/h4QUpPSJ6gOGVmKSWwZi4n0P/VadCiXGSPPm0aUKsoRUt9DZsPY
 mtj2wjCSXKXiyhTl9OtrZME86ZAIGO1dQXa/sOHsiP5PCjgQkD0b5CYi1+B6eHDt
 1/Uo2Tb9g8VAPppq20V5Uo93GrPf642oyi3FCFrR1M112Uuak5DmqHJYiDpreNcG
 D5SgI+ykSiaUaVyHifvqijoJk0rYXkqEC6evl02477lJ/X0vVo2/M8XPS95BxHST
 s5Iruo4rP+qeAy8QvhZpoPX59fO0m/AgA7cf77XXAtOpVdLH+bs4ILsEbouAIOtv
 lsmRkcYt+TpvrZFHPAxks+6g3afuROiDtxD5sXXpVWxvofi8FwWqubdlqdsbw9MP
 ZCTNyzNyKL47QeDwBfSynYUL1RSyqsphtIwk4oeQklH9rwMAnW21hi30z15hQ0pQ
 FOVkmcwi79JNvl/G+jRkDzw7r8/zcHshWdSjyUM04CDjjnCDjQOFWSIjEPwbQjjz
 S4HXpJKJW963dBgs9Z84/Ctw1GwoBk1qedDWDJE1257Qvmo/Wpe/7GddWcazOGnN
 RRFMzGPbOqBDbjtErOKGU+iCisgNEvz2XK+TI16uRjWde7DxZpiTVYgNDrZ+/Pyh
 rQ23UBms6ZRR+A==
 =iQlu
 -----END PGP SIGNATURE-----

Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer updates from Thomas Gleixner:
 "A large set of updates and features for timers and timekeeping:

   - The hierarchical timer pull model

     When timer wheel timers are armed they are placed into the timer
     wheel of a CPU which is likely to be busy at the time of expiry.
     This is done to avoid wakeups on potentially idle CPUs.

     This is wrong in several aspects:

       1) The heuristics to select the target CPU are wrong by
          definition as the chance to get the prediction right is
          close to zero.

       2) Due to #1 it is possible that timers are accumulated on
          a single target CPU

       3) The required computation in the enqueue path is just overhead
          for dubious value especially under the consideration that the
          vast majority of timer wheel timers are either canceled or
          rearmed before they expire.

     The timer pull model avoids the above by removing the target
     computation on enqueue and queueing timers always on the CPU on
     which they get armed.

     This is achieved by having separate wheels for CPU pinned timers
     and global timers which do not care about where they expire.

     As long as a CPU is busy it handles both the pinned and the global
     timers which are queued on the CPU local timer wheels.

     When a CPU goes idle it evaluates its own timer wheels:

       - If the first expiring timer is a pinned timer, then the global
         timers can be ignored as the CPU will wake up before they
         expire.

       - If the first expiring timer is a global timer, then the expiry
         time is propagated into the timer pull hierarchy and the CPU
         makes sure to wake up for the first pinned timer.

     The timer pull hierarchy organizes CPUs in groups of eight at the
     lowest level and at the next levels groups of eight groups up to
     the point where no further aggregation of groups is required, i.e.
     the number of levels is log8(NR_CPUS). The magic number of eight
     has been established by experimention, but can be adjusted if
     needed.

     In each group one busy CPU acts as the migrator. It's only one CPU
     to avoid lock contention on remote timer wheels.

     The migrator CPU checks in its own timer wheel handling whether
     there are other CPUs in the group which have gone idle and have
     global timers to expire. If there are global timers to expire, the
     migrator locks the remote CPU timer wheel and handles the expiry.

     Depending on the group level in the hierarchy this handling can
     require to walk the hierarchy downwards to the CPU level.

     Special care is taken when the last CPU goes idle. At this point
     the CPU is the systemwide migrator at the top of the hierarchy and
     it therefore cannot delegate to the hierarchy. It needs to arm its
     own timer device to expire either at the first expiring timer in
     the hierarchy or at the first CPU local timer, which ever expires
     first.

     This completely removes the overhead from the enqueue path, which
     is e.g. for networking a true hotpath and trades it for a slightly
     more complex idle path.

     This has been in development for a couple of years and the final
     series has been extensively tested by various teams from silicon
     vendors and ran through extensive CI.

     There have been slight performance improvements observed on network
     centric workloads and an Intel team confirmed that this allows them
     to power down a die completely on a mult-die socket for the first
     time in a mostly idle scenario.

     There is only one outstanding ~1.5% regression on a specific
     overloaded netperf test which is currently investigated, but the
     rest is either positive or neutral performance wise and positive on
     the power management side.

   - Fixes for the timekeeping interpolation code for cross-timestamps:

     cross-timestamps are used for PTP to get snapshots from hardware
     timers and interpolated them back to clock MONOTONIC. The changes
     address a few corner cases in the interpolation code which got the
     math and logic wrong.

   - Simplifcation of the clocksource watchdog retry logic to
     automatically adjust to handle larger systems correctly instead of
     having more incomprehensible command line parameters.

   - Treewide consolidation of the VDSO data structures.

   - The usual small improvements and cleanups all over the place"

* tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
  timer/migration: Fix quick check reporting late expiry
  tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
  vdso/datapage: Quick fix - use asm/page-def.h for ARM64
  timers: Assert no next dyntick timer look-up while CPU is offline
  tick: Assume timekeeping is correctly handed over upon last offline idle call
  tick: Shut down low-res tick from dying CPU
  tick: Split nohz and highres features from nohz_mode
  tick: Move individual bit features to debuggable mask accesses
  tick: Move got_idle_tick away from common flags
  tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
  tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
  tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
  tick: Start centralizing tick related CPU hotplug operations
  tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
  tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
  tick: Use IS_ENABLED() whenever possible
  tick/sched: Remove useless oneshot ifdeffery
  tick/nohz: Remove duplicate between lowres and highres handlers
  tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
  hrtimer: Select housekeeping CPU during migration
  ...
This commit is contained in:
Linus Torvalds 2024-03-11 14:38:26 -07:00
commit d08c407f71
43 changed files with 3210 additions and 574 deletions

View file

@ -680,12 +680,6 @@
loops can be debugged more effectively on production
systems.
clocksource.max_cswd_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
unstable. Defaults to two retries, that is,
three attempts to read the clock under test.
clocksource.verify_n_cpus= [KNL]
Limit the number of CPUs checked for clocksources
marked with CLOCK_SOURCE_VERIFY_PERCPU that

View file

@ -17503,6 +17503,7 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
F: fs/timerfd.c
F: include/linux/time_namespace.h
F: include/linux/timer*
F: include/trace/events/timer*
F: kernel/time/*timer*
F: kernel/time/namespace.c

View file

@ -4,7 +4,6 @@
#include <asm/auxvec.h>
#include <asm/hwcap.h>
#include <asm/vdso_datapage.h>
/*
* ELF register definitions..

View file

@ -1,26 +0,0 @@
/* SPDX-License-Identifier: GPL-2.0-only */
/*
* Adapted from arm64 version.
*
* Copyright (C) 2012 ARM Limited
*/
#ifndef __ASM_VDSO_DATAPAGE_H
#define __ASM_VDSO_DATAPAGE_H
#ifdef __KERNEL__
#ifndef __ASSEMBLY__
#include <vdso/datapage.h>
#include <asm/page.h>
union vdso_data_store {
struct vdso_data data[CS_BASES];
u8 page[PAGE_SIZE];
};
#endif /* !__ASSEMBLY__ */
#endif /* __KERNEL__ */
#endif /* __ASM_VDSO_DATAPAGE_H */

View file

@ -21,10 +21,12 @@
#include <asm/mpu.h>
#include <asm/procinfo.h>
#include <asm/suspend.h>
#include <asm/vdso_datapage.h>
#include <asm/hardware/cache-l2x0.h>
#include <linux/kbuild.h>
#include <linux/arm-smccc.h>
#include <vdso/datapage.h>
#include "signal.h"
/*

View file

@ -21,7 +21,6 @@
#include <asm/cacheflush.h>
#include <asm/page.h>
#include <asm/vdso.h>
#include <asm/vdso_datapage.h>
#include <clocksource/arm_arch_timer.h>
#include <vdso/helpers.h>
#include <vdso/vsyscall.h>
@ -35,9 +34,6 @@ extern char vdso_start[], vdso_end[];
/* Total number of pages needed for the data and text portions of the VDSO. */
unsigned int vdso_total_pages __ro_after_init;
/*
* The VDSO data page.
*/
static union vdso_data_store vdso_data_store __page_aligned_data;
struct vdso_data *vdso_data = vdso_data_store.data;

View file

@ -69,10 +69,7 @@ static struct vdso_abi_info vdso_info[] __ro_after_init = {
/*
* The vDSO data page.
*/
static union {
struct vdso_data data[CS_BASES];
u8 page[PAGE_SIZE];
} vdso_data_store __page_aligned_data;
static union vdso_data_store vdso_data_store __page_aligned_data;
struct vdso_data *vdso_data = vdso_data_store.data;
static int vdso_mremap(const struct vm_special_mapping *sm,

View file

@ -5,11 +5,6 @@
#include <linux/types.h>
#ifndef GENERIC_TIME_VSYSCALL
struct vdso_data {
};
#endif
/*
* The VDSO symbols are mapped into Linux so we can just use regular symbol
* addressing to get their offsets in userspace. The symbols are mapped at an

View file

@ -8,25 +8,15 @@
#include <linux/slab.h>
#include <asm/page.h>
#ifdef GENERIC_TIME_VSYSCALL
#include <vdso/datapage.h>
#else
#include <asm/vdso.h>
#endif
extern char vdso_start[], vdso_end[];
static unsigned int vdso_pages;
static struct page **vdso_pagelist;
/*
* The vDSO data page.
*/
static union {
struct vdso_data data;
u8 page[PAGE_SIZE];
} vdso_data_store __page_aligned_data;
struct vdso_data *vdso_data = &vdso_data_store.data;
static union vdso_data_store vdso_data_store __page_aligned_data;
struct vdso_data *vdso_data = vdso_data_store.data;
static int __init vdso_init(void)
{

View file

@ -21,15 +21,13 @@
#include <asm/vdso.h>
#include <vdso/helpers.h>
#include <vdso/vsyscall.h>
#include <vdso/datapage.h>
#include <generated/vdso-offsets.h>
extern char vdso_start[], vdso_end[];
/* Kernel-provided data used by the VDSO. */
static union {
u8 page[PAGE_SIZE];
struct vdso_data data[CS_BASES];
} generic_vdso_data __page_aligned_data;
static union vdso_data_store generic_vdso_data __page_aligned_data;
static union {
u8 page[LOONGARCH_VDSO_DATA_SIZE];

View file

@ -50,9 +50,4 @@ extern struct mips_vdso_image vdso_image_o32;
extern struct mips_vdso_image vdso_image_n32;
#endif
union mips_vdso_data {
struct vdso_data data[CS_BASES];
u8 page[PAGE_SIZE];
};
#endif /* __ASM_VDSO_H */

View file

@ -24,7 +24,7 @@
#include <vdso/vsyscall.h>
/* Kernel-provided data used by the VDSO. */
static union mips_vdso_data mips_vdso_data __page_aligned_data;
static union vdso_data_store mips_vdso_data __page_aligned_data;
struct vdso_data *vdso_data = mips_vdso_data.data;
/*

View file

@ -30,14 +30,8 @@ enum rv_vdso_map {
#define VVAR_SIZE (VVAR_NR_PAGES << PAGE_SHIFT)
/*
* The vDSO data page.
*/
static union {
struct vdso_data data;
u8 page[PAGE_SIZE];
} vdso_data_store __page_aligned_data;
struct vdso_data *vdso_data = &vdso_data_store.data;
static union vdso_data_store vdso_data_store __page_aligned_data;
struct vdso_data *vdso_data = vdso_data_store.data;
struct __vdso_info {
const char *name;

View file

@ -3,7 +3,6 @@
#define __S390_ASM_VDSO_DATA_H
#include <linux/types.h>
#include <vdso/datapage.h>
struct arch_vdso_data {
__s64 tod_steering_delta;

View file

@ -25,10 +25,7 @@ extern char vdso32_start[], vdso32_end[];
static struct vm_special_mapping vvar_mapping;
static union {
struct vdso_data data[CS_BASES];
u8 page[PAGE_SIZE];
} vdso_data_store __page_aligned_data;
static union vdso_data_store vdso_data_store __page_aligned_data;
struct vdso_data *vdso_data = vdso_data_store.data;

View file

@ -291,7 +291,19 @@ static inline void timer_probe(void) {}
#define TIMER_ACPI_DECLARE(name, table_id, fn) \
ACPI_DECLARE_PROBE_ENTRY(timer, name, table_id, 0, NULL, 0, fn)
extern ulong max_cswd_read_retries;
static inline unsigned int clocksource_get_max_watchdog_retry(void)
{
/*
* When system is in the boot phase or under heavy workload, there
* can be random big latencies during the clocksource/watchdog
* read, so allow retries to filter the noise latency. As the
* latency's frequency and maximum value goes up with the number of
* CPUs, scale the number of retries with the number of online
* CPUs.
*/
return (ilog2(num_online_cpus()) / 2) + 1;
}
void clocksource_verify_percpu(struct clocksource *cs);
#endif /* _LINUX_CLOCKSOURCE_H */

View file

@ -184,6 +184,7 @@ enum cpuhp_state {
CPUHP_AP_ARM64_ISNDEP_STARTING,
CPUHP_AP_SMPCFD_DYING,
CPUHP_AP_HRTIMERS_DYING,
CPUHP_AP_TICK_DYING,
CPUHP_AP_X86_TBOOT_DYING,
CPUHP_AP_ARM_CACHE_B15_RAC_DYING,
CPUHP_AP_ONLINE,
@ -231,6 +232,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
CPUHP_AP_PERF_POWERPC_HV_GPCI_ONLINE,
CPUHP_AP_PERF_CSKY_ONLINE,
CPUHP_AP_TMIGR_ONLINE,
CPUHP_AP_WATCHDOG_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RANDOM_ONLINE,

View file

@ -18,12 +18,8 @@
#include <linux/list.h>
#include <linux/percpu-defs.h>
#include <linux/rbtree.h>
#include <linux/seqlock.h>
#include <linux/timer.h>
struct hrtimer_clock_base;
struct hrtimer_cpu_base;
/*
* Mode arguments of xxx_hrtimer functions:
*
@ -98,107 +94,6 @@ struct hrtimer_sleeper {
struct task_struct *task;
};
#ifdef CONFIG_64BIT
# define __hrtimer_clock_base_align ____cacheline_aligned
#else
# define __hrtimer_clock_base_align
#endif
/**
* struct hrtimer_clock_base - the timer base for a specific clock
* @cpu_base: per cpu clock base
* @index: clock type index for per_cpu support when moving a
* timer to a base on another cpu.
* @clockid: clock id for per_cpu support
* @seq: seqcount around __run_hrtimer
* @running: pointer to the currently running hrtimer
* @active: red black tree root node for the active timers
* @get_time: function to retrieve the current time of the clock
* @offset: offset of this clock to the monotonic base
*/
struct hrtimer_clock_base {
struct hrtimer_cpu_base *cpu_base;
unsigned int index;
clockid_t clockid;
seqcount_raw_spinlock_t seq;
struct hrtimer *running;
struct timerqueue_head active;
ktime_t (*get_time)(void);
ktime_t offset;
} __hrtimer_clock_base_align;
enum hrtimer_base_type {
HRTIMER_BASE_MONOTONIC,
HRTIMER_BASE_REALTIME,
HRTIMER_BASE_BOOTTIME,
HRTIMER_BASE_TAI,
HRTIMER_BASE_MONOTONIC_SOFT,
HRTIMER_BASE_REALTIME_SOFT,
HRTIMER_BASE_BOOTTIME_SOFT,
HRTIMER_BASE_TAI_SOFT,
HRTIMER_MAX_CLOCK_BASES,
};
/**
* struct hrtimer_cpu_base - the per cpu clock bases
* @lock: lock protecting the base and associated clock bases
* and timers
* @cpu: cpu number
* @active_bases: Bitfield to mark bases with active timers
* @clock_was_set_seq: Sequence counter of clock was set events
* @hres_active: State of high resolution mode
* @in_hrtirq: hrtimer_interrupt() is currently executing
* @hang_detected: The last hrtimer interrupt detected a hang
* @softirq_activated: displays, if the softirq is raised - update of softirq
* related settings is not required then.
* @nr_events: Total number of hrtimer interrupt events
* @nr_retries: Total number of hrtimer interrupt retries
* @nr_hangs: Total number of hrtimer interrupt hangs
* @max_hang_time: Maximum time spent in hrtimer_interrupt
* @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are
* expired
* @online: CPU is online from an hrtimers point of view
* @timer_waiters: A hrtimer_cancel() invocation waits for the timer
* callback to finish.
* @expires_next: absolute time of the next event, is required for remote
* hrtimer enqueue; it is the total first expiry time (hard
* and soft hrtimer are taken into account)
* @next_timer: Pointer to the first expiring timer
* @softirq_expires_next: Time to check, if soft queues needs also to be expired
* @softirq_next_timer: Pointer to the first expiring softirq based timer
* @clock_base: array of clock bases for this cpu
*
* Note: next_timer is just an optimization for __remove_hrtimer().
* Do not dereference the pointer because it is not reliable on
* cross cpu removals.
*/
struct hrtimer_cpu_base {
raw_spinlock_t lock;
unsigned int cpu;
unsigned int active_bases;
unsigned int clock_was_set_seq;
unsigned int hres_active : 1,
in_hrtirq : 1,
hang_detected : 1,
softirq_activated : 1,
online : 1;
#ifdef CONFIG_HIGH_RES_TIMERS
unsigned int nr_events;
unsigned short nr_retries;
unsigned short nr_hangs;
unsigned int max_hang_time;
#endif
#ifdef CONFIG_PREEMPT_RT
spinlock_t softirq_expiry_lock;
atomic_t timer_waiters;
#endif
ktime_t expires_next;
struct hrtimer *next_timer;
ktime_t softirq_expires_next;
struct hrtimer *softirq_next_timer;
struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES];
} ____cacheline_aligned;
static inline void hrtimer_set_expires(struct hrtimer *timer, ktime_t time)
{
timer->node.expires = time;
@ -447,20 +342,12 @@ extern u64
hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
/**
* hrtimer_forward_now - forward the timer expiry so it expires after now
* hrtimer_forward_now() - forward the timer expiry so it expires after now
* @timer: hrtimer to forward
* @interval: the interval to forward
*
* Forward the timer expiry so it will expire after the current time
* of the hrtimer clock base. Returns the number of overruns.
*
* Can be safely called from the callback function of @timer. If
* called from other contexts @timer must neither be enqueued nor
* running the callback and the caller needs to take care of
* serialization.
*
* Note: This only updates the timer expiry value and does not requeue
* the timer.
* It is a variant of hrtimer_forward(). The timer will expire after the current
* time of the hrtimer clock base. See hrtimer_forward() for details.
*/
static inline u64 hrtimer_forward_now(struct hrtimer *timer,
ktime_t interval)

View file

@ -3,6 +3,8 @@
#define _LINUX_HRTIMER_DEFS_H
#include <linux/ktime.h>
#include <linux/timerqueue.h>
#include <linux/seqlock.h>
#ifdef CONFIG_HIGH_RES_TIMERS
@ -24,4 +26,106 @@
#endif
#ifdef CONFIG_64BIT
# define __hrtimer_clock_base_align ____cacheline_aligned
#else
# define __hrtimer_clock_base_align
#endif
/**
* struct hrtimer_clock_base - the timer base for a specific clock
* @cpu_base: per cpu clock base
* @index: clock type index for per_cpu support when moving a
* timer to a base on another cpu.
* @clockid: clock id for per_cpu support
* @seq: seqcount around __run_hrtimer
* @running: pointer to the currently running hrtimer
* @active: red black tree root node for the active timers
* @get_time: function to retrieve the current time of the clock
* @offset: offset of this clock to the monotonic base
*/
struct hrtimer_clock_base {
struct hrtimer_cpu_base *cpu_base;
unsigned int index;
clockid_t clockid;
seqcount_raw_spinlock_t seq;
struct hrtimer *running;
struct timerqueue_head active;
ktime_t (*get_time)(void);
ktime_t offset;
} __hrtimer_clock_base_align;
enum hrtimer_base_type {
HRTIMER_BASE_MONOTONIC,
HRTIMER_BASE_REALTIME,
HRTIMER_BASE_BOOTTIME,
HRTIMER_BASE_TAI,
HRTIMER_BASE_MONOTONIC_SOFT,
HRTIMER_BASE_REALTIME_SOFT,
HRTIMER_BASE_BOOTTIME_SOFT,
HRTIMER_BASE_TAI_SOFT,
HRTIMER_MAX_CLOCK_BASES,
};
/**
* struct hrtimer_cpu_base - the per cpu clock bases
* @lock: lock protecting the base and associated clock bases
* and timers
* @cpu: cpu number
* @active_bases: Bitfield to mark bases with active timers
* @clock_was_set_seq: Sequence counter of clock was set events
* @hres_active: State of high resolution mode
* @in_hrtirq: hrtimer_interrupt() is currently executing
* @hang_detected: The last hrtimer interrupt detected a hang
* @softirq_activated: displays, if the softirq is raised - update of softirq
* related settings is not required then.
* @nr_events: Total number of hrtimer interrupt events
* @nr_retries: Total number of hrtimer interrupt retries
* @nr_hangs: Total number of hrtimer interrupt hangs
* @max_hang_time: Maximum time spent in hrtimer_interrupt
* @softirq_expiry_lock: Lock which is taken while softirq based hrtimer are
* expired
* @online: CPU is online from an hrtimers point of view
* @timer_waiters: A hrtimer_cancel() invocation waits for the timer
* callback to finish.
* @expires_next: absolute time of the next event, is required for remote
* hrtimer enqueue; it is the total first expiry time (hard
* and soft hrtimer are taken into account)
* @next_timer: Pointer to the first expiring timer
* @softirq_expires_next: Time to check, if soft queues needs also to be expired
* @softirq_next_timer: Pointer to the first expiring softirq based timer
* @clock_base: array of clock bases for this cpu
*
* Note: next_timer is just an optimization for __remove_hrtimer().
* Do not dereference the pointer because it is not reliable on
* cross cpu removals.
*/
struct hrtimer_cpu_base {
raw_spinlock_t lock;
unsigned int cpu;
unsigned int active_bases;
unsigned int clock_was_set_seq;
unsigned int hres_active : 1,
in_hrtirq : 1,
hang_detected : 1,
softirq_activated : 1,
online : 1;
#ifdef CONFIG_HIGH_RES_TIMERS
unsigned int nr_events;
unsigned short nr_retries;
unsigned short nr_hangs;
unsigned int max_hang_time;
#endif
#ifdef CONFIG_PREEMPT_RT
spinlock_t softirq_expiry_lock;
atomic_t timer_waiters;
#endif
ktime_t expires_next;
struct hrtimer *next_timer;
ktime_t softirq_expires_next;
struct hrtimer *softirq_next_timer;
struct hrtimer_clock_base clock_base[HRTIMER_MAX_CLOCK_BASES];
} ____cacheline_aligned;
#endif

View file

@ -102,12 +102,15 @@ static inline u64 get_jiffies_64(void)
}
#endif
/*
* These inlines deal with timer wrapping correctly. You are
* strongly encouraged to use them:
* 1. Because people otherwise forget
* 2. Because if the timer wrap changes in future you won't have to
* alter your driver code.
/**
* DOC: General information about time_* inlines
*
* These inlines deal with timer wrapping correctly. You are strongly encouraged
* to use them:
*
* #. Because people otherwise forget
* #. Because if the timer wrap changes in future you won't have to alter your
* driver code.
*/
/**

View file

@ -19,16 +19,22 @@ extern void __init tick_init(void);
extern void tick_suspend_local(void);
/* Should be core only, but XEN resume magic and ARM BL switcher require it */
extern void tick_resume_local(void);
extern void tick_handover_do_timer(void);
extern void tick_cleanup_dead_cpu(int cpu);
#else /* CONFIG_GENERIC_CLOCKEVENTS */
static inline void tick_init(void) { }
static inline void tick_suspend_local(void) { }
static inline void tick_resume_local(void) { }
static inline void tick_handover_do_timer(void) { }
static inline void tick_cleanup_dead_cpu(int cpu) { }
#endif /* !CONFIG_GENERIC_CLOCKEVENTS */
#if defined(CONFIG_GENERIC_CLOCKEVENTS) && defined(CONFIG_HOTPLUG_CPU)
extern int tick_cpu_dying(unsigned int cpu);
extern void tick_assert_timekeeping_handover(void);
#else
#define tick_cpu_dying NULL
static inline void tick_assert_timekeeping_handover(void) { }
#endif
#if defined(CONFIG_GENERIC_CLOCKEVENTS) && defined(CONFIG_SUSPEND)
extern void tick_freeze(void);
extern void tick_unfreeze(void);
@ -69,12 +75,6 @@ extern void tick_broadcast_control(enum tick_broadcast_mode mode);
static inline void tick_broadcast_control(enum tick_broadcast_mode mode) { }
#endif /* BROADCAST */
#if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_HOTPLUG_CPU)
extern void tick_offline_cpu(unsigned int cpu);
#else
static inline void tick_offline_cpu(unsigned int cpu) { }
#endif
#ifdef CONFIG_GENERIC_CLOCKEVENTS
extern int tick_broadcast_oneshot_control(enum tick_broadcast_state state);
#else

View file

@ -36,16 +36,10 @@
* workqueue locking issues. It's not meant for executing random crap
* with interrupts disabled. Abuse is monitored!
*
* @TIMER_PINNED: A pinned timer will not be affected by any timer
* placement heuristics (like, NOHZ) and will always expire on the CPU
* on which the timer was enqueued.
*
* Note: Because enqueuing of timers can migrate the timer from one
* CPU to another, pinned timers are not guaranteed to stay on the
* initialy selected CPU. They move to the CPU on which the enqueue
* function is invoked via mod_timer() or add_timer(). If the timer
* should be placed on a particular CPU, then add_timer_on() has to be
* used.
* @TIMER_PINNED: A pinned timer will always expire on the CPU on which the
* timer was enqueued. When a particular CPU is required, add_timer_on()
* has to be used. Enqueue via mod_timer() and add_timer() is always done
* on the local CPU.
*/
#define TIMER_CPUMASK 0x0003FFFF
#define TIMER_MIGRATING 0x00040000
@ -165,6 +159,8 @@ extern int timer_reduce(struct timer_list *timer, unsigned long expires);
#define NEXT_TIMER_MAX_DELTA ((1UL << 30) - 1)
extern void add_timer(struct timer_list *timer);
extern void add_timer_local(struct timer_list *timer);
extern void add_timer_global(struct timer_list *timer);
extern int try_to_del_timer_sync(struct timer_list *timer);
extern int timer_delete_sync(struct timer_list *timer);

View file

@ -0,0 +1,298 @@
/* SPDX-License-Identifier: GPL-2.0-only */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM timer_migration
#if !defined(_TRACE_TIMER_MIGRATION_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_TIMER_MIGRATION_H
#include <linux/tracepoint.h>
/* Group events */
TRACE_EVENT(tmigr_group_set,
TP_PROTO(struct tmigr_group *group),
TP_ARGS(group),
TP_STRUCT__entry(
__field( void *, group )
__field( unsigned int, lvl )
__field( unsigned int, numa_node )
),
TP_fast_assign(
__entry->group = group;
__entry->lvl = group->level;
__entry->numa_node = group->numa_node;
),
TP_printk("group=%p lvl=%d numa=%d",
__entry->group, __entry->lvl, __entry->numa_node)
);
TRACE_EVENT(tmigr_connect_child_parent,
TP_PROTO(struct tmigr_group *child),
TP_ARGS(child),
TP_STRUCT__entry(
__field( void *, child )
__field( void *, parent )
__field( unsigned int, lvl )
__field( unsigned int, numa_node )
__field( unsigned int, num_children )
__field( u32, childmask )
),
TP_fast_assign(
__entry->child = child;
__entry->parent = child->parent;
__entry->lvl = child->parent->level;
__entry->numa_node = child->parent->numa_node;
__entry->num_children = child->parent->num_children;
__entry->childmask = child->childmask;
),
TP_printk("group=%p childmask=%0x parent=%p lvl=%d numa=%d num_children=%d",
__entry->child, __entry->childmask, __entry->parent,
__entry->lvl, __entry->numa_node, __entry->num_children)
);
TRACE_EVENT(tmigr_connect_cpu_parent,
TP_PROTO(struct tmigr_cpu *tmc),
TP_ARGS(tmc),
TP_STRUCT__entry(
__field( void *, parent )
__field( unsigned int, cpu )
__field( unsigned int, lvl )
__field( unsigned int, numa_node )
__field( unsigned int, num_children )
__field( u32, childmask )
),
TP_fast_assign(
__entry->parent = tmc->tmgroup;
__entry->cpu = tmc->cpuevt.cpu;
__entry->lvl = tmc->tmgroup->level;
__entry->numa_node = tmc->tmgroup->numa_node;
__entry->num_children = tmc->tmgroup->num_children;
__entry->childmask = tmc->childmask;
),
TP_printk("cpu=%d childmask=%0x parent=%p lvl=%d numa=%d num_children=%d",
__entry->cpu, __entry->childmask, __entry->parent,
__entry->lvl, __entry->numa_node, __entry->num_children)
);
DECLARE_EVENT_CLASS(tmigr_group_and_cpu,
TP_PROTO(struct tmigr_group *group, union tmigr_state state, u32 childmask),
TP_ARGS(group, state, childmask),
TP_STRUCT__entry(
__field( void *, group )
__field( void *, parent )
__field( unsigned int, lvl )
__field( unsigned int, numa_node )
__field( u32, childmask )
__field( u8, active )
__field( u8, migrator )
),
TP_fast_assign(
__entry->group = group;
__entry->parent = group->parent;
__entry->lvl = group->level;
__entry->numa_node = group->numa_node;
__entry->childmask = childmask;
__entry->active = state.active;
__entry->migrator = state.migrator;
),
TP_printk("group=%p lvl=%d numa=%d active=%0x migrator=%0x "
"parent=%p childmask=%0x",
__entry->group, __entry->lvl, __entry->numa_node,
__entry->active, __entry->migrator,
__entry->parent, __entry->childmask)
);
DEFINE_EVENT(tmigr_group_and_cpu, tmigr_group_set_cpu_inactive,
TP_PROTO(struct tmigr_group *group, union tmigr_state state, u32 childmask),
TP_ARGS(group, state, childmask)
);
DEFINE_EVENT(tmigr_group_and_cpu, tmigr_group_set_cpu_active,
TP_PROTO(struct tmigr_group *group, union tmigr_state state, u32 childmask),
TP_ARGS(group, state, childmask)
);
/* CPU events*/
DECLARE_EVENT_CLASS(tmigr_cpugroup,
TP_PROTO(struct tmigr_cpu *tmc),
TP_ARGS(tmc),
TP_STRUCT__entry(
__field( u64, wakeup )
__field( void *, parent )
__field( unsigned int, cpu )
),
TP_fast_assign(
__entry->wakeup = tmc->wakeup;
__entry->parent = tmc->tmgroup;
__entry->cpu = tmc->cpuevt.cpu;
),
TP_printk("cpu=%d parent=%p wakeup=%llu", __entry->cpu, __entry->parent, __entry->wakeup)
);
DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_new_timer,
TP_PROTO(struct tmigr_cpu *tmc),
TP_ARGS(tmc)
);
DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_active,
TP_PROTO(struct tmigr_cpu *tmc),
TP_ARGS(tmc)
);
DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_online,
TP_PROTO(struct tmigr_cpu *tmc),
TP_ARGS(tmc)
);
DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_offline,
TP_PROTO(struct tmigr_cpu *tmc),
TP_ARGS(tmc)
);
DEFINE_EVENT(tmigr_cpugroup, tmigr_handle_remote_cpu,
TP_PROTO(struct tmigr_cpu *tmc),
TP_ARGS(tmc)
);
DECLARE_EVENT_CLASS(tmigr_idle,
TP_PROTO(struct tmigr_cpu *tmc, u64 nextevt),
TP_ARGS(tmc, nextevt),
TP_STRUCT__entry(
__field( u64, nextevt)
__field( u64, wakeup)
__field( void *, parent)
__field( unsigned int, cpu)
),
TP_fast_assign(
__entry->nextevt = nextevt;
__entry->wakeup = tmc->wakeup;
__entry->parent = tmc->tmgroup;
__entry->cpu = tmc->cpuevt.cpu;
),
TP_printk("cpu=%d parent=%p nextevt=%llu wakeup=%llu",
__entry->cpu, __entry->parent, __entry->nextevt, __entry->wakeup)
);
DEFINE_EVENT(tmigr_idle, tmigr_cpu_idle,
TP_PROTO(struct tmigr_cpu *tmc, u64 nextevt),
TP_ARGS(tmc, nextevt)
);
DEFINE_EVENT(tmigr_idle, tmigr_cpu_new_timer_idle,
TP_PROTO(struct tmigr_cpu *tmc, u64 nextevt),
TP_ARGS(tmc, nextevt)
);
TRACE_EVENT(tmigr_update_events,
TP_PROTO(struct tmigr_group *child, struct tmigr_group *group,
union tmigr_state childstate, union tmigr_state groupstate,
u64 nextevt),
TP_ARGS(child, group, childstate, groupstate, nextevt),
TP_STRUCT__entry(
__field( void *, child )
__field( void *, group )
__field( u64, nextevt )
__field( u64, group_next_expiry )
__field( u64, child_evt_expiry )
__field( unsigned int, group_lvl )
__field( unsigned int, child_evtcpu )
__field( u8, child_active )
__field( u8, group_active )
),
TP_fast_assign(
__entry->child = child;
__entry->group = group;
__entry->nextevt = nextevt;
__entry->group_next_expiry = group->next_expiry;
__entry->child_evt_expiry = child ? child->groupevt.nextevt.expires : 0;
__entry->group_lvl = group->level;
__entry->child_evtcpu = child ? child->groupevt.cpu : 0;
__entry->child_active = childstate.active;
__entry->group_active = groupstate.active;
),
TP_printk("child=%p group=%p group_lvl=%d child_active=%0x group_active=%0x "
"nextevt=%llu next_expiry=%llu child_evt_expiry=%llu child_evtcpu=%d",
__entry->child, __entry->group, __entry->group_lvl, __entry->child_active,
__entry->group_active,
__entry->nextevt, __entry->group_next_expiry, __entry->child_evt_expiry,
__entry->child_evtcpu)
);
TRACE_EVENT(tmigr_handle_remote,
TP_PROTO(struct tmigr_group *group),
TP_ARGS(group),
TP_STRUCT__entry(
__field( void * , group )
__field( unsigned int , lvl )
),
TP_fast_assign(
__entry->group = group;
__entry->lvl = group->level;
),
TP_printk("group=%p lvl=%d",
__entry->group, __entry->lvl)
);
#endif /* _TRACE_TIMER_MIGRATION_H */
/* This part must be outside protection */
#include <trace/define_trace.h>

View file

@ -19,6 +19,12 @@
#include <vdso/time32.h>
#include <vdso/time64.h>
#ifdef CONFIG_ARM64
#include <asm/page-def.h>
#else
#include <asm/page.h>
#endif
#ifdef CONFIG_ARCH_HAS_VDSO_DATA
#include <asm/vdso/data.h>
#else
@ -121,6 +127,14 @@ struct vdso_data {
extern struct vdso_data _vdso_data[CS_BASES] __attribute__((visibility("hidden")));
extern struct vdso_data _timens_data[CS_BASES] __attribute__((visibility("hidden")));
/**
* union vdso_data_store - Generic vDSO data page
*/
union vdso_data_store {
struct vdso_data data[CS_BASES];
u8 page[PAGE_SIZE];
};
/*
* The generic vDSO implementation requires that gettimeofday.h
* provides:

View file

@ -30,9 +30,9 @@ static __always_inline u32 vdso_read_retry(const struct vdso_data *vd,
static __always_inline void vdso_write_begin(struct vdso_data *vd)
{
/*
* WRITE_ONCE it is required otherwise the compiler can validly tear
* WRITE_ONCE() is required otherwise the compiler can validly tear
* updates to vd[x].seq and it is possible that the value seen by the
* reader it is inconsistent.
* reader is inconsistent.
*/
WRITE_ONCE(vd[CS_HRES_COARSE].seq, vd[CS_HRES_COARSE].seq + 1);
WRITE_ONCE(vd[CS_RAW].seq, vd[CS_RAW].seq + 1);
@ -43,9 +43,9 @@ static __always_inline void vdso_write_end(struct vdso_data *vd)
{
smp_wmb();
/*
* WRITE_ONCE it is required otherwise the compiler can validly tear
* WRITE_ONCE() is required otherwise the compiler can validly tear
* updates to vd[x].seq and it is possible that the value seen by the
* reader it is inconsistent.
* reader is inconsistent.
*/
WRITE_ONCE(vd[CS_HRES_COARSE].seq, vd[CS_HRES_COARSE].seq + 1);
WRITE_ONCE(vd[CS_RAW].seq, vd[CS_RAW].seq + 1);

View file

@ -1323,10 +1323,6 @@ static int take_cpu_down(void *_param)
*/
cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
/* Give up timekeeping duties */
tick_handover_do_timer();
/* Remove CPU from timer broadcasting */
tick_offline_cpu(cpu);
/* Park the stopper thread */
stop_machine_park(cpu);
return 0;
@ -1402,6 +1398,7 @@ void cpuhp_report_idle_dead(void)
struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
BUG_ON(st->state != CPUHP_AP_OFFLINE);
tick_assert_timekeeping_handover();
rcutree_report_cpu_dead();
st->state = CPUHP_AP_IDLE_DEAD;
/*
@ -2204,7 +2201,11 @@ static struct cpuhp_step cpuhp_hp_states[] = {
.startup.single = NULL,
.teardown.single = hrtimers_cpu_dying,
},
[CPUHP_AP_TICK_DYING] = {
.name = "tick:dying",
.startup.single = NULL,
.teardown.single = tick_cpu_dying,
},
/* Entry state on starting. Interrupts enabled from here on. Transient
* state for synchronsization */
[CPUHP_AP_ONLINE] = {

View file

@ -291,7 +291,6 @@ static void do_idle(void)
local_irq_disable();
if (cpu_is_offline(cpu)) {
tick_nohz_idle_stop_tick();
cpuhp_report_idle_dead();
arch_cpu_idle_dead();
}

View file

@ -17,6 +17,9 @@ endif
obj-$(CONFIG_GENERIC_SCHED_CLOCK) += sched_clock.o
obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o tick-sched.o
obj-$(CONFIG_LEGACY_TIMER_TICK) += tick-legacy.o
ifeq ($(CONFIG_SMP),y)
obj-$(CONFIG_NO_HZ_COMMON) += timer_migration.o
endif
obj-$(CONFIG_HAVE_GENERIC_VDSO) += vsyscall.o
obj-$(CONFIG_DEBUG_FS) += timekeeping_debug.o
obj-$(CONFIG_TEST_UDELAY) += test_udelay.o

View file

@ -659,7 +659,7 @@ void tick_cleanup_dead_cpu(int cpu)
#endif
#ifdef CONFIG_SYSFS
static struct bus_type clockevents_subsys = {
static const struct bus_type clockevents_subsys = {
.name = "clockevents",
.dev_name = "clockevent",
};

View file

@ -104,8 +104,8 @@ static void wdtest_ktime_clocksource_reset(void)
static int wdtest_func(void *arg)
{
unsigned long j1, j2;
int i, max_retries;
char *s;
int i;
schedule_timeout_uninterruptible(holdoff * HZ);
@ -139,18 +139,19 @@ static int wdtest_func(void *arg)
WARN_ON_ONCE(time_before(j2, j1 + NSEC_PER_USEC));
/* Verify tsc-like stability with various numbers of errors injected. */
for (i = 0; i <= max_cswd_read_retries + 1; i++) {
if (i <= 1 && i < max_cswd_read_retries)
max_retries = clocksource_get_max_watchdog_retry();
for (i = 0; i <= max_retries + 1; i++) {
if (i <= 1 && i < max_retries)
s = "";
else if (i <= max_cswd_read_retries)
else if (i <= max_retries)
s = ", expect message";
else
s = ", expect clock skew";
pr_info("--- Watchdog with %dx error injection, %lu retries%s.\n", i, max_cswd_read_retries, s);
pr_info("--- Watchdog with %dx error injection, %d retries%s.\n", i, max_retries, s);
WRITE_ONCE(wdtest_ktime_read_ndelays, i);
schedule_timeout_uninterruptible(2 * HZ);
WARN_ON_ONCE(READ_ONCE(wdtest_ktime_read_ndelays));
WARN_ON_ONCE((i <= max_cswd_read_retries) !=
WARN_ON_ONCE((i <= max_retries) !=
!(clocksource_wdtest_ktime.flags & CLOCK_SOURCE_UNSTABLE));
wdtest_ktime_clocksource_reset();
}

View file

@ -210,9 +210,6 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}
ulong max_cswd_read_retries = 2;
module_param(max_cswd_read_retries, ulong, 0644);
EXPORT_SYMBOL_GPL(max_cswd_read_retries);
static int verify_n_cpus = 8;
module_param(verify_n_cpus, int, 0644);
@ -224,11 +221,12 @@ enum wd_read_status {
static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow, u64 *wdnow)
{
unsigned int nretries;
unsigned int nretries, max_retries;
u64 wd_end, wd_end2, wd_delta;
int64_t wd_delay, wd_seq_delay;
for (nretries = 0; nretries <= max_cswd_read_retries; nretries++) {
max_retries = clocksource_get_max_watchdog_retry();
for (nretries = 0; nretries <= max_retries; nretries++) {
local_irq_disable();
*wdnow = watchdog->read(watchdog);
*csnow = cs->read(cs);
@ -240,7 +238,7 @@ static enum wd_read_status cs_watchdog_read(struct clocksource *cs, u64 *csnow,
wd_delay = clocksource_cyc2ns(wd_delta, watchdog->mult,
watchdog->shift);
if (wd_delay <= WATCHDOG_MAX_SKEW) {
if (nretries > 1 || nretries >= max_cswd_read_retries) {
if (nretries > 1 || nretries >= max_retries) {
pr_warn("timekeeping watchdog on CPU%d: %s retried %d times before success\n",
smp_processor_id(), watchdog->name, nretries);
}
@ -1468,7 +1466,7 @@ static struct attribute *clocksource_attrs[] = {
};
ATTRIBUTE_GROUPS(clocksource);
static struct bus_type clocksource_subsys = {
static const struct bus_type clocksource_subsys = {
.name = "clocksource",
.dev_name = "clocksource",
};

View file

@ -38,6 +38,7 @@
#include <linux/sched/deadline.h>
#include <linux/sched/nohz.h>
#include <linux/sched/debug.h>
#include <linux/sched/isolation.h>
#include <linux/timer.h>
#include <linux/freezer.h>
#include <linux/compat.h>
@ -746,7 +747,7 @@ static void hrtimer_switch_to_hres(void)
base->hres_active = 1;
hrtimer_resolution = HIGH_RES_NSEC;
tick_setup_sched_timer();
tick_setup_sched_timer(true);
/* "Retrigger" the interrupt to get things going */
retrigger_next_event(NULL);
}
@ -1021,21 +1022,23 @@ void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
}
/**
* hrtimer_forward - forward the timer expiry
* hrtimer_forward() - forward the timer expiry
* @timer: hrtimer to forward
* @now: forward past this time
* @interval: the interval to forward
*
* Forward the timer expiry so it will expire in the future.
* Returns the number of overruns.
*
* Can be safely called from the callback function of @timer. If
* called from other contexts @timer must neither be enqueued nor
* running the callback and the caller needs to take care of
* serialization.
* .. note::
* This only updates the timer expiry value and does not requeue the timer.
*
* Note: This only updates the timer expiry value and does not requeue
* the timer.
* There is also a variant of the function hrtimer_forward_now().
*
* Context: Can be safely called from the callback function of @timer. If called
* from other contexts @timer must neither be enqueued nor running the
* callback and the caller needs to take care of serialization.
*
* Return: The number of overruns are returned.
*/
u64 hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval)
{
@ -2223,10 +2226,8 @@ static void migrate_hrtimer_list(struct hrtimer_clock_base *old_base,
int hrtimers_cpu_dying(unsigned int dying_cpu)
{
int i, ncpu = cpumask_any_and(cpu_active_mask, housekeeping_cpumask(HK_TYPE_TIMER));
struct hrtimer_cpu_base *old_base, *new_base;
int i, ncpu = cpumask_first(cpu_active_mask);
tick_cancel_sched_timer(dying_cpu);
old_base = this_cpu_ptr(&hrtimer_bases);
new_base = &per_cpu(hrtimer_bases, ncpu);

View file

@ -111,15 +111,13 @@ void tick_handle_periodic(struct clock_event_device *dev)
tick_periodic(cpu);
#if defined(CONFIG_HIGH_RES_TIMERS) || defined(CONFIG_NO_HZ_COMMON)
/*
* The cpu might have transitioned to HIGHRES or NOHZ mode via
* update_process_times() -> run_local_timers() ->
* hrtimer_run_queues().
*/
if (dev->event_handler != tick_handle_periodic)
if (IS_ENABLED(CONFIG_TICK_ONESHOT) && dev->event_handler != tick_handle_periodic)
return;
#endif
if (!clockevent_state_oneshot(dev))
return;
@ -398,16 +396,31 @@ int tick_broadcast_oneshot_control(enum tick_broadcast_state state)
EXPORT_SYMBOL_GPL(tick_broadcast_oneshot_control);
#ifdef CONFIG_HOTPLUG_CPU
/*
* Transfer the do_timer job away from a dying cpu.
*
* Called with interrupts disabled. No locking required. If
* tick_do_timer_cpu is owned by this cpu, nothing can change it.
*/
void tick_handover_do_timer(void)
void tick_assert_timekeeping_handover(void)
{
if (tick_do_timer_cpu == smp_processor_id())
WARN_ON_ONCE(tick_do_timer_cpu == smp_processor_id());
}
/*
* Stop the tick and transfer the timekeeping job away from a dying cpu.
*/
int tick_cpu_dying(unsigned int dying_cpu)
{
/*
* If the current CPU is the timekeeper, it's the only one that
* can safely hand over its duty. Also all online CPUs are in
* stop machine, guaranteed not to be idle, therefore it's safe
* to pick any online successor.
*/
if (tick_do_timer_cpu == dying_cpu)
tick_do_timer_cpu = cpumask_first(cpu_online_mask);
/* Make sure the CPU won't try to retake the timekeeping duty */
tick_sched_timer_dying(dying_cpu);
/* Remove CPU from timer broadcasting */
tick_offline_cpu(dying_cpu);
return 0;
}
/*

View file

@ -8,6 +8,11 @@
#include "timekeeping.h"
#include "tick-sched.h"
struct timer_events {
u64 local;
u64 global;
};
#ifdef CONFIG_GENERIC_CLOCKEVENTS
# define TICK_DO_TIMER_NONE -1
@ -137,8 +142,10 @@ static inline bool tick_broadcast_oneshot_available(void) { return tick_oneshot_
#endif /* !(BROADCAST && ONESHOT) */
#if defined(CONFIG_GENERIC_CLOCKEVENTS_BROADCAST) && defined(CONFIG_HOTPLUG_CPU)
extern void tick_offline_cpu(unsigned int cpu);
extern void tick_broadcast_offline(unsigned int cpu);
#else
static inline void tick_offline_cpu(unsigned int cpu) { }
static inline void tick_broadcast_offline(unsigned int cpu) { }
#endif
@ -152,8 +159,16 @@ static inline void tick_nohz_init(void) { }
#ifdef CONFIG_NO_HZ_COMMON
extern unsigned long tick_nohz_active;
extern void timers_update_nohz(void);
extern u64 get_jiffies_update(unsigned long *basej);
# ifdef CONFIG_SMP
extern struct static_key_false timers_migration_enabled;
extern void fetch_next_timer_interrupt_remote(unsigned long basej, u64 basem,
struct timer_events *tevt,
unsigned int cpu);
extern void timer_lock_remote_bases(unsigned int cpu);
extern void timer_unlock_remote_bases(unsigned int cpu);
extern bool timer_base_is_idle(void);
extern void timer_expire_remote(unsigned int cpu);
# endif
#else /* CONFIG_NO_HZ_COMMON */
static inline void timers_update_nohz(void) { }
@ -163,6 +178,7 @@ static inline void timers_update_nohz(void) { }
DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem);
u64 timer_base_try_to_set_idle(unsigned long basej, u64 basem, bool *idle);
void timer_clear_idle(void);
#define CLOCK_SET_WALL \

View file

@ -43,7 +43,6 @@ struct tick_sched *tick_get_tick_sched(int cpu)
return &per_cpu(tick_cpu_sched, cpu);
}
#if defined(CONFIG_NO_HZ_COMMON) || defined(CONFIG_HIGH_RES_TIMERS)
/*
* The time when the last jiffy update happened. Write access must hold
* jiffies_lock and jiffies_seq. tick_nohz_next_event() needs to get a
@ -181,13 +180,32 @@ static ktime_t tick_init_jiffy_update(void)
return period;
}
static inline int tick_sched_flag_test(struct tick_sched *ts,
unsigned long flag)
{
return !!(ts->flags & flag);
}
static inline void tick_sched_flag_set(struct tick_sched *ts,
unsigned long flag)
{
lockdep_assert_irqs_disabled();
ts->flags |= flag;
}
static inline void tick_sched_flag_clear(struct tick_sched *ts,
unsigned long flag)
{
lockdep_assert_irqs_disabled();
ts->flags &= ~flag;
}
#define MAX_STALLED_JIFFIES 5
static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
{
int cpu = smp_processor_id();
#ifdef CONFIG_NO_HZ_COMMON
/*
* Check if the do_timer duty was dropped. We don't care about
* concurrency: This happens only when the CPU in charge went
@ -198,13 +216,13 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
* If nohz_full is enabled, this should not happen because the
* 'tick_do_timer_cpu' CPU never relinquishes.
*/
if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)) {
if (IS_ENABLED(CONFIG_NO_HZ_COMMON) &&
unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)) {
#ifdef CONFIG_NO_HZ_FULL
WARN_ON_ONCE(tick_nohz_full_running);
#endif
tick_do_timer_cpu = cpu;
}
#endif
/* Check if jiffies need an update */
if (tick_do_timer_cpu == cpu)
@ -225,13 +243,12 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
}
}
if (ts->inidle)
if (tick_sched_flag_test(ts, TS_FLAG_INIDLE))
ts->got_idle_tick = 1;
}
static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
{
#ifdef CONFIG_NO_HZ_COMMON
/*
* When we are idle and the tick is stopped, we have to touch
* the watchdog as we might not schedule for a really long
@ -240,7 +257,8 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
* idle" jiffy stamp so the idle accounting adjustment we do
* when we go busy again does not account too many ticks.
*/
if (ts->tick_stopped) {
if (IS_ENABLED(CONFIG_NO_HZ_COMMON) &&
tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
touch_softlockup_watchdog_sched();
if (is_idle_task(current))
ts->idle_jiffies++;
@ -251,11 +269,52 @@ static void tick_sched_handle(struct tick_sched *ts, struct pt_regs *regs)
*/
ts->next_tick = 0;
}
#endif
update_process_times(user_mode(regs));
profile_tick(CPU_PROFILING);
}
#endif
/*
* We rearm the timer until we get disabled by the idle code.
* Called with interrupts disabled.
*/
static enum hrtimer_restart tick_nohz_handler(struct hrtimer *timer)
{
struct tick_sched *ts = container_of(timer, struct tick_sched, sched_timer);
struct pt_regs *regs = get_irq_regs();
ktime_t now = ktime_get();
tick_sched_do_timer(ts, now);
/*
* Do not call when we are not in IRQ context and have
* no valid 'regs' pointer
*/
if (regs)
tick_sched_handle(ts, regs);
else
ts->next_tick = 0;
/*
* In dynticks mode, tick reprogram is deferred:
* - to the idle task if in dynticks-idle
* - to IRQ exit if in full-dynticks.
*/
if (unlikely(tick_sched_flag_test(ts, TS_FLAG_STOPPED)))
return HRTIMER_NORESTART;
hrtimer_forward(timer, now, TICK_NSEC);
return HRTIMER_RESTART;
}
static void tick_sched_timer_cancel(struct tick_sched *ts)
{
if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES))
hrtimer_cancel(&ts->sched_timer);
else if (tick_sched_flag_test(ts, TS_FLAG_NOHZ))
tick_program_event(KTIME_MAX, 1);
}
#ifdef CONFIG_NO_HZ_FULL
cpumask_var_t tick_nohz_full_mask;
@ -529,7 +588,7 @@ void __tick_nohz_task_switch(void)
ts = this_cpu_ptr(&tick_cpu_sched);
if (ts->tick_stopped) {
if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
if (atomic_read(&current->tick_dep_mask) ||
atomic_read(&current->signal->tick_dep_mask))
tick_nohz_full_kick();
@ -601,7 +660,7 @@ void __init tick_nohz_init(void)
pr_info("NO_HZ: Full dynticks CPUs: %*pbl.\n",
cpumask_pr_args(tick_nohz_full_mask));
}
#endif
#endif /* #ifdef CONFIG_NO_HZ_FULL */
/*
* NOHZ - aka dynamic tick functionality
@ -626,14 +685,14 @@ bool tick_nohz_tick_stopped(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
return ts->tick_stopped;
return tick_sched_flag_test(ts, TS_FLAG_STOPPED);
}
bool tick_nohz_tick_stopped_cpu(int cpu)
{
struct tick_sched *ts = per_cpu_ptr(&tick_cpu_sched, cpu);
return ts->tick_stopped;
return tick_sched_flag_test(ts, TS_FLAG_STOPPED);
}
/**
@ -663,7 +722,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
ktime_t delta;
if (WARN_ON_ONCE(!ts->idle_active))
if (WARN_ON_ONCE(!tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE)))
return;
delta = ktime_sub(now, ts->idle_entrytime);
@ -675,7 +734,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
ts->idle_entrytime = now;
ts->idle_active = 0;
tick_sched_flag_clear(ts, TS_FLAG_IDLE_ACTIVE);
write_seqcount_end(&ts->idle_sleeptime_seq);
sched_clock_idle_wakeup_event();
@ -685,7 +744,7 @@ static void tick_nohz_start_idle(struct tick_sched *ts)
{
write_seqcount_begin(&ts->idle_sleeptime_seq);
ts->idle_entrytime = ktime_get();
ts->idle_active = 1;
tick_sched_flag_set(ts, TS_FLAG_IDLE_ACTIVE);
write_seqcount_end(&ts->idle_sleeptime_seq);
sched_clock_idle_sleep_event();
@ -707,7 +766,7 @@ static u64 get_cpu_sleep_time_us(struct tick_sched *ts, ktime_t *sleeptime,
do {
seq = read_seqcount_begin(&ts->idle_sleeptime_seq);
if (ts->idle_active && compute_delta) {
if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE) && compute_delta) {
ktime_t delta = ktime_sub(now, ts->idle_entrytime);
idle = ktime_add(*sleeptime, delta);
@ -780,7 +839,7 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
/* Forward the time to expire in the future */
hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES)) {
hrtimer_start_expires(&ts->sched_timer,
HRTIMER_MODE_ABS_PINNED_HARD);
} else {
@ -799,18 +858,40 @@ static inline bool local_timer_softirq_pending(void)
return local_softirq_pending() & BIT(TIMER_SOFTIRQ);
}
static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
/*
* Read jiffies and the time when jiffies were updated last
*/
u64 get_jiffies_update(unsigned long *basej)
{
u64 basemono, next_tick, delta, expires;
unsigned long basejiff;
unsigned int seq;
u64 basemono;
/* Read jiffies and the time when jiffies were updated last */
do {
seq = read_seqcount_begin(&jiffies_seq);
basemono = last_jiffies_update;
basejiff = jiffies;
} while (read_seqcount_retry(&jiffies_seq, seq));
*basej = basejiff;
return basemono;
}
/**
* tick_nohz_next_event() - return the clock monotonic based next event
* @ts: pointer to tick_sched struct
* @cpu: CPU number
*
* Return:
* *%0 - When the next event is a maximum of TICK_NSEC in the future
* and the tick is not stopped yet
* *%next_event - Next event based on clock monotonic
*/
static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
{
u64 basemono, next_tick, delta, expires;
unsigned long basejiff;
basemono = get_jiffies_update(&basejiff);
ts->last_jiffies = basejiff;
ts->timer_expires_base = basemono;
@ -849,16 +930,11 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
*/
delta = next_tick - basemono;
if (delta <= (u64)TICK_NSEC) {
/*
* Tell the timer code that the base is not idle, i.e. undo
* the effect of get_next_timer_interrupt():
*/
timer_clear_idle();
/*
* We've not stopped the tick yet, and there's a timer in the
* next period, so no point in stopping it either, bail.
*/
if (!ts->tick_stopped) {
if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
ts->timer_expires = 0;
goto out;
}
@ -871,7 +947,8 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
*/
delta = timekeeping_max_deferment();
if (cpu != tick_do_timer_cpu &&
(tick_do_timer_cpu != TICK_DO_TIMER_NONE || !ts->do_timer_last))
(tick_do_timer_cpu != TICK_DO_TIMER_NONE ||
!tick_sched_flag_test(ts, TS_FLAG_DO_TIMER_LAST)))
delta = KTIME_MAX;
/* Calculate the next expiry time */
@ -889,12 +966,38 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
{
struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
unsigned long basejiff = ts->last_jiffies;
u64 basemono = ts->timer_expires_base;
u64 expires = ts->timer_expires;
bool timer_idle = tick_sched_flag_test(ts, TS_FLAG_STOPPED);
u64 expires;
/* Make sure we won't be trying to stop it twice in a row. */
ts->timer_expires_base = 0;
/*
* Now the tick should be stopped definitely - so the timer base needs
* to be marked idle as well to not miss a newly queued timer.
*/
expires = timer_base_try_to_set_idle(basejiff, basemono, &timer_idle);
if (expires > ts->timer_expires) {
/*
* This path could only happen when the first timer was removed
* between calculating the possible sleep length and now (when
* high resolution mode is not active, timer could also be a
* hrtimer).
*
* We have to stick to the original calculated expiry value to
* not stop the tick for too long with a shallow C-state (which
* was programmed by cpuidle because of an early next expiration
* value).
*/
expires = ts->timer_expires;
}
/* If the timer base is not idle, retain the not yet stopped tick. */
if (!timer_idle)
return;
/*
* If this CPU is the one which updates jiffies, then give up
* the assignment and let it be taken by the CPU which runs
@ -905,13 +1008,13 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
*/
if (cpu == tick_do_timer_cpu) {
tick_do_timer_cpu = TICK_DO_TIMER_NONE;
ts->do_timer_last = 1;
tick_sched_flag_set(ts, TS_FLAG_DO_TIMER_LAST);
} else if (tick_do_timer_cpu != TICK_DO_TIMER_NONE) {
ts->do_timer_last = 0;
tick_sched_flag_clear(ts, TS_FLAG_DO_TIMER_LAST);
}
/* Skip reprogram of event if it's not changed */
if (ts->tick_stopped && (expires == ts->next_tick)) {
if (tick_sched_flag_test(ts, TS_FLAG_STOPPED) && (expires == ts->next_tick)) {
/* Sanity check: make sure clockevent is actually programmed */
if (expires == KTIME_MAX || ts->next_tick == hrtimer_get_expires(&ts->sched_timer))
return;
@ -929,12 +1032,12 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
* call we save the current tick time, so we can restart the
* scheduler tick in tick_nohz_restart_sched_tick().
*/
if (!ts->tick_stopped) {
if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
calc_load_nohz_start();
quiet_vmstat();
ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
ts->tick_stopped = 1;
tick_sched_flag_set(ts, TS_FLAG_STOPPED);
trace_tick_stop(1, TICK_DEP_MASK_NONE);
}
@ -945,14 +1048,11 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
* the tick timer.
*/
if (unlikely(expires == KTIME_MAX)) {
if (ts->nohz_mode == NOHZ_MODE_HIGHRES)
hrtimer_cancel(&ts->sched_timer);
else
tick_program_event(KTIME_MAX, 1);
tick_sched_timer_cancel(ts);
return;
}
if (ts->nohz_mode == NOHZ_MODE_HIGHRES) {
if (tick_sched_flag_test(ts, TS_FLAG_HIGHRES)) {
hrtimer_start(&ts->sched_timer, expires,
HRTIMER_MODE_ABS_PINNED_HARD);
} else {
@ -967,7 +1067,7 @@ static void tick_nohz_retain_tick(struct tick_sched *ts)
}
#ifdef CONFIG_NO_HZ_FULL
static void tick_nohz_stop_sched_tick(struct tick_sched *ts, int cpu)
static void tick_nohz_full_stop_tick(struct tick_sched *ts, int cpu)
{
if (tick_nohz_next_event(ts, cpu))
tick_nohz_stop_tick(ts, cpu);
@ -991,7 +1091,7 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
touch_softlockup_watchdog_sched();
/* Cancel the scheduled timer and restore the tick: */
ts->tick_stopped = 0;
tick_sched_flag_clear(ts, TS_FLAG_STOPPED);
tick_nohz_restart(ts, now);
}
@ -1002,8 +1102,8 @@ static void __tick_nohz_full_update_tick(struct tick_sched *ts,
int cpu = smp_processor_id();
if (can_stop_full_tick(cpu, ts))
tick_nohz_stop_sched_tick(ts, cpu);
else if (ts->tick_stopped)
tick_nohz_full_stop_tick(ts, cpu);
else if (tick_sched_flag_test(ts, TS_FLAG_STOPPED))
tick_nohz_restart_sched_tick(ts, now);
#endif
}
@ -1013,7 +1113,7 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
if (!tick_nohz_full_cpu(smp_processor_id()))
return;
if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
if (!tick_sched_flag_test(ts, TS_FLAG_NOHZ))
return;
__tick_nohz_full_update_tick(ts, ktime_get());
@ -1060,25 +1160,9 @@ static bool report_idle_softirq(void)
static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
{
/*
* If this CPU is offline and it is the one which updates
* jiffies, then give up the assignment and let it be taken by
* the CPU which runs the tick timer next. If we don't drop
* this here, the jiffies might be stale and do_timer() never
* gets invoked.
*/
if (unlikely(!cpu_online(cpu))) {
if (cpu == tick_do_timer_cpu)
tick_do_timer_cpu = TICK_DO_TIMER_NONE;
/*
* Make sure the CPU doesn't get fooled by obsolete tick
* deadline if it comes back online later.
*/
ts->next_tick = 0;
return false;
}
WARN_ON_ONCE(cpu_is_offline(cpu));
if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
if (unlikely(!tick_sched_flag_test(ts, TS_FLAG_NOHZ)))
return false;
if (need_resched())
@ -1128,14 +1212,14 @@ void tick_nohz_idle_stop_tick(void)
ts->idle_calls++;
if (expires > 0LL) {
int was_stopped = ts->tick_stopped;
int was_stopped = tick_sched_flag_test(ts, TS_FLAG_STOPPED);
tick_nohz_stop_tick(ts, cpu);
ts->idle_sleeps++;
ts->idle_expires = expires;
if (!was_stopped && ts->tick_stopped) {
if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
ts->idle_jiffies = ts->last_jiffies;
nohz_balance_enter_idle(cpu);
}
@ -1147,11 +1231,6 @@ void tick_nohz_idle_stop_tick(void)
void tick_nohz_idle_retain_tick(void)
{
tick_nohz_retain_tick(this_cpu_ptr(&tick_cpu_sched));
/*
* Undo the effect of get_next_timer_interrupt() called from
* tick_nohz_next_event().
*/
timer_clear_idle();
}
/**
@ -1171,7 +1250,7 @@ void tick_nohz_idle_enter(void)
WARN_ON_ONCE(ts->timer_expires_base);
ts->inidle = 1;
tick_sched_flag_set(ts, TS_FLAG_INIDLE);
tick_nohz_start_idle(ts);
local_irq_enable();
@ -1200,7 +1279,7 @@ void tick_nohz_irq_exit(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
if (ts->inidle)
if (tick_sched_flag_test(ts, TS_FLAG_INIDLE))
tick_nohz_start_idle(ts);
else
tick_nohz_full_update_tick(ts);
@ -1254,7 +1333,7 @@ ktime_t tick_nohz_get_sleep_length(ktime_t *delta_next)
ktime_t now = ts->idle_entrytime;
ktime_t next_event;
WARN_ON_ONCE(!ts->inidle);
WARN_ON_ONCE(!tick_sched_flag_test(ts, TS_FLAG_INIDLE));
*delta_next = ktime_sub(dev->next_event, now);
@ -1326,7 +1405,7 @@ void tick_nohz_idle_restart_tick(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
if (ts->tick_stopped) {
if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
ktime_t now = ktime_get();
tick_nohz_restart_sched_tick(ts, now);
tick_nohz_account_idle_time(ts, now);
@ -1367,12 +1446,12 @@ void tick_nohz_idle_exit(void)
local_irq_disable();
WARN_ON_ONCE(!ts->inidle);
WARN_ON_ONCE(!tick_sched_flag_test(ts, TS_FLAG_INIDLE));
WARN_ON_ONCE(ts->timer_expires_base);
ts->inidle = 0;
idle_active = ts->idle_active;
tick_stopped = ts->tick_stopped;
tick_sched_flag_clear(ts, TS_FLAG_INIDLE);
idle_active = tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE);
tick_stopped = tick_sched_flag_test(ts, TS_FLAG_STOPPED);
if (idle_active || tick_stopped)
now = ktime_get();
@ -1391,38 +1470,22 @@ void tick_nohz_idle_exit(void)
* at the clockevent level. hrtimer can't be used instead, because its
* infrastructure actually relies on the tick itself as a backend in
* low-resolution mode (see hrtimer_run_queues()).
*
* This low-resolution handler still makes use of some hrtimer APIs meanwhile
* for convenience with expiration calculation and forwarding.
*/
static void tick_nohz_lowres_handler(struct clock_event_device *dev)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
struct pt_regs *regs = get_irq_regs();
ktime_t now = ktime_get();
dev->next_event = KTIME_MAX;
tick_sched_do_timer(ts, now);
tick_sched_handle(ts, regs);
/*
* In dynticks mode, tick reprogram is deferred:
* - to the idle task if in dynticks-idle
* - to IRQ exit if in full-dynticks.
*/
if (likely(!ts->tick_stopped)) {
hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
if (likely(tick_nohz_handler(&ts->sched_timer) == HRTIMER_RESTART))
tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
}
}
static inline void tick_nohz_activate(struct tick_sched *ts, int mode)
static inline void tick_nohz_activate(struct tick_sched *ts)
{
if (!tick_nohz_enabled)
return;
ts->nohz_mode = mode;
tick_sched_flag_set(ts, TS_FLAG_NOHZ);
/* One update is enough */
if (!test_and_set_bit(0, &tick_nohz_active))
timers_update_nohz();
@ -1433,9 +1496,6 @@ static inline void tick_nohz_activate(struct tick_sched *ts, int mode)
*/
static void tick_nohz_switch_to_nohz(void)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
ktime_t next;
if (!tick_nohz_enabled)
return;
@ -1444,16 +1504,9 @@ static void tick_nohz_switch_to_nohz(void)
/*
* Recycle the hrtimer in 'ts', so we can share the
* hrtimer_forward_now() function with the highres code.
* highres code.
*/
hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD);
/* Get the next period */
next = tick_init_jiffy_update();
hrtimer_set_expires(&ts->sched_timer, next);
hrtimer_forward_now(&ts->sched_timer, TICK_NSEC);
tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
tick_nohz_activate(ts, NOHZ_MODE_LOWRES);
tick_setup_sched_timer(false);
}
static inline void tick_nohz_irq_enter(void)
@ -1461,10 +1514,10 @@ static inline void tick_nohz_irq_enter(void)
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
ktime_t now;
if (!ts->idle_active && !ts->tick_stopped)
if (!tick_sched_flag_test(ts, TS_FLAG_STOPPED | TS_FLAG_IDLE_ACTIVE))
return;
now = ktime_get();
if (ts->idle_active)
if (tick_sched_flag_test(ts, TS_FLAG_IDLE_ACTIVE))
tick_nohz_stop_idle(ts, now);
/*
* If all CPUs are idle we may need to update a stale jiffies value.
@ -1473,7 +1526,7 @@ static inline void tick_nohz_irq_enter(void)
* rare case (typically stop machine). So we must make sure we have a
* last resort.
*/
if (ts->tick_stopped)
if (tick_sched_flag_test(ts, TS_FLAG_STOPPED))
tick_nohz_update_jiffies(now);
}
@ -1481,7 +1534,7 @@ static inline void tick_nohz_irq_enter(void)
static inline void tick_nohz_switch_to_nohz(void) { }
static inline void tick_nohz_irq_enter(void) { }
static inline void tick_nohz_activate(struct tick_sched *ts, int mode) { }
static inline void tick_nohz_activate(struct tick_sched *ts) { }
#endif /* CONFIG_NO_HZ_COMMON */
@ -1494,45 +1547,6 @@ void tick_irq_enter(void)
tick_nohz_irq_enter();
}
/*
* High resolution timer specific code
*/
#ifdef CONFIG_HIGH_RES_TIMERS
/*
* We rearm the timer until we get disabled by the idle code.
* Called with interrupts disabled.
*/
static enum hrtimer_restart tick_nohz_highres_handler(struct hrtimer *timer)
{
struct tick_sched *ts =
container_of(timer, struct tick_sched, sched_timer);
struct pt_regs *regs = get_irq_regs();
ktime_t now = ktime_get();
tick_sched_do_timer(ts, now);
/*
* Do not call when we are not in IRQ context and have
* no valid 'regs' pointer
*/
if (regs)
tick_sched_handle(ts, regs);
else
ts->next_tick = 0;
/*
* In dynticks mode, tick reprogram is deferred:
* - to the idle task if in dynticks-idle
* - to IRQ exit if in full-dynticks.
*/
if (unlikely(ts->tick_stopped))
return HRTIMER_NORESTART;
hrtimer_forward(timer, now, TICK_NSEC);
return HRTIMER_RESTART;
}
static int sched_skew_tick;
static int __init skew_tick(char *str)
@ -1545,15 +1559,19 @@ early_param("skew_tick", skew_tick);
/**
* tick_setup_sched_timer - setup the tick emulation timer
* @mode: tick_nohz_mode to setup for
*/
void tick_setup_sched_timer(void)
void tick_setup_sched_timer(bool hrtimer)
{
struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
ktime_t now = ktime_get();
/* Emulate tick processing via per-CPU hrtimers: */
hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_HARD);
ts->sched_timer.function = tick_nohz_highres_handler;
if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS) && hrtimer) {
tick_sched_flag_set(ts, TS_FLAG_HIGHRES);
ts->sched_timer.function = tick_nohz_handler;
}
/* Get the next period (per-CPU) */
hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
@ -1566,23 +1584,35 @@ void tick_setup_sched_timer(void)
hrtimer_add_expires_ns(&ts->sched_timer, offset);
}
hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED_HARD);
tick_nohz_activate(ts, NOHZ_MODE_HIGHRES);
hrtimer_forward_now(&ts->sched_timer, TICK_NSEC);
if (IS_ENABLED(CONFIG_HIGH_RES_TIMERS) && hrtimer)
hrtimer_start_expires(&ts->sched_timer, HRTIMER_MODE_ABS_PINNED_HARD);
else
tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
tick_nohz_activate(ts);
}
#endif /* HIGH_RES_TIMERS */
#if defined CONFIG_NO_HZ_COMMON || defined CONFIG_HIGH_RES_TIMERS
void tick_cancel_sched_timer(int cpu)
/*
* Shut down the tick and make sure the CPU won't try to retake the timekeeping
* duty before disabling IRQs in idle for the last time.
*/
void tick_sched_timer_dying(int cpu)
{
struct tick_device *td = &per_cpu(tick_cpu_device, cpu);
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
struct clock_event_device *dev = td->evtdev;
ktime_t idle_sleeptime, iowait_sleeptime;
unsigned long idle_calls, idle_sleeps;
# ifdef CONFIG_HIGH_RES_TIMERS
if (ts->sched_timer.base)
hrtimer_cancel(&ts->sched_timer);
# endif
/* This must happen before hrtimers are migrated! */
tick_sched_timer_cancel(ts);
/*
* If the clockevents doesn't support CLOCK_EVT_STATE_ONESHOT_STOPPED,
* make sure not to call low-res tick handler.
*/
if (tick_sched_flag_test(ts, TS_FLAG_NOHZ))
dev->event_handler = clockevents_handle_noop;
idle_sleeptime = ts->idle_sleeptime;
iowait_sleeptime = ts->iowait_sleeptime;
@ -1594,7 +1624,6 @@ void tick_cancel_sched_timer(int cpu)
ts->idle_calls = idle_calls;
ts->idle_sleeps = idle_sleeps;
}
#endif
/*
* Async notification about clocksource changes
@ -1632,7 +1661,7 @@ int tick_check_oneshot_change(int allow_nohz)
if (!test_and_clear_bit(0, &ts->check_clocks))
return 0;
if (ts->nohz_mode != NOHZ_MODE_INACTIVE)
if (tick_sched_flag_test(ts, TS_FLAG_NOHZ))
return 0;
if (!timekeeping_valid_for_hres() || !tick_is_oneshot_available())

View file

@ -14,20 +14,26 @@ struct tick_device {
enum tick_device_mode mode;
};
enum tick_nohz_mode {
NOHZ_MODE_INACTIVE,
NOHZ_MODE_LOWRES,
NOHZ_MODE_HIGHRES,
};
/* The CPU is in the tick idle mode */
#define TS_FLAG_INIDLE BIT(0)
/* The idle tick has been stopped */
#define TS_FLAG_STOPPED BIT(1)
/*
* Indicator that the CPU is actively in the tick idle mode;
* it is reset during irq handling phases.
*/
#define TS_FLAG_IDLE_ACTIVE BIT(2)
/* CPU was the last one doing do_timer before going idle */
#define TS_FLAG_DO_TIMER_LAST BIT(3)
/* NO_HZ is enabled */
#define TS_FLAG_NOHZ BIT(4)
/* High resolution tick mode */
#define TS_FLAG_HIGHRES BIT(5)
/**
* struct tick_sched - sched tick emulation and no idle tick control/stats
*
* @inidle: Indicator that the CPU is in the tick idle mode
* @tick_stopped: Indicator that the idle tick has been stopped
* @idle_active: Indicator that the CPU is actively in the tick idle mode;
* it is reset during irq handling phases.
* @do_timer_last: CPU was the last one doing do_timer before going idle
* @flags: State flags gathering the TS_FLAG_* features
* @got_idle_tick: Tick timer function has run with @inidle set
* @stalled_jiffies: Number of stalled jiffies detected across ticks
* @last_tick_jiffies: Value of jiffies seen on last tick
@ -57,11 +63,7 @@ enum tick_nohz_mode {
*/
struct tick_sched {
/* Common flags */
unsigned int inidle : 1;
unsigned int tick_stopped : 1;
unsigned int idle_active : 1;
unsigned int do_timer_last : 1;
unsigned int got_idle_tick : 1;
unsigned long flags;
/* Tick handling: jiffies stall check */
unsigned int stalled_jiffies;
@ -73,13 +75,13 @@ struct tick_sched {
ktime_t next_tick;
unsigned long idle_jiffies;
ktime_t idle_waketime;
unsigned int got_idle_tick;
/* Idle entry */
seqcount_t idle_sleeptime_seq;
ktime_t idle_entrytime;
/* Tick stop */
enum tick_nohz_mode nohz_mode;
unsigned long last_jiffies;
u64 timer_expires_base;
u64 timer_expires;
@ -102,11 +104,11 @@ struct tick_sched {
extern struct tick_sched *tick_get_tick_sched(int cpu);
extern void tick_setup_sched_timer(void);
#if defined CONFIG_NO_HZ_COMMON || defined CONFIG_HIGH_RES_TIMERS
extern void tick_cancel_sched_timer(int cpu);
extern void tick_setup_sched_timer(bool hrtimer);
#if defined CONFIG_TICK_ONESHOT
extern void tick_sched_timer_dying(int cpu);
#else
static inline void tick_cancel_sched_timer(int cpu) { }
static inline void tick_sched_timer_dying(int cpu) { }
#endif
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST

View file

@ -1180,13 +1180,15 @@ static int adjust_historical_crosststamp(struct system_time_snapshot *history,
}
/*
* cycle_between - true if test occurs chronologically between before and after
* timestamp_in_interval - true if ts is chronologically in [start, end]
*
* True if ts occurs chronologically at or after start, and before or at end.
*/
static bool cycle_between(u64 before, u64 test, u64 after)
static bool timestamp_in_interval(u64 start, u64 end, u64 ts)
{
if (test > before && test < after)
if (ts >= start && ts <= end)
return true;
if (test < before && before > after)
if (start > end && (ts >= start || ts <= end))
return true;
return false;
}
@ -1247,7 +1249,7 @@ int get_device_system_crosststamp(int (*get_time_fn)
*/
now = tk_clock_read(&tk->tkr_mono);
interval_start = tk->tkr_mono.cycle_last;
if (!cycle_between(interval_start, cycles, now)) {
if (!timestamp_in_interval(interval_start, now, cycles)) {
clock_was_set_seq = tk->clock_was_set_seq;
cs_was_changed_seq = tk->cs_was_changed_seq;
cycles = interval_start;
@ -1260,10 +1262,8 @@ int get_device_system_crosststamp(int (*get_time_fn)
tk_core.timekeeper.offs_real);
base_raw = tk->tkr_raw.base;
nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono,
system_counterval.cycles);
nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw,
system_counterval.cycles);
nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, cycles);
nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, cycles);
} while (read_seqcount_retry(&tk_core.seq, seq));
xtstamp->sys_realtime = ktime_add_ns(base_real, nsec_real);
@ -1278,13 +1278,13 @@ int get_device_system_crosststamp(int (*get_time_fn)
bool discontinuity;
/*
* Check that the counter value occurs after the provided
* Check that the counter value is not before the provided
* history reference and that the history doesn't cross a
* clocksource change
*/
if (!history_begin ||
!cycle_between(history_begin->cycles,
system_counterval.cycles, cycles) ||
!timestamp_in_interval(history_begin->cycles,
cycles, system_counterval.cycles) ||
history_begin->cs_was_changed_seq != cs_was_changed_seq)
return -EINVAL;
partial_history_cycles = cycles - system_counterval.cycles;

View file

@ -53,6 +53,7 @@
#include <asm/io.h>
#include "tick-internal.h"
#include "timer_migration.h"
#define CREATE_TRACE_POINTS
#include <trace/events/timer.h>
@ -187,15 +188,66 @@ EXPORT_SYMBOL(jiffies_64);
#define WHEEL_SIZE (LVL_SIZE * LVL_DEPTH)
#ifdef CONFIG_NO_HZ_COMMON
# define NR_BASES 2
# define BASE_STD 0
# define BASE_DEF 1
/*
* If multiple bases need to be locked, use the base ordering for lock
* nesting, i.e. lowest number first.
*/
# define NR_BASES 3
# define BASE_LOCAL 0
# define BASE_GLOBAL 1
# define BASE_DEF 2
#else
# define NR_BASES 1
# define BASE_STD 0
# define BASE_LOCAL 0
# define BASE_GLOBAL 0
# define BASE_DEF 0
#endif
/**
* struct timer_base - Per CPU timer base (number of base depends on config)
* @lock: Lock protecting the timer_base
* @running_timer: When expiring timers, the lock is dropped. To make
* sure not to race agains deleting/modifying a
* currently running timer, the pointer is set to the
* timer, which expires at the moment. If no timer is
* running, the pointer is NULL.
* @expiry_lock: PREEMPT_RT only: Lock is taken in softirq around
* timer expiry callback execution and when trying to
* delete a running timer and it wasn't successful in
* the first glance. It prevents priority inversion
* when callback was preempted on a remote CPU and a
* caller tries to delete the running timer. It also
* prevents a life lock, when the task which tries to
* delete a timer preempted the softirq thread which
* is running the timer callback function.
* @timer_waiters: PREEMPT_RT only: Tells, if there is a waiter
* waiting for the end of the timer callback function
* execution.
* @clk: clock of the timer base; is updated before enqueue
* of a timer; during expiry, it is 1 offset ahead of
* jiffies to avoid endless requeuing to current
* jiffies
* @next_expiry: expiry value of the first timer; it is updated when
* finding the next timer and during enqueue; the
* value is not valid, when next_expiry_recalc is set
* @cpu: Number of CPU the timer base belongs to
* @next_expiry_recalc: States, whether a recalculation of next_expiry is
* required. Value is set true, when a timer was
* deleted.
* @is_idle: Is set, when timer_base is idle. It is triggered by NOHZ
* code. This state is only used in standard
* base. Deferrable timers, which are enqueued remotely
* never wake up an idle CPU. So no matter of supporting it
* for this base.
* @timers_pending: Is set, when a timer is pending in the base. It is only
* reliable when next_expiry_recalc is not set.
* @pending_map: bitmap of the timer wheel; each bit reflects a
* bucket of the wheel. When a bit is set, at least a
* single timer is enqueued in the related bucket.
* @vectors: Array of lists; Each array member reflects a bucket
* of the timer wheel. The list contains all timers
* which are enqueued into a specific bucket.
*/
struct timer_base {
raw_spinlock_t lock;
struct timer_list *running_timer;
@ -583,11 +635,16 @@ trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
/*
* We might have to IPI the remote CPU if the base is idle and the
* timer is not deferrable. If the other CPU is on the way to idle
* then it can't set base->is_idle as we hold the base lock:
* timer is pinned. If it is a non pinned timer, it is only queued
* on the remote CPU, when timer was running during queueing. Then
* everything is handled by remote CPU anyway. If the other CPU is
* on the way to idle then it can't set base->is_idle as we hold
* the base lock:
*/
if (base->is_idle)
if (base->is_idle) {
WARN_ON_ONCE(!(timer->flags & TIMER_PINNED));
wake_up_nohz_cpu(base->cpu);
}
}
/*
@ -899,7 +956,10 @@ static int detach_if_pending(struct timer_list *timer, struct timer_base *base,
static inline struct timer_base *get_timer_cpu_base(u32 tflags, u32 cpu)
{
struct timer_base *base = per_cpu_ptr(&timer_bases[BASE_STD], cpu);
int index = tflags & TIMER_PINNED ? BASE_LOCAL : BASE_GLOBAL;
struct timer_base *base;
base = per_cpu_ptr(&timer_bases[index], cpu);
/*
* If the timer is deferrable and NO_HZ_COMMON is set then we need
@ -912,7 +972,10 @@ static inline struct timer_base *get_timer_cpu_base(u32 tflags, u32 cpu)
static inline struct timer_base *get_timer_this_cpu_base(u32 tflags)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
int index = tflags & TIMER_PINNED ? BASE_LOCAL : BASE_GLOBAL;
struct timer_base *base;
base = this_cpu_ptr(&timer_bases[index]);
/*
* If the timer is deferrable and NO_HZ_COMMON is set then we need
@ -928,17 +991,6 @@ static inline struct timer_base *get_timer_base(u32 tflags)
return get_timer_cpu_base(tflags, tflags & TIMER_CPUMASK);
}
static inline struct timer_base *
get_target_base(struct timer_base *base, unsigned tflags)
{
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
if (static_branch_likely(&timers_migration_enabled) &&
!(tflags & TIMER_PINNED))
return get_timer_cpu_base(tflags, get_nohz_timer_target());
#endif
return get_timer_this_cpu_base(tflags);
}
static inline void __forward_timer_base(struct timer_base *base,
unsigned long basej)
{
@ -1093,7 +1145,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, unsigned int option
if (!ret && (options & MOD_TIMER_PENDING_ONLY))
goto out_unlock;
new_base = get_target_base(base, timer->flags);
new_base = get_timer_this_cpu_base(timer->flags);
if (base != new_base) {
/*
@ -1245,12 +1297,49 @@ void add_timer(struct timer_list *timer)
}
EXPORT_SYMBOL(add_timer);
/**
* add_timer_local() - Start a timer on the local CPU
* @timer: The timer to be started
*
* Same as add_timer() except that the timer flag TIMER_PINNED is set.
*
* See add_timer() for further details.
*/
void add_timer_local(struct timer_list *timer)
{
if (WARN_ON_ONCE(timer_pending(timer)))
return;
timer->flags |= TIMER_PINNED;
__mod_timer(timer, timer->expires, MOD_TIMER_NOTPENDING);
}
EXPORT_SYMBOL(add_timer_local);
/**
* add_timer_global() - Start a timer without TIMER_PINNED flag set
* @timer: The timer to be started
*
* Same as add_timer() except that the timer flag TIMER_PINNED is unset.
*
* See add_timer() for further details.
*/
void add_timer_global(struct timer_list *timer)
{
if (WARN_ON_ONCE(timer_pending(timer)))
return;
timer->flags &= ~TIMER_PINNED;
__mod_timer(timer, timer->expires, MOD_TIMER_NOTPENDING);
}
EXPORT_SYMBOL(add_timer_global);
/**
* add_timer_on - Start a timer on a particular CPU
* @timer: The timer to be started
* @cpu: The CPU to start it on
*
* Same as add_timer() except that it starts the timer on the given CPU.
* Same as add_timer() except that it starts the timer on the given CPU and
* the TIMER_PINNED flag is set. When timer shouldn't be a pinned timer in
* the next round, add_timer_global() should be used instead as it unsets
* the TIMER_PINNED flag.
*
* See add_timer() for further details.
*/
@ -1264,6 +1353,9 @@ void add_timer_on(struct timer_list *timer, int cpu)
if (WARN_ON_ONCE(timer_pending(timer)))
return;
/* Make sure timer flags have TIMER_PINNED flag set */
timer->flags |= TIMER_PINNED;
new_base = get_timer_cpu_base(timer->flags, cpu);
/*
@ -1911,71 +2003,350 @@ static u64 cmp_next_hrtimer_event(u64 basem, u64 expires)
return DIV_ROUND_UP_ULL(nextevt, TICK_NSEC) * TICK_NSEC;
}
/**
* get_next_timer_interrupt - return the time (clock mono) of the next timer
* @basej: base time jiffies
* @basem: base time clock monotonic
*
* Returns the tick aligned clock monotonic time of the next pending
* timer or KTIME_MAX if no timer is pending.
*/
u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
static unsigned long next_timer_interrupt(struct timer_base *base,
unsigned long basej)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
unsigned long nextevt = basej + NEXT_TIMER_MAX_DELTA;
u64 expires = KTIME_MAX;
bool was_idle;
/*
* Pretend that there is no timer pending if the cpu is offline.
* Possible pending timers will be migrated later to an active cpu.
*/
if (cpu_is_offline(smp_processor_id()))
return expires;
raw_spin_lock(&base->lock);
if (base->next_expiry_recalc)
next_expiry_recalc(base);
/*
* Move next_expiry for the empty base into the future to prevent an
* unnecessary raise of the timer softirq when the next_expiry value
* will be reached even if there is no timer pending.
*
* This update is also required to make timer_base::next_expiry values
* easy comparable to find out which base holds the first pending timer.
*/
if (!base->timers_pending)
base->next_expiry = basej + NEXT_TIMER_MAX_DELTA;
return base->next_expiry;
}
static unsigned long fetch_next_timer_interrupt(unsigned long basej, u64 basem,
struct timer_base *base_local,
struct timer_base *base_global,
struct timer_events *tevt)
{
unsigned long nextevt, nextevt_local, nextevt_global;
bool local_first;
nextevt_local = next_timer_interrupt(base_local, basej);
nextevt_global = next_timer_interrupt(base_global, basej);
local_first = time_before_eq(nextevt_local, nextevt_global);
nextevt = local_first ? nextevt_local : nextevt_global;
/*
* If the @nextevt is at max. one tick away, use @nextevt and store
* it in the local expiry value. The next global event is irrelevant in
* this case and can be left as KTIME_MAX.
*/
if (time_before_eq(nextevt, basej + 1)) {
/* If we missed a tick already, force 0 delta */
if (time_before(nextevt, basej))
nextevt = basej;
tevt->local = basem + (u64)(nextevt - basej) * TICK_NSEC;
/*
* This is required for the remote check only but it doesn't
* hurt, when it is done for both call sites:
*
* * The remote callers will only take care of the global timers
* as local timers will be handled by CPU itself. When not
* updating tevt->global with the already missed first global
* timer, it is possible that it will be missed completely.
*
* * The local callers will ignore the tevt->global anyway, when
* nextevt is max. one tick away.
*/
if (!local_first)
tevt->global = tevt->local;
return nextevt;
}
/*
* Update tevt.* values:
*
* If the local queue expires first, then the global event can be
* ignored. If the global queue is empty, nothing to do either.
*/
if (!local_first && base_global->timers_pending)
tevt->global = basem + (u64)(nextevt_global - basej) * TICK_NSEC;
if (base_local->timers_pending)
tevt->local = basem + (u64)(nextevt_local - basej) * TICK_NSEC;
return nextevt;
}
# ifdef CONFIG_SMP
/**
* fetch_next_timer_interrupt_remote() - Store next timers into @tevt
* @basej: base time jiffies
* @basem: base time clock monotonic
* @tevt: Pointer to the storage for the expiry values
* @cpu: Remote CPU
*
* Stores the next pending local and global timer expiry values in the
* struct pointed to by @tevt. If a queue is empty the corresponding
* field is set to KTIME_MAX. If local event expires before global
* event, global event is set to KTIME_MAX as well.
*
* Caller needs to make sure timer base locks are held (use
* timer_lock_remote_bases() for this purpose).
*/
void fetch_next_timer_interrupt_remote(unsigned long basej, u64 basem,
struct timer_events *tevt,
unsigned int cpu)
{
struct timer_base *base_local, *base_global;
/* Preset local / global events */
tevt->local = tevt->global = KTIME_MAX;
base_local = per_cpu_ptr(&timer_bases[BASE_LOCAL], cpu);
base_global = per_cpu_ptr(&timer_bases[BASE_GLOBAL], cpu);
lockdep_assert_held(&base_local->lock);
lockdep_assert_held(&base_global->lock);
fetch_next_timer_interrupt(basej, basem, base_local, base_global, tevt);
}
/**
* timer_unlock_remote_bases - unlock timer bases of cpu
* @cpu: Remote CPU
*
* Unlocks the remote timer bases.
*/
void timer_unlock_remote_bases(unsigned int cpu)
__releases(timer_bases[BASE_LOCAL]->lock)
__releases(timer_bases[BASE_GLOBAL]->lock)
{
struct timer_base *base_local, *base_global;
base_local = per_cpu_ptr(&timer_bases[BASE_LOCAL], cpu);
base_global = per_cpu_ptr(&timer_bases[BASE_GLOBAL], cpu);
raw_spin_unlock(&base_global->lock);
raw_spin_unlock(&base_local->lock);
}
/**
* timer_lock_remote_bases - lock timer bases of cpu
* @cpu: Remote CPU
*
* Locks the remote timer bases.
*/
void timer_lock_remote_bases(unsigned int cpu)
__acquires(timer_bases[BASE_LOCAL]->lock)
__acquires(timer_bases[BASE_GLOBAL]->lock)
{
struct timer_base *base_local, *base_global;
base_local = per_cpu_ptr(&timer_bases[BASE_LOCAL], cpu);
base_global = per_cpu_ptr(&timer_bases[BASE_GLOBAL], cpu);
lockdep_assert_irqs_disabled();
raw_spin_lock(&base_local->lock);
raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
}
/**
* timer_base_is_idle() - Return whether timer base is set idle
*
* Returns value of local timer base is_idle value.
*/
bool timer_base_is_idle(void)
{
return __this_cpu_read(timer_bases[BASE_LOCAL].is_idle);
}
static void __run_timer_base(struct timer_base *base);
/**
* timer_expire_remote() - expire global timers of cpu
* @cpu: Remote CPU
*
* Expire timers of global base of remote CPU.
*/
void timer_expire_remote(unsigned int cpu)
{
struct timer_base *base = per_cpu_ptr(&timer_bases[BASE_GLOBAL], cpu);
__run_timer_base(base);
}
static void timer_use_tmigr(unsigned long basej, u64 basem,
unsigned long *nextevt, bool *tick_stop_path,
bool timer_base_idle, struct timer_events *tevt)
{
u64 next_tmigr;
if (timer_base_idle)
next_tmigr = tmigr_cpu_new_timer(tevt->global);
else if (tick_stop_path)
next_tmigr = tmigr_cpu_deactivate(tevt->global);
else
next_tmigr = tmigr_quick_check(tevt->global);
/*
* If the CPU is the last going idle in timer migration hierarchy, make
* sure the CPU will wake up in time to handle remote timers.
* next_tmigr == KTIME_MAX if other CPUs are still active.
*/
if (next_tmigr < tevt->local) {
u64 tmp;
/* If we missed a tick already, force 0 delta */
if (next_tmigr < basem)
next_tmigr = basem;
tmp = div_u64(next_tmigr - basem, TICK_NSEC);
*nextevt = basej + (unsigned long)tmp;
tevt->local = next_tmigr;
}
}
# else
static void timer_use_tmigr(unsigned long basej, u64 basem,
unsigned long *nextevt, bool *tick_stop_path,
bool timer_base_idle, struct timer_events *tevt)
{
/*
* Make sure first event is written into tevt->local to not miss a
* timer on !SMP systems.
*/
tevt->local = min_t(u64, tevt->local, tevt->global);
}
# endif /* CONFIG_SMP */
static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
bool *idle)
{
struct timer_events tevt = { .local = KTIME_MAX, .global = KTIME_MAX };
struct timer_base *base_local, *base_global;
unsigned long nextevt;
bool idle_is_possible;
/*
* When the CPU is offline, the tick is cancelled and nothing is supposed
* to try to stop it.
*/
if (WARN_ON_ONCE(cpu_is_offline(smp_processor_id()))) {
if (idle)
*idle = true;
return tevt.local;
}
base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
raw_spin_lock(&base_local->lock);
raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
nextevt = fetch_next_timer_interrupt(basej, basem, base_local,
base_global, &tevt);
/*
* If the next event is only one jiffie ahead there is no need to call
* timer migration hierarchy related functions. The value for the next
* global timer in @tevt struct equals then KTIME_MAX. This is also
* true, when the timer base is idle.
*
* The proper timer migration hierarchy function depends on the callsite
* and whether timer base is idle or not. @nextevt will be updated when
* this CPU needs to handle the first timer migration hierarchy
* event. See timer_use_tmigr() for detailed information.
*/
idle_is_possible = time_after(nextevt, basej + 1);
if (idle_is_possible)
timer_use_tmigr(basej, basem, &nextevt, idle,
base_local->is_idle, &tevt);
/*
* We have a fresh next event. Check whether we can forward the
* base.
*/
__forward_timer_base(base, basej);
if (base->timers_pending) {
nextevt = base->next_expiry;
/* If we missed a tick already, force 0 delta */
if (time_before(nextevt, basej))
nextevt = basej;
expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
} else {
/*
* Move next_expiry for the empty base into the future to
* prevent a unnecessary raise of the timer softirq when the
* next_expiry value will be reached even if there is no timer
* pending.
*/
base->next_expiry = nextevt;
}
__forward_timer_base(base_local, basej);
__forward_timer_base(base_global, basej);
/*
* Base is idle if the next event is more than a tick away.
*
* If the base is marked idle then any timer add operation must forward
* the base clk itself to keep granularity small. This idle logic is
* only maintained for the BASE_STD base, deferrable timers may still
* see large granularity skew (by design).
* Set base->is_idle only when caller is timer_base_try_to_set_idle()
*/
was_idle = base->is_idle;
base->is_idle = time_after(nextevt, basej + 1);
if (was_idle != base->is_idle)
trace_timer_base_idle(base->is_idle, base->cpu);
if (idle) {
/*
* Bases are idle if the next event is more than a tick
* away. Caution: @nextevt could have changed by enqueueing a
* global timer into timer migration hierarchy. Therefore a new
* check is required here.
*
* If the base is marked idle then any timer add operation must
* forward the base clk itself to keep granularity small. This
* idle logic is only maintained for the BASE_LOCAL and
* BASE_GLOBAL base, deferrable timers may still see large
* granularity skew (by design).
*/
if (!base_local->is_idle && time_after(nextevt, basej + 1)) {
base_local->is_idle = true;
trace_timer_base_idle(true, base_local->cpu);
}
*idle = base_local->is_idle;
raw_spin_unlock(&base->lock);
/*
* When timer base is not set idle, undo the effect of
* tmigr_cpu_deactivate() to prevent inconsitent states - active
* timer base but inactive timer migration hierarchy.
*
* When timer base was already marked idle, nothing will be
* changed here.
*/
if (!base_local->is_idle && idle_is_possible)
tmigr_cpu_activate();
}
return cmp_next_hrtimer_event(basem, expires);
raw_spin_unlock(&base_global->lock);
raw_spin_unlock(&base_local->lock);
return cmp_next_hrtimer_event(basem, tevt.local);
}
/**
* get_next_timer_interrupt() - return the time (clock mono) of the next timer
* @basej: base time jiffies
* @basem: base time clock monotonic
*
* Returns the tick aligned clock monotonic time of the next pending timer or
* KTIME_MAX if no timer is pending. If timer of global base was queued into
* timer migration hierarchy, first global timer is not taken into account. If
* it was the last CPU of timer migration hierarchy going idle, first global
* event is taken into account.
*/
u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
{
return __get_next_timer_interrupt(basej, basem, NULL);
}
/**
* timer_base_try_to_set_idle() - Try to set the idle state of the timer bases
* @basej: base time jiffies
* @basem: base time clock monotonic
* @idle: pointer to store the value of timer_base->is_idle on return;
* *idle contains the information whether tick was already stopped
*
* Returns the tick aligned clock monotonic time of the next pending timer or
* KTIME_MAX if no timer is pending. When tick was already stopped KTIME_MAX is
* returned as well.
*/
u64 timer_base_try_to_set_idle(unsigned long basej, u64 basem, bool *idle)
{
if (*idle)
return KTIME_MAX;
return __get_next_timer_interrupt(basej, basem, idle);
}
/**
@ -1985,18 +2356,18 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
*/
void timer_clear_idle(void)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
/*
* We do this unlocked. The worst outcome is a remote enqueue sending
* a pointless IPI, but taking the lock would just make the window for
* sending the IPI a few instructions smaller for the cost of taking
* the lock in the exit from idle path.
* We do this unlocked. The worst outcome is a remote pinned timer
* enqueue sending a pointless IPI, but taking the lock would just
* make the window for sending the IPI a few instructions smaller
* for the cost of taking the lock in the exit from idle
* path. Required for BASE_LOCAL only.
*/
if (base->is_idle) {
base->is_idle = false;
trace_timer_base_idle(false, smp_processor_id());
}
__this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
trace_timer_base_idle(false, smp_processor_id());
/* Activate without holding the timer_base->lock */
tmigr_cpu_activate();
}
#endif
@ -2009,11 +2380,10 @@ static inline void __run_timers(struct timer_base *base)
struct hlist_head heads[LVL_DEPTH];
int levels;
if (time_before(jiffies, base->next_expiry))
return;
lockdep_assert_held(&base->lock);
timer_base_lock_expiry(base);
raw_spin_lock_irq(&base->lock);
if (base->running_timer)
return;
while (time_after_eq(jiffies, base->clk) &&
time_after_eq(jiffies, base->next_expiry)) {
@ -2037,20 +2407,40 @@ static inline void __run_timers(struct timer_base *base)
while (levels--)
expire_timers(base, heads + levels);
}
}
static void __run_timer_base(struct timer_base *base)
{
if (time_before(jiffies, base->next_expiry))
return;
timer_base_lock_expiry(base);
raw_spin_lock_irq(&base->lock);
__run_timers(base);
raw_spin_unlock_irq(&base->lock);
timer_base_unlock_expiry(base);
}
static void run_timer_base(int index)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[index]);
__run_timer_base(base);
}
/*
* This function runs timers and the timer-tq in bottom half context.
*/
static __latent_entropy void run_timer_softirq(struct softirq_action *h)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
run_timer_base(BASE_LOCAL);
if (IS_ENABLED(CONFIG_NO_HZ_COMMON)) {
run_timer_base(BASE_GLOBAL);
run_timer_base(BASE_DEF);
__run_timers(base);
if (IS_ENABLED(CONFIG_NO_HZ_COMMON))
__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
if (is_timers_nohz_active())
tmigr_handle_remote();
}
}
/*
@ -2058,19 +2448,18 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
*/
static void run_local_timers(void)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
hrtimer_run_queues();
/* Raise the softirq only if required. */
if (time_before(jiffies, base->next_expiry)) {
if (!IS_ENABLED(CONFIG_NO_HZ_COMMON))
return;
/* CPU is awake, so check the deferrable base. */
base++;
if (time_before(jiffies, base->next_expiry))
for (int i = 0; i < NR_BASES; i++, base++) {
/* Raise the softirq only if required. */
if (time_after_eq(jiffies, base->next_expiry) ||
(i == BASE_DEF && tmigr_requires_handle_remote())) {
raise_softirq(TIMER_SOFTIRQ);
return;
}
}
raise_softirq(TIMER_SOFTIRQ);
}
/*

View file

@ -147,11 +147,15 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
# define P_ns(x) \
SEQ_printf(m, " .%-15s: %Lu nsecs\n", #x, \
(unsigned long long)(ktime_to_ns(ts->x)))
# define P_flag(x, f) \
SEQ_printf(m, " .%-15s: %d\n", #x, !!(ts->flags & (f)))
{
struct tick_sched *ts = tick_get_tick_sched(cpu);
P(nohz_mode);
P_flag(nohz, TS_FLAG_NOHZ);
P_flag(highres, TS_FLAG_HIGHRES);
P_ns(last_tick);
P(tick_stopped);
P_flag(tick_stopped, TS_FLAG_STOPPED);
P(idle_jiffies);
P(idle_calls);
P(idle_sleeps);
@ -256,7 +260,7 @@ static void timer_list_show_tickdevices_header(struct seq_file *m)
static inline void timer_list_header(struct seq_file *m, u64 now)
{
SEQ_printf(m, "Timer List Version: v0.9\n");
SEQ_printf(m, "Timer List Version: v0.10\n");
SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);
SEQ_printf(m, "\n");

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,140 @@
/* SPDX-License-Identifier: GPL-2.0-only */
#ifndef _KERNEL_TIME_MIGRATION_H
#define _KERNEL_TIME_MIGRATION_H
/* Per group capacity. Must be a power of 2! */
#define TMIGR_CHILDREN_PER_GROUP 8
/**
* struct tmigr_event - a timer event associated to a CPU
* @nextevt: The node to enqueue an event in the parent group queue
* @cpu: The CPU to which this event belongs
* @ignore: Hint whether the event could be ignored; it is set when
* CPU or group is active;
*/
struct tmigr_event {
struct timerqueue_node nextevt;
unsigned int cpu;
bool ignore;
};
/**
* struct tmigr_group - timer migration hierarchy group
* @lock: Lock protecting the event information and group hierarchy
* information during setup
* @parent: Pointer to the parent group
* @groupevt: Next event of the group which is only used when the
* group is !active. The group event is then queued into
* the parent timer queue.
* Ignore bit of @groupevt is set when the group is active.
* @next_expiry: Base monotonic expiry time of the next event of the
* group; It is used for the racy lockless check whether a
* remote expiry is required; it is always reliable
* @events: Timer queue for child events queued in the group
* @migr_state: State of the group (see union tmigr_state)
* @level: Hierarchy level of the group; Required during setup
* @numa_node: Required for setup only to make sure CPU and low level
* group information is NUMA local. It is set to NUMA node
* as long as the group level is per NUMA node (level <
* tmigr_crossnode_level); otherwise it is set to
* NUMA_NO_NODE
* @num_children: Counter of group children to make sure the group is only
* filled with TMIGR_CHILDREN_PER_GROUP; Required for setup
* only
* @childmask: childmask of the group in the parent group; is set
* during setup and will never change; can be read
* lockless
* @list: List head that is added to the per level
* tmigr_level_list; is required during setup when a
* new group needs to be connected to the existing
* hierarchy groups
*/
struct tmigr_group {
raw_spinlock_t lock;
struct tmigr_group *parent;
struct tmigr_event groupevt;
u64 next_expiry;
struct timerqueue_head events;
atomic_t migr_state;
unsigned int level;
int numa_node;
unsigned int num_children;
u8 childmask;
struct list_head list;
};
/**
* struct tmigr_cpu - timer migration per CPU group
* @lock: Lock protecting the tmigr_cpu group information
* @online: Indicates whether the CPU is online; In deactivate path
* it is required to know whether the migrator in the top
* level group is to be set offline, while a timer is
* pending. Then another online CPU needs to be notified to
* take over the migrator role. Furthermore the information
* is required in CPU hotplug path as the CPU is able to go
* idle before the timer migration hierarchy hotplug AP is
* reached. During this phase, the CPU has to handle the
* global timers on its own and must not act as a migrator.
* @idle: Indicates whether the CPU is idle in the timer migration
* hierarchy
* @remote: Is set when timers of the CPU are expired remotely
* @tmgroup: Pointer to the parent group
* @childmask: childmask of tmigr_cpu in the parent group
* @wakeup: Stores the first timer when the timer migration
* hierarchy is completely idle and remote expiry was done;
* is returned to timer code in the idle path and is only
* used in idle path.
* @cpuevt: CPU event which could be enqueued into the parent group
*/
struct tmigr_cpu {
raw_spinlock_t lock;
bool online;
bool idle;
bool remote;
struct tmigr_group *tmgroup;
u8 childmask;
u64 wakeup;
struct tmigr_event cpuevt;
};
/**
* union tmigr_state - state of tmigr_group
* @state: Combined version of the state - only used for atomic
* read/cmpxchg function
* @struct: Split version of the state - only use the struct members to
* update information to stay independent of endianness
*/
union tmigr_state {
u32 state;
/**
* struct - split state of tmigr_group
* @active: Contains each childmask bit of the active children
* @migrator: Contains childmask of the child which is migrator
* @seq: Sequence counter needs to be increased when an update
* to the tmigr_state is done. It prevents a race when
* updates in the child groups are propagated in changed
* order. Detailed information about the scenario is
* given in the documentation at the begin of
* timer_migration.c.
*/
struct {
u8 active;
u8 migrator;
u16 seq;
} __packed;
};
#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
extern void tmigr_handle_remote(void);
extern bool tmigr_requires_handle_remote(void);
extern void tmigr_cpu_activate(void);
extern u64 tmigr_cpu_deactivate(u64 nextevt);
extern u64 tmigr_cpu_new_timer(u64 nextevt);
extern u64 tmigr_quick_check(u64 nextevt);
#else
static inline void tmigr_handle_remote(void) { }
static inline bool tmigr_requires_handle_remote(void) { return false; }
static inline void tmigr_cpu_activate(void) { }
#endif
#endif

View file

@ -2564,7 +2564,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
add_timer_on(timer, cpu);
} else {
if (likely(cpu == WORK_CPU_UNBOUND))
add_timer(timer);
add_timer_global(timer);
else
add_timer_on(timer, cpu);
}

View file

@ -567,7 +567,7 @@ then
torture_bootargs="rcupdate.rcu_cpu_stall_suppress_at_boot=1 torture.disable_onoff_at_boot rcupdate.rcu_task_stall_timeout=30000 tsc=watchdog"
torture_set "clocksourcewd-1" tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 45s --configs TREE03 --kconfig "CONFIG_TEST_CLOCKSOURCE_WATCHDOG=y" --trust-make
torture_bootargs="rcupdate.rcu_cpu_stall_suppress_at_boot=1 torture.disable_onoff_at_boot rcupdate.rcu_task_stall_timeout=30000 clocksource.max_cswd_read_retries=1 tsc=watchdog"
torture_bootargs="rcupdate.rcu_cpu_stall_suppress_at_boot=1 torture.disable_onoff_at_boot rcupdate.rcu_task_stall_timeout=30000 tsc=watchdog"
torture_set "clocksourcewd-2" tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 45s --configs TREE03 --kconfig "CONFIG_TEST_CLOCKSOURCE_WATCHDOG=y" --trust-make
# In case our work is already done...