linux

mirror of https://github.com/torvalds/linux synced 2024-10-06 19:34:19 +00:00

Author	SHA1	Message	Date
Frederic Weisbecker	19b344a91f	timers: Assert no next dyntick timer look-up while CPU is offline The next timer (re-)evaluation, with the purpose of entering/updating the dyntick mode, can happen from 3 sites and none of them are relevant while the CPU is offline: 1) The idle loop: a) From the quick check helping the cpuidle governor to heuristically predict the best C-state. b) While stopping the tick. But if the CPU is offline, the tick has been cancelled and there is consequently no need to further stop the tick. 2) Remote expiry: when a CPU remotely expires global timers on behalf of another CPU, the latter target's next timer is re-evaluated afterwards. However remote expîry doesn't happen on offline CPUs. 3) IRQ exit: on nohz_full mode, the tick is (re-)evaluated on IRQ exit. But full dynticks is disabled on offline CPUs. Therefore it is safe to assume that no next dyntick timer lookup can be performed on offline CPUs. Assert this expectation to report any surprise. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-17-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	500f8f9bce	tick: Assume timekeeping is correctly handed over upon last offline idle call The timekeeping duty is handed over from the outgoing CPU on stop machine, then the oneshot tick is stopped right after. Therefore it's guaranteed that the current CPU isn't the timekeeper upon its last call to idle. Besides, calling tick_nohz_idle_stop_tick() while the dying CPU goes into idle suggests that the tick is going to be stopped while it is actually stopped already from the appropriate CPU hotplug state. Remove the confusing call and the obsolete case handling and convert it to a sanity check that verifies the above assumption. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-16-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	3f69d04e14	tick: Shut down low-res tick from dying CPU The timekeeping duty is handed over from the outgoing CPU within stop machine. This works well if CONFIG_NO_HZ_COMMON=n or the tick is in high-res mode. However in low-res dynticks mode, the tick isn't cancelled until the clockevent is shut down, which can happen later. The tick may therefore fire again once IRQs are re-enabled on stop machine and until IRQs are disabled for good upon the last call to idle. That's so many opportunities for a timekeeper to go idle and the outgoing CPU to take over that duty. This is why tick_nohz_idle_stop_tick() is called one last time on idle if the CPU is seen offline: so that the timekeeping duty is handed over again in case the CPU has re-taken the duty. This means there are two timekeeping handovers on CPU down hotplug with different undocumented constraints and purposes: 1) A handover on stop machine for !dynticks \|\| highres. All online CPUs are guaranteed to be non-idle and the timekeeping duty can be safely handed-over. The hrtimer tick is cancelled so it is guaranteed that in dynticks mode the outgoing CPU won't take again the duty. 2) A handover on last idle call for dynticks && lowres. Setting the duty to TICK_DO_TIMER_NONE makes sure that a CPU will take over the timekeeping. Prepare for consolidating the handover to a single place (the first one) with shutting down the low-res tick as well from tick_cancel_sched_timer() as well. This will simplify the handover and unify the tick cancellation between high-res and low-res. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-15-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	7988e5ae2b	tick: Split nohz and highres features from nohz_mode The nohz mode field tells about low resolution nohz mode or high resolution nohz mode but it doesn't tell about high resolution non-nohz mode. In order to retrieve the latter state, tick_cancel_sched_timer() must fiddle with struct hrtimer's internals to guess if the tick has been initialized in high resolution. Move instead the nohz mode field information into the tick flags and provide two new bits: one to know if the tick is in nohz mode and another one to know if the tick is in high resolution. The combination of those two flags provides all the needed informations to determine which of the three tick modes is running. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-14-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	a478ffb2ae	tick: Move individual bit features to debuggable mask accesses The individual bitfields of struct tick_sched must be modified from IRQs disabled places, otherwise local modifications can race due to them sharing the same memory storage. The recent move of the "got_idle_tick" bitfield to its own storage shows that the use of these bitfields, as pretty as they look, can be as much error prone. In order to avoid future issues of the like and make sure that those bitfields are safely accessed, move those flags to an explicit mask along with a mutator function performing the basic IRQs disabled sanity check. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-13-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	3ce74f1a85	tick: Move got_idle_tick away from common flags tick_nohz_idle_got_tick() is called by cpuidle_reflect() within the idle loop with interrupts enabled. This function modifies the struct tick_sched's bitfield "got_idle_tick". However this bitfield is stored within the same mask as other bitfields that can be modified from interrupts. Fortunately so far it looks like the only race that can happen is while writing ->got_idle_tick to 0, an interrupt fires and writes the ->idle_active field to 0. It's then possible that the interrupted write to ->got_idle_tick writes back the old value of ->idle_active back to 1. However if that happens, the worst possible outcome is that the time spent between that interrupt and the upcoming call to tick_nohz_idle_exit() is accounted as idle, which is negligible quantity. Still all the bitfield writes within this struct tick_sched's shadow mask should be IRQ-safe. Therefore move this bitfield out to its own storage to avoid further suprises. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-12-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	d9b1865c86	tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode The full-nohz update function checks if the nohz mode is active before proceeding. It considers one exception though: if the tick is already stopped even though the nohz mode is inactive, it still moves on in order to update/restart the tick if needed. However in order for the tick to be stopped, the nohz_mode has to be either NOHZ_MODE_LOWRES or NOHZ_MODE_HIGHRES. Therefore it doesn't make sense to test if the tick is stopped before verifying NOHZ_MODE_INACTIVE mode. Remove the needless related condition. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-11-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	ef8969bb55	tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING The broadcast shutdown code is executed through a random explicit call within stop machine from the outgoing CPU. However the tick broadcast is a midware between the tick callback and the clocksource, therefore it makes more sense to shut it down after the tick callback and before the clocksource drivers. Move it instead to the common tick shutdown CPU hotplug state where related operations can be ordered from highest to lowest level. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-10-frederic@kernel.org	2024-02-26 11:37:32 +01:00
Frederic Weisbecker	f04e51220a	tick: Move tick cancellation up to CPUHP_AP_TICK_DYING The tick hrtimer is cancelled right before hrtimers are migrated. This is done from the hrtimer subsystem even though it shouldn't know about its actual users. Move instead the tick hrtimer cancellation to the relevant CPU hotplug state that aims at centralizing high level tick shutdown operations so that the related flow is easy to follow. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-9-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Frederic Weisbecker	3ad6eb0683	tick: Start centralizing tick related CPU hotplug operations During the CPU offlining process, the various timer tick features are shut down from scattered places, sometimes from teardown callbacks on stop machine, sometimes through explicit calls, sometimes from the control CPU after the CPU died. The reason why these shutdown operations are spread around is not always clear and it makes the tick lifecycle hard to follow. The tick should be shut down in order from highest to lowest level: On stop machine from the dying CPU (high-level): 1) Hand-over the timekeeping duty (tick_handover_do_timer()) 2) Cancel the tick implementation called by the clockevent callback (tick_cancel_sched_timer()) 3) Shutdown broadcasting (tick_offline_cpu() / tick_broadcast_offline()) On stop machine from the dying CPU (low-level): 4) Shutdown clockevents drivers (CPUHP_AP_*_TIMER_STARTING states) From the control CPU after the CPU died (low-level): 5) Shutdown/unregister/cleanup clockevents for the dead CPU (tick_cleanup_dead_cpu()) Instead the current order is 2, 4 (both from CPU hotplug states), then 1 and 3 through direct calls. This layout and order don't make much sense. The operations 1, 2, 3 should be gathered together and in order. Sort this situation with creating a new TICK shut-down CPU hotplug state and start with introducing the timekeeping duty hand-over there. The state must precede hrtimers migration because the tick hrtimer will be stopped from it in a further patch. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-8-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Frederic Weisbecker	60313c21c3	tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick() The tick sched structure is already cleared from tick_cancel_sched_timer(), so there is no need to clear that field again. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-7-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Frederic Weisbecker	3650f49bfb	tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick() tick_nohz_stop_sched_tick() is only about NOHZ_full and not about dynticks-idle. Reflect that in the function name to avoid confusion. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-6-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Frederic Weisbecker	27dc08096c	tick: Use IS_ENABLED() whenever possible Avoid ifdeferry if it can be converted to IS_ENABLED() whenever possible Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-5-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Frederic Weisbecker	3aedb7fcd8	tick/sched: Remove useless oneshot ifdeffery tick-sched.c is only built when CONFIG_TICK_ONESHOT=y, which is selected only if CONFIG_NO_HZ_COMMON=y or CONFIG_HIGH_RES_TIMERS=y. Therefore the related ifdeferry in this file is needless and can be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-4-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Peng Liu	37263ba0c4	tick/nohz: Remove duplicate between lowres and highres handlers tick_nohz_lowres_handler() does the same work as tick_nohz_highres_handler() plus the clockevent device reprogramming, so make the former reuse the latter and rename it accordingly. Signed-off-by: Peng Liu <liupeng17@lenovo.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-3-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Peng Liu	ffb7e01c4e	tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer() The ts->sched_timer initialization work of tick_nohz_switch_to_nohz() is almost the same as that of tick_setup_sched_timer(), so adjust the latter to get it reused by tick_nohz_switch_to_nohz(). This also makes the low resolution mode sched_timer benefit from the tick skew boot option. Signed-off-by: Peng Liu <liupeng17@lenovo.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240225225508.11587-2-frederic@kernel.org	2024-02-26 11:37:31 +01:00
Costa Shulyupin	56c2cb1012	hrtimer: Select housekeeping CPU during migration During CPU-down hotplug, hrtimers may migrate to isolated CPUs, compromising CPU isolation. Address this issue by masking valid CPUs for hrtimers using housekeeping_cpumask(HK_TYPE_TIMER). Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Costa Shulyupin <costa.shul@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20240222200856.569036-1-costa.shul@redhat.com	2024-02-22 22:18:21 +01:00
Anna-Maria Behnsen	b2cf7507e1	timers: Always queue timers on the local CPU The timer pull model is in place so we can remove the heuristics which try to guess the best target CPU at enqueue/modification time. All non pinned timers are queued on the local CPU in the separate storage and eventually pulled at expiry time to a remote CPU. Originally-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-21-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Anna-Maria Behnsen	36e40df35d	timer_migration: Add tracepoints The timer pull logic needs proper debugging aids. Add tracepoints so the hierarchical idle machinery can be diagnosed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240222103403.31923-1-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Anna-Maria Behnsen	7ee9887703	timers: Implement the hierarchical pull model Placing timers at enqueue time on a target CPU based on dubious heuristics does not make any sense: 1) Most timer wheel timers are canceled or rearmed before they expire. 2) The heuristics to predict which CPU will be busy when the timer expires are wrong by definition. So placing the timers at enqueue wastes precious cycles. The proper solution to this problem is to always queue the timers on the local CPU and allow the non pinned timers to be pulled onto a busy CPU at expiry time. Therefore split the timer storage into local pinned and global timers: Local pinned timers are always expired on the CPU on which they have been queued. Global timers can be expired on any CPU. As long as a CPU is busy it expires both local and global timers. When a CPU goes idle it arms for the first expiring local timer. If the first expiring pinned (local) timer is before the first expiring movable timer, then no action is required because the CPU will wake up before the first movable timer expires. If the first expiring movable timer is before the first expiring pinned (local) timer, then this timer is queued into an idle timerqueue and eventually expired by another active CPU. To avoid global locking the timerqueues are implemented as a hierarchy. The lowest level of the hierarchy holds the CPUs. The CPUs are associated to groups of 8, which are separated per node. If more than one CPU group exist, then a second level in the hierarchy collects the groups. Depending on the size of the system more than 2 levels are required. Each group has a "migrator" which checks the timerqueue during the tick for remote expirable timers. If the last CPU in a group goes idle it reports the first expiring event in the group up to the next group(s) in the hierarchy. If the last CPU goes idle it arms its timer for the first system wide expiring timer to ensure that no timer event is missed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240222103710.32582-1-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Anna-Maria Behnsen	57e95a5c41	timers: Introduce function to check timer base is_idle flag To prepare for the conversion of the NOHZ timer placement to a pull at expiry time model it's required to have a function that returns the value of the is_idle flag of the timer base to keep the hierarchy states during online in sync with timer base state. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-18-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Richard Cochran (linutronix GmbH)	4c532939aa	tick/sched: Split out jiffies update helper function The logic to get the time of the last jiffies update will be needed by the timer pull model as well. Move the code into a global function in anticipation of the new caller. No functional change. Signed-off-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-17-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Anna-Maria Behnsen	89f01e10c9	timers: Check if timers base is handled already Due to the conversion of the NOHZ timer placement to a pull at expiry time model, the per CPU timer bases with non pinned timers are no longer handled only by the local CPU. In case a remote CPU already expires the non pinned timers base of the local CPU, nothing more needs to be done by the local CPU. A check at the begin of the expire timers routine is required, because timer base lock is dropped before executing the timer callback function. This is a preparatory work, but has no functional impact right now. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-16-anna-maria@linutronix.de	2024-02-22 17:52:32 +01:00
Richard Cochran (linutronix GmbH)	90f5df66c8	timers: Restructure internal locking Move the locking out from __run_timers() to the call sites, so the protected section can be extended at the call site. Preparatory work for changing the NOHZ timer placement to a pull at expiry time model. No functional change. Signed-off-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-15-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	f73d9257ff	timers: Add get next timer interrupt functionality for remote CPUs To prepare for the conversion of the NOHZ timer placement to a pull at expiry time model it's required to have functionality available getting the next timer interrupt on a remote CPU. Locking of the timer bases and getting the information for the next timer interrupt functionality is split into separate functions. This is required to be compliant with lock ordering when the new model is in place. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-14-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	70b4cf84f3	timers: Split out "get next timer interrupt" functionality The functionality for getting the next timer interrupt in get_next_timer_interrupt() is split into a separate function fetch_next_timer_interrupt() to be usable by other call sites. This is preparatory work for the conversion of the NOHZ timer placement to a pull at expiry time model. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-13-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	21927fc89e	timers: Retrieve next expiry of pinned/non-pinned timers separately For the conversion of the NOHZ timer placement to a pull at expiry time model it's required to have separate expiry times for the pinned and the non-pinned (movable) timers. Therefore struct timer_events is introduced. No functional change Originally-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-12-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	83a665dc99	timers: Keep the pinned timers separate from the others Separate the storage space for pinned timers. Deferrable timers (doesn't matter if pinned or non pinned) are still enqueued into their own base. This is preparatory work for changing the NOHZ timer placement from a push at enqueue time to a pull at expiry time model. Originally-by: Richard Cochran (linutronix GmbH) <richardcochran@gmail.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-11-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	9f6a3c602c	timers: Split next timer interrupt logic Split the logic for getting next timer interrupt (no matter of recalculated or already stored in base->next_expiry) into a separate function named next_timer_interrupt(). Make it available to local call sites only. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-10-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	af68cb3fc7	timers: Simplify code in run_local_timers() The logic for raising a softirq the way it is implemented right now, is readable for two timer bases. When increasing the number of timer bases, code gets harder to read. With the introduction of the timer migration hierarchy, there will be three timer bases. Therefore restructure the code to use a loop. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-9-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	aae55e9fb8	timers: Make sure TIMER_PINNED flag is set in add_timer_on() When adding a timer to the timer wheel using add_timer_on(), it is an implicitly pinned timer. With the timer pull at expiry time model in place, the TIMER_PINNED flag is required to make sure timers end up in proper base. Set the TIMER_PINNED flag unconditionally when add_timer_on() is executed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-8-anna-maria@linutronix.de	2024-02-22 17:52:31 +01:00
Anna-Maria Behnsen	8e7e247f64	timers: Introduce add_timer() variants which modify timer flags A timer might be used as a pinned timer (using add_timer_on()) and later on as non-pinned timer using add_timer(). When the "NOHZ timer pull at expiry model" is in place, the TIMER_PINNED flag is required to be used whenever a timer needs to expire on a dedicated CPU. Otherwise the flag must not be set if expiration on a dedicated CPU is not required. add_timer_on()'s behavior will be changed during the preparation patches for the "NOHZ timer pull at expiry model" to unconditionally set the TIMER_PINNED flag. To be able to clear/ set the flag when queueing a timer, two variants of add_timer() are introduced. This is a preparatory step and has no functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-6-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Anna-Maria Behnsen	73129cf4b6	timers: Optimization for timer_base_try_to_set_idle() When tick is stopped also the timer base is_idle flag is set. When reentering timer_base_try_to_set_idle() with the tick stopped, there is no need to check whether the timer base needs to be set idle again. When a timer was enqueued in the meantime, this is already handled by the tick_nohz_next_event() call which was executed before tick_nohz_stop_tick(). Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-5-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Anna-Maria Behnsen	e2e1d724e9	timers: Move marking timer bases idle into tick_nohz_stop_tick() The timer base is marked idle when get_next_timer_interrupt() is executed. But the decision whether the tick will be stopped and whether the system is able to go idle is done later. When the timer bases is marked idle and a new first timer is enqueued remote an IPI is raised. Even if it is not required because the tick is not stopped and the timer base is evaluated again at the next tick. To prevent this, the timer base is marked idle in tick_nohz_stop_tick() and get_next_timer_interrupt() is streamlined by only looking for the next timer interrupt. All other work is postponed to timer_base_try_to_set_idle() which is called by tick_nohz_stop_tick(). timer_base_try_to_set_idle() never resets timer_base::is_idle state. This is done when the tick is restarted via tick_nohz_restart_sched_tick(). With this, tick_sched::tick_stopped and timer_base::is_idle are always in sync. So there is no longer the need to execute timer_clear_idle() in tick_nohz_idle_retain_tick(). This was required before, as tick_nohz_next_event() set timer_base::is_idle even if the tick would not be stopped. So timer_clear_idle() is only executed, when timer base is idle. So the check whether timer base is idle, is now no longer required as well. While at it fix some nearby whitespace damage as well. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-4-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Anna-Maria Behnsen	39ed699fb6	timers: Split out get next timer interrupt Split out get_next_timer_interrupt() to be able to extend it and make it reusable for other call sites. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-3-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Anna-Maria Behnsen	bebed6649e	timers: Restructure get_next_timer_interrupt() get_next_timer_interrupt() contains two parts for the next timer interrupt calculation. Those two parts are separated by forwarding the base clock. But the second part does not depend on the forwarded base clock. Therefore restructure get_next_timer_interrupt() to keep things together which belong together. No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240221090548.36600-2-anna-maria@linutronix.de	2024-02-22 17:52:30 +01:00
Feng Tang	2ed08e4bc5	clocksource: Scale the watchdog read retries automatically On a 8-socket server the TSC is wrongly marked as 'unstable' and disabled during boot time on about one out of 120 boot attempts: clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns, wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable tsc: Marking TSC unstable due to clocksource watchdog TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152) clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896. clocksource: Switched to clocksource hpet The reason is that for platform with a large number of CPUs, there are sporadic big or huge read latencies while reading the watchog/clocksource during boot or when system is under stress work load, and the frequency and maximum value of the latency goes up with the number of online CPUs. The cCurrent code already has logic to detect and filter such high latency case by reading the watchdog twice and checking the two deltas. Due to the randomness of the latency, there is a low probabilty that the first delta (latency) is big, but the second delta is small and looks valid. The watchdog code retries the readouts by default twice, which is not necessarily sufficient for systems with a large number of CPUs. There is a command line parameter 'max_cswd_read_retries' which allows to increase the number of retries, but that's not user friendly as it needs to be tweaked per system. As the number of required retries is proportional to the number of online CPUs, this parameter can be calculated at runtime. Scale and enlarge the number of retries according to the number of online CPUs and remove the command line parameter completely. [ tglx: Massaged change log and comments ] Signed-off-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jin Wang <jin1.wang@intel.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20240221060859.1027450-1-feng.tang@intel.com	2024-02-21 12:00:42 +01:00
David Gow	e0a1284b29	time/kunit: Use correct format specifier 'days' is a s64 (from div_s64), and so should use a %lld specifier. This was found by extending KUnit's assertion macros to use gcc's __printf attribute. Fixes: `2760105516` ("time: Improve performance of time64_to_tm()") Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240221092728.1281499-5-davidgow@google.com	2024-02-21 12:00:42 +01:00
Ingo Molnar	94bf12af35	Merge tag 'v6.8-rc5' into timers/core, to resolve conflict There's a conflict between this recent upstream fix: `dad6a09f31` ("hrtimer: Report offline hrtimer enqueue") and a pending commit in the timers tree: `1a4729ecaf` ("hrtimers: Move hrtimer base related definitions into hrtimer_defs.h") Resolve it by applying the upstream fix to the new <linux/hrtimer_defs.h> header. Conflict: include/linux/hrtimer.h Semantic conflict: include/linux/hrtimer_defs.h Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-19 22:27:57 +01:00
Peter Hilber	14274d0bd3	timekeeping: Fix cross-timestamp interpolation for non-x86 So far, get_device_system_crosststamp() unconditionally passes system_counterval.cycles to timekeeping_cycles_to_ns(). But when interpolating system time (do_interp == true), system_counterval.cycles is before tkr_mono.cycle_last, contrary to the timekeeping_cycles_to_ns() expectations. On x86, CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE will mitigate on interpolating, setting delta to 0. With delta == 0, xtstamp->sys_monoraw and xtstamp->sys_realtime are then set to the last update time, as implicitly expected by adjust_historical_crosststamp(). On other architectures, the resulting nonsense xtstamp->sys_monoraw and xtstamp->sys_realtime corrupt the xtstamp (ts) adjustment in adjust_historical_crosststamp(). Fix this by deriving xtstamp->sys_monoraw and xtstamp->sys_realtime from the last update time when interpolating, by using the local variable "cycles". The local variable already has the right value when interpolating, unlike system_counterval.cycles. Fixes: `2c756feb18` ("time: Add history to cross timestamp interface supporting slower devices") Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20231218073849.35294-4-peter.hilber@opensynergy.com	2024-02-19 12:18:51 +01:00
Peter Hilber	87a4113088	timekeeping: Fix cross-timestamp interpolation corner case decision The cycle_between() helper checks if parameter test is in the open interval (before, after). Colloquially speaking, this also applies to the counter wrap-around special case before > after. get_device_system_crosststamp() currently uses cycle_between() at the first call site to decide whether to interpolate for older counter readings. get_device_system_crosststamp() has the following problem with cycle_between() testing against an open interval: Assume that, by chance, cycles == tk->tkr_mono.cycle_last (in the following, "cycle_last" for brevity). Then, cycle_between() at the first call site, with effective argument values cycle_between(cycle_last, cycles, now), returns false, enabling interpolation. During interpolation, get_device_system_crosststamp() will then call cycle_between() at the second call site (if a history_begin was supplied). The effective argument values are cycle_between(history_begin->cycles, cycles, cycles), since system_counterval.cycles == interval_start == cycles, per the assumption. Due to the test against the open interval, cycle_between() returns false again. This causes get_device_system_crosststamp() to return -EINVAL. This failure should be avoided, since get_device_system_crosststamp() works both when cycles follows cycle_last (no interpolation), and when cycles precedes cycle_last (interpolation). For the case cycles == cycle_last, interpolation is actually unneeded. Fix this by changing cycle_between() into timestamp_in_interval(), which now checks against the closed interval, rather than the open interval. This changes the get_device_system_crosststamp() behavior for three corner cases: 1. Bypass interpolation in the case cycles == tk->tkr_mono.cycle_last, fixing the problem described above. 2. At the first timestamp_in_interval() call site, cycles == now no longer causes failure. 3. At the second timestamp_in_interval() call site, history_begin->cycles == system_counterval.cycles no longer causes failure. adjust_historical_crosststamp() also works for this corner case, where partial_history_cycles == total_history_cycles. These behavioral changes should not cause any problems. Fixes: `2c756feb18` ("time: Add history to cross timestamp interface supporting slower devices") Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20231218073849.35294-3-peter.hilber@opensynergy.com	2024-02-19 12:18:51 +01:00
Peter Hilber	84dccadd3e	timekeeping: Fix cross-timestamp interpolation on counter wrap cycle_between() decides whether get_device_system_crosststamp() will interpolate for older counter readings. cycle_between() yields wrong results for a counter wrap-around where after < before < test, and for the case after < test < before. Fix the comparison logic. Fixes: `2c756feb18` ("time: Add history to cross timestamp interface supporting slower devices") Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20231218073849.35294-2-peter.hilber@opensynergy.com	2024-02-19 12:18:51 +01:00
Anna-Maria Behnsen	892abd3571	timers: Add struct member description for timer_base timer_base struct lacks description of struct members. Important struct member information is sprinkled in comments or in code all over the place. Collect information and write struct description to keep track of most important information in a single place. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240123164702.55612-5-anna-maria@linutronix.de	2024-02-19 09:38:00 +01:00
Anna-Maria Behnsen	f365d05506	tick/sched: Add function description for tick_nohz_next_event() The return value of tick_nohz_next_event() is not obvious at the first glance. Add a kernel-doc compatible function description which also covers return values. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240123164702.55612-4-anna-maria@linutronix.de	2024-02-19 09:38:00 +01:00
Anna-Maria Behnsen	ca2768bbf5	hrtimers: Update formatting of documentation Documentation of functions lacks the annotations which are used by kernel-doc and *.rst to make appearance in rendered documents more user-friendly. Use those annotations to improve user-friendliness. While at it prevent duplication of comments and use a reference instead. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240123164702.55612-3-anna-maria@linutronix.de	2024-02-19 09:37:59 +01:00
Ricardo B. Marliere	49f1ff50d4	clockevents: Make clockevents_subsys const Now that the driver core can properly handle constant struct bus_type, move the clockevents_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-2-207ec18e24b8@marliere.net	2024-02-07 15:11:24 +01:00
Ricardo B. Marliere	2bc7fc24f9	clocksource: Make clocksource_subsys const Now that the driver core can properly handle constant struct bus_type, move the clocksource_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-1-207ec18e24b8@marliere.net	2024-02-07 15:11:24 +01:00
Frederic Weisbecker	dad6a09f31	hrtimer: Report offline hrtimer enqueue The hrtimers migration on CPU-down hotplug process has been moved earlier, before the CPU actually goes to die. This leaves a small window of opportunity to queue an hrtimer in a blind spot, leaving it ignored. For example a practical case has been reported with RCU waking up a SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that way a sched/rt timer to the local offline CPU. Make sure such situations never go unnoticed and warn when that happens. Fixes: `5c0930ccaa` ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com	2024-02-06 10:56:35 +01:00
Tim Chen	9a574ea906	tick/sched: Preserve number of idle sleeps across CPU hotplug events Commit `71fee48f` ("tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug") preserved total idle sleep time and iowait sleeptime across CPU hotplug events. Similar reasoning applies to the number of idle calls and idle sleeps to get the proper average of sleep time per idle invocation. Preserve those fields too. Fixes: `71fee48f` ("tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug") Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240122233534.3094238-1-tim.c.chen@linux.intel.com	2024-01-25 09:52:40 +01:00
Jiri Wiesner	6446495535	clocksource: Skip watchdog check for large watchdog intervals There have been reports of the watchdog marking clocksources unstable on machines with 8 NUMA nodes: clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large: clocksource: 'hpet' wd_nsec: 14523447520 clocksource: 'tsc' cs_nsec: 14524115132 The measured clocksource skew - the absolute difference between cs_nsec and wd_nsec - was 668 microseconds: cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612 The kernel used 200 microseconds for the uncertainty_margin of both the clocksource and watchdog, resulting in a threshold of 400 microseconds (the md variable). Both the cs_nsec and the wd_nsec value indicate that the readout interval was circa 14.5 seconds. The observed behaviour is that watchdog checks failed for large readout intervals on 8 NUMA node machines. This indicates that the size of the skew was directly proportinal to the length of the readout interval on those machines. The measured clocksource skew, 668 microseconds, was evaluated against a threshold (the md variable) that is suited for readout intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second. The intention of `2e27e793e2` ("clocksource: Reduce clocksource-skew threshold") was to tighten the threshold for evaluating skew and set the lower bound for the uncertainty_margin of clocksources to twice WATCHDOG_MAX_SKEW. Later in `c37e85c135` ("clocksource: Loosen clocksource watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to 125 microseconds to fit the limit of NTP, which is able to use a clocksource that suffers from up to 500 microseconds of skew per second. Both the TSC and the HPET use default uncertainty_margin. When the readout interval gets stretched the default uncertainty_margin is no longer a suitable lower bound for evaluating skew - it imposes a limit that is far stricter than the skew with which NTP can deal. The root causes of the skew being directly proportinal to the length of the readout interval are: * the inaccuracy of the shift/mult pairs of clocksources and the watchdog * the conversion to nanoseconds is imprecise for large readout intervals Prevent this by skipping the current watchdog check if the readout interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin (of the TSC and HPET) corresponds to a limit on clocksource skew of 250 ppm (microseconds of skew per second). To keep the limit imposed by NTP (500 microseconds of skew per second) for all possible readout intervals, the margins would have to be scaled so that the threshold value is proportional to the length of the actual readout interval. As for why the readout interval may get stretched: Since the watchdog is executed in softirq context the expiration of the watchdog timer can get severely delayed on account of a ksoftirqd thread not getting to run in a timely manner. Surely, a system with such belated softirq execution is not working well and the scheduling issue should be looked into but the clocksource watchdog should be able to deal with it accordingly. Fixes: `2e27e793e2` ("clocksource: Reduce clocksource-skew threshold") Suggested-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Jiri Wiesner <jwiesner@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Feng Tang <feng.tang@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240122172350.GA740@incl	2024-01-25 09:13:16 +01:00

1 2 3 4 5 ...

2398 commits