Commit graph

35098 commits

Author SHA1 Message Date
Roman Gushchin 844f157f6c bpf: Eliminate rlimit-based memory accounting for devmap maps
Do not use rlimit-based memory accounting for devmap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-23-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin 087b0d39fe bpf: Eliminate rlimit-based memory accounting for cgroup storage maps
Do not use rlimit-based memory accounting for cgroup storage maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-22-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin 711cabaf14 bpf: Eliminate rlimit-based memory accounting for cpumap maps
Do not use rlimit-based memory accounting for cpumap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-21-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin f043733f31 bpf: Eliminate rlimit-based memory accounting for bpf_struct_ops maps
Do not use rlimit-based memory accounting for bpf_struct_ops maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-20-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin 1bc5975613 bpf: Eliminate rlimit-based memory accounting for arraymap maps
Do not use rlimit-based memory accounting for arraymap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-19-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin e9aae8beba bpf: Memcg-based memory accounting for bpf local storage maps
Account memory used by bpf local storage maps:
per-socket, per-inode and per-task storages.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-16-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin be4035c734 bpf: Memcg-based memory accounting for bpf ringbuffer
Enable the memcg-based memory accounting for the memory used by
the bpf ringbuffer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-15-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 353e7af4bf bpf: Memcg-based memory accounting for lpm_trie maps
Include lpm trie and lpm trie node objects into the memcg-based memory
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-14-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 881456811a bpf: Refine memcg-based memory accounting for hashtab maps
Include percpu objects and the size of map metadata into the
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-13-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 1440290adf bpf: Refine memcg-based memory accounting for devmap maps
Include map metadata and the node size (struct bpf_dtab_netdev)
into the accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-12-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 3a61c7c58b bpf: Memcg-based memory accounting for cgroup storage maps
Account memory used by cgroup storage maps including metadata
structures.

Account the percpu memory for the percpu flavor of cgroup storage.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-11-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin e88cc05b61 bpf: Refine memcg-based memory accounting for cpumap maps
Include metadata and percpu data into the memcg-based memory
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-10-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 6d192c7938 bpf: Refine memcg-based memory accounting for arraymap maps
Include percpu arrays and auxiliary data into the memcg-based memory
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-9-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin d5299b67dd bpf: Memcg-based memory accounting for bpf maps
This patch enables memcg-based memory accounting for memory allocated
by __bpf_map_area_alloc(), which is used by many types of bpf maps for
large initial memory allocations.

Please note, that __bpf_map_area_alloc() should not be used outside of
map creation paths without setting the active memory cgroup to the
map's memory cgroup.

Following patches in the series will refine the accounting for
some of the map types.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-8-guro@fb.com
2020-12-02 18:32:45 -08:00
Roman Gushchin 48edc1f78a bpf: Prepare for memcg-based memory accounting for bpf maps
Bpf maps can be updated from an interrupt context and in such
case there is no process which can be charged. It makes the memory
accounting of bpf maps non-trivial.

Fortunately, after commit 4127c6504f ("mm: kmem: enable kernel
memcg accounting from interrupt contexts") and commit b87d8cefe4
("mm, memcg: rework remote charging API to support nesting")
it's finally possible.

To make the ownership model simple and consistent, when the map
is created, the memory cgroup of the current process is recorded.
All subsequent allocations related to the bpf map are charged to
the same memory cgroup. It includes allocations made by any processes
(even if they do belong to a different cgroup) and from interrupts.

This commit introduces 3 new helpers, which will be used by following
commits to enable the accounting of bpf maps memory:
  - bpf_map_kmalloc_node()
  - bpf_map_kzalloc()
  - bpf_map_alloc_percpu()

They are wrapping popular memory allocation functions. They set
the active memory cgroup to the map's memory cgroup and add
__GFP_ACCOUNT to the passed gfp flags. Then they call into
the corresponding memory allocation function and restore
the original active memory cgroup.

These helpers are supposed to use everywhere except the map creation
path. During the map creation when the map structure is allocated by
itself, it cannot be passed to those helpers. In those cases default
memory allocation function will be used with the __GFP_ACCOUNT flag.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-7-guro@fb.com
2020-12-02 18:32:34 -08:00
Roman Gushchin ddf8503c7c bpf: Memcg-based memory accounting for bpf progs
Include memory used by bpf programs into the memcg-based accounting.
This includes the memory used by programs itself, auxiliary data,
statistics and bpf line info. A memory cgroup containing the
process which loads the program is getting charged.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-6-guro@fb.com
2020-12-02 18:28:06 -08:00
Roman Gushchin bcfe06bf26 mm: memcontrol: Use helpers to read page's memcg data
Patch series "mm: allow mapping accounted kernel pages to userspace", v6.

Currently a non-slab kernel page which has been charged to a memory cgroup
can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
flag is defined as a page type (like buddy, offline, etc), so it takes a
bit from a page->mapped counter.  Pages with a type set can't be mapped to
userspace.

But in general the kmemcg flag has nothing to do with mapping to
userspace.  It only means that the page has been accounted by the page
allocator, so it has to be properly uncharged on release.

Some bpf maps are mapping the vmalloc-based memory to userspace, and their
memory can't be accounted because of this implementation detail.

This patchset removes this limitation by moving the PageKmemcg flag into
one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
adds several checks and removes a couple of obsolete functions.  As the
result the code became more robust with fewer open-coded bit tricks.

This patch (of 4):

Currently there are many open-coded reads of the page->mem_cgroup pointer,
as well as a couple of read helpers, which are barely used.

It creates an obstacle on a way to reuse some bits of the pointer for
storing additional bits of information.  In fact, we already do this for
slab pages, where the last bit indicates that a pointer has an attached
vector of objcg pointers instead of a regular memcg pointer.

This commits uses 2 existing helpers and introduces a new helper to
converts all read sides to calls of these helpers:
  struct mem_cgroup *page_memcg(struct page *page);
  struct mem_cgroup *page_memcg_rcu(struct page *page);
  struct mem_cgroup *page_memcg_check(struct page *page);

page_memcg_check() is intended to be used in cases when the page can be a
slab page and have a memcg pointer pointing at objcg vector.  It does
check the lowest bit, and if set, returns NULL.  page_memcg() contains a
VM_BUG_ON_PAGE() check for the page not being a slab page.

To make sure nobody uses a direct access, struct page's
mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
2020-12-02 18:28:05 -08:00
Frederic Weisbecker d14ce74f1f irq: Call tick_irq_enter() inside HARDIRQ_OFFSET
Now that account_hardirq_enter() is called after HARDIRQ_OFFSET has
been incremented, there is nothing left that prevents us from also
moving tick_irq_enter() after HARDIRQ_OFFSET is incremented.

The desired outcome is to remove the nasty hack that prevents softirqs
from being raised through ksoftirqd instead of the hardirq bottom half.
Also tick_irq_enter() then becomes appropriately covered by lockdep.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201202115732.27827-6-frederic@kernel.org
2020-12-02 20:20:05 +01:00
Frederic Weisbecker d3759e7184 irqtime: Move irqtime entry accounting after irq offset incrementation
IRQ time entry is currently accounted before HARDIRQ_OFFSET or
SOFTIRQ_OFFSET are incremented. This is convenient to decide to which
index the cputime to account is dispatched.

Unfortunately it prevents tick_irq_enter() from being called under
HARDIRQ_OFFSET because tick_irq_enter() has to be called before the IRQ
entry accounting due to the necessary clock catch up. As a result we
don't benefit from appropriate lockdep coverage on tick_irq_enter().

To prepare for fixing this, move the IRQ entry cputime accounting after
the preempt offset is incremented. This requires the cputime dispatch
code to handle the extra offset.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201202115732.27827-5-frederic@kernel.org
2020-12-02 20:20:05 +01:00
Frederic Weisbecker 8a6a5920d3 sched/vtime: Consolidate IRQ time accounting
The 3 architectures implementing CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
all have their own version of irq time accounting that dispatch the
cputime to the appropriate index: hardirq, softirq, system, idle,
guest... from an all-in-one function.

Instead of having these ad-hoc versions, move the cputime destination
dispatch decision to the core code and leave only the actual per-index
cputime accounting to the architecture.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201202115732.27827-4-frederic@kernel.org
2020-12-02 20:20:05 +01:00
Frederic Weisbecker 2b91ec9f55 s390/vtime: Use the generic IRQ entry accounting
s390 has its own version of IRQ entry accounting because it doesn't
account the idle time the same way the other architectures do. Only
the actual idle sleep time is accounted as idle time, the rest of the
idle task execution is accounted as system time.

Make the generic IRQ entry accounting aware of architectures that have
their own way of accounting idle time and convert s390 to use it.

This prepares s390 to get involved in further consolidations of IRQ
time accounting.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201202115732.27827-3-frederic@kernel.org
2020-12-02 20:20:04 +01:00
Frederic Weisbecker 7197688b20 sched/cputime: Remove symbol exports from IRQ time accounting
account_irq_enter_time() and account_irq_exit_time() are not called
from modules. EXPORT_SYMBOL_GPL() can be safely removed from the IRQ
cputime accounting functions called from there.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201202115732.27827-2-frederic@kernel.org
2020-12-02 20:20:04 +01:00
Sven Schnelle c6156e1da6 entry: Add syscall_exit_to_user_mode_work()
This is the same as syscall_exit_to_user_mode() but without calling
exit_to_user_mode(). This can be used if there is an architectural reason
to avoid the combo function, e.g. restarting a syscall without returning to
userspace. Before returning to user space the caller has to invoke
exit_to_user_mode().

[ tglx: Amended comments ]

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-6-svens@linux.ibm.com
2020-12-02 15:07:58 +01:00
Sven Schnelle 310de1a678 entry: Add exit_to_user_mode() wrapper
Called from architecture specific code when syscall_exit_to_user_mode() is
not suitable. It simply calls __exit_to_user_mode().

This way __exit_to_user_mode() can still be inlined because it is declared
static __always_inline.

[ tglx: Amended comments and moved it to a different place in the header ]

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-5-svens@linux.ibm.com
2020-12-02 15:07:57 +01:00
Sven Schnelle 96e2fbccd0 entry_Add_enter_from_user_mode_wrapper
To be called from architecture specific code if the combo interfaces are
not suitable. It simply calls __enter_from_user_mode(). This way
__enter_from_user_mode will still be inlined because it is declared static
__always_inline.

[ tglx: Amend comments and move it to a different location in the header ]

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-4-svens@linux.ibm.com
2020-12-02 15:07:57 +01:00
Sven Schnelle bb793562f0 entry: Rename exit_to_user_mode()
In order to make this function publicly available rename it so it can still
be inlined. An additional exit_to_user_mode() function will be added with
a later commit.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-3-svens@linux.ibm.com
2020-12-02 15:07:57 +01:00
Sven Schnelle 6666bb714f entry: Rename enter_from_user_mode()
In order to make this function publicly available rename it so it can still
be inlined. An additional enter_from_user_mode() function will be added with
a later commit.

Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201201142755.31931-2-svens@linux.ibm.com
2020-12-02 15:07:57 +01:00
Gabriel Krisman Bertazi 11894468e3 entry: Support Syscall User Dispatch on common syscall entry
Syscall User Dispatch (SUD) must take precedence over seccomp and
ptrace, since the use case is emulation (it can be invoked with a
different ABI) such that seccomp filtering by syscall number doesn't
make sense in the first place.  In addition, either the syscall is
dispatched back to userspace, in which case there is no resource for to
trace, or the syscall will be executed, and seccomp/ptrace will execute
next.

Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as
well, just to prevent a trace exit event when dispatch was triggered.
For that, the on_syscall_dispatch() examines context to skip the
tracepoint, audit and other work.

[ tglx: Add a comment on the exit side ]

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-5-krisman@collabora.com
2020-12-02 15:07:56 +01:00
Gabriel Krisman Bertazi 1446e1df9e kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS.  This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition.  This is particularly important for Windows
games running over Wine.

The proposed interface looks like this:

  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])

The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector.  This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.

selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.

The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.

Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support.  I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism.  And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.

For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant.  The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access.  In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com
2020-12-02 15:07:56 +01:00
Linus Torvalds ef6900acc8 Tracing fixes for 5.10-rc6
- Use correct timestamp variable for ring buffer write stamp update
  - Fix up before stamp and write stamp when crossing ring buffer sub
    buffers
  - Keep a zero delta in ring buffer in slow path if cmpxchg fails
  - Fix trace_printk static buffer for archs that care
  - Fix ftrace record accounting for ftrace ops with trampolines
  - Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency
  - Remove WARN_ON in hwlat tracer that triggers on something that is OK
  - Make "my_tramp" trampoline in ftrace direct sample code global
  - Fixes in the bootconfig tool for better alignment management
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCX8ZzghQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qg0JAQCII1bDQyF3APLlNFRqfHf3bTo7Zl5z
 WaUd1Cd7JkY+WAD/eF1dWjN0JRtfU+oRlk6UZ4oNmp8WMJvQ7oV26ub2egE=
 =lts8
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Use correct timestamp variable for ring buffer write stamp update

 - Fix up before stamp and write stamp when crossing ring buffer sub
   buffers

 - Keep a zero delta in ring buffer in slow path if cmpxchg fails

 - Fix trace_printk static buffer for archs that care

 - Fix ftrace record accounting for ftrace ops with trampolines

 - Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency

 - Remove WARN_ON in hwlat tracer that triggers on something that is OK

 - Make "my_tramp" trampoline in ftrace direct sample code global

 - Fixes in the bootconfig tool for better alignment management

* tag 'trace-v5.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Always check to put back before stamp when crossing pages
  ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency
  ftrace: Fix updating FTRACE_FL_TRAMP
  tracing: Fix alignment of static buffer
  tracing: Remove WARN_ON in start_thread()
  samples/ftrace: Mark my_tramp[12]? global
  ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next()
  ring-buffer: Update write stamp with the correct ts
  docs: bootconfig: Update file format on initrd image
  tools/bootconfig: Align the bootconfig applied initrd image size to 4
  tools/bootconfig: Fix to check the write failure correctly
  tools/bootconfig: Fix errno reference after printf()
2020-12-01 15:30:18 -08:00
Sami Tolvanen a2abe7cbd8 scs: switch to vmapped shadow stacks
The kernel currently uses kmem_cache to allocate shadow call stacks,
which means an overflows may not be immediately detected and can
potentially result in another task's shadow stack to be overwritten.

This change switches SCS to use virtually mapped shadow stacks for
tasks, which increases shadow stack size to a full page and provides
more robust overflow detection, similarly to VMAP_STACK.

Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20201130233442.2562064-2-samitolvanen@google.com
Signed-off-by: Will Deacon <will@kernel.org>
2020-12-01 10:30:28 +00:00
Steven Rostedt (VMware) 68e10d5ff5 ring-buffer: Always check to put back before stamp when crossing pages
The current ring buffer logic checks to see if the updating of the event
buffer was interrupted, and if it is, it will try to fix up the before stamp
with the write stamp to make them equal again. This logic is flawed, because
if it is not interrupted, the two are guaranteed to be different, as the
current event just updated the before stamp before allocation. This
guarantees that the next event (this one or another interrupting one) will
think it interrupted the time updates of a previous event and inject an
absolute time stamp to compensate.

The correct logic is to always update the timestamps when traversing to a
new sub buffer.

Cc: stable@vger.kernel.org
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 23:21:51 -05:00
Naveen N. Rao 49a962c075 ftrace: Fix DYNAMIC_FTRACE_WITH_DIRECT_CALLS dependency
DYNAMIC_FTRACE_WITH_DIRECT_CALLS should depend on
DYNAMIC_FTRACE_WITH_REGS since we need ftrace_regs_caller().

Link: https://lkml.kernel.org/r/fc4b257ea8689a36f086d2389a9ed989496ca63a.1606412433.git.naveen.n.rao@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Fixes: 763e34e74b ("ftrace: Add register_ftrace_direct()")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 21:43:08 -05:00
Naveen N. Rao 4c75b0ff4e ftrace: Fix updating FTRACE_FL_TRAMP
On powerpc, kprobe-direct.tc triggered FTRACE_WARN_ON() in
ftrace_get_addr_new() followed by the below message:
  Bad trampoline accounting at: 000000004222522f (wake_up_process+0xc/0x20) (f0000001)

The set of steps leading to this involved:
- modprobe ftrace-direct-too
- enable_probe
- modprobe ftrace-direct
- rmmod ftrace-direct <-- trigger

The problem turned out to be that we were not updating flags in the
ftrace record properly. From the above message about the trampoline
accounting being bad, it can be seen that the ftrace record still has
FTRACE_FL_TRAMP set though ftrace-direct module is going away. This
happens because we are checking if any ftrace_ops has the
FTRACE_FL_TRAMP flag set _before_ updating the filter hash.

The fix for this is to look for any _other_ ftrace_ops that also needs
FTRACE_FL_TRAMP.

Link: https://lkml.kernel.org/r/56c113aa9c3e10c19144a36d9684c7882bf09af5.1606412433.git.naveen.n.rao@linux.vnet.ibm.com

Cc: stable@vger.kernel.org
Fixes: a124692b69 ("ftrace: Enable trampoline when rec count returns back to one")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 21:43:07 -05:00
Minchan Kim 8fa655a3a0 tracing: Fix alignment of static buffer
With 5.9 kernel on ARM64, I found ftrace_dump output was broken but
it had no problem with normal output "cat /sys/kernel/debug/tracing/trace".

With investigation, it seems coping the data into temporal buffer seems to
break the align binary printf expects if the static buffer is not aligned
with 4-byte. IIUC, get_arg in bstr_printf expects that args has already
right align to be decoded and seq_buf_bprintf says ``the arguments are saved
in a 32bit word array that is defined by the format string constraints``.
So if we don't keep the align under copy to temporal buffer, the output
will be broken by shifting some bytes.

This patch fixes it.

Link: https://lkml.kernel.org/r/20201125225654.1618966-1-minchan@kernel.org

Cc: <stable@vger.kernel.org>
Fixes: 8e99cf91b9 ("tracing: Do not allocate buffer in trace_find_next_entry() in atomic")
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 21:43:07 -05:00
Vasily Averin 310e3a4b5a tracing: Remove WARN_ON in start_thread()
This patch reverts commit 978defee11 ("tracing: Do a WARN_ON()
 if start_thread() in hwlat is called when thread exists")

.start hook can be legally called several times if according
tracer is stopped

screen window 1
[root@localhost ~]# echo 1 > /sys/kernel/tracing/events/kmem/kfree/enable
[root@localhost ~]# echo 1 > /sys/kernel/tracing/options/pause-on-trace
[root@localhost ~]# less -F /sys/kernel/tracing/trace

screen window 2
[root@localhost ~]# cat /sys/kernel/debug/tracing/tracing_on
0
[root@localhost ~]# echo hwlat >  /sys/kernel/debug/tracing/current_tracer
[root@localhost ~]# echo 1 > /sys/kernel/debug/tracing/tracing_on
[root@localhost ~]# cat /sys/kernel/debug/tracing/tracing_on
0
[root@localhost ~]# echo 2 > /sys/kernel/debug/tracing/tracing_on

triggers warning in dmesg:
WARNING: CPU: 3 PID: 1403 at kernel/trace/trace_hwlat.c:371 hwlat_tracer_start+0xc9/0xd0

Link: https://lkml.kernel.org/r/bd4d3e70-400d-9c82-7b73-a2d695e86b58@virtuozzo.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 978defee11 ("tracing: Do a WARN_ON() if start_thread() in hwlat is called when thread exists")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 21:43:07 -05:00
Andrea Righi 8785f51a17 ring-buffer: Set the right timestamp in the slow path of __rb_reserve_next()
In the slow path of __rb_reserve_next() a nested event(s) can happen
between evaluating the timestamp delta of the current event and updating
write_stamp via local_cmpxchg(); in this case the delta is not valid
anymore and it should be set to 0 (same timestamp as the interrupting
event), since the event that we are currently processing is not the last
event in the buffer.

Link: https://lkml.kernel.org/r/X8IVJcp1gRE+FJCJ@xps-13-7390

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lwn.net/Articles/831207
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 15:22:05 -05:00
Steven Rostedt (VMware) 55ea4cf403 ring-buffer: Update write stamp with the correct ts
The write stamp, used to calculate deltas between events, was updated with
the stale "ts" value in the "info" structure, and not with the updated "ts"
variable. This caused the deltas between events to be inaccurate, and when
crossing into a new sub buffer, had time go backwards.

Link: https://lkml.kernel.org/r/20201124223917.795844-1-elavila@google.com

Cc: stable@vger.kernel.org
Fixes: a389d86f7f ("ring-buffer: Have nested events still record running time stamp")
Reported-by: "J. Avila" <elavila@google.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Tested-by: Will McVicker <willmcvicker@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-11-30 15:21:31 -05:00
Marc Zyngier 4615fbc378 genirq/irqdomain: Don't try to free an interrupt that has no mapping
When an interrupt allocation fails for N interrupts, it is pretty
common for the error handling code to free the same number of interrupts,
no matter how many interrupts have actually been allocated.

This may result in the domain freeing code to be unexpectedly called
for interrupts that have no mapping in that domain. Things end pretty
badly.

Instead, add some checks to irq_domain_free_irqs_hierarchy() to make sure
that thiss does not follow the hierarchy if no mapping exists for a given
interrupt.

Fixes: 6a6544e520 ("genirq/irqdomain: Remove auto-recursive hierarchy support")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201129135551.396777-1-maz@kernel.org
2020-11-30 14:50:21 +01:00
Laurent Vivier bb4c6910c8 genirq/irqdomain: Add an irq_create_mapping_affinity() function
There is currently no way to convey the affinity of an interrupt
via irq_create_mapping(), which creates issues for devices that
expect that affinity to be managed by the kernel.

In order to sort this out, rename irq_create_mapping() to
irq_create_mapping_affinity() with an additional affinity parameter that
can be passed down to irq_domain_alloc_descs().

irq_create_mapping() is re-implemented as a wrapper around
irq_create_mapping_affinity().

No functional change.

Fixes: e75eafb9b0 ("genirq/msi: Switch to new irq spreading infrastructure")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kurz <groug@kaod.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201126082852.1178497-2-lvivier@redhat.com
2020-11-30 12:21:31 +01:00
Linus Torvalds f91a3aa6bc Yet two more places which invoke tracing from RCU disabled regions in the
idle path. Similar to the entry path the low level idle functions have to
 be non-instrumentable.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/DpAUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXSLD/9klc0YimnEnROW6Q5Svb2IcyIutmXF
 bOIY1bYYoKILOBj3wyvDUhmdMuq5zh7H9yG11hO8MaVVWVQcLcOMLdHTYm9dcdmF
 xQk33+xqjuhRShB+nEmC9ayYtWogtH6W6uZ6WDtF9ZltMKU85n5ddGJ/Fvo+HoCb
 NbOdHGJdJ3/3ZCeHnxOnxM+5/GwjkBuccTV/tXmb3yXrfU9DBySyQ4/UchcpF43w
 LcEb0kiQbpZsBTByKJOQV8+RR654S0sILlvRwVXpmj94vrgGwhlVk1/9rz7tkOhF
 ksoo1mTVu75LMt22G/hXxE63787yRvFdHjapf0+kCOAuhl992NK+xlGDH8o9DXcu
 9y73D4bI0HnDFs20w6vs20iLvxECJiYHJqlgR5ZwFUToceaNgtiYr8kzuD7Zbae1
 KG2E7BuNSwHWMtf97fGn44GZknPEOaKdDn4Wv6/bvKHxLm77qe11RKF70Stcz2AI
 am13KmQzzsHGF5qNWwpElRUxSdxfJMR66RnOdTQULGrRedaZTFol/y2pnVzTSe3k
 SZnlpL5kE7y92UYDogPb5wWA7b+YkJN0OdSkRFy1FH26ZG8E4M7ZJ2tql5Sw7pGM
 lsTjXpAUphnK5rz7QcYE8KAZWj//fIAcElIrvdklVcBnS3IqjfksYW27B64133vx
 cT1B/lA1PHXj6Q==
 =raED
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "Two more places which invoke tracing from RCU disabled regions in the
  idle path.

  Similar to the entry path the low level idle functions have to be
  non-instrumentable"

* tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  intel_idle: Fix intel_idle() vs tracing
  sched/idle: Fix arch_cpu_idle() vs tracing
2020-11-29 11:19:26 -08:00
Jakub Kicinski 5c39f26e67 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in CAN, keep the net-next + the byteswap wrapper.

Conflicts:
	drivers/net/can/usb/gs_usb.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 18:25:27 -08:00
Linus Torvalds 43d6ecd97c Urgent printk fix for 5.10
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAl/A+VEACgkQUqAMR0iA
 lPIwGw/9F/E2ZdX+Vgi3ZiR/5GdfVZeIW+QwhKXBQc8Jr9+p2JJ+UOPeeazKQA5l
 bFt6GR67yjqtFS5gO76EPCQ6/Uu3cPA+A3HQRQZuE6p0zM+mrMXc/upLMy5DKi4Z
 f4zkW8dYWSBpAWPvM9bb0gIKO9wVV6Aj1IyyZLfEghX/KrJPx0zutioO4ScYxhA9
 YVITmnUQ6YzHEVE8CwWGV4lArC50ILGdIqNlZrkjuG3CuGTdyB2OY60P8XCy8bzn
 W3WgRGI/bvfHwCPh8oYKm/5nM9JAVdhbEpoFQj8cMPKoH5DeSGNWfYXkali2gqhL
 1Y2SntTcR7zclMcN0/gIn9ViVsma/eayAyawSYgQjmAdl6H/vv9B7x9ZswmK/b38
 JzOzHwP+H3lXVg2yN4EbH3uDMTMjqflYuC7QiZ/HNa43KURXhoritw2hBRczhazp
 mdyRQf4iv8NoYSthggD6LolCs+ay5NZpCeB3YXgnlpxiYFGCE+ykSz41AGdTyYTl
 jOWVtK1VawFD0/FgpgF8XK7/gOXWeYb+4WeBYgGKgCJdneiB5eJt8eWT7zmpAPpG
 FECexdAd4TAjD+EEbidiFWpMjJcY2TnOJp76O3/Wlo1QLbEgRHOklM/Rrq0zGg5b
 vm3w0kobGZfpIJuzSOAHyErX0jGEVTq6yUi381jSQpf4bTttIpc=
 =7IkU
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-5.10-rc6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fixes from Petr Mladek:

 - do not lose trailing newline in pr_cont() calls

 - two trivial fixes for a dead store and a config description

* tag 'printk-for-5.10-rc6-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: finalize records with trailing newlines
  printk: remove unneeded dead-store assignment
  init/Kconfig: Fix CPU number in LOG_CPU_MAX_BUF_SHIFT description
2020-11-27 10:38:36 -08:00
Petr Mladek 739e7116b1 Merge branch 'for-5.10-pr_cont-fixup' into for-linus 2020-11-27 13:41:23 +01:00
John Ogness 4ad9921af4 printk: finalize records with trailing newlines
Any record with a trailing newline (LOG_NEWLINE flag) cannot
be continued because the newline has been stripped and will
not be visible if the message is appended. This was already
handled correctly when committing in log_output() but was
not handled correctly when committing in log_store().

Fixes: f5f022e53b ("printk: reimplement log_cont using record extension")
Link: https://lore.kernel.org/r/20201126114836.14750-1-john.ogness@linutronix.de
Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2020-11-27 11:58:54 +01:00
Ingo Molnar a787bdaff8 Merge branch 'linus' into sched/core, to resolve semantic conflict
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-11-27 11:10:50 +01:00
Nicholas Piggin 8ff00399b1 kernel/cpu: add arch override for clear_tasks_mm_cpumask() mm handling
powerpc/64s keeps a counter in the mm which counts bits set in
mm_cpumask as well as other things. This means it can't use generic code
to clear bits out of the mask and doesn't adjust the arch specific
counter.

Add an arch override that allows powerpc/64s to use
clear_tasks_mm_cpumask().

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201126102530.691335-4-npiggin@gmail.com
2020-11-27 00:10:39 +11:00
Peter Zijlstra 20c7775aec Merge remote-tracking branch 'origin/master' into perf/core
Further perf/core patches will depend on:

  d3f7b1bb20 ("mm/gup: fix gup_fast with dynamic page table folding")

which is already in Linus' tree.
2020-11-26 13:16:55 +01:00
KP Singh 27672f0d28 bpf: Add a BPF helper for getting the IMA hash of an inode
Provide a wrapper function to get the IMA hash of an inode. This helper
is useful in fingerprinting files (e.g executables on execution) and
using these fingerprints in detections like an executable unlinking
itself.

Since the ima_inode_hash can sleep, it's only allowed for sleepable
LSM hooks.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201124151210.1081188-3-kpsingh@chromium.org
2020-11-26 00:04:04 +01:00
Andy Shevchenko b87e745945 resource: provide meaningful MODULE_LICENSE() in test suite
modpost complains that module has no licence provided.
Provide it via meaningful MODULE_LICENSE().

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-11-25 18:52:35 +01:00
Wedson Almeida Filho 59e2e27d22 bpf: Refactor check_cfg to use a structured loop.
The current implementation uses a number of gotos to implement a loop
and different paths within the loop, which makes the code less readable
than it would be with an explicit while-loop. This patch also replaces a
chain of if/if-elses keyed on the same expression with a switch
statement.

No change in behaviour is intended.

Signed-off-by: Wedson Almeida Filho <wedsonaf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201121015509.3594191-1-wedsonaf@google.com
2020-11-24 20:29:26 -08:00
Andrii Nakryiko 607c543f93 bpf: Sanitize BTF data pointer after module is loaded
Given .BTF section is not allocatable, it will get trimmed after module is
loaded. BPF system handles that properly by creating an independent copy of
data. But prevent any accidental misused by resetting the pointer to BTF data.

Fixes: 36e68442d1 ("bpf: Load and verify kernel module BTFs")
Suggested-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jessica Yu <jeyu@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/bpf/20201121070829.2612884-2-andrii@kernel.org
2020-11-25 00:05:21 +01:00
Peter Zijlstra 2914b0ba61 irq_work: Optimize irq_work_single()
Trade one atomic op for a full memory barrier.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2020-11-24 16:47:49 +01:00
Peter Zijlstra 545b8c8df4 smp: Cleanup smp_call_function*()
Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2020-11-24 16:47:49 +01:00
Peter Zijlstra 7a9f50a058 irq_work: Cleanup
Get rid of the __call_single_node union and clean up the API a little
to avoid external code relying on the structure layout as much.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2020-11-24 16:47:48 +01:00
Mel Gorman 23e6082a52 sched: Limit the amount of NUMA imbalance that can exist at fork time
At fork time currently, a local node can be allowed to fill completely
and allow the periodic load balancer to fix the problem. This can be
problematic in cases where a task creates lots of threads that idle until
woken as part of a worker poll causing a memory bandwidth problem.

However, a "real" workload suffers badly from this behaviour. The workload
in question is mostly NUMA aware but spawns large numbers of threads
that act as a worker pool that can be called from anywhere. These need
to spread early to get reasonable behaviour.

This patch limits how much a local node can fill before spilling over
to another node and it will not be a universal win. Specifically,
very short-lived workloads that fit within a NUMA node would prefer
the memory bandwidth.

As I cannot describe the "real" workload, the best proxy measure I found
for illustration was a page fault microbenchmark. It's not representative
of the workload but demonstrates the hazard of the current behaviour.

pft timings
                                 5.10.0-rc2             5.10.0-rc2
                          imbalancefloat-v2          forkspread-v2
Amean     elapsed-1        46.37 (   0.00%)       46.05 *   0.69%*
Amean     elapsed-4        12.43 (   0.00%)       12.49 *  -0.47%*
Amean     elapsed-7         7.61 (   0.00%)        7.55 *   0.81%*
Amean     elapsed-12        4.79 (   0.00%)        4.80 (  -0.17%)
Amean     elapsed-21        3.13 (   0.00%)        2.89 *   7.74%*
Amean     elapsed-30        3.65 (   0.00%)        2.27 *  37.62%*
Amean     elapsed-48        3.08 (   0.00%)        2.13 *  30.69%*
Amean     elapsed-79        2.00 (   0.00%)        1.90 *   4.95%*
Amean     elapsed-80        2.00 (   0.00%)        1.90 *   4.70%*

This is showing the time to fault regions belonging to threads. The target
machine has 80 logical CPUs and two nodes. Note the ~30% gain when the
machine is approximately the point where one node becomes fully utilised.
The slower results are borderline noise.

Kernel building shows similar benefits around the same balance point.
Generally performance was either neutral or better in the tests conducted.
The main consideration with this patch is the point where fork stops
spreading a task so some workloads may benefit from different balance
points but it would be a risky tuning parameter.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-5-mgorman@techsingularity.net
2020-11-24 16:47:48 +01:00
Mel Gorman 7d2b5dd0bc sched/numa: Allow a floating imbalance between NUMA nodes
Currently, an imbalance is only allowed when a destination node
is almost completely idle. This solved one basic class of problems
and was the cautious approach.

This patch revisits the possibility that NUMA nodes can be imbalanced
until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat
superficial -- it's half the cores when HT is enabled.  At higher
utilisations, balancing should continue as normal and keep things even
until scheduler domains are fully busy or over utilised.

Note that this is not expected to be a universal win. Any benchmark
that prefers spreading as wide as possible with limited communication
will favour the old behaviour as there is more memory bandwidth.
Workloads that communicate heavily in pairs such as netperf or tbench
benefit. For the tests I ran, the vast majority of workloads saw
a benefit so it seems to be a worthwhile trade-off.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-4-mgorman@techsingularity.net
2020-11-24 16:47:47 +01:00
Mel Gorman 5c339005f8 sched: Avoid unnecessary calculation of load imbalance at clone time
In find_idlest_group(), the load imbalance is only relevant when the group
is either overloaded or fully busy but it is calculated unconditionally.
This patch moves the imbalance calculation to the context it is required.
Technically, it is a micro-optimisation but really the benefit is avoiding
confusing one type of imbalance with another depending on the group_type
in the next patch.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-3-mgorman@techsingularity.net
2020-11-24 16:47:47 +01:00
Mel Gorman abeae76a47 sched/numa: Rename nr_running and break out the magic number
This is simply a preparation patch to make the following patches easier
to read. No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-2-mgorman@techsingularity.net
2020-11-24 16:47:47 +01:00
Peter Zijlstra 58c644ba51 sched/idle: Fix arch_cpu_idle() vs tracing
We call arch_cpu_idle() with RCU disabled, but then use
local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.

Switch all arch_cpu_idle() implementations to use
raw_local_irq_{en,dis}able() and carefully manage the
lockdep,rcu,tracing state like we do in entry.

(XXX: we really should change arch_cpu_idle() to not return with
interrupts enabled)

Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org
2020-11-24 16:47:35 +01:00
Thomas Gleixner 5fbda3ecd1 sched: highmem: Store local kmaps in task struct
Instead of storing the map per CPU provide and use per task storage. That
prepares for local kmaps which are preemptible.

The context switch code is preparatory and not yet in use because
kmap_atomic() runs with preemption disabled. Will be made usable in the
next step.

The context switch logic is safe even when an interrupt happens after
clearing or before restoring the kmaps. The kmap index in task struct is
not modified so any nesting kmap in an interrupt will use unused indices
and on return the counter is the same as before.

Also add an assert into the return to user space code. Going back to user
space with an active kmap local is a nono.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201118204007.372935758@linutronix.de
2020-11-24 14:42:09 +01:00
Thomas Gleixner 13c8da5db4 Merge branch 'sched/core' into core/mm
Pull the migrate disable mechanics which is a prerequisite for preemptible
kmap_local().
2020-11-24 11:26:11 +01:00
Thomas Gleixner 74d862b682 sched: Make migrate_disable/enable() independent of RT
Now that the scheduler can deal with migrate disable properly, there is no
real compelling reason to make it only available for RT.

There are quite some code pathes which needlessly disable preemption in
order to prevent migration and some constructs like kmap_atomic() enforce
it implicitly.

Making it available independent of RT allows to provide a preemptible
variant of kmap_atomic() and makes the code more consistent in general.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Grudgingly-Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20201118204007.269943012@linutronix.de
2020-11-24 11:25:44 +01:00
Alex Shi ab150c3f80 PM / suspend: fix kernel-doc markup
Add parameter explanation to fix kernel-doc marks:

kernel/power/suspend.c:233: warning: Function parameter or member
'state' not described in 'suspend_valid_only_mem'
kernel/power/suspend.c:344: warning: Function parameter or member
'state' not described in 'suspend_prepare'

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
[ rjw: Change the proposed parameter descriptions. ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-11-23 18:24:53 +01:00
Peter Collingbourne 6ac05e832a signal: define the SA_EXPOSE_TAGBITS bit in sa_flags
Architectures that support address tagging, such as arm64, may want to
expose fault address tag bits to the signal handler to help diagnose
memory errors. However, these bits have not been previously set,
and their presence may confuse unaware user applications. Therefore,
introduce a SA_EXPOSE_TAGBITS flag bit in sa_flags that a signal
handler may use to explicitly request that the bits are set.

The generic signal handler APIs expect to receive tagged addresses.
Architectures may specify how to untag addresses in the case where
SA_EXPOSE_TAGBITS is clear by defining the arch_untagged_si_addr
function.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://linux-review.googlesource.com/id/I16dd0ed2081f091fce97be0190cb8caa874c26cb
Link: https://lkml.kernel.org/r/13cf24d00ebdd8e1f55caf1821c7c29d54100191.1605904350.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-11-23 10:31:06 -06:00
Peter Collingbourne a54f0dfda7 signal: define the SA_UNSUPPORTED bit in sa_flags
Define a sa_flags bit, SA_UNSUPPORTED, which will never be supported
in the uapi. The purpose of this flag bit is to allow userspace to
distinguish an old kernel that does not clear unknown sa_flags bits
from a kernel that supports every flag bit.

In other words, if userspace does something like:

  act.sa_flags |= SA_UNSUPPORTED;
  sigaction(SIGSEGV, &act, 0);
  sigaction(SIGSEGV, 0, &oldact);

and finds that SA_UNSUPPORTED remains set in oldact.sa_flags, it means
that the kernel cannot be trusted to have cleared unknown flag bits
from sa_flags, so no assumptions about flag bit support can be made.

Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ic2501ad150a3a79c1cf27fb8c99be342e9dffbcb
Link: https://lkml.kernel.org/r/bda7ddff8895a9bc4ffc5f3cf3d4d37a32118077.1605582887.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-11-23 10:31:06 -06:00
Peter Collingbourne 23acdc76f1 signal: clear non-uapi flag bits when passing/returning sa_flags
Previously we were not clearing non-uapi flag bits in
sigaction.sa_flags when storing the userspace-provided sa_flags or
when returning them via oldact. Start doing so.

This allows userspace to detect missing support for flag bits and
allows the kernel to use non-uapi bits internally, as we are already
doing in arch/x86 for two flag bits. Now that this change is in
place, we no longer need the code in arch/x86 that was hiding these
bits from userspace, so remove it.

This is technically a userspace-visible behavior change for sigaction, as
the unknown bits returned via oldact.sa_flags are no longer set. However,
we are free to define the behavior for unknown bits exactly because
their behavior is currently undefined, so for now we can define the
meaning of each of them to be "clear the bit in oldact.sa_flags unless
the bit becomes known in the future". Furthermore, this behavior is
consistent with OpenBSD [1], illumos [2] and XNU [3] (FreeBSD [4] and
NetBSD [5] fail the syscall if unknown bits are set). So there is some
precedent for this behavior in other kernels, and in particular in XNU,
which is probably the most popular kernel among those that I looked at,
which means that this change is less likely to be a compatibility issue.

Link: [1] f634a6a4b5/sys/kern/kern_sig.c (L278)
Link: [2] 76f19f5fdc/usr/src/uts/common/syscall/sigaction.c (L86)
Link: [3] a449c6a3b8/bsd/kern/kern_sig.c (L480)
Link: [4] eded70c370/sys/kern/kern_sig.c (L699)
Link: [5] 3365779bec/sys/kern/sys_sig.c (L473)
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Link: https://linux-review.googlesource.com/id/I35aab6f5be932505d90f3b3450c083b4db1eca86
Link: https://lkml.kernel.org/r/878dbcb5f47bc9b11881c81f745c0bef5c23f97f.1605235762.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-11-23 10:31:05 -06:00
Thomas Gleixner ae9ef58996 softirq: Move related code into one section
To prepare for adding a RT aware variant of softirq serialization and
processing move related code into one section so the necessary #ifdeffery
is reduced to one.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20201113141733.974214480@linutronix.de
2020-11-23 10:31:06 +01:00
Linus Torvalds f4b936f5d6 A couple of scheduler fixes:
- Make the conditional update of the overutilized state work correctly by
    caching the relevant flags state before overwriting them and checking
    them afterwards.
 
  - Fix a data race in the wakeup path which caused loadavg on ARM64
    platforms to become a random number generator.
 
  - Fix the ordering of the iowaiter accounting operations so it can't be
    decremented before it is incremented.
 
  - Fix a bug in the deadline scheduler vs. priority inheritance when a
    non-deadline task A has inherited the parameters of a deadline task B
    and then blocks on a non-deadline task C.
 
    The second inheritance step used the static deadline parameters of task
    A, which are usually 0, instead of further propagating task B's
    parameters. The zero initialized parameters trigger a bug in the
    deadline scheduler.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+6edsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaJCEAC7VGr9IlWRzCI/173tKAXkLRrGXHVb
 yOYc/YjLMCTcERNxqpf8uIURd/ATSHU/RMwfFcB558NedKZ/QKZDoKmLqeCXnVeM
 e20tXv/fmpqRS7lgtmbBfhQ8mSDhst960oD1mHifdEwEBCCm7mLEaipTuTWjnZ0x
 rOz70Hir1mSjsP0E7ZorsxCr1yExbrt+jZfKCe9D2kUSvlWHf1ipzAYNlqb/DsfG
 n81G7q9LYV8NUhX3lt8oSZDq0K44aO6G6fEaP4EkfwsIAOh37yPHwuEuqDZCBmXw
 rQ17XUU3jQ2MtubPvVEKG/6Z+hAUyOsAKynpq/RhzueXQm/9Ns6+qHX/xY8yh39y
 S5qPd5DLRlac8f7cFwz2zPxP5E+xTJLONgRkuN1XlitMJZBxru9AzDNa0/6on8TM
 OtvbvVR+bPUfHiHULk4fTz7fLcbgYgxbCgfGoFsVlfskOxnzgEG8WfuI2Up2rRJ0
 nr1MCER+5fprciqPPs+18rVEFiC4mQSrV01cnwrNbpW8pqibZSomMilQ0oQvcTGL
 VDEHkaDTa5YbR92Szq4rYbr7Sf0ihFU0EZUNVQnu7SujdVFxTdHb1yr8UYcYp09b
 LqGFhr1FHBNYKbw3rEPx2R/FGuCii21oQkhz94ujDo1Np8EGVZYwFGh+iwbsa2Xn
 K1u0HzqLTfTkMw==
 =HiGq
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
 "A couple of scheduler fixes:

   - Make the conditional update of the overutilized state work
     correctly by caching the relevant flags state before overwriting
     them and checking them afterwards.

   - Fix a data race in the wakeup path which caused loadavg on ARM64
     platforms to become a random number generator.

   - Fix the ordering of the iowaiter accounting operations so it can't
     be decremented before it is incremented.

   - Fix a bug in the deadline scheduler vs. priority inheritance when a
     non-deadline task A has inherited the parameters of a deadline task
     B and then blocks on a non-deadline task C.

     The second inheritance step used the static deadline parameters of
     task A, which are usually 0, instead of further propagating task
     B's parameters. The zero initialized parameters trigger a bug in
     the deadline scheduler"

* tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/deadline: Fix priority inheritance with multiple scheduling classes
  sched: Fix rq->nr_iowait ordering
  sched: Fix data-race in wakeup
  sched/fair: Fix overutilized update in enqueue_task_fair()
2020-11-22 13:26:07 -08:00
Linus Torvalds 855cf1ee47 A single fix for lockdep which makes the recursion protection cover graph
lock/unlock.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+6dQ4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoXLiD/91TaQTrgVht6NxF5CmvEpoYPzGg/TO
 MeU7GmgeXv7ciIIDaPONsCLOGjGx5dSxk1QJ5OH+GhJ/lIXbtvbVUgbs7Ezjkl4d
 SyjRHNv3wYV7VJySEgqxo+Rd6fCQN7rOf/J3ZbzPHtMl40qxSpz1FeXTRft7cn62
 pkunRKM+f/tE/Ue96tiHCQVbwRxG3LEPIuXtLJQqtPvgMzb9zhC9mZpUcr5ZJyVD
 BfX69iIs1W1XKql6yH/KuBUY9d3qRzzmhko6a2mMSL3+VX0j4o46LMlP4HRndKTT
 5fEsUC3UOItqn7X4f3HHeQhckEAKXNGEMR30th84CligiEanQuIIHNNKazsnp87A
 GT/XA8RO9+OAd5TRP4CEaFDUR0a0OOZjiw3vW6E77BObi31Se9/EO3Z7XpJC2BSl
 vTsukplOmeilyspQJT7inVgZrtUlF0lpVV5YbMLjtbn5kzLu/75GZN3JagwCmflN
 GKbix/8tnH0KrbizCE8ujUZl+aNjk2TM64mAzOXMlmg+yZ47af31hFs846tCMhv5
 JHXkiMaXpWx3PapgGHM+4PJ1WDGTxm6mUvgAcyCVWgT4gxVXsppnkn0RLdRTGQ50
 C9movzOn1K2fEe7Pp40arG0tm18l9Ukp7mqN1CL+givRKNb61Owybe6qh2qk0IWT
 DAcgdC9h2rdTqQ==
 =dsm4
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Thomas Gleixner:
 "A single fix for lockdep which makes the recursion protection cover
  graph lock/unlock"

* tag 'locking-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Put graph lock/unlock under lock_recursion protection
2020-11-22 13:19:53 -08:00
Linus Torvalds ea0ab64306 seccomp fixes for v5.10-rc5
- Fix typos in seccomp selftests on powerpc and sh (Kees Cook)
 
 - Fix PF_SUPERPRIV audit marking in seccomp and ptrace (Mickaël Salaün)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl+4FQgACgkQiXL039xt
 wCYERA/8Cwb8p8PWbtOq6uUKsZ4kJQuMZee3crP+LM0B1a427hOCSuezoDtOY1wO
 N7IEOr3ipHxoxZ1Zf2Ln4lGyktxlwzA8pqlGZyqR5zFIl+0HArbXdAfcRMcaeqUp
 eSV30CNBD5XfGLc2R5s1qHnrshVBJFebCpgMfSCOQQWMpZ51nnaoFN8N8iSEx6PN
 kYHC1C1WX5g3vtKo29xS2Y7KCThMOXvcNI7eFpVD0C4ZwEr8lywbTTzBBhXUIGBX
 6NoNOV7kVxIjNLQ8x17F1OacrC6h4ZzNTl4MEYnMZ/Mw0NVB3MvoHQohwW+Y98Rf
 97sPQPZjYeJ6xURolRsWX+kvXC7PyLYvfldsQi00QDfdc6RGu0pnsG4UuivsldlY
 OhswE9Q/KKHmzXiHnZBmcw4NcSyhZiL3LYB1VZl3jDobeOhVKyHw72vo8Zrhhz8A
 ksCDg3vNvOo/x2iH9GSUG4Fjk8coXRif8P6lH5Btw6V+x9ZlFiaW5WbSbP0G3PzJ
 zS5nPu8PE6Sm70XlRn0BbRmIjV9AhEZqNNZoOsndrbR86klH6WolyCB4ifj2MKuR
 ZwbeDblUrYyRne/Ll9XGQVDSFv8J5phxtDQM0phiGK0jOsvXqbl8RvlckCKqBwm0
 7VgtEumU5vJTx01avrXw86Sj7B2IR4M1nTgpwWJ2EVs9U98Emew=
 =1Mse
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp fixes from Kees Cook:
 "This gets the seccomp selftests running again on powerpc and sh, and
  fixes an audit reporting oversight noticed in both seccomp and ptrace.

   - Fix typos in seccomp selftests on powerpc and sh (Kees Cook)

   - Fix PF_SUPERPRIV audit marking in seccomp and ptrace (Mickaël
     Salaün)"

* tag 'seccomp-v5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  selftests/seccomp: sh: Fix register names
  selftests/seccomp: powerpc: Fix typo in macro variable name
  seccomp: Set PF_SUPERPRIV when checking capability
  ptrace: Set PF_SUPERPRIV when checking capability
2020-11-21 10:24:05 -08:00
Song Liu 91b2db27d3 bpf: Simplify task_file_seq_get_next()
Simplify task_file_seq_get_next() by removing two in/out arguments: task
and fstruct. Use info->task and info->files instead.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201120002833.2481110-1-songliubraving@fb.com
2020-11-20 20:36:34 +01:00
Eric Biggers a24d22b225 crypto: sha - split sha.h into sha1.h and sha2.h
Currently <crypto/sha.h> contains declarations for both SHA-1 and SHA-2,
and <crypto/sha3.h> contains declarations for SHA-3.

This organization is inconsistent, but more importantly SHA-1 is no
longer considered to be cryptographically secure.  So to the extent
possible, SHA-1 shouldn't be grouped together with any of the other SHA
versions, and usage of it should be phased out.

Therefore, split <crypto/sha.h> into two headers <crypto/sha1.h> and
<crypto/sha2.h>, and make everyone explicitly specify whether they want
the declarations for SHA-1, SHA-2, or both.

This avoids making the SHA-1 declarations visible to files that don't
want anything to do with SHA-1.  It also prepares for potentially moving
sha1.h into a new insecure/ or dangerous/ directory.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-11-20 14:45:33 +11:00
Paul E. McKenney c4638ff064 Merge branch 'kcsan.2020.11.06a' into HEAD
kcsan.2020.11.06a:  Kernel concurrency sanitizer (KCSAN) updates.
2020-11-19 19:38:25 -08:00
Paul E. McKenney 7fc91fc845 Merge branches 'cpuinfo.2020.11.06a', 'doc.2020.11.06a', 'fixes.2020.11.19b', 'lockdep.2020.11.02a', 'tasks.2020.11.06a' and 'torture.2020.11.06a' into HEAD
cpuinfo.2020.11.06a: Speedups for /proc/cpuinfo.
doc.2020.11.06a: Documentation updates.
fixes.2020.11.19b: Miscellaneous fixes.
lockdep.2020.11.02a: Lockdep-RCU updates to avoid "unused variable".
tasks.2020.11.06a: Tasks-RCU updates.
torture.2020.11.06a': Torture-test updates.
2020-11-19 19:37:47 -08:00
Paul E. McKenney 50edb98853 srcu: Take early exit on memory-allocation failure
It turns out that init_srcu_struct() can be invoked from usermode tasks,
and that fatal signals received by these tasks can cause memory-allocation
failures.  These failures are not handled well by init_srcu_struct(),
so much so that NULL pointer dereferences can result.  This commit
therefore causes init_srcu_struct() to take an early exit upon detection
of memory-allocation failure.

Link: https://lore.kernel.org/lkml/20200908144306.33355-1-aik@ozlabs.ru/
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:17 -08:00
Uladzislau Rezki (Sony) 56292e8609 rcu/tree: Defer kvfree_rcu() allocation to a clean context
The current memmory-allocation interface causes the following difficulties
for kvfree_rcu():

a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will
   complain about violation of the nesting rules, as in "BUG: Invalid
   wait context".  This Kconfig option checks for proper raw_spinlock
   vs. spinlock nesting, in particular, it is not legal to acquire a
   spinlock_t while holding a raw_spinlock_t.

   This is a problem because kfree_rcu() uses raw_spinlock_t whereas the
   "page allocator" internally deals with spinlock_t to access to its
   zones. The code also can be broken from higher level of view:
   <snip>
       raw_spin_lock(&some_lock);
       kfree_rcu(some_pointer, some_field_offset);
   <snip>

b) If built with CONFIG_PREEMPT_RT, spinlock_t is converted into
   sleeplock.  This means that invoking the page allocator from atomic
   contexts results in "BUG: scheduling while atomic".

c) Please note that call_rcu() is already invoked from raw atomic context,
   so it is only reasonable to expaect that kfree_rcu() and kvfree_rcu()
   will also be called from atomic raw context.

This commit therefore defers page allocation to a clean context using the
combination of an hrtimer and a workqueue.  The hrtimer stage is required
in order to avoid deadlocks with the scheduler.  This deferred allocation
is required only when kvfree_rcu()'s per-CPU page cache is empty.

Link: https://lore.kernel.org/lkml/20200630164543.4mdcf6zb4zfclhln@linutronix.de/
Fixes: 3042f83f19 ("rcu: Support reclaim for head-less object")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:17 -08:00
Paul E. McKenney bfb3aa735f rcu: Do not report strict GPs for outgoing CPUs
An outgoing CPU is marked offline in a stop-machine handler and most
of that CPU's services stop at that point, including IRQ work queues.
However, that CPU must take another pass through the scheduler and through
a number of CPU-hotplug notifiers, many of which contain RCU readers.
In the past, these readers were not a problem because the outgoing CPU
has interrupts disabled, so that rcu_read_unlock_special() would not
be invoked, and thus RCU would never attempt to queue IRQ work on the
outgoing CPU.

This changed with the advent of the CONFIG_RCU_STRICT_GRACE_PERIOD
Kconfig option, in which rcu_read_unlock_special() is invoked upon exit
from almost all RCU read-side critical sections.  Worse yet, because
interrupts are disabled, rcu_read_unlock_special() cannot immediately
report a quiescent state and will therefore attempt to defer this
reporting, for example, by queueing IRQ work.  Which fails with a splat
because the CPU is already marked as being offline.

But it turns out that there is no need to report this quiescent state
because rcu_report_dead() will do this job shortly after the outgoing
CPU makes its final dive into the idle loop.  This commit therefore
makes rcu_read_unlock_special() refrain from queuing IRQ work onto
outgoing CPUs.

Fixes: 44bad5b3cc ("rcu: Do full report for .need_qs for strict GPs")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Jann Horn <jannh@google.com>
2020-11-19 19:37:17 -08:00
Zhouyi Zhou 354c3f0e22 rcu: Fix a typo in rcu_blocking_is_gp() header comment
This commit fixes a typo in the rcu_blocking_is_gp() function's header
comment.

Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:17 -08:00
Paul E. McKenney 4d60b475f8 rcu: Prevent lockdep-RCU splats on lock acquisition/release
The rcu_cpu_starting() and rcu_report_dead() functions transition the
current CPU between online and offline state from an RCU perspective.
Unfortunately, this means that the rcu_cpu_starting() function's lock
acquisition and the rcu_report_dead() function's lock releases happen
while the CPU is offline from an RCU perspective, which can result
in lockdep-RCU splats about using RCU from an offline CPU.  And this
situation can also result in too-short grace periods, especially in
guest OSes that are subject to vCPU preemption.

This commit therefore uses sequence-count-like synchronization to forgive
use of RCU while RCU thinks a CPU is offline across the full extent of
the rcu_cpu_starting() and rcu_report_dead() function's lock acquisitions
and releases.

One approach would have been to use the actual sequence-count primitives
provided by the Linux kernel.  Unfortunately, the resulting code looks
completely broken and wrong, and is likely to result in patches that
break RCU in an attempt to address this appearance of broken wrongness.
Plus there is no net savings in lines of code, given the additional
explicit memory barriers required.

Therefore, this sequence count is instead implemented by a new ->ofl_seq
field in the rcu_node structure.  If this counter's value is an odd
number, RCU forgives RCU read-side critical sections on other CPUs covered
by the same rcu_node structure, even if those CPUs are offline from
an RCU perspective.  In addition, if a given leaf rcu_node structure's
->ofl_seq counter value is an odd number, rcu_gp_init() delays starting
the grace period until that counter value changes.

[ paulmck: Apply Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:17 -08:00
Joel Fernandes (Google) bd56e0a4a2 rcu/tree: nocb: Avoid raising softirq for offloaded ready-to-execute CBs
Testing showed that rcu_pending() can return 1 when offloaded callbacks
are ready to execute.  This invokes RCU core processing, for example,
by raising RCU_SOFTIRQ, eventually resulting in a call to rcu_core().
However, rcu_core() explicitly avoids in any way manipulating offloaded
callbacks, which are instead handled by the rcuog and rcuoc kthreads,
which work independently of rcu_core().

One exception to this independence is that rcu_core() invokes
do_nocb_deferred_wakeup(), however, rcu_pending() also checks
rcu_nocb_need_deferred_wakeup() in order to correctly handle this case,
invoking rcu_core() when needed.

This commit therefore avoids needlessly invoking RCU core processing
by checking rcu_segcblist_ready_cbs() only on non-offloaded CPUs.
This reduces overhead, for example, by reducing softirq activity.

This change passed 30 minute tests of TREE01 through TREE09 each.

On TREE08, there is at most 150us from the time that rcu_pending() chose
not to invoke RCU core processing to the time when the ready callbacks
were invoked by the rcuoc kthread.  This provides further evidence that
there is no need to invoke rcu_core() for offloaded callbacks that are
ready to invoke.

Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:17 -08:00
Peter Zijlstra d2098b4440 rcu,ftrace: Fix ftrace recursion
Kim reported that perf-ftrace made his box unhappy. It turns out that
commit:

  ff5c4f5cad ("rcu/tree: Mark the idle relevant functions noinstr")

removed one too many notrace qualifiers, probably due to there not being
a helpful comment.

This commit therefore reinstates the notrace and adds a comment to avoid
losing it again.

[ paulmck: Apply Steven Rostedt's feedback on the comment. ]
Fixes: ff5c4f5cad ("rcu/tree: Mark the idle relevant functions noinstr")
Reported-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:17 -08:00
Joe Perches 7c47ee5aa0 rcu/tree: Make struct kernel_param_ops definitions const
These should be const, so make it so.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:17 -08:00
Joel Fernandes (Google) 9f866dac94 rcu/tree: Add a warning if CPU being onlined did not report QS already
Currently, rcu_cpu_starting() checks to see if the RCU core expects a
quiescent state from the incoming CPU.  However, the current interaction
between RCU quiescent-state reporting and CPU-hotplug operations should
mean that the incoming CPU never needs to report a quiescent state.
First, the outgoing CPU reports a quiescent state if needed.  Second,
the race where the CPU is leaving just as RCU is initializing a new
grace period is handled by an explicit check for this condition.  Third,
the CPU's leaf rcu_node structure's ->lock serializes these checks.

This means that if rcu_cpu_starting() ever feels the need to report
a quiescent state, then there is a bug somewhere in the CPU hotplug
code or the RCU grace-period handling code.  This commit therefore
adds a WARN_ON_ONCE() to bring that bug to everyone's attention.

Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00
Neeraj Upadhyay a3941517fc rcu: Clarify nocb kthreads naming in RCU_NOCB_CPU config
This commit clarifies that the "p" and the "s" in the in the RCU_NOCB_CPU
config-option description refer to the "x" in the "rcuox/N" kthread name.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
[ paulmck: While in the area, update description and advice. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00
Neeraj Upadhyay ed73860cec rcu: Fix single-CPU check in rcu_blocking_is_gp()
Currently, for CONFIG_PREEMPTION=n kernels, rcu_blocking_is_gp() uses
num_online_cpus() to determine whether there is only one CPU online.  When
there is only a single CPU online, the simple fact that synchronize_rcu()
could be legally called implies that a full grace period has elapsed.
Therefore, in the single-CPU case, synchronize_rcu() simply returns
immediately.  Unfortunately, num_online_cpus() is unreliable while a
CPU-hotplug operation is transitioning to or from single-CPU operation
because:

1.	num_online_cpus() uses atomic_read(&__num_online_cpus) to
	locklessly sample the number of online CPUs.  The hotplug locks
	are not held, which means that an incoming CPU can concurrently
	update this count.  This in turn means that an RCU read-side
	critical section on the incoming CPU might observe updates
	prior to the grace period, but also that this critical section
	might extend beyond the end of the optimized synchronize_rcu().
	This breaks RCU's fundamental guarantee.

2.	In addition, num_online_cpus() does no ordering, thus providing
	another way that RCU's fundamental guarantee can be broken by
	the current code.

3.	The most probable failure mode happens on outgoing CPUs.
	The outgoing CPU updates the count of online CPUs in the
	CPUHP_TEARDOWN_CPU stop-machine handler, which is fine in
	and of itself due to preemption being disabled at the call
	to num_online_cpus().  Unfortunately, after that stop-machine
	handler returns, the CPU takes one last trip through the
	scheduler (which has RCU readers) and, after the resulting
	context switch, one final dive into the idle loop.  During this
	time, RCU needs to keep track of two CPUs, but num_online_cpus()
	will say that there is only one, which in turn means that the
	surviving CPU will incorrectly ignore the outgoing CPU's RCU
	read-side critical sections.

This problem is illustrated by the following litmus test in which P0()
corresponds to synchronize_rcu() and P1() corresponds to the incoming CPU.
The herd7 tool confirms that the "exists" clause can be satisfied,
thus demonstrating that this breakage can happen according to the Linux
kernel memory model.

   {
     int x = 0;
     atomic_t numonline = ATOMIC_INIT(1);
   }

   P0(int *x, atomic_t *numonline)
   {
     int r0;
     WRITE_ONCE(*x, 1);
     r0 = atomic_read(numonline);
     if (r0 == 1) {
       smp_mb();
     } else {
       synchronize_rcu();
     }
     WRITE_ONCE(*x, 2);
   }

   P1(int *x, atomic_t *numonline)
   {
     int r0; int r1;

     atomic_inc(numonline);
     smp_mb();
     rcu_read_lock();
     r0 = READ_ONCE(*x);
     smp_rmb();
     r1 = READ_ONCE(*x);
     rcu_read_unlock();
   }

   locations [x;numonline;]

   exists (1:r0=0 /\ 1:r1=2)

It is important to note that these problems arise only when the system
is transitioning to or from single-CPU operation.

One solution would be to hold the CPU-hotplug locks while sampling
num_online_cpus(), which was in fact the intent of the (redundant)
preempt_disable() and preempt_enable() surrounding this call to
num_online_cpus().  Actually blocking CPU hotplug would not only result
in excessive overhead, but would also unnecessarily impede CPU-hotplug
operations.

This commit therefore follows long-standing RCU tradition by maintaining
a separate RCU-specific set of CPU-hotplug books.

This separate set of books is implemented by a new ->n_online_cpus field
in the rcu_state structure that maintains RCU's count of the online CPUs.
This count is incremented early in the CPU-online process, so that
the critical transition away from single-CPU operation will occur when
there is only a single CPU.  Similarly for the critical transition to
single-CPU operation, the counter is decremented late in the CPU-offline
process, again while there is only a single CPU.  Because there is only
ever a single CPU when the ->n_online_cpus field undergoes the critical
1->2 and 2->1 transitions, full memory ordering and mutual exclusion is
provided implicitly and, better yet, for free.

In the case where the CPU is coming online, nothing will happen until
the current CPU helps it come online.  Therefore, the new CPU will see
all accesses prior to the optimized grace period, which means that RCU
does not need to further delay this new CPU.  In the case where the CPU
is going offline, the outgoing CPU is totally out of the picture before
the optimized grace period starts, which means that this outgoing CPU
cannot see any of the accesses following that grace period.  Again,
RCU needs no further interaction with the outgoing CPU.

This does mean that synchronize_rcu() will unnecessarily do a few grace
periods the hard way just before the second CPU comes online and just
after the second-to-last CPU goes offline, but it is not worth optimizing
this uncommon case.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00
Frederic Weisbecker e3771c850d rcu: Implement rcu_segcblist_is_offloaded() config dependent
This commit simplifies the use of the rcu_segcblist_is_offloaded() API so
that its callers no longer need to check the RCU_NOCB_CPU Kconfig option.
Note that rcu_segcblist_is_offloaded() is defined in the header file,
which means that the generated code should be just as efficient as before.

Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00
chao dfe564045c rcu: Panic after fixed number of stalls
Some stalls are transient, so that system fully recovers.  This commit
therefore allows users to configure the number of stalls that must happen
in order to trigger kernel panic.

Signed-off-by: chao <chao@eero.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:37:16 -08:00
Peter Zijlstra 6dbce04d84 rcu: Allow rcu_irq_enter_check_tick() from NMI
Eugenio managed to tickle #PF from NMI context which resulted in
hitting a WARN in RCU through irqentry_enter() ->
__rcu_irq_enter_check_tick().

However, this situation is perfectly sane and does not warrant an
WARN. The #PF will (necessarily) be atomic and not require messing
with the tick state, so early return is correct.  This commit
therefore removes the WARN.

Fixes: aaf2bc50df ("rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()")
Reported-by: "Eugenio Pérez" <eupm90@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-11-19 19:34:17 -08:00
Jakub Kicinski 56495a2442 Merge https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19 19:08:46 -08:00
Linus Torvalds 4d02da974e Networking fixes for 5.10-rc5, including fixes from the WiFi (mac80211),
can and bpf (including the strncpy_from_user fix).
 
 Current release - regressions:
 
  - mac80211: fix memory leak of filtered powersave frames
 
  - mac80211: free sta in sta_info_insert_finish() on errors to avoid
              sleeping in atomic context
 
  - netlabel: fix an uninitialized variable warning added in -rc4
 
 Previous release - regressions:
 
  - vsock: forward all packets to the host when no H2G is registered,
            un-breaking AWS Nitro Enclaves
 
  - net: Exempt multicast addresses from five-second neighbor lifetime
         requirement, decreasing the chances neighbor tables fill up
 
  - net/tls: fix corrupted data in recvmsg
 
  - qed: fix ILT configuration of SRC block
 
  - can: m_can: process interrupt only when not runtime suspended
 
 Previous release - always broken:
 
  - page_frag: Recover from memory pressure by not recycling pages
               allocating from the reserves
 
  - strncpy_from_user: Mask out bytes after NUL terminator
 
  - ip_tunnels: Set tunnel option flag only when tunnel metadata is
                present, always setting it confuses Open vSwitch
 
  - bpf, sockmap:
    - Fix partial copy_page_to_iter so progress can still be made
    - Fix socket memory accounting and obeying SO_RCVBUF
 
  - net: Have netpoll bring-up DSA management interface
 
  - net: bridge: add missing counters to ndo_get_stats64 callback
 
  - tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt
 
  - enetc: Workaround MDIO register access HW bug
 
  - net/ncsi: move netlink family registration to a subsystem init,
              instead of tying it to driver probe
 
  - net: ftgmac100: unregister NC-SI when removing driver to avoid crash
 
  - lan743x: prevent interrupt storm on open
 
  - lan743x: fix freeing skbs in the wrong context
 
  - net/mlx5e: Fix socket refcount leak on kTLS RX resync
 
  - net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097
 
  - fix 21 unset return codes and other mistakes on error paths,
    mostly detected by the Hulk Robot
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+226AACgkQMUZtbf5S
 IruE1w/+JX3CqJwGIqyzyhwVshNaKxmX9gAOMJzkckjEohn8932zPaNq7kbmNYqt
 5QsJoou3cXjFeoIEAkQA5fqR4stTZpZMnLO+7JnxxQ0vb2YBN+tIGQRNCnmd1Q0h
 u9gb5+5AdORdlmk3E7oC8v50dzQRfboJXLEEZTo2uGJwUgLlEAiqTSV2w4YDHMhL
 JtgtWA/fraL0CUc2WMoxuimg9NirbRuMijsU6+d/yExzznDpdoho/qsxL+Odu1NF
 hSdaKirA8B8ml0pOd/b4mj+fm4+lKyXZBfSyLx4Ki1TqluEMLzDp7gQPRnU6yyJm
 AOu4zsKxx6qitOX2qLQCNlEpkQp6LA2N2Zb1orliUV3Bsq2DJRhU35FgLcghtdRP
 GTRSdKHr2BvMScOZ7dQo8l4TqVc3e/khSZDRGdvpsM275Dt0JyS/l7yAWxunPqMb
 +/483/s75OuBRO57ULLJ/hR02TG37g/Jv5sI0sG/7oDpGfnulinQX+fxy9izyTEM
 KYl0mAPSqhb6RcjE0YXWG0rhJN6FSvc/lwPQHjq8wPSkwEdD/FTb6/eYqbXDi1ld
 UTYhFpkh1PQrwct14eSScMeJqTsNKvG0VV39/uZLZCzcqa3yOY5+oTzwaCFlMsy3
 a5yGGxqoh7/FTM8t1ml21is9uZ31LAQEnNTMPv69pZPwAv5G5yE=
 =SRwI
 -----END PGP SIGNATURE-----

Merge tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.10-rc5, including fixes from the WiFi
  (mac80211), can and bpf (including the strncpy_from_user fix).

  Current release - regressions:

   - mac80211: fix memory leak of filtered powersave frames

   - mac80211: free sta in sta_info_insert_finish() on errors to avoid
     sleeping in atomic context

   - netlabel: fix an uninitialized variable warning added in -rc4

  Previous release - regressions:

   - vsock: forward all packets to the host when no H2G is registered,
     un-breaking AWS Nitro Enclaves

   - net: Exempt multicast addresses from five-second neighbor lifetime
     requirement, decreasing the chances neighbor tables fill up

   - net/tls: fix corrupted data in recvmsg

   - qed: fix ILT configuration of SRC block

   - can: m_can: process interrupt only when not runtime suspended

  Previous release - always broken:

   - page_frag: Recover from memory pressure by not recycling pages
     allocating from the reserves

   - strncpy_from_user: Mask out bytes after NUL terminator

   - ip_tunnels: Set tunnel option flag only when tunnel metadata is
     present, always setting it confuses Open vSwitch

   - bpf, sockmap:
      - Fix partial copy_page_to_iter so progress can still be made
      - Fix socket memory accounting and obeying SO_RCVBUF

   - net: Have netpoll bring-up DSA management interface

   - net: bridge: add missing counters to ndo_get_stats64 callback

   - tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt

   - enetc: Workaround MDIO register access HW bug

   - net/ncsi: move netlink family registration to a subsystem init,
     instead of tying it to driver probe

   - net: ftgmac100: unregister NC-SI when removing driver to avoid
     crash

   - lan743x:
      - prevent interrupt storm on open
      - fix freeing skbs in the wrong context

   - net/mlx5e: Fix socket refcount leak on kTLS RX resync

   - net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097

   - fix 21 unset return codes and other mistakes on error paths, mostly
     detected by the Hulk Robot"

* tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (115 commits)
  fail_function: Remove a redundant mutex unlock
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
  lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
  net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid()
  net/smc: fix matching of existing link groups
  ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module
  libbpf: Fix VERSIONED_SYM_COUNT number parsing
  net/mlx4_core: Fix init_hca fields offset
  atm: nicstar: Unmap DMA on send error
  page_frag: Recover from memory pressure
  net: dsa: mv88e6xxx: Wait for EEPROM done after HW reset
  mlxsw: core: Use variable timeout for EMAD retries
  mlxsw: Fix firmware flashing
  net: Have netpoll bring-up DSA management interface
  atl1e: fix error return code in atl1e_probe()
  atl1c: fix error return code in atl1c_probe()
  ah6: fix error return code in ah6_input()
  net: usb: qmi_wwan: Set DTR quirk for MR400
  can: m_can: process interrupt only when not runtime suspended
  can: flexcan: flexcan_chip_start(): fix erroneous flexcan_transceiver_enable() during bus-off recovery
  ...
2020-11-19 13:33:16 -08:00
Jakub Kicinski e6ea60bac1 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
1) libbpf should not attempt to load unused subprogs, from Andrii.

2) Make strncpy_from_user() mask out bytes after NUL terminator, from Daniel.

3) Relax return code check for subprograms in the BPF verifier, from Dmitrii.

4) Fix several sockmap issues, from John.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  fail_function: Remove a redundant mutex unlock
  selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL
  lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
  libbpf: Fix VERSIONED_SYM_COUNT number parsing
  bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list
  bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self
  bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self
  bpf, sockmap: Use truesize with sk_rmem_schedule()
  bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect
  bpf, sockmap: Fix partial copy_page_to_iter so progress can still be made
  selftests/bpf: Fix error return code in run_getsockopt_test()
  bpf: Relax return code check for subprograms
  tools, bpftool: Add missing close before bpftool net attach exit
  MAINTAINERS/bpf: Update Andrii's entry.
  selftests/bpf: Fix unused attribute usage in subprogs_unused test
  bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id
  bpf: Fix passing zero to PTR_ERR() in bpf_btf_printf_prepare
  libbpf: Don't attempt to load unused subprog as an entry-point BPF program
====================

Link: https://lore.kernel.org/r/20201119200721.288-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19 12:26:10 -08:00
Luo Meng 2801a5da5b fail_function: Remove a redundant mutex unlock
Fix a mutex_unlock() issue where before copy_from_user() is
not called mutex_locked.

Fixes: 4b1a29a7f5 ("error-injection: Support fault injection framework")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/bpf/160570737118.263807.8358435412898356284.stgit@devnote2
2020-11-19 11:58:16 -08:00
Daniel Xu 6fa6d28051 lib/strncpy_from_user.c: Mask out bytes after NUL terminator.
do_strncpy_from_user() may copy some extra bytes after the NUL
terminator into the destination buffer. This usually does not matter for
normal string operations. However, when BPF programs key BPF maps with
strings, this matters a lot.

A BPF program may read strings from user memory by calling the
bpf_probe_read_user_str() helper which eventually calls
do_strncpy_from_user(). The program can then key a map with the
destination buffer. BPF map keys are fixed-width and string-agnostic,
meaning that map keys are treated as a set of bytes.

The issue is when do_strncpy_from_user() overcopies bytes after the NUL
terminator, it can result in seemingly identical strings occupying
multiple slots in a BPF map. This behavior is subtle and totally
unexpected by the user.

This commit masks out the bytes following the NUL while preserving
long-sized stride in the fast path.

Fixes: 6ae08ae3de ("bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers")
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/21efc982b3e9f2f7b0379eed642294caaa0c27a7.1605642949.git.dxu@dxuuu.xyz
2020-11-19 11:56:16 -08:00
Ionela Voinescu fa50e2b452 sched/topology: Condition EAS enablement on FIE support
In order to make accurate predictions across CPUs and for all performance
states, Energy Aware Scheduling (EAS) needs frequency-invariant load
tracking signals.

EAS task placement aims to minimize energy consumption, and does so in
part by limiting the search space to only CPUs with the highest spare
capacity (CPU capacity - CPU utilization) in their performance domain.
Those candidates are the placement choices that will keep frequency at
its lowest possible and therefore save the most energy.

But without frequency invariance, a CPU's utilization is relative to the
CPU's current performance level, and not relative to its maximum
performance level, which determines its capacity. As a result, it will
fail to correctly indicate any potential spare capacity obtained by an
increase in a CPU's performance level. Therefore, a non-invariant
utilization signal would render the EAS task placement logic invalid.

Now that we properly report support for the Frequency Invariance Engine
(FIE) through arch_scale_freq_invariant() for arm and arm64 systems,
while also ensuring a re-evaluation of the EAS use conditions for
possible invariance status change, we can assert this is the case when
initializing EAS. Warn and bail out otherwise.

Suggested-by: Quentin Perret <qperret@google.com>
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201027180713.7642-4-ionela.voinescu@arm.com
2020-11-19 11:25:47 +01:00
Ionela Voinescu 31f6a8c0a4 sched/topology,schedutil: Wrap sched domains rebuild
Add the rebuild_sched_domains_energy() function to wrap the functionality
that rebuilds the scheduling domains if any of the Energy Aware Scheduling
(EAS) initialisation conditions change. This functionality is used when
schedutil is added or removed or when EAS is enabled or disabled
through the sched_energy_aware sysctl.

Therefore, create a single function that is used in both these cases and
that can be later reused.

Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Quentin Perret <qperret@google.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lkml.kernel.org/r/20201027180713.7642-2-ionela.voinescu@arm.com
2020-11-19 11:25:47 +01:00
Dietmar Eggemann 480a6ca2dc sched/uclamp: Allow to reset a task uclamp constraint value
In case the user wants to stop controlling a uclamp constraint value
for a task, use the magic value -1 in sched_util_{min,max} with the
appropriate sched_flags (SCHED_FLAG_UTIL_CLAMP_{MIN,MAX}) to indicate
the reset.

The advantage over the 'additional flag' approach (i.e. introducing
SCHED_FLAG_UTIL_CLAMP_RESET) is that no additional flag has to be
exported via uapi. This avoids the need to document how this new flag
has be used in conjunction with the existing uclamp related flags.

The following subtle issue is fixed as well. When a uclamp constraint
value is set on a !user_defined uclamp_se it is currently first reset
and then set.
Fix this by AND'ing !user_defined with !SCHED_FLAG_UTIL_CLAMP which
stands for the 'sched class change' case.
The related condition 'if (uc_se->user_defined)' moved from
__setscheduler_uclamp() into uclamp_reset().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yun Hsiang <hsiang023167@gmail.com>
Link: https://lkml.kernel.org/r/20201113113454.25868-1-dietmar.eggemann@arm.com
2020-11-19 11:25:47 +01:00
Tal Zussman b19a888c1e sched/core: Fix typos in comments
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201113005156.GA8408@charmander
2020-11-19 11:25:46 +01:00
Valentin Schneider b5b217346d sched/topology: Warn when NUMA diameter > 2
NUMA topologies where the shortest path between some two nodes requires
three or more hops (i.e. diameter > 2) end up being misrepresented in the
scheduler topology structures.

This is currently detected when booting a kernel with CONFIG_SCHED_DEBUG=y
+ sched_debug on the cmdline, although this will only yield a warning about
sched_group spans not matching sched_domain spans:

  ERROR: groups don't span domain->span

Add an explicit warning for that case, triggered regardless of
CONFIG_SCHED_DEBUG, and decorate it with an appropriate comment.

The topology described in the comment can be booted up on QEMU by appending
the following to your usual QEMU incantation:

    -smp cores=4 \
    -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
    -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
    -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
    -numa dist,src=0,dst=3,val=40, -numa dist,src=1,dst=2,val=20, \
    -numa dist,src=1,dst=3,val=30, -numa dist,src=2,dst=3,val=20

A somewhat more realistic topology (6-node mesh) with the same affliction
can be conjured with:

    -smp cores=6 \
    -numa node,cpus=0,nodeid=0 -numa node,cpus=1,nodeid=1, \
    -numa node,cpus=2,nodeid=2, -numa node,cpus=3,nodeid=3, \
    -numa node,cpus=4,nodeid=4, -numa node,cpus=5,nodeid=5, \
    -numa dist,src=0,dst=1,val=20, -numa dist,src=0,dst=2,val=30, \
    -numa dist,src=0,dst=3,val=40, -numa dist,src=0,dst=4,val=30, \
    -numa dist,src=0,dst=5,val=20, \
    -numa dist,src=1,dst=2,val=20, -numa dist,src=1,dst=3,val=30, \
    -numa dist,src=1,dst=4,val=20, -numa dist,src=1,dst=5,val=30, \
    -numa dist,src=2,dst=3,val=20, -numa dist,src=2,dst=4,val=30, \
    -numa dist,src=2,dst=5,val=40, \
    -numa dist,src=3,dst=4,val=20, -numa dist,src=3,dst=5,val=30, \
    -numa dist,src=4,dst=5,val=20

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/lkml/jhjtux5edo2.mognet@arm.com
2020-11-19 11:25:46 +01:00
Daniel Jordan 406100f3da cpuset: fix race between hotplug work and later CPU offline
One of our machines keeled over trying to rebuild the scheduler domains.
Mainline produces the same splat:

  BUG: unable to handle page fault for address: 0000607f820054db
  CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
  Workqueue: events cpuset_hotplug_workfn
  RIP: build_sched_domains
  Call Trace:
   partition_sched_domains_locked
   rebuild_sched_domains_locked
   cpuset_hotplug_workfn

It happens with cgroup2 and exclusive cpusets only.  This reproducer
triggers it on an 8-cpu vm and works most effectively with no
preexisting child cgroups:

  cd $UNIFIED_ROOT
  mkdir cg1
  echo 4-7 > cg1/cpuset.cpus
  echo root > cg1/cpuset.cpus.partition

  # with smt/control reading 'on',
  echo off > /sys/devices/system/cpu/smt/control

RIP maps to

  sd->shared = *per_cpu_ptr(sdd->sds, sd_id);

from sd_init().  sd_id is calculated earlier in the same function:

  cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
  sd_id = cpumask_first(sched_domain_span(sd));

tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
value from per_cpu_ptr() above.

The problem is a race between cpuset_hotplug_workfn() and a later
offline of CPU N.  cpuset_hotplug_workfn() updates the effective masks
when N is still online, the offline clears N from cpu_sibling_map, and
then the worker uses the stale effective masks that still have N to
generate the scheduling domains, leading the worker to read
N's empty cpu_sibling_map in sd_init().

rebuild_sched_domains_locked() prevented the race during the cgroup2
cpuset series up until the Fixes commit changed its check.  Make the
check more robust so that it can detect an offline CPU in any exclusive
cpuset's effective mask, not just the top one.

Fixes: 0ccea8feb9 ("cpuset: Make generate_sched_domains() work with partition")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20201112171711.639541-1-daniel.m.jordan@oracle.com
2020-11-19 11:25:45 +01:00
Peter Zijlstra 1293771e43 sched: Fix migration_cpu_stop() WARN
Oleksandr reported hitting the WARN in the 'task_rq(p) != rq' branch
of migration_cpu_stop(). Valentin noted that using cpu_of(rq) in that
case is just plain wrong to begin with, since per the earlier branch
that isn't the actual CPU of the task.

Replace both instances of is_cpu_allowed() by a direct p->cpus_mask
test using task_cpu().

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Debugged-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-11-19 11:25:45 +01:00
Valentin Schneider d707faa64d sched/core: Add missing completion for affine_move_task() waiters
Qian reported that some fuzzer issuing sched_setaffinity() ends up stuck on
a wait_for_completion(). The problematic pattern seems to be:

  affine_move_task()
      // task_running() case
      stop_one_cpu();
      wait_for_completion(&pending->done);

Combined with, on the stopper side:

  migration_cpu_stop()
    // Task moved between unlocks and scheduling the stopper
    task_rq(p) != rq &&
    // task_running() case
    dest_cpu >= 0

    => no complete_all()

This can happen with both PREEMPT and !PREEMPT, although !PREEMPT should
be more likely to see this given the targeted task has a much bigger window
to block and be woken up elsewhere before the stopper runs.

Make migration_cpu_stop() always look at pending affinity requests; signal
their completion if the stopper hits a rq mismatch but the task is
still within its allowed mask. When Migrate-Disable isn't involved, this
matches the previous set_cpus_allowed_ptr() vs migration_cpu_stop()
behaviour.

Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/lkml/8b62fd1ad1b18def27f18e2ee2df3ff5b36d0762.camel@redhat.com
2020-11-19 11:25:45 +01:00
Frederic Weisbecker 6775de4984 context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
schedule_user() was traditionally used by the entry code's tail to
preempt userspace after the call to user_enter(). Indeed the call to
user_enter() used to be performed upon syscall exit slow path which was
right before the last opportunity to schedule() while resuming to
userspace. The context tracking state had to be saved on the task stack
and set back to CONTEXT_KERNEL temporarily in order to safely switch to
another task.

Only a few archs use it now (namely sparc64 and powerpc64) and those
implementing HAVE_CONTEXT_TRACKING_OFFSTACK definetly can't rely on it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117151637.259084-5-frederic@kernel.org
2020-11-19 11:25:42 +01:00
Frederic Weisbecker 9f68b5b74c sched: Detect call to schedule from critical entry code
Detect calls to schedule() between user_enter() and user_exit(). Those
are symptoms of early entry code that either forgot to protect a call
to schedule() inside exception_enter()/exception_exit() or, in the case
of HAVE_CONTEXT_TRACKING_OFFSTACK, enabled interrupts or preemption in
a wrong spot.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117151637.259084-4-frederic@kernel.org
2020-11-19 11:25:42 +01:00
Hui Su aabe19b827
nsproxy: use put_nsproxy() in switch_task_namespaces()
We already have a dedicated helper that handles reference count
checking so stop open-coding the reference count check in
switch_task_namespaces() and use the dedicated put_nsproxy() helper
instead.
Take the change to fix a whitespace issue too.

Signed-off-by: Hui Su <sh_def@163.com>
[christian.brauner@ubuntu.com: expand commit message]
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20201115180054.GA371317@rlk
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-11-19 10:57:02 +01:00
Thomas Gleixner b996544916 tick: Get rid of tick_period
The variable tick_period is initialized to NSEC_PER_TICK / HZ during boot
and never updated again.

If NSEC_PER_TICK is not an integer multiple of HZ this computation is less
accurate than TICK_NSEC which has proper rounding in place.

Aside of the inaccuracy there is no reason for having this variable at
all. It's just a pointless indirection and all usage sites can just use the
TICK_NSEC constant.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.766643526@linutronix.de
2020-11-19 10:48:29 +01:00
Yunfeng Ye 896b969e67 tick/sched: Release seqcount before invoking calc_load_global()
calc_load_global() does not need the sequence count protection.

[ tglx: Split it up properly and added comments ]

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.660902274@linutronix.de
2020-11-19 10:48:29 +01:00
Thomas Gleixner 7a35bf2a6a tick/sched: Optimize tick_do_update_jiffies64() further
Now that it's clear that there is always one tick to account, simplify the
calculations some more.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.565663056@linutronix.de
2020-11-19 10:48:29 +01:00
Yunfeng Ye 94ad2e3ced tick/sched: Reduce seqcount held scope in tick_do_update_jiffies64()
If jiffies are up to date already (caller lost the race against another
CPU) there is no point to change the sequence count. Doing that just forces
other CPUs into the seqcount retry loop in tick_nohz_next_event() for
nothing.

Just bail out early.

[ tglx: Rewrote most of it ]

Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.462195901@linutronix.de
2020-11-19 10:48:29 +01:00
Thomas Gleixner 372acbbaa8 tick/sched: Use tick_next_period for lockless quick check
No point in doing calculations.

   tick_next_period = last_jiffies_update + tick_period

Just check whether now is before tick_next_period to figure out whether
jiffies need an update.

Add a comment why the intentional data race in the quick check is safe or
not so safe in a 32bit corner case and why we don't worry about it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.337366695@linutronix.de
2020-11-19 10:48:29 +01:00
Thomas Gleixner c398960cd8 tick: Document protections for tick related data
The protection rules for tick_next_period and last_jiffies_update are blury
at best. Clarify this.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.197713794@linutronix.de
2020-11-19 10:48:28 +01:00
Thomas Gleixner f73f64d568 tick/broadcast: Serialize access to tick_next_period
tick_broadcast_setup_oneshot() accesses tick_next_period twice without any
serialization. This is wrong in two aspects:

  - Reading it twice might make the broadcast data inconsistent if the
    variable is updated concurrently.

  - On 32bit systems the access might see an partial update

Protect it with jiffies_lock. That's safe as none of the callchains leading
up to this function can create a lock ordering violation:

timer interrupt
  run_local_timers()
    hrtimer_run_queues()
      hrtimer_switch_to_hres()
        tick_init_highres()
	  tick_switch_to_oneshot()
	    tick_broadcast_switch_to_oneshot()
or
     tick_check_oneshot_change()
       tick_nohz_switch_to_nohz()
         tick_switch_to_oneshot()
           tick_broadcast_switch_to_oneshot()

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201117132006.061341507@linutronix.de
2020-11-19 10:48:28 +01:00
Dmitrii Banshchikov d055126180 bpf: Add bpf_ktime_get_coarse_ns helper
The helper uses CLOCK_MONOTONIC_COARSE source of time that is less
accurate but more performant.

We have a BPF CGROUP_SKB firewall that supports event logging through
bpf_perf_event_output(). Each event has a timestamp and currently we use
bpf_ktime_get_ns() for it. Use of bpf_ktime_get_coarse_ns() saves ~15-20
ns in time required for event logging.

bpf_ktime_get_ns():
EgressLogByRemoteEndpoint                              113.82ns    8.79M

bpf_ktime_get_coarse_ns():
EgressLogByRemoteEndpoint                               95.40ns   10.48M

Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201117184549.257280-1-me@ubique.spb.ru
2020-11-18 23:25:32 +01:00
Hui Su 5c62634fc6 namespace: make timens_on_fork() return nothing
timens_on_fork() always return 0, and maybe not
need to judge the return value in copy_namespaces().

So make timens_on_fork() return nothing and do not
judge its return val in copy_namespaces().

Signed-off-by: Hui Su <sh_def@163.com>
Link: https://lore.kernel.org/r/20201117161750.GA45121@rlk
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-11-18 11:06:47 +01:00
KP Singh 3f6719c7b6 bpf: Add bpf_bprm_opts_set helper
The helper allows modification of certain bits on the linux_binprm
struct starting with the secureexec bit which can be updated using the
BPF_F_BPRM_SECUREEXEC flag.

secureexec can be set by the LSM for privilege gaining executions to set
the AT_SECURE auxv for glibc.  When set, the dynamic linker disables the
use of certain environment variables (like LD_PRELOAD).

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201117232929.2156341-1-kpsingh@chromium.org
2020-11-18 01:36:27 +01:00
Mickaël Salaün fb14528e44 seccomp: Set PF_SUPERPRIV when checking capability
Replace the use of security_capable(current_cred(), ...) with
ns_capable_noaudit() which set PF_SUPERPRIV.

Since commit 98f368e9e2 ("kernel: Add noaudit variant of
ns_capable()"), a new ns_capable_noaudit() helper is available.  Let's
use it!

Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Will Drewry <wad@chromium.org>
Cc: stable@vger.kernel.org
Fixes: e2cfabdfd0 ("seccomp: add system call filtering using BPF")
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201030123849.770769-3-mic@digikod.net
2020-11-17 12:53:22 -08:00
Mickaël Salaün cf23705244 ptrace: Set PF_SUPERPRIV when checking capability
Commit 69f594a389 ("ptrace: do not audit capability check when outputing
/proc/pid/stat") replaced the use of ns_capable() with
has_ns_capability{,_noaudit}() which doesn't set PF_SUPERPRIV.

Commit 6b3ad6649a ("ptrace: reintroduce usage of subjective credentials in
ptrace_has_cap()") replaced has_ns_capability{,_noaudit}() with
security_capable(), which doesn't set PF_SUPERPRIV neither.

Since commit 98f368e9e2 ("kernel: Add noaudit variant of ns_capable()"), a
new ns_capable_noaudit() helper is available.  Let's use it!

As a result, the signature of ptrace_has_cap() is restored to its original one.

Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: stable@vger.kernel.org
Fixes: 6b3ad6649a ("ptrace: reintroduce usage of subjective credentials in ptrace_has_cap()")
Fixes: 69f594a389 ("ptrace: do not audit capability check when outputing /proc/pid/stat")
Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201030123849.770769-2-mic@digikod.net
2020-11-17 12:53:22 -08:00
Linus Torvalds 9dacf44c38 Merge branch 'urgent-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
 "A single commit that fixes a bug that was introduced a couple of merge
  windows ago, but which rather more recently converged to an
  agreed-upon fix. The bug is that interrupts can be incorrectly enabled
  while holding an irq-disabled spinlock. This can of course result in
  self-deadlocks.

  The bug is a bit difficult to trigger. It requires that a preempted
  task be blocking a preemptible-RCU grace period long enough to trigger
  an RCU CPU stall warning. In addition, an interrupt must occur at just
  the right time, and that interrupt's handler must acquire that same
  irq-disabled spinlock. Still, a deadlock is a deadlock.

  Furthermore, we do now have a fix, and that fix survives kernel test
  robot, -next, and rcutorture testing. It has also been verified by
  Sebastian as fixing the bug. Therefore..."

* 'urgent-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled
2020-11-17 10:31:56 -08:00
Andy Shevchenko 5df38ca6af resource: Add test cases for new resource API
Add test cases for newly added resource APIs.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-11-17 18:06:28 +01:00
Andy Shevchenko 66f4fa32eb resource: Simplify region_intersects() by reducing conditionals
Now we have for 'other' and 'type' variables

other	type	return
  0	  0	REGION_DISJOINT
  0	  x	REGION_INTERSECTS
  x	  0	REGION_DISJOINT
  x	  x	REGION_MIXED

Obviously it's easier to check 'type' for 0 first instead of
currently checked 'other'.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Hanjun Guo <guohanjun@huawei.com>
Tested-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-11-17 18:06:28 +01:00
Boqun Feng 43be4388e9 lockdep: Put graph lock/unlock under lock_recursion protection
A warning was hit when running xfstests/generic/068 in a Hyper-V guest:

[...] ------------[ cut here ]------------
[...] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
[...] WARNING: CPU: 2 PID: 1350 at kernel/locking/lockdep.c:5280 check_flags.part.0+0x165/0x170
[...] ...
[...] Workqueue: events pwq_unbound_release_workfn
[...] RIP: 0010:check_flags.part.0+0x165/0x170
[...] ...
[...] Call Trace:
[...]  lock_is_held_type+0x72/0x150
[...]  ? lock_acquire+0x16e/0x4a0
[...]  rcu_read_lock_sched_held+0x3f/0x80
[...]  __send_ipi_one+0x14d/0x1b0
[...]  hv_send_ipi+0x12/0x30
[...]  __pv_queued_spin_unlock_slowpath+0xd1/0x110
[...]  __raw_callee_save___pv_queued_spin_unlock_slowpath+0x11/0x20
[...]  .slowpath+0x9/0xe
[...]  lockdep_unregister_key+0x128/0x180
[...]  pwq_unbound_release_workfn+0xbb/0xf0
[...]  process_one_work+0x227/0x5c0
[...]  worker_thread+0x55/0x3c0
[...]  ? process_one_work+0x5c0/0x5c0
[...]  kthread+0x153/0x170
[...]  ? __kthread_bind_mask+0x60/0x60
[...]  ret_from_fork+0x1f/0x30

The cause of the problem is we have call chain lockdep_unregister_key()
-> <irq disabled by raw_local_irq_save()> lockdep_unlock() ->
arch_spin_unlock() -> __pv_queued_spin_unlock_slowpath() -> pv_kick() ->
__send_ipi_one() -> trace_hyperv_send_ipi_one().

Although this particular warning is triggered because Hyper-V has a
trace point in ipi sending, but in general arch_spin_unlock() may call
another function having a trace point in it, so put the arch_spin_lock()
and arch_spin_unlock() after lock_recursion protection to fix this
problem and avoid similiar problems.

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201113110512.1056501-1-boqun.feng@gmail.com
2020-11-17 13:15:35 +01:00
Juri Lelli 2279f540ea sched/deadline: Fix priority inheritance with multiple scheduling classes
Glenn reported that "an application [he developed produces] a BUG in
deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
PTHREAD_PRIO_INHERIT mutexes.  I believe the bug is triggered when a CFS
task that was boosted by a SCHED_DEADLINE task boosts another CFS task
(nested priority inheritance).

 ------------[ cut here ]------------
 kernel BUG at kernel/sched/deadline.c:1462!
 invalid opcode: 0000 [#1] PREEMPT SMP
 CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
 Hardware name: ...
 RIP: 0010:enqueue_task_dl+0x335/0x910
 Code: ...
 RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
 FS:  00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  ? intel_pstate_update_util_hwp+0x13/0x170
  rt_mutex_setprio+0x1cc/0x4b0
  task_blocks_on_rt_mutex+0x225/0x260
  rt_spin_lock_slowlock_locked+0xab/0x2d0
  rt_spin_lock_slowlock+0x50/0x80
  hrtimer_grab_expiry_lock+0x20/0x30
  hrtimer_cancel+0x13/0x30
  do_nanosleep+0xa0/0x150
  hrtimer_nanosleep+0xe1/0x230
  ? __hrtimer_init_sleeper+0x60/0x60
  __x64_sys_nanosleep+0x8d/0xa0
  do_syscall_64+0x4a/0x100
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fa58b52330d
 ...
 ---[ end trace 0000000000000002 ]—

He also provided a simple reproducer creating the situation below:

 So the execution order of locking steps are the following
 (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
 are mutexes that are enabled * with priority inheritance.)

 Time moves forward as this timeline goes down:

 N1              N2               D1
 |               |                |
 |               |                |
 Lock(M1)        |                |
 |               |                |
 |             Lock(M2)           |
 |               |                |
 |               |              Lock(M2)
 |               |                |
 |             Lock(M1)           |
 |             (!!bug triggered!) |

Daniel reported a similar situation as well, by just letting ksoftirqd
run with DEADLINE (and eventually block on a mutex).

Problem is that boosted entities (Priority Inheritance) use static
DEADLINE parameters of the top priority waiter. However, there might be
cases where top waiter could be a non-DEADLINE entity that is currently
boosted by a DEADLINE entity from a different lock chain (i.e., nested
priority chains involving entities of non-DEADLINE classes). In this
case, top waiter static DEADLINE parameters could be null (initialized
to 0 at fork()) and replenish_dl_entity() would hit a BUG().

Fix this by keeping track of the original donor and using its parameters
when a task is boosted.

Reported-by: Glenn Elliott <glenn@aurora.tech>
Reported-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
2020-11-17 13:15:28 +01:00
Peter Zijlstra ec618b84f6 sched: Fix rq->nr_iowait ordering
schedule()				ttwu()
    deactivate_task();			  if (p->on_rq && ...) // false
					    atomic_dec(&task_rq(p)->nr_iowait);
    if (prev->in_iowait)
      atomic_inc(&rq->nr_iowait);

Allows nr_iowait to be decremented before it gets incremented,
resulting in more dodgy IO-wait numbers than usual.

Note that because we can now do ttwu_queue_wakelist() before
p->on_cpu==0, we lose the natural ordering and have to further delay
the decrement.

Fixes: c6e7bd7afa ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20201117093829.GD3121429@hirez.programming.kicks-ass.net
2020-11-17 13:15:28 +01:00
Quentin Perret 8e1ac4299a sched/fair: Fix overutilized update in enqueue_task_fair()
enqueue_task_fair() attempts to skip the overutilized update for new
tasks as their util_avg is not accurate yet. However, the flag we check
to do so is overwritten earlier on in the function, which makes the
condition pretty much a nop.

Fix this by saving the flag early on.

Fixes: 2802bf3cd9 ("sched/fair: Add over-utilization/tipping point indicator")
Reported-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20201112111201.2081902-1-qperret@google.com
2020-11-17 13:15:27 +01:00
Gabriel Krisman Bertazi 2991552447 entry: Drop usage of TIF flags in the generic syscall code
Now that the flags migration in the common syscall entry code is complete
and the code relies exclusively on thread_info::syscall_work, clean up the
accesses to TI flags in that path.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-10-krisman@collabora.com
2020-11-16 21:53:16 +01:00
Gabriel Krisman Bertazi 785dc4eb7f audit: Migrate to use SYSCALL_WORK flag
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.

Define SYSCALL_WORK_SYSCALL_AUDIT, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-9-krisman@collabora.com
2020-11-16 21:53:16 +01:00
Gabriel Krisman Bertazi 64eb35f701 ptrace: Migrate TIF_SYSCALL_EMU to use SYSCALL_WORK flag
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.

Define SYSCALL_WORK_SYSCALL_EMU, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-8-krisman@collabora.com
2020-11-16 21:53:16 +01:00
Gabriel Krisman Bertazi 64c19ba29b ptrace: Migrate to use SYSCALL_TRACE flag
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.

Define SYSCALL_WORK_SYSCALL_TRACE, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-7-krisman@collabora.com
2020-11-16 21:53:16 +01:00
Gabriel Krisman Bertazi 524666cb5d tracepoints: Migrate to use SYSCALL_WORK flag
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.

Define SYSCALL_WORK_SYSCALL_TRACEPOINT, use it in the generic entry code
and convert the code which uses the TIF specific helper functions to use
the new *_syscall_work() helpers which either resolve to the new mode for
users of the generic entry code or to the TIF based functions for the other
architectures.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-6-krisman@collabora.com
2020-11-16 21:53:15 +01:00
Gabriel Krisman Bertazi 23d67a5485 seccomp: Migrate to use SYSCALL_WORK flag
On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.

Define SYSCALL_WORK_SECCOMP, use it in the generic entry code and convert
the code which uses the TIF specific helper functions to use the new
*_syscall_work() helpers which either resolve to the new mode for users of
the generic entry code or to the TIF based functions for the other
architectures.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-5-krisman@collabora.com
2020-11-16 21:53:15 +01:00
Gabriel Krisman Bertazi b86678cf0f entry: Wire up syscall_work in common entry code
Prepare the common entry code to use the SYSCALL_WORK flags. They will
be defined in subsequent patches for each type of syscall
work. SYSCALL_WORK_ENTRY/EXIT are defined for the transition, as they
will replace the TIF_ equivalent defines.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-4-krisman@collabora.com
2020-11-16 21:53:15 +01:00
Francis Laniel 872f690341 treewide: rename nla_strlcpy to nla_strscpy.
Calls to nla_strlcpy are now replaced by calls to nla_strscpy which is the new
name of this function.

Signed-off-by: Francis Laniel <laniel_francis@privacyrequired.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-16 08:08:54 -08:00
Mauro Carvalho Chehab 8c67d247dc genirq: Fix kernel-doc markups
Some identifiers have different names between their prototypes
and the kernel-doc markup.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/13a44f4f0c3135e14b16ae8fcce4af1eab27cb5f.1605521731.git.mchehab+huawei@kernel.org
2020-11-16 15:20:54 +01:00
Mauro Carvalho Chehab 66981c37b3 hrtimer: Fix kernel-doc markups
The hrtimer_get_remaining() markup is documenting, instead,
__hrtimer_get_remaining(), as it is placed at the C file.

In order to properly document it, a kernel-doc markup is needed together
with the function prototype. So, add a new one, while preserving the
existing one, just fixing the function name.

The hrtimer_is_queued prototype has a typo: it is using
'=' instead of '-' to split: identifier - description
as required by kernel-doc markup.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/9dc87808c2fd07b7e050bafcd033c5ef05808fea.1605521731.git.mchehab+huawei@kernel.org
2020-11-16 15:20:01 +01:00
Thomas Gleixner cc947f2b9c timers: Make run_local_timers() static
No users outside of the timer code. Move the caller below this function to
avoid a pointless forward declaration.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-11-16 15:20:01 +01:00
Ira Weiny 78a56e0494 entry: Fix spelling/typo errors in irq entry code
s/reguired/required/
s/Interupts/Interrupts/
s/quiescient/quiescent/
s/assemenbly/assembly/

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201104230157.3378023-1-ira.weiny@intel.com
2020-11-15 23:54:00 +01:00
Alex Shi 6e5a91901c timekeeping: Address parameter documentation issues for various functions
The kernel-doc parser complains:

 kernel/time/timekeeping.c:1543: warning: Function parameter or member
 'ts' not described in 'read_persistent_clock64'

 kernel/time/timekeeping.c:764: warning: Function parameter or member
 'tk' not described in 'timekeeping_forward_now'

 kernel/time/timekeeping.c:1331: warning: Function parameter or member
 'ts' not described in 'timekeeping_inject_offset'

 kernel/time/timekeeping.c:1331: warning: Excess function parameter 'tv'
 description in 'timekeeping_inject_offset'

Add the missing parameter documentations and rename the 'tv' parameter of
timekeeping_inject_offset() to 'ts' so it matches the implemention.

[ tglx: Reworded a few docs and massaged changelog ]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-5-git-send-email-alex.shi@linux.alibaba.com
2020-11-15 23:47:24 +01:00
Alex Shi 29efc4612a timekeeping: Fix parameter docs of read_persistent_wall_and_boot_offset()
Address the following kernel-doc markup warnings:

 kernel/time/timekeeping.c:1563: warning: Function parameter or member
 'wall_time' not described in 'read_persistent_wall_and_boot_offset'
 kernel/time/timekeeping.c:1563: warning: Function parameter or member
 'boot_offset' not described in 'read_persistent_wall_and_boot_offset'

The parameters are described but miss the leading '@' and the colon after
the parameter names.

[ tglx: Massaged changelog ]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-6-git-send-email-alex.shi@linux.alibaba.com
2020-11-15 23:47:24 +01:00
Alex Shi f27f7c3f10 timekeeping: Add missing parameter docs for pvclock_gtod_[un]register_notifier()
The kernel-doc parser complains about:
 kernel/time/timekeeping.c:651: warning: Function parameter or member
 'nb' not described in 'pvclock_gtod_register_notifier'
 kernel/time/timekeeping.c:670: warning: Function parameter or member
 'nb' not described in 'pvclock_gtod_unregister_notifier'

Add the missing parameter explanations.

[ tglx: Massaged changelog ]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-3-git-send-email-alex.shi@linux.alibaba.com
2020-11-15 23:47:24 +01:00
Thomas Gleixner c1ce406e80 timekeeping: Fix up function documentation for the NMI safe accessors
Alex reported the following warning:

 kernel/time/timekeeping.c:464: warning: Function parameter or member
 'tkf' not described in '__ktime_get_fast_ns'

which is not entirely correct because the documented function is
ktime_get_mono_fast_ns() which does not have a parameter, but the
kernel-doc parser looks at the function declaration which follows the
comment and complains about the missing parameter documentation.

Aside of that the documentation for the rest of the NMI safe accessors is
either incomplete or missing.

  - Move the function documentation to the right place
  - Fixup the references and inconsistencies
  - Add the missing documentation for ktime_get_raw_fast_ns()

Reported-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2020-11-15 23:47:24 +01:00
Alex Shi e025b03113 timekeeping: Add missing parameter documentation for update_fast_timekeeper()
Address the following warning:

 kernel/time/timekeeping.c:415: warning: Function parameter or member
 'tkf' not described in 'update_fast_timekeeper'

[ tglx: Remove the bogus ktime_get_mono_fast_ns() part ]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-2-git-send-email-alex.shi@linux.alibaba.com
2020-11-15 23:47:24 +01:00
Alex Shi 199d280c88 timekeeping: Remove static functions from kernel-doc markup
Various static functions in the timekeeping code have function comments
which pretend to be kernel-doc, but are incomplete and trigger parser
warnings.

As these functions are local to the timekeeping core code there is no need
to expose them via kernel-doc.

Remove the double star kernel-doc marker and remove excess newlines.

[ tglx: Massaged changelog and removed excess newlines ]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-4-git-send-email-alex.shi@linux.alibaba.com
2020-11-15 23:47:23 +01:00
Alex Shi a0f5a65fa5 time: Add missing colons for parameter documentation of time64_to_tm()
Address these kernel-doc warnings:

 kernel/time/timeconv.c:79: warning: Function parameter or member
 'totalsecs' not described in 'time64_to_tm'
 kernel/time/timeconv.c:79: warning: Function parameter or member
 'offset' not described in 'time64_to_tm'
 kernel/time/timeconv.c:79: warning: Function parameter or member
 'result' not described in 'time64_to_tm'

The parameters are described but lack colons after the parameter name.

[ tglx: Massaged changelog ]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/1605252275-63652-1-git-send-email-alex.shi@linux.alibaba.com
2020-11-15 23:47:23 +01:00
Sebastian Andrzej Siewior c725dafc95 timers: Don't block on ->expiry_lock for TIMER_IRQSAFE timers
PREEMPT_RT does not spin and wait until a running timer completes its
callback but instead it blocks on a sleeping lock to prevent a livelock in
the case that the task waiting for the callback completion preempted the
callback.

This cannot be done for timers flagged with TIMER_IRQSAFE. These timers can
be canceled from an interrupt disabled context even on RT kernels.

The expiry callback of such timers is invoked with interrupts disabled so
there is no need to use the expiry lock mechanism because obviously the
callback cannot be preempted even on RT kernels.

Do not use the timer_base::expiry_lock mechanism when waiting for a running
callback to complete if the timer is flagged with TIMER_IRQSAFE.

Also add a lockdep assertion for RT kernels to validate that the expiry
lock mechanism is always invoked in preemptible context.

Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201103190937.hga67rqhvknki3tp@linutronix.de
2020-11-15 20:59:26 +01:00
Helge Deller da88f9b311 timer_list: Use printk format instead of open-coded symbol lookup
Use the "%ps" printk format string to resolve symbol names.

This works on all platforms, including ia64, ppc64 and parisc64 on which
one needs to dereference pointers to function descriptors instead of
function pointers.

Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201104163401.GA3984@ls3530.fritz.box
2020-11-15 20:47:14 +01:00
Linus Torvalds 64b609d6a6 A set of fixes for perf:
- A set of commits which reduce the stack usage of various perf event
    handling functions which allocated large data structs on stack causing
    stack overflows in the worst case.
 
  - Use the proper mechanism for detecting soft interrupts in the recursion
    protection.
 
  - Make the resursion protection simpler and more robust.
 
  - Simplify the scheduling of event groups to make the code more robust and
    prepare for fixing the issues vs. scheduling of exclusive event groups.
 
  - Prevent event multiplexing and rotation for exclusive event groups
 
  - Correct the perf event attribute exclusive semantics to take pinned
    events, e.g. the PMU watchdog, into account
 
  - Make the anythread filtering conditional for Intel's generic PMU
    counters as it is not longer guaranteed to be supported on newer
    CPUs. Check the corresponding CPUID leaf to make sure.
 
  - Fixup a duplicate initialization in an array which was probably cause by
    the usual copy & paste - forgot to edit mishap.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xIi0THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofixD/4+4gc8DhOmAkMrN0Z9tiW8ebgMKmb9
 wZRkMr5Osi0GzLJOPZ6SdY6jd0A3rMN/sW6P1DT6pDtcty4bKFoW5VZBuUDIAhel
 BC4C93L3y1En/GEZu1GTy3LvsBwLBQTOoY4goDjbdAbk60S/0RTHOGyQsRsOQFe6
 fVs3iXozAFuaR6I6N3dlxuJAE51zvr8MyBWaUoByNDB//1+lLNW+JfClaAOG1oXx
 qZIg/niatBVGzSGgKNRUyh3g8G1HJtabsA/NZ4PH8ZHuYABfmj4lmmUPR77ICLfV
 wMITEBG7eaktB8EqM9hvaoOZLA5kpXHO2JbCFSs4c4x11mlC8g7QMV3poCw33YoN
 a5TmT1A3muri1riy1/Ee9lXACOq7/tf2+Xfn9o6dvDdBwd6s5pzlhLGR8gILp2lF
 2bcg3IwYvHT/Kiurb/WGNpbCqQIPJpcUcfs3tNBCCtKegahUQNnGjxN3NVo9RCit
 zfL6xIJ8eZiYnsxXx4NKm744AukWiql3aRNgRkOdBP5WC68xt6VLcxG1YZKUoDhy
 jRSOCD/DuPSMSvAAgN7S8OWlPsKWBxVxxWYV+K8FpwhgzbQ3WbS3UDiYkhgjeOxu
 OlM692oWpllKvQWlvYthr2Be6oPCRRi1vvADNNbTKzgHk5i61bwympsGl1EZx3Pz
 2ROp7NJFRESnqw==
 =FzCf
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fixes from Thomas Gleixner:
 "A set of fixes for perf:

    - A set of commits which reduce the stack usage of various perf
      event handling functions which allocated large data structs on
      stack causing stack overflows in the worst case

    - Use the proper mechanism for detecting soft interrupts in the
      recursion protection

    - Make the resursion protection simpler and more robust

    - Simplify the scheduling of event groups to make the code more
      robust and prepare for fixing the issues vs. scheduling of
      exclusive event groups

    - Prevent event multiplexing and rotation for exclusive event groups

    - Correct the perf event attribute exclusive semantics to take
      pinned events, e.g. the PMU watchdog, into account

    - Make the anythread filtering conditional for Intel's generic PMU
      counters as it is not longer guaranteed to be supported on newer
      CPUs. Check the corresponding CPUID leaf to make sure

    - Fixup a duplicate initialization in an array which was probably
      caused by the usual 'copy & paste - forgot to edit' mishap"

* tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel/uncore: Fix Add BW copypasta
  perf/x86/intel: Make anythread filter support conditional
  perf: Tweak perf_event_attr::exclusive semantics
  perf: Fix event multiplexing for exclusive groups
  perf: Simplify group_sched_in()
  perf: Simplify group_sched_out()
  perf/x86: Make dummy_iregs static
  perf/arch: Remove perf_sample_data::regs_user_copy
  perf: Optimize get_recursion_context()
  perf: Fix get_recursion_context()
  perf/x86: Reduce stack usage for x86_pmu::drain_pebs()
  perf: Reduce stack usage of perf_output_begin()
2020-11-15 09:46:36 -08:00
Linus Torvalds d0a37fd57f A set of scheduler fixes:
- Address a load balancer regression by making the load balancer use the
    same logic as the wakeup path to spread tasks in the LLC domain.
 
  - Prefer the CPU on which a task run last over the local CPU in the fast
    wakeup path for asymmetric CPU capacity systems to align with the
    symmetric case. This ensures more locality and prevents massive
    migration overhead on those asymetric systems
 
  - Fix a memory corruption bug in the scheduler debug code caused by
    handing a modified buffer pointer to kfree().
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xJIoTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYofyGD/9rUnLlC1h7jEufVa4yPG94DcEqiXT7
 8B/zNRKnOmqQePCYUm+DS8njSFqpF9VjR+5zpos3bgYqwn7DyfV+hpxbbgS9NDh/
 qRg5gxhTrR4uMyZN62Fex5JS4bP8mKO7oc0usgV2Ytsg3e4H+9DqYhuaA5GrJAxC
 J3d1Hv/YBW2Uo+RZpB20aaJr0srN7bswTtPMxeeqo8q3Qh4pFcI+rmA4WphVAgHF
 jQWaNP4YVTgNjqxy7nBp7zFHlSdRbLohldZFtueYmRo1mjmkyQ34Cg7etfBvN1Uf
 iVYZLaInr0YPr0qR4FrQ3yI8ln/HESxshs0ARzMReYVT71mV//o5wftE18uCULQB
 rRu9vYz+LBVhkdgx118jJdNJqyqk6Ca6h9ZLqyBKuckj9a39289bwWiS6D/6W51p
 gurq58YTb2lRzyCnOVEULXehYRJkDI8EToiWppRVm9gy43OFPNox7n6TvNLW6BLS
 I8msTVdqDYXXj4U1o4Mf9K5LBKlda+ARuBu87r7kH1BJLxXHnOHcEkmeN8O9k7eu
 jdWfeDzDDjBjt/TU+X4f4RNjudUZrSPQrrESE5+XhfM4CwqcPXa2M/dGtPekW/ED
 9IqxPvwkau+0Ym6gkuanfnmda+JVR/nLvZV0uFuUGd+2xMcRemZbZE6hTUiYvYPY
 CAHpOhmeakbr6w==
 =wFcU
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Thomas Gleixner:
 "A set of scheduler fixes:

   - Address a load balancer regression by making the load balancer use
     the same logic as the wakeup path to spread tasks in the LLC domain

   - Prefer the CPU on which a task run last over the local CPU in the
     fast wakeup path for asymmetric CPU capacity systems to align with
     the symmetric case. This ensures more locality and prevents massive
     migration overhead on those asymetric systems

   - Fix a memory corruption bug in the scheduler debug code caused by
     handing a modified buffer pointer to kfree()"

* tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Fix memory corruption caused by multiple small reads of flags
  sched/fair: Prefer prev cpu in asymmetric wakeup path
  sched/fair: Ensure tasks spreading in LLC during LB
2020-11-15 09:39:35 -08:00
Linus Torvalds 259c2fbef8 Two fixes for the locking subsystem:
- Prevent an unconditional interrupt enable in a futex helper function
     which can be called from contexts which expect interrupts to stay
     disabled across the call.
 
   - Don't modify lockdep chain keys in the validation process as that
     causes chain inconsistency.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl+xF7cTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoR9dD/0X7EmbGdTIL9ZgEQfch8fWY79uHqQ7
 EXgVQb26KuWDzRutgk0qeSixQTYfhzoT2nPRGNpAQtyYYir0p2fXG2kstJEQZQzq
 toK5VsL11NJKVhlO/1y5+RcufsgfjTqOpqygbMm1gz+7ejfe7kfJZUAEMwbWzZRn
 a8HZ/VTQb/Z/F1xv+PuACkCp79ezzosL4hiN5QG0FEyiX47Pf8HuXTHKAD3SJgnc
 6ZvCkDkCHOv3jGmQ68sXBQ2m/ciYnDs1D8J/SD9zmggLFs8+R0LKCNxI46HfgRDV
 3oqx7OivazDvBmNXlSCFQQG+saIgRlWuVPV8PVTD3Ihmx25DzXhreSyxjyRhl3uL
 WN7C3ztk6lJv0B/BbtpFceobXjE7IN71CjDIFCwii20dn5UTrgLaJwnW1YrX6qqb
 +iz3cJs4bNLD1brGAlx8lqZxZ2omyXcPNQOi+vSkdTy8C/OYmypC1xusWSBpBtkp
 1+V1uoYFtJWsBzKfmbXAPSSiw9ppIVe/w3J/LrCcFv3CAEaDlCkN0klOF3/ULsga
 d+hMEUKagIqRXNeBdoEfY78LzSfIqkovy3rLNnWATu8fS2IdoRiwhliRxMCU9Ceh
 4WU1F4QAqu42cd0zYVWnVIfDVP3Qx/ENl8/Jd0m7Z8jmmjhOQo5XO7o5PxP8uP3U
 NvnoT5d5D2KysA==
 =XuF7
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fixes from Thomas Gleixner:
 "Two fixes for the locking subsystem:

   - Prevent an unconditional interrupt enable in a futex helper
     function which can be called from contexts which expect interrupts
     to stay disabled across the call

   - Don't modify lockdep chain keys in the validation process as that
     causes chain inconsistency"

* tag 'locking-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  lockdep: Avoid to modify chain keys in validate_chain()
  futex: Don't enable IRQs unconditionally in put_pi_state()
2020-11-15 09:25:43 -08:00
Thomas Gleixner e906a546bd genirq/irqdomain: Make irq_domain_disassociate() static
No users outside of the core code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87a6vja7mb.fsf@nanos.tec.linutronix.de
2020-11-15 12:01:11 +01:00
Thomas Gleixner f296dcd629 genirq: Remove GENERIC_IRQ_LEGACY_ALLOC_HWIRQ
Commit bb9d812643 ("arch: remove tile port") removed the last user of
this cruft two years ago...

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87eekvac06.fsf@nanos.tec.linutronix.de
2020-11-14 22:39:00 +01:00