linux/mm
Vlastimil Babka 666716fd26 mm, slub: stop freeing kmem_cache_node structures on node offline
Patch series "mm, slab, slub: remove cpu and memory hotplug locks".

Some related work caused me to look at how we use get/put_mems_online()
and get/put_online_cpus() during kmem cache
creation/descruction/shrinking, and realize that it should be actually
safe to remove all of that with rather small effort (as e.g.  Michal Hocko
suspected in some of the past discussions already).  This has the benefit
to avoid rather heavy locks that have caused locking order issues already
in the past.  So this is the result, Patches 2 and 3 remove memory hotplug
and cpu hotplug locking, respectively.  Patch 1 is due to realization that
in fact some races exist despite the locks (even if not removed), but the
most sane solution is not to introduce more of them, but rather accept
some wasted memory in scenarios that should be rare anyway (full memory
hot remove), as we do the same in other contexts already.

This patch (of 3):

Commit e4f8e513c3 ("mm/slub: fix a deadlock in show_slab_objects()") has
fixed a problematic locking order by removing the memory hotplug lock
get/put_online_mems() from show_slab_objects().  During the discussion, it
was argued [1] that this is OK, because existing slabs on the node would
prevent a hotremove to proceed.

That's true, but per-node kmem_cache_node structures are not necessarily
allocated on the same node and may exist even without actual slab pages on
the same node.  Any path that uses get_node() directly or via
for_each_kmem_cache_node() (such as show_slab_objects()) can race with
freeing of kmem_cache_node even with the !NULL check, resulting in
use-after-free.

To that end, commit e4f8e513c3 argues in a comment that:

 * We don't really need mem_hotplug_lock (to hold off
 * slab_mem_going_offline_callback) here because slab's memory hot
 * unplug code doesn't destroy the kmem_cache->node[] data.

While it's true that slab_mem_going_offline_callback() doesn't free the
kmem_cache_node, the later callback slab_mem_offline_callback() actually
does, so the race and use-after-free exists.  Not just for
show_slab_objects() after commit e4f8e513c3, but also many other places
that are not under slab_mutex.  And adding slab_mutex locking or other
synchronization to SLUB paths such as get_any_partial() would be bad for
performance and error-prone.

The easiest solution is therefore to make the abovementioned comment true
and stop freeing the kmem_cache_node structures, accepting some wasted
memory in the full memory node removal scenario.  Analogically we also
don't free hotremoved pgdat as mentioned in [1], nor the similar per-node
structures in SLAB.  Importantly this approach will not block the
hotremove, as generally such nodes should be movable in order to succeed
hotremove in the first place, and thus the GFP_KERNEL allocated
kmem_cache_node will come from elsewhere.

[1] https://lore.kernel.org/linux-mm/20190924151147.GB23050@dhcp22.suse.cz/

Link: https://lkml.kernel.org/r/20210113131634.3671-1-vbabka@suse.cz
Link: https://lkml.kernel.org/r/20210113131634.3671-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Qian Cai <cai@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-24 13:38:27 -08:00
..
kasan kasan: fix stack traces dependency for HW_TAGS 2021-02-09 17:26:44 -08:00
backing-dev.c mm:backing-dev: use sysfs_emit in macro defining functions 2020-12-15 12:13:47 -08:00
balloon_compaction.c
cleancache.c
cma.c mm: cma: improve pr_debug log in cma_release() 2020-12-15 12:13:46 -08:00
cma.h mm: cma: use CMA_MAX_NAME to define the length of cma name array 2020-09-01 09:19:43 +02:00
cma_debug.c debugfs: make sure we can remove u32_array files cleanly 2020-07-10 13:54:00 -07:00
compaction.c mm, compaction: move high_pfn to the for loop scope 2021-02-05 11:03:47 -08:00
debug.c mm: memcontrol: Use helpers to read page's memcg data 2020-12-02 18:28:05 -08:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: avoid doing memory allocation with pgtable_t mapped. 2020-10-16 11:11:14 -07:00
dmapool.c mm/dmapool.c: replace hard coded function name with __func__ 2020-10-13 18:38:32 -07:00
early_ioremap.c mm/early_ioremap.c: use %pa to print resource_size_t variables 2020-01-31 10:30:38 -08:00
fadvise.c mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED 2020-10-13 18:38:29 -07:00
failslab.c
filemap.c arm64 updates for 5.12 2021-02-21 13:08:42 -08:00
frontswap.c mm/frontswap: mark various intentional data races 2020-08-14 19:56:56 -07:00
gup.c Merge branch 'akpm' (patches from Andrew) 2020-12-15 12:53:37 -08:00
gup_test.c mm/gup_test.c: mark gup_test_init as __init function 2020-12-15 12:13:38 -08:00
gup_test.h selftests/vm: gup_test: introduce the dump_pages() sub-test 2020-12-15 12:13:38 -08:00
highmem.c mm/highmem: prepare for overriding set_pte_at() 2021-01-24 10:34:52 -08:00
hmm.c mm: do page fault accounting in handle_mm_fault 2020-08-12 10:58:02 -07:00
huge_memory.c mm: thp: fix MADV_REMOVE deadlock on shmem THP 2021-02-05 11:03:47 -08:00
hugetlb.c These changes fix MM (soft-)dirty bit management in the procfs code & clean up the API. 2021-02-21 12:19:56 -08:00
hugetlb_cgroup.c hugetlb_cgroup: fix offline of hugetlb cgroup with reservations 2020-12-06 10:19:07 -08:00
hwpoison-inject.c mm,hwpoison-inject: don't pin for hwpoison_filter 2020-10-16 11:11:16 -07:00
init-mm.c mm/gup: prevent gup_fast from racing with COW during fork 2020-12-15 12:13:39 -08:00
internal.h mm, page_alloc: disable pcplists during memory offline 2020-12-15 12:13:43 -08:00
interval_tree.c
ioremap.c mm: move p?d_alloc_track to separate header file 2020-08-07 11:33:26 -07:00
Kconfig media: videobuf2: Move frame_vector into media subsystem 2021-01-12 14:15:31 +01:00
Kconfig.debug mm, page_poison: remove CONFIG_PAGE_POISONING_ZERO 2020-12-15 12:13:46 -08:00
khugepaged.c mm: Avoid modifying vmf.address in __collapse_huge_page_swapin() 2021-01-21 12:50:18 +00:00
kmemleak.c mm/kmemleak: rely on rcu for task stack scanning 2020-10-13 18:38:27 -07:00
ksm.c mm: cleanup kstrto*() usage 2020-12-15 12:13:47 -08:00
list_lru.c mm: list_lru: set shrinker map bit when child nr_items is not zero 2020-12-06 10:19:07 -08:00
maccess.c uaccess: add force_uaccess_{begin,end} helpers 2020-08-12 10:57:59 -07:00
madvise.c idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
Makefile media: videobuf2: Move frame_vector into media subsystem 2021-01-12 14:15:31 +01:00
mapping_dirty_helpers.c mm/mapping_dirty_helpers: enhance the kernel-doc markups 2020-12-15 12:13:41 -08:00
memblock.c memblock: remove return value of memblock_free_all() 2021-02-22 13:01:23 -08:00
memcontrol.c idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
memfd.c
memory-failure.c mm: fix page reference leak in soft_offline_page() 2021-01-24 10:34:52 -08:00
memory.c Fixes around VM_FPNMAP and follow_pfn 2021-02-22 17:45:02 -08:00
memory_hotplug.c mm: memmap defer init doesn't work as expected 2020-12-29 15:36:49 -08:00
mempolicy.c mm: migrate: initialize err in do_migrate_pages 2021-01-12 18:12:54 -08:00
mempool.c kasan, mm: rename kasan_poison_kfree 2020-12-22 12:55:09 -08:00
memremap.c mm/mremap_pages: fix static key devmap_managed_key updates 2020-11-02 12:14:18 -08:00
memtest.c
migrate.c mm: migrate: do not migrate HugeTLB page whose refcount is one 2021-02-05 11:03:47 -08:00
mincore.c inode: make init and permission helpers idmapped mount aware 2021-01-24 14:27:16 +01:00
mlock.c mm/lru: introduce relock_page_lruvec() 2020-12-15 14:48:04 -08:00
mm_init.c mm: fix fall-through warnings for Clang 2020-12-15 12:13:47 -08:00
mmap.c tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() 2021-01-29 20:02:29 +01:00
mmap_lock.c mm: mmap_lock: add tracepoints around lock acquisition 2020-12-15 12:13:41 -08:00
mmu_gather.c tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() 2021-01-29 20:02:29 +01:00
mmu_notifier.c mm: track mmu notifiers in fs_reclaim_acquire/release 2020-12-15 12:13:41 -08:00
mmzone.c mm/lru: replace pgdat lru_lock with lruvec lock 2020-12-15 14:48:04 -08:00
mprotect.c mm: Add 'mprotect' hook to struct vm_operations_struct 2020-11-17 14:36:14 +01:00
mremap.c This pull request contains the following changes for UML: 2021-02-21 13:53:00 -08:00
msync.c mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
nommu.c mm/nommu: Fix return type of filemap_map_pages() 2021-01-28 14:10:31 +00:00
oom_kill.c tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu() 2021-01-29 20:02:29 +01:00
page-writeback.c mm: make wait_on_page_writeback() wait for multiple pending writebacks 2021-01-05 11:33:00 -08:00
page_alloc.c mm: page_frag: Introduce page_frag_alloc_align() 2021-02-06 11:57:28 -08:00
page_counter.c mm/page_counter: use page_counter_read in page_counter_set_max 2020-12-15 12:13:40 -08:00
page_ext.c mm: fix some spelling mistakes in comments 2020-12-15 22:46:19 -08:00
page_idle.c mm: page_idle_get_page() does not need lru_lock 2020-12-15 14:48:03 -08:00
page_io.c mm: remove get_swap_bio 2021-01-27 09:51:49 -07:00
page_isolation.c mm/page_isolation: do not isolate the max order page 2020-12-15 12:13:45 -08:00
page_owner.c mm/page_owner: record timestamp and pid 2020-12-15 12:13:38 -08:00
page_poison.c kasan, mm: reset tags when accessing metadata 2020-12-22 12:55:08 -08:00
page_reporting.c mm: rename page_order() to buddy_order() 2020-10-16 11:11:19 -07:00
page_reporting.h mm: introduce include/linux/pgtable.h 2020-06-09 09:39:13 -07:00
page_vma_mapped.c mm/page_vma_mapped.c: add colon to fix kernel-doc markups error for check_pte 2020-12-15 12:13:41 -08:00
pagewalk.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
percpu-internal.h mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu-km.c mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu-stats.c mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu-vm.c mm: memcg/percpu: account percpu memory to memory cgroups 2020-08-12 10:57:55 -07:00
percpu.c percpu: fix clang modpost section mismatch 2021-02-14 18:15:15 +00:00
pgalloc-track.h mm: move p?d_alloc_track to separate header file 2020-08-07 11:33:26 -07:00
pgtable-generic.c mm: introduce include/linux/pgtable.h 2020-06-09 09:39:13 -07:00
process_vm_access.c mm/process_vm_access.c: include compat.h 2021-01-12 18:12:54 -08:00
ptdump.c kasan, arm64: expand CONFIG_KASAN checks 2020-12-22 12:55:08 -08:00
readahead.c mm: use limited read-ahead to satisfy read 2020-10-17 13:49:08 -06:00
rmap.c mm/lru: revise the comments of lru_lock 2020-12-15 14:48:04 -08:00
rodata_test.c mm/rodata_test.c: fix missing function declaration 2020-08-21 09:52:53 -07:00
shmem.c idmapped-mounts-v5.12 2021-02-23 13:39:45 -08:00
shuffle.c mm: rename page_order() to buddy_order() 2020-10-16 11:11:19 -07:00
shuffle.h mm/shuffle: remove dynamic reconfiguration 2020-08-07 11:33:29 -07:00
slab.c mm/slab: minor coding style tweaks 2021-02-24 13:38:27 -08:00
slab.h mm/sl?b.c: remove ctor argument from kmem_cache_flags 2021-02-24 13:38:27 -08:00
slab_common.c mm/sl?b.c: remove ctor argument from kmem_cache_flags 2021-02-24 13:38:27 -08:00
slob.c mm, tracing: record slab name for kmem_cache_free() 2021-02-24 13:38:26 -08:00
slub.c mm, slub: stop freeing kmem_cache_node structures on node offline 2021-02-24 13:38:27 -08:00
sparse-vmemmap.c mm/sparse: only sub-section aligned range would be populated 2020-08-07 11:33:27 -07:00
sparse.c mm/memory_hotplug: guard more declarations by CONFIG_MEMORY_HOTPLUG 2020-10-16 11:11:18 -07:00
swap.c mm/lru: introduce relock_page_lruvec() 2020-12-15 14:48:04 -08:00
swap_cgroup.c mm: memcontrol: make swap tracking an integral part of memory control 2020-06-03 20:09:48 -07:00
swap_slots.c mm/swap_slots.c: remove always zero and unused return value of enable_swap_slots_cache() 2020-10-13 18:38:30 -07:00
swap_state.c mm: use sysfs_emit for struct kobject * uses 2020-12-15 12:13:47 -08:00
swapfile.c arm64 updates for 5.12 2021-02-21 13:08:42 -08:00
truncate.c mm: fix kernel-doc markups 2020-12-15 12:13:47 -08:00
usercopy.c mm/usercopy.c: delete duplicated word 2020-08-12 10:57:58 -07:00
userfaultfd.c mm/vmscan: protect the workingset on anonymous LRU 2020-08-12 10:57:55 -07:00
util.c mm: Make mem_dump_obj() handle vmalloc() memory 2021-01-22 15:24:04 -08:00
vmacache.c kernel: better document the use_mm/unuse_mm API contract 2020-06-10 19:14:18 -07:00
vmalloc.c Merge branch 'for-mingo-rcu' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu 2021-02-12 12:56:55 +01:00
vmpressure.c mm: vmpressure: use mem_cgroup_is_root API 2020-04-02 09:35:31 -07:00
vmscan.c mm: don't put pinned pages into the swap cache 2021-01-17 12:08:04 -08:00
vmstat.c arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL 2020-12-15 12:13:42 -08:00
workingset.c Merge branch 'akpm' (patches from Andrew) 2020-12-15 14:55:10 -08:00
z3fold.c z3fold: remove preempt disabled sections for RT 2020-12-15 12:13:45 -08:00
zbud.c mm/zbud: remove redundant initialization 2020-10-13 18:38:34 -07:00
zpool.c mm/zpool.c: delete duplicated word and fix grammar 2020-08-12 10:57:58 -07:00
zsmalloc.c mm/zsmalloc.c: rework the list_add code in insert_zspage() 2020-12-15 12:13:46 -08:00
zswap.c mm/zswap: move to use crypto_acomp API for hardware acceleration 2020-12-15 12:13:46 -08:00