linux/mm
Michal Hocko 72f0184c8a mm, memcg: remove hotplug locking from try_charge
The following lockdep splat has been noticed during LTP testing

  ======================================================
  WARNING: possible circular locking dependency detected
  4.13.0-rc3-next-20170807 #12 Not tainted
  ------------------------------------------------------
  a.out/4771 is trying to acquire lock:
   (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff812b4668>] drain_all_stock.part.35+0x18/0x140

  but task is already holding lock:
   (&mm->mmap_sem){++++++}, at: [<ffffffff8106eb35>] __do_page_fault+0x175/0x530

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&mm->mmap_sem){++++++}:
         lock_acquire+0xc9/0x230
         __might_fault+0x70/0xa0
         _copy_to_user+0x23/0x70
         filldir+0xa7/0x110
         xfs_dir2_sf_getdents.isra.10+0x20c/0x2c0 [xfs]
         xfs_readdir+0x1fa/0x2c0 [xfs]
         xfs_file_readdir+0x30/0x40 [xfs]
         iterate_dir+0x17a/0x1a0
         SyS_getdents+0xb0/0x160
         entry_SYSCALL_64_fastpath+0x1f/0xbe

  -> #2 (&type->i_mutex_dir_key#3){++++++}:
         lock_acquire+0xc9/0x230
         down_read+0x51/0xb0
         lookup_slow+0xde/0x210
         walk_component+0x160/0x250
         link_path_walk+0x1a6/0x610
         path_openat+0xe4/0xd50
         do_filp_open+0x91/0x100
         file_open_name+0xf5/0x130
         filp_open+0x33/0x50
         kernel_read_file_from_path+0x39/0x80
         _request_firmware+0x39f/0x880
         request_firmware_direct+0x37/0x50
         request_microcode_fw+0x64/0xe0
         reload_store+0xf7/0x180
         dev_attr_store+0x18/0x30
         sysfs_kf_write+0x44/0x60
         kernfs_fop_write+0x113/0x1a0
         __vfs_write+0x37/0x170
         vfs_write+0xc7/0x1c0
         SyS_write+0x58/0xc0
         do_syscall_64+0x6c/0x1f0
         return_from_SYSCALL_64+0x0/0x7a

  -> #1 (microcode_mutex){+.+.+.}:
         lock_acquire+0xc9/0x230
         __mutex_lock+0x88/0x960
         mutex_lock_nested+0x1b/0x20
         microcode_init+0xbb/0x208
         do_one_initcall+0x51/0x1a9
         kernel_init_freeable+0x208/0x2a7
         kernel_init+0xe/0x104
         ret_from_fork+0x2a/0x40

  -> #0 (cpu_hotplug_lock.rw_sem){++++++}:
         __lock_acquire+0x153c/0x1550
         lock_acquire+0xc9/0x230
         cpus_read_lock+0x4b/0x90
         drain_all_stock.part.35+0x18/0x140
         try_charge+0x3ab/0x6e0
         mem_cgroup_try_charge+0x7f/0x2c0
         shmem_getpage_gfp+0x25f/0x1050
         shmem_fault+0x96/0x200
         __do_fault+0x1e/0xa0
         __handle_mm_fault+0x9c3/0xe00
         handle_mm_fault+0x16e/0x380
         __do_page_fault+0x24a/0x530
         do_page_fault+0x30/0x80
         page_fault+0x28/0x30

  other info that might help us debug this:

  Chain exists of:
    cpu_hotplug_lock.rw_sem --> &type->i_mutex_dir_key#3 --> &mm->mmap_sem

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&mm->mmap_sem);
                                 lock(&type->i_mutex_dir_key#3);
                                 lock(&mm->mmap_sem);
    lock(cpu_hotplug_lock.rw_sem);

   *** DEADLOCK ***

  2 locks held by a.out/4771:
   #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8106eb35>] __do_page_fault+0x175/0x530
   #1:  (percpu_charge_mutex){+.+...}, at: [<ffffffff812b4c97>] try_charge+0x397/0x6e0

The problem is very similar to the one fixed by commit a459eeb7b8
("mm, page_alloc: do not depend on cpu hotplug locks inside the
allocator").  We are taking hotplug locks while we can be sitting on top
of basically arbitrary locks.  This just calls for problems.

We can get rid of {get,put}_online_cpus, fortunately.  We do not have to
be worried about races with memory hotplug because drain_local_stock,
which is called from both the WQ draining and the memory hotplug
contexts, is always operating on the local cpu stock with IRQs disabled.

The only thing to be careful about is that the target memcg doesn't
vanish while we are still in drain_all_stock so take a reference on it.

Link: http://lkml.kernel.org/r/20170913090023.28322-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Artem Savkov <asavkov@redhat.com>
Tested-by: Artem Savkov <asavkov@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:24 -07:00
..
kasan
backing-dev.c mm/backing-dev.c: fix an error handling path in 'cgwb_create()' 2017-09-11 14:16:44 -06:00
balloon_compaction.c mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY 2017-09-08 18:26:46 -07:00
bootmem.c
cleancache.c
cma.c
cma.h
cma_debug.c
compaction.c
debug.c
debug_page_ref.c
dmapool.c
early_ioremap.c
fadvise.c mm: fadvise: avoid fadvise for fs without backing device 2017-09-08 18:26:47 -07:00
failslab.c
filemap.c fs: Fix page cache inconsistency when mixing buffered and AIO DIO 2017-09-25 08:56:05 -06:00
frame_vector.c
frontswap.c
gup.c mm/device-public-memory: device memory cache coherent with CPU 2017-09-08 18:26:46 -07:00
highmem.c
hmm.c mm/hmm: avoid bloating arch that do not make use of HMM 2017-09-08 18:26:46 -07:00
huge_memory.c mm: soft-dirty: keep soft-dirty bits over thp migration 2017-09-08 18:26:45 -07:00
hugetlb.c
hugetlb_cgroup.c
hwpoison-inject.c
init-mm.c
internal.h
interval_tree.c lib/interval_tree: fast overlap detection 2017-09-08 18:26:49 -07:00
Kconfig mm/hmm: avoid bloating arch that do not make use of HMM 2017-09-08 18:26:46 -07:00
Kconfig.debug
khugepaged.c
kmemcheck.c
kmemleak-test.c
kmemleak.c
ksm.c ksm: fix unlocked iteration over vmas in cmp_and_merge_page() 2017-10-03 17:54:23 -07:00
list_lru.c
maccess.c
madvise.c mm, hugetlb, soft_offline: save compound page order before page migration 2017-10-03 17:54:24 -07:00
Makefile mm/hmm: avoid bloating arch that do not make use of HMM 2017-09-08 18:26:46 -07:00
memblock.c
memcontrol.c mm, memcg: remove hotplug locking from try_charge 2017-10-03 17:54:24 -07:00
memory-failure.c
memory.c lib/interval_tree: fast overlap detection 2017-09-08 18:26:49 -07:00
memory_hotplug.c mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory 2017-09-08 18:26:46 -07:00
mempolicy.c mm/mempolicy.c: remove BUG_ON() checks for VMA inside mpol_misplaced() 2017-09-08 18:26:47 -07:00
mempool.c
memtest.c
migrate.c mm/hmm: avoid bloating arch that do not make use of HMM 2017-09-08 18:26:46 -07:00
mincore.c
mlock.c mm/mlock.c: use page_zone() instead of page_zone_id() 2017-09-08 18:26:47 -07:00
mm_init.c
mmap.c lib/interval_tree: fast overlap detection 2017-09-08 18:26:49 -07:00
mmu_context.c
mmu_notifier.c
mmzone.c
mprotect.c mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory 2017-09-08 18:26:46 -07:00
mremap.c mm: thp: check pmd migration entry in common path 2017-09-08 18:26:45 -07:00
msync.c
nobootmem.c
nommu.c Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-09-14 18:13:32 -07:00
oom_kill.c mm, oom_reaper: skip mm structs with mmu notifiers 2017-10-03 17:54:24 -07:00
page-writeback.c
page_alloc.c mm/page_alloc.c: apply gfp_allowed_mask before the first allocation attempt 2017-09-08 18:26:47 -07:00
page_counter.c
page_ext.c
page_idle.c
page_io.c
page_isolation.c
page_owner.c mm, page_owner: skip unnecessary stack_trace entries 2017-09-13 18:53:16 -07:00
page_poison.c
page_vma_mapped.c mm/migrate: support un-addressable ZONE_DEVICE page in migration 2017-09-08 18:26:46 -07:00
pagewalk.c
percpu-internal.h
percpu-km.c
percpu-stats.c percpu: fix starting offset for chunk statistics traversal 2017-09-27 14:45:57 -07:00
percpu-vm.c
percpu.c percpu: fix iteration to prevent skipping over block 2017-09-28 07:39:27 -07:00
pgtable-generic.c mm: thp: enable thp migration in generic path 2017-09-08 18:26:45 -07:00
process_vm_access.c
quicklist.c
readahead.c
rmap.c lib/interval_tree: fast overlap detection 2017-09-08 18:26:49 -07:00
rodata_test.c
shmem.c mm: treewide: remove GFP_TEMPORARY allocation flag 2017-09-13 18:53:16 -07:00
slab.c
slab.h
slab_common.c
slob.c
slub.c mm: treewide: remove GFP_TEMPORARY allocation flag 2017-09-13 18:53:16 -07:00
sparse-vmemmap.c
sparse.c mm/sparse.c: fix typo in online_mem_sections 2017-09-08 18:26:47 -07:00
swap.c mm/device-public-memory: device memory cache coherent with CPU 2017-09-08 18:26:46 -07:00
swap_cgroup.c
swap_slots.c
swap_state.c
swapfile.c mm/swapfile.c: fix swapon frontswap_map memory leak on error 2017-09-08 18:26:47 -07:00
truncate.c
usercopy.c
userfaultfd.c
util.c
vmacache.c
vmalloc.c
vmpressure.c
vmscan.c
vmstat.c mm: consider the number in local CPUs when reading NUMA stats 2017-09-08 18:26:47 -07:00
workingset.c
z3fold.c z3fold: fix potential race in z3fold_reclaim_page 2017-10-03 17:54:24 -07:00
zbud.c
zpool.c
zsmalloc.c mm/zsmalloc.c: change stat type parameter to int 2017-09-08 18:26:47 -07:00
zswap.c