linux/mm
Ross Zwisler 965d004af5 dax: fix deadlock with DAX 4k holes
Currently in DAX if we have three read faults on the same hole address we
can end up with the following:

Thread 0		Thread 1		Thread 2
--------		--------		--------
dax_iomap_fault
 grab_mapping_entry
  lock_slot
   <locks empty DAX entry>

  			dax_iomap_fault
			 grab_mapping_entry
			  get_unlocked_mapping_entry
			   <sleeps on empty DAX entry>

						dax_iomap_fault
						 grab_mapping_entry
						  get_unlocked_mapping_entry
						   <sleeps on empty DAX entry>
  dax_load_hole
   find_or_create_page
   ...
    page_cache_tree_insert
     dax_wake_mapping_entry_waiter
      <wakes one sleeper>
     __radix_tree_replace
      <swaps empty DAX entry with 4k zero page>

			<wakes>
			get_page
			lock_page
			...
			put_locked_mapping_entry
			unlock_page
			put_page

						<sleeps forever on the DAX
						 wait queue>

The crux of the problem is that once we insert a 4k zero page, all
locking from then on is done in terms of that 4k zero page and any
additional threads sleeping on the empty DAX entry will never be woken.

Fix this by waking all sleepers when we replace the DAX radix tree entry
with a 4k zero page.  This will allow all sleeping threads to
successfully transition from locking based on the DAX empty entry to
locking on the 4k zero page.

With the test case reported by Xiong this happens very regularly in my
test setup, with some runs resulting in 9+ threads in this deadlocked
state.  With this fix I've been able to run that same test dozens of
times in a loop without issue.

Fixes: ac401cc782 ("dax: New fault locking")
Link: http://lkml.kernel.org/r/1483479365-13607-1-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Xiong Zhou <xzhou@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>	[4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:54 -08:00
..
kasan Power management material for v4.10-rc1 2016-12-13 10:41:53 -08:00
backing-dev.c writeback: track if we're sleeping on progress in balance_dirty_pages() 2016-11-08 08:28:55 -07:00
balloon_compaction.c
bootmem.c mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping 2016-10-11 15:06:33 -07:00
cleancache.c
cma.c mm/cma.c: check the max limit for cma allocation 2016-11-11 08:12:37 -08:00
cma.h
cma_debug.c
compaction.c mm, compaction: allow compaction for GFP_NOFS requests 2016-12-14 16:04:07 -08:00
debug.c mm, debug: print raw struct page data in __dump_page() 2016-12-12 18:55:08 -08:00
debug_page_ref.c
dmapool.c
early_ioremap.c
fadvise.c mm: fadvise: avoid expensive remote LRU cache draining after FADV_DONTNEED 2016-12-20 09:48:46 -08:00
failslab.c
filemap.c dax: fix deadlock with DAX 4k holes 2017-01-10 18:31:54 -08:00
frame_vector.c mm: replace get_vaddr_frames() write/force parameters with gup_flags 2016-10-19 08:11:24 -07:00
frontswap.c
gup.c mm: unexport __get_user_pages_unlocked() 2016-12-14 16:04:09 -08:00
highmem.c
huge_memory.c mm: join struct fault_env and vm_fault 2016-12-14 16:04:09 -08:00
hugetlb.c mm: add tlb_remove_check_page_size_change to track page size change 2016-12-12 18:55:07 -08:00
hugetlb_cgroup.c
hwpoison-inject.c
init-mm.c mm: Add a user_ns owner to mm_struct and fix ptrace permission checks 2016-11-22 11:49:48 -06:00
internal.h mm: add PageWaiters indicating tasks are waiting for a page bit 2016-12-25 11:54:48 -08:00
interval_tree.c
Kconfig mm: THP page cache support for ppc64 2016-12-12 18:55:08 -08:00
Kconfig.debug
khugepaged.c radix-tree: improve multiorder iterators 2016-12-14 16:04:10 -08:00
kmemcheck.c
kmemleak-test.c
kmemleak.c kmemleak: fix reference to Documentation 2016-12-12 18:55:07 -08:00
ksm.c mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node() 2016-10-07 18:46:29 -07:00
list_lru.c mm/list_lru.c: avoid error-path NULL pointer deref 2016-10-27 18:43:42 -07:00
maccess.c
madvise.c mm: add tlb_remove_check_page_size_change to track page size change 2016-12-12 18:55:07 -08:00
Makefile Disable the __builtin_return_address() warning globally after all 2016-10-12 10:23:41 -07:00
memblock.c mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping 2016-10-11 15:06:33 -07:00
memcontrol.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
memory-failure.c mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked 2016-12-25 11:54:48 -08:00
memory.c mm: stop leaking PageTables 2017-01-07 17:49:33 -08:00
memory_hotplug.c mm: remove x86-only restriction of movable_node 2016-12-12 18:55:07 -08:00
mempolicy.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
mempool.c
memtest.c
migrate.c mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked 2016-12-25 11:54:48 -08:00
mincore.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
mlock.c thp: fix corner case of munlock() of PTE-mapped THPs 2016-11-30 16:32:52 -08:00
mm_init.c
mmap.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
mmu_context.c
mmu_notifier.c
mmzone.c
mprotect.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
mremap.c mremap: move_ptes: check pte dirty after its removal 2016-11-29 08:20:24 -08:00
msync.c
nobootmem.c mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping 2016-10-11 15:06:33 -07:00
nommu.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
oom_kill.c oom: print nodemask in the oom report 2016-10-07 18:46:29 -07:00
page-writeback.c radix-tree: delete radix_tree_range_tag_if_tagged() 2016-12-14 16:04:10 -08:00
page_alloc.c mm: add support for releasing multiple instances of a page 2016-12-14 16:04:08 -08:00
page_counter.c
page_ext.c mm/page_ext: support extra space allocation by page_ext user 2016-10-07 18:46:27 -07:00
page_idle.c
page_io.c writeback: add wbc_to_write_flags() 2016-11-02 10:24:03 -06:00
page_isolation.c mm/page_isolation: fix typo: "paes" -> "pages" 2016-10-07 18:46:29 -07:00
page_owner.c mm/page_owner: don't define fields on struct page_ext by hard-coding 2016-10-07 18:46:27 -07:00
page_poison.c
pagewalk.c
percpu-km.c
percpu-vm.c
percpu.c Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2016-12-13 12:34:47 -08:00
pgtable-generic.c
process_vm_access.c mm: unexport __get_user_pages_unlocked() 2016-12-14 16:04:09 -08:00
quicklist.c
readahead.c mm: don't cap request size based on read-ahead setting 2016-12-12 18:55:08 -08:00
rmap.c mm, rmap: handle anon_vma_prepare() common case inline 2016-12-12 18:55:08 -08:00
shmem.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
slab.c Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2016-12-13 12:59:57 -08:00
slab.h mm, slab: maintain total slab count instead of active count 2016-12-12 18:55:07 -08:00
slab_common.c mm/slab_common.c: check kmem_create_cache flags are common 2016-12-12 18:55:06 -08:00
slob.c slub: move synchronize_sched out of slab_mutex on shrink 2016-12-12 18:55:06 -08:00
slub.c slub: avoid false-postive warning 2016-12-12 18:55:06 -08:00
sparse-vmemmap.c
sparse.c
swap.c mm: add PageWaiters indicating tasks are waiting for a page bit 2016-12-25 11:54:48 -08:00
swap_cgroup.c
swap_state.c mm, swap: use offset of swap entry as key of swap cache 2016-10-07 18:46:28 -07:00
swapfile.c mm: add three more cond_resched() in swapoff 2016-12-12 18:55:08 -08:00
truncate.c mm: Invalidate DAX radix tree entries only if appropriate 2016-12-26 20:29:24 -08:00
usercopy.c
userfaultfd.c
util.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
vmacache.c mm: unrig VMA cache hit ratio 2016-10-07 18:46:27 -07:00
vmalloc.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
vmpressure.c
vmscan.c Merge branch 'akpm' (patches from Andrew) 2016-12-12 20:50:02 -08:00
vmstat.c mm/vmstat: Convert to hotplug state machine 2016-12-02 00:52:35 +01:00
workingset.c mm: workingset: fix use-after-free in shadow node shrinker 2017-01-07 18:22:40 -08:00
z3fold.c
zbud.c
zpool.c
zsmalloc.c mm/zsmalloc: Convert to hotplug state machine 2016-12-02 00:52:36 +01:00
zswap.c mm/zswap: Convert pool to hotplug state machine 2016-12-02 00:52:36 +01:00