From 401507d67d5c2854f5a88b3f93f64fc6f267bca5 Mon Sep 17 00:00:00 2001 From: Wang Nan Date: Wed, 29 Oct 2014 14:50:18 -0700 Subject: [PATCH 01/21] cgroup/kmemleak: add kmemleak_free() for cgroup deallocations. Commit ff7ee93f4715 ("cgroup/kmemleak: Annotate alloc_page() for cgroup allocations") introduces kmemleak_alloc() for alloc_page_cgroup(), but corresponding kmemleak_free() is missing, which makes kmemleak be wrongly disabled after memory offlining. Log is pasted at the end of this commit message. This patch add kmemleak_free() into free_page_cgroup(). During page offlining, this patch removes corresponding entries in kmemleak rbtree. After that, the freed memory can be allocated again by other subsystems without killing kmemleak. bash # for x in 1 2 3 4; do echo offline > /sys/devices/system/memory/memory$x/state ; sleep 1; done ; dmesg | grep leak Offlined Pages 32768 kmemleak: Cannot insert 0xffff880016969000 into the object search tree (overlaps existing) CPU: 0 PID: 412 Comm: sleep Not tainted 3.17.0-rc5+ #86 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x46/0x58 create_object+0x266/0x2c0 kmemleak_alloc+0x26/0x50 kmem_cache_alloc+0xd3/0x160 __sigqueue_alloc+0x49/0xd0 __send_signal+0xcb/0x410 send_signal+0x45/0x90 __group_send_sig_info+0x13/0x20 do_notify_parent+0x1bb/0x260 do_exit+0x767/0xa40 do_group_exit+0x44/0xa0 SyS_exit_group+0x17/0x20 system_call_fastpath+0x16/0x1b kmemleak: Kernel memory leak detector disabled kmemleak: Object 0xffff880016900000 (size 524288): kmemleak: comm "swapper/0", pid 0, jiffies 4294667296 kmemleak: min_count = 0 kmemleak: count = 0 kmemleak: flags = 0x1 kmemleak: checksum = 0 kmemleak: backtrace: log_early+0x63/0x77 kmemleak_alloc+0x4b/0x50 init_section_page_cgroup+0x7f/0xf5 page_cgroup_init+0xc5/0xd0 start_kernel+0x333/0x408 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0xf5/0xfc Fixes: ff7ee93f4715 (cgroup/kmemleak: Annotate alloc_page() for cgroup allocations) Signed-off-by: Wang Nan Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Steven Rostedt Cc: [3.2+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_cgroup.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c index 3708264d2833..5331c2bd85a2 100644 --- a/mm/page_cgroup.c +++ b/mm/page_cgroup.c @@ -171,6 +171,7 @@ static void free_page_cgroup(void *addr) sizeof(struct page_cgroup) * PAGES_PER_SECTION; BUG_ON(PageReserved(page)); + kmemleak_free(addr); free_pages_exact(addr, table_size); } } From 6ea41c0c0aa37d87ef5dd0d14535d2e1e195cd83 Mon Sep 17 00:00:00 2001 From: Joonsoo Kim Date: Wed, 29 Oct 2014 14:50:20 -0700 Subject: [PATCH 02/21] mm/compaction.c: avoid premature range skip in isolate_migratepages_range Commit edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()") commonizes isolate_migratepages variants and make them use isolate_migratepages_block(). isolate_migratepages_block() could stop the execution when enough pages are isolated, but, there is no code in isolate_migratepages_range() to handle this case. In the result, even if isolate_migratepages_block() returns prematurely without checking all pages in the range, isolate_migratepages_block() is called repeately on the following pageblock and some pages in the previous range are skipped to check. Then, CMA is failed frequently due to this fact. To fix this problem, this patch let isolate_migratepages_range() know the situation that enough pages are isolated and stop the isolation in that case. Note that isolate_migratepages() has no such problem, because, it always stops the isolation after just one call of isolate_migratepages_block(). Signed-off-by: Joonsoo Kim Acked-by: Vlastimil Babka Cc: David Rientjes Cc: Minchan Kim Cc: Michal Nazarewicz Cc: Naoya Horiguchi Cc: Christoph Lameter Cc: Rik van Riel Cc: Mel Gorman Cc: Zhang Yanfei Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/compaction.c b/mm/compaction.c index edba18aed173..ec74cf0123ef 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -784,6 +784,9 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn, cc->nr_migratepages = 0; break; } + + if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) + break; } acct_isolated(cc->zone, cc); From 6424babfd68dd8a83d9c60a5242d27038856599f Mon Sep 17 00:00:00 2001 From: Jerry Hoemann Date: Wed, 29 Oct 2014 14:50:22 -0700 Subject: [PATCH 03/21] fsnotify: next_i is freed during fsnotify_unmount_inodes. During file system stress testing on 3.10 and 3.12 based kernels, the umount command occasionally hung in fsnotify_unmount_inodes in the section of code: spin_lock(&inode->i_lock); if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { spin_unlock(&inode->i_lock); continue; } As this section of code holds the global inode_sb_list_lock, eventually the system hangs trying to acquire the lock. Multiple crash dumps showed: The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point back at itself. As this is not the value of list upon entry to the function, the kernel never exits the loop. To help narrow down problem, the call to list_del_init in inode_sb_list_del was changed to list_del. This poisons the pointers in the i_sb_list and causes a kernel to panic if it transverse a freed inode. Subsequent stress testing paniced in fsnotify_unmount_inodes at the bottom of the list_for_each_entry_safe loop showing next_i had become free. We believe the root cause of the problem is that next_i is being freed during the window of time that the list_for_each_entry_safe loop temporarily releases inode_sb_list_lock to call fsnotify and fsnotify_inode_delete. The code in fsnotify_unmount_inodes attempts to prevent the freeing of inode and next_i by calling __iget. However, the code doesn't do the __iget call on next_i if i_count == 0 or if i_state & (I_FREEING | I_WILL_FREE) The patch addresses this issue by advancing next_i in the above two cases until we either find a next_i which we can __iget or we reach the end of the list. This makes the handling of next_i more closely match the handling of the variable "inode." The time to reproduce the hang is highly variable (from hours to days.) We ran the stress test on a 3.10 kernel with the proposed patch for a week without failure. During list_for_each_entry_safe, next_i is becoming free causing the loop to never terminate. Advance next_i in those cases where __iget is not done. Signed-off-by: Jerry Hoemann Cc: Jeff Kirsher Cc: Ken Helias Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/notify/inode_mark.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c index 9ce062218de9..e8497144b323 100644 --- a/fs/notify/inode_mark.c +++ b/fs/notify/inode_mark.c @@ -288,20 +288,25 @@ void fsnotify_unmount_inodes(struct list_head *list) spin_unlock(&inode->i_lock); /* In case the dropping of a reference would nuke next_i. */ - if ((&next_i->i_sb_list != list) && - atomic_read(&next_i->i_count)) { + while (&next_i->i_sb_list != list) { spin_lock(&next_i->i_lock); - if (!(next_i->i_state & (I_FREEING | I_WILL_FREE))) { + if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) && + atomic_read(&next_i->i_count)) { __iget(next_i); need_iput = next_i; + spin_unlock(&next_i->i_lock); + break; } spin_unlock(&next_i->i_lock); + next_i = list_entry(next_i->i_sb_list.next, + struct inode, i_sb_list); } /* - * We can safely drop inode_sb_list_lock here because we hold - * references on both inode and next_i. Also no new inodes - * will be added since the umount has begun. + * We can safely drop inode_sb_list_lock here because either + * we actually hold references on both inode and next_i or + * end of list. Also no new inodes will be added since the + * umount has begun. */ spin_unlock(&inode_sb_list_lock); From f601de204465048bdf0d5537f630729622ebc3a6 Mon Sep 17 00:00:00 2001 From: Riku Voipio Date: Wed, 29 Oct 2014 14:50:24 -0700 Subject: [PATCH 04/21] gcov: add ARM64 to GCOV_PROFILE_ALL Following up the arm testing of gcov, turns out gcov on ARM64 works fine as well. Only change needed is adding ARM64 to Kconfig depends. Tested with qemu and mach-virt Signed-off-by: Riku Voipio Acked-by: Peter Oberparleiter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/gcov/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/gcov/Kconfig b/kernel/gcov/Kconfig index cf66c5c8458e..3b7408759bdf 100644 --- a/kernel/gcov/Kconfig +++ b/kernel/gcov/Kconfig @@ -35,7 +35,7 @@ config GCOV_KERNEL config GCOV_PROFILE_ALL bool "Profile entire Kernel" depends on GCOV_KERNEL - depends on SUPERH || S390 || X86 || PPC || MICROBLAZE || ARM + depends on SUPERH || S390 || X86 || PPC || MICROBLAZE || ARM || ARM64 default n ---help--- This options activates profiling for the entire kernel. From 5ddacbe92b806cd5b4f8f154e8e46ac267fff55c Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Wed, 29 Oct 2014 14:50:26 -0700 Subject: [PATCH 05/21] mm: free compound page with correct order Compound page should be freed by put_page() or free_pages() with correct order. Not doing so will cause tail pages leaked. The compound order can be obtained by compound_order() or use HPAGE_PMD_ORDER in our case. Some people would argue the latter is faster but I prefer the former which is more general. This bug was observed not just on our servers (the worst case we saw is 11G leaked on a 48G machine) but also on our workstations running Ubuntu based distro. $ cat /proc/vmstat | grep thp_zero_page_alloc thp_zero_page_alloc 55 thp_zero_page_alloc_failed 0 This means there is (thp_zero_page_alloc - 1) * (2M - 4K) memory leaked. Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page") Signed-off-by: Yu Zhao Acked-by: Kirill A. Shutemov Cc: Andrea Arcangeli Cc: Mel Gorman Cc: David Rientjes Cc: Bob Liu Cc: [3.8+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 74c78aa8bc2f..780d12c000e9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -200,7 +200,7 @@ static struct page *get_huge_zero_page(void) preempt_disable(); if (cmpxchg(&huge_zero_page, NULL, zero_page)) { preempt_enable(); - __free_page(zero_page); + __free_pages(zero_page, compound_order(zero_page)); goto retry; } @@ -232,7 +232,7 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink, if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) { struct page *zero_page = xchg(&huge_zero_page, NULL); BUG_ON(zero_page == NULL); - __free_page(zero_page); + __free_pages(zero_page, compound_order(zero_page)); return HPAGE_PMD_NR; } From 47f29df7db78ee4fcdb104cf36918d987ddd0278 Mon Sep 17 00:00:00 2001 From: Marek Szyprowski Date: Wed, 29 Oct 2014 14:50:29 -0700 Subject: [PATCH 06/21] drivers: of: add return value to of_reserved_mem_device_init() Driver calling of_reserved_mem_device_init() might be interested if the initialization has been successful or not, so add support for returning error code. This fixes a build warining caused by commit 7bfa5ab6fa1b ("drivers: dma-coherent: add initialization from device tree"), which has been merged without this change and without fixing function return value. Fixes: 7bfa5ab6fa1b1 ("drivers: dma-coherent: add initialization from device tree") Signed-off-by: Marek Szyprowski Acked-by: Arnd Bergmann Cc: Michal Nazarewicz Cc: Grant Likely Cc: Laura Abbott Cc: Josh Cartwright Cc: Joonsoo Kim Cc: Kyungmin Park Cc: Russell King Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/dma-contiguous.c | 3 ++- drivers/of/of_reserved_mem.c | 14 +++++++++----- include/linux/of_reserved_mem.h | 9 ++++++--- 3 files changed, 17 insertions(+), 9 deletions(-) diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c index 473ff4892401..950fff9ce453 100644 --- a/drivers/base/dma-contiguous.c +++ b/drivers/base/dma-contiguous.c @@ -223,9 +223,10 @@ bool dma_release_from_contiguous(struct device *dev, struct page *pages, #undef pr_fmt #define pr_fmt(fmt) fmt -static void rmem_cma_device_init(struct reserved_mem *rmem, struct device *dev) +static int rmem_cma_device_init(struct reserved_mem *rmem, struct device *dev) { dev_set_cma_area(dev, rmem->priv); + return 0; } static void rmem_cma_device_release(struct reserved_mem *rmem, diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c index 59fb12e84e6b..dc566b38645f 100644 --- a/drivers/of/of_reserved_mem.c +++ b/drivers/of/of_reserved_mem.c @@ -243,23 +243,27 @@ static inline struct reserved_mem *__find_rmem(struct device_node *node) * This function assign memory region pointed by "memory-region" device tree * property to the given device. */ -void of_reserved_mem_device_init(struct device *dev) +int of_reserved_mem_device_init(struct device *dev) { struct reserved_mem *rmem; struct device_node *np; + int ret; np = of_parse_phandle(dev->of_node, "memory-region", 0); if (!np) - return; + return -ENODEV; rmem = __find_rmem(np); of_node_put(np); if (!rmem || !rmem->ops || !rmem->ops->device_init) - return; + return -EINVAL; - rmem->ops->device_init(rmem, dev); - dev_info(dev, "assigned reserved memory node %s\n", rmem->name); + ret = rmem->ops->device_init(rmem, dev); + if (ret == 0) + dev_info(dev, "assigned reserved memory node %s\n", rmem->name); + + return ret; } /** diff --git a/include/linux/of_reserved_mem.h b/include/linux/of_reserved_mem.h index 5b5efae09135..ad2f67054372 100644 --- a/include/linux/of_reserved_mem.h +++ b/include/linux/of_reserved_mem.h @@ -16,7 +16,7 @@ struct reserved_mem { }; struct reserved_mem_ops { - void (*device_init)(struct reserved_mem *rmem, + int (*device_init)(struct reserved_mem *rmem, struct device *dev); void (*device_release)(struct reserved_mem *rmem, struct device *dev); @@ -28,14 +28,17 @@ typedef int (*reservedmem_of_init_fn)(struct reserved_mem *rmem); _OF_DECLARE(reservedmem, name, compat, init, reservedmem_of_init_fn) #ifdef CONFIG_OF_RESERVED_MEM -void of_reserved_mem_device_init(struct device *dev); +int of_reserved_mem_device_init(struct device *dev); void of_reserved_mem_device_release(struct device *dev); void fdt_init_reserved_mem(void); void fdt_reserved_mem_save_node(unsigned long node, const char *uname, phys_addr_t base, phys_addr_t size); #else -static inline void of_reserved_mem_device_init(struct device *dev) { } +static inline int of_reserved_mem_device_init(struct device *dev) +{ + return -ENOSYS; +} static inline void of_reserved_mem_device_release(struct device *pdev) { } static inline void fdt_init_reserved_mem(void) { } From 6d50e60cd2edb5a57154db5a6f64eef5aa59b751 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Wed, 29 Oct 2014 14:50:31 -0700 Subject: [PATCH 07/21] mm, thp: fix collapsing of hugepages on madvise If an anonymous mapping is not allowed to fault thp memory and then madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never collapse this memory into thp memory. This occurs because the madvise(2) handler for thp, hugepage_madvise(), clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags until the final action of madvise_behavior(). This causes the khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when the vma had previously had VM_NOHUGEPAGE set. Fix this by passing the correct vma flags to the khugepaged mm slot handler. There's no chance khugepaged can run on this vma until after madvise_behavior() returns since we hold mm->mmap_sem. It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags in hugepage_advise(), but I didn't want to introduce special case behavior into madvise_behavior(). I think it's best to just let it always set vma->vm_flags itself. Signed-off-by: David Rientjes Reported-by: Suleiman Souhlal Cc: "Kirill A. Shutemov" Cc: Andrea Arcangeli Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/khugepaged.h | 17 ++++++++++------- mm/huge_memory.c | 11 ++++++----- mm/mmap.c | 8 ++++---- 3 files changed, 20 insertions(+), 16 deletions(-) diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index 6b394f0b5148..eeb307985715 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -6,7 +6,8 @@ #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern int __khugepaged_enter(struct mm_struct *mm); extern void __khugepaged_exit(struct mm_struct *mm); -extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma); +extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma, + unsigned long vm_flags); #define khugepaged_enabled() \ (transparent_hugepage_flags & \ @@ -35,13 +36,13 @@ static inline void khugepaged_exit(struct mm_struct *mm) __khugepaged_exit(mm); } -static inline int khugepaged_enter(struct vm_area_struct *vma) +static inline int khugepaged_enter(struct vm_area_struct *vma, + unsigned long vm_flags) { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags)) if ((khugepaged_always() || - (khugepaged_req_madv() && - vma->vm_flags & VM_HUGEPAGE)) && - !(vma->vm_flags & VM_NOHUGEPAGE)) + (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) && + !(vm_flags & VM_NOHUGEPAGE)) if (__khugepaged_enter(vma->vm_mm)) return -ENOMEM; return 0; @@ -54,11 +55,13 @@ static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm) static inline void khugepaged_exit(struct mm_struct *mm) { } -static inline int khugepaged_enter(struct vm_area_struct *vma) +static inline int khugepaged_enter(struct vm_area_struct *vma, + unsigned long vm_flags) { return 0; } -static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma) +static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma, + unsigned long vm_flags) { return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 780d12c000e9..de984159cf0b 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -803,7 +803,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, return VM_FAULT_FALLBACK; if (unlikely(anon_vma_prepare(vma))) return VM_FAULT_OOM; - if (unlikely(khugepaged_enter(vma))) + if (unlikely(khugepaged_enter(vma, vma->vm_flags))) return VM_FAULT_OOM; if (!(flags & FAULT_FLAG_WRITE) && transparent_hugepage_use_zero_page()) { @@ -1970,7 +1970,7 @@ int hugepage_madvise(struct vm_area_struct *vma, * register it here without waiting a page fault that * may not happen any time soon. */ - if (unlikely(khugepaged_enter_vma_merge(vma))) + if (unlikely(khugepaged_enter_vma_merge(vma, *vm_flags))) return -ENOMEM; break; case MADV_NOHUGEPAGE: @@ -2071,7 +2071,8 @@ int __khugepaged_enter(struct mm_struct *mm) return 0; } -int khugepaged_enter_vma_merge(struct vm_area_struct *vma) +int khugepaged_enter_vma_merge(struct vm_area_struct *vma, + unsigned long vm_flags) { unsigned long hstart, hend; if (!vma->anon_vma) @@ -2083,11 +2084,11 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma) if (vma->vm_ops) /* khugepaged not yet working on file or special mappings */ return 0; - VM_BUG_ON_VMA(vma->vm_flags & VM_NO_THP, vma); + VM_BUG_ON_VMA(vm_flags & VM_NO_THP, vma); hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK; hend = vma->vm_end & HPAGE_PMD_MASK; if (hstart < hend) - return khugepaged_enter(vma); + return khugepaged_enter(vma, vm_flags); return 0; } diff --git a/mm/mmap.c b/mm/mmap.c index 7f855206e7fb..87e82b38453c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1080,7 +1080,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, end, prev->vm_pgoff, NULL); if (err) return NULL; - khugepaged_enter_vma_merge(prev); + khugepaged_enter_vma_merge(prev, vm_flags); return prev; } @@ -1099,7 +1099,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, next->vm_pgoff - pglen, NULL); if (err) return NULL; - khugepaged_enter_vma_merge(area); + khugepaged_enter_vma_merge(area, vm_flags); return area; } @@ -2208,7 +2208,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) } } vma_unlock_anon_vma(vma); - khugepaged_enter_vma_merge(vma); + khugepaged_enter_vma_merge(vma, vma->vm_flags); validate_mm(vma->vm_mm); return error; } @@ -2277,7 +2277,7 @@ int expand_downwards(struct vm_area_struct *vma, } } vma_unlock_anon_vma(vma); - khugepaged_enter_vma_merge(vma); + khugepaged_enter_vma_merge(vma, vma->vm_flags); validate_mm(vma->vm_mm); return error; } From c8d523a4b053d1678adb92976b5ef84d9bc481e8 Mon Sep 17 00:00:00 2001 From: Stanimir Varbanov Date: Wed, 29 Oct 2014 14:50:33 -0700 Subject: [PATCH 08/21] drivers/rtc/rtc-pm8xxx.c: rework to support pm8941 rtc Adds support for RTC device inside PM8941 PMIC. The RTC in this PMIC have two register spaces. Thus the rtc-pm8xxx is slightly reworked to reflect these differences. The register set for different PMIC chips are selected on DT compatible string base. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: simplify and fix locking in pm8xxx_rtc_set_time()] Signed-off-by: Stanimir Varbanov Cc: Alessandro Zummo Cc: Stephen Boyd Cc: Josh Cartwright Cc: Stanimir Varbanov Cc: Dan Carpenter Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/rtc/Kconfig | 2 +- drivers/rtc/rtc-pm8xxx.c | 222 +++++++++++++++++++++++---------------- 2 files changed, 135 insertions(+), 89 deletions(-) diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig index 94ae1798d48a..6dd12ddbabc6 100644 --- a/drivers/rtc/Kconfig +++ b/drivers/rtc/Kconfig @@ -1320,7 +1320,7 @@ config RTC_DRV_LPC32XX config RTC_DRV_PM8XXX tristate "Qualcomm PMIC8XXX RTC" - depends on MFD_PM8XXX + depends on MFD_PM8XXX || MFD_SPMI_PMIC help If you say yes here you get support for the Qualcomm PMIC8XXX RTC. diff --git a/drivers/rtc/rtc-pm8xxx.c b/drivers/rtc/rtc-pm8xxx.c index 197699f358c7..5adcf111fc14 100644 --- a/drivers/rtc/rtc-pm8xxx.c +++ b/drivers/rtc/rtc-pm8xxx.c @@ -27,21 +27,36 @@ /* RTC_CTRL register bit fields */ #define PM8xxx_RTC_ENABLE BIT(7) -#define PM8xxx_RTC_ALARM_ENABLE BIT(1) #define PM8xxx_RTC_ALARM_CLEAR BIT(0) #define NUM_8_BIT_RTC_REGS 0x4 +/** + * struct pm8xxx_rtc_regs - describe RTC registers per PMIC versions + * @ctrl: base address of control register + * @write: base address of write register + * @read: base address of read register + * @alarm_ctrl: base address of alarm control register + * @alarm_ctrl2: base address of alarm control2 register + * @alarm_rw: base address of alarm read-write register + * @alarm_en: alarm enable mask + */ +struct pm8xxx_rtc_regs { + unsigned int ctrl; + unsigned int write; + unsigned int read; + unsigned int alarm_ctrl; + unsigned int alarm_ctrl2; + unsigned int alarm_rw; + unsigned int alarm_en; +}; + /** * struct pm8xxx_rtc - rtc driver internal structure * @rtc: rtc device for this driver. * @regmap: regmap used to access RTC registers * @allow_set_time: indicates whether writing to the RTC is allowed * @rtc_alarm_irq: rtc alarm irq number. - * @rtc_base: address of rtc control register. - * @rtc_read_base: base address of read registers. - * @rtc_write_base: base address of write registers. - * @alarm_rw_base: base address of alarm registers. * @ctrl_reg: rtc control register. * @rtc_dev: device structure. * @ctrl_reg_lock: spinlock protecting access to ctrl_reg. @@ -51,11 +66,7 @@ struct pm8xxx_rtc { struct regmap *regmap; bool allow_set_time; int rtc_alarm_irq; - int rtc_base; - int rtc_read_base; - int rtc_write_base; - int alarm_rw_base; - u8 ctrl_reg; + const struct pm8xxx_rtc_regs *regs; struct device *rtc_dev; spinlock_t ctrl_reg_lock; }; @@ -71,8 +82,10 @@ static int pm8xxx_rtc_set_time(struct device *dev, struct rtc_time *tm) { int rc, i; unsigned long secs, irq_flags; - u8 value[NUM_8_BIT_RTC_REGS], alarm_enabled = 0, ctrl_reg; + u8 value[NUM_8_BIT_RTC_REGS], alarm_enabled = 0; + unsigned int ctrl_reg; struct pm8xxx_rtc *rtc_dd = dev_get_drvdata(dev); + const struct pm8xxx_rtc_regs *regs = rtc_dd->regs; if (!rtc_dd->allow_set_time) return -EACCES; @@ -87,30 +100,30 @@ static int pm8xxx_rtc_set_time(struct device *dev, struct rtc_time *tm) dev_dbg(dev, "Seconds value to be written to RTC = %lu\n", secs); spin_lock_irqsave(&rtc_dd->ctrl_reg_lock, irq_flags); - ctrl_reg = rtc_dd->ctrl_reg; - if (ctrl_reg & PM8xxx_RTC_ALARM_ENABLE) { + rc = regmap_read(rtc_dd->regmap, regs->ctrl, &ctrl_reg); + if (rc) + goto rtc_rw_fail; + + if (ctrl_reg & regs->alarm_en) { alarm_enabled = 1; - ctrl_reg &= ~PM8xxx_RTC_ALARM_ENABLE; - rc = regmap_write(rtc_dd->regmap, rtc_dd->rtc_base, ctrl_reg); + ctrl_reg &= ~regs->alarm_en; + rc = regmap_write(rtc_dd->regmap, regs->ctrl, ctrl_reg); if (rc) { dev_err(dev, "Write to RTC control register failed\n"); goto rtc_rw_fail; } - rtc_dd->ctrl_reg = ctrl_reg; - } else { - spin_unlock_irqrestore(&rtc_dd->ctrl_reg_lock, irq_flags); } /* Write 0 to Byte[0] */ - rc = regmap_write(rtc_dd->regmap, rtc_dd->rtc_write_base, 0); + rc = regmap_write(rtc_dd->regmap, regs->write, 0); if (rc) { dev_err(dev, "Write to RTC write data register failed\n"); goto rtc_rw_fail; } /* Write Byte[1], Byte[2], Byte[3] */ - rc = regmap_bulk_write(rtc_dd->regmap, rtc_dd->rtc_write_base + 1, + rc = regmap_bulk_write(rtc_dd->regmap, regs->write + 1, &value[1], sizeof(value) - 1); if (rc) { dev_err(dev, "Write to RTC write data register failed\n"); @@ -118,25 +131,23 @@ static int pm8xxx_rtc_set_time(struct device *dev, struct rtc_time *tm) } /* Write Byte[0] */ - rc = regmap_write(rtc_dd->regmap, rtc_dd->rtc_write_base, value[0]); + rc = regmap_write(rtc_dd->regmap, regs->write, value[0]); if (rc) { dev_err(dev, "Write to RTC write data register failed\n"); goto rtc_rw_fail; } if (alarm_enabled) { - ctrl_reg |= PM8xxx_RTC_ALARM_ENABLE; - rc = regmap_write(rtc_dd->regmap, rtc_dd->rtc_base, ctrl_reg); + ctrl_reg |= regs->alarm_en; + rc = regmap_write(rtc_dd->regmap, regs->ctrl, ctrl_reg); if (rc) { dev_err(dev, "Write to RTC control register failed\n"); goto rtc_rw_fail; } - rtc_dd->ctrl_reg = ctrl_reg; } rtc_rw_fail: - if (alarm_enabled) - spin_unlock_irqrestore(&rtc_dd->ctrl_reg_lock, irq_flags); + spin_unlock_irqrestore(&rtc_dd->ctrl_reg_lock, irq_flags); return rc; } @@ -148,9 +159,9 @@ static int pm8xxx_rtc_read_time(struct device *dev, struct rtc_time *tm) unsigned long secs; unsigned int reg; struct pm8xxx_rtc *rtc_dd = dev_get_drvdata(dev); + const struct pm8xxx_rtc_regs *regs = rtc_dd->regs; - rc = regmap_bulk_read(rtc_dd->regmap, rtc_dd->rtc_read_base, - value, sizeof(value)); + rc = regmap_bulk_read(rtc_dd->regmap, regs->read, value, sizeof(value)); if (rc) { dev_err(dev, "RTC read data register failed\n"); return rc; @@ -160,14 +171,14 @@ static int pm8xxx_rtc_read_time(struct device *dev, struct rtc_time *tm) * Read the LSB again and check if there has been a carry over. * If there is, redo the read operation. */ - rc = regmap_read(rtc_dd->regmap, rtc_dd->rtc_read_base, ®); + rc = regmap_read(rtc_dd->regmap, regs->read, ®); if (rc < 0) { dev_err(dev, "RTC read data register failed\n"); return rc; } if (unlikely(reg < value[0])) { - rc = regmap_bulk_read(rtc_dd->regmap, rtc_dd->rtc_read_base, + rc = regmap_bulk_read(rtc_dd->regmap, regs->read, value, sizeof(value)); if (rc) { dev_err(dev, "RTC read data register failed\n"); @@ -195,9 +206,11 @@ static int pm8xxx_rtc_read_time(struct device *dev, struct rtc_time *tm) static int pm8xxx_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alarm) { int rc, i; - u8 value[NUM_8_BIT_RTC_REGS], ctrl_reg; + u8 value[NUM_8_BIT_RTC_REGS]; + unsigned int ctrl_reg; unsigned long secs, irq_flags; struct pm8xxx_rtc *rtc_dd = dev_get_drvdata(dev); + const struct pm8xxx_rtc_regs *regs = rtc_dd->regs; rtc_tm_to_time(&alarm->time, &secs); @@ -208,28 +221,28 @@ static int pm8xxx_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alarm) spin_lock_irqsave(&rtc_dd->ctrl_reg_lock, irq_flags); - rc = regmap_bulk_write(rtc_dd->regmap, rtc_dd->alarm_rw_base, value, + rc = regmap_bulk_write(rtc_dd->regmap, regs->alarm_rw, value, sizeof(value)); if (rc) { dev_err(dev, "Write to RTC ALARM register failed\n"); goto rtc_rw_fail; } - ctrl_reg = rtc_dd->ctrl_reg; + rc = regmap_read(rtc_dd->regmap, regs->alarm_ctrl, &ctrl_reg); + if (rc) + goto rtc_rw_fail; if (alarm->enabled) - ctrl_reg |= PM8xxx_RTC_ALARM_ENABLE; + ctrl_reg |= regs->alarm_en; else - ctrl_reg &= ~PM8xxx_RTC_ALARM_ENABLE; + ctrl_reg &= ~regs->alarm_en; - rc = regmap_write(rtc_dd->regmap, rtc_dd->rtc_base, ctrl_reg); + rc = regmap_write(rtc_dd->regmap, regs->alarm_ctrl, ctrl_reg); if (rc) { - dev_err(dev, "Write to RTC control register failed\n"); + dev_err(dev, "Write to RTC alarm control register failed\n"); goto rtc_rw_fail; } - rtc_dd->ctrl_reg = ctrl_reg; - dev_dbg(dev, "Alarm Set for h:r:s=%d:%d:%d, d/m/y=%d/%d/%d\n", alarm->time.tm_hour, alarm->time.tm_min, alarm->time.tm_sec, alarm->time.tm_mday, @@ -245,8 +258,9 @@ static int pm8xxx_rtc_read_alarm(struct device *dev, struct rtc_wkalrm *alarm) u8 value[NUM_8_BIT_RTC_REGS]; unsigned long secs; struct pm8xxx_rtc *rtc_dd = dev_get_drvdata(dev); + const struct pm8xxx_rtc_regs *regs = rtc_dd->regs; - rc = regmap_bulk_read(rtc_dd->regmap, rtc_dd->alarm_rw_base, value, + rc = regmap_bulk_read(rtc_dd->regmap, regs->alarm_rw, value, sizeof(value)); if (rc) { dev_err(dev, "RTC alarm time read failed\n"); @@ -276,25 +290,26 @@ static int pm8xxx_rtc_alarm_irq_enable(struct device *dev, unsigned int enable) int rc; unsigned long irq_flags; struct pm8xxx_rtc *rtc_dd = dev_get_drvdata(dev); - u8 ctrl_reg; + const struct pm8xxx_rtc_regs *regs = rtc_dd->regs; + unsigned int ctrl_reg; spin_lock_irqsave(&rtc_dd->ctrl_reg_lock, irq_flags); - ctrl_reg = rtc_dd->ctrl_reg; + rc = regmap_read(rtc_dd->regmap, regs->alarm_ctrl, &ctrl_reg); + if (rc) + goto rtc_rw_fail; if (enable) - ctrl_reg |= PM8xxx_RTC_ALARM_ENABLE; + ctrl_reg |= regs->alarm_en; else - ctrl_reg &= ~PM8xxx_RTC_ALARM_ENABLE; + ctrl_reg &= ~regs->alarm_en; - rc = regmap_write(rtc_dd->regmap, rtc_dd->rtc_base, ctrl_reg); + rc = regmap_write(rtc_dd->regmap, regs->alarm_ctrl, ctrl_reg); if (rc) { dev_err(dev, "Write to RTC control register failed\n"); goto rtc_rw_fail; } - rtc_dd->ctrl_reg = ctrl_reg; - rtc_rw_fail: spin_unlock_irqrestore(&rtc_dd->ctrl_reg_lock, irq_flags); return rc; @@ -311,6 +326,7 @@ static const struct rtc_class_ops pm8xxx_rtc_ops = { static irqreturn_t pm8xxx_alarm_trigger(int irq, void *dev_id) { struct pm8xxx_rtc *rtc_dd = dev_id; + const struct pm8xxx_rtc_regs *regs = rtc_dd->regs; unsigned int ctrl_reg; int rc; unsigned long irq_flags; @@ -320,48 +336,100 @@ static irqreturn_t pm8xxx_alarm_trigger(int irq, void *dev_id) spin_lock_irqsave(&rtc_dd->ctrl_reg_lock, irq_flags); /* Clear the alarm enable bit */ - ctrl_reg = rtc_dd->ctrl_reg; - ctrl_reg &= ~PM8xxx_RTC_ALARM_ENABLE; + rc = regmap_read(rtc_dd->regmap, regs->alarm_ctrl, &ctrl_reg); + if (rc) { + spin_unlock_irqrestore(&rtc_dd->ctrl_reg_lock, irq_flags); + goto rtc_alarm_handled; + } - rc = regmap_write(rtc_dd->regmap, rtc_dd->rtc_base, ctrl_reg); + ctrl_reg &= ~regs->alarm_en; + + rc = regmap_write(rtc_dd->regmap, regs->alarm_ctrl, ctrl_reg); if (rc) { spin_unlock_irqrestore(&rtc_dd->ctrl_reg_lock, irq_flags); dev_err(rtc_dd->rtc_dev, - "Write to RTC control register failed\n"); + "Write to alarm control register failed\n"); goto rtc_alarm_handled; } - rtc_dd->ctrl_reg = ctrl_reg; spin_unlock_irqrestore(&rtc_dd->ctrl_reg_lock, irq_flags); /* Clear RTC alarm register */ - rc = regmap_read(rtc_dd->regmap, - rtc_dd->rtc_base + PM8XXX_ALARM_CTRL_OFFSET, - &ctrl_reg); + rc = regmap_read(rtc_dd->regmap, regs->alarm_ctrl2, &ctrl_reg); if (rc) { dev_err(rtc_dd->rtc_dev, - "RTC Alarm control register read failed\n"); + "RTC Alarm control2 register read failed\n"); goto rtc_alarm_handled; } - ctrl_reg &= ~PM8xxx_RTC_ALARM_CLEAR; - rc = regmap_write(rtc_dd->regmap, - rtc_dd->rtc_base + PM8XXX_ALARM_CTRL_OFFSET, - ctrl_reg); + ctrl_reg |= PM8xxx_RTC_ALARM_CLEAR; + rc = regmap_write(rtc_dd->regmap, regs->alarm_ctrl2, ctrl_reg); if (rc) dev_err(rtc_dd->rtc_dev, - "Write to RTC Alarm control register failed\n"); + "Write to RTC Alarm control2 register failed\n"); rtc_alarm_handled: return IRQ_HANDLED; } +static int pm8xxx_rtc_enable(struct pm8xxx_rtc *rtc_dd) +{ + const struct pm8xxx_rtc_regs *regs = rtc_dd->regs; + unsigned int ctrl_reg; + int rc; + + /* Check if the RTC is on, else turn it on */ + rc = regmap_read(rtc_dd->regmap, regs->ctrl, &ctrl_reg); + if (rc) + return rc; + + if (!(ctrl_reg & PM8xxx_RTC_ENABLE)) { + ctrl_reg |= PM8xxx_RTC_ENABLE; + rc = regmap_write(rtc_dd->regmap, regs->ctrl, ctrl_reg); + if (rc) + return rc; + } + + return 0; +} + +static const struct pm8xxx_rtc_regs pm8921_regs = { + .ctrl = 0x11d, + .write = 0x11f, + .read = 0x123, + .alarm_rw = 0x127, + .alarm_ctrl = 0x11d, + .alarm_ctrl2 = 0x11e, + .alarm_en = BIT(1), +}; + +static const struct pm8xxx_rtc_regs pm8058_regs = { + .ctrl = 0x1e8, + .write = 0x1ea, + .read = 0x1ee, + .alarm_rw = 0x1f2, + .alarm_ctrl = 0x1e8, + .alarm_ctrl2 = 0x1e9, + .alarm_en = BIT(1), +}; + +static const struct pm8xxx_rtc_regs pm8941_regs = { + .ctrl = 0x6046, + .write = 0x6040, + .read = 0x6048, + .alarm_rw = 0x6140, + .alarm_ctrl = 0x6146, + .alarm_ctrl2 = 0x6148, + .alarm_en = BIT(7), +}; + /* * Hardcoded RTC bases until IORESOURCE_REG mapping is figured out */ static const struct of_device_id pm8xxx_id_table[] = { - { .compatible = "qcom,pm8921-rtc", .data = (void *) 0x11D }, - { .compatible = "qcom,pm8058-rtc", .data = (void *) 0x1E8 }, + { .compatible = "qcom,pm8921-rtc", .data = &pm8921_regs }, + { .compatible = "qcom,pm8058-rtc", .data = &pm8058_regs }, + { .compatible = "qcom,pm8941-rtc", .data = &pm8941_regs }, { }, }; MODULE_DEVICE_TABLE(of, pm8xxx_id_table); @@ -369,7 +437,6 @@ MODULE_DEVICE_TABLE(of, pm8xxx_id_table); static int pm8xxx_rtc_probe(struct platform_device *pdev) { int rc; - unsigned int ctrl_reg; struct pm8xxx_rtc *rtc_dd; const struct of_device_id *match; @@ -399,33 +466,12 @@ static int pm8xxx_rtc_probe(struct platform_device *pdev) rtc_dd->allow_set_time = of_property_read_bool(pdev->dev.of_node, "allow-set-time"); - rtc_dd->rtc_base = (long) match->data; - - /* Setup RTC register addresses */ - rtc_dd->rtc_write_base = rtc_dd->rtc_base + PM8XXX_RTC_WRITE_OFFSET; - rtc_dd->rtc_read_base = rtc_dd->rtc_base + PM8XXX_RTC_READ_OFFSET; - rtc_dd->alarm_rw_base = rtc_dd->rtc_base + PM8XXX_ALARM_RW_OFFSET; - + rtc_dd->regs = match->data; rtc_dd->rtc_dev = &pdev->dev; - /* Check if the RTC is on, else turn it on */ - rc = regmap_read(rtc_dd->regmap, rtc_dd->rtc_base, &ctrl_reg); - if (rc) { - dev_err(&pdev->dev, "RTC control register read failed!\n"); + rc = pm8xxx_rtc_enable(rtc_dd); + if (rc) return rc; - } - - if (!(ctrl_reg & PM8xxx_RTC_ENABLE)) { - ctrl_reg |= PM8xxx_RTC_ENABLE; - rc = regmap_write(rtc_dd->regmap, rtc_dd->rtc_base, ctrl_reg); - if (rc) { - dev_err(&pdev->dev, - "Write to RTC control register failed\n"); - return rc; - } - } - - rtc_dd->ctrl_reg = ctrl_reg; platform_set_drvdata(pdev, rtc_dd); From 0baf2a4dbf75abb7c186fd6c8d55d27aaa354a29 Mon Sep 17 00:00:00 2001 From: Martin Schwidefsky Date: Wed, 29 Oct 2014 14:50:35 -0700 Subject: [PATCH 09/21] kernel/kmod: fix use-after-free of the sub_info structure Found this in the message log on a s390 system: BUG kmalloc-192 (Not tainted): Poison overwritten Disabling lock debugging due to kernel taint INFO: 0x00000000684761f4-0x00000000684761f7. First byte 0xff instead of 0x6b INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648 __slab_alloc.isra.47.constprop.56+0x5f6/0x658 kmem_cache_alloc_trace+0x106/0x408 call_usermodehelper_setup+0x70/0x128 call_usermodehelper+0x62/0x90 cgroup_release_agent+0x178/0x1c0 process_one_work+0x36e/0x680 worker_thread+0x2f0/0x4f8 kthread+0x10a/0x120 kernel_thread_starter+0x6/0xc kernel_thread_starter+0x0/0xc INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648 __slab_free+0x94/0x560 kfree+0x364/0x3e0 call_usermodehelper_exec+0x110/0x1b8 cgroup_release_agent+0x178/0x1c0 process_one_work+0x36e/0x680 worker_thread+0x2f0/0x4f8 kthread+0x10a/0x120 kernel_thread_starter+0x6/0xc kernel_thread_starter+0x0/0xc There is a use-after-free bug on the subprocess_info structure allocated by the user mode helper. In case do_execve() returns with an error ____call_usermodehelper() stores the error code to sub_info->retval, but sub_info can already have been freed. Regarding UMH_NO_WAIT, the sub_info structure can be freed by __call_usermodehelper() before the worker thread returns from do_execve(), allowing memory corruption when do_execve() failed after exec_mmap() is called. Regarding UMH_WAIT_EXEC, the call to umh_complete() allows call_usermodehelper_exec() to continue which then frees sub_info. To fix this race the code needs to make sure that the call to call_usermodehelper_freeinfo() is always done after the last store to sub_info->retval. Signed-off-by: Martin Schwidefsky Reviewed-by: Oleg Nesterov Cc: Tetsuo Handa Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kmod.c | 76 +++++++++++++++++++++++++-------------------------- 1 file changed, 37 insertions(+), 39 deletions(-) diff --git a/kernel/kmod.c b/kernel/kmod.c index 8637e041a247..80f7a6d00519 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -196,12 +196,34 @@ int __request_module(bool wait, const char *fmt, ...) EXPORT_SYMBOL(__request_module); #endif /* CONFIG_MODULES */ +static void call_usermodehelper_freeinfo(struct subprocess_info *info) +{ + if (info->cleanup) + (*info->cleanup)(info); + kfree(info); +} + +static void umh_complete(struct subprocess_info *sub_info) +{ + struct completion *comp = xchg(&sub_info->complete, NULL); + /* + * See call_usermodehelper_exec(). If xchg() returns NULL + * we own sub_info, the UMH_KILLABLE caller has gone away + * or the caller used UMH_NO_WAIT. + */ + if (comp) + complete(comp); + else + call_usermodehelper_freeinfo(sub_info); +} + /* * This is the task which runs the usermode application */ static int ____call_usermodehelper(void *data) { struct subprocess_info *sub_info = data; + int wait = sub_info->wait & ~UMH_KILLABLE; struct cred *new; int retval; @@ -221,7 +243,7 @@ static int ____call_usermodehelper(void *data) retval = -ENOMEM; new = prepare_kernel_cred(current); if (!new) - goto fail; + goto out; spin_lock(&umh_sysctl_lock); new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); @@ -233,7 +255,7 @@ static int ____call_usermodehelper(void *data) retval = sub_info->init(sub_info, new); if (retval) { abort_creds(new); - goto fail; + goto out; } } @@ -242,12 +264,13 @@ static int ____call_usermodehelper(void *data) retval = do_execve(getname_kernel(sub_info->path), (const char __user *const __user *)sub_info->argv, (const char __user *const __user *)sub_info->envp); +out: + sub_info->retval = retval; + /* wait_for_helper() will call umh_complete if UHM_WAIT_PROC. */ + if (wait != UMH_WAIT_PROC) + umh_complete(sub_info); if (!retval) return 0; - - /* Exec failed? */ -fail: - sub_info->retval = retval; do_exit(0); } @@ -258,26 +281,6 @@ static int call_helper(void *data) return ____call_usermodehelper(data); } -static void call_usermodehelper_freeinfo(struct subprocess_info *info) -{ - if (info->cleanup) - (*info->cleanup)(info); - kfree(info); -} - -static void umh_complete(struct subprocess_info *sub_info) -{ - struct completion *comp = xchg(&sub_info->complete, NULL); - /* - * See call_usermodehelper_exec(). If xchg() returns NULL - * we own sub_info, the UMH_KILLABLE caller has gone away. - */ - if (comp) - complete(comp); - else - call_usermodehelper_freeinfo(sub_info); -} - /* Keventd can't block, but this (a child) can. */ static int wait_for_helper(void *data) { @@ -336,18 +339,8 @@ static void __call_usermodehelper(struct work_struct *work) kmod_thread_locker = NULL; } - switch (wait) { - case UMH_NO_WAIT: - call_usermodehelper_freeinfo(sub_info); - break; - - case UMH_WAIT_PROC: - if (pid > 0) - break; - /* FALLTHROUGH */ - case UMH_WAIT_EXEC: - if (pid < 0) - sub_info->retval = pid; + if (pid < 0) { + sub_info->retval = pid; umh_complete(sub_info); } } @@ -588,7 +581,12 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait) goto out; } - sub_info->complete = &done; + /* + * Set the completion pointer only if there is a waiter. + * This makes it possible to use umh_complete to free + * the data structure in case of UMH_NO_WAIT. + */ + sub_info->complete = (wait == UMH_NO_WAIT) ? NULL : &done; sub_info->wait = wait; queue_work(khelper_wq, &sub_info->work); From eaf3a659086e1d1d85dc8fbce4007e3c9076e0b3 Mon Sep 17 00:00:00 2001 From: Marek Szyprowski Date: Wed, 29 Oct 2014 14:50:38 -0700 Subject: [PATCH 10/21] drivers/rtc/rtc-s3c.c: fix initialization failure without rtc source clock Fix unconditional initialization failure on non-exynos3250 SoCs. Commit df9e26d093d3 ("rtc: s3c: add support for RTC of Exynos3250 SoC") introduced rtc source clock support, but also added initialization failure on SoCs, which doesn't need such clock. Signed-off-by: Marek Szyprowski Reviewed-by: Chanwoo Choi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/rtc/rtc-s3c.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/rtc/rtc-s3c.c b/drivers/rtc/rtc-s3c.c index a6b1252c9941..806072238c00 100644 --- a/drivers/rtc/rtc-s3c.c +++ b/drivers/rtc/rtc-s3c.c @@ -535,13 +535,15 @@ static int s3c_rtc_probe(struct platform_device *pdev) } clk_prepare_enable(info->rtc_clk); - info->rtc_src_clk = devm_clk_get(&pdev->dev, "rtc_src"); - if (IS_ERR(info->rtc_src_clk)) { - dev_err(&pdev->dev, "failed to find rtc source clock\n"); - return PTR_ERR(info->rtc_src_clk); + if (info->data->needs_src_clk) { + info->rtc_src_clk = devm_clk_get(&pdev->dev, "rtc_src"); + if (IS_ERR(info->rtc_src_clk)) { + dev_err(&pdev->dev, + "failed to find rtc source clock\n"); + return PTR_ERR(info->rtc_src_clk); + } + clk_prepare_enable(info->rtc_src_clk); } - clk_prepare_enable(info->rtc_src_clk); - /* check to see if everything is setup correctly */ if (info->data->enable) From 35dca71c1fad13616d9ea336c05730071793b63a Mon Sep 17 00:00:00 2001 From: Yasuaki Ishimatsu Date: Wed, 29 Oct 2014 14:50:40 -0700 Subject: [PATCH 11/21] memory-hotplug: clear pgdat which is allocated by bootmem in try_offline_node() When hot adding the same memory after hot removal, the following messages are shown: WARNING: CPU: 20 PID: 6 at mm/page_alloc.c:4968 free_area_init_node+0x3fe/0x426() ... Call Trace: dump_stack+0x46/0x58 warn_slowpath_common+0x81/0xa0 warn_slowpath_null+0x1a/0x20 free_area_init_node+0x3fe/0x426 hotadd_new_pgdat+0x90/0x110 add_memory+0xd4/0x200 acpi_memory_device_add+0x1aa/0x289 acpi_bus_attach+0xfd/0x204 acpi_bus_attach+0x178/0x204 acpi_bus_scan+0x6a/0x90 acpi_device_hotplug+0xe8/0x418 acpi_hotplug_work_fn+0x1f/0x2b process_one_work+0x14e/0x3f0 worker_thread+0x11b/0x510 kthread+0xe1/0x100 ret_from_fork+0x7c/0xb0 The detaled explanation is as follows: When hot removing memory, pgdat is set to 0 in try_offline_node(). But if the pgdat is allocated by bootmem allocator, the clearing step is skipped. And when hot adding the same memory, the uninitialized pgdat is reused. But free_area_init_node() checks wether pgdat is set to zero. As a result, free_area_init_node() hits WARN_ON(). This patch clears pgdat which is allocated by bootmem allocator in try_offline_node(). Signed-off-by: Yasuaki Ishimatsu Cc: Zhang Zhen Cc: Wang Nan Cc: Tang Chen Reviewed-by: Toshi Kani Cc: Dave Hansen Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 5 ----- 1 file changed, 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 29d8693d0c61..252e1dbbed86 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1912,7 +1912,6 @@ void try_offline_node(int nid) unsigned long start_pfn = pgdat->node_start_pfn; unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages; unsigned long pfn; - struct page *pgdat_page = virt_to_page(pgdat); int i; for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) { @@ -1941,10 +1940,6 @@ void try_offline_node(int nid) node_set_offline(nid); unregister_one_node(nid); - if (!PageSlab(pgdat_page) && !PageCompound(pgdat_page)) - /* node data is allocated from boot memory */ - return; - /* free waittable in each zone */ for (i = 0; i < MAX_NR_ZONES; i++) { struct zone *zone = pgdat->node_zones + i; From 5a6e7599d3f8000496068b12276492311efad5ea Mon Sep 17 00:00:00 2001 From: Pavel Machek Date: Wed, 29 Oct 2014 14:50:42 -0700 Subject: [PATCH 12/21] drivers/rtc/rtc-bq32k.c: fix register value Fix register value in bq32000 trickle charging. Mike reported that I'm using wrong value in one trickle-charging case, and after checking docs, I must admit he's right. Signed-off-by: Pavel Machek Reported-by: Mike Bremford Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/rtc/rtc-bq32k.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/rtc/rtc-bq32k.c b/drivers/rtc/rtc-bq32k.c index 314129e66d6e..92679df6d6e2 100644 --- a/drivers/rtc/rtc-bq32k.c +++ b/drivers/rtc/rtc-bq32k.c @@ -160,7 +160,7 @@ static int trickle_charger_of_init(struct device *dev, struct device_node *node) dev_err(dev, "bq32k: diode and resistor mismatch\n"); return -EINVAL; } - reg = 0x25; + reg = 0x45; break; default: From ea5d05b34aca25c066e0699512d0ffbd8ee6ac3e Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 29 Oct 2014 14:50:44 -0700 Subject: [PATCH 13/21] lib/bitmap.c: fix undefined shift in __bitmap_shift_{left|right}() If __bitmap_shift_left() or __bitmap_shift_right() are asked to shift by a multiple of BITS_PER_LONG, they will try to shift a long value by BITS_PER_LONG bits which is undefined. Change the functions to avoid the undefined shift. Coverity id: 1192175 Coverity id: 1192174 Signed-off-by: Jan Kara Cc: Rasmus Villemoes Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/bitmap.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/lib/bitmap.c b/lib/bitmap.c index cd250a2e14cb..b499ab6ada29 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -131,7 +131,9 @@ void __bitmap_shift_right(unsigned long *dst, lower = src[off + k]; if (left && off + k == lim - 1) lower &= mask; - dst[k] = upper << (BITS_PER_LONG - rem) | lower >> rem; + dst[k] = lower >> rem; + if (rem) + dst[k] |= upper << (BITS_PER_LONG - rem); if (left && k == lim - 1) dst[k] &= mask; } @@ -172,7 +174,9 @@ void __bitmap_shift_left(unsigned long *dst, upper = src[k]; if (left && k == lim - 1) upper &= (1UL << left) - 1; - dst[k + off] = lower >> (BITS_PER_LONG - rem) | upper << rem; + dst[k + off] = upper << rem; + if (rem) + dst[k + off] |= lower >> (BITS_PER_LONG - rem); if (left && k + off == lim - 1) dst[k + off] &= (1UL << left) - 1; } From 3a3c02ecf7f2852f122d6d16fb9b3d9cb0c6f201 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 29 Oct 2014 14:50:46 -0700 Subject: [PATCH 14/21] mm: page-writeback: inline account_page_dirtied() into single caller A follow-up patch would have changed the call signature. To save the trouble, just fold it instead. Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Cc: Vladimir Davydov Cc: [3.17.x] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 1 - mm/page-writeback.c | 23 ++++------------------- 2 files changed, 4 insertions(+), 20 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 27eb1bfbe704..b46461116cd2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1235,7 +1235,6 @@ int __set_page_dirty_no_writeback(struct page *page); int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page); void account_page_dirtied(struct page *page, struct address_space *mapping); -void account_page_writeback(struct page *page); int set_page_dirty(struct page *page); int set_page_dirty_lock(struct page *page); int clear_page_dirty_for_io(struct page *page); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ff24c9d83112..ff6a5b07211e 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2115,23 +2115,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) } EXPORT_SYMBOL(account_page_dirtied); -/* - * Helper function for set_page_writeback family. - * - * The caller must hold mem_cgroup_begin/end_update_page_stat() lock - * while calling this function. - * See test_set_page_writeback for example. - * - * NOTE: Unlike account_page_dirtied this does not rely on being atomic - * wrt interrupts. - */ -void account_page_writeback(struct page *page) -{ - mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); - inc_zone_page_state(page, NR_WRITEBACK); -} -EXPORT_SYMBOL(account_page_writeback); - /* * For address_spaces which do not use buffers. Just tag the page as dirty in * its radix tree. @@ -2410,8 +2393,10 @@ int __test_set_page_writeback(struct page *page, bool keep_write) } else { ret = TestSetPageWriteback(page); } - if (!ret) - account_page_writeback(page); + if (!ret) { + mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); + inc_zone_page_state(page, NR_WRITEBACK); + } mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags); return ret; From d7365e783edb858279be1d03f61bc8d5d3383d90 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 29 Oct 2014 14:50:48 -0700 Subject: [PATCH 15/21] mm: memcontrol: fix missed end-writeback page accounting Commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") changed page migration to uncharge the old page right away. The page is locked, unmapped, truncated, and off the LRU, but it could race with writeback ending, which then doesn't unaccount the page properly: test_clear_page_writeback() migration wait_on_page_writeback() TestClearPageWriteback() mem_cgroup_migrate() clear PCG_USED mem_cgroup_update_page_stat() if (PageCgroupUsed(pc)) decrease memcg pages under writeback release pc->mem_cgroup->move_lock The per-page statistics interface is heavily optimized to avoid a function call and a lookup_page_cgroup() in the file unmap fast path, which means it doesn't verify whether a page is still charged before clearing PageWriteback() and it has to do it in the stat update later. Rework it so that it looks up the page's memcg once at the beginning of the transaction and then uses it throughout. The charge will be verified before clearing PageWriteback() and migration can't uncharge the page as long as that is still set. The RCU lock will protect the memcg past uncharge. As far as losing the optimization goes, the following test results are from a microbenchmark that maps, faults, and unmaps a 4GB sparse file three times in a nested fashion, so that there are two negative passes that don't account but still go through the new transaction overhead. There is no actual difference: old: 33.195102545 seconds time elapsed ( +- 0.01% ) new: 33.199231369 seconds time elapsed ( +- 0.03% ) The time spent in page_remove_rmap()'s callees still adds up to the same, but the time spent in the function itself seems reduced: # Children Self Command Shared Object Symbol old: 0.12% 0.11% filemapstress [kernel.kallsyms] [k] page_remove_rmap new: 0.12% 0.08% filemapstress [kernel.kallsyms] [k] page_remove_rmap Signed-off-by: Johannes Weiner Acked-by: Michal Hocko Cc: Vladimir Davydov Cc: [3.17.x] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 56 ++++++-------------- mm/memcontrol.c | 105 ++++++++++++++++++++----------------- mm/page-writeback.c | 22 ++++---- mm/rmap.c | 20 +++---- 4 files changed, 95 insertions(+), 108 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 19df5d857411..6b75640ef5ab 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -139,48 +139,23 @@ static inline bool mem_cgroup_disabled(void) return false; } -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, - unsigned long *flags); +struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page, bool *locked, + unsigned long *flags); +void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, bool locked, + unsigned long flags); +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg, + enum mem_cgroup_stat_index idx, int val); -extern atomic_t memcg_moving; - -static inline void mem_cgroup_begin_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) -{ - if (mem_cgroup_disabled()) - return; - rcu_read_lock(); - *locked = false; - if (atomic_read(&memcg_moving)) - __mem_cgroup_begin_update_page_stat(page, locked, flags); -} - -void __mem_cgroup_end_update_page_stat(struct page *page, - unsigned long *flags); -static inline void mem_cgroup_end_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) -{ - if (mem_cgroup_disabled()) - return; - if (*locked) - __mem_cgroup_end_update_page_stat(page, flags); - rcu_read_unlock(); -} - -void mem_cgroup_update_page_stat(struct page *page, - enum mem_cgroup_stat_index idx, - int val); - -static inline void mem_cgroup_inc_page_stat(struct page *page, +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg, enum mem_cgroup_stat_index idx) { - mem_cgroup_update_page_stat(page, idx, 1); + mem_cgroup_update_page_stat(memcg, idx, 1); } -static inline void mem_cgroup_dec_page_stat(struct page *page, +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg, enum mem_cgroup_stat_index idx) { - mem_cgroup_update_page_stat(page, idx, -1); + mem_cgroup_update_page_stat(memcg, idx, -1); } unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, @@ -315,13 +290,14 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p) { } -static inline void mem_cgroup_begin_update_page_stat(struct page *page, +static inline struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page, bool *locked, unsigned long *flags) { + return NULL; } -static inline void mem_cgroup_end_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) +static inline void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, + bool locked, unsigned long flags) { } @@ -343,12 +319,12 @@ static inline bool mem_cgroup_oom_synchronize(bool wait) return false; } -static inline void mem_cgroup_inc_page_stat(struct page *page, +static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg, enum mem_cgroup_stat_index idx) { } -static inline void mem_cgroup_dec_page_stat(struct page *page, +static inline void mem_cgroup_dec_page_stat(struct mem_cgroup *memcg, enum mem_cgroup_stat_index idx) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 23976fd885fd..d6ac0e33e150 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1536,12 +1536,8 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg) * start move here. */ -/* for quick checking without looking up memcg */ -atomic_t memcg_moving __read_mostly; - static void mem_cgroup_start_move(struct mem_cgroup *memcg) { - atomic_inc(&memcg_moving); atomic_inc(&memcg->moving_account); synchronize_rcu(); } @@ -1552,10 +1548,8 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg) * Now, mem_cgroup_clear_mc() may call this function with NULL. * We check NULL in callee rather than caller. */ - if (memcg) { - atomic_dec(&memcg_moving); + if (memcg) atomic_dec(&memcg->moving_account); - } } /* @@ -2204,41 +2198,52 @@ bool mem_cgroup_oom_synchronize(bool handle) return true; } -/* - * Used to update mapped file or writeback or other statistics. +/** + * mem_cgroup_begin_page_stat - begin a page state statistics transaction + * @page: page that is going to change accounted state + * @locked: &memcg->move_lock slowpath was taken + * @flags: IRQ-state flags for &memcg->move_lock * - * Notes: Race condition + * This function must mark the beginning of an accounted page state + * change to prevent double accounting when the page is concurrently + * being moved to another memcg: * - * Charging occurs during page instantiation, while the page is - * unmapped and locked in page migration, or while the page table is - * locked in THP migration. No race is possible. + * memcg = mem_cgroup_begin_page_stat(page, &locked, &flags); + * if (TestClearPageState(page)) + * mem_cgroup_update_page_stat(memcg, state, -1); + * mem_cgroup_end_page_stat(memcg, locked, flags); * - * Uncharge happens to pages with zero references, no race possible. + * The RCU lock is held throughout the transaction. The fast path can + * get away without acquiring the memcg->move_lock (@locked is false) + * because page moving starts with an RCU grace period. * - * Charge moving between groups is protected by checking mm->moving - * account and taking the move_lock in the slowpath. + * The RCU lock also protects the memcg from being freed when the page + * state that is going to change is the only thing preventing the page + * from being uncharged. E.g. end-writeback clearing PageWriteback(), + * which allows migration to go ahead and uncharge the page before the + * account transaction might be complete. */ - -void __mem_cgroup_begin_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) +struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page, + bool *locked, + unsigned long *flags) { struct mem_cgroup *memcg; struct page_cgroup *pc; + rcu_read_lock(); + + if (mem_cgroup_disabled()) + return NULL; + pc = lookup_page_cgroup(page); again: memcg = pc->mem_cgroup; if (unlikely(!memcg || !PageCgroupUsed(pc))) - return; - /* - * If this memory cgroup is not under account moving, we don't - * need to take move_lock_mem_cgroup(). Because we already hold - * rcu_read_lock(), any calls to move_account will be delayed until - * rcu_read_unlock(). - */ - VM_BUG_ON(!rcu_read_lock_held()); + return NULL; + + *locked = false; if (atomic_read(&memcg->moving_account) <= 0) - return; + return memcg; move_lock_mem_cgroup(memcg, flags); if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) { @@ -2246,36 +2251,40 @@ void __mem_cgroup_begin_update_page_stat(struct page *page, goto again; } *locked = true; + + return memcg; } -void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags) +/** + * mem_cgroup_end_page_stat - finish a page state statistics transaction + * @memcg: the memcg that was accounted against + * @locked: value received from mem_cgroup_begin_page_stat() + * @flags: value received from mem_cgroup_begin_page_stat() + */ +void mem_cgroup_end_page_stat(struct mem_cgroup *memcg, bool locked, + unsigned long flags) { - struct page_cgroup *pc = lookup_page_cgroup(page); + if (memcg && locked) + move_unlock_mem_cgroup(memcg, &flags); - /* - * It's guaranteed that pc->mem_cgroup never changes while - * lock is held because a routine modifies pc->mem_cgroup - * should take move_lock_mem_cgroup(). - */ - move_unlock_mem_cgroup(pc->mem_cgroup, flags); + rcu_read_unlock(); } -void mem_cgroup_update_page_stat(struct page *page, +/** + * mem_cgroup_update_page_stat - update page state statistics + * @memcg: memcg to account against + * @idx: page state item to account + * @val: number of pages (positive or negative) + * + * See mem_cgroup_begin_page_stat() for locking requirements. + */ +void mem_cgroup_update_page_stat(struct mem_cgroup *memcg, enum mem_cgroup_stat_index idx, int val) { - struct mem_cgroup *memcg; - struct page_cgroup *pc = lookup_page_cgroup(page); - unsigned long uninitialized_var(flags); - - if (mem_cgroup_disabled()) - return; - VM_BUG_ON(!rcu_read_lock_held()); - memcg = pc->mem_cgroup; - if (unlikely(!memcg || !PageCgroupUsed(pc))) - return; - this_cpu_add(memcg->stat->count[idx], val); + if (memcg) + this_cpu_add(memcg->stat->count[idx], val); } /* diff --git a/mm/page-writeback.c b/mm/page-writeback.c index ff6a5b07211e..19ceae87522d 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2327,11 +2327,12 @@ EXPORT_SYMBOL(clear_page_dirty_for_io); int test_clear_page_writeback(struct page *page) { struct address_space *mapping = page_mapping(page); - int ret; - bool locked; unsigned long memcg_flags; + struct mem_cgroup *memcg; + bool locked; + int ret; - mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags); + memcg = mem_cgroup_begin_page_stat(page, &locked, &memcg_flags); if (mapping) { struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long flags; @@ -2352,22 +2353,23 @@ int test_clear_page_writeback(struct page *page) ret = TestClearPageWriteback(page); } if (ret) { - mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); + mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_WRITEBACK); dec_zone_page_state(page, NR_WRITEBACK); inc_zone_page_state(page, NR_WRITTEN); } - mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags); + mem_cgroup_end_page_stat(memcg, locked, memcg_flags); return ret; } int __test_set_page_writeback(struct page *page, bool keep_write) { struct address_space *mapping = page_mapping(page); - int ret; - bool locked; unsigned long memcg_flags; + struct mem_cgroup *memcg; + bool locked; + int ret; - mem_cgroup_begin_update_page_stat(page, &locked, &memcg_flags); + memcg = mem_cgroup_begin_page_stat(page, &locked, &memcg_flags); if (mapping) { struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long flags; @@ -2394,10 +2396,10 @@ int __test_set_page_writeback(struct page *page, bool keep_write) ret = TestSetPageWriteback(page); } if (!ret) { - mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); + mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_WRITEBACK); inc_zone_page_state(page, NR_WRITEBACK); } - mem_cgroup_end_update_page_stat(page, &locked, &memcg_flags); + mem_cgroup_end_page_stat(memcg, locked, memcg_flags); return ret; } diff --git a/mm/rmap.c b/mm/rmap.c index 116a5053415b..f574046f77d4 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1042,15 +1042,16 @@ void page_add_new_anon_rmap(struct page *page, */ void page_add_file_rmap(struct page *page) { - bool locked; + struct mem_cgroup *memcg; unsigned long flags; + bool locked; - mem_cgroup_begin_update_page_stat(page, &locked, &flags); + memcg = mem_cgroup_begin_page_stat(page, &locked, &flags); if (atomic_inc_and_test(&page->_mapcount)) { __inc_zone_page_state(page, NR_FILE_MAPPED); - mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED); + mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED); } - mem_cgroup_end_update_page_stat(page, &locked, &flags); + mem_cgroup_end_page_stat(memcg, locked, flags); } /** @@ -1061,9 +1062,10 @@ void page_add_file_rmap(struct page *page) */ void page_remove_rmap(struct page *page) { + struct mem_cgroup *uninitialized_var(memcg); bool anon = PageAnon(page); - bool locked; unsigned long flags; + bool locked; /* * The anon case has no mem_cgroup page_stat to update; but may @@ -1071,7 +1073,7 @@ void page_remove_rmap(struct page *page) * we hold the lock against page_stat move: so avoid it on anon. */ if (!anon) - mem_cgroup_begin_update_page_stat(page, &locked, &flags); + memcg = mem_cgroup_begin_page_stat(page, &locked, &flags); /* page still mapped by someone else? */ if (!atomic_add_negative(-1, &page->_mapcount)) @@ -1096,8 +1098,7 @@ void page_remove_rmap(struct page *page) -hpage_nr_pages(page)); } else { __dec_zone_page_state(page, NR_FILE_MAPPED); - mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED); - mem_cgroup_end_update_page_stat(page, &locked, &flags); + mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED); } if (unlikely(PageMlocked(page))) clear_page_mlock(page); @@ -1110,10 +1111,9 @@ void page_remove_rmap(struct page *page) * Leaving it set also helps swapoff to reinstate ptes * faster for those pages still in swapcache. */ - return; out: if (!anon) - mem_cgroup_end_update_page_stat(page, &locked, &flags); + mem_cgroup_end_page_stat(memcg, locked, flags); } /* From 8186eb6a799e4e32f984b55858d8e393938be0c1 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 29 Oct 2014 14:50:51 -0700 Subject: [PATCH 16/21] mm: rmap: split out page_remove_file_rmap() page_remove_rmap() has too many branches on PageAnon() and is hard to follow. Move the file part into a separate function. Signed-off-by: Johannes Weiner Reviewed-by: Michal Hocko Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/rmap.c | 78 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 46 insertions(+), 32 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index f574046f77d4..19886fb2f13a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1054,6 +1054,36 @@ void page_add_file_rmap(struct page *page) mem_cgroup_end_page_stat(memcg, locked, flags); } +static void page_remove_file_rmap(struct page *page) +{ + struct mem_cgroup *memcg; + unsigned long flags; + bool locked; + + memcg = mem_cgroup_begin_page_stat(page, &locked, &flags); + + /* page still mapped by someone else? */ + if (!atomic_add_negative(-1, &page->_mapcount)) + goto out; + + /* Hugepages are not counted in NR_FILE_MAPPED for now. */ + if (unlikely(PageHuge(page))) + goto out; + + /* + * We use the irq-unsafe __{inc|mod}_zone_page_stat because + * these counters are not modified in interrupt context, and + * pte lock(a spinlock) is held, which implies preemption disabled. + */ + __dec_zone_page_state(page, NR_FILE_MAPPED); + mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED); + + if (unlikely(PageMlocked(page))) + clear_page_mlock(page); +out: + mem_cgroup_end_page_stat(memcg, locked, flags); +} + /** * page_remove_rmap - take down pte mapping from a page * @page: page to remove mapping from @@ -1062,46 +1092,33 @@ void page_add_file_rmap(struct page *page) */ void page_remove_rmap(struct page *page) { - struct mem_cgroup *uninitialized_var(memcg); - bool anon = PageAnon(page); - unsigned long flags; - bool locked; - - /* - * The anon case has no mem_cgroup page_stat to update; but may - * uncharge_page() below, where the lock ordering can deadlock if - * we hold the lock against page_stat move: so avoid it on anon. - */ - if (!anon) - memcg = mem_cgroup_begin_page_stat(page, &locked, &flags); + if (!PageAnon(page)) { + page_remove_file_rmap(page); + return; + } /* page still mapped by someone else? */ if (!atomic_add_negative(-1, &page->_mapcount)) - goto out; + return; + + /* Hugepages are not counted in NR_ANON_PAGES for now. */ + if (unlikely(PageHuge(page))) + return; /* - * Hugepages are not counted in NR_ANON_PAGES nor NR_FILE_MAPPED - * and not charged by memcg for now. - * * We use the irq-unsafe __{inc|mod}_zone_page_stat because * these counters are not modified in interrupt context, and - * these counters are not modified in interrupt context, and * pte lock(a spinlock) is held, which implies preemption disabled. */ - if (unlikely(PageHuge(page))) - goto out; - if (anon) { - if (PageTransHuge(page)) - __dec_zone_page_state(page, - NR_ANON_TRANSPARENT_HUGEPAGES); - __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, - -hpage_nr_pages(page)); - } else { - __dec_zone_page_state(page, NR_FILE_MAPPED); - mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_FILE_MAPPED); - } + if (PageTransHuge(page)) + __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES); + + __mod_zone_page_state(page_zone(page), NR_ANON_PAGES, + -hpage_nr_pages(page)); + if (unlikely(PageMlocked(page))) clear_page_mlock(page); + /* * It would be tidy to reset the PageAnon mapping here, * but that might overwrite a racing page_add_anon_rmap @@ -1111,9 +1128,6 @@ void page_remove_rmap(struct page *page) * Leaving it set also helps swapoff to reinstate ptes * faster for those pages still in swapcache. */ -out: - if (!anon) - mem_cgroup_end_page_stat(memcg, locked, flags); } /* From d3556babd7facb8fbc596bada0d67139e3b22330 Mon Sep 17 00:00:00 2001 From: Richard Weinberger Date: Wed, 29 Oct 2014 14:50:53 -0700 Subject: [PATCH 17/21] ocfs2: fix d_splice_alias() return code checking d_splice_alias() can return a valid dentry, NULL or an ERR_PTR. Currently the code checks not for ERR_PTR and will cuase an oops in ocfs2_dentry_attach_lock(). Fix this by using IS_ERR_OR_NULL(). Signed-off-by: Richard Weinberger Cc: Mark Fasheh Cc: Joel Becker Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/ocfs2/namei.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index 8add6f1030d7..b931e04e3388 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -158,7 +158,7 @@ static struct dentry *ocfs2_lookup(struct inode *dir, struct dentry *dentry, * NOTE: This dentry already has ->d_op set from * ocfs2_get_parent() and ocfs2_get_dentry() */ - if (ret) + if (!IS_ERR_OR_NULL(ret)) dentry = ret; status = ocfs2_dentry_attach_lock(dentry, inode, From 8aba7e0a2c02355f9a7dec629635cb7093fe0508 Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Wed, 29 Oct 2014 14:50:55 -0700 Subject: [PATCH 18/21] mm/slab_common: don't check for duplicate cache names The SLUB cache merges caches with the same size and alignment and there was long standing bug with this behavior: - create the cache named "foo" - create the cache named "bar" (which is merged with "foo") - delete the cache named "foo" (but it stays allocated because "bar" uses it) - create the cache named "foo" again - it fails because the name "foo" is already used That bug was fixed in commit 694617474e33 ("slab_common: fix the check for duplicate slab names") by not warning on duplicate cache names when the SLUB subsystem is used. Recently, cache merging was implemented the with SLAB subsystem too, in 12220dea07f1 ("mm/slab: support slab merge")). Therefore we need stop checking for duplicate names even for the SLAB subsystem. This patch fixes the bug by removing the check. Signed-off-by: Mikulas Patocka Acked-by: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/slab_common.c | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index 3a6e0cfdf03a..406944207b61 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -93,16 +93,6 @@ static int kmem_cache_sanity_check(const char *name, size_t size) s->object_size); continue; } - -#if !defined(CONFIG_SLUB) - if (!strcmp(s->name, name)) { - pr_err("%s (%s): Cache name already exists.\n", - __func__, name); - dump_stack(); - s = NULL; - return -EINVAL; - } -#endif } WARN_ON(strchr(name, ' ')); /* It confuses parsers */ From 5a99e95b8d1cd47f6feddcdca6c71d22060df8a2 Mon Sep 17 00:00:00 2001 From: Weijie Yang Date: Wed, 29 Oct 2014 14:50:57 -0700 Subject: [PATCH 19/21] zram: avoid NULL pointer access in concurrent situation There is a rare NULL pointer bug in mem_used_total_show() and mem_used_max_store() in concurrent situation, like this: zram is not initialized, process A is a mem_used_total reader which runs periodically, while process B try to init zram. process A process B access meta, get a NULL value init zram, done init_done() is true access meta->mem_pool, get a NULL pointer BUG This patch fixes this issue. Signed-off-by: Weijie Yang Acked-by: Minchan Kim Acked-by: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 0e63e8aa8279..2ad0b5bce44b 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -99,11 +99,12 @@ static ssize_t mem_used_total_show(struct device *dev, { u64 val = 0; struct zram *zram = dev_to_zram(dev); - struct zram_meta *meta = zram->meta; down_read(&zram->init_lock); - if (init_done(zram)) + if (init_done(zram)) { + struct zram_meta *meta = zram->meta; val = zs_get_total_pages(meta->mem_pool); + } up_read(&zram->init_lock); return scnprintf(buf, PAGE_SIZE, "%llu\n", val << PAGE_SHIFT); @@ -173,16 +174,17 @@ static ssize_t mem_used_max_store(struct device *dev, int err; unsigned long val; struct zram *zram = dev_to_zram(dev); - struct zram_meta *meta = zram->meta; err = kstrtoul(buf, 10, &val); if (err || val != 0) return -EINVAL; down_read(&zram->init_lock); - if (init_done(zram)) + if (init_done(zram)) { + struct zram_meta *meta = zram->meta; atomic_long_set(&zram->stats.max_used_pages, zs_get_total_pages(meta->mem_pool)); + } up_read(&zram->init_lock); return len; From 5417421b270229bfce0795ccc99a4b481e4954ca Mon Sep 17 00:00:00 2001 From: Andriy Skulysh Date: Wed, 29 Oct 2014 14:50:59 -0700 Subject: [PATCH 20/21] sh: fix sh770x SCIF memory regions Resources scif1_resources & scif2_resources overlap. Actual SCIF region size is 0x10. This is regression from commit d850acf975be ("sh: Declare SCIF register base and IRQ as resources") Signed-off-by: Andriy Skulysh Acked-by: Laurent Pinchart Cc: Geert Uytterhoeven Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/sh/kernel/cpu/sh3/setup-sh770x.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/sh/kernel/cpu/sh3/setup-sh770x.c b/arch/sh/kernel/cpu/sh3/setup-sh770x.c index 9139d14b9c53..538c10db3537 100644 --- a/arch/sh/kernel/cpu/sh3/setup-sh770x.c +++ b/arch/sh/kernel/cpu/sh3/setup-sh770x.c @@ -118,7 +118,7 @@ static struct plat_sci_port scif0_platform_data = { }; static struct resource scif0_resources[] = { - DEFINE_RES_MEM(0xfffffe80, 0x100), + DEFINE_RES_MEM(0xfffffe80, 0x10), DEFINE_RES_IRQ(evt2irq(0x4e0)), }; @@ -143,7 +143,7 @@ static struct plat_sci_port scif1_platform_data = { }; static struct resource scif1_resources[] = { - DEFINE_RES_MEM(0xa4000150, 0x100), + DEFINE_RES_MEM(0xa4000150, 0x10), DEFINE_RES_IRQ(evt2irq(0x900)), }; @@ -169,7 +169,7 @@ static struct plat_sci_port scif2_platform_data = { }; static struct resource scif2_resources[] = { - DEFINE_RES_MEM(0xa4000140, 0x100), + DEFINE_RES_MEM(0xa4000140, 0x10), DEFINE_RES_IRQ(evt2irq(0x880)), }; From 4d88e6f7d5ffc84e6094a47925870f4a130555c2 Mon Sep 17 00:00:00 2001 From: Konstantin Khlebnikov Date: Wed, 29 Oct 2014 14:51:02 -0700 Subject: [PATCH 21/21] mm/balloon_compaction: fix deflation when compaction is disabled If CONFIG_BALLOON_COMPACTION=n balloon_page_insert() does not link pages with balloon and doesn't set PagePrivate flag, as a result balloon_page_dequeue() cannot get any pages because it thinks that all of them are isolated. Without balloon compaction nobody can isolate ballooned pages. It's safe to remove this check. Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management"). Signed-off-by: Konstantin Khlebnikov Reported-by: Matt Mullins Cc: [3.17] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/balloon_compaction.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index b3cbe19f71b5..fcad8322ef36 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -68,11 +68,13 @@ struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info) * to be released by the balloon driver. */ if (trylock_page(page)) { +#ifdef CONFIG_BALLOON_COMPACTION if (!PagePrivate(page)) { /* raced with isolation */ unlock_page(page); continue; } +#endif spin_lock_irqsave(&b_dev_info->pages_lock, flags); balloon_page_delete(page); __count_vm_event(BALLOON_DEFLATE);