Commit graph

257 commits

Author SHA1 Message Date
Matthew Wilcox (Oracle) 84fbbe2189 mm/rmap: Constify the rmap_walk_control argument
The rmap walking functions do not modify the rmap_walk_control, and
page_idle_clear_pte_refs() takes advantage of that to move construction
of the rmap_walk_control to compile time.  This lets us remove an
unclean cast.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21 13:01:35 -04:00
Matthew Wilcox (Oracle) 2f031c6f04 mm/rmap: Convert rmap_walk() to take a folio
This ripples all the way through to every calling and called function
from rmap.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21 13:01:35 -04:00
Matthew Wilcox (Oracle) e05b34539d mm: Turn page_anon_vma() into folio_anon_vma()
Move the prototype from mm.h to mm/internal.h and convert all callers
to pass a folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21 13:01:35 -04:00
Matthew Wilcox (Oracle) eed05e54d2 mm: Add DEFINE_PAGE_VMA_WALK and DEFINE_FOLIO_VMA_WALK
Instead of declaring a struct page_vma_mapped_walk directly,
use these helpers to allow us to transition to a PFN approach in the
following patches.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-03-21 12:59:02 -04:00
Hugh Dickins cea86fe246 mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points.  Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-02-17 11:56:48 -05:00
Nanyong Sun e1c63e110f mm: ksm: fix use-after-free kasan report in ksm_might_need_to_copy
When under the stress of swapping in/out with KSM enabled, there is a
low probability that kasan reports the BUG of use-after-free in
ksm_might_need_to_copy() when do swap in.  The freed object is the
anon_vma got from page_anon_vma(page).

It is because a swapcache page associated with one anon_vma now needed
for another anon_vma, but the page's original vma was unmapped and the
anon_vma was freed.  In this case the if condition below always return
false and then alloc a new page to copy.  Swapin process then use the
new page and can continue to run well, so this is harmless actually.

      } else if (anon_vma->root == vma->anon_vma->root &&
                 page->index == linear_page_index(vma, address)) {

This patch exchange the order of above two judgment statement to avoid
the kasan warning.  Let cpu run "page->index == linear_page_index(vma,
address)" firstly and return false basically to skip the read of
anon_vma->root which may trigger the kasan use-after-free warning:

    ==================================================================
    BUG: KASAN: use-after-free in ksm_might_need_to_copy+0x12e/0x5b0
    Read of size 8 at addr ffff88be9977dbd0 by task khugepaged/694

     CPU: 8 PID: 694 Comm: khugepaged Kdump: loaded Tainted: G OE - 4.18.0.x86_64
     Hardware name: 1288H V5/BC11SPSC0, BIOS 7.93 01/14/2021
    Call Trace:
     dump_stack+0xf1/0x19b
     print_address_description+0x70/0x360
     kasan_report+0x1b2/0x330
     ksm_might_need_to_copy+0x12e/0x5b0
     do_swap_page+0x452/0xe70
     __collapse_huge_page_swapin+0x24b/0x720
     khugepaged_scan_pmd+0xcae/0x1ff0
     khugepaged+0x8ee/0xd70
     kthread+0x1a2/0x1d0
     ret_from_fork+0x1f/0x40

    Allocated by task 2306153:
     kasan_kmalloc+0xa0/0xd0
     kmem_cache_alloc+0xc0/0x1c0
     anon_vma_clone+0xf7/0x380
     anon_vma_fork+0xc0/0x390
     copy_process+0x447b/0x4810
     _do_fork+0x118/0x620
     do_syscall_64+0x112/0x360
     entry_SYSCALL_64_after_hwframe+0x65/0xca

    Freed by task 2306242:
     __kasan_slab_free+0x130/0x180
     kmem_cache_free+0x78/0x1d0
     unlink_anon_vmas+0x19c/0x4a0
     free_pgtables+0x137/0x1b0
     exit_mmap+0x133/0x320
     mmput+0x15e/0x390
     do_exit+0x8c5/0x1210
     do_group_exit+0xb5/0x1b0
     __x64_sys_exit_group+0x21/0x30
     do_syscall_64+0x112/0x360
     entry_SYSCALL_64_after_hwframe+0x65/0xca

    The buggy address belongs to the object at ffff88be9977dba0
     which belongs to the cache anon_vma_chain of size 64
    The buggy address is located 48 bytes inside of
     64-byte region [ffff88be9977dba0, ffff88be9977dbe0)
    The buggy address belongs to the page:
    page:ffffea00fa65df40 count:1 mapcount:0 mapping:ffff888107717800 index:0x0
    flags: 0x17ffffc0000100(slab)
    ==================================================================

Link: https://lkml.kernel.org/r/20211202102940.1069634-1-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:31 +02:00
Arnd Bergmann 36090def7b mm: move tlb_flush_pending inline helpers to mm_inline.h
linux/mm_types.h should only define structure definitions, to make it
cheap to include elsewhere.  The atomic_t helper function definitions
are particularly large, so it's better to move the helpers using those
into the existing linux/mm_inline.h and only include that where needed.

As a follow-up, we may want to go through all the indirect includes in
mm_types.h and reduce them as much as possible.

Link: https://lkml.kernel.org/r/20211207125710.2503446-2-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Colin Cross <ccross@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:27 +02:00
Matthew Wilcox (Oracle) 19138349ed mm/migrate: Add folio_migrate_flags()
Turn migrate_page_states() into a wrapper around folio_migrate_flags().
Also convert two functions only called from folio_migrate_flags() to
be folio-based.  ksm_migrate_page() becomes folio_migrate_ksm() and
copy_page_owner() becomes folio_copy_owner().  folio_migrate_flags()
alone shrinks by two thirds -- 1967 bytes down to 642 bytes.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-10-18 07:49:39 -04:00
Matthew Wilcox (Oracle) 8f425e4ed0 mm/memcg: Convert mem_cgroup_charge() to take a folio
Convert all callers of mem_cgroup_charge() to call page_folio() on the
page they're currently passing in.  Many of them will be converted to
use folios themselves soon.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
2021-09-27 09:27:31 -04:00
Nick Desaulniers adac17e3f6 mm/ksm: remove old GCC 4.9+ check
The minimum supported version of GCC has been raised to GCC 5.1.

Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-13 10:18:28 -07:00
Zhansaya Bagdauletkyzy 584ff0dfb0 mm: KSM: fix data type
ksm_stable_node_chains_prune_millisecs is declared as int, but in
stable__node_chains_prune_millisecs_store(), it can store values up to
UINT_MAX.  Change its type to unsigned int.

Link: https://lkml.kernel.org/r/20210806111351.GA71845@asus
Signed-off-by: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:18 -07:00
Liam Howlett ff69fb8100 mm/ksm: use vma_lookup() in find_mergeable_vma()
Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Link: https://lkml.kernel.org/r/20210521174745.2219620-19-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-06-29 10:53:52 -07:00
Hugh Dickins 628622904b ksm: revert "use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()"
This reverts commit 3e96b6a2e9.  General
Protection Fault in rmap_walk_ksm() under memory pressure:
remove_rmap_item_from_tree() needs to take page lock, of course.

Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2105092253500.1127@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-14 19:41:32 -07:00
Ingo Molnar f0953a1bba mm: fix typos in comments
Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:35 -07:00
David Hildenbrand bbcd53c960 drivers/char: remove /dev/kmem for good
Patch series "drivers/char: remove /dev/kmem for good".

Exploring /dev/kmem and /dev/mem in the context of memory hot(un)plug and
memory ballooning, I started questioning the existence of /dev/kmem.

Comparing it with the /proc/kcore implementation, it does not seem to be
able to deal with things like

a) Pages unmapped from the direct mapping (e.g., to be used by secretmem)
  -> kern_addr_valid(). virt_addr_valid() is not sufficient.

b) Special cases like gart aperture memory that is not to be touched
  -> mem_pfn_is_ram()

Unless I am missing something, it's at least broken in some cases and might
fault/crash the machine.

Looks like its existence has been questioned before in 2005 and 2010 [1],
after ~11 additional years, it might make sense to revive the discussion.

CONFIG_DEVKMEM is only enabled in a single defconfig (on purpose or by
mistake?).  All distributions disable it: in Ubuntu it has been disabled
for more than 10 years, in Debian since 2.6.31, in Fedora at least
starting with FC3, in RHEL starting with RHEL4, in SUSE starting from
15sp2, and OpenSUSE has it disabled as well.

1) /dev/kmem was popular for rootkits [2] before it got disabled
   basically everywhere. Ubuntu documents [3] "There is no modern user of
   /dev/kmem any more beyond attackers using it to load kernel rootkits.".
   RHEL documents in a BZ [5] "it served no practical purpose other than to
   serve as a potential security problem or to enable binary module drivers
   to access structures/functions they shouldn't be touching"

2) /proc/kcore is a decent interface to have a controlled way to read
   kernel memory for debugging puposes. (will need some extensions to
   deal with memory offlining/unplug, memory ballooning, and poisoned
   pages, though)

3) It might be useful for corner case debugging [1]. KDB/KGDB might be a
   better fit, especially, to write random memory; harder to shoot
   yourself into the foot.

4) "Kernel Memory Editor" [4] hasn't seen any updates since 2000 and seems
   to be incompatible with 64bit [1]. For educational purposes,
   /proc/kcore might be used to monitor value updates -- or older
   kernels can be used.

5) It's broken on arm64, and therefore, completely disabled there.

Looks like it's essentially unused and has been replaced by better
suited interfaces for individual tasks (/proc/kcore, KDB/KGDB). Let's
just remove it.

[1] https://lwn.net/Articles/147901/
[2] https://www.linuxjournal.com/article/10505
[3] https://wiki.ubuntu.com/Security/Features#A.2Fdev.2Fkmem_disabled
[4] https://sourceforge.net/projects/kme/
[5] https://bugzilla.redhat.com/show_bug.cgi?id=154796

Link: https://lkml.kernel.org/r/20210324102351.6932-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210324102351.6932-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Alexander A. Klimov" <grandmaster@al2klimov.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: Andrey Zhizhikin <andrey.zhizhikin@leica-geosystems.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Corentin Labbe <clabbe@baylibre.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Gregory Clement <gregory.clement@bootlin.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hillf Danton <hdanton@sina.com>
Cc: huang ying <huang.ying.caritas@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: James Troup <james.troup@canonical.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kairui Song <kasong@redhat.com>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Cc: Liviu Dudau <liviu.dudau@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: openrisc@lists.librecores.org
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Pavel Machek (CIP)" <pavel@denx.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
Cc: sparclinux@vger.kernel.org
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Theodore Dubois <tblodt@icloud.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: William Cohen <wcohen@redhat.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-07 00:26:34 -07:00
Chengyang Fan 420be4edef mm/ksm: remove unused parameter from remove_trailing_rmap_items()
Since commit 6514d511db ("ksm: singly-linked rmap_list") was merged,
remove_trailing_rmap_items() doesn't use the 'mm_slot' parameter.  So
remove it, and update caller accordingly.

Link: https://lkml.kernel.org/r/20210330121320.1693474-1-cy.fan@huawei.com
Signed-off-by: Chengyang Fan <cy.fan@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:25 -07:00
Miaohe Lin c89a384e25 ksm: fix potential missing rmap_item for stable_node
When removing rmap_item from stable tree, STABLE_FLAG of rmap_item is
cleared with head reserved.  So the following scenario might happen: For
ksm page with rmap_item1:

cmp_and_merge_page
  stable_node->head = &migrate_nodes;
  remove_rmap_item_from_tree, but head still equal to stable_node;
  try_to_merge_with_ksm_page failed;
  return;

For the same ksm page with rmap_item2, stable node migration succeed this
time.  The stable_node->head does not equal to migrate_nodes now.  For ksm
page with rmap_item1 again:

cmp_and_merge_page
 stable_node->head != &migrate_nodes && rmap_item->head == stable_node
 return;

We would miss the rmap_item for stable_node and might result in failed
rmap_walk_ksm().  Fix this by set rmap_item->head to NULL when rmap_item
is removed from stable tree.

Link: https://lkml.kernel.org/r/20210330140228.45635-5-linmiaohe@huawei.com
Fixes: 4146d2d673 ("ksm: make !merge_across_nodes migration safe")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:25 -07:00
Miaohe Lin cd7fae2602 ksm: remove dedicated macro KSM_FLAG_MASK
The macro KSM_FLAG_MASK is used in rmap_walk_ksm() only.  So we can
replace ~KSM_FLAG_MASK with PAGE_MASK to remove this dedicated macro and
make code more consistent because PAGE_MASK is used elsewhere in this
file.

Link: https://lkml.kernel.org/r/20210330140228.45635-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:25 -07:00
Miaohe Lin 3e96b6a2e9 ksm: use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()
It's unnecessary to lock the page when get ksm page if we're going to
remove the rmap item as page migration is irrelevant in this case.  Use
GET_KSM_PAGE_NOLOCK instead to save some page lock cycles.

Link: https://lkml.kernel.org/r/20210330140228.45635-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Miaohe Lin a08e1e11c9 ksm: remove redundant VM_BUG_ON_PAGE() on stable_tree_search()
Patch series "Cleanup and fixup for ksm".

This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE and
dedicated macro KSM_FLAG_MASK.  Also this fixes potential missing
rmap_item for stable_node which would result in failed rmap_walk_ksm().
More details can be found in the respective changelogs.

This patch (of 4):

The same VM_BUG_ON_PAGE() check is already done in the callee.  Remove
these extra caller one to simplify code slightly.

Link: https://lkml.kernel.org/r/20210330140228.45635-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210330140228.45635-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-05 11:27:24 -07:00
Alexey Dobriyan dfefd226b0 mm: cleanup kstrto*() usage
Range checks can folded into proper conversion function.  kstrto*() exist
for all arithmetic types.

Link: https://lkml.kernel.org/r/20201122123759.GC92364@localhost.localdomain
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:47 -08:00
Joe Perches ae7a927d27 mm: use sysfs_emit for struct kobject * uses
Patch series "mm: Convert sysfs sprintf family to sysfs_emit", v2.

Use the new sysfs_emit family and not the sprintf family.

This patch (of 5):

Use the sysfs_emit function instead of the sprintf family.

Done with cocci script as in commit 3c6bff3cf9 ("RDMA: Convert sysfs
kobject * show functions to use sysfs_emit()")

Link: https://lkml.kernel.org/r/cover.1605376435.git.joe@perches.com
Link: https://lkml.kernel.org/r/9c249215bad6df616ba0410ad980042694970c1b.1605376435.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-15 12:13:47 -08:00
Mauro Carvalho Chehab 9303c9d5e9 docs: get rid of :c:type explicit declarations for structs
The :c:type:`foo` only works properly with structs before
Sphinx 3.x.

On Sphinx 3.x, structs should now be declared using the
.. c:struct, and referenced via :c:struct tag.

As we now have the automarkup.py macro, that automatically
convert:
	struct foo

into cross-references, let's get rid of that, solving
several warnings when building docs with Sphinx 3.x.

Reviewed-by: André Almeida <andrealmeid@collabora.com> # blk-mq.rst
Reviewed-by: Takashi Iwai <tiwai@suse.de> # sound
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
2020-10-15 07:49:40 +02:00
Hugh Dickins 62fdb1632b ksm: reinstate memcg charge on copied pages
Patch series "mm: fixes to past from future testing".

Here's a set of independent fixes against 5.9-rc2: prompted by
testing Alex Shi's "warning on !memcg" and lru_lock series, but
I think fit for 5.9 - though maybe only the first for stable.

This patch (of 5):

In 5.8 some instances of memcg charging in do_swap_page() and unuse_pte()
were removed, on the understanding that swap cache is now already charged
at those points; but a case was missed, when ksm_might_need_to_copy() has
decided it must allocate a substitute page: such pages were never charged.
Fix it inside ksm_might_need_to_copy().

This was discovered by Alex Shi's prospective commit "mm/memcg: warning on
!memcg after readahead page charged".

But there is a another surprise: this also fixes some rarer uncharged
PageAnon cases, when KSM is configured in, but has never been activated.
ksm_might_need_to_copy()'s anon_vma->root and linear_page_index() check
sometimes catches a case which would need to have been copied if KSM were
turned on.  Or that's my optimistic interpretation (of my own old code),
but it leaves some doubt as to whether everything is working as intended
there - might it hint at rare anon ptes which rmap cannot find?  A
question not easily answered: put in the fix for missed memcg charges.

Cc; Matthew Wilcox <willy@infradead.org>

Fixes: 4c6355b25e ("mm: memcontrol: charge swapin pages on instantiation")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>	[5.8]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301343270.5954@eggly.anvils
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301358020.5954@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-19 13:13:38 -07:00
Linus Torvalds b25d1dc947 Merge branch 'simplify-do_wp_page'
Merge emailed patches from Peter Xu:
 "This is a small series that I picked up from Linus's suggestion to
  simplify cow handling (and also make it more strict) by checking
  against page refcounts rather than mapcounts.

  This makes uffd-wp work again (verified by running upmapsort)"

Note: this is horrendously bad timing, and making this kind of
fundamental vm change after -rc3 is not at all how things should work.
The saving grace is that it really is a a nice simplification:

 8 files changed, 29 insertions(+), 120 deletions(-)

The reason for the bad timing is that it turns out that commit
17839856fd ("gup: document and work around 'COW can break either way'
issue" broke not just UFFD functionality (as Peter noticed), but Mikulas
Patocka also reports that it caused issues for strace when running in a
DAX environment with ext4 on a persistent memory setup.

And we can't just revert that commit without re-introducing the original
issue that is a potential security hole, so making COW stricter (and in
the process much simpler) is a step to then undoing the forced COW that
broke other uses.

Link: https://lore.kernel.org/lkml/alpine.LRH.2.02.2009031328040.6929@file01.intranet.prod.int.rdu2.redhat.com/

* emailed patches from Peter Xu <peterx@redhat.com>:
  mm: Add PGREUSE counter
  mm/gup: Remove enfornced COW mechanism
  mm/ksm: Remove reuse_ksm_page()
  mm: do_wp_page() simplification
2020-09-04 09:31:54 -07:00
Peter Xu 1a0cf26323 mm/ksm: Remove reuse_ksm_page()
Remove the function as the last reference has gone away with the do_wp_page()
changes.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-04 09:25:20 -07:00
Shawn Anastasio 12564485ed Revert "powerpc/64s: Remove PROT_SAO support"
This reverts commit 5c9fa16e8a.

Since PROT_SAO can still be useful for certain classes of software,
reintroduce it. Concerns about guest migration for LPARs using SAO
will be addressed next.

Signed-off-by: Shawn Anastasio <shawn@anastas.io>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200821185558.35561-2-shawn@anastas.io
2020-08-24 14:12:53 +10:00
Peter Xu bce617edec mm: do page fault accounting in handle_mm_fault
Patch series "mm: Page fault accounting cleanups", v5.

This is v5 of the pf accounting cleanup series.  It originates from Gerald
Schaefer's report on an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit 4064b98270 ("mm: allow
VM_FAULT_RETRY for multiple times"):

  https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

What this series did:

  - Correct page fault accounting: we do accounting for a page fault
    (no matter whether it's from #PF handling, or gup, or anything else)
    only with the one that completed the fault.  For example, page fault
    retries should not be counted in page fault counters.  Same to the
    perf events.

  - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
    event is used in an adhoc way across different archs.

    Case (1): for many archs it's done at the entry of a page fault
    handler, so that it will also cover e.g.  errornous faults.

    Case (2): for some other archs, it is only accounted when the page
    fault is resolved successfully.

    Case (3): there're still quite some archs that have not enabled
    this perf event.

    Since this series will touch merely all the archs, we unify this
    perf event to always follow case (1), which is the one that makes most
    sense.  And since we moved the accounting into handle_mm_fault, the
    other two MAJ/MIN perf events are well taken care of naturally.

  - Unify definition of "major faults": the definition of "major
    fault" is slightly changed when used in accounting (not
    VM_FAULT_MAJOR).  More information in patch 1.

  - Always account the page fault onto the one that triggered the page
    fault.  This does not matter much for #PF handlings, but mostly for
    gup.  More information on this in patch 25.

Patchset layout:

Patch 1:     Introduced the accounting in handle_mm_fault(), not enabled.
Patch 2-23:  Enable the new accounting for arch #PF handlers one by one.
Patch 24:    Enable the new accounting for the rest outliers (gup, iommu, etc.)
Patch 25:    Cleanup GUP task_struct pointer since it's not needed any more

This patch (of 25):

This is a preparation patch to move page fault accountings into the
general code in handle_mm_fault().  This includes both the per task
flt_maj/flt_min counters, and the major/minor page fault perf events.  To
do this, the pt_regs pointer is passed into handle_mm_fault().

PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
handlers.

So far, all the pt_regs pointer that passed into handle_mm_fault() is
NULL, which means this patch should have no intented functional change.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12 10:58:02 -07:00
Linus Torvalds 25d8d4eeca powerpc updates for 5.9
- Add support for (optionally) using queued spinlocks & rwlocks.
 
  - Support for a new faster system call ABI using the scv instruction on Power9
    or later.
 
  - Drop support for the PROT_SAO mmap/mprotect flag as it will be unsupported on
    Power10 and future processors, leaving us with no way to implement the
    functionality it requests. This risks breaking userspace, though we believe
    it is unused in practice.
 
  - A bug fix for, and then the removal of, our custom stack expansion checking.
    We now allow stack expansion up to the rlimit, like other architectures.
 
  - Remove the remnants of our (previously disabled) topology update code, which
    tried to react to NUMA layout changes on virtualised systems, but was prone
    to crashes and other problems.
 
  - Add PMU support for Power10 CPUs.
 
  - A change to our signal trampoline so that we don't unbalance the link stack
    (branch return predictor) in the signal delivery path.
 
  - Lots of other cleanups, refactorings, smaller features and so on as usual.
 
 Thanks to:
   Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey Kardashevskiy,
   Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton
   Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan S, Bharata B Rao, Bill
   Wendling, Bin Meng, Cédric Le Goater, Chris Packham, Christophe Leroy,
   Christoph Hellwig, Daniel Axtens, Dan Williams, David Lamparter, Desnes A.
   Nunes do Rosario, Erhard F., Finn Thain, Frederic Barrat, Ganesh Goudar,
   Gautham R. Shenoy, Geoff Levand, Greg Kurz, Gustavo A. R. Silva, Hari Bathini,
   Harish, Imre Kaloz, Joel Stanley, Joe Perches, John Crispin, Jordan Niethe,
   Kajol Jain, Kamalesh Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li
   RongQing, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal
   Suchanek, Milton Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan
   Chancellor, Nathan Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver
   O'Halloran, Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe
   Bergheaud, Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy
   Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh
   Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar Dronamraju,
   Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza Cascardo, Thiago Jung
   Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov, Wei Yongjun, Wen Xiong,
   YueHaibing.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl8tOxATHG1wZUBlbGxl
 cm1hbi5pZC5hdQAKCRBR6+o8yOGlgDQfEAClXHWf6hnxB84bEu39D51NkVotL1IG
 BRWFvyix+xHuUkHIouBPAAMl6ngY5X6wkYd+Z+CY9zHNtdSDoVlJE30YXdMQA/dE
 L/rYxR1884yGR/uU/3wusboO68ReXwcKQPmKOymUfh0zH7ujyJsSWLpXFK1YDC5d
 2TVVTi0Q+P5ucMHDh0L+AHirIxZvtZSp43+J7xLtywsj+XAxJWCTGo5WCJbdgbCA
 Qbv3aOkVyUa3EgsbdM/STPpv82ebqT+PHxeSIO4Jw6ZODtKRH0R5YsWCApuY9eZ+
 ebY9RLmgv9ZAhJqB2fv9A5NDcMoGpZNmjM7HrWpXwULKQpkBGHCzJ9FcSdHVMOx8
 nbVMFjt4uzLwV1w8lFYslQ2tNH/uH2o9BlryV1RLpiiKokDAJO/NOsWN9y0u/I4J
 EmAM5DSX2LgVvvas96IlGK8KX4xkOkf8FLX/H5UDvvAfloH8J4CZXk/CWCab/nqY
 KEHPnMmYvQZ1w9SzyZg9sO/1p6Bl1Gmm75Jv2F1lBiRW/42VcGBI/qLsJ4lC59Fc
 KbwufYNYYG38wbxDLW1HAPJhRonxIcaZj3EEqk7aTiLZ55nNbu8e2k32CpNXTGqt
 npOhzJHimcq7L6+878ZW+xpbZwogIEUdRSsmwb6aT8za3ShnYwSA2Q3LYxh9xyGH
 j3GifvPq6Efp3Q==
 =QMY1
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:

 - Add support for (optionally) using queued spinlocks & rwlocks.

 - Support for a new faster system call ABI using the scv instruction on
   Power9 or later.

 - Drop support for the PROT_SAO mmap/mprotect flag as it will be
   unsupported on Power10 and future processors, leaving us with no way
   to implement the functionality it requests. This risks breaking
   userspace, though we believe it is unused in practice.

 - A bug fix for, and then the removal of, our custom stack expansion
   checking. We now allow stack expansion up to the rlimit, like other
   architectures.

 - Remove the remnants of our (previously disabled) topology update
   code, which tried to react to NUMA layout changes on virtualised
   systems, but was prone to crashes and other problems.

 - Add PMU support for Power10 CPUs.

 - A change to our signal trampoline so that we don't unbalance the link
   stack (branch return predictor) in the signal delivery path.

 - Lots of other cleanups, refactorings, smaller features and so on as
   usual.

Thanks to: Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey
Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju
T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan
S, Bharata B Rao, Bill Wendling, Bin Meng, Cédric Le Goater, Chris
Packham, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Dan
Williams, David Lamparter, Desnes A. Nunes do Rosario, Erhard F., Finn
Thain, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geoff Levand,
Greg Kurz, Gustavo A. R. Silva, Hari Bathini, Harish, Imre Kaloz, Joel
Stanley, Joe Perches, John Crispin, Jordan Niethe, Kajol Jain, Kamalesh
Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li RongQing, Madhavan
Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal Suchanek, Milton
Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan Chancellor, Nathan
Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver O'Halloran,
Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud,
Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy
Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh
Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar
Dronamraju, Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza
Cascardo, Thiago Jung Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov,
Wei Yongjun, Wen Xiong, YueHaibing.

* tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (337 commits)
  selftests/powerpc: Fix pkey syscall redefinitions
  powerpc: Fix circular dependency between percpu.h and mmu.h
  powerpc/powernv/sriov: Fix use of uninitialised variable
  selftests/powerpc: Skip vmx/vsx/tar/etc tests on older CPUs
  powerpc/40x: Fix assembler warning about r0
  powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
  powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
  cpuidle: pseries: Fixup exit latency for CEDE(0)
  cpuidle: pseries: Add function to parse extended CEDE records
  cpuidle: pseries: Set the latency-hint before entering CEDE
  selftests/powerpc: Fix online CPU selection
  powerpc/perf: Consolidate perf_callchain_user_[64|32]()
  powerpc/pseries/hotplug-cpu: Remove double free in error path
  powerpc/pseries/mobility: Add pr_debug() for device tree changes
  powerpc/pseries/mobility: Set pr_fmt()
  powerpc/cacheinfo: Warn if cache object chain becomes unordered
  powerpc/cacheinfo: Improve diagnostics about malformed cache lists
  powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages
  powerpc/cacheinfo: Set pr_fmt()
  powerpc: fix function annotations to avoid section mismatch warnings with gcc-10
  ...
2020-08-07 10:33:50 -07:00
Nicholas Piggin 5c9fa16e8a powerpc/64s: Remove PROT_SAO support
ISA v3.1 does not support the SAO storage control attribute required to
implement PROT_SAO. PROT_SAO was used by specialised system software
(Lx86) that has been discontinued for about 7 years, and is not thought
to be used elsewhere, so removal should not cause problems.

We rather remove it than keep support for older processors, because
live migrating guest partitions to newer processors may not be possible
if SAO is in use (or worse allowed with silent races).

- PROT_SAO stays in the uapi header so code using it would still build.
- arch_validate_prot() is removed, the generic version rejects PROT_SAO
  so applications would get a failure at mmap() time.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Drop KVM change for the time being]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200703011958.1166620-3-npiggin@gmail.com
2020-07-22 00:01:25 +10:00
Kees Cook 3f649ab728 treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
	xargs perl -pi -e \
		's/\buninitialized_var\(([^\)]+)\)/\1/g;
		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB
Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers
Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-16 12:35:15 -07:00
Michel Lespinasse c1e8d7c6a7 mmap locking API: convert mmap_sem comments
Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse 3e4e28c5a8 mmap locking API: convert mmap_sem API comments
Convert comments that reference old mmap_sem APIs to reference
corresponding new mmap locking APIs instead.

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Michel Lespinasse d8ed45c5dc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-09 09:39:14 -07:00
Ethon Paul 457aef949d mm: ksm: fix a typo in comment "alreaady"->"already"
There is a typo in comment, fix it.

Signed-off-by: Ethon Paul <ethp@qq.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200410162427.13927-1-ethp@qq.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-04 19:06:23 -07:00
Muchun Song 56df70a63e mm/ksm: fix NULL pointer dereference when KSM zero page is enabled
find_mergeable_vma() can return NULL.  In this case, it leads to a crash
when we access vm_mm(its offset is 0x40) later in write_protect_page.
And this case did happen on our server.  The following call trace is
captured in kernel 4.19 with the following patch applied and KSM zero
page enabled on our server.

  commit e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring")

So add a vma check to fix it.

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
  Oops: 0000 [#1] SMP NOPTI
  CPU: 9 PID: 510 Comm: ksmd Kdump: loaded Tainted: G OE 4.19.36.bsk.9-amd64 #4.19.36.bsk.9
  RIP: try_to_merge_one_page+0xc7/0x760
  Code: 24 58 65 48 33 34 25 28 00 00 00 89 e8 0f 85 a3 06 00 00 48 83 c4
        60 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 46 08 a8 01 75 b8 <49>
        8b 44 24 40 4c 8d 7c 24 20 b9 07 00 00 00 4c 89 e6 4c 89 ff 48
  RSP: 0018:ffffadbdd9fffdb0 EFLAGS: 00010246
  RAX: ffffda83ffd4be08 RBX: ffffda83ffd4be40 RCX: 0000002c6e800000
  RDX: 0000000000000000 RSI: ffffda83ffd4be40 RDI: 0000000000000000
  RBP: ffffa11939f02ec0 R08: 0000000094e1a447 R09: 00000000abe76577
  R10: 0000000000000962 R11: 0000000000004e6a R12: 0000000000000000
  R13: ffffda83b1e06380 R14: ffffa18f31f072c0 R15: ffffda83ffd4be40
  FS: 0000000000000000(0000) GS:ffffa0da43b80000(0000) knlGS:0000000000000000
  CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000040 CR3: 0000002c77c0a003 CR4: 00000000007626e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  PKRU: 55555554
  Call Trace:
    ksm_scan_thread+0x115e/0x1960
    kthread+0xf5/0x130
    ret_from_fork+0x1f/0x30

[songmuchun@bytedance.com: if the vma is out of date, just exit]
  Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com
[akpm@linux-foundation.org: add the conventional braces, replace /** with /*]
Fixes: e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring")
Co-developed-by: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Markus Elfring <Markus.Elfring@web.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com
Link: http://lkml.kernel.org/r/20200414132905.83819-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-21 11:11:55 -07:00
Joe Perches e4a9bc5896 mm: use fallthrough;
Convert the various /* fallthrough */ comments to the pseudo-keyword
fallthrough;

Done via script:
https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Link: http://lkml.kernel.org/r/f62fea5d10eb0ccfc05d87c242a620c261219b66.camel@perches.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:41 -07:00
Li Chen 7a9547fd4e mm/ksm.c: update get_user_pages() argument in comment
This updates get_user_pages()'s argument in ksm_test_exit()'s comment

Signed-off-by: Li Chen <chenli@uniontech.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/30ac2417-f1c7-f337-0beb-df561295298c@uniontech.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07 10:43:38 -07:00
Linus Torvalds aedc0650f9 * PPC secure guest support
* small x86 cleanup
 * fix for an x86-specific out-of-bounds write on a ioctl (not guest triggerable,
   data not attacker-controlled)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJd551cAAoJEL/70l94x66D+JkH/R3eEOyvckPmYmzd0lnV8mQ/
 7e0n2G/aD+iLZkcCbUnMaImdmSJmoEEJCPjgPk/5nJ3zUi5b/ABWyidEM5uf19Hl
 rzKBg0DR7BiQptPnZv2JMwEVKu3JOTchMykqu9xXChQlICocZ0xjdOA6nQ19p0Lv
 FulDw5MUaWrXevIzCBskQ38zJejRQA6CpD1lQkHn7LKS9p3p+BsAOd/Ouy87RfWG
 b3ktECNbXyO6KStrrhgm+z8pviWY+kqYklyBlDOOwxWif0x8WvNDpQLoVo+ZuhLU
 Me8YJ1BN75vFlxzh6ZK5exBUnm9E3fGVKIaaF+dpuds2x+j4HnYl+lZCm89MdqY=
 =Q4v7
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:

 - PPC secure guest support

 - small x86 cleanup

 - fix for an x86-specific out-of-bounds write on a ioctl (not guest
   triggerable, data not attacker-controlled)

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvm: vmx: Stop wasting a page for guest_msrs
  KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332)
  Documentation: kvm: Fix mention to number of ioctls classes
  powerpc: Ultravisor: Add PPC_UV config option
  KVM: PPC: Book3S HV: Support reset of secure guest
  KVM: PPC: Book3S HV: Handle memory plug/unplug to secure VM
  KVM: PPC: Book3S HV: Radix changes for secure guest
  KVM: PPC: Book3S HV: Shared pages support for secure guests
  KVM: PPC: Book3S HV: Support for running secure guests
  mm: ksm: Export ksm_madvise()
  KVM x86: Move kvm cpuid support out of svm
2019-12-04 11:08:30 -08:00
Bharata B Rao 33cf170715 mm: ksm: Export ksm_madvise()
On PEF-enabled POWER platforms that support running of secure guests,
secure pages of the guest are represented by device private pages
in the host. Such pages needn't participate in KSM merging. This is
achieved by using ksm_madvise() call which need to be exported
since KVM PPC can be a kernel module.

Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-11-28 16:23:03 +11:00
Andrey Ryabinin 9a63236f1a mm/ksm.c: don't WARN if page is still mapped in remove_stable_node()
It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in
remove_stable_node() when it races with __mmput() and squeezes in
between ksm_exit() and exit_mmap().

  WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150

  Call Trace:
   remove_all_stable_nodes+0x12b/0x330
   run_store+0x4ef/0x7b0
   kernfs_fop_write+0x200/0x420
   vfs_write+0x154/0x450
   ksys_write+0xf9/0x1d0
   do_syscall_64+0x99/0x510
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Remove the warning as there is nothing scary going on.

Link: http://lkml.kernel.org/r/20191119131850.5675-1-aryabinin@virtuozzo.com
Fixes: cbf86cfe04 ("ksm: remove old stable nodes more thoroughly")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-22 09:11:18 -08:00
Song Liu 010c164a5f mm: move memcmp_pages() and pages_identical()
Patch series "THP aware uprobe", v13.

This patchset makes uprobe aware of THPs.

Currently, when uprobe is attached to text on THP, the page is split by
FOLL_SPLIT.  As a result, uprobe eliminates the performance benefit of
THP.

This set makes uprobe THP-aware.  Instead of FOLL_SPLIT, we introduces
FOLL_SPLIT_PMD, which only split PMD for uprobe.

After all uprobes within the THP are removed, the PTE-mapped pages are
regrouped as huge PMD.

This set (plus a few THP patches) is also available at

   https://github.com/liu-song-6/linux/tree/uprobe-thp

This patch (of 6):

Move memcmp_pages() to mm/util.c and pages_identical() to mm.h, so that we
can use them in other files.

Link: http://lkml.kernel.org/r/20190815164525.1848545-2-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <matthew.wilcox@oracle.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Thomas Gleixner 7a338472f2 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482
Based on 1 normalized pattern(s):

  this work is licensed under the terms of the gnu gpl version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 48 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:52 +02:00
Jérôme Glisse 7269f99993 mm/mmu_notifier: use correct mmu_notifier events for each invalidation
This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table.  See the
patch which introduced the events to see the rational behind this.

Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Jérôme Glisse 6f4f13e8d9 mm/mmu_notifier: contextual information for event triggering invalidation
CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).

Users of mmu notifier API track changes to the CPU page table and take
specific action for them.  While current API only provide range of virtual
address affected by the change, not why the changes is happening.

This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma).  Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.

The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens.  A latter patch do convert mm call path to use a more appropriate
events for each call.

This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:

%<----------------------------------------------------------------------
@@
identifier I1, I2, I3, I4;
@@
static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
+enum mmu_notifier_event event,
+unsigned flags,
+struct vm_area_struct *vma,
struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }

@@
@@
-#define mmu_notifier_range_init(range, mm, start, end)
+#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)

@@
expression E1, E3, E4;
identifier I1;
@@
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, I1,
I1->vm_mm, E3, E4)
...>

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, VMA,
E2, E3, E4)
...> }

@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
<...
mmu_notifier_range_init(E1,
+MMU_NOTIFY_UNMAP, 0, NULL,
E2, E3, E4)
...> }
---------------------------------------------------------------------->%

Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place

Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:49 -07:00
Yang Shi 2cee57d1b0 mm: ksm: do not block on page lock when searching stable tree
ksmd needs to search the stable tree to look for the suitable KSM page,
but the KSM page might be locked for a while due to i.e.  KSM page rmap
walk.  Basically it is not a big deal since commit 2c653d0ee2 ("ksm:
introduce ksm_max_page_sharing per page deduplication limit"), since
max_page_sharing limits the number of shared KSM pages.

But it still sounds not worth waiting for the lock, the page can be
skip, then try to merge it in the next scan to avoid potential stall if
its content is still intact.

Introduce trylock mode to get_ksm_page() to not block on page lock, like
what try_to_merge_one_page() does.  And, define three possible
operations (nolock, lock and trylock) as enum type to avoid stacking up
bools and make the code more readable.

Return -EBUSY if trylock fails, since NULL means not find suitable KSM
page, which is a valid case.

With the default max_page_sharing setting (256), there is almost no
observed change comparing lock vs trylock.

However, with ksm02 of LTP, the reduced ksmd full scan time can be
observed, which has set max_page_sharing to 786432.  With lock version,
ksmd may tak 10s - 11s to run two full scans, with trylock version ksmd
may take 8s - 11s to run two full scans.  And, the number of
pages_sharing and pages_to_scan keep same.  Basically, this change has
no harm.

[hughd@google.com: fix BUG_ON()]
  Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182122280.6914@eggly.anvils
Link: http://lkml.kernel.org/r/1548793753-62377-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Suggested-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:19 -08:00
Kirill Tkhai 52d1e606ee mm: reuse only-pte-mapped KSM page in do_wp_page()
Add an optimization for KSM pages almost in the same way that we have
for ordinary anonymous pages.  If there is a write fault in a page,
which is mapped to an only pte, and it is not related to swap cache; the
page may be reused without copying its content.

[ Note that we do not consider PageSwapCache() pages at least for now,
  since we don't want to complicate __get_ksm_page(), which has nice
  optimization based on this (for the migration case). Currenly it is
  spinning on PageSwapCache() pages, waiting for when they have
  unfreezed counters (i.e., for the migration finish). But we don't want
  to make it also spinning on swap cache pages, which we try to reuse,
  since there is not a very high probability to reuse them. So, for now
  we do not consider PageSwapCache() pages at all. ]

So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
page_stable_node(), to skip a page, which KSM is currently trying to
link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
merge one more page into the page, we are reusing.  After that, nobody
can refer to the reusing page: KSM skips !PageSwapCache() pages with
zero refcount; and the protection against of all other participants is
the same as for reused ordinary anon pages pte lock, page lock and
mmap_sem.

[akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:15 -08:00
Anshuman Khandual 98fa15f34c mm: replace all open encodings for NUMA_NO_NODE
Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code.  Please let me know if this
might have missed some instances.  This might also have replaced some
false positives.  I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1.  Even though implicitly understood it is always better to
have macros in there.  Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE.  This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Hans Verkuil <hverkuil@xs4all.nl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 21:07:14 -08:00
Kirill Tkhai fcf9a0ef8d ksm: react on changing "sleep_millisecs" parameter faster
ksm thread unconditionally sleeps in ksm_scan_thread() after each
iteration:

	schedule_timeout_interruptible(
		msecs_to_jiffies(ksm_thread_sleep_millisecs))

The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs.

In case of user writes a big value by a mistake, and the thread enters
into schedule_timeout_interruptible(), it's not possible to cancel the
sleep by writing a new smaler value; the thread is just sleeping till
timeout expires.

The patch fixes the problem by waking the thread each time after the value
is updated.

This also may be useful for debug purposes; and also for userspace
daemons, which change sleep_millisecs value in dependence of system load.

Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:51 -08:00
Jérôme Glisse ac46d4f3c4 mm/mmu_notifier: use structure for invalidate_range_start/end calls v2
To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks.  No functional changes with this patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
From: Jérôme Glisse <jglisse@redhat.com>
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:50 -08:00
Timofey Titovets 59e1a2f4bf ksm: replace jhash2 with xxhash
Replace jhash2 with xxhash.

Perf numbers:
Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
ksm: crc32c   hash() 12081 MB/s
ksm: xxh64    hash()  8770 MB/s
ksm: xxh32    hash()  4529 MB/s
ksm: jhash2   hash()  1569 MB/s

Sioh Lee did some testing:

crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns

As jhash2 always will be slower (for data size like PAGE_SIZE).  Don't use
it in ksm at all.

Use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that requires some tricky solution to work well in all
situations.

Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.com
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: leesioh <solee@os.korea.ac.kr>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-12-28 12:11:46 -08:00
Nick Desaulniers 815f0ddb34 include/linux/compiler*.h: make compiler-*.h mutually exclusive
Commit cafa0010cd ("Raise the minimum required gcc version to 4.6")
recently exposed a brittle part of the build for supporting non-gcc
compilers.

Both Clang and ICC define __GNUC__, __GNUC_MINOR__, and
__GNUC_PATCHLEVEL__ for quick compatibility with code bases that haven't
added compiler specific checks for __clang__ or __INTEL_COMPILER.

This is brittle, as they happened to get compatibility by posing as a
certain version of GCC.  This broke when upgrading the minimal version
of GCC required to build the kernel, to a version above what ICC and
Clang claim to be.

Rather than always including compiler-gcc.h then undefining or
redefining macros in compiler-intel.h or compiler-clang.h, let's
separate out the compiler specific macro definitions into mutually
exclusive headers, do more proper compiler detection, and keep shared
definitions in compiler_types.h.

Fixes: cafa0010cd ("Raise the minimum required gcc version to 4.6")
Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Eli Friedman <efriedma@codeaurora.org>
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 17:31:34 -07:00
Jiang Biao 1c4c3b99c0 mm: fix page_freeze_refs and page_unfreeze_refs in comments
page_freeze_refs/page_unfreeze_refs have already been relplaced by
page_ref_freeze/page_ref_unfreeze , but they are not modified in the
comments.

Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Souptick Joarder 50a7ca3c6f mm: convert return type of handle_mm_fault() caller to vm_fault_t
Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")

In this patch all the caller of handle_mm_fault() are changed to return
vm_fault_t type.

Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Michal Simek <monstr@monstr.eu>
Cc: James Hogan <jhogan@kernel.org>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:28 -07:00
Dave Jiang e1fb4a0864 dax: remove VM_MIXEDMAP for fsdax and device dax
This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map.  The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations.  In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().

This patch has been tested against ndctl unit test.  It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.

Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:27 -07:00
Jia He 1105a2fc02 mm/ksm.c: ignore STABLE_FLAG of rmap_item->address in rmap_walk_ksm()
In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
PAGE_SIZE unaligned for rmap_item->address under memory pressure
tests(start 20 guests and run memhog in the host).

  WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
  CPU: 4 PID: 4641 Comm: memhog Tainted: G        W 4.17.0-rc3+ #8
  Call trace:
   kvm_age_hva_handler+0xc0/0xc8
   handle_hva_to_gpa+0xa8/0xe0
   kvm_age_hva+0x4c/0xe8
   kvm_mmu_notifier_clear_flush_young+0x54/0x98
   __mmu_notifier_clear_flush_young+0x6c/0xa0
   page_referenced_one+0x154/0x1d8
   rmap_walk_ksm+0x12c/0x1d0
   rmap_walk+0x94/0xa0
   page_referenced+0x194/0x1b0
   shrink_page_list+0x674/0xc28
   shrink_inactive_list+0x26c/0x5b8
   shrink_node_memcg+0x35c/0x620
   shrink_node+0x100/0x430
   do_try_to_free_pages+0xe0/0x3a8
   try_to_free_pages+0xe4/0x230
   __alloc_pages_nodemask+0x564/0xdc0
   alloc_pages_vma+0x90/0x228
   do_anonymous_page+0xc8/0x4d0
   __handle_mm_fault+0x4a0/0x508
   handle_mm_fault+0xf8/0x1b0
   do_page_fault+0x218/0x4b8
   do_translation_fault+0x90/0xa0
   do_mem_abort+0x68/0xf0
   el0_da+0x24/0x28

In rmap_walk_ksm, the rmap_item->address might still have the
STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
PAGE_SIZE aligned.  Thus it will cause exceptions in handle_hva_to_gpa
on arm64.

This patch fixes it by ignoring (not removing) the low bits of address
when doing rmap_walk_ksm.

IMO, it should be backported to stable tree.  the storm of WARN_ONs is
very easy for me to reproduce.  More than that, I watched a panic (not
reproducible) as follows:

  page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
  flags: 0x1fffc00000000000()
  raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
  raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
  page dumped because: nonzero _refcount
  CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
  Hardware name: <snip for confidential issues>
  Call trace:
    dump_backtrace+0x0/0x22c
    show_stack+0x24/0x2c
    dump_stack+0x8c/0xb0
    bad_page+0xf4/0x154
    free_pages_check_bad+0x90/0x9c
    free_pcppages_bulk+0x464/0x518
    free_hot_cold_page+0x22c/0x300
    __put_page+0x54/0x60
    unmap_stage2_range+0x170/0x2b4
    kvm_unmap_hva_handler+0x30/0x40
    handle_hva_to_gpa+0xb0/0xec
    kvm_unmap_hva_range+0x5c/0xd0

I even injected a fault on purpose in kvm_unmap_hva_range by seting
size=size-0x200, the call trace is similar as above.  So I thought the
panic is similarly caused by the root cause of WARN_ON.

Andrea said:

: It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
: zap those bits and hide the misalignment caused by the low metadata
: bits being erroneously left set in the address, but the arm code
: notices when that's the last page in the memslot and the hva_end is
: getting aligned and the size is below one page.
:
: I think the problem triggers in the addr += PAGE_SIZE of
: unmap_stage2_ptes that never matches end because end is aligned but
: addr is not.
:
: 	} while (pte++, addr += PAGE_SIZE, addr != end);
:
: x86 again only works on hva_start/hva_end after converting it to
: gfn_start/end and that being in pfn units the bits are zapped before
: they risk to cause trouble.

Jia He said:

: I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
: this patch, the WARN_ON is very easy for reproducing.  After this patch, I
: have run the same benchmarch for a whole day without any WARN_ONs

Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.com
Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Jia He <hejianet@gmail.com>
Cc: Suzuki K Poulose <Suzuki.Poulose@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-15 07:55:23 +09:00
Mike Rapoport 88484826bc mm/ksm: move [set_]page_stable_node from ksm.h to ksm.c
page_stable_node() and set_page_stable_node() are only used in mm/ksm.c
and there is no point to keep them in the include/linux/ksm.h

[akpm@linux-foundation.org: fix SYSFS=n build]
Link: http://lkml.kernel.org/r/1524552106-7356-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-06-07 17:34:36 -07:00
Mike Rapoport 5a2ca3efe6 mm/ksm: docs: extend overview comment and make it "DOC:"
The existing comment provides a good overview of KSM implementation. Let's
update it to reflect recent additions of "chain" and "dup" variants of the
stable tree nodes and mark it as "DOC:" for inclusion into the KSM
documentation.

Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-04-27 17:19:24 -06:00
Jonathan Corbet 24844fd339 Merge branch 'mm-rst' into docs-next
Mike Rapoport says:

  These patches convert files in Documentation/vm to ReST format, add an
  initial index and link it to the top level documentation.

  There are no contents changes in the documentation, except few spelling
  fixes. The relatively large diffstat stems from the indentation and
  paragraph wrapping changes.

  I've tried to keep the formatting as consistent as possible, but I could
  miss some places that needed markup and add some markup where it was not
  necessary.

[jc: significant conflicts in vm/hmm.rst]
2018-04-16 14:25:08 -06:00
Mike Rapoport ad56b738c5 docs/vm: rename documentation files to .rst
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-04-16 14:18:15 -06:00
Claudio Imbrenda a38c015f31 mm/ksm.c: fix inconsistent accounting of zero pages
When using KSM with use_zero_pages, we replace anonymous pages
containing only zeroes with actual zero pages, which are not anonymous.
We need to do proper accounting of the mm counters, otherwise we will
get wrong values in /proc and a BUG message in dmesg when tearing down
the mm.

Link: http://lkml.kernel.org/r/1522931274-15552-1-git-send-email-imbrenda@linux.vnet.ibm.com
Fixes: e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:31 -07:00
Claudio Imbrenda 77da2ba064 mm/ksm: fix interaction with THP
This patch fixes a corner case for KSM.  When two pages belong or
belonged to the same transparent hugepage, and they should be merged,
KSM fails to split the page, and therefore no merging happens.

This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
  e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21)
* filling it with the same values
  e.g. memset(p, 42, 1<<21)
* performing madvise to make it mergeable
  e.g. madvise(p, 1<<21, MADV_MERGEABLE)
* waiting for KSM to perform a few scans

The expected outcome is that the all the pages get merged (1 shared and
the rest sharing); the actual outcome is that no pages get merged (1
unshared and the rest volatile)

The reason of this behaviour is that we increase the reference count
once for both pages we want to merge, but if they belong to the same
hugepage (or compound page), the reference counter used in both cases is
the one of the head of the compound page.  This means that
split_huge_page will find a value of the reference counter too high and
will fail.

This patch solves this problem by testing if the two pages to merge
belong to the same hugepage when attempting to merge them.  If so, the
hugepage is split safely.  This means that the hugepage is not split if
not necessary.

Link: http://lkml.kernel.org/r/1521548069-24758-1-git-send-email-imbrenda@linux.vnet.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Co-authored-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:27 -07:00
Colin Ian King c01f0b54ef mm/ksm.c: make stable_node_dup() static
stable_node_dup() is local to the source and does not need to be in
global scope, so make it static.

Cleans up sparse warning:

  mm/ksm.c:1321:13: warning: symbol 'stable_node_dup' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20180206221005.12642-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-05 21:36:24 -07:00
Khalid Aziz 74a0496748 sparc64: Add support for ADI (Application Data Integrity)
ADI is a new feature supported on SPARC M7 and newer processors to allow
hardware to catch rogue accesses to memory. ADI is supported for data
fetches only and not instruction fetches. An app can enable ADI on its
data pages, set version tags on them and use versioned addresses to
access the data pages. Upper bits of the address contain the version
tag. On M7 processors, upper four bits (bits 63-60) contain the version
tag. If a rogue app attempts to access ADI enabled data pages, its
access is blocked and processor generates an exception. Please see
Documentation/sparc/adi.txt for further details.

This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
MCD (Memory Corruption Detection) on selected memory ranges, enable
TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
version tags on page swap out/in or migration. ADI is not enabled by
default for any task. A task must explicitly enable ADI on a memory
range and set version tag for ADI to be effective for the task.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-18 07:38:48 -07:00
Mike Rapoport b7701a5f2e mm: docs: fixup punctuation
so that kernel-doc will properly recognize the parameter and function
descriptions.

Link: http://lkml.kernel.org/r/1516700871-22279-2-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06 18:32:48 -08:00
Paul E. McKenney 08df477434 mm/ksm: Remove now-redundant smp_read_barrier_depends()
Because READ_ONCE() now implies smp_read_barrier_depends(), the
smp_read_barrier_depends() in get_ksm_page() is now redundant.
This commit removes it and updates the comments.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: <linux-mm@kvack.org>
2017-12-04 10:52:56 -08:00
Jérôme Glisse 0f10851ea4 mm/mmu_notifier: avoid double notification when it is useless
This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users.  Everyone else is unaffected
by it.

When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range).  But that notification is not necessary
in all cases.

This patch removes almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before
mmu_notifier_invalidate_range_end.  It also adds documentation in all
those cases explaining why.

Below is a more in depth analysis of why this is fine to do this:

For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
device use thing like ATS/PASID to get the IOMMU to walk the CPU page
table to access a process virtual address space).  There is only 2 cases
when you need to notify those secondary TLB while holding page table
lock when clearing a pte/pmd:

  A) page backing address is free before mmu_notifier_invalidate_range_end
  B) a page table entry is updated to point to a new page (COW, write fault
     on zero page, __replace_page(), ...)

Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.

Case B is more subtle. For correctness it requires the following sequence
to happen:
  - take page table lock
  - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
  - set page table entry to point to new page

If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.

Consider the following scenario (device use a feature similar to ATS/
PASID):

Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).

[Time N] -----------------------------------------------------------------
CPU-thread-0  {try to write to addrA}
CPU-thread-1  {try to write to addrB}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA and populate device TLB}
DEV-thread-2  {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {write to addrA which is a write to new page}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {}
CPU-thread-3  {write to addrB which is a write to new page}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA from old page}
DEV-thread-2  {read addrB from new page}

So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA.  This break total memory
ordering for the device.

When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback
to mmu_notifier_invalidate_range_end() outside the page table lock.  This
is true even if the thread doing page table update is preempted right
after releasing page table lock before calling
mmu_notifier_invalidate_range_end

Thanks to Andrea for thinking of a problematic scenario for COW.

[jglisse@redhat.com: v2]
  Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Kirill Tkhai 4b22927f0c ksm: fix unlocked iteration over vmas in cmp_and_merge_page()
In this place mm is unlocked, so vmas or list may change.  Down read
mmap_sem to protect them from modifications.

Link: http://lkml.kernel.org/r/150512788393.10691.8868381099691121308.stgit@localhost.localdomain
Fixes: e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring")
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: zhong jiang <zhongjiang@huawei.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:23 -07:00
Arvind Yadav f907c26a91 mm/ksm.c: constify attribute_group structures
attribute_group are not supposed to change at runtime.  All functions
working with attribute_group provided by <linux/sysfs.h> work with const
attribute_group.  So mark the non-const structs as const.

Link: http://lkml.kernel.org/r/1501157167-3706-2-git-send-email-arvind.yadav.cs@gmail.com
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
Minchan Kim b3a81d0841 mm: fix KSM data corruption
Nadav reported KSM can corrupt the user data by the TLB batching
race[1].  That means data user written can be lost.

Quote from Nadav Amit:
 "For this race we need 4 CPUs:

  CPU0: Caches a writable and dirty PTE entry, and uses the stale value
  for write later.

  CPU1: Runs madvise_free on the range that includes the PTE. It would
  clear the dirty-bit. It batches TLB flushes.

  CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
  We care about the fact that it clears the PTE write-bit, and of
  course, batches TLB flushes.

  CPU3: Runs KSM. Our purpose is to pass the following test in
  write_protect_page():

	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))

  Since it will avoid TLB flush. And we want to do it while the PTE is
  stale. Later, and before replacing the page, we would be able to
  change the page.

  Note that all the operations the CPU1-3 perform canhappen in parallel
  since they only acquire mmap_sem for read.

  We start with two identical pages. Everything below regards the same
  page/PTE.

  CPU0        CPU1        CPU2        CPU3
  ----        ----        ----        ----
  Write the same
  value on page

  [cache PTE as
   dirty in TLB]

              MADV_FREE
              pte_mkclean()

                          4 > clear_refs
                          pte_wrprotect()

                                      write_protect_page()
                                      [ success, no flush ]

                                      pages_indentical()
                                      [ ok ]

  Write to page
  different value

  [Ok, using stale
   PTE]

                                      replace_page()

  Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
  CPU0 already wrote on the page, but KSM ignored this write, and it got
  lost"

In above scenario, MADV_FREE is fixed by changing TLB batching API
including [set|clear]_tlb_flush_pending.  Remained thing is soft-dirty
part.

This patch changes soft-dirty uses TLB batching API instead of
flush_tlb_mm and KSM checks pending TLB flush by using
mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
there are other parallel threads pending TLB flush.

[1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com

Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
Reported-by: Nadav Amit <namit@vmware.com>
Tested-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 15:54:07 -07:00
Andrea Arcangeli 80b18dfa53 ksm: optimize refile of stable_node_dup at the head of the chain
If a candidate stable_node_dup has been found and it can accept further
merges it can be refiled to the head of the list to speedup next
searches without altering which dup is found and how the dups accumulate
in the chain.

We already refiled it back to the head in the prune_stale_stable_nodes
case, but we didn't refile it if not pruning (which is more common).
And we also refiled it when it was already at the head which is
unnecessary (in the prune_stale_stable_nodes case, nr > 1 means there's
more than one dup in the chain, it doesn't mean it's not already at the
head of the chain).

The stable_node_chain list is single threaded and there's no SMP locking
contention so it should be faster to refile it to the head of the list
also if prune_stale_stable_nodes is false.

Profiling shows the refile happens 1.9% of the time when a dup is found
with a max_page_sharing limit setting of 3 (with max_page_sharing of 2
the refile never happens of course as there's never space for one more
merge) which is reasonably low.  At higher max_page_sharing values it
should be much less frequent.

This is just an optimization.

Link: http://lkml.kernel.org/r/20170518173721.22316-4-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Evgheni Dereveanchin <ederevea@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Gavin Guo <gavin.guo@canonical.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00
Andrea Arcangeli 8dc5ffcd5a ksm: swap the two output parameters of chain/chain_prune
Some static checker complains if chain/chain_prune returns a potentially
stale pointer.

There are two output parameters to chain/chain_prune, one is tree_page
the other is stable_node_dup.  Like in get_ksm_page the caller has to
check tree_page is NULL before touching the stable_node.  Similarly in
chain/chain_prune the caller has to check tree_page before touching the
stable_node_dup returned or the original stable_node passed as
parameter.

Because the tree_page is never returned as a stale pointer, it may be
more intuitive to return tree_page and to pass stable_node_dup for
reference instead of the reverse.

This patch purely swaps the two output parameters of chain/chain_prune
as a cleanup for the static checker and to mimic the get_ksm_page
behavior more closely.  There's no change to the caller at all except
the swap, it's purely a cleanup and it is a noop from the caller point
of view.

Link: http://lkml.kernel.org/r/20170518173721.22316-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Tested-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Evgheni Dereveanchin <ederevea@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Gavin Guo <gavin.guo@canonical.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00
Andrea Arcangeli 0ba1d0f7c4 ksm: cleanup stable_node chain collapse case
Patch series "KSMscale cleanup/optimizations".

There are no fixes here it's just minor cleanups and optimizations.

1/3 removes makes the "fix" for the stale stable_node fall in the
    standard case without introducing new cases.  Setting stable_node to
    NULL was marginally safer, but stale pointer is still wiped from the
    caller, this looks cleaner.

2/3 should fix the false positive from Dan's static checker.

3/3 is a microoptimization to apply the the refile of future merge
    candidate dups at the head of the chain in all cases and to skip it in
    one case where we did it and but it was a noop (to avoid checking if
    it was already at the head but now we've to check it anyway so it got
    optimized away).

This patch (of 3):

When the stable_node chain is collapsed we can as well set the caller
stable_node to match the returned stable_node_dup in chain_prune().

This way the collapse case becomes indistinguishable from the regular
stable_node case and we can remove two branches from the KSM page
migration handling slow paths.

While it was all correct this looks cleaner (and faster) as the caller has
to deal with fewer special cases.

Link: http://lkml.kernel.org/r/20170518173721.22316-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Evgheni Dereveanchin <ederevea@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Gavin Guo <gavin.guo@canonical.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00
Andrea Arcangeli b4fecc67cc ksm: fix use after free with merge_across_nodes = 0
If merge_across_nodes was manually set to 0 (not the default value) by
the admin or a tuned profile on NUMA systems triggering cross-NODE page
migrations, a stable_node use after free could materialize.

If the chain is collapsed stable_node would point to the old chain that
was already freed.  stable_node_dup would be the stable_node dup now
converted to a regular stable_node and indexed in the rbtree in
replacement of the freed stable_node chain (not anymore a dup).

This special case where the chain is collapsed in the NUMA replacement
path, is now detected by setting stable_node to NULL by the chain_prune
callee if it decides to collapse the chain.  This tells the NUMA
replacement code that even if stable_node and stable_node_dup are
different, this is not a chain if stable_node is NULL, as the
stable_node_dup was converted to a regular stable_node and the chain was
collapsed.

It is generally safer for the callee to force the caller stable_node to
NULL the moment it become stale so any other mistake like this would
result in an instant Oops easier to debug than an use after free.

Otherwise the replace logic would act like if stable_node was a valid
chain, when in fact it was freed.  Notably
stable_node_chain_add_dup(page_node, stable_node) would run on a stable
stable_node.

Andrey Ryabinin found the source of the use after free in chain_prune().

Link: http://lkml.kernel.org/r/20170512193805.8807-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Evgheni Dereveanchin <ederevea@redhat.com>
Tested-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Gavin Guo <gavin.guo@canonical.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00
Andrea Arcangeli 2c653d0ee2 ksm: introduce ksm_max_page_sharing per page deduplication limit
Without a max deduplication limit for each KSM page, the list of the
rmap_items associated to each stable_node can grow infinitely large.

During the rmap walk each entry can take up to ~10usec to process
because of IPIs for the TLB flushing (both for the primary MMU and the
secondary MMUs with the MMU notifier).  With only 16GB of address space
shared in the same KSM page, that would amount to dozens of seconds of
kernel runtime.

A ~256 max deduplication factor will reduce the latencies of the rmap
walks on KSM pages to order of a few msec.  Just doing the
cond_resched() during the rmap walks is not enough, the list size must
have a limit too, otherwise the caller could get blocked in (schedule
friendly) kernel computations for seconds, unexpectedly.

There's room for optimization to significantly reduce the IPI delivery
cost during the page_referenced(), but at least for page_migration in
the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
it may be inevitable to send lots of IPIs if each rmap_item->mm is
active on a different CPU and there are lots of CPUs.  Even if we ignore
the IPI delivery cost, we've still to walk the whole KSM rmap list, so
we can't allow millions or billions (ulimited) number of entries in the
KSM stable_node rmap_item lists.

The limit is enforced efficiently by adding a second dimension to the
stable rbtree.  So there are three types of stable_nodes: the regular
ones (identical as before, living in the first flat dimension of the
stable rbtree), the "chains" and the "dups".

Every "chain" and all "dups" linked into a "chain" enforce the invariant
that they represent the same write protected memory content, even if
each "dup" will be pointed by a different KSM page copy of that content.
This way the stable rbtree lookup computational complexity is unaffected
if compared to an unlimited max_sharing_limit.  It is still enforced
that there cannot be KSM page content duplicates in the stable rbtree
itself.

Adding the second dimension to the stable rbtree only after the
max_page_sharing limit hits, provides for a zero memory footprint
increase on 64bit archs.  The memory overhead of the per-KSM page
stable_tree and per virtual mapping rmap_item is unchanged.  Only after
the max_page_sharing limit hits, we need to allocate a stable_tree
"chain" and rb_replace() the "regular" stable_node with the newly
allocated stable_node "chain".  After that we simply add the "regular"
stable_node to the chain as a stable_node "dup" by linking hlist_dup in
the stable_node_chain->hlist.  This way the "regular" (flat) stable_node
is converted to a stable_node "dup" living in the second dimension of
the stable rbtree.

During stable rbtree lookups the stable_node "chain" is identified as
stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
is_stable_node_chain()).

When dropping stable_nodes, the stable_node "dup" is identified as
stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).

The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
elsewhere in any stable_node->head/node to avoid a clashes with the
stable_node->node.rb_parent_color pointer, and different from
&migrate_nodes.  So the second field of &migrate_nodes is picked and
verified as always safe with a BUILD_BUG_ON in case the list_head
implementation changes in the future.

The STABLE_NODE_DUP is picked as a random negative value in
stable_node->rmap_hlist_len.  rmap_hlist_len cannot become negative when
it's a "regular" stable_node or a stable_node "dup".

The stable_node_chain->nid is irrelevant.  The stable_node_chain->kpfn
is aliased in a union with a time field used to rate limit the
stable_node_chain->hlist prunes.

The garbage collection of the stable_node_chain happens lazily during
stable rbtree lookups (as for all other kind of stable_nodes), or while
disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
entire stable rbtree.

While the "regular" stable_nodes and the stable_node "dups" must wait
for their underlying tree_page to be freed before they can be freed
themselves, the stable_node "chains" can be freed immediately if the
stable_node->hlist turns empty.  This is because the "chains" are never
pointed by any page->mapping and they're effectively stable rbtree KSM
self contained metadata.

[akpm@linux-foundation.org: fix non-NUMA build]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Petr Holasek <pholasek@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Evgheni Dereveanchin <ederevea@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Gavin Guo <gavin.guo@canonical.com>
Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00
Andrea Arcangeli a7306c3436 ksm: prevent crash after write_protect_page fails
"err" needs to be left set to -EFAULT if split_huge_page succeeds.
Otherwise if "err" gets clobbered with zero and write_protect_page
fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
and then try_to_merge_with_ksm_page() will continue thinking kpage is a
PageKsm when in fact it's still an anonymous page.  Eventually it'll
crash in page_add_anon_rmap.

This has been reproduced on Fedora25 kernel but I can reproduce with
upstream too.

The bug was introduced in commit f765f54059 ("ksm: prepare to new THP
semantics") introduced in v4.5.

    page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
    flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
    page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
    page->mem_cgroup:ffffa09674bf0000
    ------------[ cut here ]------------
    kernel BUG at mm/rmap.c:1222!
    CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
    RIP: do_page_add_anon_rmap+0x1c4/0x240
    Call Trace:
      page_add_anon_rmap+0x18/0x20
      try_to_merge_with_ksm_page+0x50b/0x780
      ksm_scan_thread+0x1211/0x1410
      ? prepare_to_wait_event+0x100/0x100
      ? try_to_merge_with_ksm_page+0x780/0x780
      kthread+0xd9/0xf0
      ? kthread_park+0x60/0x60
      ret_from_fork+0x25/0x30

Fixes: f765f54059 ("ksm: prepare to new THP semantics")
Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Federico Simoncelli <fsimonce@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02 15:07:37 -07:00
Minchan Kim e4b8222271 mm: make rmap_one boolean function
rmap_one's return value controls whether rmap_work should contine to
scan other ptes or not so it's target for changing to boolean.  Return
true if the scan should be continued.  Otherwise, return false to stop
the scanning.

This patch makes rmap_one's return value to boolean.

Link: http://lkml.kernel.org/r/1489555493-14659-10-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:10 -07:00
Minchan Kim 1df631ae19 mm: make rmap_walk() return void
There is no user of the return value from rmap_walk() and friends so
this patch makes them void-returning functions.

Link: http://lkml.kernel.org/r/1489555493-14659-9-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:10 -07:00
Ingo Molnar f7ccbae45c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/coredump.h>
We are going to split <linux/sched/coredump.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/coredump.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:28 +01:00
Ingo Molnar 6e84f31522 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

The APIs that are going to be moved first are:

   mm_alloc()
   __mmdrop()
   mmdrop()
   mmdrop_async_fn()
   mmdrop_async()
   mmget_not_zero()
   mmput()
   mmput_async()
   get_task_mm()
   mm_access()
   mm_release()

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:28 +01:00
Vegard Nossum f1f1007644 mm: add new mmgrab() helper
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

  git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
  git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.

(Michal Hocko provided most of the kerneldoc comment.)

Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:48 -08:00
Aneesh Kumar K.V 595cd8f256 mm/ksm: handle protnone saved writes when making page write protect
Without this KSM will consider the page write protected, but a numa
fault can later mark the page writable.  This can result in memory
corruption.

Link: http://lkml.kernel.org/r/1487498625-10891-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:56 -08:00
Kirill A. Shutemov 36eaff3364 mm, ksm: convert write_protect_page() to use page_vma_mapped_walk()
For consistency, it worth converting all page_check_address() to
page_vma_mapped_walk(), so we could drop the former.

Link: http://lkml.kernel.org/r/20170129173858.45174-9-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:55 -08:00
Claudio Imbrenda e86c59b1b1 mm/ksm: improve deduplication of zero pages with colouring
Some architectures have a set of zero pages (coloured zero pages)
instead of only one zero page, in order to improve the cache
performance.  In those cases, the kernel samepage merger (KSM) would
merge all the allocated pages that happen to be filled with zeroes to
the same deduplicated page, thus losing all the advantages of coloured
zero pages.

This behaviour is noticeable when a process accesses large arrays of
allocated pages containing zeroes.  A test I conducted on s390 shows
that there is a speed penalty when KSM merges such pages, compared to
not merging them or using actual zero pages from the start without
breaking the COW.

This patch fixes this behaviour.  When coloured zero pages are present,
the checksum of a zero page is calculated during initialisation, and
compared with the checksum of the current canditate during merging.  In
case of a match, the normal merging routine is used to merge the page
with the correct coloured zero page, which ensures the candidate page is
checked to be equal to the target zero page.

A sysfs entry is also added to toggle this behaviour, since it can
potentially introduce performance regressions, especially on
architectures without coloured zero pages.  The default value is
disabled, for backwards compatibility.

With this patch, the performance with KSM is the same as with non
COW-broken actual zero pages, which is also the same as without KSM.

[akpm@linux-foundation.org: make zero_checksum and ksm_use_zero_pages __read_mostly, per Andrea]
[imbrenda@linux.vnet.ibm.com: documentation for coloured zero pages deduplication]
  Link: http://lkml.kernel.org/r/1484927522-1964-1-git-send-email-imbrenda@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1484850953-23941-1-git-send-email-imbrenda@linux.vnet.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:53 -08:00
zhong jiang 6213055f2c mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node()
According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can
in rare cases cause a hung task warning.

At present, if alloc_stable_node() allocation fails, two break_cows may
want to allocate a couple of pages, and the issue will come up when free
memory is under pressure.

We fix it by adding __GFP_HIGH to GFP, to grant access to memory
reserves, increasing the likelihood of allocation success.

[akpm@linux-foundation.org: tweak comment]
Link: http://lkml.kernel.org/r/1474354484-58233-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
zhong jiang 5b398e416e mm,ksm: fix endless looping in allocating memory when ksm enable
I hit the following hung task when runing a OOM LTP test case with 4.1
kernel.

Call trace:
[<ffffffc000086a88>] __switch_to+0x74/0x8c
[<ffffffc000a1bae0>] __schedule+0x23c/0x7bc
[<ffffffc000a1c09c>] schedule+0x3c/0x94
[<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350
[<ffffffc000a1e32c>] down_write+0x64/0x80
[<ffffffc00021f794>] __ksm_exit+0x90/0x19c
[<ffffffc0000be650>] mmput+0x118/0x11c
[<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74
[<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4
[<ffffffc0000d0f34>] get_signal+0x444/0x5e0
[<ffffffc000089fcc>] do_signal+0x1d8/0x450
[<ffffffc00008a35c>] do_notify_resume+0x70/0x78

The oom victim cannot terminate because it needs to take mmap_sem for
write while the lock is held by ksmd for read which loops in the page
allocator

ksm_do_scan
	scan_get_next_rmap_item
		down_read
		get_next_rmap_item
			alloc_rmap_item   #ksmd will loop permanently.

There is no way forward because the oom victim cannot release any memory
in 4.1 based kernel.  Since 4.6 we have the oom reaper which would solve
this problem because it would release the memory asynchronously.
Nevertheless we can relax alloc_rmap_item requirements and use
__GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
would just retry later after the lock got dropped.

Such a patch would be also easy to backport to older stable kernels which
do not have oom_reaper.

While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
by the allocation failure.

Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Suggested-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-28 16:19:01 -07:00
Kirill A. Shutemov dcddffd41d mm: do not pass mm_struct into handle_mm_fault
We always have vma->vm_mm around.

Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim bda807d444 mm: migrate: support non-lru movable page migration
We have allowed migration for only LRU pages until now and it was enough
to make high-order pages.  But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,.  enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages with
movable.  For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three
functions which are function pointers of struct
address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully.  On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation.  If a driver cannot isolate the page, it should
return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
		struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page.  The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage.  Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0.  If driver cannot migrate the page at the
moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure".  On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page.  In this function, driver should put the isolated page back to the
own data structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under
page_lock.

	void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family
functions which will be called by VM.  Exactly speaking, PG_movable is
not a real flag of struct page.  Rather than, VM reuses page->mapping's
lower bits to represent it.

	#define PAGE_MAPPING_MOVABLE 0x2
	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly.  Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.  As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable).  But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated.  Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
sudden destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
non-lru movable page, it can skip it.  Driver doesn't need to manipulate
the flag because VM will set/clear it automatically.  Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field.  PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.

[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
  Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Zhou Chengming 7496fea9a6 ksm: fix conflict between mmput and scan_get_next_rmap_item
A concurrency issue about KSM in the function scan_get_next_rmap_item.

task A (ksmd):				|task B (the mm's task):
					|
mm = slot->mm;				|
down_read(&mm->mmap_sem);		|
					|
...					|
					|
spin_lock(&ksm_mmlist_lock);		|
					|
ksm_scan.mm_slot go to the next slot;	|
					|
spin_unlock(&ksm_mmlist_lock);		|
					|mmput() ->
					|	ksm_exit():
					|
					|spin_lock(&ksm_mmlist_lock);
					|if (mm_slot && ksm_scan.mm_slot != mm_slot) {
					|	if (!mm_slot->rmap_list) {
					|		easy_to_free = 1;
					|		...
					|
					|if (easy_to_free) {
					|	mmdrop(mm);
					|	...
					|
					|So this mm_struct may be freed in the mmput().
					|
up_read(&mm->mmap_sem);			|

As we can see above, the ksmd thread may access a mm_struct that already
been freed to the kmem_cache.  Suppose a fork will get this mm_struct from
the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will
cause mmap_sem.count to become -1.

As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has
the same SMP race condition, so fix it too.  My prev fix in function
scan_get_next_rmap_item will introduce a different SMP race condition, so
just invert the up_read/spin_unlock order as Andrea Arcangeli said.

Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Geliang Tang <geliangtang@163.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Ding Tianhong <dingtianhong@huawei.com>
Cc: Li Bin <huawei.libin@huawei.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-12 15:52:50 -07:00
Dave Hansen 1b2ee1266e mm/core: Do not enforce PKEY permissions on remote mm access
We try to enforce protection keys in software the same way that we
do in hardware.  (See long example below).

But, we only want to do this when accessing our *own* process's
memory.  If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
tried to PTRACE_POKE a target process which just happened to have
some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
debugger access to that memory.  PKRU is fundamentally a
thread-local structure and we do not want to enforce it on access
to _another_ thread's data.

This gets especially tricky when we have workqueues or other
delayed-work mechanisms that might run in a random process's context.
We can check that we only enforce pkeys when operating on our *own* mm,
but delayed work gets performed when a random user context is active.
We might end up with a situation where a delayed-work gup fails when
running randomly under its "own" task but succeeds when running under
another process.  We want to avoid that.

To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
fault flag: FAULT_FLAG_REMOTE.  They indicate that we are
walking an mm which is not guranteed to be the same as
current->mm and should not be subject to protection key
enforcement.

Thanks to Jerome Glisse for pointing out this scenario.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Dominik Vogt <vogt@linux.vnet.ibm.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Geliang Tang <geliangtang@163.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Low <jason.low2@hp.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: iommu@lists.linux-foundation.org
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-18 19:46:28 +01:00
Dave Hansen d4edcf0d56 mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm
We will soon modify the vanilla get_user_pages() so it can no
longer be used on mm/tasks other than 'current/current->mm',
which is by far the most common way it is called.  For now,
we allow the old-style calls, but warn when they are used.
(implemented in previous patch)

This patch switches all callers of:

	get_user_pages()
	get_user_pages_unlocked()
	get_user_pages_locked()

to stop passing tsk/mm so they will no longer see the warnings.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-16 10:11:12 +01:00
Minchan Kim 337ed7eb5f mm/ksm.c: mark stable page dirty
The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but KSM
uses clean write-protected ptes to reference the stable ksm page.  So be
sure to mark that page dirty, so it's never mistakenly discarded.

[hughd@google.com: adjusted comments]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov f765f54059 ksm: prepare to new THP semantics
We don't need special code to stabilize THP.  If you've got reference to
any subpage of THP it will not be split under you.

New split_huge_page() also accepts tail pages: no need in special code
to get reference to head page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov d281ee6145 rmap: add argument to charge compound page
We're going to allow mapping of individual 4k pages of THP compound
page.  It means we cannot rely on PageTransHuge() check to decide if
map/unmap small page or THP.

The patch adds new argument to rmap functions to indicate whether we
want to operate on whole compound page or only the small page.

[n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov 48c935ad88 page-flags: define PG_locked behavior on compound pages
lock_page() must operate on the whole compound page.  It doesn't make
much sense to lock part of compound page.  Change code to use head
page's PG_locked, if tail page is passed.

This patch also gets rid of custom helper functions --
__set_page_locked() and __clear_page_locked().  They are replaced with
helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG.  Tail pages to these
helper would trigger VM_BUG_ON().

SLUB uses PG_locked as a bit spin locked.  IIUC, tail pages should never
appear there.  VM_BUG_ON() is added to make sure that this assumption is
correct.

[akpm@linux-foundation.org: fix fs/cifs/file.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Geliang Tang 036404183e mm/ksm.c: use list_for_each_entry_safe
Use list_for_each_entry_safe() instead of list_for_each_safe() to
simplify the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Andrea Arcangeli c8f95ed1a9 ksm: unstable_tree_search_insert error checking cleanup
get_mergeable_page() can only return NULL (also in case of errors) or the
pinned mergeable page.  It can't return an error different than NULL.
This optimizes away the unnecessary error check.

Add a return after the "out:" label in the callee to make it more
readable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Andrea Arcangeli 85c6e8dd23 ksm: use find_mergeable_vma in try_to_merge_with_ksm_page
Doing the VM_MERGEABLE check after the page == kpage check won't provide
any meaningful benefit.  The !vma->anon_vma check of find_mergeable_vma is
the only superfluous bit in using find_mergeable_vma because the !PageAnon
check of try_to_merge_one_page() implicitly checks for that, but it still
looks cleaner to share the same find_mergeable_vma().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Andrea Arcangeli 98666f8a25 ksm: use the helper method to do the hlist_empty check
This just uses the helper function to cleanup the assumption on the
hlist_node internals.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Andrea Arcangeli f2e5ff85ed ksm: don't fail stable tree lookups if walking over stale stable_nodes
The stable_nodes can become stale at any time if the underlying pages gets
freed.  The stable_node gets collected and removed from the stable rbtree
if that is detected during the rbtree lookups.

Don't fail the lookup if running into stale stable_nodes, just restart the
lookup after collecting the stale stable_nodes.  Otherwise the CPU spent
in the preparation stage is wasted and the lookup must be repeated at the
next loop potentially failing a second time in a second stale stable_node.

If we don't prune aggressively we delay the merging of the unstable node
candidates and at the same time we delay the freeing of the stale
stable_nodes.  Keeping stale stable_nodes around wastes memory and it
can't provide any benefit.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Petr Holasek <pholasek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00