From 426e5c429d16e4cd5ded46e21ff8e939bf8abd0f Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:47:00 -0700 Subject: [PATCH 001/190] mm: memory_hotplug: factor out bootmem core functions to bootmem_info.c Patch series "Free some vmemmap pages of HugeTLB page", v23. This patch series will free some vmemmap pages(struct page structures) associated with each HugeTLB page when preallocated to save memory. In order to reduce the difficulty of the first version of code review. In this version, we disable PMD/huge page mapping of vmemmap if this feature was enabled. This acutely eliminates a bunch of the complex code doing page table manipulation. When this patch series is solid, we cam add the code of vmemmap page table manipulation in the future. The struct page structures (page structs) are used to describe a physical page frame. By default, there is an one-to-one mapping from a page frame to it's corresponding page struct. The HugeTLB pages consist of multiple base page size pages and is supported by many architectures. See hugetlbpage.rst in the Documentation directory for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB are currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages. For each base page, there is a corresponding page struct. Within the HugeTLB subsystem, only the first 4 page structs are used to contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER provides this upper limit. The only 'useful' information in the remaining page structs is the compound_head field, and this field is the same for all tail pages. By removing redundant page structs for HugeTLB pages, memory can returned to the buddy allocator for other uses. When the system boot up, every 2M HugeTLB has 512 struct page structs which size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE). HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | -------------> | 2 | | | +-----------+ +-----------+ | | | 3 | -------------> | 3 | | | +-----------+ +-----------+ | | | 4 | -------------> | 4 | | 2MB | +-----------+ +-----------+ | | | 5 | -------------> | 5 | | | +-----------+ +-----------+ | | | 6 | -------------> | 6 | | | +-----------+ +-----------+ | | | 7 | -------------> | 7 | | | +-----------+ +-----------+ | | | | | | +-----------+ The value of page->compound_head is the same for all tail pages. The first page of page structs (page 0) associated with the HugeTLB page contains the 4 page structs necessary to describe the HugeTLB. The only use of the remaining pages of page structs (page 1 to page 7) is to point to page->compound_head. Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs will be used for each HugeTLB page. This will allow us to free the remaining 6 pages to the buddy allocator. Here is how things look after remapping. HugeTLB struct pages(8 pages) page frame(8 pages) +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | | | 0 | -------------> | 0 | | | +-----------+ +-----------+ | | | 1 | -------------> | 1 | | | +-----------+ +-----------+ | | | 2 | ----------------^ ^ ^ ^ ^ ^ | | +-----------+ | | | | | | | | 3 | ------------------+ | | | | | | +-----------+ | | | | | | | 4 | --------------------+ | | | | 2MB | +-----------+ | | | | | | 5 | ----------------------+ | | | | +-----------+ | | | | | 6 | ------------------------+ | | | +-----------+ | | | | 7 | --------------------------+ | | +-----------+ | | | | | | +-----------+ When a HugeTLB is freed to the buddy system, we should allocate 6 pages for vmemmap pages and restore the previous mapping relationship. Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar to the 2MB HugeTLB page. We also can use this approach to free the vmemmap pages. In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a very substantial gain. On our server, run some SPDK/QEMU applications which will use 1024GB HugeTLB page. With this feature enabled, we can save ~16GB (1G hugepage)/~12GB (2MB hugepage) memory. Because there are vmemmap page tables reconstruction on the freeing/allocating path, it increases some overhead. Here are some overhead analysis. 1) Allocating 10240 2MB HugeTLB pages. a) With this patch series applied: # time echo 10240 > /proc/sys/vm/nr_hugepages real 0m0.166s user 0m0.000s sys 0m0.166s # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [8K, 16K) 5476 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 4760 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 4 | | b) Without this patch series: # time echo 10240 > /proc/sys/vm/nr_hugepages real 0m0.067s user 0m0.000s sys 0m0.067s # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [4K, 8K) 10147 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 93 | | Summarize: this feature is about ~2x slower than before. 2) Freeing 10240 2MB HugeTLB pages. a) With this patch series applied: # time echo 0 > /proc/sys/vm/nr_hugepages real 0m0.213s user 0m0.000s sys 0m0.213s # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; } kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [8K, 16K) 6 | | [16K, 32K) 10227 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 7 | | b) Without this patch series: # time echo 0 > /proc/sys/vm/nr_hugepages real 0m0.081s user 0m0.000s sys 0m0.081s # bpftrace -e 'kprobe:free_pool_huge_page { @start[tid] = nsecs; } kretprobe:free_pool_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' Attaching 2 probes... @latency: [4K, 8K) 6805 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 3427 |@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 8 | | Summary: The overhead of __free_hugepage is about ~2-3x slower than before. Although the overhead has increased, the overhead is not significant. Like Mike said, "However, remember that the majority of use cases create HugeTLB pages at or shortly after boot time and add them to the pool. So, additional overhead is at pool creation time. There is no change to 'normal run time' operations of getting a page from or returning a page to the pool (think page fault/unmap)". Despite the overhead and in addition to the memory gains from this series. The following data is obtained by Joao Martins. Very thanks to his effort. There's an additional benefit which is page (un)pinners will see an improvement and Joao presumes because there are fewer memmap pages and thus the tail/head pages are staying in cache more often. Out of the box Joao saw (when comparing linux-next against linux-next + this series) with gup_test and pinning a 16G HugeTLB file (with 1G pages): get_user_pages(): ~32k -> ~9k unpin_user_pages(): ~75k -> ~70k Usually any tight loop fetching compound_head(), or reading tail pages data (e.g. compound_head) benefit a lot. There's some unpinning inefficiencies Joao was fixing[2], but with that in added it shows even more: unpin_user_pages(): ~27k -> ~3.8k [1] https://lore.kernel.org/linux-mm/20210409205254.242291-1-mike.kravetz@oracle.com/ [2] https://lore.kernel.org/linux-mm/20210204202500.26474-1-joao.m.martins@oracle.com/ This patch (of 9): Move bootmem info registration common API to individual bootmem_info.c. And we will use {get,put}_page_bootmem() to initialize the page for the vmemmap pages or free the vmemmap pages to buddy in the later patch. So move them out of CONFIG_MEMORY_HOTPLUG_SPARSE. This is just code movement without any functional change. Link: https://lkml.kernel.org/r/20210510030027.56044-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210510030027.56044-2-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Mike Kravetz Reviewed-by: Oscar Salvador Reviewed-by: David Hildenbrand Reviewed-by: Miaohe Lin Tested-by: Chen Huang Tested-by: Bodeddula Balasubramaniam Cc: Jonathan Corbet Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: x86@kernel.org Cc: "H. Peter Anvin" Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Alexander Viro Cc: Paul E. McKenney Cc: Pawan Gupta Cc: Randy Dunlap Cc: Oliver Neukum Cc: Anshuman Khandual Cc: Joerg Roedel Cc: Mina Almasry Cc: David Rientjes Cc: Matthew Wilcox Cc: Michal Hocko Cc: Barry Song Cc: HORIGUCHI NAOYA Cc: Joao Martins Cc: Xiongchun Duan Cc: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/sparc/mm/init_64.c | 1 + arch/x86/mm/init_64.c | 3 +- include/linux/bootmem_info.h | 40 +++++++++++ include/linux/memory_hotplug.h | 27 ------- mm/Makefile | 1 + mm/bootmem_info.c | 127 +++++++++++++++++++++++++++++++++ mm/memory_hotplug.c | 116 ------------------------------ mm/sparse.c | 1 + 8 files changed, 172 insertions(+), 144 deletions(-) create mode 100644 include/linux/bootmem_info.h create mode 100644 mm/bootmem_info.c diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 06e938d03f3b..1b23639e2fcd 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index e527d829e1ed..3aaf1d30c777 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include @@ -1623,7 +1624,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, return err; } -#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HAVE_BOOTMEM_INFO_NODE) +#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE void register_page_bootmem_memmap(unsigned long section_nr, struct page *start_page, unsigned long nr_pages) { diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h new file mode 100644 index 000000000000..4ed6dee1adc9 --- /dev/null +++ b/include/linux/bootmem_info.h @@ -0,0 +1,40 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __LINUX_BOOTMEM_INFO_H +#define __LINUX_BOOTMEM_INFO_H + +#include + +/* + * Types for free bootmem stored in page->lru.next. These have to be in + * some random range in unsigned long space for debugging purposes. + */ +enum { + MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12, + SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE, + MIX_SECTION_INFO, + NODE_INFO, + MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO, +}; + +#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE +void __init register_page_bootmem_info_node(struct pglist_data *pgdat); + +void get_page_bootmem(unsigned long info, struct page *page, + unsigned long type); +void put_page_bootmem(struct page *page); +#else +static inline void register_page_bootmem_info_node(struct pglist_data *pgdat) +{ +} + +static inline void put_page_bootmem(struct page *page) +{ +} + +static inline void get_page_bootmem(unsigned long info, struct page *page, + unsigned long type) +{ +} +#endif + +#endif /* __LINUX_BOOTMEM_INFO_H */ diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 28f32fd00fe9..a7fd2c3ccb77 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -18,18 +18,6 @@ struct vmem_altmap; #ifdef CONFIG_MEMORY_HOTPLUG struct page *pfn_to_online_page(unsigned long pfn); -/* - * Types for free bootmem stored in page->lru.next. These have to be in - * some random range in unsigned long space for debugging purposes. - */ -enum { - MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12, - SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE, - MIX_SECTION_INFO, - NODE_INFO, - MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO, -}; - /* Types for control the zone type of onlined and offlined memory */ enum { /* Offline the memory. */ @@ -222,17 +210,6 @@ static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat) #endif /* CONFIG_NUMA */ #endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */ -#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE -extern void __init register_page_bootmem_info_node(struct pglist_data *pgdat); -#else -static inline void register_page_bootmem_info_node(struct pglist_data *pgdat) -{ -} -#endif -extern void put_page_bootmem(struct page *page); -extern void get_page_bootmem(unsigned long ingo, struct page *page, - unsigned long type); - void get_online_mems(void); void put_online_mems(void); @@ -260,10 +237,6 @@ static inline void zone_span_writelock(struct zone *zone) {} static inline void zone_span_writeunlock(struct zone *zone) {} static inline void zone_seqlock_init(struct zone *zone) {} -static inline void register_page_bootmem_info_node(struct pglist_data *pgdat) -{ -} - static inline int try_online_node(int nid) { return 0; diff --git a/mm/Makefile b/mm/Makefile index bf71e295e9f6..61ce71ddf7bb 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -125,3 +125,4 @@ obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o obj-$(CONFIG_IO_MAPPING) += io-mapping.o +obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c new file mode 100644 index 000000000000..5b152dba7344 --- /dev/null +++ b/mm/bootmem_info.c @@ -0,0 +1,127 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Bootmem core functions. + * + * Copyright (c) 2020, Bytedance. + * + * Author: Muchun Song + * + */ +#include +#include +#include +#include +#include + +void get_page_bootmem(unsigned long info, struct page *page, unsigned long type) +{ + page->freelist = (void *)type; + SetPagePrivate(page); + set_page_private(page, info); + page_ref_inc(page); +} + +void put_page_bootmem(struct page *page) +{ + unsigned long type; + + type = (unsigned long) page->freelist; + BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || + type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE); + + if (page_ref_dec_return(page) == 1) { + page->freelist = NULL; + ClearPagePrivate(page); + set_page_private(page, 0); + INIT_LIST_HEAD(&page->lru); + free_reserved_page(page); + } +} + +#ifndef CONFIG_SPARSEMEM_VMEMMAP +static void register_page_bootmem_info_section(unsigned long start_pfn) +{ + unsigned long mapsize, section_nr, i; + struct mem_section *ms; + struct page *page, *memmap; + struct mem_section_usage *usage; + + section_nr = pfn_to_section_nr(start_pfn); + ms = __nr_to_section(section_nr); + + /* Get section's memmap address */ + memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); + + /* + * Get page for the memmap's phys address + * XXX: need more consideration for sparse_vmemmap... + */ + page = virt_to_page(memmap); + mapsize = sizeof(struct page) * PAGES_PER_SECTION; + mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT; + + /* remember memmap's page */ + for (i = 0; i < mapsize; i++, page++) + get_page_bootmem(section_nr, page, SECTION_INFO); + + usage = ms->usage; + page = virt_to_page(usage); + + mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT; + + for (i = 0; i < mapsize; i++, page++) + get_page_bootmem(section_nr, page, MIX_SECTION_INFO); + +} +#else /* CONFIG_SPARSEMEM_VMEMMAP */ +static void register_page_bootmem_info_section(unsigned long start_pfn) +{ + unsigned long mapsize, section_nr, i; + struct mem_section *ms; + struct page *page, *memmap; + struct mem_section_usage *usage; + + section_nr = pfn_to_section_nr(start_pfn); + ms = __nr_to_section(section_nr); + + memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); + + register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION); + + usage = ms->usage; + page = virt_to_page(usage); + + mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT; + + for (i = 0; i < mapsize; i++, page++) + get_page_bootmem(section_nr, page, MIX_SECTION_INFO); +} +#endif /* !CONFIG_SPARSEMEM_VMEMMAP */ + +void __init register_page_bootmem_info_node(struct pglist_data *pgdat) +{ + unsigned long i, pfn, end_pfn, nr_pages; + int node = pgdat->node_id; + struct page *page; + + nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT; + page = virt_to_page(pgdat); + + for (i = 0; i < nr_pages; i++, page++) + get_page_bootmem(node, page, NODE_INFO); + + pfn = pgdat->node_start_pfn; + end_pfn = pgdat_end_pfn(pgdat); + + /* register section info */ + for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { + /* + * Some platforms can assign the same pfn to multiple nodes - on + * node0 as well as nodeN. To avoid registering a pfn against + * multiple nodes we check that this pfn does not already + * reside in some other nodes. + */ + if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node)) + register_page_bootmem_info_section(pfn); + } +} diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 974a565797d8..ae7a07b02049 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -154,122 +154,6 @@ static void release_memory_resource(struct resource *res) } #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE -void get_page_bootmem(unsigned long info, struct page *page, - unsigned long type) -{ - page->freelist = (void *)type; - SetPagePrivate(page); - set_page_private(page, info); - page_ref_inc(page); -} - -void put_page_bootmem(struct page *page) -{ - unsigned long type; - - type = (unsigned long) page->freelist; - BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE || - type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE); - - if (page_ref_dec_return(page) == 1) { - page->freelist = NULL; - ClearPagePrivate(page); - set_page_private(page, 0); - INIT_LIST_HEAD(&page->lru); - free_reserved_page(page); - } -} - -#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE -#ifndef CONFIG_SPARSEMEM_VMEMMAP -static void register_page_bootmem_info_section(unsigned long start_pfn) -{ - unsigned long mapsize, section_nr, i; - struct mem_section *ms; - struct page *page, *memmap; - struct mem_section_usage *usage; - - section_nr = pfn_to_section_nr(start_pfn); - ms = __nr_to_section(section_nr); - - /* Get section's memmap address */ - memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); - - /* - * Get page for the memmap's phys address - * XXX: need more consideration for sparse_vmemmap... - */ - page = virt_to_page(memmap); - mapsize = sizeof(struct page) * PAGES_PER_SECTION; - mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT; - - /* remember memmap's page */ - for (i = 0; i < mapsize; i++, page++) - get_page_bootmem(section_nr, page, SECTION_INFO); - - usage = ms->usage; - page = virt_to_page(usage); - - mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT; - - for (i = 0; i < mapsize; i++, page++) - get_page_bootmem(section_nr, page, MIX_SECTION_INFO); - -} -#else /* CONFIG_SPARSEMEM_VMEMMAP */ -static void register_page_bootmem_info_section(unsigned long start_pfn) -{ - unsigned long mapsize, section_nr, i; - struct mem_section *ms; - struct page *page, *memmap; - struct mem_section_usage *usage; - - section_nr = pfn_to_section_nr(start_pfn); - ms = __nr_to_section(section_nr); - - memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr); - - register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION); - - usage = ms->usage; - page = virt_to_page(usage); - - mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT; - - for (i = 0; i < mapsize; i++, page++) - get_page_bootmem(section_nr, page, MIX_SECTION_INFO); -} -#endif /* !CONFIG_SPARSEMEM_VMEMMAP */ - -void __init register_page_bootmem_info_node(struct pglist_data *pgdat) -{ - unsigned long i, pfn, end_pfn, nr_pages; - int node = pgdat->node_id; - struct page *page; - - nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT; - page = virt_to_page(pgdat); - - for (i = 0; i < nr_pages; i++, page++) - get_page_bootmem(node, page, NODE_INFO); - - pfn = pgdat->node_start_pfn; - end_pfn = pgdat_end_pfn(pgdat); - - /* register section info */ - for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) { - /* - * Some platforms can assign the same pfn to multiple nodes - on - * node0 as well as nodeN. To avoid registering a pfn against - * multiple nodes we check that this pfn does not already - * reside in some other nodes. - */ - if (pfn_valid(pfn) && (early_pfn_to_nid(pfn) == node)) - register_page_bootmem_info_section(pfn); - } -} -#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */ - static int check_pfn_span(unsigned long pfn, unsigned long nr_pages, const char *reason) { diff --git a/mm/sparse.c b/mm/sparse.c index 7272f7a1449d..6326cdf36c4f 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "internal.h" #include From 6be24bed9da367c29b04e6fba8c9f27db39aa665 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:47:04 -0700 Subject: [PATCH 002/190] mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP The option HUGETLB_PAGE_FREE_VMEMMAP allows for the freeing of some vmemmap pages associated with pre-allocated HugeTLB pages. For example, on X86_64 6 vmemmap pages of size 4KB each can be saved for each 2MB HugeTLB page. 4094 vmemmap pages of size 4KB each can be saved for each 1GB HugeTLB page. When a HugeTLB page is allocated or freed, the vmemmap array representing the range associated with the page will need to be remapped. When a page is allocated, vmemmap pages are freed after remapping. When a page is freed, previously discarded vmemmap pages must be allocated before remapping. The config option is introduced early so that supporting code can be written to depend on the option. The initial version of the code only provides support for x86-64. If config HAVE_BOOTMEM_INFO_NODE is enabled, the freeing vmemmap page code denpend on it to free vmemmap pages. Otherwise, just use free_reserved_page() to free vmemmmap pages. The routine register_page_bootmem_info() is used to register bootmem info. Therefore, make sure register_page_bootmem_info is enabled if HUGETLB_PAGE_FREE_VMEMMAP is defined. Link: https://lkml.kernel.org/r/20210510030027.56044-3-songmuchun@bytedance.com Signed-off-by: Muchun Song Reviewed-by: Oscar Salvador Acked-by: Mike Kravetz Reviewed-by: Miaohe Lin Tested-by: Chen Huang Tested-by: Bodeddula Balasubramaniam Reviewed-by: Balbir Singh Cc: Alexander Viro Cc: Andy Lutomirski Cc: Anshuman Khandual Cc: Barry Song Cc: Borislav Petkov Cc: Dave Hansen Cc: David Hildenbrand Cc: David Rientjes Cc: HORIGUCHI NAOYA Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joao Martins Cc: Joerg Roedel Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Michal Hocko Cc: Mina Almasry Cc: Oliver Neukum Cc: Paul E. McKenney Cc: Pawan Gupta Cc: Peter Zijlstra Cc: Randy Dunlap Cc: Thomas Gleixner Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/mm/init_64.c | 2 +- fs/Kconfig | 5 +++++ 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 3aaf1d30c777..65ea58527176 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1270,7 +1270,7 @@ static struct kcore_list kcore_vsyscall; static void __init register_page_bootmem_info(void) { -#ifdef CONFIG_NUMA +#if defined(CONFIG_NUMA) || defined(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP) int i; for_each_online_node(i) diff --git a/fs/Kconfig b/fs/Kconfig index 141a856c50e7..58a53455d1fe 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -240,6 +240,11 @@ config HUGETLBFS config HUGETLB_PAGE def_bool HUGETLBFS +config HUGETLB_PAGE_FREE_VMEMMAP + def_bool HUGETLB_PAGE + depends on X86_64 + depends on SPARSEMEM_VMEMMAP + config MEMFD_CREATE def_bool TMPFS || HUGETLBFS From cd39d4e9e71c5437b67c819c3d53032145bf2879 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:47:09 -0700 Subject: [PATCH 003/190] mm: hugetlb: gather discrete indexes of tail page For HugeTLB page, there are more metadata to save in the struct page. But the head struct page cannot meet our needs, so we have to abuse other tail struct page to store the metadata. In order to avoid conflicts caused by subsequent use of more tail struct pages, we can gather these discrete indexes of tail struct page. In this case, it will be easier to add a new tail page index later. Link: https://lkml.kernel.org/r/20210510030027.56044-4-songmuchun@bytedance.com Signed-off-by: Muchun Song Reviewed-by: Oscar Salvador Reviewed-by: Miaohe Lin Tested-by: Chen Huang Tested-by: Bodeddula Balasubramaniam Acked-by: Michal Hocko Reviewed-by: Mike Kravetz Cc: Alexander Viro Cc: Andy Lutomirski Cc: Anshuman Khandual Cc: Balbir Singh Cc: Barry Song Cc: Borislav Petkov Cc: Dave Hansen Cc: David Hildenbrand Cc: David Rientjes Cc: HORIGUCHI NAOYA Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joao Martins Cc: Joerg Roedel Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Mina Almasry Cc: Oliver Neukum Cc: Paul E. McKenney Cc: Pawan Gupta Cc: Peter Zijlstra Cc: Randy Dunlap Cc: Thomas Gleixner Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/hugetlb.h | 21 +++++++++++++++++++-- include/linux/hugetlb_cgroup.h | 19 +++++++++++-------- 2 files changed, 30 insertions(+), 10 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3c0117656745..0c8c96481259 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -29,6 +29,23 @@ typedef struct { unsigned long pd; } hugepd_t; #include #include +/* + * For HugeTLB page, there are more metadata to save in the struct page. But + * the head struct page cannot meet our needs, so we have to abuse other tail + * struct page to store the metadata. In order to avoid conflicts caused by + * subsequent use of more tail struct pages, we gather these discrete indexes + * of tail struct page here. + */ +enum { + SUBPAGE_INDEX_SUBPOOL = 1, /* reuse page->private */ +#ifdef CONFIG_CGROUP_HUGETLB + SUBPAGE_INDEX_CGROUP, /* reuse page->private */ + SUBPAGE_INDEX_CGROUP_RSVD, /* reuse page->private */ + __MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD, +#endif + __NR_USED_SUBPAGE, +}; + struct hugepage_subpool { spinlock_t lock; long count; @@ -635,13 +652,13 @@ extern unsigned int default_hstate_idx; */ static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage) { - return (struct hugepage_subpool *)(hpage+1)->private; + return (void *)page_private(hpage + SUBPAGE_INDEX_SUBPOOL); } static inline void hugetlb_set_page_subpool(struct page *hpage, struct hugepage_subpool *subpool) { - set_page_private(hpage+1, (unsigned long)subpool); + set_page_private(hpage + SUBPAGE_INDEX_SUBPOOL, (unsigned long)subpool); } static inline struct hstate *hstate_file(struct file *f) diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h index 0bff345c4bc6..0b8d1fdda3a1 100644 --- a/include/linux/hugetlb_cgroup.h +++ b/include/linux/hugetlb_cgroup.h @@ -21,15 +21,16 @@ struct hugetlb_cgroup; struct resv_map; struct file_region; +#ifdef CONFIG_CGROUP_HUGETLB /* * Minimum page order trackable by hugetlb cgroup. * At least 4 pages are necessary for all the tracking information. - * The second tail page (hpage[2]) is the fault usage cgroup. - * The third tail page (hpage[3]) is the reservation usage cgroup. + * The second tail page (hpage[SUBPAGE_INDEX_CGROUP]) is the fault + * usage cgroup. The third tail page (hpage[SUBPAGE_INDEX_CGROUP_RSVD]) + * is the reservation usage cgroup. */ -#define HUGETLB_CGROUP_MIN_ORDER 2 +#define HUGETLB_CGROUP_MIN_ORDER order_base_2(__MAX_CGROUP_SUBPAGE_INDEX + 1) -#ifdef CONFIG_CGROUP_HUGETLB enum hugetlb_memory_event { HUGETLB_MAX, HUGETLB_NR_MEMORY_EVENTS, @@ -66,9 +67,9 @@ __hugetlb_cgroup_from_page(struct page *page, bool rsvd) if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) return NULL; if (rsvd) - return (struct hugetlb_cgroup *)page[3].private; + return (void *)page_private(page + SUBPAGE_INDEX_CGROUP_RSVD); else - return (struct hugetlb_cgroup *)page[2].private; + return (void *)page_private(page + SUBPAGE_INDEX_CGROUP); } static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page *page) @@ -90,9 +91,11 @@ static inline int __set_hugetlb_cgroup(struct page *page, if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER) return -1; if (rsvd) - page[3].private = (unsigned long)h_cg; + set_page_private(page + SUBPAGE_INDEX_CGROUP_RSVD, + (unsigned long)h_cg); else - page[2].private = (unsigned long)h_cg; + set_page_private(page + SUBPAGE_INDEX_CGROUP, + (unsigned long)h_cg); return 0; } From f41f2ed43ca5258d70d53290d1951a21621f95c8 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:47:13 -0700 Subject: [PATCH 004/190] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page Every HugeTLB has more than one struct page structure. We __know__ that we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to store metadata associated with each HugeTLB. There are a lot of struct page structures associated with each HugeTLB page. For tail pages, the value of compound_head is the same. So we can reuse first page of tail page structures. We map the virtual addresses of the remaining pages of tail page structures to the first tail page struct, and then free these page frames. Therefore, we need to reserve two pages as vmemmap areas. When we allocate a HugeTLB page from the buddy, we can free some vmemmap pages associated with each HugeTLB page. It is more appropriate to do it in the prep_new_huge_page(). The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages associated with a HugeTLB page can be freed, returns zero for now, which means the feature is disabled. We will enable it once all the infrastructure is there. [willy@infradead.org: fix documentation warning] Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Oscar Salvador Tested-by: Chen Huang Tested-by: Bodeddula Balasubramaniam Acked-by: Michal Hocko Reviewed-by: Mike Kravetz Cc: Alexander Viro Cc: Andy Lutomirski Cc: Anshuman Khandual Cc: Balbir Singh Cc: Barry Song Cc: Borislav Petkov Cc: Dave Hansen Cc: David Hildenbrand Cc: David Rientjes Cc: HORIGUCHI NAOYA Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joao Martins Cc: Joerg Roedel Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Miaohe Lin Cc: Mina Almasry Cc: Oliver Neukum Cc: Paul E. McKenney Cc: Pawan Gupta Cc: Peter Zijlstra Cc: Randy Dunlap Cc: Thomas Gleixner Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/bootmem_info.h | 28 ++++- include/linux/mm.h | 3 + mm/Makefile | 1 + mm/hugetlb.c | 22 ++-- mm/hugetlb_vmemmap.c | 218 +++++++++++++++++++++++++++++++++++ mm/hugetlb_vmemmap.h | 20 ++++ mm/sparse-vmemmap.c | 194 +++++++++++++++++++++++++++++++ 7 files changed, 473 insertions(+), 13 deletions(-) create mode 100644 mm/hugetlb_vmemmap.c create mode 100644 mm/hugetlb_vmemmap.h diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h index 4ed6dee1adc9..2bc8b1f69c93 100644 --- a/include/linux/bootmem_info.h +++ b/include/linux/bootmem_info.h @@ -2,7 +2,7 @@ #ifndef __LINUX_BOOTMEM_INFO_H #define __LINUX_BOOTMEM_INFO_H -#include +#include /* * Types for free bootmem stored in page->lru.next. These have to be in @@ -22,6 +22,27 @@ void __init register_page_bootmem_info_node(struct pglist_data *pgdat); void get_page_bootmem(unsigned long info, struct page *page, unsigned long type); void put_page_bootmem(struct page *page); + +/* + * Any memory allocated via the memblock allocator and not via the + * buddy will be marked reserved already in the memmap. For those + * pages, we can call this function to free it to buddy allocator. + */ +static inline void free_bootmem_page(struct page *page) +{ + unsigned long magic = (unsigned long)page->freelist; + + /* + * The reserve_bootmem_region sets the reserved flag on bootmem + * pages. + */ + VM_BUG_ON_PAGE(page_ref_count(page) != 2, page); + + if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) + put_page_bootmem(page); + else + VM_BUG_ON_PAGE(1, page); +} #else static inline void register_page_bootmem_info_node(struct pglist_data *pgdat) { @@ -35,6 +56,11 @@ static inline void get_page_bootmem(unsigned long info, struct page *page, unsigned long type) { } + +static inline void free_bootmem_page(struct page *page) +{ + free_reserved_page(page); +} #endif #endif /* __LINUX_BOOTMEM_INFO_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 07922ee1477e..3437aa7c6c91 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3076,6 +3076,9 @@ static inline void print_vma_addr(char *prefix, unsigned long rip) } #endif +void vmemmap_remap_free(unsigned long start, unsigned long end, + unsigned long reuse); + void *sparse_buffer_alloc(unsigned long size); struct page * __populate_section_memmap(unsigned long pfn, unsigned long nr_pages, int nid, struct vmem_altmap *altmap); diff --git a/mm/Makefile b/mm/Makefile index 61ce71ddf7bb..74b47c354682 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -75,6 +75,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o obj-$(CONFIG_ZSWAP) += zswap.o obj-$(CONFIG_HAS_DMA) += dmapool.o obj-$(CONFIG_HUGETLBFS) += hugetlb.o +obj-$(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP) += hugetlb_vmemmap.o obj-$(CONFIG_NUMA) += mempolicy.o obj-$(CONFIG_SPARSEMEM) += sparse.o obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 103f1187043f..5f5493f0f003 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -41,6 +41,7 @@ #include #include #include "internal.h" +#include "hugetlb_vmemmap.h" int hugetlb_max_hstate __read_mostly; unsigned int default_hstate_idx; @@ -1493,8 +1494,9 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid) h->nr_huge_pages_node[nid]++; } -static void __prep_new_huge_page(struct page *page) +static void __prep_new_huge_page(struct hstate *h, struct page *page) { + free_huge_page_vmemmap(h, page); INIT_LIST_HEAD(&page->lru); set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); hugetlb_set_page_subpool(page, NULL); @@ -1504,7 +1506,7 @@ static void __prep_new_huge_page(struct page *page) static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) { - __prep_new_huge_page(page); + __prep_new_huge_page(h, page); spin_lock_irq(&hugetlb_lock); __prep_account_new_huge_page(h, nid); spin_unlock_irq(&hugetlb_lock); @@ -2351,14 +2353,15 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, /* * Before dissolving the page, we need to allocate a new one for the - * pool to remain stable. Using alloc_buddy_huge_page() allows us to - * not having to deal with prep_new_huge_page() and avoids dealing of any - * counters. This simplifies and let us do the whole thing under the - * lock. + * pool to remain stable. Here, we allocate the page and 'prep' it + * by doing everything but actually updating counters and adding to + * the pool. This simplifies and let us do most of the processing + * under the lock. */ new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL); if (!new_page) return -ENOMEM; + __prep_new_huge_page(h, new_page); retry: spin_lock_irq(&hugetlb_lock); @@ -2397,14 +2400,9 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, remove_hugetlb_page(h, old_page, false); /* - * new_page needs to be initialized with the standard hugetlb - * state. This is normally done by prep_new_huge_page() but - * that takes hugetlb_lock which is already held so we need to - * open code it here. * Reference count trick is needed because allocator gives us * referenced page but the pool requires pages with 0 refcount. */ - __prep_new_huge_page(new_page); __prep_account_new_huge_page(h, nid); page_ref_dec(new_page); enqueue_huge_page(h, new_page); @@ -2420,7 +2418,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, free_new: spin_unlock_irq(&hugetlb_lock); - __free_pages(new_page, huge_page_order(h)); + update_and_free_page(h, new_page); return ret; } diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c new file mode 100644 index 000000000000..e45a138a7f85 --- /dev/null +++ b/mm/hugetlb_vmemmap.c @@ -0,0 +1,218 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Free some vmemmap pages of HugeTLB + * + * Copyright (c) 2020, Bytedance. All rights reserved. + * + * Author: Muchun Song + * + * The struct page structures (page structs) are used to describe a physical + * page frame. By default, there is a one-to-one mapping from a page frame to + * it's corresponding page struct. + * + * HugeTLB pages consist of multiple base page size pages and is supported by + * many architectures. See hugetlbpage.rst in the Documentation directory for + * more details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB + * are currently supported. Since the base page size on x86 is 4KB, a 2MB + * HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of + * 4096 base pages. For each base page, there is a corresponding page struct. + * + * Within the HugeTLB subsystem, only the first 4 page structs are used to + * contain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides + * this upper limit. The only 'useful' information in the remaining page structs + * is the compound_head field, and this field is the same for all tail pages. + * + * By removing redundant page structs for HugeTLB pages, memory can be returned + * to the buddy allocator for other uses. + * + * Different architectures support different HugeTLB pages. For example, the + * following table is the HugeTLB page size supported by x86 and arm64 + * architectures. Because arm64 supports 4k, 16k, and 64k base pages and + * supports contiguous entries, so it supports many kinds of sizes of HugeTLB + * page. + * + * +--------------+-----------+-----------------------------------------------+ + * | Architecture | Page Size | HugeTLB Page Size | + * +--------------+-----------+-----------+-----------+-----------+-----------+ + * | x86-64 | 4KB | 2MB | 1GB | | | + * +--------------+-----------+-----------+-----------+-----------+-----------+ + * | | 4KB | 64KB | 2MB | 32MB | 1GB | + * | +-----------+-----------+-----------+-----------+-----------+ + * | arm64 | 16KB | 2MB | 32MB | 1GB | | + * | +-----------+-----------+-----------+-----------+-----------+ + * | | 64KB | 2MB | 512MB | 16GB | | + * +--------------+-----------+-----------+-----------+-----------+-----------+ + * + * When the system boot up, every HugeTLB page has more than one struct page + * structs which size is (unit: pages): + * + * struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE + * + * Where HugeTLB_Size is the size of the HugeTLB page. We know that the size + * of the HugeTLB page is always n times PAGE_SIZE. So we can get the following + * relationship. + * + * HugeTLB_Size = n * PAGE_SIZE + * + * Then, + * + * struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE + * = n * sizeof(struct page) / PAGE_SIZE + * + * We can use huge mapping at the pud/pmd level for the HugeTLB page. + * + * For the HugeTLB page of the pmd level mapping, then + * + * struct_size = n * sizeof(struct page) / PAGE_SIZE + * = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE + * = sizeof(struct page) / sizeof(pte_t) + * = 64 / 8 + * = 8 (pages) + * + * Where n is how many pte entries which one page can contains. So the value of + * n is (PAGE_SIZE / sizeof(pte_t)). + * + * This optimization only supports 64-bit system, so the value of sizeof(pte_t) + * is 8. And this optimization also applicable only when the size of struct page + * is a power of two. In most cases, the size of struct page is 64 bytes (e.g. + * x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the + * size of struct page structs of it is 8 page frames which size depends on the + * size of the base page. + * + * For the HugeTLB page of the pud level mapping, then + * + * struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd) + * = PAGE_SIZE / 8 * 8 (pages) + * = PAGE_SIZE (pages) + * + * Where the struct_size(pmd) is the size of the struct page structs of a + * HugeTLB page of the pmd level mapping. + * + * E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB + * HugeTLB page consists in 4096. + * + * Next, we take the pmd level mapping of the HugeTLB page as an example to + * show the internal implementation of this optimization. There are 8 pages + * struct page structs associated with a HugeTLB page which is pmd mapped. + * + * Here is how things look before optimization. + * + * HugeTLB struct pages(8 pages) page frame(8 pages) + * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ + * | | | 0 | -------------> | 0 | + * | | +-----------+ +-----------+ + * | | | 1 | -------------> | 1 | + * | | +-----------+ +-----------+ + * | | | 2 | -------------> | 2 | + * | | +-----------+ +-----------+ + * | | | 3 | -------------> | 3 | + * | | +-----------+ +-----------+ + * | | | 4 | -------------> | 4 | + * | PMD | +-----------+ +-----------+ + * | level | | 5 | -------------> | 5 | + * | mapping | +-----------+ +-----------+ + * | | | 6 | -------------> | 6 | + * | | +-----------+ +-----------+ + * | | | 7 | -------------> | 7 | + * | | +-----------+ +-----------+ + * | | + * | | + * | | + * +-----------+ + * + * The value of page->compound_head is the same for all tail pages. The first + * page of page structs (page 0) associated with the HugeTLB page contains the 4 + * page structs necessary to describe the HugeTLB. The only use of the remaining + * pages of page structs (page 1 to page 7) is to point to page->compound_head. + * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs + * will be used for each HugeTLB page. This will allow us to free the remaining + * 6 pages to the buddy allocator. + * + * Here is how things look after remapping. + * + * HugeTLB struct pages(8 pages) page frame(8 pages) + * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ + * | | | 0 | -------------> | 0 | + * | | +-----------+ +-----------+ + * | | | 1 | -------------> | 1 | + * | | +-----------+ +-----------+ + * | | | 2 | ----------------^ ^ ^ ^ ^ ^ + * | | +-----------+ | | | | | + * | | | 3 | ------------------+ | | | | + * | | +-----------+ | | | | + * | | | 4 | --------------------+ | | | + * | PMD | +-----------+ | | | + * | level | | 5 | ----------------------+ | | + * | mapping | +-----------+ | | + * | | | 6 | ------------------------+ | + * | | +-----------+ | + * | | | 7 | --------------------------+ + * | | +-----------+ + * | | + * | | + * | | + * +-----------+ + * + * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for + * vmemmap pages and restore the previous mapping relationship. + * + * For the HugeTLB page of the pud level mapping. It is similar to the former. + * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages. + * + * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures + * (e.g. aarch64) provides a contiguous bit in the translation table entries + * that hints to the MMU to indicate that it is one of a contiguous set of + * entries that can be cached in a single TLB entry. + * + * The contiguous bit is used to increase the mapping size at the pmd and pte + * (last) level. So this type of HugeTLB page can be optimized only when its + * size of the struct page structs is greater than 2 pages. + */ +#include "hugetlb_vmemmap.h" + +/* + * There are a lot of struct page structures associated with each HugeTLB page. + * For tail pages, the value of compound_head is the same. So we can reuse first + * page of tail page structures. We map the virtual addresses of the remaining + * pages of tail page structures to the first tail page struct, and then free + * these page frames. Therefore, we need to reserve two pages as vmemmap areas. + */ +#define RESERVE_VMEMMAP_NR 2U +#define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT) + +/* + * How many vmemmap pages associated with a HugeTLB page that can be freed + * to the buddy allocator. + * + * Todo: Returns zero for now, which means the feature is disabled. We will + * enable it once all the infrastructure is there. + */ +static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h) +{ + return 0; +} + +static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h) +{ + return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT; +} + +void free_huge_page_vmemmap(struct hstate *h, struct page *head) +{ + unsigned long vmemmap_addr = (unsigned long)head; + unsigned long vmemmap_end, vmemmap_reuse; + + if (!free_vmemmap_pages_per_hpage(h)) + return; + + vmemmap_addr += RESERVE_VMEMMAP_SIZE; + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h); + vmemmap_reuse = vmemmap_addr - PAGE_SIZE; + + /* + * Remap the vmemmap virtual address range [@vmemmap_addr, @vmemmap_end) + * to the page which @vmemmap_reuse is mapped to, then free the pages + * which the range [@vmemmap_addr, @vmemmap_end] is mapped to. + */ + vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse); +} diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h new file mode 100644 index 000000000000..6923f03534d5 --- /dev/null +++ b/mm/hugetlb_vmemmap.h @@ -0,0 +1,20 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Free some vmemmap pages of HugeTLB + * + * Copyright (c) 2020, Bytedance. All rights reserved. + * + * Author: Muchun Song + */ +#ifndef _LINUX_HUGETLB_VMEMMAP_H +#define _LINUX_HUGETLB_VMEMMAP_H +#include + +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP +void free_huge_page_vmemmap(struct hstate *h, struct page *head); +#else +static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head) +{ +} +#endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */ +#endif /* _LINUX_HUGETLB_VMEMMAP_H */ diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 16183d85a7d5..3ec5488c815c 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -27,8 +27,202 @@ #include #include #include +#include +#include + #include #include +#include + +/** + * struct vmemmap_remap_walk - walk vmemmap page table + * + * @remap_pte: called for each lowest-level entry (PTE). + * @reuse_page: the page which is reused for the tail vmemmap pages. + * @reuse_addr: the virtual address of the @reuse_page page. + * @vmemmap_pages: the list head of the vmemmap pages that can be freed. + */ +struct vmemmap_remap_walk { + void (*remap_pte)(pte_t *pte, unsigned long addr, + struct vmemmap_remap_walk *walk); + struct page *reuse_page; + unsigned long reuse_addr; + struct list_head *vmemmap_pages; +}; + +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, + unsigned long end, + struct vmemmap_remap_walk *walk) +{ + pte_t *pte = pte_offset_kernel(pmd, addr); + + /* + * The reuse_page is found 'first' in table walk before we start + * remapping (which is calling @walk->remap_pte). + */ + if (!walk->reuse_page) { + walk->reuse_page = pte_page(*pte); + /* + * Because the reuse address is part of the range that we are + * walking, skip the reuse address range. + */ + addr += PAGE_SIZE; + pte++; + } + + for (; addr != end; addr += PAGE_SIZE, pte++) + walk->remap_pte(pte, addr, walk); +} + +static void vmemmap_pmd_range(pud_t *pud, unsigned long addr, + unsigned long end, + struct vmemmap_remap_walk *walk) +{ + pmd_t *pmd; + unsigned long next; + + pmd = pmd_offset(pud, addr); + do { + BUG_ON(pmd_leaf(*pmd)); + + next = pmd_addr_end(addr, end); + vmemmap_pte_range(pmd, addr, next, walk); + } while (pmd++, addr = next, addr != end); +} + +static void vmemmap_pud_range(p4d_t *p4d, unsigned long addr, + unsigned long end, + struct vmemmap_remap_walk *walk) +{ + pud_t *pud; + unsigned long next; + + pud = pud_offset(p4d, addr); + do { + next = pud_addr_end(addr, end); + vmemmap_pmd_range(pud, addr, next, walk); + } while (pud++, addr = next, addr != end); +} + +static void vmemmap_p4d_range(pgd_t *pgd, unsigned long addr, + unsigned long end, + struct vmemmap_remap_walk *walk) +{ + p4d_t *p4d; + unsigned long next; + + p4d = p4d_offset(pgd, addr); + do { + next = p4d_addr_end(addr, end); + vmemmap_pud_range(p4d, addr, next, walk); + } while (p4d++, addr = next, addr != end); +} + +static void vmemmap_remap_range(unsigned long start, unsigned long end, + struct vmemmap_remap_walk *walk) +{ + unsigned long addr = start; + unsigned long next; + pgd_t *pgd; + + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE)); + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE)); + + pgd = pgd_offset_k(addr); + do { + next = pgd_addr_end(addr, end); + vmemmap_p4d_range(pgd, addr, next, walk); + } while (pgd++, addr = next, addr != end); + + /* + * We only change the mapping of the vmemmap virtual address range + * [@start + PAGE_SIZE, end), so we only need to flush the TLB which + * belongs to the range. + */ + flush_tlb_kernel_range(start + PAGE_SIZE, end); +} + +/* + * Free a vmemmap page. A vmemmap page can be allocated from the memblock + * allocator or buddy allocator. If the PG_reserved flag is set, it means + * that it allocated from the memblock allocator, just free it via the + * free_bootmem_page(). Otherwise, use __free_page(). + */ +static inline void free_vmemmap_page(struct page *page) +{ + if (PageReserved(page)) + free_bootmem_page(page); + else + __free_page(page); +} + +/* Free a list of the vmemmap pages */ +static void free_vmemmap_page_list(struct list_head *list) +{ + struct page *page, *next; + + list_for_each_entry_safe(page, next, list, lru) { + list_del(&page->lru); + free_vmemmap_page(page); + } +} + +static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, + struct vmemmap_remap_walk *walk) +{ + /* + * Remap the tail pages as read-only to catch illegal write operation + * to the tail pages. + */ + pgprot_t pgprot = PAGE_KERNEL_RO; + pte_t entry = mk_pte(walk->reuse_page, pgprot); + struct page *page = pte_page(*pte); + + list_add(&page->lru, walk->vmemmap_pages); + set_pte_at(&init_mm, addr, pte, entry); +} + +/** + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end) + * to the page which @reuse is mapped to, then free vmemmap + * which the range are mapped to. + * @start: start address of the vmemmap virtual address range that we want + * to remap. + * @end: end address of the vmemmap virtual address range that we want to + * remap. + * @reuse: reuse address. + * + * Note: This function depends on vmemmap being base page mapped. Please make + * sure that we disable PMD mapping of vmemmap pages when calling this function. + */ +void vmemmap_remap_free(unsigned long start, unsigned long end, + unsigned long reuse) +{ + LIST_HEAD(vmemmap_pages); + struct vmemmap_remap_walk walk = { + .remap_pte = vmemmap_remap_pte, + .reuse_addr = reuse, + .vmemmap_pages = &vmemmap_pages, + }; + + /* + * In order to make remapping routine most efficient for the huge pages, + * the routine of vmemmap page table walking has the following rules + * (see more details from the vmemmap_pte_range()): + * + * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) + * should be continuous. + * - The @reuse address is part of the range [@reuse, @end) that we are + * walking which is passed to vmemmap_remap_range(). + * - The @reuse address is the first in the complete range. + * + * So we need to make sure that @start and @reuse meet the above rules. + */ + BUG_ON(start - reuse != PAGE_SIZE); + + vmemmap_remap_range(reuse, end, &walk); + free_vmemmap_page_list(&vmemmap_pages); +} /* * Allocate a block of memory to be used to back the virtual memory map From b65d4adbc0f0d4619f61ee9d8126bc5005b78802 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:47:17 -0700 Subject: [PATCH 005/190] mm: hugetlb: defer freeing of HugeTLB pages In the subsequent patch, we should allocate the vmemmap pages when freeing a HugeTLB page. But update_and_free_page() can be called under any context, so we cannot use GFP_KERNEL to allocate vmemmap pages. However, we can defer the actual freeing in a kworker to prevent from using GFP_ATOMIC to allocate the vmemmap pages. The __update_and_free_page() is where the call to allocate vmemmmap pages will be inserted. Link: https://lkml.kernel.org/r/20210510030027.56044-6-songmuchun@bytedance.com Signed-off-by: Muchun Song Reviewed-by: Mike Kravetz Reviewed-by: Oscar Salvador Cc: Alexander Viro Cc: Andy Lutomirski Cc: Anshuman Khandual Cc: Balbir Singh Cc: Barry Song Cc: Bodeddula Balasubramaniam Cc: Borislav Petkov Cc: Chen Huang Cc: Dave Hansen Cc: David Hildenbrand Cc: David Rientjes Cc: HORIGUCHI NAOYA Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joao Martins Cc: Joerg Roedel Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Miaohe Lin Cc: Michal Hocko Cc: Mina Almasry Cc: Oliver Neukum Cc: Paul E. McKenney Cc: Pawan Gupta Cc: Peter Zijlstra Cc: Randy Dunlap Cc: Thomas Gleixner Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 83 ++++++++++++++++++++++++++++++++++++++++---- mm/hugetlb_vmemmap.c | 12 ------- mm/hugetlb_vmemmap.h | 17 +++++++++ 3 files changed, 93 insertions(+), 19 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5f5493f0f003..e7eb1ab8c78a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1376,7 +1376,7 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page, h->nr_huge_pages_node[nid]--; } -static void update_and_free_page(struct hstate *h, struct page *page) +static void __update_and_free_page(struct hstate *h, struct page *page) { int i; struct page *subpage = page; @@ -1399,12 +1399,79 @@ static void update_and_free_page(struct hstate *h, struct page *page) } } +/* + * As update_and_free_page() can be called under any context, so we cannot + * use GFP_KERNEL to allocate vmemmap pages. However, we can defer the + * actual freeing in a workqueue to prevent from using GFP_ATOMIC to allocate + * the vmemmap pages. + * + * free_hpage_workfn() locklessly retrieves the linked list of pages to be + * freed and frees them one-by-one. As the page->mapping pointer is going + * to be cleared in free_hpage_workfn() anyway, it is reused as the llist_node + * structure of a lockless linked list of huge pages to be freed. + */ +static LLIST_HEAD(hpage_freelist); + +static void free_hpage_workfn(struct work_struct *work) +{ + struct llist_node *node; + + node = llist_del_all(&hpage_freelist); + + while (node) { + struct page *page; + struct hstate *h; + + page = container_of((struct address_space **)node, + struct page, mapping); + node = node->next; + page->mapping = NULL; + /* + * The VM_BUG_ON_PAGE(!PageHuge(page), page) in page_hstate() + * is going to trigger because a previous call to + * remove_hugetlb_page() will set_compound_page_dtor(page, + * NULL_COMPOUND_DTOR), so do not use page_hstate() directly. + */ + h = size_to_hstate(page_size(page)); + + __update_and_free_page(h, page); + + cond_resched(); + } +} +static DECLARE_WORK(free_hpage_work, free_hpage_workfn); + +static inline void flush_free_hpage_work(struct hstate *h) +{ + if (free_vmemmap_pages_per_hpage(h)) + flush_work(&free_hpage_work); +} + +static void update_and_free_page(struct hstate *h, struct page *page, + bool atomic) +{ + if (!free_vmemmap_pages_per_hpage(h) || !atomic) { + __update_and_free_page(h, page); + return; + } + + /* + * Defer freeing to avoid using GFP_ATOMIC to allocate vmemmap pages. + * + * Only call schedule_work() if hpage_freelist is previously + * empty. Otherwise, schedule_work() had been called but the workfn + * hasn't retrieved the list yet. + */ + if (llist_add((struct llist_node *)&page->mapping, &hpage_freelist)) + schedule_work(&free_hpage_work); +} + static void update_and_free_pages_bulk(struct hstate *h, struct list_head *list) { struct page *page, *t_page; list_for_each_entry_safe(page, t_page, list, lru) { - update_and_free_page(h, page); + update_and_free_page(h, page, false); cond_resched(); } } @@ -1471,12 +1538,12 @@ void free_huge_page(struct page *page) if (HPageTemporary(page)) { remove_hugetlb_page(h, page, false); spin_unlock_irqrestore(&hugetlb_lock, flags); - update_and_free_page(h, page); + update_and_free_page(h, page, true); } else if (h->surplus_huge_pages_node[nid]) { /* remove the page from active list */ remove_hugetlb_page(h, page, true); spin_unlock_irqrestore(&hugetlb_lock, flags); - update_and_free_page(h, page); + update_and_free_page(h, page, true); } else { arch_clear_hugepage_flags(page); enqueue_huge_page(h, page); @@ -1795,7 +1862,7 @@ int dissolve_free_huge_page(struct page *page) remove_hugetlb_page(h, head, false); h->max_huge_pages--; spin_unlock_irq(&hugetlb_lock); - update_and_free_page(h, head); + update_and_free_page(h, head, false); return 0; } out: @@ -2411,14 +2478,14 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, * Pages have been replaced, we can safely free the old one. */ spin_unlock_irq(&hugetlb_lock); - update_and_free_page(h, old_page); + update_and_free_page(h, old_page, false); } return ret; free_new: spin_unlock_irq(&hugetlb_lock); - update_and_free_page(h, new_page); + update_and_free_page(h, new_page, false); return ret; } @@ -2832,6 +2899,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, * pages in hstate via the proc/sysfs interfaces. */ mutex_lock(&h->resize_lock); + flush_free_hpage_work(h); spin_lock_irq(&hugetlb_lock); /* @@ -2941,6 +3009,7 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, /* free the pages after dropping lock */ spin_unlock_irq(&hugetlb_lock); update_and_free_pages_bulk(h, &page_list); + flush_free_hpage_work(h); spin_lock_irq(&hugetlb_lock); while (count < persistent_huge_pages(h)) { diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index e45a138a7f85..cb28c5b6c9ff 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -180,18 +180,6 @@ #define RESERVE_VMEMMAP_NR 2U #define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT) -/* - * How many vmemmap pages associated with a HugeTLB page that can be freed - * to the buddy allocator. - * - * Todo: Returns zero for now, which means the feature is disabled. We will - * enable it once all the infrastructure is there. - */ -static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h) -{ - return 0; -} - static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h) { return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT; diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h index 6923f03534d5..01f8637adbe0 100644 --- a/mm/hugetlb_vmemmap.h +++ b/mm/hugetlb_vmemmap.h @@ -12,9 +12,26 @@ #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP void free_huge_page_vmemmap(struct hstate *h, struct page *head); + +/* + * How many vmemmap pages associated with a HugeTLB page that can be freed + * to the buddy allocator. + * + * Todo: Returns zero for now, which means the feature is disabled. We will + * enable it once all the infrastructure is there. + */ +static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h) +{ + return 0; +} #else static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head) { } + +static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h) +{ + return 0; +} #endif /* CONFIG_HUGETLB_PAGE_FREE_VMEMMAP */ #endif /* _LINUX_HUGETLB_VMEMMAP_H */ From ad2fa3717b74994a22519dbe045757135db00dbb Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:47:21 -0700 Subject: [PATCH 006/190] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page When we free a HugeTLB page to the buddy allocator, we need to allocate the vmemmap pages associated with it. However, we may not be able to allocate the vmemmap pages when the system is under memory pressure. In this case, we just refuse to free the HugeTLB page. This changes behavior in some corner cases as listed below: 1) Failing to free a huge page triggered by the user (decrease nr_pages). User needs to try again later. 2) Failing to free a surplus huge page when freed by the application. Try again later when freeing a huge page next time. 3) Failing to dissolve a free huge page on ZONE_MOVABLE via offline_pages(). This can happen when we have plenty of ZONE_MOVABLE memory, but not enough kernel memory to allocate vmemmmap pages. We may even be able to migrate huge page contents, but will not be able to dissolve the source huge page. This will prevent an offline operation and is unfortunate as memory offlining is expected to succeed on movable zones. Users that depend on memory hotplug to succeed for movable zones should carefully consider whether the memory savings gained from this feature are worth the risk of possibly not being able to offline memory in certain situations. 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via alloc_contig_range() - once we have that handling in place. Mainly affects CMA and virtio-mem. Similar to 3). virito-mem will handle migration errors gracefully. CMA might be able to fallback on other free areas within the CMA region. Vmemmap pages are allocated from the page freeing context. In order for those allocations to be not disruptive (e.g. trigger oom killer) __GFP_NORETRY is used. hugetlb_lock is dropped for the allocation because a non sleeping allocation would be too fragile and it could fail too easily under memory pressure. GFP_ATOMIC or other modes to access memory reserves is not used because we want to prevent consuming reserves under heavy hugetlb freeing. [mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page] Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com [willy@infradead.org: fix alloc_vmemmap_page_list documentation warning] Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com Signed-off-by: Muchun Song Signed-off-by: Mike Kravetz Signed-off-by: Matthew Wilcox (Oracle) Reviewed-by: Mike Kravetz Reviewed-by: Oscar Salvador Cc: Alexander Viro Cc: Andy Lutomirski Cc: Anshuman Khandual Cc: Balbir Singh Cc: Barry Song Cc: Bodeddula Balasubramaniam Cc: Borislav Petkov Cc: Chen Huang Cc: Dave Hansen Cc: David Hildenbrand Cc: David Rientjes Cc: HORIGUCHI NAOYA Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joao Martins Cc: Joerg Roedel Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Miaohe Lin Cc: Michal Hocko Cc: Mina Almasry Cc: Oliver Neukum Cc: Paul E. McKenney Cc: Pawan Gupta Cc: Peter Zijlstra Cc: Randy Dunlap Cc: Thomas Gleixner Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/mm/hugetlbpage.rst | 8 ++ .../admin-guide/mm/memory-hotplug.rst | 13 +++ include/linux/hugetlb.h | 3 + include/linux/mm.h | 2 + mm/hugetlb.c | 98 ++++++++++++++++--- mm/hugetlb_vmemmap.c | 34 +++++++ mm/hugetlb_vmemmap.h | 6 ++ mm/migrate.c | 5 +- mm/sparse-vmemmap.c | 75 +++++++++++++- 9 files changed, 227 insertions(+), 17 deletions(-) diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index f7b1c7462991..6988895d09a8 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -60,6 +60,10 @@ HugePages_Surp the pool above the value in ``/proc/sys/vm/nr_hugepages``. The maximum number of surplus huge pages is controlled by ``/proc/sys/vm/nr_overcommit_hugepages``. + Note: When the feature of freeing unused vmemmap pages associated + with each hugetlb page is enabled, the number of surplus huge pages + may be temporarily larger than the maximum number of surplus huge + pages when the system is under memory pressure. Hugepagesize is the default hugepage size (in Kb). Hugetlb @@ -80,6 +84,10 @@ returned to the huge page pool when freed by a task. A user with root privileges can dynamically allocate more or free some persistent huge pages by increasing or decreasing the value of ``nr_hugepages``. +Note: When the feature of freeing unused vmemmap pages associated with each +hugetlb page is enabled, we can fail to free the huge pages triggered by +the user when ths system is under memory pressure. Please try again later. + Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure. diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 05d51d2d8beb..c6bae2d77160 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -357,6 +357,19 @@ creates ZONE_MOVABLE as following. Unfortunately, there is no information to show which memory block belongs to ZONE_MOVABLE. This is TBD. + Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE + and the feature of freeing unused vmemmap pages associated with each hugetlb + page is enabled. + + This can happen when we have plenty of ZONE_MOVABLE memory, but not enough + kernel memory to allocate vmemmmap pages. We may even be able to migrate + huge page contents, but will not be able to dissolve the source huge page. + This will prevent an offline operation and is unfortunate as memory offlining + is expected to succeed on movable zones. Users that depend on memory hotplug + to succeed for movable zones should carefully consider whether the memory + savings gained from this feature are worth the risk of possibly not being + able to offline memory in certain situations. + .. note:: Techniques that rely on long-term pinnings of memory (especially, RDMA and vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 0c8c96481259..3578d9d708fe 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -532,12 +532,14 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, * modifications require hugetlb_lock. * HPG_freed - Set when page is on the free lists. * Synchronization: hugetlb_lock held for examination and modification. + * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed. */ enum hugetlb_page_flags { HPG_restore_reserve = 0, HPG_migratable, HPG_temporary, HPG_freed, + HPG_vmemmap_optimized, __NR_HPAGEFLAGS, }; @@ -583,6 +585,7 @@ HPAGEFLAG(RestoreReserve, restore_reserve) HPAGEFLAG(Migratable, migratable) HPAGEFLAG(Temporary, temporary) HPAGEFLAG(Freed, freed) +HPAGEFLAG(VmemmapOptimized, vmemmap_optimized) #ifdef CONFIG_HUGETLB_PAGE diff --git a/include/linux/mm.h b/include/linux/mm.h index 3437aa7c6c91..706bee98d965 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3078,6 +3078,8 @@ static inline void print_vma_addr(char *prefix, unsigned long rip) void vmemmap_remap_free(unsigned long start, unsigned long end, unsigned long reuse); +int vmemmap_remap_alloc(unsigned long start, unsigned long end, + unsigned long reuse, gfp_t gfp_mask); void *sparse_buffer_alloc(unsigned long size); struct page * __populate_section_memmap(unsigned long pfn, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e7eb1ab8c78a..778db5de6232 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1376,6 +1376,39 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page, h->nr_huge_pages_node[nid]--; } +static void add_hugetlb_page(struct hstate *h, struct page *page, + bool adjust_surplus) +{ + int zeroed; + int nid = page_to_nid(page); + + VM_BUG_ON_PAGE(!HPageVmemmapOptimized(page), page); + + lockdep_assert_held(&hugetlb_lock); + + INIT_LIST_HEAD(&page->lru); + h->nr_huge_pages++; + h->nr_huge_pages_node[nid]++; + + if (adjust_surplus) { + h->surplus_huge_pages++; + h->surplus_huge_pages_node[nid]++; + } + + set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); + set_page_private(page, 0); + SetHPageVmemmapOptimized(page); + + /* + * This page is now managed by the hugetlb allocator and has + * no users -- drop the last reference. + */ + zeroed = put_page_testzero(page); + VM_BUG_ON_PAGE(!zeroed, page); + arch_clear_hugepage_flags(page); + enqueue_huge_page(h, page); +} + static void __update_and_free_page(struct hstate *h, struct page *page) { int i; @@ -1384,6 +1417,18 @@ static void __update_and_free_page(struct hstate *h, struct page *page) if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported()) return; + if (alloc_huge_page_vmemmap(h, page)) { + spin_lock_irq(&hugetlb_lock); + /* + * If we cannot allocate vmemmap pages, just refuse to free the + * page and put the page back on the hugetlb free list and treat + * as a surplus page. + */ + add_hugetlb_page(h, page, true); + spin_unlock_irq(&hugetlb_lock); + return; + } + for (i = 0; i < pages_per_huge_page(h); i++, subpage = mem_map_next(subpage, page, i)) { subpage->flags &= ~(1 << PG_locked | 1 << PG_error | @@ -1450,7 +1495,7 @@ static inline void flush_free_hpage_work(struct hstate *h) static void update_and_free_page(struct hstate *h, struct page *page, bool atomic) { - if (!free_vmemmap_pages_per_hpage(h) || !atomic) { + if (!HPageVmemmapOptimized(page) || !atomic) { __update_and_free_page(h, page); return; } @@ -1806,10 +1851,14 @@ static struct page *remove_pool_huge_page(struct hstate *h, * nothing for in-use hugepages and non-hugepages. * This function returns values like below: * - * -EBUSY: failed to dissolved free hugepages or the hugepage is in-use - * (allocated or reserved.) - * 0: successfully dissolved free hugepages or the page is not a - * hugepage (considered as already dissolved) + * -ENOMEM: failed to allocate vmemmap pages to free the freed hugepages + * when the system is under memory pressure and the feature of + * freeing unused vmemmap pages associated with each hugetlb page + * is enabled. + * -EBUSY: failed to dissolved free hugepages or the hugepage is in-use + * (allocated or reserved.) + * 0: successfully dissolved free hugepages or the page is not a + * hugepage (considered as already dissolved) */ int dissolve_free_huge_page(struct page *page) { @@ -1851,19 +1900,38 @@ int dissolve_free_huge_page(struct page *page) goto retry; } - /* - * Move PageHWPoison flag from head page to the raw error page, - * which makes any subpages rather than the error page reusable. - */ - if (PageHWPoison(head) && page != head) { - SetPageHWPoison(page); - ClearPageHWPoison(head); - } remove_hugetlb_page(h, head, false); h->max_huge_pages--; spin_unlock_irq(&hugetlb_lock); - update_and_free_page(h, head, false); - return 0; + + /* + * Normally update_and_free_page will allocate required vmemmmap + * before freeing the page. update_and_free_page will fail to + * free the page if it can not allocate required vmemmap. We + * need to adjust max_huge_pages if the page is not freed. + * Attempt to allocate vmemmmap here so that we can take + * appropriate action on failure. + */ + rc = alloc_huge_page_vmemmap(h, head); + if (!rc) { + /* + * Move PageHWPoison flag from head page to the raw + * error page, which makes any subpages rather than + * the error page reusable. + */ + if (PageHWPoison(head) && page != head) { + SetPageHWPoison(page); + ClearPageHWPoison(head); + } + update_and_free_page(h, head, false); + } else { + spin_lock_irq(&hugetlb_lock); + add_hugetlb_page(h, head, false); + h->max_huge_pages++; + spin_unlock_irq(&hugetlb_lock); + } + + return rc; } out: spin_unlock_irq(&hugetlb_lock); diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index cb28c5b6c9ff..a897c7778246 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -185,6 +185,38 @@ static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h) return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT; } +/* + * Previously discarded vmemmap pages will be allocated and remapping + * after this function returns zero. + */ +int alloc_huge_page_vmemmap(struct hstate *h, struct page *head) +{ + int ret; + unsigned long vmemmap_addr = (unsigned long)head; + unsigned long vmemmap_end, vmemmap_reuse; + + if (!HPageVmemmapOptimized(head)) + return 0; + + vmemmap_addr += RESERVE_VMEMMAP_SIZE; + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h); + vmemmap_reuse = vmemmap_addr - PAGE_SIZE; + /* + * The pages which the vmemmap virtual address range [@vmemmap_addr, + * @vmemmap_end) are mapped to are freed to the buddy allocator, and + * the range is mapped to the page which @vmemmap_reuse is mapped to. + * When a HugeTLB page is freed to the buddy allocator, previously + * discarded vmemmap pages must be allocated and remapping. + */ + ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse, + GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE); + + if (!ret) + ClearHPageVmemmapOptimized(head); + + return ret; +} + void free_huge_page_vmemmap(struct hstate *h, struct page *head) { unsigned long vmemmap_addr = (unsigned long)head; @@ -203,4 +235,6 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head) * which the range [@vmemmap_addr, @vmemmap_end] is mapped to. */ vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse); + + SetHPageVmemmapOptimized(head); } diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h index 01f8637adbe0..a37771b0b82a 100644 --- a/mm/hugetlb_vmemmap.h +++ b/mm/hugetlb_vmemmap.h @@ -11,6 +11,7 @@ #include #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP +int alloc_huge_page_vmemmap(struct hstate *h, struct page *head); void free_huge_page_vmemmap(struct hstate *h, struct page *head); /* @@ -25,6 +26,11 @@ static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h) return 0; } #else +static inline int alloc_huge_page_vmemmap(struct hstate *h, struct page *head) +{ + return 0; +} + static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head) { } diff --git a/mm/migrate.c b/mm/migrate.c index 380ca57b9031..cc4d6af41683 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -626,7 +626,10 @@ void migrate_page_states(struct page *newpage, struct page *page) if (PageSwapCache(page)) ClearPageSwapCache(page); ClearPagePrivate(page); - set_page_private(page, 0); + + /* page->private contains hugetlb specific flags */ + if (!PageHuge(page)) + set_page_private(page, 0); /* * If any waiters have accumulated on the new page then diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 3ec5488c815c..a3aa275e2668 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -40,7 +40,8 @@ * @remap_pte: called for each lowest-level entry (PTE). * @reuse_page: the page which is reused for the tail vmemmap pages. * @reuse_addr: the virtual address of the @reuse_page page. - * @vmemmap_pages: the list head of the vmemmap pages that can be freed. + * @vmemmap_pages: the list head of the vmemmap pages that can be freed + * or is mapped from. */ struct vmemmap_remap_walk { void (*remap_pte)(pte_t *pte, unsigned long addr, @@ -224,6 +225,78 @@ void vmemmap_remap_free(unsigned long start, unsigned long end, free_vmemmap_page_list(&vmemmap_pages); } +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, + struct vmemmap_remap_walk *walk) +{ + pgprot_t pgprot = PAGE_KERNEL; + struct page *page; + void *to; + + BUG_ON(pte_page(*pte) != walk->reuse_page); + + page = list_first_entry(walk->vmemmap_pages, struct page, lru); + list_del(&page->lru); + to = page_to_virt(page); + copy_page(to, (void *)walk->reuse_addr); + + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); +} + +static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, + gfp_t gfp_mask, struct list_head *list) +{ + unsigned long nr_pages = (end - start) >> PAGE_SHIFT; + int nid = page_to_nid((struct page *)start); + struct page *page, *next; + + while (nr_pages--) { + page = alloc_pages_node(nid, gfp_mask, 0); + if (!page) + goto out; + list_add_tail(&page->lru, list); + } + + return 0; +out: + list_for_each_entry_safe(page, next, list, lru) + __free_pages(page, 0); + return -ENOMEM; +} + +/** + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end) + * to the page which is from the @vmemmap_pages + * respectively. + * @start: start address of the vmemmap virtual address range that we want + * to remap. + * @end: end address of the vmemmap virtual address range that we want to + * remap. + * @reuse: reuse address. + * @gfp_mask: GFP flag for allocating vmemmap pages. + */ +int vmemmap_remap_alloc(unsigned long start, unsigned long end, + unsigned long reuse, gfp_t gfp_mask) +{ + LIST_HEAD(vmemmap_pages); + struct vmemmap_remap_walk walk = { + .remap_pte = vmemmap_restore_pte, + .reuse_addr = reuse, + .vmemmap_pages = &vmemmap_pages, + }; + + /* See the comment in the vmemmap_remap_free(). */ + BUG_ON(start - reuse != PAGE_SIZE); + + might_sleep_if(gfpflags_allow_blocking(gfp_mask)); + + if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages)) + return -ENOMEM; + + vmemmap_remap_range(reuse, end, &walk); + + return 0; +} + /* * Allocate a block of memory to be used to back the virtual memory map * or to back the page tables that are used to create the mapping. From e9fdff87e893ec5b7c32836675db80cf691b2a8b Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:47:25 -0700 Subject: [PATCH 007/190] mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap Add a kernel parameter hugetlb_free_vmemmap to enable the feature of freeing unused vmemmap pages associated with each hugetlb page on boot. We disable PMD mapping of vmemmap pages for x86-64 arch when this feature is enabled. Because vmemmap_remap_free() depends on vmemmap being base page mapped. Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Signed-off-by: Muchun Song Reviewed-by: Oscar Salvador Reviewed-by: Barry Song Reviewed-by: Miaohe Lin Tested-by: Chen Huang Tested-by: Bodeddula Balasubramaniam Reviewed-by: Mike Kravetz Cc: Alexander Viro Cc: Andy Lutomirski Cc: Anshuman Khandual Cc: Balbir Singh Cc: Borislav Petkov Cc: Dave Hansen Cc: David Hildenbrand Cc: David Rientjes Cc: HORIGUCHI NAOYA Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joao Martins Cc: Joerg Roedel Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Michal Hocko Cc: Mina Almasry Cc: Oliver Neukum Cc: Paul E. McKenney Cc: Pawan Gupta Cc: Peter Zijlstra Cc: Randy Dunlap Cc: Thomas Gleixner Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- .../admin-guide/kernel-parameters.txt | 17 +++++++++++++ Documentation/admin-guide/mm/hugetlbpage.rst | 3 +++ arch/x86/mm/init_64.c | 8 +++++-- include/linux/hugetlb.h | 19 +++++++++++++++ mm/hugetlb_vmemmap.c | 24 +++++++++++++++++++ 5 files changed, 69 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 566c4b9af3cd..b787a9953f2e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1567,6 +1567,23 @@ Documentation/admin-guide/mm/hugetlbpage.rst. Format: size[KMG] + hugetlb_free_vmemmap= + [KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP + enabled. + Allows heavy hugetlb users to free up some more + memory (6 * PAGE_SIZE for each 2MB hugetlb page). + This feauture is not free though. Large page + tables are not used to back vmemmap pages which + can lead to a performance degradation for some + workloads. Also there will be memory allocation + required when hugetlb pages are freed from the + pool which can lead to corner cases under heavy + memory pressure. + Format: { on | off (default) } + + on: enable the feature + off: disable the feature + hung_task_panic= [KNL] Should the hung task detector generate panics. Format: 0 | 1 diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index 6988895d09a8..8abaeb144e44 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -153,6 +153,9 @@ default_hugepagesz will all result in 256 2M huge pages being allocated. Valid default huge page size is architecture dependent. +hugetlb_free_vmemmap + When CONFIG_HUGETLB_PAGE_FREE_VMEMMAP is set, this enables freeing + unused vmemmap pages associated with each HugeTLB page. When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` indicates the current number of pre-allocated huge pages of the default size. diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 65ea58527176..9d9d18d0c2a1 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -1609,7 +1610,8 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE)); VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE)); - if (end - start < PAGES_PER_SECTION * sizeof(struct page)) + if ((is_hugetlb_free_vmemmap_enabled() && !altmap) || + end - start < PAGES_PER_SECTION * sizeof(struct page)) err = vmemmap_populate_basepages(start, end, node, NULL); else if (boot_cpu_has(X86_FEATURE_PSE)) err = vmemmap_populate_hugepages(start, end, node, altmap); @@ -1637,6 +1639,8 @@ void register_page_bootmem_memmap(unsigned long section_nr, pmd_t *pmd; unsigned int nr_pmd_pages; struct page *page; + bool base_mapping = !boot_cpu_has(X86_FEATURE_PSE) || + is_hugetlb_free_vmemmap_enabled(); for (; addr < end; addr = next) { pte_t *pte = NULL; @@ -1662,7 +1666,7 @@ void register_page_bootmem_memmap(unsigned long section_nr, } get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO); - if (!boot_cpu_has(X86_FEATURE_PSE)) { + if (base_mapping) { next = (addr + PAGE_SIZE) & PAGE_MASK; pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 3578d9d708fe..9ad99848f9f0 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -892,6 +892,20 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, } #endif +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP +extern bool hugetlb_free_vmemmap_enabled; + +static inline bool is_hugetlb_free_vmemmap_enabled(void) +{ + return hugetlb_free_vmemmap_enabled; +} +#else +static inline bool is_hugetlb_free_vmemmap_enabled(void) +{ + return false; +} +#endif + #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; @@ -1046,6 +1060,11 @@ static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr pte_t *ptep, pte_t pte, unsigned long sz) { } + +static inline bool is_hugetlb_free_vmemmap_enabled(void) +{ + return false; +} #endif /* CONFIG_HUGETLB_PAGE */ static inline spinlock_t *huge_pte_lock(struct hstate *h, diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index a897c7778246..3070e1465b1b 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -168,6 +168,8 @@ * (last) level. So this type of HugeTLB page can be optimized only when its * size of the struct page structs is greater than 2 pages. */ +#define pr_fmt(fmt) "HugeTLB: " fmt + #include "hugetlb_vmemmap.h" /* @@ -180,6 +182,28 @@ #define RESERVE_VMEMMAP_NR 2U #define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT) +bool hugetlb_free_vmemmap_enabled; + +static int __init early_hugetlb_free_vmemmap_param(char *buf) +{ + /* We cannot optimize if a "struct page" crosses page boundaries. */ + if ((!is_power_of_2(sizeof(struct page)))) { + pr_warn("cannot free vmemmap pages because \"struct page\" crosses page boundaries\n"); + return 0; + } + + if (!buf) + return -EINVAL; + + if (!strcmp(buf, "on")) + hugetlb_free_vmemmap_enabled = true; + else if (strcmp(buf, "off")) + return -EINVAL; + + return 0; +} +early_param("hugetlb_free_vmemmap", early_hugetlb_free_vmemmap_param); + static inline unsigned long free_vmemmap_pages_size_per_hpage(struct hstate *h) { return (unsigned long)free_vmemmap_pages_per_hpage(h) << PAGE_SHIFT; From 4bab4964a59f277915285787c828b810151de7a1 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:47:29 -0700 Subject: [PATCH 008/190] mm: memory_hotplug: disable memmap_on_memory when hugetlb_free_vmemmap enabled The parameter of memory_hotplug.memmap_on_memory is not compatible with hugetlb_free_vmemmap. So disable it when hugetlb_free_vmemmap is enabled. [akpm@linux-foundation.org: remove unneeded include, per Oscar] Link: https://lkml.kernel.org/r/20210510030027.56044-9-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Mike Kravetz Cc: Alexander Viro Cc: Andy Lutomirski Cc: Anshuman Khandual Cc: Balbir Singh Cc: Barry Song Cc: Bodeddula Balasubramaniam Cc: Borislav Petkov Cc: Chen Huang Cc: Dave Hansen Cc: David Hildenbrand Cc: David Rientjes Cc: HORIGUCHI NAOYA Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joao Martins Cc: Joerg Roedel Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Miaohe Lin Cc: Michal Hocko Cc: Mina Almasry Cc: Oliver Neukum Cc: Oscar Salvador Cc: Paul E. McKenney Cc: Pawan Gupta Cc: Peter Zijlstra Cc: Randy Dunlap Cc: Thomas Gleixner Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/kernel-parameters.txt | 8 ++++++++ mm/memory_hotplug.c | 1 + 2 files changed, 9 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index b787a9953f2e..597572ceb8ba 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1584,6 +1584,10 @@ on: enable the feature off: disable the feature + This is not compatible with memory_hotplug.memmap_on_memory. + If both parameters are enabled, hugetlb_free_vmemmap takes + precedence over memory_hotplug.memmap_on_memory. + hung_task_panic= [KNL] Should the hung task detector generate panics. Format: 0 | 1 @@ -2850,6 +2854,10 @@ Note that even when enabled, there are a few cases where the feature is not effective. + This is not compatible with hugetlb_free_vmemmap. If + both parameters are enabled, hugetlb_free_vmemmap takes + precedence over memory_hotplug.memmap_on_memory. + memtest= [KNL,X86,ARM,PPC,RISCV] Enable memtest Format: default : 0 diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ae7a07b02049..29d1d742b565 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1056,6 +1056,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size) * populate a single PMD. */ return memmap_on_memory && + !is_hugetlb_free_vmemmap_enabled() && IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) && size == memory_block_size_bytes() && IS_ALIGNED(vmemmap_size, PMD_SIZE) && From 774905878fc9b0b9a5ee4a889b97f773a077aeee Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:47:33 -0700 Subject: [PATCH 009/190] mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate All the infrastructure is ready, so we introduce nr_free_vmemmap_pages field in the hstate to indicate how many vmemmap pages associated with a HugeTLB page that can be freed to buddy allocator. And initialize it in the hugetlb_vmemmap_init(). This patch is actual enablement of the feature. There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct page structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a BUILD_BUG_ON to catch invalid usage of the tail struct page. Link: https://lkml.kernel.org/r/20210510030027.56044-10-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Mike Kravetz Reviewed-by: Oscar Salvador Reviewed-by: Miaohe Lin Tested-by: Chen Huang Tested-by: Bodeddula Balasubramaniam Cc: Alexander Viro Cc: Andy Lutomirski Cc: Anshuman Khandual Cc: Balbir Singh Cc: Barry Song Cc: Borislav Petkov Cc: Dave Hansen Cc: David Hildenbrand Cc: David Rientjes Cc: HORIGUCHI NAOYA Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: Joao Martins Cc: Joerg Roedel Cc: Jonathan Corbet Cc: Matthew Wilcox Cc: Michal Hocko Cc: Mina Almasry Cc: Oliver Neukum Cc: Paul E. McKenney Cc: Pawan Gupta Cc: Peter Zijlstra Cc: Randy Dunlap Cc: Thomas Gleixner Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/hugetlb.h | 3 +++ mm/hugetlb.c | 1 + mm/hugetlb_vmemmap.c | 33 +++++++++++++++++++++++++++++++++ mm/hugetlb_vmemmap.h | 10 ++++++---- 4 files changed, 43 insertions(+), 4 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 9ad99848f9f0..8c1920844236 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -608,6 +608,9 @@ struct hstate { unsigned int nr_huge_pages_node[MAX_NUMNODES]; unsigned int free_huge_pages_node[MAX_NUMNODES]; unsigned int surplus_huge_pages_node[MAX_NUMNODES]; +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP + unsigned int nr_free_vmemmap_pages; +#endif #ifdef CONFIG_CGROUP_HUGETLB /* cgroup control files */ struct cftype cgroup_files_dfl[7]; diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 778db5de6232..dd9c90c082fc 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3585,6 +3585,7 @@ void __init hugetlb_add_hstate(unsigned int order) h->next_nid_to_free = first_memory_node; snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB", huge_page_size(h)/1024); + hugetlb_vmemmap_init(h); parsed_hstate = h; } diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 3070e1465b1b..f9f9bb212319 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -262,3 +262,36 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head) SetHPageVmemmapOptimized(head); } + +void __init hugetlb_vmemmap_init(struct hstate *h) +{ + unsigned int nr_pages = pages_per_huge_page(h); + unsigned int vmemmap_pages; + + /* + * There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct + * page structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, + * so add a BUILD_BUG_ON to catch invalid usage of the tail struct page. + */ + BUILD_BUG_ON(__NR_USED_SUBPAGE >= + RESERVE_VMEMMAP_SIZE / sizeof(struct page)); + + if (!hugetlb_free_vmemmap_enabled) + return; + + vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT; + /* + * The head page and the first tail page are not to be freed to buddy + * allocator, the other pages will map to the first tail page, so they + * can be freed. + * + * Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? It is true + * on some architectures (e.g. aarch64). See Documentation/arm64/ + * hugetlbpage.rst for more details. + */ + if (likely(vmemmap_pages > RESERVE_VMEMMAP_NR)) + h->nr_free_vmemmap_pages = vmemmap_pages - RESERVE_VMEMMAP_NR; + + pr_info("can free %d vmemmap pages for %s\n", h->nr_free_vmemmap_pages, + h->name); +} diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h index a37771b0b82a..cb2bef8f9e73 100644 --- a/mm/hugetlb_vmemmap.h +++ b/mm/hugetlb_vmemmap.h @@ -13,17 +13,15 @@ #ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP int alloc_huge_page_vmemmap(struct hstate *h, struct page *head); void free_huge_page_vmemmap(struct hstate *h, struct page *head); +void hugetlb_vmemmap_init(struct hstate *h); /* * How many vmemmap pages associated with a HugeTLB page that can be freed * to the buddy allocator. - * - * Todo: Returns zero for now, which means the feature is disabled. We will - * enable it once all the infrastructure is there. */ static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h) { - return 0; + return h->nr_free_vmemmap_pages; } #else static inline int alloc_huge_page_vmemmap(struct hstate *h, struct page *head) @@ -35,6 +33,10 @@ static inline void free_huge_page_vmemmap(struct hstate *h, struct page *head) { } +static inline void hugetlb_vmemmap_init(struct hstate *h) +{ +} + static inline unsigned int free_vmemmap_pages_per_hpage(struct hstate *h) { return 0; From 5fe77be6bf14bf6c471be58c68edc9e0f97b72fb Mon Sep 17 00:00:00 2001 From: Shixin Liu Date: Wed, 30 Jun 2021 18:47:37 -0700 Subject: [PATCH 010/190] mm/debug_vm_pgtable: move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE The functions {pmd/pud}_set_huge and {pmd/pud}_clear_huge are not dependent on THP. Hence move {pmd/pud}_huge_tests out of CONFIG_TRANSPARENT_HUGEPAGE. Link: https://lkml.kernel.org/r/20210419071820.750217-1-liushixin2@huawei.com Signed-off-by: Shixin Liu Reviewed-by: Anshuman Khandual Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/debug_vm_pgtable.c | 113 +++++++++++++++++++----------------------- 1 file changed, 50 insertions(+), 63 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 92bfc37300df..760bb319f731 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -248,29 +248,6 @@ static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) WARN_ON(!pmd_leaf(pmd)); } -#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP -static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) -{ - pmd_t pmd; - - if (!arch_vmap_pmd_supported(prot)) - return; - - pr_debug("Validating PMD huge\n"); - /* - * X86 defined pmd_set_huge() verifies that the given - * PMD is not a populated non-leaf entry. - */ - WRITE_ONCE(*pmdp, __pmd(0)); - WARN_ON(!pmd_set_huge(pmdp, __pfn_to_phys(pfn), prot)); - WARN_ON(!pmd_clear_huge(pmdp)); - pmd = READ_ONCE(*pmdp); - WARN_ON(!pmd_none(pmd)); -} -#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */ -static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { } -#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ - static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { pmd_t pmd; @@ -395,8 +372,56 @@ static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) pud = pud_mkhuge(pud); WARN_ON(!pud_leaf(pud)); } +#else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ +static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { } +static void __init pud_advanced_tests(struct mm_struct *mm, + struct vm_area_struct *vma, pud_t *pudp, + unsigned long pfn, unsigned long vaddr, + pgprot_t prot) +{ +} +static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */ +static void __init pmd_basic_tests(unsigned long pfn, int idx) { } +static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { } +static void __init pmd_advanced_tests(struct mm_struct *mm, + struct vm_area_struct *vma, pmd_t *pmdp, + unsigned long pfn, unsigned long vaddr, + pgprot_t prot, pgtable_t pgtable) +{ +} +static void __init pud_advanced_tests(struct mm_struct *mm, + struct vm_area_struct *vma, pud_t *pudp, + unsigned long pfn, unsigned long vaddr, + pgprot_t prot) +{ +} +static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) { } +static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } +static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { } +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP +static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) +{ + pmd_t pmd; + + if (!arch_vmap_pmd_supported(prot)) + return; + + pr_debug("Validating PMD huge\n"); + /* + * X86 defined pmd_set_huge() verifies that the given + * PMD is not a populated non-leaf entry. + */ + WRITE_ONCE(*pmdp, __pmd(0)); + WARN_ON(!pmd_set_huge(pmdp, __pfn_to_phys(pfn), prot)); + WARN_ON(!pmd_clear_huge(pmdp)); + pmd = READ_ONCE(*pmdp); + WARN_ON(!pmd_none(pmd)); +} + static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { pud_t pud; @@ -416,47 +441,9 @@ static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) WARN_ON(!pud_none(pud)); } #else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ +static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) { } static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) { } -#endif /* !CONFIG_HAVE_ARCH_HUGE_VMAP */ - -#else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ -static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { } -static void __init pud_advanced_tests(struct mm_struct *mm, - struct vm_area_struct *vma, pud_t *pudp, - unsigned long pfn, unsigned long vaddr, - pgprot_t prot) -{ -} -static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } -static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) -{ -} -#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ -#else /* !CONFIG_TRANSPARENT_HUGEPAGE */ -static void __init pmd_basic_tests(unsigned long pfn, int idx) { } -static void __init pud_basic_tests(struct mm_struct *mm, unsigned long pfn, int idx) { } -static void __init pmd_advanced_tests(struct mm_struct *mm, - struct vm_area_struct *vma, pmd_t *pmdp, - unsigned long pfn, unsigned long vaddr, - pgprot_t prot, pgtable_t pgtable) -{ -} -static void __init pud_advanced_tests(struct mm_struct *mm, - struct vm_area_struct *vma, pud_t *pudp, - unsigned long pfn, unsigned long vaddr, - pgprot_t prot) -{ -} -static void __init pmd_leaf_tests(unsigned long pfn, pgprot_t prot) { } -static void __init pud_leaf_tests(unsigned long pfn, pgprot_t prot) { } -static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) -{ -} -static void __init pud_huge_tests(pud_t *pudp, unsigned long pfn, pgprot_t prot) -{ -} -static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ static void __init p4d_basic_tests(unsigned long pfn, pgprot_t prot) { From b593b90dc9768d4873b8b7c60be2c69d8f5c180e Mon Sep 17 00:00:00 2001 From: Shixin Liu Date: Wed, 30 Jun 2021 18:47:40 -0700 Subject: [PATCH 011/190] mm/debug_vm_pgtable: remove redundant pfn_{pmd/pte}() and fix one comment mistake Remove redundant pfn_{pmd/pte}() in {pmd/pte}_advanced_tests() and adjust pfn_pud() in pud_advanced_tests() to make it similar with other two functions. In addition, the branch condition should be CONFIG_TRANSPARENT_HUGEPAGE instead of CONFIG_ARCH_HAS_PTE_DEVMAP. Link: https://lkml.kernel.org/r/20210419071820.750217-2-liushixin2@huawei.com Signed-off-by: Shixin Liu Reviewed-by: Anshuman Khandual Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/debug_vm_pgtable.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 760bb319f731..f7b23565a04f 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -91,7 +91,7 @@ static void __init pte_advanced_tests(struct mm_struct *mm, unsigned long pfn, unsigned long vaddr, pgprot_t prot) { - pte_t pte = pfn_pte(pfn, prot); + pte_t pte; /* * Architectures optimize set_pte_at by avoiding TLB flush. @@ -778,12 +778,12 @@ static void __init pmd_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot) WARN_ON(!pmd_swp_soft_dirty(pmd_swp_mksoft_dirty(pmd))); WARN_ON(pmd_swp_soft_dirty(pmd_swp_clear_soft_dirty(pmd))); } -#else /* !CONFIG_ARCH_HAS_PTE_DEVMAP */ +#else /* !CONFIG_TRANSPARENT_HUGEPAGE */ static void __init pmd_soft_dirty_tests(unsigned long pfn, pgprot_t prot) { } static void __init pmd_swap_soft_dirty_tests(unsigned long pfn, pgprot_t prot) { } -#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static void __init pte_swap_tests(unsigned long pfn, pgprot_t prot) { From b2bd53f18bb7f7cfc91b3bb527d7809376700a8e Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:47:43 -0700 Subject: [PATCH 012/190] mm/huge_memory.c: remove dedicated macro HPAGE_CACHE_INDEX_MASK Patch series "Cleanup and fixup for huge_memory:, v3. This series contains cleanups to remove dedicated macro and remove unnecessary tlb_remove_page_size() for huge zero pmd. Also this adds missing read-only THP checking for transparent_hugepage_enabled() and avoids discarding hugepage if other processes are mapping it. More details can be found in the respective changelogs. Thi patch (of 5): Rewrite the pgoff checking logic to remove macro HPAGE_CACHE_INDEX_MASK which is only used here to simplify the code. Link: https://lkml.kernel.org/r/20210511134857.1581273-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210511134857.1581273-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Reviewed-by: Yang Shi Reviewed-by: Anshuman Khandual Reviewed-by: David Hildenbrand Cc: Zi Yan Cc: William Kucharski Cc: Matthew Wilcox Cc: "Aneesh Kumar K . V" Cc: Ralph Campbell Cc: Song Liu Cc: Kirill A. Shutemov Cc: Rik van Riel Cc: Johannes Weiner Cc: Minchan Kim Cc: Hugh Dickins Cc: Alexey Dobriyan Cc: Mike Kravetz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2a8ebe6c222e..8a5f49abcfa2 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -152,15 +152,13 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) bool transparent_hugepage_enabled(struct vm_area_struct *vma); -#define HPAGE_CACHE_INDEX_MASK (HPAGE_PMD_NR - 1) - static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, unsigned long haddr) { /* Don't have to check pgoff for anonymous vma */ if (!vma_is_anonymous(vma)) { - if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) != - (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK)) + if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, + HPAGE_PMD_NR)) return false; } From dfe5c51c6029af0a6c302a0d5dcde3cc4e298a47 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:47:46 -0700 Subject: [PATCH 013/190] mm/huge_memory.c: use page->deferred_list Now that we can represent the location of ->deferred_list instead of ->mapping + ->index, make use of it to improve readability. Link: https://lkml.kernel.org/r/20210511134857.1581273-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Reviewed-by: Yang Shi Reviewed-by: David Hildenbrand Cc: Alexey Dobriyan Cc: "Aneesh Kumar K . V" Cc: Anshuman Khandual Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Minchan Kim Cc: Ralph Campbell Cc: Rik van Riel Cc: Song Liu Cc: William Kucharski Cc: Zi Yan Cc: Mike Kravetz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6d2a0119fc58..a76b3835251d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2870,7 +2870,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, spin_lock_irqsave(&ds_queue->split_queue_lock, flags); /* Take pin on all head pages to avoid freeing them under us */ list_for_each_safe(pos, next, &ds_queue->split_queue) { - page = list_entry((void *)pos, struct page, mapping); + page = list_entry((void *)pos, struct page, deferred_list); page = compound_head(page); if (get_page_unless_zero(page)) { list_move(page_deferred_list(page), &list); @@ -2885,7 +2885,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); list_for_each_safe(pos, next, &list) { - page = list_entry((void *)pos, struct page, mapping); + page = list_entry((void *)pos, struct page, deferred_list); if (!trylock_page(page)) goto next; /* split_huge_page() removes page from list on success */ From e6be37b2e7bddfe0c76585ee7c7eee5acc8efeab Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:47:50 -0700 Subject: [PATCH 014/190] mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled() Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS"), read-only THP file mapping is supported. But it forgot to add checking for it in transparent_hugepage_enabled(). To fix it, we add checking for read-only THP file mapping and also introduce helper transhuge_vma_enabled() to check whether thp is enabled for specified vma to reduce duplicated code. We rename transparent_hugepage_enabled to transparent_hugepage_active to make the code easier to follow as suggested by David Hildenbrand. [linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable] Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Miaohe Lin Reviewed-by: Yang Shi Cc: Alexey Dobriyan Cc: "Aneesh Kumar K . V" Cc: Anshuman Khandual Cc: David Hildenbrand Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Minchan Kim Cc: Ralph Campbell Cc: Rik van Riel Cc: Song Liu Cc: William Kucharski Cc: Zi Yan Cc: Mike Kravetz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/task_mmu.c | 2 +- include/linux/huge_mm.h | 57 +++++++++++++++++++++++++---------------- mm/huge_memory.c | 11 +++++++- mm/khugepaged.c | 4 +-- mm/shmem.c | 3 +-- 5 files changed, 48 insertions(+), 29 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 66965ad88d8b..9feda7792882 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -832,7 +832,7 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss, false); seq_printf(m, "THPeligible: %d\n", - transparent_hugepage_enabled(vma)); + transparent_hugepage_active(vma)); if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 8a5f49abcfa2..b4e1ebaae825 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -115,9 +115,34 @@ extern struct kobj_attribute shmem_enabled_attr; extern unsigned long transparent_hugepage_flags; +static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, + unsigned long haddr) +{ + /* Don't have to check pgoff for anonymous vma */ + if (!vma_is_anonymous(vma)) { + if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, + HPAGE_PMD_NR)) + return false; + } + + if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) + return false; + return true; +} + +static inline bool transhuge_vma_enabled(struct vm_area_struct *vma, + unsigned long vm_flags) +{ + /* Explicitly disabled through madvise. */ + if ((vm_flags & VM_NOHUGEPAGE) || + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) + return false; + return true; +} + /* * to be used on vmas which are known to support THP. - * Use transparent_hugepage_enabled otherwise + * Use transparent_hugepage_active otherwise */ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) { @@ -128,15 +153,12 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_NEVER_DAX)) return false; - if (vma->vm_flags & VM_NOHUGEPAGE) + if (!transhuge_vma_enabled(vma, vma->vm_flags)) return false; if (vma_is_temporary_stack(vma)) return false; - if (test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) - return false; - if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG)) return true; @@ -150,22 +172,7 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) return false; } -bool transparent_hugepage_enabled(struct vm_area_struct *vma); - -static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, - unsigned long haddr) -{ - /* Don't have to check pgoff for anonymous vma */ - if (!vma_is_anonymous(vma)) { - if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, - HPAGE_PMD_NR)) - return false; - } - - if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) - return false; - return true; -} +bool transparent_hugepage_active(struct vm_area_struct *vma); #define transparent_hugepage_use_zero_page() \ (transparent_hugepage_flags & \ @@ -352,7 +359,7 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) return false; } -static inline bool transparent_hugepage_enabled(struct vm_area_struct *vma) +static inline bool transparent_hugepage_active(struct vm_area_struct *vma) { return false; } @@ -363,6 +370,12 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, return false; } +static inline bool transhuge_vma_enabled(struct vm_area_struct *vma, + unsigned long vm_flags) +{ + return false; +} + static inline void prep_transhuge_page(struct page *page) {} static inline bool is_transparent_hugepage(struct page *page) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a76b3835251d..8529ba6af151 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -64,7 +64,14 @@ static atomic_t huge_zero_refcount; struct page *huge_zero_page __read_mostly; unsigned long huge_zero_pfn __read_mostly = ~0UL; -bool transparent_hugepage_enabled(struct vm_area_struct *vma) +static inline bool file_thp_enabled(struct vm_area_struct *vma) +{ + return transhuge_vma_enabled(vma, vma->vm_flags) && vma->vm_file && + !inode_is_open_for_write(vma->vm_file->f_inode) && + (vma->vm_flags & VM_EXEC); +} + +bool transparent_hugepage_active(struct vm_area_struct *vma) { /* The addr is used to check if the vma size fits */ unsigned long addr = (vma->vm_end & HPAGE_PMD_MASK) - HPAGE_PMD_SIZE; @@ -75,6 +82,8 @@ bool transparent_hugepage_enabled(struct vm_area_struct *vma) return __transparent_hugepage_enabled(vma); if (vma_is_shmem(vma)) return shmem_huge_enabled(vma); + if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS)) + return file_thp_enabled(vma); return false; } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 6c0185fdd815..d97b20fad6e8 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -442,9 +442,7 @@ static inline int khugepaged_test_exit(struct mm_struct *mm) static bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags) { - /* Explicitly disabled through madvise. */ - if ((vm_flags & VM_NOHUGEPAGE) || - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) + if (!transhuge_vma_enabled(vma, vm_flags)) return false; /* Enabled via shmem mount options or sysfs settings. */ diff --git a/mm/shmem.c b/mm/shmem.c index e72931b9246c..1155279a6276 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -4040,8 +4040,7 @@ bool shmem_huge_enabled(struct vm_area_struct *vma) loff_t i_size; pgoff_t off; - if ((vma->vm_flags & VM_NOHUGEPAGE) || - test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) + if (!transhuge_vma_enabled(vma, vma->vm_flags)) return false; if (shmem_huge == SHMEM_HUGE_FORCE) return true; From 9132a468aafdaed5efd8dd5506b29f55a738782e Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:47:53 -0700 Subject: [PATCH 015/190] mm/huge_memory.c: remove unnecessary tlb_remove_page_size() for huge zero pmd Commit aa88b68c3b1d ("thp: keep huge zero page pinned until tlb flush") introduced tlb_remove_page() for huge zero page to keep it pinned until flush is complete and prevents the page from being split under us. But huge zero page is kept pinned until all relevant mm_users reach zero since the commit 6fcb52a56ff6 ("thp: reduce usage of huge zero page's atomic counter"). So tlb_remove_page_size() for huge zero pmd is unnecessary now. Link: https://lkml.kernel.org/r/20210511134857.1581273-5-linmiaohe@huawei.com Reviewed-by: Yang Shi Acked-by: David Hildenbrand Signed-off-by: Miaohe Lin Cc: Alexey Dobriyan Cc: "Aneesh Kumar K . V" Cc: Anshuman Khandual Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Minchan Kim Cc: Ralph Campbell Cc: Rik van Riel Cc: Song Liu Cc: William Kucharski Cc: Zi Yan Cc: Mike Kravetz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8529ba6af151..7972baa2ae69 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1686,12 +1686,9 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, if (arch_needs_pgtable_deposit()) zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); - if (is_huge_zero_pmd(orig_pmd)) - tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); } else if (is_huge_zero_pmd(orig_pmd)) { zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); - tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); } else { struct page *page = NULL; int flush_needed = 1; From babbbdd08af98a59089334eb3effbed5a7a0cf7f Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:47:57 -0700 Subject: [PATCH 016/190] mm/huge_memory.c: don't discard hugepage if other processes are mapping it If other processes are mapping any other subpages of the hugepage, i.e. in pte-mapped thp case, page_mapcount() will return 1 incorrectly. Then we would discard the page while other processes are still mapping it. Fix it by using total_mapcount() which can tell whether other processes are still mapping it. Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called") Reviewed-by: Yang Shi Signed-off-by: Miaohe Lin Cc: Alexey Dobriyan Cc: "Aneesh Kumar K . V" Cc: Anshuman Khandual Cc: David Hildenbrand Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Minchan Kim Cc: Ralph Campbell Cc: Rik van Riel Cc: Song Liu Cc: William Kucharski Cc: Zi Yan Cc: Mike Kravetz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7972baa2ae69..7ccea3e3e216 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1613,7 +1613,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, * If other processes are mapping this page, we couldn't discard * the page unless they all do MADV_FREE so let's skip the page. */ - if (page_mapcount(page) != 1) + if (total_mapcount(page) != 1) goto out; if (!trylock_page(page)) From 79c1c594f49a88fba9744cb5c85978c6b1b365ec Mon Sep 17 00:00:00 2001 From: Christophe Leroy Date: Wed, 30 Jun 2021 18:48:00 -0700 Subject: [PATCH 017/190] mm/hugetlb: change parameters of arch_make_huge_pte() Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2. This series implements huge VMAP and VMALLOC on powerpc 8xx. Powerpc 8xx has 4 page sizes: - 4k - 16k - 512k - 8M At the time being, vmalloc and vmap only support huge pages which are leaf at PMD level. Here the PMD level is 4M, it doesn't correspond to any supported page size. For now, implement use of 16k and 512k pages which is done at PTE level. Support of 8M pages will be implemented later, it requires use of hugepd tables. To allow this, the architecture provides two functions: - arch_vmap_pte_range_map_size() which tells vmap_pte_range() what page size to use. A stub returning PAGE_SIZE is provided when the architecture doesn't provide this function. - arch_vmap_pte_supported_shift() which tells __vmalloc_node_range() what page shift to use for a given area size. A stub returning PAGE_SHIFT is provided when the architecture doesn't provide this function. This patch (of 5): At the time being, arch_make_huge_pte() has the following prototype: pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, struct page *page, int writable); vma is used to get the pages shift or size. vma is also used on Sparc to get vm_flags. page is not used. writable is not used. In order to use this function without a vma, replace vma by shift and flags. Also remove the used parameters. Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy Acked-by: Mike Kravetz Cc: Nicholas Piggin Cc: Mike Kravetz Cc: Mike Rapoport Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Uladzislau Rezki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/include/asm/hugetlb.h | 3 +-- arch/arm64/mm/hugetlbpage.c | 5 ++--- arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h | 5 ++--- arch/sparc/include/asm/pgtable_64.h | 3 +-- arch/sparc/mm/hugetlbpage.c | 6 ++---- include/linux/hugetlb.h | 4 ++-- mm/hugetlb.c | 6 ++++-- mm/migrate.c | 4 +++- 8 files changed, 17 insertions(+), 19 deletions(-) diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h index 5abf91e3494c..1242f71937f8 100644 --- a/arch/arm64/include/asm/hugetlb.h +++ b/arch/arm64/include/asm/hugetlb.h @@ -23,8 +23,7 @@ static inline void arch_clear_hugepage_flags(struct page *page) } #define arch_clear_hugepage_flags arch_clear_hugepage_flags -extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, - struct page *page, int writable); +pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags); #define arch_make_huge_pte arch_make_huge_pte #define __HAVE_ARCH_HUGE_SET_HUGE_PTE_AT extern void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 58987a98e179..23505fc35324 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -339,10 +339,9 @@ pte_t *huge_pte_offset(struct mm_struct *mm, return NULL; } -pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, - struct page *page, int writable) +pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags) { - size_t pagesize = huge_page_size(hstate_vma(vma)); + size_t pagesize = 1UL << shift; if (pagesize == CONT_PTE_SIZE) { entry = pte_mkcont(entry); diff --git a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h index 39be9aea86db..64b6c608eca4 100644 --- a/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/hugetlb-8xx.h @@ -66,10 +66,9 @@ static inline void huge_ptep_set_wrprotect(struct mm_struct *mm, } #ifdef CONFIG_PPC_4K_PAGES -static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, - struct page *page, int writable) +static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags) { - size_t size = huge_page_size(hstate_vma(vma)); + size_t size = 1UL << shift; if (size == SZ_16K) return __pte(pte_val(entry) & ~_PAGE_HUGE); diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 550d3904de65..2cd80a0a9795 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -377,8 +377,7 @@ static inline pgprot_t pgprot_noncached(pgprot_t prot) #define pgprot_noncached pgprot_noncached #if defined(CONFIG_HUGETLB_PAGE) || defined(CONFIG_TRANSPARENT_HUGEPAGE) -extern pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, - struct page *page, int writable); +pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags); #define arch_make_huge_pte arch_make_huge_pte static inline unsigned long __pte_default_huge_mask(void) { diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c index 04d8790f6c32..0f49fada2093 100644 --- a/arch/sparc/mm/hugetlbpage.c +++ b/arch/sparc/mm/hugetlbpage.c @@ -177,10 +177,8 @@ static pte_t hugepage_shift_to_tte(pte_t entry, unsigned int shift) return sun4u_hugepage_shift_to_tte(entry, shift); } -pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, - struct page *page, int writeable) +pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, vm_flags_t flags) { - unsigned int shift = huge_page_shift(hstate_vma(vma)); pte_t pte; pte = hugepage_shift_to_tte(entry, shift); @@ -188,7 +186,7 @@ pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, #ifdef CONFIG_SPARC64 /* If this vma has ADI enabled on it, turn on TTE.mcd */ - if (vma->vm_flags & VM_SPARC_ADI) + if (flags & VM_SPARC_ADI) return pte_mkmcd(pte); else return pte_mknotmcd(pte); diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 8c1920844236..cfde3bec2261 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -741,8 +741,8 @@ static inline void arch_clear_hugepage_flags(struct page *page) { } #endif #ifndef arch_make_huge_pte -static inline pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma, - struct page *page, int writable) +static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, + vm_flags_t flags) { return entry; } diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dd9c90c082fc..88f2178ad7c9 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4060,6 +4060,7 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, int writable) { pte_t entry; + unsigned int shift = huge_page_shift(hstate_vma(vma)); if (writable) { entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page, @@ -4070,7 +4071,7 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, } entry = pte_mkyoung(entry); entry = pte_mkhuge(entry); - entry = arch_make_huge_pte(entry, vma, page, writable); + entry = arch_make_huge_pte(entry, shift, vma->vm_flags); return entry; } @@ -5468,10 +5469,11 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, } if (!huge_pte_none(pte)) { pte_t old_pte; + unsigned int shift = huge_page_shift(hstate_vma(vma)); old_pte = huge_ptep_modify_prot_start(vma, address, ptep); pte = pte_mkhuge(huge_pte_modify(old_pte, newprot)); - pte = arch_make_huge_pte(pte, vma, NULL, 0); + pte = arch_make_huge_pte(pte, shift, vma->vm_flags); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++; } diff --git a/mm/migrate.c b/mm/migrate.c index cc4d6af41683..75a15f0a2698 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -226,8 +226,10 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, #ifdef CONFIG_HUGETLB_PAGE if (PageHuge(new)) { + unsigned int shift = huge_page_shift(hstate_vma(vma)); + pte = pte_mkhuge(pte); - pte = arch_make_huge_pte(pte, vma, new, 0); + pte = arch_make_huge_pte(pte, shift, vma->vm_flags); set_huge_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte); if (PageAnon(new)) hugepage_add_anon_rmap(new, vma, pvmw.address); From c742199a014de23ee92055c2473d91fe5561ffdf Mon Sep 17 00:00:00 2001 From: Christophe Leroy Date: Wed, 30 Jun 2021 18:48:03 -0700 Subject: [PATCH 018/190] mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge For architectures with no PMD and/or no PUD, add stubs similar to what we have for architectures without P4D. [christophe.leroy@csgroup.eu: arm64: define only {pud/pmd}_{set/clear}_huge when useful] Link: https://lkml.kernel.org/r/73ec95f40cafbbb69bdfb43a7f53876fd845b0ce.1620990479.git.christophe.leroy@csgroup.eu [christophe.leroy@csgroup.eu: x86: define only {pud/pmd}_{set/clear}_huge when useful] Link: https://lkml.kernel.org/r/7fbf1b6bc3e15c07c24fa45278d57064f14c896b.1620930415.git.christophe.leroy@csgroup.eu Link: https://lkml.kernel.org/r/5ac5976419350e8e048d463a64cae449eb3ba4b0.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Uladzislau Rezki Cc: Naresh Kamboju Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/mm/mmu.c | 20 ++++++++++++-------- arch/x86/mm/pgtable.c | 34 +++++++++++++++++++--------------- include/linux/pgtable.h | 26 +++++++++++++++++++++++++- 3 files changed, 56 insertions(+), 24 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 89b66ef43a0f..add67deb4910 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1338,6 +1338,7 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot) return dt_virt; } +#if CONFIG_PGTABLE_LEVELS > 3 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot) { pud_t new_pud = pfn_pud(__phys_to_pfn(phys), mk_pud_sect_prot(prot)); @@ -1352,6 +1353,16 @@ int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot) return 1; } +int pud_clear_huge(pud_t *pudp) +{ + if (!pud_sect(READ_ONCE(*pudp))) + return 0; + pud_clear(pudp); + return 1; +} +#endif + +#if CONFIG_PGTABLE_LEVELS > 2 int pmd_set_huge(pmd_t *pmdp, phys_addr_t phys, pgprot_t prot) { pmd_t new_pmd = pfn_pmd(__phys_to_pfn(phys), mk_pmd_sect_prot(prot)); @@ -1366,14 +1377,6 @@ int pmd_set_huge(pmd_t *pmdp, phys_addr_t phys, pgprot_t prot) return 1; } -int pud_clear_huge(pud_t *pudp) -{ - if (!pud_sect(READ_ONCE(*pudp))) - return 0; - pud_clear(pudp); - return 1; -} - int pmd_clear_huge(pmd_t *pmdp) { if (!pmd_sect(READ_ONCE(*pmdp))) @@ -1381,6 +1384,7 @@ int pmd_clear_huge(pmd_t *pmdp) pmd_clear(pmdp); return 1; } +#endif int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr) { diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index d27cf69e811d..1303ff6ef7be 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -682,6 +682,7 @@ int p4d_clear_huge(p4d_t *p4d) } #endif +#if CONFIG_PGTABLE_LEVELS > 3 /** * pud_set_huge - setup kernel PUD mapping * @@ -720,6 +721,23 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot) return 1; } +/** + * pud_clear_huge - clear kernel PUD mapping when it is set + * + * Returns 1 on success and 0 on failure (no PUD map is found). + */ +int pud_clear_huge(pud_t *pud) +{ + if (pud_large(*pud)) { + pud_clear(pud); + return 1; + } + + return 0; +} +#endif + +#if CONFIG_PGTABLE_LEVELS > 2 /** * pmd_set_huge - setup kernel PMD mapping * @@ -750,21 +768,6 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot) return 1; } -/** - * pud_clear_huge - clear kernel PUD mapping when it is set - * - * Returns 1 on success and 0 on failure (no PUD map is found). - */ -int pud_clear_huge(pud_t *pud) -{ - if (pud_large(*pud)) { - pud_clear(pud); - return 1; - } - - return 0; -} - /** * pmd_clear_huge - clear kernel PMD mapping when it is set * @@ -779,6 +782,7 @@ int pmd_clear_huge(pmd_t *pmd) return 0; } +#endif #ifdef CONFIG_X86_64 /** diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index c32600c9e1ad..2b0d02291178 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1379,10 +1379,34 @@ static inline int p4d_clear_huge(p4d_t *p4d) } #endif /* !__PAGETABLE_P4D_FOLDED */ +#ifndef __PAGETABLE_PUD_FOLDED int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot); -int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot); int pud_clear_huge(pud_t *pud); +#else +static inline int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot) +{ + return 0; +} +static inline int pud_clear_huge(pud_t *pud) +{ + return 0; +} +#endif /* !__PAGETABLE_PUD_FOLDED */ + +#ifndef __PAGETABLE_PMD_FOLDED +int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot); int pmd_clear_huge(pmd_t *pmd); +#else +static inline int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot) +{ + return 0; +} +static inline int pmd_clear_huge(pmd_t *pmd) +{ + return 0; +} +#endif /* !__PAGETABLE_PMD_FOLDED */ + int p4d_free_pud_page(p4d_t *p4d, unsigned long addr); int pud_free_pmd_page(pud_t *pud, unsigned long addr); int pmd_free_pte_page(pmd_t *pmd, unsigned long addr); From f7ee1f13d606c1b1be3bdaf1609f3991bc06da87 Mon Sep 17 00:00:00 2001 From: Christophe Leroy Date: Wed, 30 Jun 2021 18:48:06 -0700 Subject: [PATCH 019/190] mm/vmalloc: enable mapping of huge pages at pte level in vmap On some architectures like powerpc, there are huge pages that are mapped at pte level. Enable it in vmap. For that, architectures can provide arch_vmap_pte_range_map_size() that returns the size of pages to map at pte level. Link: https://lkml.kernel.org/r/fb3ccc73377832ac6708181ec419128a2f98ce36.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Uladzislau Rezki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/vmalloc.h | 8 ++++++++ mm/vmalloc.c | 21 ++++++++++++++++++--- 2 files changed, 26 insertions(+), 3 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index bfaaf0b6fa76..54ec0736a656 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -104,6 +104,14 @@ static inline bool arch_vmap_pmd_supported(pgprot_t prot) } #endif +#ifndef arch_vmap_pte_range_map_size +static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr, unsigned long end, + u64 pfn, unsigned int max_page_shift) +{ + return PAGE_SIZE; +} +#endif + /* * Highlevel APIs for driver use */ diff --git a/mm/vmalloc.c b/mm/vmalloc.c index b2ec7f751bd0..fe0af8db71e0 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include @@ -83,10 +84,11 @@ static void free_work(struct work_struct *w) /*** Page table manipulation functions ***/ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot, - pgtbl_mod_mask *mask) + unsigned int max_page_shift, pgtbl_mod_mask *mask) { pte_t *pte; u64 pfn; + unsigned long size = PAGE_SIZE; pfn = phys_addr >> PAGE_SHIFT; pte = pte_alloc_kernel_track(pmd, addr, mask); @@ -94,9 +96,22 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return -ENOMEM; do { BUG_ON(!pte_none(*pte)); + +#ifdef CONFIG_HUGETLB_PAGE + size = arch_vmap_pte_range_map_size(addr, end, pfn, max_page_shift); + if (size != PAGE_SIZE) { + pte_t entry = pfn_pte(pfn, prot); + + entry = pte_mkhuge(entry); + entry = arch_make_huge_pte(entry, ilog2(size), 0); + set_huge_pte_at(&init_mm, addr, pte, entry); + pfn += PFN_DOWN(size); + continue; + } +#endif set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot)); pfn++; - } while (pte++, addr += PAGE_SIZE, addr != end); + } while (pte += PFN_DOWN(size), addr += size, addr != end); *mask |= PGTBL_PTE_MODIFIED; return 0; } @@ -145,7 +160,7 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, continue; } - if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask)) + if (vmap_pte_range(pmd, addr, next, phys_addr, prot, max_page_shift, mask)) return -ENOMEM; } while (pmd++, phys_addr += (next - addr), addr = next, addr != end); return 0; From 3382bbee0464bf31e63853c6ec2a83ead77a01cc Mon Sep 17 00:00:00 2001 From: Christophe Leroy Date: Wed, 30 Jun 2021 18:48:09 -0700 Subject: [PATCH 020/190] mm/vmalloc: enable mapping of huge pages at pte level in vmalloc On some architectures like powerpc, there are huge pages that are mapped at pte level. Enable it in vmalloc. For that, architectures can provide arch_vmap_pte_supported_shift() that returns the shift for pages to map at pte level. Link: https://lkml.kernel.org/r/2c717e3b1fba1894d890feb7669f83025bfa314d.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Uladzislau Rezki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/vmalloc.h | 7 +++++++ mm/vmalloc.c | 13 +++++++------ 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 54ec0736a656..1dabd6f22486 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -112,6 +112,13 @@ static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr, uns } #endif +#ifndef arch_vmap_pte_supported_shift +static inline int arch_vmap_pte_supported_shift(unsigned long size) +{ + return PAGE_SHIFT; +} +#endif + /* * Highlevel APIs for driver use */ diff --git a/mm/vmalloc.c b/mm/vmalloc.c index fe0af8db71e0..71dd29fe618b 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2927,8 +2927,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, return NULL; } - if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP) && - arch_vmap_pmd_supported(prot)) { + if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) { unsigned long size_per_node; /* @@ -2941,11 +2940,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, size_per_node = size; if (node == NUMA_NO_NODE) size_per_node /= num_online_nodes(); - if (size_per_node >= PMD_SIZE) { + if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE) shift = PMD_SHIFT; - align = max(real_align, 1UL << shift); - size = ALIGN(real_size, 1UL << shift); - } + else + shift = arch_vmap_pte_supported_shift(size_per_node); + + align = max(real_align, 1UL << shift); + size = ALIGN(real_size, 1UL << shift); } again: From a6a8f7c4aa7eb50304b5c4e68eccd24313f3a785 Mon Sep 17 00:00:00 2001 From: Christophe Leroy Date: Wed, 30 Jun 2021 18:48:12 -0700 Subject: [PATCH 021/190] powerpc/8xx: add support for huge pages on VMAP and VMALLOC powerpc 8xx has 4 page sizes: - 4k - 16k - 512k - 8M At the time being, vmalloc and vmap only support huge pages which are leaf at PMD level. Here the PMD level is 4M, it doesn't correspond to any supported page size. For now, implement use of 16k and 512k pages which is done at PTE level. Support of 8M pages will be implemented later, it requires vmalloc to support hugepd tables. Link: https://lkml.kernel.org/r/8b972f1c03fb6bd59953035f0a3e4d26659de4f8.1620795204.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy Cc: Benjamin Herrenschmidt Cc: Michael Ellerman Cc: Mike Kravetz Cc: Mike Rapoport Cc: Nicholas Piggin Cc: Paul Mackerras Cc: Uladzislau Rezki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/nohash/32/mmu-8xx.h | 43 ++++++++++++++++++++ 2 files changed, 44 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 14b132cf95e2..98206f3ebae0 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -187,7 +187,7 @@ config PPC select GENERIC_VDSO_TIME_NS select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP - select HAVE_ARCH_HUGE_VMAP if PPC_BOOK3S_64 && PPC_RADIX_MMU + select HAVE_ARCH_HUGE_VMAP if PPC_RADIX_MMU || PPC_8xx select HAVE_ARCH_JUMP_LABEL select HAVE_ARCH_JUMP_LABEL_RELATIVE select HAVE_ARCH_KASAN if PPC32 && PPC_PAGE_SHIFT <= 14 diff --git a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h index 6e4faa0a9b35..997cec973406 100644 --- a/arch/powerpc/include/asm/nohash/32/mmu-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/mmu-8xx.h @@ -178,6 +178,7 @@ #ifndef __ASSEMBLY__ #include +#include void mmu_pin_tlb(unsigned long top, bool readonly); @@ -225,6 +226,48 @@ static inline unsigned int mmu_psize_to_shift(unsigned int mmu_psize) BUG(); } +static inline bool arch_vmap_try_size(unsigned long addr, unsigned long end, u64 pfn, + unsigned int max_page_shift, unsigned long size) +{ + if (end - addr < size) + return false; + + if ((1UL << max_page_shift) < size) + return false; + + if (!IS_ALIGNED(addr, size)) + return false; + + if (!IS_ALIGNED(PFN_PHYS(pfn), size)) + return false; + + return true; +} + +static inline unsigned long arch_vmap_pte_range_map_size(unsigned long addr, unsigned long end, + u64 pfn, unsigned int max_page_shift) +{ + if (arch_vmap_try_size(addr, end, pfn, max_page_shift, SZ_512K)) + return SZ_512K; + if (PAGE_SIZE == SZ_16K) + return SZ_16K; + if (arch_vmap_try_size(addr, end, pfn, max_page_shift, SZ_16K)) + return SZ_16K; + return PAGE_SIZE; +} +#define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size + +static inline int arch_vmap_pte_supported_shift(unsigned long size) +{ + if (size >= SZ_512K) + return 19; + else if (size >= SZ_16K) + return 14; + else + return PAGE_SHIFT; +} +#define arch_vmap_pte_supported_shift arch_vmap_pte_supported_shift + /* patch sites */ extern s32 patch__itlbmiss_exit_1, patch__dtlbmiss_exit_1; extern s32 patch__itlbmiss_perf, patch__dtlbmiss_perf; From 22f3c951865be13dd32ba042b50bea3f6f93e115 Mon Sep 17 00:00:00 2001 From: Nanyong Sun Date: Wed, 30 Jun 2021 18:48:16 -0700 Subject: [PATCH 022/190] khugepaged: selftests: remove debug_cow The debug_cow attribute had been removed since commit 4958e4d86ecb01 ("mm: thp: remove debug_cow switch"), so remove it in selftest code too, otherwise the khugepaged test will fail. Link: https://lkml.kernel.org/r/20210430051117.400189-1-sunnanyong@huawei.com Fixes: 4958e4d86ecb01 ("mm: thp: remove debug_cow switch") Signed-off-by: Nanyong Sun Cc: Yang Shi Cc: Zi Yan Cc: Kirill A. Shutemov Cc: Shuah Khan Cc: Kefeng Wang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/khugepaged.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c index 8b75821302a7..155120b67a16 100644 --- a/tools/testing/selftests/vm/khugepaged.c +++ b/tools/testing/selftests/vm/khugepaged.c @@ -86,7 +86,6 @@ struct settings { enum thp_enabled thp_enabled; enum thp_defrag thp_defrag; enum shmem_enabled shmem_enabled; - bool debug_cow; bool use_zero_page; struct khugepaged_settings khugepaged; }; @@ -95,7 +94,6 @@ static struct settings default_settings = { .thp_enabled = THP_MADVISE, .thp_defrag = THP_DEFRAG_ALWAYS, .shmem_enabled = SHMEM_NEVER, - .debug_cow = 0, .use_zero_page = 0, .khugepaged = { .defrag = 1, @@ -268,7 +266,6 @@ static void write_settings(struct settings *settings) write_string("defrag", thp_defrag_strings[settings->thp_defrag]); write_string("shmem_enabled", shmem_enabled_strings[settings->shmem_enabled]); - write_num("debug_cow", settings->debug_cow); write_num("use_zero_page", settings->use_zero_page); write_num("khugepaged/defrag", khugepaged->defrag); @@ -304,7 +301,6 @@ static void save_settings(void) .thp_defrag = read_string("defrag", thp_defrag_strings), .shmem_enabled = read_string("shmem_enabled", shmem_enabled_strings), - .debug_cow = read_num("debug_cow"), .use_zero_page = read_num("use_zero_page"), }; saved_settings.khugepaged = (struct khugepaged_settings) { From 8cc5fcbb5be814c115085549b700e473685b11e9 Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Wed, 30 Jun 2021 18:48:19 -0700 Subject: [PATCH 023/190] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY On UFFDIO_COPY, if we fail to copy the page contents while holding the hugetlb_fault_mutex, we will drop the mutex and return to the caller after allocating a page that consumed a reservation. In this case there may be a fault that double consumes the reservation. To handle this, we free the allocated page, fix the reservations, and allocate a temporary hugetlb page and return that to the caller. When the caller does the copy outside of the lock, we again check the cache, and allocate a page consuming the reservation, and copy over the contents. Test: Hacked the code locally such that resv_huge_pages underflows produce a warning and the copy_huge_page_from_user() always fails, then: ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success ./tools/testing/selftests/vm/userfaultfd hugetlb 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success Both tests succeed and produce no warnings. After the test runs number of free/resv hugepages is correct. [yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared'] Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com [almasrymina@google.com: fix allocation error check and copy func name] Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.com Signed-off-by: Mina Almasry Signed-off-by: YueHaibing Cc: Axel Rasmussen Cc: Peter Xu Cc: Mike Kravetz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/migrate.h | 4 ++++ mm/hugetlb.c | 48 +++++++++++++++++++++++++++++++-------- mm/migrate.c | 2 +- mm/userfaultfd.c | 50 +---------------------------------------- 4 files changed, 45 insertions(+), 59 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 4bb4e519e3f5..7b7b73977278 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -51,6 +51,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, int extra_count); +extern void copy_huge_page(struct page *dst, struct page *src); #else static inline void putback_movable_pages(struct list_head *l) {} @@ -77,6 +78,9 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, return -ENOSYS; } +static inline void copy_huge_page(struct page *dst, struct page *src) +{ +} #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_COMPACTION diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 88f2178ad7c9..b14f4d1749b2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include @@ -5076,20 +5077,17 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, struct page **pagep) { bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); - struct address_space *mapping; - pgoff_t idx; + struct hstate *h = hstate_vma(dst_vma); + struct address_space *mapping = dst_vma->vm_file->f_mapping; + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); unsigned long size; int vm_shared = dst_vma->vm_flags & VM_SHARED; - struct hstate *h = hstate_vma(dst_vma); pte_t _dst_pte; spinlock_t *ptl; - int ret; + int ret = -ENOMEM; struct page *page; int writable; - mapping = dst_vma->vm_file->f_mapping; - idx = vma_hugecache_offset(h, dst_vma, dst_addr); - if (is_continue) { ret = -EFAULT; page = find_lock_page(mapping, idx); @@ -5118,12 +5116,44 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { ret = -ENOENT; + /* Free the allocated page which may have + * consumed a reservation. + */ + restore_reserve_on_error(h, dst_vma, dst_addr, page); + put_page(page); + + /* Allocate a temporary page to hold the copied + * contents. + */ + page = alloc_huge_page_vma(h, dst_vma, dst_addr); + if (!page) { + ret = -ENOMEM; + goto out; + } *pagep = page; - /* don't free the page */ + /* Set the outparam pagep and return to the caller to + * copy the contents outside the lock. Don't free the + * page. + */ goto out; } } else { - page = *pagep; + if (vm_shared && + hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { + put_page(*pagep); + ret = -EEXIST; + *pagep = NULL; + goto out; + } + + page = alloc_huge_page(dst_vma, dst_addr, 0); + if (IS_ERR(page)) { + ret = -ENOMEM; + *pagep = NULL; + goto out; + } + copy_huge_page(page, *pagep); + put_page(*pagep); *pagep = NULL; } diff --git a/mm/migrate.c b/mm/migrate.c index 75a15f0a2698..8fc766e52e52 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -553,7 +553,7 @@ static void __copy_gigantic_page(struct page *dst, struct page *src, } } -static void copy_huge_page(struct page *dst, struct page *src) +void copy_huge_page(struct page *dst, struct page *src) { int i; int nr_pages; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 63a73e164d55..da5535d2d990 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -209,7 +209,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, unsigned long len, enum mcopy_atomic_mode mode) { - int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED; int vm_shared = dst_vma->vm_flags & VM_SHARED; ssize_t err; pte_t *dst_pte; @@ -308,7 +307,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, mutex_unlock(&hugetlb_fault_mutex_table[hash]); i_mmap_unlock_read(mapping); - vm_alloc_shared = vm_shared; cond_resched(); @@ -346,54 +344,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, out_unlock: mmap_read_unlock(dst_mm); out: - if (page) { - /* - * We encountered an error and are about to free a newly - * allocated huge page. - * - * Reservation handling is very subtle, and is different for - * private and shared mappings. See the routine - * restore_reserve_on_error for details. Unfortunately, we - * can not call restore_reserve_on_error now as it would - * require holding mmap_lock. - * - * If a reservation for the page existed in the reservation - * map of a private mapping, the map was modified to indicate - * the reservation was consumed when the page was allocated. - * We clear the HPageRestoreReserve flag now so that the global - * reserve count will not be incremented in free_huge_page. - * The reservation map will still indicate the reservation - * was consumed and possibly prevent later page allocation. - * This is better than leaking a global reservation. If no - * reservation existed, it is still safe to clear - * HPageRestoreReserve as no adjustments to reservation counts - * were made during allocation. - * - * The reservation map for shared mappings indicates which - * pages have reservations. When a huge page is allocated - * for an address with a reservation, no change is made to - * the reserve map. In this case HPageRestoreReserve will be - * set to indicate that the global reservation count should be - * incremented when the page is freed. This is the desired - * behavior. However, when a huge page is allocated for an - * address without a reservation a reservation entry is added - * to the reservation map, and HPageRestoreReserve will not be - * set. When the page is freed, the global reserve count will - * NOT be incremented and it will appear as though we have - * leaked reserved page. In this case, set HPageRestoreReserve - * so that the global reserve count will be incremented to - * match the reservation map entry which was created. - * - * Note that vm_alloc_shared is based on the flags of the vma - * for which the page was originally allocated. dst_vma could - * be different or NULL on error. - */ - if (vm_alloc_shared) - SetHPageRestoreReserve(page); - else - ClearHPageRestoreReserve(page); + if (page) put_page(page); - } BUG_ON(copied < 0); BUG_ON(err > 0); BUG_ON(!copied && !err); From 3bc2b6a725963bb1b441356873da890e397c1a3f Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:48:22 -0700 Subject: [PATCH 024/190] mm: sparsemem: split the huge PMD mapping of vmemmap pages Patch series "Split huge PMD mapping of vmemmap pages", v4. In order to reduce the difficulty of code review in series[1]. We disable huge PMD mapping of vmemmap pages when that feature is enabled. In this series, we do not disable huge PMD mapping of vmemmap pages anymore. We will split huge PMD mapping when needed. When HugeTLB pages are freed from the pool we do not attempt coalasce and move back to a PMD mapping because it is much more complex. [1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/ This patch (of 3): In [1], PMD mappings of vmemmap pages were disabled if the the feature hugetlb_free_vmemmap was enabled. This was done to simplify the initial implementation of vmmemap freeing for hugetlb pages. Now, remove this simplification by allowing PMD mapping and switching to PTE mappings as needed for allocated hugetlb pages. When a hugetlb page is allocated, the vmemmap page tables are walked to free vmemmap pages. During this walk, split huge PMD mappings to PTE mappings as required. In the unlikely case PTE pages can not be allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb page. When HugeTLB pages are freed from the pool, we do not attempt to coalesce and move back to a PMD mapping because it is much more complex. [1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.com Signed-off-by: Muchun Song Reviewed-by: Mike Kravetz Cc: Oscar Salvador Cc: Michal Hocko Cc: David Hildenbrand Cc: Chen Huang Cc: Jonathan Corbet Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 4 +- mm/hugetlb_vmemmap.c | 5 +- mm/sparse-vmemmap.c | 167 ++++++++++++++++++++++++++++++++----------- 3 files changed, 131 insertions(+), 45 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 706bee98d965..aa875dacd9c3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3076,8 +3076,8 @@ static inline void print_vma_addr(char *prefix, unsigned long rip) } #endif -void vmemmap_remap_free(unsigned long start, unsigned long end, - unsigned long reuse); +int vmemmap_remap_free(unsigned long start, unsigned long end, + unsigned long reuse); int vmemmap_remap_alloc(unsigned long start, unsigned long end, unsigned long reuse, gfp_t gfp_mask); diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index f9f9bb212319..06802056f296 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -258,9 +258,8 @@ void free_huge_page_vmemmap(struct hstate *h, struct page *head) * to the page which @vmemmap_reuse is mapped to, then free the pages * which the range [@vmemmap_addr, @vmemmap_end] is mapped to. */ - vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse); - - SetHPageVmemmapOptimized(head); + if (!vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse)) + SetHPageVmemmapOptimized(head); } void __init hugetlb_vmemmap_init(struct hstate *h) diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index a3aa275e2668..bdce883f9286 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -38,6 +38,7 @@ * struct vmemmap_remap_walk - walk vmemmap page table * * @remap_pte: called for each lowest-level entry (PTE). + * @nr_walked: the number of walked pte. * @reuse_page: the page which is reused for the tail vmemmap pages. * @reuse_addr: the virtual address of the @reuse_page page. * @vmemmap_pages: the list head of the vmemmap pages that can be freed @@ -46,11 +47,44 @@ struct vmemmap_remap_walk { void (*remap_pte)(pte_t *pte, unsigned long addr, struct vmemmap_remap_walk *walk); + unsigned long nr_walked; struct page *reuse_page; unsigned long reuse_addr; struct list_head *vmemmap_pages; }; +static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, + struct vmemmap_remap_walk *walk) +{ + pmd_t __pmd; + int i; + unsigned long addr = start; + struct page *page = pmd_page(*pmd); + pte_t *pgtable = pte_alloc_one_kernel(&init_mm); + + if (!pgtable) + return -ENOMEM; + + pmd_populate_kernel(&init_mm, &__pmd, pgtable); + + for (i = 0; i < PMD_SIZE / PAGE_SIZE; i++, addr += PAGE_SIZE) { + pte_t entry, *pte; + pgprot_t pgprot = PAGE_KERNEL; + + entry = mk_pte(page + i, pgprot); + pte = pte_offset_kernel(&__pmd, addr); + set_pte_at(&init_mm, addr, pte, entry); + } + + /* Make pte visible before pmd. See comment in __pte_alloc(). */ + smp_wmb(); + pmd_populate_kernel(&init_mm, pmd, pgtable); + + flush_tlb_kernel_range(start, start + PMD_SIZE); + + return 0; +} + static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct vmemmap_remap_walk *walk) @@ -69,58 +103,80 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, */ addr += PAGE_SIZE; pte++; + walk->nr_walked++; } - for (; addr != end; addr += PAGE_SIZE, pte++) + for (; addr != end; addr += PAGE_SIZE, pte++) { walk->remap_pte(pte, addr, walk); + walk->nr_walked++; + } } -static void vmemmap_pmd_range(pud_t *pud, unsigned long addr, - unsigned long end, - struct vmemmap_remap_walk *walk) +static int vmemmap_pmd_range(pud_t *pud, unsigned long addr, + unsigned long end, + struct vmemmap_remap_walk *walk) { pmd_t *pmd; unsigned long next; pmd = pmd_offset(pud, addr); do { - BUG_ON(pmd_leaf(*pmd)); + if (pmd_leaf(*pmd)) { + int ret; + ret = split_vmemmap_huge_pmd(pmd, addr & PMD_MASK, walk); + if (ret) + return ret; + } next = pmd_addr_end(addr, end); vmemmap_pte_range(pmd, addr, next, walk); } while (pmd++, addr = next, addr != end); + + return 0; } -static void vmemmap_pud_range(p4d_t *p4d, unsigned long addr, - unsigned long end, - struct vmemmap_remap_walk *walk) +static int vmemmap_pud_range(p4d_t *p4d, unsigned long addr, + unsigned long end, + struct vmemmap_remap_walk *walk) { pud_t *pud; unsigned long next; pud = pud_offset(p4d, addr); do { + int ret; + next = pud_addr_end(addr, end); - vmemmap_pmd_range(pud, addr, next, walk); + ret = vmemmap_pmd_range(pud, addr, next, walk); + if (ret) + return ret; } while (pud++, addr = next, addr != end); + + return 0; } -static void vmemmap_p4d_range(pgd_t *pgd, unsigned long addr, - unsigned long end, - struct vmemmap_remap_walk *walk) +static int vmemmap_p4d_range(pgd_t *pgd, unsigned long addr, + unsigned long end, + struct vmemmap_remap_walk *walk) { p4d_t *p4d; unsigned long next; p4d = p4d_offset(pgd, addr); do { + int ret; + next = p4d_addr_end(addr, end); - vmemmap_pud_range(p4d, addr, next, walk); + ret = vmemmap_pud_range(p4d, addr, next, walk); + if (ret) + return ret; } while (p4d++, addr = next, addr != end); + + return 0; } -static void vmemmap_remap_range(unsigned long start, unsigned long end, - struct vmemmap_remap_walk *walk) +static int vmemmap_remap_range(unsigned long start, unsigned long end, + struct vmemmap_remap_walk *walk) { unsigned long addr = start; unsigned long next; @@ -131,8 +187,12 @@ static void vmemmap_remap_range(unsigned long start, unsigned long end, pgd = pgd_offset_k(addr); do { + int ret; + next = pgd_addr_end(addr, end); - vmemmap_p4d_range(pgd, addr, next, walk); + ret = vmemmap_p4d_range(pgd, addr, next, walk); + if (ret) + return ret; } while (pgd++, addr = next, addr != end); /* @@ -141,6 +201,8 @@ static void vmemmap_remap_range(unsigned long start, unsigned long end, * belongs to the range. */ flush_tlb_kernel_range(start + PAGE_SIZE, end); + + return 0; } /* @@ -179,10 +241,27 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, pte_t entry = mk_pte(walk->reuse_page, pgprot); struct page *page = pte_page(*pte); - list_add(&page->lru, walk->vmemmap_pages); + list_add_tail(&page->lru, walk->vmemmap_pages); set_pte_at(&init_mm, addr, pte, entry); } +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, + struct vmemmap_remap_walk *walk) +{ + pgprot_t pgprot = PAGE_KERNEL; + struct page *page; + void *to; + + BUG_ON(pte_page(*pte) != walk->reuse_page); + + page = list_first_entry(walk->vmemmap_pages, struct page, lru); + list_del(&page->lru); + to = page_to_virt(page); + copy_page(to, (void *)walk->reuse_addr); + + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); +} + /** * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end) * to the page which @reuse is mapped to, then free vmemmap @@ -193,12 +272,12 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, * remap. * @reuse: reuse address. * - * Note: This function depends on vmemmap being base page mapped. Please make - * sure that we disable PMD mapping of vmemmap pages when calling this function. + * Return: %0 on success, negative error code otherwise. */ -void vmemmap_remap_free(unsigned long start, unsigned long end, - unsigned long reuse) +int vmemmap_remap_free(unsigned long start, unsigned long end, + unsigned long reuse) { + int ret; LIST_HEAD(vmemmap_pages); struct vmemmap_remap_walk walk = { .remap_pte = vmemmap_remap_pte, @@ -221,25 +300,31 @@ void vmemmap_remap_free(unsigned long start, unsigned long end, */ BUG_ON(start - reuse != PAGE_SIZE); - vmemmap_remap_range(reuse, end, &walk); + mmap_write_lock(&init_mm); + ret = vmemmap_remap_range(reuse, end, &walk); + mmap_write_downgrade(&init_mm); + + if (ret && walk.nr_walked) { + end = reuse + walk.nr_walked * PAGE_SIZE; + /* + * vmemmap_pages contains pages from the previous + * vmemmap_remap_range call which failed. These + * are pages which were removed from the vmemmap. + * They will be restored in the following call. + */ + walk = (struct vmemmap_remap_walk) { + .remap_pte = vmemmap_restore_pte, + .reuse_addr = reuse, + .vmemmap_pages = &vmemmap_pages, + }; + + vmemmap_remap_range(reuse, end, &walk); + } + mmap_read_unlock(&init_mm); + free_vmemmap_page_list(&vmemmap_pages); -} -static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, - struct vmemmap_remap_walk *walk) -{ - pgprot_t pgprot = PAGE_KERNEL; - struct page *page; - void *to; - - BUG_ON(pte_page(*pte) != walk->reuse_page); - - page = list_first_entry(walk->vmemmap_pages, struct page, lru); - list_del(&page->lru); - to = page_to_virt(page); - copy_page(to, (void *)walk->reuse_addr); - - set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); + return ret; } static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, @@ -273,6 +358,8 @@ static int alloc_vmemmap_page_list(unsigned long start, unsigned long end, * remap. * @reuse: reuse address. * @gfp_mask: GFP flag for allocating vmemmap pages. + * + * Return: %0 on success, negative error code otherwise. */ int vmemmap_remap_alloc(unsigned long start, unsigned long end, unsigned long reuse, gfp_t gfp_mask) @@ -287,12 +374,12 @@ int vmemmap_remap_alloc(unsigned long start, unsigned long end, /* See the comment in the vmemmap_remap_free(). */ BUG_ON(start - reuse != PAGE_SIZE); - might_sleep_if(gfpflags_allow_blocking(gfp_mask)); - if (alloc_vmemmap_page_list(start, end, gfp_mask, &vmemmap_pages)) return -ENOMEM; + mmap_read_lock(&init_mm); vmemmap_remap_range(reuse, end, &walk); + mmap_read_unlock(&init_mm); return 0; } From 2d7a21715f25122779e2bed17db8c57aa01e922f Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:48:25 -0700 Subject: [PATCH 025/190] mm: sparsemem: use huge PMD mapping for vmemmap pages The preparation of splitting huge PMD mapping of vmemmap pages is ready, so switch the mapping from PTE to PMD. Link: https://lkml.kernel.org/r/20210616094915.34432-3-songmuchun@bytedance.com Signed-off-by: Muchun Song Reviewed-by: Mike Kravetz Cc: Chen Huang Cc: David Hildenbrand Cc: Jonathan Corbet Cc: Michal Hocko Cc: Oscar Salvador Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- .../admin-guide/kernel-parameters.txt | 7 ------ arch/x86/mm/init_64.c | 8 ++---- include/linux/hugetlb.h | 25 +++++-------------- mm/memory_hotplug.c | 2 +- 4 files changed, 9 insertions(+), 33 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 597572ceb8ba..7a7da02b3bc6 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1572,13 +1572,6 @@ enabled. Allows heavy hugetlb users to free up some more memory (6 * PAGE_SIZE for each 2MB hugetlb page). - This feauture is not free though. Large page - tables are not used to back vmemmap pages which - can lead to a performance degradation for some - workloads. Also there will be memory allocation - required when hugetlb pages are freed from the - pool which can lead to corner cases under heavy - memory pressure. Format: { on | off (default) } on: enable the feature diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 9d9d18d0c2a1..65ea58527176 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -34,7 +34,6 @@ #include #include #include -#include #include #include @@ -1610,8 +1609,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE)); VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE)); - if ((is_hugetlb_free_vmemmap_enabled() && !altmap) || - end - start < PAGES_PER_SECTION * sizeof(struct page)) + if (end - start < PAGES_PER_SECTION * sizeof(struct page)) err = vmemmap_populate_basepages(start, end, node, NULL); else if (boot_cpu_has(X86_FEATURE_PSE)) err = vmemmap_populate_hugepages(start, end, node, altmap); @@ -1639,8 +1637,6 @@ void register_page_bootmem_memmap(unsigned long section_nr, pmd_t *pmd; unsigned int nr_pmd_pages; struct page *page; - bool base_mapping = !boot_cpu_has(X86_FEATURE_PSE) || - is_hugetlb_free_vmemmap_enabled(); for (; addr < end; addr = next) { pte_t *pte = NULL; @@ -1666,7 +1662,7 @@ void register_page_bootmem_memmap(unsigned long section_nr, } get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO); - if (base_mapping) { + if (!boot_cpu_has(X86_FEATURE_PSE)) { next = (addr + PAGE_SIZE) & PAGE_MASK; pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index cfde3bec2261..f11ba701e199 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -895,20 +895,6 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, } #endif -#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP -extern bool hugetlb_free_vmemmap_enabled; - -static inline bool is_hugetlb_free_vmemmap_enabled(void) -{ - return hugetlb_free_vmemmap_enabled; -} -#else -static inline bool is_hugetlb_free_vmemmap_enabled(void) -{ - return false; -} -#endif - #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; @@ -1063,13 +1049,14 @@ static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr pte_t *ptep, pte_t pte, unsigned long sz) { } - -static inline bool is_hugetlb_free_vmemmap_enabled(void) -{ - return false; -} #endif /* CONFIG_HUGETLB_PAGE */ +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP +extern bool hugetlb_free_vmemmap_enabled; +#else +#define hugetlb_free_vmemmap_enabled false +#endif + static inline spinlock_t *huge_pte_lock(struct hstate *h, struct mm_struct *mm, pte_t *pte) { diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 29d1d742b565..492560b61999 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1056,7 +1056,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size) * populate a single PMD. */ return memmap_on_memory && - !is_hugetlb_free_vmemmap_enabled() && + !hugetlb_free_vmemmap_enabled && IS_ENABLED(CONFIG_MHP_MEMMAP_ON_MEMORY) && size == memory_block_size_bytes() && IS_ALIGNED(vmemmap_size, PMD_SIZE) && From e6d41f12df0efcaa6e30b575d40f2529024cfce9 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:48:28 -0700 Subject: [PATCH 026/190] mm: hugetlb: introduce CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap pages associated with each HugeTLB page is default off. Now the vmemmap is PMD mapped. So there is no side effect when this feature is enabled with no HugeTLB pages in the system. Someone may want to enable this feature in the compiler time instead of using boot command line. So add a config to make it default on when someone do not want to enable it via command line. Link: https://lkml.kernel.org/r/20210616094915.34432-4-songmuchun@bytedance.com Signed-off-by: Muchun Song Cc: Chen Huang Cc: David Hildenbrand Cc: Jonathan Corbet Cc: Michal Hocko Cc: Mike Kravetz Cc: Oscar Salvador Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/kernel-parameters.txt | 3 +++ fs/Kconfig | 10 ++++++++++ mm/hugetlb_vmemmap.c | 6 ++++-- 3 files changed, 17 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 7a7da02b3bc6..c4e34420a485 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1577,6 +1577,9 @@ on: enable the feature off: disable the feature + Built with CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON=y, + the default is on. + This is not compatible with memory_hotplug.memmap_on_memory. If both parameters are enabled, hugetlb_free_vmemmap takes precedence over memory_hotplug.memmap_on_memory. diff --git a/fs/Kconfig b/fs/Kconfig index 58a53455d1fe..a7749c126b8e 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -245,6 +245,16 @@ config HUGETLB_PAGE_FREE_VMEMMAP depends on X86_64 depends on SPARSEMEM_VMEMMAP +config HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON + bool "Default freeing vmemmap pages of HugeTLB to on" + default n + depends on HUGETLB_PAGE_FREE_VMEMMAP + help + When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap + pages associated with each HugeTLB page is default off. Say Y here + to enable freeing vmemmap pages of HugeTLB by default. It can then + be disabled on the command line via hugetlb_free_vmemmap=off. + config MEMFD_CREATE def_bool TMPFS || HUGETLBFS diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 06802056f296..c540c21e26f5 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -182,7 +182,7 @@ #define RESERVE_VMEMMAP_NR 2U #define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT) -bool hugetlb_free_vmemmap_enabled; +bool hugetlb_free_vmemmap_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON); static int __init early_hugetlb_free_vmemmap_param(char *buf) { @@ -197,7 +197,9 @@ static int __init early_hugetlb_free_vmemmap_param(char *buf) if (!strcmp(buf, "on")) hugetlb_free_vmemmap_enabled = true; - else if (strcmp(buf, "off")) + else if (!strcmp(buf, "off")) + hugetlb_free_vmemmap_enabled = false; + else return -EINVAL; return 0; From 48b8d744ea841b8adf8d07bfe7a2d55f22e4d179 Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Wed, 30 Jun 2021 18:48:31 -0700 Subject: [PATCH 027/190] hugetlb: remove prep_compound_huge_page cleanup Patch series "Fix prep_compound_gigantic_page ref count adjustment". These patches address the possible race between prep_compound_gigantic_page and __page_cache_add_speculative as described by Jann Horn in [1]. The first patch simply removes the unnecessary/obsolete helper routine prep_compound_huge_page to make the actual fix a little simpler. The second patch is the actual fix and has a detailed explanation in the commit message. This potential issue has existed for almost 10 years and I am unaware of anyone actually hitting the race. I did not cc stable, but would be happy to squash the patches and send to stable if anyone thinks that is a good idea. [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/ This patch (of 2): I could not think of a reliable way to recreate the issue for testing. Rather, I 'simulated errors' to exercise all the error paths. The routine prep_compound_huge_page is a simple wrapper to call either prep_compound_gigantic_page or prep_compound_page. However, it is only called from gather_bootmem_prealloc which only processes gigantic pages. Eliminate the routine and call prep_compound_gigantic_page directly. Link: https://lkml.kernel.org/r/20210622021423.154662-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20210622021423.154662-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz Cc: Andrea Arcangeli Cc: Jan Kara Cc: Jann Horn Cc: John Hubbard Cc: "Kirill A . Shutemov" Cc: Matthew Wilcox Cc: Michal Hocko Cc: Youquan Song Cc: Muchun Song Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 29 ++++++++++------------------- 1 file changed, 10 insertions(+), 19 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index b14f4d1749b2..8048763e98a7 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1320,8 +1320,6 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask, return alloc_contig_pages(nr_pages, gfp_mask, nid, nodemask); } -static void prep_new_huge_page(struct hstate *h, struct page *page, int nid); -static void prep_compound_gigantic_page(struct page *page, unsigned int order); #else /* !CONFIG_CONTIG_ALLOC */ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask, int nid, nodemask_t *nodemask) @@ -2759,16 +2757,10 @@ int __alloc_bootmem_huge_page(struct hstate *h) return 1; } -static void __init prep_compound_huge_page(struct page *page, - unsigned int order) -{ - if (unlikely(order > (MAX_ORDER - 1))) - prep_compound_gigantic_page(page, order); - else - prep_compound_page(page, order); -} - -/* Put bootmem huge pages into the standard lists after mem_map is up */ +/* + * Put bootmem huge pages into the standard lists after mem_map is up. + * Note: This only applies to gigantic (order > MAX_ORDER) pages. + */ static void __init gather_bootmem_prealloc(void) { struct huge_bootmem_page *m; @@ -2777,20 +2769,19 @@ static void __init gather_bootmem_prealloc(void) struct page *page = virt_to_page(m); struct hstate *h = m->hstate; + VM_BUG_ON(!hstate_is_gigantic(h)); WARN_ON(page_count(page) != 1); - prep_compound_huge_page(page, huge_page_order(h)); + prep_compound_gigantic_page(page, huge_page_order(h)); WARN_ON(PageReserved(page)); prep_new_huge_page(h, page, page_to_nid(page)); put_page(page); /* free it into the hugepage allocator */ /* - * If we had gigantic hugepages allocated at boot time, we need - * to restore the 'stolen' pages to totalram_pages in order to - * fix confusing memory reports from free(1) and another - * side-effects, like CommitLimit going negative. + * We need to restore the 'stolen' pages to totalram_pages + * in order to fix confusing memory reports from free(1) and + * other side-effects, like CommitLimit going negative. */ - if (hstate_is_gigantic(h)) - adjust_managed_page_count(page, pages_per_huge_page(h)); + adjust_managed_page_count(page, pages_per_huge_page(h)); cond_resched(); } } From 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849 Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Wed, 30 Jun 2021 18:48:34 -0700 Subject: [PATCH 028/190] hugetlb: address ref count racing in prep_compound_gigantic_page In [1], Jann Horn points out a possible race between prep_compound_gigantic_page and __page_cache_add_speculative. The root cause of the possible race is prep_compound_gigantic_page uncondittionally setting the ref count of pages to zero. It does this because prep_compound_gigantic_page is handed a 'group' of pages from an allocator and needs to convert that group of pages to a compound page. The ref count of each page in this 'group' is one as set by the allocator. However, the ref count of compound page tail pages must be zero. The potential race comes about when ref counted pages are returned from the allocator. When this happens, other mm code could also take a reference on the page. __page_cache_add_speculative is one such example. Therefore, prep_compound_gigantic_page can not just set the ref count of pages to zero as it does today. Doing so would lose the reference taken by any other code. This would lead to BUGs in code checking ref counts and could possibly even lead to memory corruption. There are two possible ways to address this issue. 1) Make all allocators of gigantic groups of pages be able to return a properly constructed compound page. 2) Make prep_compound_gigantic_page be more careful when constructing a compound page. This patch takes approach 2. In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero if it is one. If the cmpxchg fails, call synchronize_rcu() in the hope that the extra ref count will be driopped during a rcu grace period. This is not a performance critical code path and the wait should be accceptable. If the ref count is still inflated after the grace period, then undo any modifications made and return an error. Currently prep_compound_gigantic_page is type void and does not return errors. Modify the two callers to check for and handle error returns. On error, the caller must free the 'group' of pages as they can not be used to form a gigantic page. After freeing pages, the runtime caller (alloc_fresh_huge_page) will retry the allocation once. Boot time allocations can not be retried. The routine prep_compound_page also unconditionally sets the ref count of compound page tail pages to zero. However, in this case the buddy allocator is constructing a compound page from freshly allocated pages. The ref count on those freshly allocated pages is already zero, so the set_page_count(p, 0) is unnecessary and could lead to confusion. Just remove it. [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com Fixes: 58a84aa92723 ("thp: set compound tail page _count to zero") Signed-off-by: Mike Kravetz Reported-by: Jann Horn Cc: Youquan Song Cc: Andrea Arcangeli Cc: Jan Kara Cc: John Hubbard Cc: "Kirill A . Shutemov" Cc: Matthew Wilcox Cc: Michal Hocko Cc: Muchun Song Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 72 +++++++++++++++++++++++++++++++++++++++++++------ mm/page_alloc.c | 1 - 2 files changed, 64 insertions(+), 9 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8048763e98a7..89ba5147206e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1623,9 +1623,9 @@ static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) spin_unlock_irq(&hugetlb_lock); } -static void prep_compound_gigantic_page(struct page *page, unsigned int order) +static bool prep_compound_gigantic_page(struct page *page, unsigned int order) { - int i; + int i, j; int nr_pages = 1 << order; struct page *p = page + 1; @@ -1647,11 +1647,48 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order) * after get_user_pages(). */ __ClearPageReserved(p); + /* + * Subtle and very unlikely + * + * Gigantic 'page allocators' such as memblock or cma will + * return a set of pages with each page ref counted. We need + * to turn this set of pages into a compound page with tail + * page ref counts set to zero. Code such as speculative page + * cache adding could take a ref on a 'to be' tail page. + * We need to respect any increased ref count, and only set + * the ref count to zero if count is currently 1. If count + * is not 1, we call synchronize_rcu in the hope that a rcu + * grace period will cause ref count to drop and then retry. + * If count is still inflated on retry we return an error and + * must discard the pages. + */ + if (!page_ref_freeze(p, 1)) { + pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n"); + synchronize_rcu(); + if (!page_ref_freeze(p, 1)) + goto out_error; + } set_page_count(p, 0); set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); atomic_set(compound_pincount_ptr(page), 0); + return true; + +out_error: + /* undo tail page modifications made above */ + p = page + 1; + for (j = 1; j < i; j++, p = mem_map_next(p, page, j)) { + clear_compound_head(p); + set_page_refcounted(p); + } + /* need to clear PG_reserved on remaining tail pages */ + for (; j < nr_pages; j++, p = mem_map_next(p, page, j)) + __ClearPageReserved(p); + set_compound_order(page, 0); + page[1].compound_nr = 0; + __ClearPageHead(page); + return false; } /* @@ -1771,7 +1808,9 @@ static struct page *alloc_fresh_huge_page(struct hstate *h, nodemask_t *node_alloc_noretry) { struct page *page; + bool retry = false; +retry: if (hstate_is_gigantic(h)) page = alloc_gigantic_page(h, gfp_mask, nid, nmask); else @@ -1780,8 +1819,21 @@ static struct page *alloc_fresh_huge_page(struct hstate *h, if (!page) return NULL; - if (hstate_is_gigantic(h)) - prep_compound_gigantic_page(page, huge_page_order(h)); + if (hstate_is_gigantic(h)) { + if (!prep_compound_gigantic_page(page, huge_page_order(h))) { + /* + * Rare failure to convert pages to compound page. + * Free pages and try again - ONCE! + */ + free_gigantic_page(page, huge_page_order(h)); + if (!retry) { + retry = true; + goto retry; + } + pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n"); + return NULL; + } + } prep_new_huge_page(h, page, page_to_nid(page)); return page; @@ -2771,10 +2823,14 @@ static void __init gather_bootmem_prealloc(void) VM_BUG_ON(!hstate_is_gigantic(h)); WARN_ON(page_count(page) != 1); - prep_compound_gigantic_page(page, huge_page_order(h)); - WARN_ON(PageReserved(page)); - prep_new_huge_page(h, page, page_to_nid(page)); - put_page(page); /* free it into the hugepage allocator */ + if (prep_compound_gigantic_page(page, huge_page_order(h))) { + WARN_ON(PageReserved(page)); + prep_new_huge_page(h, page, page_to_nid(page)); + put_page(page); /* add to the hugepage allocator */ + } else { + free_gigantic_page(page, huge_page_order(h)); + pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n"); + } /* * We need to restore the 'stolen' pages to totalram_pages diff --git a/mm/page_alloc.c b/mm/page_alloc.c index db00ee8d79d2..eeff64843718 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -754,7 +754,6 @@ void prep_compound_page(struct page *page, unsigned int order) __SetPageHead(page); for (i = 1; i < nr_pages; i++) { struct page *p = page + i; - set_page_count(p, 0); p->mapping = TAIL_MAPPING; set_compound_head(p, page); } From 510d25c92ec4ace4199a94f2f0cc9b8208c0de57 Mon Sep 17 00:00:00 2001 From: Naoya Horiguchi Date: Wed, 30 Jun 2021 18:48:38 -0700 Subject: [PATCH 029/190] mm/hwpoison: disable pcp for page_handle_poison() Recent changes by patch "mm/page_alloc: allow high-order pages to be stored on the per-cpu lists" makes kernels determine whether to use pcp by pcp_allowed_order(), which breaks soft-offline for hugetlb pages. Soft-offline dissolves a migration source page, then removes it from buddy free list, so it's assumed that any subpage of the soft-offlined hugepage are recognized as a buddy page just after returning from dissolve_free_huge_page(). pcp_allowed_order() returns true for hugetlb, so this assumption is no longer true. So disable pcp during dissolve_free_huge_page() and take_page_off_buddy() to prevent soft-offlined hugepages from linking to pcp lists. Soft-offline should not be common events so the impact on performance should be minimal. And I think that the optimization of Mel's patch could benefit to hugetlb so zone_pcp_disable() is called only in hwpoison context. Link: https://lkml.kernel.org/r/20210617092626.291006-1-nao.horiguchi@gmail.com Signed-off-by: Naoya Horiguchi Acked-by: Mel Gorman Cc: Mike Kravetz Cc: David Hildenbrand Cc: Oscar Salvador Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index e5a1531f7f4e..9d2d31ffe8a4 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -66,6 +66,19 @@ int sysctl_memory_failure_recovery __read_mostly = 1; atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); +static bool __page_handle_poison(struct page *page) +{ + bool ret; + + zone_pcp_disable(page_zone(page)); + ret = dissolve_free_huge_page(page); + if (!ret) + ret = take_page_off_buddy(page); + zone_pcp_enable(page_zone(page)); + + return ret; +} + static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release) { if (hugepage_or_freepage) { @@ -73,7 +86,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo * Doing this check for free pages is also fine since dissolve_free_huge_page * returns 0 for non-hugetlb pages as well. */ - if (dissolve_free_huge_page(page) || !take_page_off_buddy(page)) + if (!__page_handle_poison(page)) /* * We could fail to take off the target page from buddy * for example due to racy page allocation, but that's @@ -985,7 +998,7 @@ static int me_huge_page(struct page *p, unsigned long pfn) */ if (PageAnon(hpage)) put_page(hpage); - if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) { + if (__page_handle_poison(p)) { page_ref_inc(p); res = MF_RECOVERED; } @@ -1446,7 +1459,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags) } unlock_page(head); res = MF_FAILED; - if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) { + if (__page_handle_poison(p)) { page_ref_inc(p); res = MF_RECOVERED; } From d2c6c06fff5098850b2b3b360758c9cc6102053f Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 30 Jun 2021 18:48:41 -0700 Subject: [PATCH 030/190] userfaultfd/selftests: use user mode only Patch series "userfaultfd/selftests: A few cleanups", v2. I wanted to cleanup userfaultfd.c fault handling for a long time. If it's not cleaned, when the new code grows the file it'll also grow the size that needs to be cleaned... This is my attempt to cleanup the userfaultfd selftest on fault handling, to use an err() macro instead of either fprintf() or perror() then another exit() call. The huge cleanup is done in the last patch. The first 4 patches are some other standalone cleanups for the same file, so I put them together. This patch (of 5): Userfaultfd selftest does not need to handle kernel initiated fault. Set user mode so it can be run even if unprivileged_userfaultfd=0 (which is the default). Link: https://lkml.kernel.org/r/20210412232753.1012412-2-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Axel Rasmussen Cc: Andrea Arcangeli Cc: Mike Rapoport Cc: Alexander Viro Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index f5ab5e0312e7..ce23db8eec26 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -831,7 +831,7 @@ static int userfaultfd_open_ext(uint64_t *features) { struct uffdio_api uffdio_api; - uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); if (uffd < 0) { fprintf(stderr, "userfaultfd syscall not available in this kernel\n"); From ba4f8c355ef96ed521788d6707344f350bf78078 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 30 Jun 2021 18:48:44 -0700 Subject: [PATCH 031/190] userfaultfd/selftests: remove the time() check on delayed uffd There seems to have no guarantee that time() will return the same for the two calls even if there's no delay, e.g. when a fault is accidentally crossing the changing of a second. Meanwhile, this message is also not helping that much since delay could happen with a lot of reasons, e.g., schedule latency of resolving thread. It may not mean an issue with uffd. Neither do I saw this error triggered either in the past runs. Even if it triggers, it'll be drown in all the rest of test logs. Remove it. Link: https://lkml.kernel.org/r/20210412232753.1012412-3-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Axel Rasmussen Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 8 -------- 1 file changed, 8 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index ce23db8eec26..5cae66e27171 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -395,7 +395,6 @@ static void *locking_thread(void *arg) unsigned long long count; char randstate[64]; unsigned int seed; - time_t start; if (bounces & BOUNCE_RANDOM) { seed = (unsigned int) time(NULL) - bounces; @@ -432,7 +431,6 @@ static void *locking_thread(void *arg) page_nr += 1; page_nr %= nr_pages; - start = time(NULL); if (bounces & BOUNCE_VERIFY) { count = *area_count(area_dst, page_nr); if (!count) { @@ -495,12 +493,6 @@ static void *locking_thread(void *arg) count++; *area_count(area_dst, page_nr) = count_verify[page_nr] = count; pthread_mutex_unlock(area_mutex(area_dst, page_nr)); - - if (time(NULL) - start > 1) - fprintf(stderr, - "userfault too slow %ld " - "possible false positive with overcommit\n", - time(NULL) - start); } return NULL; From 4e08e18a785f9e901ca64062b9227c68d1b40ea3 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 30 Jun 2021 18:48:48 -0700 Subject: [PATCH 032/190] userfaultfd/selftests: dropping VERIFY check in locking_thread It tries to check against all zeros and looped for quite a few times. However after that we'll verify the same page with count_verify, while count_verify can never be zero. So it means if it's a zero page we'll detect it anyways with below code. There's yet another place we conditionally check the fault flag - just do it unconditionally. Link: https://lkml.kernel.org/r/20210412232753.1012412-4-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Axel Rasmussen Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 55 +----------------------- 1 file changed, 1 insertion(+), 54 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 5cae66e27171..387b9360ae64 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -430,58 +430,6 @@ static void *locking_thread(void *arg) } else page_nr += 1; page_nr %= nr_pages; - - if (bounces & BOUNCE_VERIFY) { - count = *area_count(area_dst, page_nr); - if (!count) { - fprintf(stderr, - "page_nr %lu wrong count %Lu %Lu\n", - page_nr, count, - count_verify[page_nr]); - exit(1); - } - - - /* - * We can't use bcmp (or memcmp) because that - * returns 0 erroneously if the memory is - * changing under it (even if the end of the - * page is never changing and always - * different). - */ -#if 1 - if (!my_bcmp(area_dst + page_nr * page_size, zeropage, - page_size)) { - fprintf(stderr, - "my_bcmp page_nr %lu wrong count %Lu %Lu\n", - page_nr, count, count_verify[page_nr]); - exit(1); - } -#else - unsigned long loops; - - loops = 0; - /* uncomment the below line to test with mutex */ - /* pthread_mutex_lock(area_mutex(area_dst, page_nr)); */ - while (!bcmp(area_dst + page_nr * page_size, zeropage, - page_size)) { - loops += 1; - if (loops > 10) - break; - } - /* uncomment below line to test with mutex */ - /* pthread_mutex_unlock(area_mutex(area_dst, page_nr)); */ - if (loops) { - fprintf(stderr, - "page_nr %lu all zero thread %lu %p %lu\n", - page_nr, cpu, area_dst + page_nr * page_size, - loops); - if (loops > 10) - exit(1); - } -#endif - } - pthread_mutex_lock(area_mutex(area_dst, page_nr)); count = *area_count(area_dst, page_nr); if (count != count_verify[page_nr]) { @@ -613,8 +561,7 @@ static void uffd_handle_page_fault(struct uffd_msg *msg, stats->minor_faults++; } else { /* Missing page faults */ - if (bounces & BOUNCE_VERIFY && - msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) { + if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) { fprintf(stderr, "unexpected write fault\n"); exit(1); } From de3ca8e4a56dda0f0dfb05d4fddab985cde5159a Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 30 Jun 2021 18:48:52 -0700 Subject: [PATCH 033/190] userfaultfd/selftests: only dump counts if mode enabled WP and MINOR modes are conditionally enabled on specific memory types. This patch avoids dumping tons of zeros for those cases when the modes are not supported at all. Link: https://lkml.kernel.org/r/20210412232753.1012412-5-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Axel Rasmussen Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 30 ++++++++++++++++-------- 1 file changed, 20 insertions(+), 10 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 387b9360ae64..da2374bda5a3 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -171,16 +171,26 @@ static void uffd_stats_report(struct uffd_stats *stats, int n_cpus) minor_total += stats[i].minor_faults; } - printf("userfaults: %llu missing (", miss_total); - for (i = 0; i < n_cpus; i++) - printf("%lu+", stats[i].missing_faults); - printf("\b), %llu wp (", wp_total); - for (i = 0; i < n_cpus; i++) - printf("%lu+", stats[i].wp_faults); - printf("\b), %llu minor (", minor_total); - for (i = 0; i < n_cpus; i++) - printf("%lu+", stats[i].minor_faults); - printf("\b)\n"); + printf("userfaults: "); + if (miss_total) { + printf("%llu missing (", miss_total); + for (i = 0; i < n_cpus; i++) + printf("%lu+", stats[i].missing_faults); + printf("\b) "); + } + if (wp_total) { + printf("%llu wp (", wp_total); + for (i = 0; i < n_cpus; i++) + printf("%lu+", stats[i].wp_faults); + printf("\b) "); + } + if (minor_total) { + printf("%llu minor (", minor_total); + for (i = 0; i < n_cpus; i++) + printf("%lu+", stats[i].minor_faults); + printf("\b)"); + } + printf("\n"); } static int anon_release_pages(char *rel_area) From 42e584eede17b21b03896961e0df45ece4d01e79 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 30 Jun 2021 18:48:55 -0700 Subject: [PATCH 034/190] userfaultfd/selftests: unify error handling Introduce err()/_err() and replace all the different ways to fail the program, mostly "fprintf" and "perror" with tons of exit() calls. Always stop the test program at any failure. Link: https://lkml.kernel.org/r/20210412232753.1012412-6-peterx@redhat.com Signed-off-by: Peter Xu Reviewed-by: Axel Rasmussen Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 556 ++++++++--------------- 1 file changed, 187 insertions(+), 369 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index da2374bda5a3..6339aeaeeff8 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -140,11 +140,18 @@ static void usage(void) exit(1); } -#define uffd_error(code, fmt, ...) \ - do { \ - fprintf(stderr, fmt, ##__VA_ARGS__); \ - fprintf(stderr, ": %" PRId64 "\n", (int64_t)(code)); \ - exit(1); \ +#define _err(fmt, ...) \ + do { \ + int ret = errno; \ + fprintf(stderr, "ERROR: " fmt, ##__VA_ARGS__); \ + fprintf(stderr, " (errno=%d, line=%d)\n", \ + ret, __LINE__); \ + } while (0) + +#define err(fmt, ...) \ + do { \ + _err(fmt, ##__VA_ARGS__); \ + exit(1); \ } while (0) static void uffd_stats_reset(struct uffd_stats *uffd_stats, @@ -193,44 +200,28 @@ static void uffd_stats_report(struct uffd_stats *stats, int n_cpus) printf("\n"); } -static int anon_release_pages(char *rel_area) +static void anon_release_pages(char *rel_area) { - int ret = 0; - - if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED)) { - perror("madvise"); - ret = 1; - } - - return ret; + if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED)) + err("madvise(MADV_DONTNEED) failed"); } static void anon_allocate_area(void **alloc_area) { - if (posix_memalign(alloc_area, page_size, nr_pages * page_size)) { - fprintf(stderr, "out of memory\n"); - *alloc_area = NULL; - } + if (posix_memalign(alloc_area, page_size, nr_pages * page_size)) + err("posix_memalign() failed"); } static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset) { } -/* HugeTLB memory */ -static int hugetlb_release_pages(char *rel_area) +static void hugetlb_release_pages(char *rel_area) { - int ret = 0; - if (fallocate(huge_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, - rel_area == huge_fd_off0 ? 0 : - nr_pages * page_size, - nr_pages * page_size)) { - perror("fallocate"); - ret = 1; - } - - return ret; + rel_area == huge_fd_off0 ? 0 : nr_pages * page_size, + nr_pages * page_size)) + err("fallocate() failed"); } static void hugetlb_allocate_area(void **alloc_area) @@ -243,20 +234,16 @@ static void hugetlb_allocate_area(void **alloc_area) MAP_HUGETLB, huge_fd, *alloc_area == area_src ? 0 : nr_pages * page_size); - if (*alloc_area == MAP_FAILED) { - perror("mmap of hugetlbfs file failed"); - goto fail; - } + if (*alloc_area == MAP_FAILED) + err("mmap of hugetlbfs file failed"); if (map_shared) { area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_HUGETLB, huge_fd, *alloc_area == area_src ? 0 : nr_pages * page_size); - if (area_alias == MAP_FAILED) { - perror("mmap of hugetlb file alias failed"); - goto fail_munmap; - } + if (area_alias == MAP_FAILED) + err("mmap of hugetlb file alias failed"); } if (*alloc_area == area_src) { @@ -267,16 +254,6 @@ static void hugetlb_allocate_area(void **alloc_area) } if (area_alias) *alloc_area_alias = area_alias; - - return; - -fail_munmap: - if (munmap(*alloc_area, nr_pages * page_size) < 0) { - perror("hugetlb munmap"); - exit(1); - } -fail: - *alloc_area = NULL; } static void hugetlb_alias_mapping(__u64 *start, size_t len, unsigned long offset) @@ -292,33 +269,24 @@ static void hugetlb_alias_mapping(__u64 *start, size_t len, unsigned long offset *start = (unsigned long) area_dst_alias + offset; } -/* Shared memory */ -static int shmem_release_pages(char *rel_area) +static void shmem_release_pages(char *rel_area) { - int ret = 0; - - if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) { - perror("madvise"); - ret = 1; - } - - return ret; + if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE)) + err("madvise(MADV_REMOVE) failed"); } static void shmem_allocate_area(void **alloc_area) { *alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0); - if (*alloc_area == MAP_FAILED) { - fprintf(stderr, "shared memory mmap failed\n"); - *alloc_area = NULL; - } + if (*alloc_area == MAP_FAILED) + err("mmap of memfd failed"); } struct uffd_test_ops { unsigned long expected_ioctls; void (*allocate_area)(void **alloc_area); - int (*release_pages)(char *rel_area); + void (*release_pages)(char *rel_area); void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset); }; @@ -373,11 +341,8 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp) /* Undo write-protect, do wakeup after that */ prms.mode = wp ? UFFDIO_WRITEPROTECT_MODE_WP : 0; - if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms)) { - fprintf(stderr, "clear WP failed for address 0x%" PRIx64 "\n", - (uint64_t)start); - exit(1); - } + if (ioctl(ufd, UFFDIO_WRITEPROTECT, &prms)) + err("clear WP failed: address=0x%"PRIx64, (uint64_t)start); } static void continue_range(int ufd, __u64 start, __u64 len) @@ -388,12 +353,9 @@ static void continue_range(int ufd, __u64 start, __u64 len) req.range.len = len; req.mode = 0; - if (ioctl(ufd, UFFDIO_CONTINUE, &req)) { - fprintf(stderr, - "UFFDIO_CONTINUE failed for address 0x%" PRIx64 "\n", - (uint64_t)start); - exit(1); - } + if (ioctl(ufd, UFFDIO_CONTINUE, &req)) + err("UFFDIO_CONTINUE failed for address 0x%" PRIx64, + (uint64_t)start); } static void *locking_thread(void *arg) @@ -412,10 +374,8 @@ static void *locking_thread(void *arg) seed += cpu; bzero(&rand, sizeof(rand)); bzero(&randstate, sizeof(randstate)); - if (initstate_r(seed, randstate, sizeof(randstate), &rand)) { - fprintf(stderr, "srandom_r error\n"); - exit(1); - } + if (initstate_r(seed, randstate, sizeof(randstate), &rand)) + err("initstate_r failed"); } else { page_nr = -bounces; if (!(bounces & BOUNCE_RACINGFAULTS)) @@ -424,16 +384,12 @@ static void *locking_thread(void *arg) while (!finished) { if (bounces & BOUNCE_RANDOM) { - if (random_r(&rand, &rand_nr)) { - fprintf(stderr, "random_r 1 error\n"); - exit(1); - } + if (random_r(&rand, &rand_nr)) + err("random_r failed"); page_nr = rand_nr; if (sizeof(page_nr) > sizeof(rand_nr)) { - if (random_r(&rand, &rand_nr)) { - fprintf(stderr, "random_r 2 error\n"); - exit(1); - } + if (random_r(&rand, &rand_nr)) + err("random_r failed"); page_nr |= (((unsigned long) rand_nr) << 16) << 16; } @@ -442,12 +398,9 @@ static void *locking_thread(void *arg) page_nr %= nr_pages; pthread_mutex_lock(area_mutex(area_dst, page_nr)); count = *area_count(area_dst, page_nr); - if (count != count_verify[page_nr]) { - fprintf(stderr, - "page_nr %lu memory corruption %Lu %Lu\n", - page_nr, count, - count_verify[page_nr]); exit(1); - } + if (count != count_verify[page_nr]) + err("page_nr %lu memory corruption %llu %llu", + page_nr, count, count_verify[page_nr]); count++; *area_count(area_dst, page_nr) = count_verify[page_nr] = count; pthread_mutex_unlock(area_mutex(area_dst, page_nr)); @@ -464,22 +417,21 @@ static void retry_copy_page(int ufd, struct uffdio_copy *uffdio_copy, offset); if (ioctl(ufd, UFFDIO_COPY, uffdio_copy)) { /* real retval in ufdio_copy.copy */ - if (uffdio_copy->copy != -EEXIST) { - uffd_error(uffdio_copy->copy, - "UFFDIO_COPY retry error"); - } - } else - uffd_error(uffdio_copy->copy, "UFFDIO_COPY retry unexpected"); + if (uffdio_copy->copy != -EEXIST) + err("UFFDIO_COPY retry error: %"PRId64, + (int64_t)uffdio_copy->copy); + } else { + err("UFFDIO_COPY retry unexpected: %"PRId64, + (int64_t)uffdio_copy->copy); + } } static int __copy_page(int ufd, unsigned long offset, bool retry) { struct uffdio_copy uffdio_copy; - if (offset >= nr_pages * page_size) { - fprintf(stderr, "unexpected offset %lu\n", offset); - exit(1); - } + if (offset >= nr_pages * page_size) + err("unexpected offset %lu\n", offset); uffdio_copy.dst = (unsigned long) area_dst + offset; uffdio_copy.src = (unsigned long) area_src + offset; uffdio_copy.len = page_size; @@ -491,9 +443,10 @@ static int __copy_page(int ufd, unsigned long offset, bool retry) if (ioctl(ufd, UFFDIO_COPY, &uffdio_copy)) { /* real retval in ufdio_copy.copy */ if (uffdio_copy.copy != -EEXIST) - uffd_error(uffdio_copy.copy, "UFFDIO_COPY error"); + err("UFFDIO_COPY error: %"PRId64, + (int64_t)uffdio_copy.copy); } else if (uffdio_copy.copy != page_size) { - uffd_error(uffdio_copy.copy, "UFFDIO_COPY unexpected copy"); + err("UFFDIO_COPY error: %"PRId64, (int64_t)uffdio_copy.copy); } else { if (test_uffdio_copy_eexist && retry) { test_uffdio_copy_eexist = false; @@ -522,11 +475,10 @@ static int uffd_read_msg(int ufd, struct uffd_msg *msg) if (ret < 0) { if (errno == EAGAIN) return 1; - perror("blocking read error"); + err("blocking read error"); } else { - fprintf(stderr, "short read\n"); + err("short read"); } - exit(1); } return 0; @@ -537,10 +489,8 @@ static void uffd_handle_page_fault(struct uffd_msg *msg, { unsigned long offset; - if (msg->event != UFFD_EVENT_PAGEFAULT) { - fprintf(stderr, "unexpected msg event %u\n", msg->event); - exit(1); - } + if (msg->event != UFFD_EVENT_PAGEFAULT) + err("unexpected msg event %u", msg->event); if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP) { /* Write protect page faults */ @@ -571,10 +521,8 @@ static void uffd_handle_page_fault(struct uffd_msg *msg, stats->minor_faults++; } else { /* Missing page faults */ - if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) { - fprintf(stderr, "unexpected write fault\n"); - exit(1); - } + if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE) + err("unexpected write fault"); offset = (char *)(unsigned long)msg->arg.pagefault.address - area_dst; offset &= ~(page_size-1); @@ -601,32 +549,20 @@ static void *uffd_poll_thread(void *arg) for (;;) { ret = poll(pollfd, 2, -1); - if (!ret) { - fprintf(stderr, "poll error %d\n", ret); - exit(1); - } - if (ret < 0) { - perror("poll"); - exit(1); - } + if (ret <= 0) + err("poll error: %d", ret); if (pollfd[1].revents & POLLIN) { - if (read(pollfd[1].fd, &tmp_chr, 1) != 1) { - fprintf(stderr, "read pipefd error\n"); - exit(1); - } + if (read(pollfd[1].fd, &tmp_chr, 1) != 1) + err("read pipefd error"); break; } - if (!(pollfd[0].revents & POLLIN)) { - fprintf(stderr, "pollfd[0].revents %d\n", - pollfd[0].revents); - exit(1); - } + if (!(pollfd[0].revents & POLLIN)) + err("pollfd[0].revents %d", pollfd[0].revents); if (uffd_read_msg(uffd, &msg)) continue; switch (msg.event) { default: - fprintf(stderr, "unexpected msg event %u\n", - msg.event); exit(1); + err("unexpected msg event %u\n", msg.event); break; case UFFD_EVENT_PAGEFAULT: uffd_handle_page_fault(&msg, stats); @@ -640,10 +576,8 @@ static void *uffd_poll_thread(void *arg) uffd_reg.range.start = msg.arg.remove.start; uffd_reg.range.len = msg.arg.remove.end - msg.arg.remove.start; - if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_reg.range)) { - fprintf(stderr, "remove failure\n"); - exit(1); - } + if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_reg.range)) + err("remove failure"); break; case UFFD_EVENT_REMAP: area_dst = (char *)(unsigned long)msg.arg.remap.to; @@ -746,9 +680,7 @@ static int stress(struct uffd_stats *uffd_stats) * UFFDIO_COPY without writing zero pages into area_dst * because the background threads already completed). */ - if (uffd_test_ops->release_pages(area_src)) - return 1; - + uffd_test_ops->release_pages(area_src); finished = 1; for (cpu = 0; cpu < nr_cpus; cpu++) @@ -758,10 +690,8 @@ static int stress(struct uffd_stats *uffd_stats) for (cpu = 0; cpu < nr_cpus; cpu++) { char c; if (bounces & BOUNCE_POLL) { - if (write(pipefd[cpu*2+1], &c, 1) != 1) { - fprintf(stderr, "pipefd write error\n"); - return 1; - } + if (write(pipefd[cpu*2+1], &c, 1) != 1) + err("pipefd write error"); if (pthread_join(uffd_threads[cpu], (void *)&uffd_stats[cpu])) return 1; @@ -861,10 +791,8 @@ static int faulting_process(int signal_test) memset(&act, 0, sizeof(act)); act.sa_sigaction = sighndl; act.sa_flags = SA_SIGINFO; - if (sigaction(SIGBUS, &act, 0)) { - perror("sigaction"); - return 1; - } + if (sigaction(SIGBUS, &act, 0)) + err("sigaction"); lastnr = (unsigned long)-1; } @@ -874,10 +802,8 @@ static int faulting_process(int signal_test) if (signal_test) { if (sigsetjmp(*sigbuf, 1) != 0) { - if (steps == 1 && nr == lastnr) { - fprintf(stderr, "Signal repeated\n"); - return 1; - } + if (steps == 1 && nr == lastnr) + err("Signal repeated"); lastnr = nr; if (signal_test == 1) { @@ -902,12 +828,9 @@ static int faulting_process(int signal_test) } count = *area_count(area_dst, nr); - if (count != count_verify[nr]) { - fprintf(stderr, - "nr %lu memory corruption %Lu %Lu\n", - nr, count, - count_verify[nr]); - } + if (count != count_verify[nr]) + err("nr %lu memory corruption %llu %llu\n", + nr, count, count_verify[nr]); /* * Trigger write protection if there is by writing * the same value back. @@ -923,18 +846,14 @@ static int faulting_process(int signal_test) area_dst = mremap(area_dst, nr_pages * page_size, nr_pages * page_size, MREMAP_MAYMOVE | MREMAP_FIXED, area_src); - if (area_dst == MAP_FAILED) { - perror("mremap"); - exit(1); - } + if (area_dst == MAP_FAILED) + err("mremap"); for (; nr < nr_pages; nr++) { count = *area_count(area_dst, nr); if (count != count_verify[nr]) { - fprintf(stderr, - "nr %lu memory corruption %Lu %Lu\n", - nr, count, - count_verify[nr]); exit(1); + err("nr %lu memory corruption %llu %llu\n", + nr, count, count_verify[nr]); } /* * Trigger write protection if there is by writing @@ -943,15 +862,11 @@ static int faulting_process(int signal_test) *area_count(area_dst, nr) = count; } - if (uffd_test_ops->release_pages(area_dst)) - return 1; + uffd_test_ops->release_pages(area_dst); - for (nr = 0; nr < nr_pages; nr++) { - if (my_bcmp(area_dst + nr * page_size, zeropage, page_size)) { - fprintf(stderr, "nr %lu is not zero\n", nr); - exit(1); - } - } + for (nr = 0; nr < nr_pages; nr++) + if (my_bcmp(area_dst + nr * page_size, zeropage, page_size)) + err("nr %lu is not zero", nr); return 0; } @@ -964,13 +879,12 @@ static void retry_uffdio_zeropage(int ufd, uffdio_zeropage->range.len, offset); if (ioctl(ufd, UFFDIO_ZEROPAGE, uffdio_zeropage)) { - if (uffdio_zeropage->zeropage != -EEXIST) { - uffd_error(uffdio_zeropage->zeropage, - "UFFDIO_ZEROPAGE retry error"); - } + if (uffdio_zeropage->zeropage != -EEXIST) + err("UFFDIO_ZEROPAGE error: %"PRId64, + (int64_t)uffdio_zeropage->zeropage); } else { - uffd_error(uffdio_zeropage->zeropage, - "UFFDIO_ZEROPAGE retry unexpected"); + err("UFFDIO_ZEROPAGE error: %"PRId64, + (int64_t)uffdio_zeropage->zeropage); } } @@ -983,10 +897,8 @@ static int __uffdio_zeropage(int ufd, unsigned long offset, bool retry) has_zeropage = uffd_test_ops->expected_ioctls & (1 << _UFFDIO_ZEROPAGE); - if (offset >= nr_pages * page_size) { - fprintf(stderr, "unexpected offset %lu\n", offset); - exit(1); - } + if (offset >= nr_pages * page_size) + err("unexpected offset %lu", offset); uffdio_zeropage.range.start = (unsigned long) area_dst + offset; uffdio_zeropage.range.len = page_size; uffdio_zeropage.mode = 0; @@ -994,14 +906,13 @@ static int __uffdio_zeropage(int ufd, unsigned long offset, bool retry) res = uffdio_zeropage.zeropage; if (ret) { /* real retval in ufdio_zeropage.zeropage */ - if (has_zeropage) { - uffd_error(res, "UFFDIO_ZEROPAGE %s", - res == -EEXIST ? "-EEXIST" : "error"); - } else if (res != -EINVAL) - uffd_error(res, "UFFDIO_ZEROPAGE not -EINVAL"); + if (has_zeropage) + err("UFFDIO_ZEROPAGE error: %"PRId64, (int64_t)res); + else if (res != -EINVAL) + err("UFFDIO_ZEROPAGE not -EINVAL"); } else if (has_zeropage) { if (res != page_size) { - uffd_error(res, "UFFDIO_ZEROPAGE unexpected"); + err("UFFDIO_ZEROPAGE unexpected size"); } else { if (test_uffdio_zeropage_eexist && retry) { test_uffdio_zeropage_eexist = false; @@ -1011,7 +922,7 @@ static int __uffdio_zeropage(int ufd, unsigned long offset, bool retry) return 1; } } else - uffd_error(res, "UFFDIO_ZEROPAGE succeeded"); + err("UFFDIO_ZEROPAGE succeeded"); return 0; } @@ -1030,8 +941,7 @@ static int userfaultfd_zeropage_test(void) printf("testing UFFDIO_ZEROPAGE: "); fflush(stdout); - if (uffd_test_ops->release_pages(area_dst)) - return 1; + uffd_test_ops->release_pages(area_dst); if (userfaultfd_open(0)) return 1; @@ -1040,25 +950,16 @@ static int userfaultfd_zeropage_test(void) uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; if (test_uffdio_wp) uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { - fprintf(stderr, "register failure\n"); - exit(1); - } + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) + err("register failure"); expected_ioctls = uffd_test_ops->expected_ioctls; - if ((uffdio_register.ioctls & expected_ioctls) != - expected_ioctls) { - fprintf(stderr, - "unexpected missing ioctl for anon memory\n"); - exit(1); - } + if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) + err("unexpected missing ioctl for anon memory"); - if (uffdio_zeropage(uffd, 0)) { - if (my_bcmp(area_dst, zeropage, page_size)) { - fprintf(stderr, "zeropage is not zero\n"); - exit(1); - } - } + if (uffdio_zeropage(uffd, 0)) + if (my_bcmp(area_dst, zeropage, page_size)) + err("zeropage is not zero"); close(uffd); printf("done.\n"); @@ -1078,8 +979,7 @@ static int userfaultfd_events_test(void) printf("testing events (fork, remap, remove): "); fflush(stdout); - if (uffd_test_ops->release_pages(area_dst)) - return 1; + uffd_test_ops->release_pages(area_dst); features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP | UFFD_FEATURE_EVENT_REMOVE; @@ -1092,41 +992,28 @@ static int userfaultfd_events_test(void) uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; if (test_uffdio_wp) uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { - fprintf(stderr, "register failure\n"); - exit(1); - } + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) + err("register failure"); expected_ioctls = uffd_test_ops->expected_ioctls; - if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) { - fprintf(stderr, "unexpected missing ioctl for anon memory\n"); - exit(1); - } + if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) + err("unexpected missing ioctl for anon memory"); - if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) { - perror("uffd_poll_thread create"); - exit(1); - } + if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) + err("uffd_poll_thread create"); pid = fork(); - if (pid < 0) { - perror("fork"); - exit(1); - } + if (pid < 0) + err("fork"); if (!pid) exit(faulting_process(0)); waitpid(pid, &err, 0); - if (err) { - fprintf(stderr, "faulting process failed\n"); - exit(1); - } - - if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) { - perror("pipe write"); - exit(1); - } + if (err) + err("faulting process failed"); + if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) + err("pipe write"); if (pthread_join(uffd_mon, NULL)) return 1; @@ -1151,8 +1038,7 @@ static int userfaultfd_sig_test(void) printf("testing signal delivery: "); fflush(stdout); - if (uffd_test_ops->release_pages(area_dst)) - return 1; + uffd_test_ops->release_pages(area_dst); features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS; if (userfaultfd_open(features)) @@ -1164,57 +1050,41 @@ static int userfaultfd_sig_test(void) uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; if (test_uffdio_wp) uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { - fprintf(stderr, "register failure\n"); - exit(1); - } + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) + err("register failure"); expected_ioctls = uffd_test_ops->expected_ioctls; - if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) { - fprintf(stderr, "unexpected missing ioctl for anon memory\n"); - exit(1); - } + if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) + err("unexpected missing ioctl for anon memory"); - if (faulting_process(1)) { - fprintf(stderr, "faulting process failed\n"); - exit(1); - } + if (faulting_process(1)) + err("faulting process failed"); - if (uffd_test_ops->release_pages(area_dst)) - return 1; + uffd_test_ops->release_pages(area_dst); - if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) { - perror("uffd_poll_thread create"); - exit(1); - } + if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) + err("uffd_poll_thread create"); pid = fork(); - if (pid < 0) { - perror("fork"); - exit(1); - } + if (pid < 0) + err("fork"); if (!pid) exit(faulting_process(2)); waitpid(pid, &err, 0); - if (err) { - fprintf(stderr, "faulting process failed\n"); - exit(1); - } - - if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) { - perror("pipe write"); - exit(1); - } + if (err) + err("faulting process failed"); + if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) + err("pipe write"); if (pthread_join(uffd_mon, (void **)&userfaults)) return 1; printf("done.\n"); if (userfaults) - fprintf(stderr, "Signal test failed, userfaults: %ld\n", - userfaults); + err("Signal test failed, userfaults: %ld", userfaults); close(uffd); + return userfaults != 0; } @@ -1236,8 +1106,7 @@ static int userfaultfd_minor_test(void) printf("testing minor faults: "); fflush(stdout); - if (uffd_test_ops->release_pages(area_dst)) - return 1; + uffd_test_ops->release_pages(area_dst); if (userfaultfd_open_ext(&features)) return 1; @@ -1251,17 +1120,13 @@ static int userfaultfd_minor_test(void) uffdio_register.range.start = (unsigned long)area_dst_alias; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MINOR; - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { - fprintf(stderr, "register failure\n"); - exit(1); - } + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) + err("register failure"); expected_ioctls = uffd_test_ops->expected_ioctls; expected_ioctls |= 1 << _UFFDIO_CONTINUE; - if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) { - fprintf(stderr, "unexpected missing ioctl(s)\n"); - exit(1); - } + if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls) + err("unexpected missing ioctl(s)"); /* * After registering with UFFD, populate the non-UFFD-registered side of @@ -1272,10 +1137,8 @@ static int userfaultfd_minor_test(void) page_size); } - if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) { - perror("uffd_poll_thread create"); - exit(1); - } + if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats)) + err("uffd_poll_thread create"); /* * Read each of the pages back using the UFFD-registered mapping. We @@ -1284,26 +1147,19 @@ static int userfaultfd_minor_test(void) * page's contents, and then issuing a CONTINUE ioctl. */ - if (posix_memalign(&expected_page, page_size, page_size)) { - fprintf(stderr, "out of memory\n"); - return 1; - } + if (posix_memalign(&expected_page, page_size, page_size)) + err("out of memory"); for (p = 0; p < nr_pages; ++p) { expected_byte = ~((uint8_t)(p % ((uint8_t)-1))); memset(expected_page, expected_byte, page_size); if (my_bcmp(expected_page, area_dst_alias + (p * page_size), - page_size)) { - fprintf(stderr, - "unexpected page contents after minor fault\n"); - exit(1); - } + page_size)) + err("unexpected page contents after minor fault"); } - if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) { - perror("pipe write"); - exit(1); - } + if (write(pipefd[1], &c, sizeof(c)) != sizeof(c)) + err("pipe write"); if (pthread_join(uffd_mon, NULL)) return 1; @@ -1321,7 +1177,6 @@ static int userfaultfd_stress(void) unsigned long nr; struct uffdio_register uffdio_register; unsigned long cpu; - int err; struct uffd_stats uffd_stats[nr_cpus]; uffd_test_ops->allocate_area((void **)&area_src); @@ -1366,10 +1221,8 @@ static int userfaultfd_stress(void) } } - if (posix_memalign(&area, page_size, page_size)) { - fprintf(stderr, "out of memory\n"); - return 1; - } + if (posix_memalign(&area, page_size, page_size)) + err("out of memory"); zeropage = area; bzero(zeropage, page_size); @@ -1378,7 +1231,6 @@ static int userfaultfd_stress(void) pthread_attr_init(&attr); pthread_attr_setstacksize(&attr, 16*1024*1024); - err = 0; while (bounces--) { unsigned long expected_ioctls; @@ -1407,25 +1259,18 @@ static int userfaultfd_stress(void) uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; if (test_uffdio_wp) uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP; - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { - fprintf(stderr, "register failure\n"); - return 1; - } + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) + err("register failure"); expected_ioctls = uffd_test_ops->expected_ioctls; if ((uffdio_register.ioctls & expected_ioctls) != - expected_ioctls) { - fprintf(stderr, - "unexpected missing ioctl for anon memory\n"); - return 1; - } + expected_ioctls) + err("unexpected missing ioctl for anon memory"); if (area_dst_alias) { uffdio_register.range.start = (unsigned long) area_dst_alias; - if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) { - fprintf(stderr, "register failure alias\n"); - return 1; - } + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) + err("register failure alias"); } /* @@ -1452,8 +1297,7 @@ static int userfaultfd_stress(void) * MADV_DONTNEED only after the UFFDIO_REGISTER, so it's * required to MADV_DONTNEED here. */ - if (uffd_test_ops->release_pages(area_dst)) - return 1; + uffd_test_ops->release_pages(area_dst); uffd_stats_reset(uffd_stats, nr_cpus); @@ -1467,33 +1311,22 @@ static int userfaultfd_stress(void) nr_pages * page_size, false); /* unregister */ - if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) { - fprintf(stderr, "unregister failure\n"); - return 1; - } + if (ioctl(uffd, UFFDIO_UNREGISTER, &uffdio_register.range)) + err("unregister failure"); if (area_dst_alias) { uffdio_register.range.start = (unsigned long) area_dst; if (ioctl(uffd, UFFDIO_UNREGISTER, - &uffdio_register.range)) { - fprintf(stderr, "unregister failure alias\n"); - return 1; - } + &uffdio_register.range)) + err("unregister failure alias"); } /* verification */ - if (bounces & BOUNCE_VERIFY) { - for (nr = 0; nr < nr_pages; nr++) { - if (*area_count(area_dst, nr) != count_verify[nr]) { - fprintf(stderr, - "error area_count %Lu %Lu %lu\n", - *area_count(area_src, nr), - count_verify[nr], - nr); - err = 1; - bounces = 0; - } - } - } + if (bounces & BOUNCE_VERIFY) + for (nr = 0; nr < nr_pages; nr++) + if (*area_count(area_dst, nr) != count_verify[nr]) + err("error area_count %llu %llu %lu\n", + *area_count(area_src, nr), + count_verify[nr], nr); /* prepare next bounce */ tmp_area = area_src; @@ -1507,9 +1340,6 @@ static int userfaultfd_stress(void) uffd_stats_report(uffd_stats, nr_cpus); } - if (err) - return err; - close(uffd); return userfaultfd_zeropage_test() || userfaultfd_sig_test() || userfaultfd_events_test() || userfaultfd_minor_test(); @@ -1560,7 +1390,7 @@ static void set_test_type(const char *type) test_type = TEST_SHMEM; uffd_test_ops = &shmem_uffd_test_ops; } else { - fprintf(stderr, "Unknown test type: %s\n", type); exit(1); + err("Unknown test type: %s", type); } if (test_type == TEST_HUGETLB) @@ -1568,15 +1398,11 @@ static void set_test_type(const char *type) else page_size = sysconf(_SC_PAGE_SIZE); - if (!page_size) { - fprintf(stderr, "Unable to determine page size\n"); - exit(2); - } + if (!page_size) + err("Unable to determine page size"); if ((unsigned long) area_count(NULL, 0) + sizeof(unsigned long long) * 2 - > page_size) { - fprintf(stderr, "Impossible to run this test\n"); - exit(2); - } + > page_size) + err("Impossible to run this test"); } static void sigalrm(int sig) @@ -1593,10 +1419,8 @@ int main(int argc, char **argv) if (argc < 4) usage(); - if (signal(SIGALRM, sigalrm) == SIG_ERR) { - fprintf(stderr, "failed to arm SIGALRM"); - exit(1); - } + if (signal(SIGALRM, sigalrm) == SIG_ERR) + err("failed to arm SIGALRM"); alarm(ALARM_INTERVAL_SECS); set_test_type(argv[1]); @@ -1605,13 +1429,13 @@ int main(int argc, char **argv) nr_pages_per_cpu = atol(argv[2]) * 1024*1024 / page_size / nr_cpus; if (!nr_pages_per_cpu) { - fprintf(stderr, "invalid MiB\n"); + _err("invalid MiB"); usage(); } bounces = atoi(argv[3]); if (bounces <= 0) { - fprintf(stderr, "invalid bounces\n"); + _err("invalid bounces"); usage(); } nr_pages = nr_pages_per_cpu * nr_cpus; @@ -1620,16 +1444,10 @@ int main(int argc, char **argv) if (argc < 5) usage(); huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755); - if (huge_fd < 0) { - fprintf(stderr, "Open of %s failed", argv[3]); - perror("open"); - exit(1); - } - if (ftruncate(huge_fd, 0)) { - fprintf(stderr, "ftruncate %s to size 0 failed", argv[3]); - perror("ftruncate"); - exit(1); - } + if (huge_fd < 0) + err("Open of %s failed", argv[4]); + if (ftruncate(huge_fd, 0)) + err("ftruncate %s to size 0 failed", argv[4]); } printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n", nr_pages, nr_pages_per_cpu); From 5fc7a5f6fd04bc18f309d9f979b32ef7d1d0a997 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 30 Jun 2021 18:48:59 -0700 Subject: [PATCH 035/190] mm/thp: simplify copying of huge zero page pmd when fork Patch series "mm/uffd: Misc fix for uffd-wp and one more test". This series tries to fix some corner case bugs for uffd-wp on either thp or fork(). Then it introduced a new test with pagemap/pageout. Patch layout: Patch 1: cleanup for THP, it'll slightly simplify the follow up patches Patch 2-4: misc fixes for uffd-wp here and there; please refer to each patch Patch 5: add pagemap support for uffd-wp Patch 6: add pagemap/pageout test for uffd-wp The last test introduced can also verify some of the fixes in previous patches, as the test will fail without the fixes. However it's not easy to verify all the changes in patch 2-4, but hopefully they can still be properly reviewed. Note that if considering the ongoing uffd-wp shmem & hugetlbfs work, patch 5 will be incomplete as it's missing e.g. hugetlbfs part or the special swap pte detection. However that's not needed in this series, and since that series is still during review, this series does not depend on that one (the last test only runs with anonymous memory, not file-backed). So this series can be merged even before that series. This patch (of 6): Huge zero page is handled in a special path in copy_huge_pmd(), however it should share most codes with a normal thp page. Trying to share more code with it by removing the special path. The only leftover so far is the huge zero page refcounting (mm_get_huge_zero_page()), because that's separately done with a global counter. This prepares for a future patch to modify the huge pmd to be installed, so that we don't need to duplicate it explicitly into huge zero page case too. Link: https://lkml.kernel.org/r/20210428225030.9708-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210428225030.9708-2-peterx@redhat.com Signed-off-by: Peter Xu Cc: Kirill A. Shutemov Cc: Mike Kravetz , peterx@redhat.com Cc: Mike Rapoport Cc: Axel Rasmussen Cc: Andrea Arcangeli Cc: Hugh Dickins Cc: Jerome Glisse Cc: Alexander Viro Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Joe Perches Cc: Lokesh Gidra Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7ccea3e3e216..b72d919ab13a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1088,17 +1088,13 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, * a page table. */ if (is_huge_zero_pmd(pmd)) { - struct page *zero_page; /* * get_huge_zero_page() will never allocate a new page here, * since we already have a zero page to copy. It just takes a * reference. */ - zero_page = mm_get_huge_zero_page(dst_mm); - set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd, - zero_page); - ret = 0; - goto out_unlock; + mm_get_huge_zero_page(dst_mm); + goto out_zero_page; } src_page = pmd_page(pmd); @@ -1122,6 +1118,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, get_page(src_page); page_dup_rmap(src_page, true); add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); +out_zero_page: mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); From 8f34f1eac3820fc2722e5159acceb22545b30b0d Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 30 Jun 2021 18:49:02 -0700 Subject: [PATCH 036/190] mm/userfaultfd: fix uffd-wp special cases for fork() We tried to do something similar in b569a1760782 ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all right.. A few fixes around the code path: 1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather than the new vma. That's overlooked in b569a1760782, so it won't work as expected. Thanks to the recent rework on fork code (7a4830c380f3a8b3), we can easily get the new vma now, so switch the checks to that. 2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the huge pmd is a migration huge pmd. When it happens, instead of using pmd_uffd_wp(), we should use pmd_swp_uffd_wp(). The fix is simply to handle them separately. 3. Forget to carry over uffd-wp bit for a write migration huge pmd entry. This also happens in copy_huge_pmd(), where we converted a write huge migration entry into a read one. 4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes. 5. In copy_present_page() when COW is enforced when fork(), we also need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new vma, and when the pte to be copied has uffd-wp bit set. Remove the comment in copy_present_pte() about this. It won't help a huge lot to only comment there, but comment everywhere would be an overkill. Let's assume the commit messages would help. [peterx@redhat.com: fix a few thp pmd missing uffd-wp bit] Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com Fixes: b569a1760782f ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork") Signed-off-by: Peter Xu Cc: Jerome Glisse Cc: Mike Rapoport Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Axel Rasmussen Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 2 +- include/linux/swapops.h | 2 ++ mm/huge_memory.c | 27 ++++++++++++++------------- mm/memory.c | 25 +++++++++++++------------ 4 files changed, 30 insertions(+), 26 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b4e1ebaae825..939f21b69ead 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -10,7 +10,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, - struct vm_area_struct *vma); + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd); int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, pud_t *dst_pud, pud_t *src_pud, unsigned long addr, diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 5907205c712c..708fbeb21dd3 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -265,6 +265,8 @@ static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd) if (pmd_swp_soft_dirty(pmd)) pmd = pmd_swp_clear_soft_dirty(pmd); + if (pmd_swp_uffd_wp(pmd)) + pmd = pmd_swp_clear_uffd_wp(pmd); arch_entry = __pmd_to_swp_entry(pmd); return swp_entry(__swp_type(arch_entry), __swp_offset(arch_entry)); } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b72d919ab13a..40a90ff18180 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1026,7 +1026,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, - struct vm_area_struct *vma) + struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { spinlock_t *dst_ptl, *src_ptl; struct page *src_page; @@ -1035,7 +1035,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, int ret = -ENOMEM; /* Skip if can be re-fill on fault */ - if (!vma_is_anonymous(vma)) + if (!vma_is_anonymous(dst_vma)) return 0; pgtable = pte_alloc_one(dst_mm); @@ -1049,14 +1049,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, ret = -EAGAIN; pmd = *src_pmd; - /* - * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA - * does not have the VM_UFFD_WP, which means that the uffd - * fork event is not enabled. - */ - if (!(vma->vm_flags & VM_UFFD_WP)) - pmd = pmd_clear_uffd_wp(pmd); - #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION if (unlikely(is_swap_pmd(pmd))) { swp_entry_t entry = pmd_to_swp_entry(pmd); @@ -1067,11 +1059,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd = swp_entry_to_pmd(entry); if (pmd_swp_soft_dirty(*src_pmd)) pmd = pmd_swp_mksoft_dirty(pmd); + if (pmd_swp_uffd_wp(*src_pmd)) + pmd = pmd_swp_mkuffd_wp(pmd); set_pmd_at(src_mm, addr, src_pmd, pmd); } add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); + if (!userfaultfd_wp(dst_vma)) + pmd = pmd_swp_clear_uffd_wp(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); ret = 0; goto out_unlock; @@ -1107,11 +1103,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, * best effort that the pinned pages won't be replaced by another * random page during the coming copy-on-write. */ - if (unlikely(page_needs_cow_for_dma(vma, src_page))) { + if (unlikely(page_needs_cow_for_dma(src_vma, src_page))) { pte_free(dst_mm, pgtable); spin_unlock(src_ptl); spin_unlock(dst_ptl); - __split_huge_pmd(vma, src_pmd, addr, false, NULL); + __split_huge_pmd(src_vma, src_pmd, addr, false, NULL); return -EAGAIN; } @@ -1121,8 +1117,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, out_zero_page: mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); - pmdp_set_wrprotect(src_mm, addr, src_pmd); + if (!userfaultfd_wp(dst_vma)) + pmd = pmd_clear_uffd_wp(pmd); pmd = pmd_mkold(pmd_wrprotect(pmd)); set_pmd_at(dst_mm, addr, dst_pmd, pmd); @@ -1835,6 +1832,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, newpmd = swp_entry_to_pmd(entry); if (pmd_swp_soft_dirty(*pmd)) newpmd = pmd_swp_mksoft_dirty(newpmd); + if (pmd_swp_uffd_wp(*pmd)) + newpmd = pmd_swp_mkuffd_wp(newpmd); set_pmd_at(mm, addr, pmd, newpmd); } goto unlock; @@ -3245,6 +3244,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) pmde = pmd_mksoft_dirty(pmde); if (is_write_migration_entry(entry)) pmde = maybe_pmd_mkwrite(pmde, vma); + if (pmd_swp_uffd_wp(*pvmw->pmd)) + pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE); if (PageAnon(new)) diff --git a/mm/memory.c b/mm/memory.c index 48c4576df898..07a71a016c18 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -707,10 +707,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, static unsigned long copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, - pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, - unsigned long addr, int *rss) + pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, unsigned long addr, int *rss) { - unsigned long vm_flags = vma->vm_flags; + unsigned long vm_flags = dst_vma->vm_flags; pte_t pte = *src_pte; struct page *page; swp_entry_t entry = pte_to_swp_entry(pte); @@ -779,6 +779,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, set_pte_at(src_mm, addr, src_pte, pte); } } + if (!userfaultfd_wp(dst_vma)) + pte = pte_swp_clear_uffd_wp(pte); set_pte_at(dst_mm, addr, dst_pte, pte); return 0; } @@ -844,6 +846,9 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma /* All done, just insert the new page copy in the child */ pte = mk_pte(new_page, dst_vma->vm_page_prot); pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma); + if (userfaultfd_pte_wp(dst_vma, *src_pte)) + /* Uffd-wp needs to be delivered to dest pte as well */ + pte = pte_wrprotect(pte_mkuffd_wp(pte)); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); return 0; } @@ -893,12 +898,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte = pte_mkclean(pte); pte = pte_mkold(pte); - /* - * Make sure the _PAGE_UFFD_WP bit is cleared if the new VMA - * does not have the VM_UFFD_WP, which means that the uffd - * fork event is not enabled. - */ - if (!(vm_flags & VM_UFFD_WP)) + if (!userfaultfd_wp(dst_vma)) pte = pte_clear_uffd_wp(pte); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); @@ -973,7 +973,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, if (unlikely(!pte_present(*src_pte))) { entry.val = copy_nonpresent_pte(dst_mm, src_mm, dst_pte, src_pte, - src_vma, addr, rss); + dst_vma, src_vma, + addr, rss); if (entry.val) break; progress += 8; @@ -1050,8 +1051,8 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, || pmd_devmap(*src_pmd)) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma); - err = copy_huge_pmd(dst_mm, src_mm, - dst_pmd, src_pmd, addr, src_vma); + err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd, + addr, dst_vma, src_vma); if (err == -ENOMEM) return -ENOMEM; if (!err) From 00b151f21f390f1e0b294720a3660506abaf49cd Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 30 Jun 2021 18:49:06 -0700 Subject: [PATCH 037/190] mm/userfaultfd: fail uffd-wp registration if not supported We should fail uffd-wp registration immediately if the arch does not even have CONFIG_HAVE_ARCH_USERFAULTFD_WP defined. That'll block also relevant ioctls on e.g. UFFDIO_WRITEPROTECT because that'll check against VM_UFFD_WP, which can only be applied with a success registration. Remove the WP feature bit too for those archs when handling UFFDIO_API ioctl. Link: https://lkml.kernel.org/r/20210428225030.9708-5-peterx@redhat.com Signed-off-by: Peter Xu Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Axel Rasmussen Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/userfaultfd.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 14f92285d04f..5dd78238cc15 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1304,8 +1304,12 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, vm_flags = 0; if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING) vm_flags |= VM_UFFD_MISSING; - if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) + if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) { +#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP + goto out; +#endif vm_flags |= VM_UFFD_WP; + } if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR) { #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR goto out; @@ -1942,6 +1946,9 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, uffdio_api.features = UFFD_API_FEATURES; #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS; +#endif +#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP + uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP; #endif uffdio_api.ioctls = UFFD_API_IOCTLS; ret = -EFAULT; From fb8e37f35a2fe1f983ac21850e856e2c7498d469 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 30 Jun 2021 18:49:09 -0700 Subject: [PATCH 038/190] mm/pagemap: export uffd-wp protection information Export the PTE/PMD status of uffd-wp to pagemap too. Link: https://lkml.kernel.org/r/20210428225030.9708-6-peterx@redhat.com Signed-off-by: Peter Xu Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Axel Rasmussen Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/mm/pagemap.rst | 2 ++ fs/proc/task_mmu.c | 9 +++++++++ 2 files changed, 11 insertions(+) diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index 340a5aee9b80..fb578fbbb76c 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -21,6 +21,8 @@ There are four components to pagemap: * Bit 55 pte is soft-dirty (see :ref:`Documentation/admin-guide/mm/soft-dirty.rst `) * Bit 56 page exclusively mapped (since 4.2) + * Bit 57 pte is uffd-wp write-protected (since 5.13) (see + :ref:`Documentation/admin-guide/mm/userfaultfd.rst `) * Bits 57-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) * Bit 62 page swapped diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 9feda7792882..95c8f1e8fea6 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1302,6 +1302,7 @@ struct pagemapread { #define PM_PFRAME_MASK GENMASK_ULL(PM_PFRAME_BITS - 1, 0) #define PM_SOFT_DIRTY BIT_ULL(55) #define PM_MMAP_EXCLUSIVE BIT_ULL(56) +#define PM_UFFD_WP BIT_ULL(57) #define PM_FILE BIT_ULL(61) #define PM_SWAP BIT_ULL(62) #define PM_PRESENT BIT_ULL(63) @@ -1375,10 +1376,14 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, page = vm_normal_page(vma, addr, pte); if (pte_soft_dirty(pte)) flags |= PM_SOFT_DIRTY; + if (pte_uffd_wp(pte)) + flags |= PM_UFFD_WP; } else if (is_swap_pte(pte)) { swp_entry_t entry; if (pte_swp_soft_dirty(pte)) flags |= PM_SOFT_DIRTY; + if (pte_swp_uffd_wp(pte)) + flags |= PM_UFFD_WP; entry = pte_to_swp_entry(pte); if (pm->show_pfn) frame = swp_type(entry) | @@ -1426,6 +1431,8 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, flags |= PM_PRESENT; if (pmd_soft_dirty(pmd)) flags |= PM_SOFT_DIRTY; + if (pmd_uffd_wp(pmd)) + flags |= PM_UFFD_WP; if (pm->show_pfn) frame = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); @@ -1444,6 +1451,8 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, flags |= PM_SWAP; if (pmd_swp_soft_dirty(pmd)) flags |= PM_SOFT_DIRTY; + if (pmd_swp_uffd_wp(pmd)) + flags |= PM_UFFD_WP; VM_BUG_ON(!is_pmd_migration_entry(pmd)); page = migration_entry_to_page(entry); } From eb3b2e0039837546b460d8c747b86b2632a975a1 Mon Sep 17 00:00:00 2001 From: Peter Xu Date: Wed, 30 Jun 2021 18:49:13 -0700 Subject: [PATCH 039/190] userfaultfd/selftests: add pagemap uffd-wp test Add one anonymous specific test to start using pagemap. With pagemap support, we can directly read the uffd-wp bit from pgtable without triggering any fault, so it's easier to do sanity checks in unit tests. Meanwhile this test also leverages the newly introduced MADV_PAGEOUT madvise function to test swap ptes with uffd-wp bit set, and across fork()s. Link: https://lkml.kernel.org/r/20210428225030.9708-7-peterx@redhat.com Signed-off-by: Peter Xu Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Axel Rasmussen Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 154 +++++++++++++++++++++++ 1 file changed, 154 insertions(+) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 6339aeaeeff8..93eae095b61e 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -1170,6 +1170,144 @@ static int userfaultfd_minor_test(void) return stats.missing_faults != 0 || stats.minor_faults != nr_pages; } +#define BIT_ULL(nr) (1ULL << (nr)) +#define PM_SOFT_DIRTY BIT_ULL(55) +#define PM_MMAP_EXCLUSIVE BIT_ULL(56) +#define PM_UFFD_WP BIT_ULL(57) +#define PM_FILE BIT_ULL(61) +#define PM_SWAP BIT_ULL(62) +#define PM_PRESENT BIT_ULL(63) + +static int pagemap_open(void) +{ + int fd = open("/proc/self/pagemap", O_RDONLY); + + if (fd < 0) + err("open pagemap"); + + return fd; +} + +static uint64_t pagemap_read_vaddr(int fd, void *vaddr) +{ + uint64_t value; + int ret; + + ret = pread(fd, &value, sizeof(uint64_t), + ((uint64_t)vaddr >> 12) * sizeof(uint64_t)); + if (ret != sizeof(uint64_t)) + err("pread() on pagemap failed"); + + return value; +} + +/* This macro let __LINE__ works in err() */ +#define pagemap_check_wp(value, wp) do { \ + if (!!(value & PM_UFFD_WP) != wp) \ + err("pagemap uffd-wp bit error: 0x%"PRIx64, value); \ + } while (0) + +static int pagemap_test_fork(bool present) +{ + pid_t child = fork(); + uint64_t value; + int fd, result; + + if (!child) { + /* Open the pagemap fd of the child itself */ + fd = pagemap_open(); + value = pagemap_read_vaddr(fd, area_dst); + /* + * After fork() uffd-wp bit should be gone as long as we're + * without UFFD_FEATURE_EVENT_FORK + */ + pagemap_check_wp(value, false); + /* Succeed */ + exit(0); + } + waitpid(child, &result, 0); + return result; +} + +static void userfaultfd_pagemap_test(unsigned int test_pgsize) +{ + struct uffdio_register uffdio_register; + int pagemap_fd; + uint64_t value; + + /* Pagemap tests uffd-wp only */ + if (!test_uffdio_wp) + return; + + /* Not enough memory to test this page size */ + if (test_pgsize > nr_pages * page_size) + return; + + printf("testing uffd-wp with pagemap (pgsize=%u): ", test_pgsize); + /* Flush so it doesn't flush twice in parent/child later */ + fflush(stdout); + + uffd_test_ops->release_pages(area_dst); + + if (test_pgsize > page_size) { + /* This is a thp test */ + if (madvise(area_dst, nr_pages * page_size, MADV_HUGEPAGE)) + err("madvise(MADV_HUGEPAGE) failed"); + } else if (test_pgsize == page_size) { + /* This is normal page test; force no thp */ + if (madvise(area_dst, nr_pages * page_size, MADV_NOHUGEPAGE)) + err("madvise(MADV_NOHUGEPAGE) failed"); + } + + if (userfaultfd_open(0)) + err("userfaultfd_open"); + + uffdio_register.range.start = (unsigned long) area_dst; + uffdio_register.range.len = nr_pages * page_size; + uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; + if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register)) + err("register failed"); + + pagemap_fd = pagemap_open(); + + /* Touch the page */ + *area_dst = 1; + wp_range(uffd, (uint64_t)area_dst, test_pgsize, true); + value = pagemap_read_vaddr(pagemap_fd, area_dst); + pagemap_check_wp(value, true); + /* Make sure uffd-wp bit dropped when fork */ + if (pagemap_test_fork(true)) + err("Detected stall uffd-wp bit in child"); + + /* Exclusive required or PAGEOUT won't work */ + if (!(value & PM_MMAP_EXCLUSIVE)) + err("multiple mapping detected: 0x%"PRIx64, value); + + if (madvise(area_dst, test_pgsize, MADV_PAGEOUT)) + err("madvise(MADV_PAGEOUT) failed"); + + /* Uffd-wp should persist even swapped out */ + value = pagemap_read_vaddr(pagemap_fd, area_dst); + pagemap_check_wp(value, true); + /* Make sure uffd-wp bit dropped when fork */ + if (pagemap_test_fork(false)) + err("Detected stall uffd-wp bit in child"); + + /* Unprotect; this tests swap pte modifications */ + wp_range(uffd, (uint64_t)area_dst, page_size, false); + value = pagemap_read_vaddr(pagemap_fd, area_dst); + pagemap_check_wp(value, false); + + /* Fault in the page from disk */ + *area_dst = 2; + value = pagemap_read_vaddr(pagemap_fd, area_dst); + pagemap_check_wp(value, false); + + close(pagemap_fd); + close(uffd); + printf("done\n"); +} + static int userfaultfd_stress(void) { void *area; @@ -1341,6 +1479,22 @@ static int userfaultfd_stress(void) } close(uffd); + + if (test_type == TEST_ANON) { + /* + * shmem/hugetlb won't be able to run since they have different + * behavior on fork() (file-backed memory normally drops ptes + * directly when fork), meanwhile the pagemap test will verify + * pgtable entry of fork()ed child. + */ + userfaultfd_pagemap_test(page_size); + /* + * Hard-code for x86_64 for now for 2M THP, as x86_64 is + * currently the only one that supports uffd-wp + */ + userfaultfd_pagemap_test(page_size * 512); + } + return userfaultfd_zeropage_test() || userfaultfd_sig_test() || userfaultfd_events_test() || userfaultfd_minor_test(); } From 3460f6e5c1ed94c2ab7c1ccc032a5bebd88deaa7 Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Wed, 30 Jun 2021 18:49:17 -0700 Subject: [PATCH 040/190] userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte Patch series "userfaultfd: add minor fault handling for shmem", v6. Overview ======== See the series which added minor faults for hugetlbfs [3] for a detailed overview of minor fault handling in general. This series adds the same support for shmem-backed areas. This series is structured as follows: - Commits 1 and 2 are cleanups. - Commits 3 and 4 implement the new feature (minor fault handling for shmem). - Commit 5 advertises that the feature is now available since at this point it's fully implemented. - Commit 6 is a final cleanup, modifying an existing code path to re-use a new helper we've introduced. - Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature. Use Case ======== In some cases it is useful to have VM memory backed by tmpfs instead of hugetlbfs. So, this feature will be used to support the same VM live migration use case described in my original series. Additionally, Android folks (Lokesh Gidra ) hope to optimize the Android Runtime garbage collector using this feature: "The plan is to use userfaultfd for concurrently compacting the heap. With this feature, the heap can be shared-mapped at another location where the GC-thread(s) could continue the compaction operation without the need to invoke userfault ioctl(UFFDIO_COPY) each time. OTOH, if and when Java threads get faults on the heap, UFFDIO_CONTINUE can be used to resume execution. Furthermore, this feature enables updating references in the 'non-moving' portion of the heap efficiently. Without this feature, uneccessary page copying (ioctl(UFFDIO_COPY)) would be required." [1] https://lore.kernel.org/patchwork/cover/1388144/ [2] https://lore.kernel.org/patchwork/patch/1408161/ [3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t This patch (of 9): Previously, we did a dance where we had one calling path in userfaultfd.c (mfill_atomic_pte), but then we split it into two in shmem_fs.h (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single shared function in shmem.c (shmem_mfill_atomic_pte). This is all a bit overly complex. Just call the single combined shmem function directly, allowing us to clean up various branches, boilerplate, etc. While we're touching this function, two other small cleanup changes: - offset is equivalent to pgoff, so we can get rid of offset entirely. - Split two VM_BUG_ON cases into two statements. This means the line number reported when the BUG is hit specifies exactly which condition was true. Link: https://lkml.kernel.org/r/20210503180737.2487560-1-axelrasmussen@google.com Link: https://lkml.kernel.org/r/20210503180737.2487560-3-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Reviewed-by: Peter Xu Acked-by: Hugh Dickins Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/shmem_fs.h | 19 +++++++-------- mm/shmem.c | 52 +++++++++++++--------------------------- mm/userfaultfd.c | 10 +++----- 3 files changed, 27 insertions(+), 54 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index d82b6f396588..a69ea4d97fdd 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -122,21 +122,18 @@ static inline bool shmem_file(struct file *file) extern bool shmem_charge(struct inode *inode, long pages); extern void shmem_uncharge(struct inode *inode, long pages); +#ifdef CONFIG_USERFAULTFD #ifdef CONFIG_SHMEM -extern int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, +extern int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, unsigned long src_addr, + bool zeropage, struct page **pagep); -extern int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr); -#else -#define shmem_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, dst_addr, \ - src_addr, pagep) ({ BUG(); 0; }) -#define shmem_mfill_zeropage_pte(dst_mm, dst_pmd, dst_vma, \ - dst_addr) ({ BUG(); 0; }) -#endif +#else /* !CONFIG_SHMEM */ +#define shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, \ + src_addr, zeropage, pagep) ({ BUG(); 0; }) +#endif /* CONFIG_SHMEM */ +#endif /* CONFIG_USERFAULTFD */ #endif diff --git a/mm/shmem.c b/mm/shmem.c index 1155279a6276..9b66a2b2680c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2352,13 +2352,14 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode return inode; } -static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - bool zeropage, - struct page **pagep) +#ifdef CONFIG_USERFAULTFD +int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, + pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + unsigned long src_addr, + bool zeropage, + struct page **pagep) { struct inode *inode = file_inode(dst_vma->vm_file); struct shmem_inode_info *info = SHMEM_I(inode); @@ -2370,7 +2371,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, struct page *page; pte_t _dst_pte, *dst_pte; int ret; - pgoff_t offset, max_off; + pgoff_t max_off; ret = -ENOMEM; if (!shmem_inode_acct_block(inode, 1)) { @@ -2391,7 +2392,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, if (!page) goto out_unacct_blocks; - if (!zeropage) { /* mcopy_atomic */ + if (!zeropage) { /* COPY */ page_kaddr = kmap_atomic(page); ret = copy_from_user(page_kaddr, (const void __user *)src_addr, @@ -2405,7 +2406,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, /* don't free the page */ return -ENOENT; } - } else { /* mfill_zeropage_atomic */ + } else { /* ZEROPAGE */ clear_highpage(page); } } else { @@ -2413,15 +2414,15 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, *pagep = NULL; } - VM_BUG_ON(PageLocked(page) || PageSwapBacked(page)); + VM_BUG_ON(PageLocked(page)); + VM_BUG_ON(PageSwapBacked(page)); __SetPageLocked(page); __SetPageSwapBacked(page); __SetPageUptodate(page); ret = -EFAULT; - offset = linear_page_index(dst_vma, dst_addr); max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - if (unlikely(offset >= max_off)) + if (unlikely(pgoff >= max_off)) goto out_release; ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL, @@ -2447,7 +2448,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, ret = -EFAULT; max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - if (unlikely(offset >= max_off)) + if (unlikely(pgoff >= max_off)) goto out_release_unlock; ret = -EEXIST; @@ -2484,28 +2485,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, shmem_inode_unacct_blocks(inode, 1); goto out; } - -int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - struct page **pagep) -{ - return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, - dst_addr, src_addr, false, pagep); -} - -int shmem_mfill_zeropage_pte(struct mm_struct *dst_mm, - pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr) -{ - struct page *page = NULL; - - return shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, - dst_addr, 0, true, &page); -} +#endif /* CONFIG_USERFAULTFD */ #ifdef CONFIG_TMPFS static const struct inode_operations shmem_symlink_inode_operations; diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index da5535d2d990..57b228b9b6a4 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -392,13 +392,9 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, dst_vma, dst_addr); } else { VM_WARN_ON_ONCE(wp_copy); - if (!zeropage) - err = shmem_mcopy_atomic_pte(dst_mm, dst_pmd, - dst_vma, dst_addr, - src_addr, page); - else - err = shmem_mfill_zeropage_pte(dst_mm, dst_pmd, - dst_vma, dst_addr); + err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, + dst_addr, src_addr, zeropage, + page); } return err; From c949b097ef2e332fa90708127c972b823fb58ec1 Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Wed, 30 Jun 2021 18:49:20 -0700 Subject: [PATCH 041/190] userfaultfd/shmem: support minor fault registration for shmem This patch allows shmem-backed VMAs to be registered for minor faults. Minor faults are appropriately relayed to userspace in the fault path, for VMAs with the relevant flag. This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed minor faults, though, so userspace doesn't yet have a way to resolve such faults. Because of this, we also don't yet advertise this as a supported feature. That will be done in a separate commit when the feature is fully implemented. Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Acked-by: Peter Xu Acked-by: Hugh Dickins Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/userfaultfd.c | 3 +-- mm/memory.c | 8 +++++--- mm/shmem.c | 12 +++++++++++- 3 files changed, 17 insertions(+), 6 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 5dd78238cc15..82ef253d66b6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1267,8 +1267,7 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, } if (vm_flags & VM_UFFD_MINOR) { - /* FIXME: Add minor fault interception for shmem. */ - if (!is_vm_hugetlb_page(vma)) + if (!(is_vm_hugetlb_page(vma) || vma_is_shmem(vma))) return false; } diff --git a/mm/memory.c b/mm/memory.c index 07a71a016c18..b535a1d0317d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4026,9 +4026,11 @@ static vm_fault_t do_read_fault(struct vm_fault *vmf) * something). */ if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) { - ret = do_fault_around(vmf); - if (ret) - return ret; + if (likely(!userfaultfd_minor(vmf->vma))) { + ret = do_fault_around(vmf); + if (ret) + return ret; + } } ret = __do_fault(vmf); diff --git a/mm/shmem.c b/mm/shmem.c index 9b66a2b2680c..0d186b91a2f9 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1797,7 +1797,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index, * vm. If we swap it in we mark it dirty since we also free the swap * entry since a page cannot live in both the swap and page cache. * - * vmf and fault_type are only supplied by shmem_fault: + * vma, vmf, and fault_type are only supplied by shmem_fault: * otherwise they are NULL. */ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, @@ -1832,6 +1832,16 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, page = pagecache_get_page(mapping, index, FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0); + + if (page && vma && userfaultfd_minor(vma)) { + if (!xa_is_value(page)) { + unlock_page(page); + put_page(page); + } + *fault_type = handle_userfault(vmf, VM_UFFD_MINOR); + return 0; + } + if (xa_is_value(page)) { error = shmem_swapin_page(inode, index, &page, sgp, gfp, vma, fault_type); From 153132571f0204dc5844faf6b0f8096c6c29d277 Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Wed, 30 Jun 2021 18:49:24 -0700 Subject: [PATCH 042/190] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem With this change, userspace can resolve a minor fault within a shmem-backed area with a UFFDIO_CONTINUE ioctl. The semantics for this match those for hugetlbfs - we look up the existing page in the page cache, and install a PTE for it. This commit introduces a new helper: mfill_atomic_install_pte. Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in shmem.c? The existing userfault implementation only relies on shmem.c for VM_SHARED VMAs. However, minor fault handling / CONTINUE work just fine for !VM_SHARED VMAs as well. We'd prefer to handle CONTINUE for shmem in one place, regardless of shared/private (to reduce code duplication). Why add a new mfill_atomic_install_pte helper? A problem we have with continue is that shmem_mfill_atomic_pte() and mcopy_atomic_pte() are *close* to what we want, but not exactly. We do want to setup the PTEs in a CONTINUE operation, but we don't want to e.g. allocate a new page, charge it (e.g. to the shmem inode), manipulate various flags, etc. Also we have the problem stated above: shmem_mfill_atomic_pte() and mcopy_atomic_pte() both handle one-half of the problem (shared / private) continue cares about. So, introduce mcontinue_atomic_pte(), to handle all of the shmem continue cases. Introduce the helper so it doesn't duplicate code with mcopy_atomic_pte(). In a future commit, shmem_mfill_atomic_pte() will also be modified to use this new helper. However, since this is a bigger refactor, it seems most clear to do it as a separate change. Link: https://lkml.kernel.org/r/20210503180737.2487560-5-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Acked-by: Hugh Dickins Acked-by: Peter Xu Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/userfaultfd.c | 172 ++++++++++++++++++++++++++++++++++------------- 1 file changed, 127 insertions(+), 45 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 57b228b9b6a4..218ec2af615f 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -48,6 +48,83 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, return dst_vma; } +/* + * Install PTEs, to map dst_addr (within dst_vma) to page. + * + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed), + * whether or not dst_vma is VM_SHARED. It also handles the more general + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file + * backed, or not). + * + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by + * shmem_mcopy_atomic_pte instead. + */ +static int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, struct page *page, + bool newly_allocated, bool wp_copy) +{ + int ret; + pte_t _dst_pte, *dst_pte; + bool writable = dst_vma->vm_flags & VM_WRITE; + bool vm_shared = dst_vma->vm_flags & VM_SHARED; + bool page_in_cache = page->mapping; + spinlock_t *ptl; + struct inode *inode; + pgoff_t offset, max_off; + + _dst_pte = mk_pte(page, dst_vma->vm_page_prot); + if (page_in_cache && !vm_shared) + writable = false; + if (writable || !page_in_cache) + _dst_pte = pte_mkdirty(_dst_pte); + if (writable) { + if (wp_copy) + _dst_pte = pte_mkuffd_wp(_dst_pte); + else + _dst_pte = pte_mkwrite(_dst_pte); + } + + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + + if (vma_is_shmem(dst_vma)) { + /* serialize against truncate with the page table lock */ + inode = dst_vma->vm_file->f_inode; + offset = linear_page_index(dst_vma, dst_addr); + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); + ret = -EFAULT; + if (unlikely(offset >= max_off)) + goto out_unlock; + } + + ret = -EEXIST; + if (!pte_none(*dst_pte)) + goto out_unlock; + + if (page_in_cache) + page_add_file_rmap(page, false); + else + page_add_new_anon_rmap(page, dst_vma, dst_addr, false); + + /* + * Must happen after rmap, as mm_counter() checks mapping (via + * PageAnon()), which is set by __page_set_anon_rmap(). + */ + inc_mm_counter(dst_mm, mm_counter(page)); + + if (newly_allocated) + lru_cache_add_inactive_or_unevictable(page, dst_vma); + + set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(dst_vma, dst_addr, dst_pte); + ret = 0; +out_unlock: + pte_unmap_unlock(dst_pte, ptl); + return ret; +} + static int mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct vm_area_struct *dst_vma, @@ -56,13 +133,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, struct page **pagep, bool wp_copy) { - pte_t _dst_pte, *dst_pte; - spinlock_t *ptl; void *page_kaddr; int ret; struct page *page; - pgoff_t offset, max_off; - struct inode *inode; if (!*pagep) { ret = -ENOMEM; @@ -99,43 +172,12 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL)) goto out_release; - _dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot)); - if (dst_vma->vm_flags & VM_WRITE) { - if (wp_copy) - _dst_pte = pte_mkuffd_wp(_dst_pte); - else - _dst_pte = pte_mkwrite(_dst_pte); - } - - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); - if (dst_vma->vm_file) { - /* the shmem MAP_PRIVATE case requires checking the i_size */ - inode = dst_vma->vm_file->f_inode; - offset = linear_page_index(dst_vma, dst_addr); - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - ret = -EFAULT; - if (unlikely(offset >= max_off)) - goto out_release_uncharge_unlock; - } - ret = -EEXIST; - if (!pte_none(*dst_pte)) - goto out_release_uncharge_unlock; - - inc_mm_counter(dst_mm, MM_ANONPAGES); - page_add_new_anon_rmap(page, dst_vma, dst_addr, false); - lru_cache_add_inactive_or_unevictable(page, dst_vma); - - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); - - /* No need to invalidate - it was non-present before */ - update_mmu_cache(dst_vma, dst_addr, dst_pte); - - pte_unmap_unlock(dst_pte, ptl); - ret = 0; + ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr, + page, true, wp_copy); + if (ret) + goto out_release; out: return ret; -out_release_uncharge_unlock: - pte_unmap_unlock(dst_pte, ptl); out_release: put_page(page); goto out; @@ -176,6 +218,41 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, return ret; } +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ +static int mcontinue_atomic_pte(struct mm_struct *dst_mm, + pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + bool wp_copy) +{ + struct inode *inode = file_inode(dst_vma->vm_file); + pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); + struct page *page; + int ret; + + ret = shmem_getpage(inode, pgoff, &page, SGP_READ); + if (ret) + goto out; + if (!page) { + ret = -EFAULT; + goto out; + } + + ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr, + page, false, wp_copy); + if (ret) + goto out_release; + + unlock_page(page); + ret = 0; +out: + return ret; +out_release: + unlock_page(page); + put_page(page); + goto out; +} + static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -367,11 +444,16 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, unsigned long dst_addr, unsigned long src_addr, struct page **page, - bool zeropage, + enum mcopy_atomic_mode mode, bool wp_copy) { ssize_t err; + if (mode == MCOPY_ATOMIC_CONTINUE) { + return mcontinue_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, + wp_copy); + } + /* * The normal page fault path for a shmem will invoke the * fault, fill the hole in the file and COW it right away. The @@ -383,7 +465,7 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, * and not in the radix tree. */ if (!(dst_vma->vm_flags & VM_SHARED)) { - if (!zeropage) + if (mode == MCOPY_ATOMIC_NORMAL) err = mcopy_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, src_addr, page, wp_copy); @@ -393,7 +475,8 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, } else { VM_WARN_ON_ONCE(wp_copy); err = shmem_mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, - dst_addr, src_addr, zeropage, + dst_addr, src_addr, + mode != MCOPY_ATOMIC_NORMAL, page); } @@ -415,7 +498,6 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, long copied; struct page *page; bool wp_copy; - bool zeropage = (mcopy_mode == MCOPY_ATOMIC_ZEROPAGE); /* * Sanitize the command parameters: @@ -478,7 +560,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) goto out_unlock; - if (mcopy_mode == MCOPY_ATOMIC_CONTINUE) + if (!vma_is_shmem(dst_vma) && mcopy_mode == MCOPY_ATOMIC_CONTINUE) goto out_unlock; /* @@ -526,7 +608,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm, BUG_ON(pmd_trans_huge(*dst_pmd)); err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr, - src_addr, &page, zeropage, wp_copy); + src_addr, &page, mcopy_mode, wp_copy); cond_resched(); if (unlikely(err == -ENOENT)) { From 964ab0040ff9598783bf37776b5e31b27b50e293 Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Wed, 30 Jun 2021 18:49:27 -0700 Subject: [PATCH 043/190] userfaultfd/shmem: advertise shmem minor fault support Now that the feature is fully implemented (the faulting path hooks exist so userspace is notified, and the ioctl to resolve such faults is available), advertise this as a supported feature. Link: https://lkml.kernel.org/r/20210503180737.2487560-6-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Acked-by: Hugh Dickins Acked-by: Peter Xu Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/admin-guide/mm/userfaultfd.rst | 3 ++- fs/userfaultfd.c | 3 ++- include/uapi/linux/userfaultfd.h | 7 ++++++- 3 files changed, 10 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 3aa38e8b8361..6528036093e1 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -77,7 +77,8 @@ events, except page fault notifications, may be generated: - ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory - areas. + areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating + support for shmem virtual memory areas. The userland application should set the feature flags it intends to use when invoking the ``UFFDIO_API`` ioctl, to request that those features be diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 82ef253d66b6..19ebae443ade 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1944,7 +1944,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx, /* report all available features and ioctls to userland */ uffdio_api.features = UFFD_API_FEATURES; #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR - uffdio_api.features &= ~UFFD_FEATURE_MINOR_HUGETLBFS; + uffdio_api.features &= + ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM); #endif #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP uffdio_api.features &= ~UFFD_FEATURE_PAGEFAULT_FLAG_WP; diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 650480f41f1d..05b31d60acf6 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -31,7 +31,8 @@ UFFD_FEATURE_MISSING_SHMEM | \ UFFD_FEATURE_SIGBUS | \ UFFD_FEATURE_THREAD_ID | \ - UFFD_FEATURE_MINOR_HUGETLBFS) + UFFD_FEATURE_MINOR_HUGETLBFS | \ + UFFD_FEATURE_MINOR_SHMEM) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -185,6 +186,9 @@ struct uffdio_api { * UFFD_FEATURE_MINOR_HUGETLBFS indicates that minor faults * can be intercepted (via REGISTER_MODE_MINOR) for * hugetlbfs-backed pages. + * + * UFFD_FEATURE_MINOR_SHMEM indicates the same support as + * UFFD_FEATURE_MINOR_HUGETLBFS, but for shmem-backed pages instead. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -196,6 +200,7 @@ struct uffdio_api { #define UFFD_FEATURE_SIGBUS (1<<7) #define UFFD_FEATURE_THREAD_ID (1<<8) #define UFFD_FEATURE_MINOR_HUGETLBFS (1<<9) +#define UFFD_FEATURE_MINOR_SHMEM (1<<10) __u64 features; __u64 ioctls; From 7d64ae3ab648a967b7ba5cc3e89281d76742c34e Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Wed, 30 Jun 2021 18:49:31 -0700 Subject: [PATCH 044/190] userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte() In a previous commit, we added the mfill_atomic_install_pte() helper. This helper does the job of setting up PTEs for an existing page, to map it into a given VMA. It deals with both the anon and shmem cases, as well as the shared and private cases. In other words, shmem_mfill_atomic_pte() duplicates a case it already handles. So, expose it, and let shmem_mfill_atomic_pte() use it directly, to reduce code duplication. This requires that we refactor shmem_mfill_atomic_pte() a bit: Instead of doing accounting (shmem_recalc_inode() et al) part-way through the PTE setup, do it afterward. This frees up mfill_atomic_install_pte() from having to care about this accounting, and means we don't need to e.g. shmem_uncharge() in the error path. A side effect is this switches shmem_mfill_atomic_pte() to use lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add(). This wrapper does some extra accounting in an exceptional case, if appropriate, so it's actually the more correct thing to use. Link: https://lkml.kernel.org/r/20210503180737.2487560-7-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Reviewed-by: Peter Xu Acked-by: Hugh Dickins Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/userfaultfd_k.h | 5 +++ mm/shmem.c | 58 ++++++++--------------------------- mm/userfaultfd.c | 17 ++++------ 3 files changed, 23 insertions(+), 57 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 794d1538b8ba..331d2ccf0bcc 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -53,6 +53,11 @@ enum mcopy_atomic_mode { MCOPY_ATOMIC_CONTINUE, }; +extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, struct page *page, + bool newly_allocated, bool wp_copy); + extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long src_start, unsigned long len, bool *mmap_changing, __u64 mode); diff --git a/mm/shmem.c b/mm/shmem.c index 0d186b91a2f9..bd2ffd861ddb 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2376,14 +2376,11 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, struct address_space *mapping = inode->i_mapping; gfp_t gfp = mapping_gfp_mask(mapping); pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); - spinlock_t *ptl; void *page_kaddr; struct page *page; - pte_t _dst_pte, *dst_pte; int ret; pgoff_t max_off; - ret = -ENOMEM; if (!shmem_inode_acct_block(inode, 1)) { /* * We may have got a page, returned -ENOENT triggering a retry, @@ -2394,10 +2391,11 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, put_page(*pagep); *pagep = NULL; } - goto out; + return -ENOMEM; } if (!*pagep) { + ret = -ENOMEM; page = shmem_alloc_page(gfp, info, pgoff); if (!page) goto out_unacct_blocks; @@ -2412,9 +2410,9 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { *pagep = page; - shmem_inode_unacct_blocks(inode, 1); + ret = -ENOENT; /* don't free the page */ - return -ENOENT; + goto out_unacct_blocks; } } else { /* ZEROPAGE */ clear_highpage(page); @@ -2440,32 +2438,10 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, if (ret) goto out_release; - _dst_pte = mk_pte(page, dst_vma->vm_page_prot); - if (dst_vma->vm_flags & VM_WRITE) - _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte)); - else { - /* - * We don't set the pte dirty if the vma has no - * VM_WRITE permission, so mark the page dirty or it - * could be freed from under us. We could do it - * unconditionally before unlock_page(), but doing it - * only if VM_WRITE is not set is faster. - */ - set_page_dirty(page); - } - - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); - - ret = -EFAULT; - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - if (unlikely(pgoff >= max_off)) - goto out_release_unlock; - - ret = -EEXIST; - if (!pte_none(*dst_pte)) - goto out_release_unlock; - - lru_cache_add(page); + ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr, + page, true, false); + if (ret) + goto out_delete_from_cache; spin_lock_irq(&info->lock); info->alloced++; @@ -2473,27 +2449,17 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, shmem_recalc_inode(inode); spin_unlock_irq(&info->lock); - inc_mm_counter(dst_mm, mm_counter_file(page)); - page_add_file_rmap(page, false); - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); - - /* No need to invalidate - it was non-present before */ - update_mmu_cache(dst_vma, dst_addr, dst_pte); - pte_unmap_unlock(dst_pte, ptl); + SetPageDirty(page); unlock_page(page); - ret = 0; -out: - return ret; -out_release_unlock: - pte_unmap_unlock(dst_pte, ptl); - ClearPageDirty(page); + return 0; +out_delete_from_cache: delete_from_page_cache(page); out_release: unlock_page(page); put_page(page); out_unacct_blocks: shmem_inode_unacct_blocks(inode, 1); - goto out; + return ret; } #endif /* CONFIG_USERFAULTFD */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 218ec2af615f..0e2132834bc7 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -51,18 +51,13 @@ struct vm_area_struct *find_dst_vma(struct mm_struct *dst_mm, /* * Install PTEs, to map dst_addr (within dst_vma) to page. * - * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed), - * whether or not dst_vma is VM_SHARED. It also handles the more general - * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file - * backed, or not). - * - * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by - * shmem_mcopy_atomic_pte instead. + * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both shmem + * and anon, and for both shared and private VMAs. */ -static int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, struct page *page, - bool newly_allocated, bool wp_copy) +int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, struct page *page, + bool newly_allocated, bool wp_copy) { int ret; pte_t _dst_pte, *dst_pte; From fa2c2b58189b28ee7bd830b4cb71abfe5060fff2 Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Wed, 30 Jun 2021 18:49:34 -0700 Subject: [PATCH 045/190] userfaultfd/selftests: use memfd_create for shmem test type This is a preparatory commit. In the future, we want to be able to setup alias mappings for area_src and area_dst in the shmem test, like we do in the hugetlb_shared test. With a VMA obtained via mmap(MAP_ANONYMOUS | MAP_SHARED), it isn't clear how to do this. So, mmap() with an fd, so we can create alias mappings. Use memfd_create instead of actually passing in a tmpfs path like hugetlb does, since it's more convenient / simpler to run, and works just as well. Future commits will: 1. Setup the alias mappings. 2. Extend our tests to actually take advantage of this, to test new userfaultfd behavior being introduced in this series. Also, a small fix in the area we're changing: when the hugetlb setup fails in main(), pass in the right argv[] so we actually print out the hugetlb file path. Link: https://lkml.kernel.org/r/20210503180737.2487560-8-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Reviewed-by: Peter Xu Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 93eae095b61e..7495834180ce 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -85,6 +85,7 @@ static bool test_uffdio_wp = false; static bool test_uffdio_minor = false; static bool map_shared; +static int shm_fd; static int huge_fd; static char *huge_fd_off0; static unsigned long long *count_verify; @@ -277,8 +278,11 @@ static void shmem_release_pages(char *rel_area) static void shmem_allocate_area(void **alloc_area) { + unsigned long offset = + alloc_area == (void **)&area_src ? 0 : nr_pages * page_size; + *alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, - MAP_ANONYMOUS | MAP_SHARED, -1, 0); + MAP_SHARED, shm_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of memfd failed"); } @@ -1602,6 +1606,16 @@ int main(int argc, char **argv) err("Open of %s failed", argv[4]); if (ftruncate(huge_fd, 0)) err("ftruncate %s to size 0 failed", argv[4]); + } else if (test_type == TEST_SHMEM) { + shm_fd = memfd_create(argv[0], 0); + if (shm_fd < 0) + err("memfd_create"); + if (ftruncate(shm_fd, nr_pages * page_size * 2)) + err("ftruncate"); + if (fallocate(shm_fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0, + nr_pages * page_size * 2)) + err("fallocate"); } printf("nr_pages: %lu, nr_pages_per_cpu: %lu\n", nr_pages, nr_pages_per_cpu); From 5bb23edb18373b20ff740e56d7c97ea60fb51491 Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Wed, 30 Jun 2021 18:49:38 -0700 Subject: [PATCH 046/190] userfaultfd/selftests: create alias mappings in the shmem test Previously, we just allocated two shm areas: area_src and area_dst. With this commit, change this so we also allocate area_src_alias, and area_dst_alias. area_*_alias and area_* (respectively) point to the same underlying physical pages, but are different VMAs. In a future commit in this series, we'll leverage this setup to exercise minor fault handling support for shmem, just like we do in the hugetlb_shared test. Link: https://lkml.kernel.org/r/20210503180737.2487560-9-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Reviewed-by: Peter Xu Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 7495834180ce..1f8e2bee8f8d 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -278,13 +278,29 @@ static void shmem_release_pages(char *rel_area) static void shmem_allocate_area(void **alloc_area) { - unsigned long offset = - alloc_area == (void **)&area_src ? 0 : nr_pages * page_size; + void *area_alias = NULL; + bool is_src = alloc_area == (void **)&area_src; + unsigned long offset = is_src ? 0 : nr_pages * page_size; *alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, MAP_SHARED, shm_fd, offset); if (*alloc_area == MAP_FAILED) err("mmap of memfd failed"); + + area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE, + MAP_SHARED, shm_fd, offset); + if (area_alias == MAP_FAILED) + err("mmap of memfd alias failed"); + + if (is_src) + area_src_alias = area_alias; + else + area_dst_alias = area_alias; +} + +static void shmem_alias_mapping(__u64 *start, size_t len, unsigned long offset) +{ + *start = (unsigned long)area_dst_alias + offset; } struct uffd_test_ops { @@ -314,7 +330,7 @@ static struct uffd_test_ops shmem_uffd_test_ops = { .expected_ioctls = SHMEM_EXPECTED_IOCTLS, .allocate_area = shmem_allocate_area, .release_pages = shmem_release_pages, - .alias_mapping = noop_alias_mapping, + .alias_mapping = shmem_alias_mapping, }; static struct uffd_test_ops hugetlb_uffd_test_ops = { From 8ba6e8640844213e27c22f5eae915710f7b7998d Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Wed, 30 Jun 2021 18:49:41 -0700 Subject: [PATCH 047/190] userfaultfd/selftests: reinitialize test context in each test Currently, the context (fds, mmap-ed areas, etc.) are global. Each test mutates this state in some way, in some cases really "clobbering it" (e.g., the events test mremap-ing area_dst over the top of area_src, or the minor faults tests overwriting the count_verify values in the test areas). We run the tests in a particular order, each test is careful to make the right assumptions about its starting state, etc. But, this is fragile. It's better for a test's success or failure to not depend on what some other prior test case did to the global state. To that end, clear and reinitialize the test context at the start of each test case, so whatever prior test cases did doesn't affect future tests. This is particularly relevant to this series because the events test's mremap of area_dst screws up assumptions the minor fault test was relying on. This wasn't a problem for hugetlb, as we don't mremap in that case. [peterx@redhat.com: fix conflict between this patch and the uffd pagemap series] Link: https://lkml.kernel.org/r/YKQqKrl+/cQ1utrb@t490s Link: https://lkml.kernel.org/r/20210503180737.2487560-10-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Signed-off-by: Peter Xu Reviewed-by: Peter Xu Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 222 ++++++++++++----------- 1 file changed, 117 insertions(+), 105 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 1f8e2bee8f8d..f78816130c7f 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -89,7 +89,8 @@ static int shm_fd; static int huge_fd; static char *huge_fd_off0; static unsigned long long *count_verify; -static int uffd, uffd_flags, finished, *pipefd; +static int uffd = -1; +static int uffd_flags, finished, *pipefd; static char *area_src, *area_src_alias, *area_dst, *area_dst_alias; static char *zeropage; pthread_attr_t attr; @@ -342,6 +343,111 @@ static struct uffd_test_ops hugetlb_uffd_test_ops = { static struct uffd_test_ops *uffd_test_ops; +static void userfaultfd_open(uint64_t *features) +{ + struct uffdio_api uffdio_api; + + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); + if (uffd < 0) + err("userfaultfd syscall not available in this kernel"); + uffd_flags = fcntl(uffd, F_GETFD, NULL); + + uffdio_api.api = UFFD_API; + uffdio_api.features = *features; + if (ioctl(uffd, UFFDIO_API, &uffdio_api)) + err("UFFDIO_API failed.\nPlease make sure to " + "run with either root or ptrace capability."); + if (uffdio_api.api != UFFD_API) + err("UFFDIO_API error: %" PRIu64, (uint64_t)uffdio_api.api); + + *features = uffdio_api.features; +} + +static inline void munmap_area(void **area) +{ + if (*area) + if (munmap(*area, nr_pages * page_size)) + err("munmap"); + + *area = NULL; +} + +static void uffd_test_ctx_clear(void) +{ + size_t i; + + if (pipefd) { + for (i = 0; i < nr_cpus * 2; ++i) { + if (close(pipefd[i])) + err("close pipefd"); + } + free(pipefd); + pipefd = NULL; + } + + if (count_verify) { + free(count_verify); + count_verify = NULL; + } + + if (uffd != -1) { + if (close(uffd)) + err("close uffd"); + uffd = -1; + } + + huge_fd_off0 = NULL; + munmap_area((void **)&area_src); + munmap_area((void **)&area_src_alias); + munmap_area((void **)&area_dst); + munmap_area((void **)&area_dst_alias); +} + +static void uffd_test_ctx_init_ext(uint64_t *features) +{ + unsigned long nr, cpu; + + uffd_test_ctx_clear(); + + uffd_test_ops->allocate_area((void **)&area_src); + uffd_test_ops->allocate_area((void **)&area_dst); + + uffd_test_ops->release_pages(area_src); + uffd_test_ops->release_pages(area_dst); + + userfaultfd_open(features); + + count_verify = malloc(nr_pages * sizeof(unsigned long long)); + if (!count_verify) + err("count_verify"); + + for (nr = 0; nr < nr_pages; nr++) { + *area_mutex(area_src, nr) = + (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER; + count_verify[nr] = *area_count(area_src, nr) = 1; + /* + * In the transition between 255 to 256, powerpc will + * read out of order in my_bcmp and see both bytes as + * zero, so leave a placeholder below always non-zero + * after the count, to avoid my_bcmp to trigger false + * positives. + */ + *(area_count(area_src, nr) + 1) = 1; + } + + pipefd = malloc(sizeof(int) * nr_cpus * 2); + if (!pipefd) + err("pipefd"); + for (cpu = 0; cpu < nr_cpus; cpu++) + if (pipe2(&pipefd[cpu * 2], O_CLOEXEC | O_NONBLOCK)) + err("pipe"); +} + +static inline void uffd_test_ctx_init(uint64_t features) +{ + uffd_test_ctx_init_ext(&features); +} + static int my_bcmp(char *str1, char *str2, size_t n) { unsigned long i; @@ -726,40 +832,6 @@ static int stress(struct uffd_stats *uffd_stats) return 0; } -static int userfaultfd_open_ext(uint64_t *features) -{ - struct uffdio_api uffdio_api; - - uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); - if (uffd < 0) { - fprintf(stderr, - "userfaultfd syscall not available in this kernel\n"); - return 1; - } - uffd_flags = fcntl(uffd, F_GETFD, NULL); - - uffdio_api.api = UFFD_API; - uffdio_api.features = *features; - if (ioctl(uffd, UFFDIO_API, &uffdio_api)) { - fprintf(stderr, "UFFDIO_API failed.\nPlease make sure to " - "run with either root or ptrace capability.\n"); - return 1; - } - if (uffdio_api.api != UFFD_API) { - fprintf(stderr, "UFFDIO_API error: %" PRIu64 "\n", - (uint64_t)uffdio_api.api); - return 1; - } - - *features = uffdio_api.features; - return 0; -} - -static int userfaultfd_open(uint64_t features) -{ - return userfaultfd_open_ext(&features); -} - sigjmp_buf jbuf, *sigbuf; static void sighndl(int sig, siginfo_t *siginfo, void *ptr) @@ -868,6 +940,8 @@ static int faulting_process(int signal_test) MREMAP_MAYMOVE | MREMAP_FIXED, area_src); if (area_dst == MAP_FAILED) err("mremap"); + /* Reset area_src since we just clobbered it */ + area_src = NULL; for (; nr < nr_pages; nr++) { count = *area_count(area_dst, nr); @@ -961,10 +1035,8 @@ static int userfaultfd_zeropage_test(void) printf("testing UFFDIO_ZEROPAGE: "); fflush(stdout); - uffd_test_ops->release_pages(area_dst); + uffd_test_ctx_init(0); - if (userfaultfd_open(0)) - return 1; uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; @@ -981,7 +1053,6 @@ static int userfaultfd_zeropage_test(void) if (my_bcmp(area_dst, zeropage, page_size)) err("zeropage is not zero"); - close(uffd); printf("done.\n"); return 0; } @@ -999,12 +1070,10 @@ static int userfaultfd_events_test(void) printf("testing events (fork, remap, remove): "); fflush(stdout); - uffd_test_ops->release_pages(area_dst); - features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP | UFFD_FEATURE_EVENT_REMOVE; - if (userfaultfd_open(features)) - return 1; + uffd_test_ctx_init(features); + fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); uffdio_register.range.start = (unsigned long) area_dst; @@ -1037,8 +1106,6 @@ static int userfaultfd_events_test(void) if (pthread_join(uffd_mon, NULL)) return 1; - close(uffd); - uffd_stats_report(&stats, 1); return stats.missing_faults != nr_pages; @@ -1058,11 +1125,9 @@ static int userfaultfd_sig_test(void) printf("testing signal delivery: "); fflush(stdout); - uffd_test_ops->release_pages(area_dst); - features = UFFD_FEATURE_EVENT_FORK|UFFD_FEATURE_SIGBUS; - if (userfaultfd_open(features)) - return 1; + uffd_test_ctx_init(features); + fcntl(uffd, F_SETFL, uffd_flags | O_NONBLOCK); uffdio_register.range.start = (unsigned long) area_dst; @@ -1103,7 +1168,6 @@ static int userfaultfd_sig_test(void) printf("done.\n"); if (userfaults) err("Signal test failed, userfaults: %ld", userfaults); - close(uffd); return userfaults != 0; } @@ -1126,10 +1190,7 @@ static int userfaultfd_minor_test(void) printf("testing minor faults: "); fflush(stdout); - uffd_test_ops->release_pages(area_dst); - - if (userfaultfd_open_ext(&features)) - return 1; + uffd_test_ctx_init_ext(&features); /* If kernel reports the feature isn't supported, skip the test. */ if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) { printf("skipping test due to lack of feature support\n"); @@ -1183,8 +1244,6 @@ static int userfaultfd_minor_test(void) if (pthread_join(uffd_mon, NULL)) return 1; - close(uffd); - uffd_stats_report(&stats, 1); return stats.missing_faults != 0 || stats.minor_faults != nr_pages; @@ -1267,7 +1326,7 @@ static void userfaultfd_pagemap_test(unsigned int test_pgsize) /* Flush so it doesn't flush twice in parent/child later */ fflush(stdout); - uffd_test_ops->release_pages(area_dst); + uffd_test_ctx_init(0); if (test_pgsize > page_size) { /* This is a thp test */ @@ -1279,9 +1338,6 @@ static void userfaultfd_pagemap_test(unsigned int test_pgsize) err("madvise(MADV_NOHUGEPAGE) failed"); } - if (userfaultfd_open(0)) - err("userfaultfd_open"); - uffdio_register.range.start = (unsigned long) area_dst; uffdio_register.range.len = nr_pages * page_size; uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; @@ -1324,7 +1380,6 @@ static void userfaultfd_pagemap_test(unsigned int test_pgsize) pagemap_check_wp(value, false); close(pagemap_fd); - close(uffd); printf("done\n"); } @@ -1334,50 +1389,9 @@ static int userfaultfd_stress(void) char *tmp_area; unsigned long nr; struct uffdio_register uffdio_register; - unsigned long cpu; struct uffd_stats uffd_stats[nr_cpus]; - uffd_test_ops->allocate_area((void **)&area_src); - if (!area_src) - return 1; - uffd_test_ops->allocate_area((void **)&area_dst); - if (!area_dst) - return 1; - - if (userfaultfd_open(0)) - return 1; - - count_verify = malloc(nr_pages * sizeof(unsigned long long)); - if (!count_verify) { - perror("count_verify"); - return 1; - } - - for (nr = 0; nr < nr_pages; nr++) { - *area_mutex(area_src, nr) = (pthread_mutex_t) - PTHREAD_MUTEX_INITIALIZER; - count_verify[nr] = *area_count(area_src, nr) = 1; - /* - * In the transition between 255 to 256, powerpc will - * read out of order in my_bcmp and see both bytes as - * zero, so leave a placeholder below always non-zero - * after the count, to avoid my_bcmp to trigger false - * positives. - */ - *(area_count(area_src, nr) + 1) = 1; - } - - pipefd = malloc(sizeof(int) * nr_cpus * 2); - if (!pipefd) { - perror("pipefd"); - return 1; - } - for (cpu = 0; cpu < nr_cpus; cpu++) { - if (pipe2(&pipefd[cpu*2], O_CLOEXEC | O_NONBLOCK)) { - perror("pipe"); - return 1; - } - } + uffd_test_ctx_init(0); if (posix_memalign(&area, page_size, page_size)) err("out of memory"); @@ -1498,8 +1512,6 @@ static int userfaultfd_stress(void) uffd_stats_report(uffd_stats, nr_cpus); } - close(uffd); - if (test_type == TEST_ANON) { /* * shmem/hugetlb won't be able to run since they have different From 4a8f021ba0a220a95d4251ea3f199ef693f1249b Mon Sep 17 00:00:00 2001 From: Axel Rasmussen Date: Wed, 30 Jun 2021 18:49:45 -0700 Subject: [PATCH 048/190] userfaultfd/selftests: exercise minor fault handling shmem support Enable test_uffdio_minor for test_type == TEST_SHMEM, and modify the test slightly to pass in / check for the right feature flags. Link: https://lkml.kernel.org/r/20210503180737.2487560-11-axelrasmussen@google.com Signed-off-by: Axel Rasmussen Reviewed-by: Peter Xu Cc: Alexander Viro Cc: Andrea Arcangeli Cc: Brian Geffon Cc: "Dr . David Alan Gilbert" Cc: Hugh Dickins Cc: Jerome Glisse Cc: Joe Perches Cc: Kirill A. Shutemov Cc: Lokesh Gidra Cc: Mike Kravetz Cc: Mike Rapoport Cc: Mina Almasry Cc: Oliver Upton Cc: Shaohua Li Cc: Shuah Khan Cc: Stephen Rothwell Cc: Wang Qing Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/userfaultfd.c | 29 ++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index f78816130c7f..e363bdaff59d 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -474,6 +474,7 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp) static void continue_range(int ufd, __u64 start, __u64 len) { struct uffdio_continue req; + int ret; req.range.start = start; req.range.len = len; @@ -482,6 +483,17 @@ static void continue_range(int ufd, __u64 start, __u64 len) if (ioctl(ufd, UFFDIO_CONTINUE, &req)) err("UFFDIO_CONTINUE failed for address 0x%" PRIx64, (uint64_t)start); + + /* + * Error handling within the kernel for continue is subtly different + * from copy or zeropage, so it may be a source of bugs. Trigger an + * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG. + */ + req.mapped = 0; + ret = ioctl(ufd, UFFDIO_CONTINUE, &req); + if (ret >= 0 || req.mapped != -EEXIST) + err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64, + ret, (int64_t) req.mapped); } static void *locking_thread(void *arg) @@ -1182,7 +1194,7 @@ static int userfaultfd_minor_test(void) void *expected_page; char c; struct uffd_stats stats = { 0 }; - uint64_t features = UFFD_FEATURE_MINOR_HUGETLBFS; + uint64_t req_features, features_out; if (!test_uffdio_minor) return 0; @@ -1190,9 +1202,17 @@ static int userfaultfd_minor_test(void) printf("testing minor faults: "); fflush(stdout); - uffd_test_ctx_init_ext(&features); - /* If kernel reports the feature isn't supported, skip the test. */ - if (!(features & UFFD_FEATURE_MINOR_HUGETLBFS)) { + if (test_type == TEST_HUGETLB) + req_features = UFFD_FEATURE_MINOR_HUGETLBFS; + else if (test_type == TEST_SHMEM) + req_features = UFFD_FEATURE_MINOR_SHMEM; + else + return 1; + + features_out = req_features; + uffd_test_ctx_init_ext(&features_out); + /* If kernel reports required features aren't supported, skip test. */ + if ((features_out & req_features) != req_features) { printf("skipping test due to lack of feature support\n"); fflush(stdout); return 0; @@ -1575,6 +1595,7 @@ static void set_test_type(const char *type) map_shared = true; test_type = TEST_SHMEM; uffd_test_ops = &shmem_uffd_test_ops; + test_uffdio_minor = true; } else { err("Unknown test type: %s", type); } From 2d2b8d2b67713da5de333a8849342503a9f21c60 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Wed, 30 Jun 2021 18:49:48 -0700 Subject: [PATCH 049/190] mm/vmscan.c: fix potential deadlock in reclaim_pages() Theoretically without the protect from memalloc_noreclaim_save() and memalloc_noreclaim_restore(), reclaim_pages() can go into the block I/O layer recursively and deadlock. Querying 'reclaim_pages' in our kernel crash databases didn't yield any results. So the deadlock seems unlikely to happen. A possible explanation is that the only user of reclaim_pages(), i.e., MADV_PAGEOUT, is usually called before memory pressure builds up, e.g., on Android and Chrome OS. Under such a condition, allocations in the block I/O layer can be fulfilled without diverting to direct reclaim and therefore the recursion is avoided. Link: https://lkml.kernel.org/r/20210622074642.785473-1-yuzhao@google.com Link: https://lkml.kernel.org/r/20210614194727.2684053-1-yuzhao@google.com Signed-off-by: Yu Zhao Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index d7c3cb8688dd..7b52ab166aae 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1701,6 +1701,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, unsigned int nr_reclaimed; struct page *page, *next; LIST_HEAD(clean_pages); + unsigned int noreclaim_flag; list_for_each_entry_safe(page, next, page_list, lru) { if (!PageHuge(page) && page_is_file_lru(page) && @@ -1711,8 +1712,17 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, } } + /* + * We should be safe here since we are only dealing with file pages and + * we are not kswapd and therefore cannot write dirty file pages. But + * call memalloc_noreclaim_save() anyway, just in case these conditions + * change in the future. + */ + noreclaim_flag = memalloc_noreclaim_save(); nr_reclaimed = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc, &stat, true); + memalloc_noreclaim_restore(noreclaim_flag); + list_splice(&clean_pages, page_list); mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -(long)nr_reclaimed); @@ -2306,6 +2316,7 @@ unsigned long reclaim_pages(struct list_head *page_list) LIST_HEAD(node_page_list); struct reclaim_stat dummy_stat; struct page *page; + unsigned int noreclaim_flag; struct scan_control sc = { .gfp_mask = GFP_KERNEL, .priority = DEF_PRIORITY, @@ -2314,6 +2325,8 @@ unsigned long reclaim_pages(struct list_head *page_list) .may_swap = 1, }; + noreclaim_flag = memalloc_noreclaim_save(); + while (!list_empty(page_list)) { page = lru_to_page(page_list); if (nid == NUMA_NO_NODE) { @@ -2350,6 +2363,8 @@ unsigned long reclaim_pages(struct list_head *page_list) } } + memalloc_noreclaim_restore(noreclaim_flag); + return nr_reclaimed; } From 764c04a9cbe6f66334ed9a8a154e7d1b4b535da9 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Wed, 30 Jun 2021 18:49:51 -0700 Subject: [PATCH 050/190] include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_low mm_vmscan_inactive_list_is_low has no users after commit b91ac374346b ("mm: vmscan: enforce inactive:active ratio at the reclaim root"). Remove it. Link: https://lkml.kernel.org/r/20210614194554.2683395-1-yuzhao@google.com Signed-off-by: Yu Zhao Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/trace/events/vmscan.h | 41 ----------------------------------- 1 file changed, 41 deletions(-) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index 00d1180527d8..88faf2400ec2 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -423,47 +423,6 @@ TRACE_EVENT(mm_vmscan_lru_shrink_active, show_reclaim_flags(__entry->reclaim_flags)) ); -TRACE_EVENT(mm_vmscan_inactive_list_is_low, - - TP_PROTO(int nid, int reclaim_idx, - unsigned long total_inactive, unsigned long inactive, - unsigned long total_active, unsigned long active, - unsigned long ratio, int file), - - TP_ARGS(nid, reclaim_idx, total_inactive, inactive, total_active, active, ratio, file), - - TP_STRUCT__entry( - __field(int, nid) - __field(int, reclaim_idx) - __field(unsigned long, total_inactive) - __field(unsigned long, inactive) - __field(unsigned long, total_active) - __field(unsigned long, active) - __field(unsigned long, ratio) - __field(int, reclaim_flags) - ), - - TP_fast_assign( - __entry->nid = nid; - __entry->reclaim_idx = reclaim_idx; - __entry->total_inactive = total_inactive; - __entry->inactive = inactive; - __entry->total_active = total_active; - __entry->active = active; - __entry->ratio = ratio; - __entry->reclaim_flags = trace_reclaim_flags(file) & - RECLAIM_WB_LRU; - ), - - TP_printk("nid=%d reclaim_idx=%d total_inactive=%ld inactive=%ld total_active=%ld active=%ld ratio=%ld flags=%s", - __entry->nid, - __entry->reclaim_idx, - __entry->total_inactive, __entry->inactive, - __entry->total_active, __entry->active, - __entry->ratio, - show_reclaim_flags(__entry->reclaim_flags)) -); - TRACE_EVENT(mm_vmscan_node_reclaim_begin, TP_PROTO(int nid, int order, gfp_t gfp_flags), From 3ebc57f40316049139ab9ca3d19e52449106ee9f Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:49:54 -0700 Subject: [PATCH 051/190] mm: workingset: define macro WORKINGSET_SHIFT The magic number 1 is used in several places in workingset.c. Define a macro WORKINGSET_SHIFT for it to improve code readability. Link: https://lkml.kernel.org/r/20210624122307.1759342-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/workingset.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/mm/workingset.c b/mm/workingset.c index 4f7a306ce75a..5ba3e42446fa 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -168,8 +168,10 @@ * refault distance will immediately activate the refaulting page. */ +#define WORKINGSET_SHIFT 1 #define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ - 1 + NODES_SHIFT + MEM_CGROUP_ID_SHIFT) + WORKINGSET_SHIFT + NODES_SHIFT + \ + MEM_CGROUP_ID_SHIFT) #define EVICTION_MASK (~0UL >> EVICTION_SHIFT) /* @@ -189,7 +191,7 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction, eviction &= EVICTION_MASK; eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid; eviction = (eviction << NODES_SHIFT) | pgdat->node_id; - eviction = (eviction << 1) | workingset; + eviction = (eviction << WORKINGSET_SHIFT) | workingset; return xa_mk_value(eviction); } @@ -201,8 +203,8 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, int memcgid, nid; bool workingset; - workingset = entry & 1; - entry >>= 1; + workingset = entry & ((1UL << WORKINGSET_SHIFT) - 1); + entry >>= WORKINGSET_SHIFT; nid = entry & ((1UL << NODES_SHIFT) - 1); entry >>= NODES_SHIFT; memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1); From 781eb2cdd26f3748be57da9bed98bbe5b0dd99fb Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Wed, 30 Jun 2021 18:49:57 -0700 Subject: [PATCH 052/190] mm/kconfig: move HOLES_IN_ZONE into mm commit a55749639dc1 ("ia64: drop marked broken DISCONTIGMEM and VIRTUAL_MEM_MAP") drop VIRTUAL_MEM_MAP, so there is no need HOLES_IN_ZONE on ia64. Also move HOLES_IN_ZONE into mm/Kconfig, select it if architecture needs this feature. Link: https://lkml.kernel.org/r/20210417075946.181402-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: Catalin Marinas [arm64] Cc: Will Deacon Cc: Thomas Bogendoerfer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/Kconfig | 4 +--- arch/ia64/Kconfig | 3 --- arch/mips/Kconfig | 3 --- mm/Kconfig | 3 +++ 4 files changed, 4 insertions(+), 9 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index d01a1545ab8f..937bdf0c3859 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -201,6 +201,7 @@ config ARM64 select HAVE_KPROBES select HAVE_KRETPROBES select HAVE_GENERIC_VDSO + select HOLES_IN_ZONE select IOMMU_DMA if IOMMU_SUPPORT select IRQ_DOMAIN select IRQ_FORCED_THREADING @@ -1052,9 +1053,6 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK def_bool y depends on NUMA -config HOLES_IN_ZONE - def_bool y - source "kernel/Kconfig.hz" config ARCH_SPARSEMEM_ENABLE diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index da22a35e6f03..f4ff6bd5fc13 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -308,9 +308,6 @@ config NODES_SHIFT MAX_NUMNODES will be 2^(This value). If in doubt, use the default. -config HOLES_IN_ZONE - bool - config HAVE_ARCH_NODEDATA_EXTENSION def_bool y depends on NUMA diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 4704a16c2e44..df02e36a4a8d 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -1233,9 +1233,6 @@ config HAVE_PLAT_MEMCPY config ISA_DMA_API bool -config HOLES_IN_ZONE - bool - config SYS_SUPPORTS_RELOCATABLE bool help diff --git a/mm/Kconfig b/mm/Kconfig index ded98fb859ab..055639e52459 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -96,6 +96,9 @@ config HAVE_FAST_GUP depends on MMU bool +config HOLES_IN_ZONE + bool + # Don't discard allocated memory used to track "memory" and "reserved" memblocks # after early boot, so it can still be used to test for validity of memory. # Also, memblocks are updated with memory hot(un)plug. From 8d719afcb34434ebfa7911338d8c777eca8452b0 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 30 Jun 2021 18:50:00 -0700 Subject: [PATCH 053/190] docs: proc.rst: meminfo: briefly describe gaps in memory accounting Add a paragraph that explains that it may happen that the counters in /proc/meminfo do not add up to the overall memory usage. Link: https://lkml.kernel.org/r/20210421061127.1182723-1-rppt@kernel.org Signed-off-by: Mike Rapoport Acked-by: Michal Hocko Reviewed-by: Matthew Wilcox (Oracle) Acked-by: Vlastimil Babka Cc: Jonathan Corbet Cc: Alexey Dobriyan Cc: Eric Dumazet Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.rst | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 81bfe3c800cc..dd4e3e18c22b 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -933,8 +933,15 @@ meminfo ~~~~~~~ Provides information about distribution and utilization of memory. This -varies by architecture and compile options. The following is from a -16GB PIII, which has highmem enabled. You may not have all of these fields. +varies by architecture and compile options. Some of the counters reported +here overlap. The memory reported by the non overlapping counters may not +add up to the overall memory usage and the difference for some workloads +can be substantial. In many cases there are other means to find out +additional memory using subsystem specific interfaces, for instance +/proc/net/sockstat for TCP memory allocations. + +The following is from a 16GB PIII, which has highmem enabled. +You may not have all of these fields. :: From 3c36b419b111e28a657e6534aae07964a98a5ca9 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:50:03 -0700 Subject: [PATCH 054/190] fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3. Looking for places where the kernel might unconditionally read PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore needs some more love to not touch some other pages we really don't want to read -- i.e., hwpoisoned ones. Examples for PageOffline() pages are pages inflated in a balloon, memory unplugged via virtio-mem, and partially-present sections in memory added by the Hyper-V balloon. When reading pages inflated in a balloon, we essentially produce unnecessary load in the hypervisor; holes in partially present sections in case of Hyper-V are not accessible and already were a problem for /proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages. In the future, virtio-mem might disallow reading unplugged memory -- marked as PageOffline() -- in some environments, resulting in undefined behavior when accessed; therefore, I'm trying to identify and rework all these (corner) cases. With this series, there is really only access via /dev/mem, /proc/vmcore and kdb left after I ripped out /dev/kmem. kdb is an advanced corner-case use case -- we won't care for now if someone explicitly tries to do nasty things by reading from/writing to physical addresses we better not touch. /dev/mem is a use case we won't support for virtio-mem, at least for now, so we'll simply disallow mapping any virtio-mem memory via /dev/mem next. /proc/vmcore is really only a problem when dumping the old kernel via something that's not makedumpfile (read: basically never), however, we'll try sanitizing that as well in the second kernel in the future. Tested via kcore_dump: https://github.com/schlafwandler/kcore_dump This patch (of 6): Commit db779ef67ffe ("proc/kcore: Remove unused kclist_add_remap()") removed the last user of KCORE_REMAP. Commit 595dd46ebfc1 ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when dumping vsyscall user page") removed the last user of KCORE_OTHER. Let's drop both types. While at it, also drop vaddr in "struct kcore_list", used by KCORE_REMAP only. Link: https://lkml.kernel.org/r/20210526093041.8800-1-david@redhat.com Link: https://lkml.kernel.org/r/20210526093041.8800-2-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Mike Rapoport Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Alexey Dobriyan Cc: "Matthew Wilcox (Oracle)" Cc: Oscar Salvador Cc: Michal Hocko Cc: Roman Gushchin Cc: Alex Shi Cc: Steven Price Cc: Mike Kravetz Cc: Aili Yao Cc: Jiri Bohac Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Wei Liu Cc: Naoya Horiguchi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/kcore.c | 7 ++----- include/linux/kcore.h | 3 --- 2 files changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index 4d2e64e9016c..09f77d3c6e15 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -380,11 +380,8 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) phdr->p_type = PT_LOAD; phdr->p_flags = PF_R | PF_W | PF_X; phdr->p_offset = kc_vaddr_to_offset(m->addr) + data_offset; - if (m->type == KCORE_REMAP) - phdr->p_vaddr = (size_t)m->vaddr; - else - phdr->p_vaddr = (size_t)m->addr; - if (m->type == KCORE_RAM || m->type == KCORE_REMAP) + phdr->p_vaddr = (size_t)m->addr; + if (m->type == KCORE_RAM) phdr->p_paddr = __pa(m->addr); else if (m->type == KCORE_TEXT) phdr->p_paddr = __pa_symbol(m->addr); diff --git a/include/linux/kcore.h b/include/linux/kcore.h index da676cdbd727..86c0f1d18998 100644 --- a/include/linux/kcore.h +++ b/include/linux/kcore.h @@ -11,14 +11,11 @@ enum kcore_type { KCORE_RAM, KCORE_VMEMMAP, KCORE_USER, - KCORE_OTHER, - KCORE_REMAP, }; struct kcore_list { struct list_head list; unsigned long addr; - unsigned long vaddr; size_t size; int type; }; From 2711032c64a9c151a6469d53fdc7f9f4df7f6e45 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:50:07 -0700 Subject: [PATCH 055/190] fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM Let's resturcture the code, using switch-case, and checking pfn_is_ram() only when we are dealing with KCORE_RAM. Link: https://lkml.kernel.org/r/20210526093041.8800-3-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Mike Rapoport Cc: Aili Yao Cc: Alexey Dobriyan Cc: Alex Shi Cc: Haiyang Zhang Cc: Jason Wang Cc: Jiri Bohac Cc: "K. Y. Srinivasan" Cc: "Matthew Wilcox (Oracle)" Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: Oscar Salvador Cc: Roman Gushchin Cc: Stephen Hemminger Cc: Steven Price Cc: Wei Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/kcore.c | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index 09f77d3c6e15..ed6fbb3bd50c 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -483,25 +483,36 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) goto out; } m = NULL; /* skip the list anchor */ - } else if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) { - if (clear_user(buffer, tsz)) { - ret = -EFAULT; - goto out; - } - } else if (m->type == KCORE_VMALLOC) { + goto skip; + } + + switch (m->type) { + case KCORE_VMALLOC: vread(buf, (char *)start, tsz); /* we have to zero-fill user buffer even if no read */ if (copy_to_user(buffer, buf, tsz)) { ret = -EFAULT; goto out; } - } else if (m->type == KCORE_USER) { + break; + case KCORE_USER: /* User page is handled prior to normal kernel page: */ if (copy_to_user(buffer, (char *)start, tsz)) { ret = -EFAULT; goto out; } - } else { + break; + case KCORE_RAM: + if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) { + if (clear_user(buffer, tsz)) { + ret = -EFAULT; + goto out; + } + break; + } + fallthrough; + case KCORE_VMEMMAP: + case KCORE_TEXT: if (kern_addr_valid(start)) { /* * Using bounce buffer to bypass the @@ -525,7 +536,15 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) goto out; } } + break; + default: + pr_warn_once("Unhandled KCORE type: %d\n", m->type); + if (clear_user(buffer, tsz)) { + ret = -EFAULT; + goto out; + } } +skip: buflen -= tsz; *fpos += tsz; buffer += tsz; From 0daa322b8ff94d8ee4081c2c6868a1aaf1309642 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:50:10 -0700 Subject: [PATCH 056/190] fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages Let's avoid reading: 1) Offline memory sections: the content of offline memory sections is stale as the memory is effectively unused by the kernel. On s390x with standby memory, offline memory sections (belonging to offline storage increments) are not accessible. With virtio-mem and the hyper-v balloon, we can have unavailable memory chunks that should not be accessed inside offline memory sections. Last but not least, offline memory sections might contain hwpoisoned pages which we can no longer identify because the memmap is stale. 2) PG_offline pages: logically offline pages that are documented as "The content of these pages is effectively stale. Such pages should not be touched (read/write/dump/save) except by their owner.". Examples include pages inflated in a balloon or unavailble memory ranges inside hotplugged memory sections with virtio-mem or the hyper-v balloon. 3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal. As documented: "Accessing is not safe since it may cause another machine check. Don't touch!" Introduce is_page_hwpoison(), adding a comment that it is inherently racy but best we can really do. Reading /proc/kcore now performs similar checks as when reading /proc/vmcore for kdump via makedumpfile: problematic pages are exclude. It's also similar to hibernation code, however, we don't skip hwpoisoned pages when processing pages in kernel/power/snapshot.c:saveable_page() yet. Note 1: we can race against memory offlining code, especially memory going offline and getting unplugged: however, we will properly tear down the identity mapping and handle faults gracefully when accessing this memory from kcore code. Note 2: we can race against drivers setting PageOffline() and turning memory inaccessible in the hypervisor. We'll handle this in a follow-up patch. Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Mike Rapoport Reviewed-by: Oscar Salvador Cc: Aili Yao Cc: Alexey Dobriyan Cc: Alex Shi Cc: Haiyang Zhang Cc: Jason Wang Cc: Jiri Bohac Cc: "K. Y. Srinivasan" Cc: "Matthew Wilcox (Oracle)" Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: Roman Gushchin Cc: Stephen Hemminger Cc: Steven Price Cc: Wei Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/kcore.c | 14 +++++++++++++- include/linux/page-flags.h | 12 ++++++++++++ 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index ed6fbb3bd50c..92ff1e4436cb 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -465,6 +465,9 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) m = NULL; while (buflen) { + struct page *page; + unsigned long pfn; + /* * If this is the first iteration or the address is not within * the previous entry, search for a matching entry. @@ -503,7 +506,16 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) } break; case KCORE_RAM: - if (!pfn_is_ram(__pa(start) >> PAGE_SHIFT)) { + pfn = __pa(start) >> PAGE_SHIFT; + page = pfn_to_online_page(pfn); + + /* + * Don't read offline sections, logically offline pages + * (e.g., inflated in a balloon), hwpoisoned pages, + * and explicitly excluded physical ranges. + */ + if (!page || PageOffline(page) || + is_page_hwpoison(page) || !pfn_is_ram(pfn)) { if (clear_user(buffer, tsz)) { ret = -EFAULT; goto out; diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index d8e26243db25..613295588848 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -694,6 +694,18 @@ PAGEFLAG_FALSE(DoubleMap) TESTSCFLAG_FALSE(DoubleMap) #endif +/* + * Check if a page is currently marked HWPoisoned. Note that this check is + * best effort only and inherently racy: there is no way to synchronize with + * failing hardware. + */ +static inline bool is_page_hwpoison(struct page *page) +{ + if (PageHWPoison(page)) + return true; + return PageHuge(page) && PageHWPoison(compound_head(page)); +} + /* * For pages that are never mapped to userspace (and aren't PageSlab), * page_type may be used. Because it is initialised to -1, we invert the From 82840451936f0301781ece80322230fd8edfc648 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:50:14 -0700 Subject: [PATCH 057/190] mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() A driver might set a page logically offline -- PageOffline() -- and turn the page inaccessible in the hypervisor; after that, access to page content can be fatal. One example is virtio-mem; while unplugged memory -- marked as PageOffline() can currently be read in the hypervisor, this will no longer be the case in the future; for example, when having a virtio-mem device backed by huge pages in the hypervisor. Some special PFN walkers -- i.e., /proc/kcore -- read content of random pages after checking PageOffline(); however, these PFN walkers can race with drivers that set PageOffline(). Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing. page_offline_freeze()/page_offline_thaw() allows for a subsystem to synchronize with such drivers, achieving that a page cannot be set PageOffline() while frozen. page_offline_begin()/page_offline_end() is used by drivers that care about such races when setting a page PageOffline(). For simplicity, use a rwsem for now; neither drivers nor users are performance sensitive. Link: https://lkml.kernel.org/r/20210526093041.8800-5-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Michal Hocko Reviewed-by: Mike Rapoport Reviewed-by: Oscar Salvador Cc: Aili Yao Cc: Alexey Dobriyan Cc: Alex Shi Cc: Haiyang Zhang Cc: Jason Wang Cc: Jiri Bohac Cc: "K. Y. Srinivasan" Cc: "Matthew Wilcox (Oracle)" Cc: "Michael S. Tsirkin" Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: Roman Gushchin Cc: Stephen Hemminger Cc: Steven Price Cc: Wei Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/page-flags.h | 10 ++++++++++ mm/util.c | 40 ++++++++++++++++++++++++++++++++++++++ 2 files changed, 50 insertions(+) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 613295588848..3e7e616067fc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -769,9 +769,19 @@ PAGE_TYPE_OPS(Buddy, buddy) * relies on this feature is aware that re-onlining the memory block will * require to re-set the pages PageOffline() and not giving them to the * buddy via online_page_callback_t. + * + * There are drivers that mark a page PageOffline() and expect there won't be + * any further access to page content. PFN walkers that read content of random + * pages should check PageOffline() and synchronize with such drivers using + * page_offline_freeze()/page_offline_thaw(). */ PAGE_TYPE_OPS(Offline, offline) +extern void page_offline_freeze(void); +extern void page_offline_thaw(void); +extern void page_offline_begin(void); +extern void page_offline_end(void); + /* * Marks pages in use as page tables. */ diff --git a/mm/util.c b/mm/util.c index a8bf17f18a81..a034525e7ba2 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1010,3 +1010,43 @@ void mem_dump_obj(void *object) } EXPORT_SYMBOL_GPL(mem_dump_obj); #endif + +/* + * A driver might set a page logically offline -- PageOffline() -- and + * turn the page inaccessible in the hypervisor; after that, access to page + * content can be fatal. + * + * Some special PFN walkers -- i.e., /proc/kcore -- read content of random + * pages after checking PageOffline(); however, these PFN walkers can race + * with drivers that set PageOffline(). + * + * page_offline_freeze()/page_offline_thaw() allows for a subsystem to + * synchronize with such drivers, achieving that a page cannot be set + * PageOffline() while frozen. + * + * page_offline_begin()/page_offline_end() is used by drivers that care about + * such races when setting a page PageOffline(). + */ +static DECLARE_RWSEM(page_offline_rwsem); + +void page_offline_freeze(void) +{ + down_read(&page_offline_rwsem); +} + +void page_offline_thaw(void) +{ + up_read(&page_offline_rwsem); +} + +void page_offline_begin(void) +{ + down_write(&page_offline_rwsem); +} +EXPORT_SYMBOL(page_offline_begin); + +void page_offline_end(void) +{ + up_write(&page_offline_rwsem); +} +EXPORT_SYMBOL(page_offline_end); From 6cc26d77613a970ed9b5ca66f230b29edf7c917e Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:50:18 -0700 Subject: [PATCH 058/190] virtio-mem: use page_offline_(start|end) when setting PageOffline() Let's properly use page_offline_(start|end) to synchronize setting PageOffline(), so we won't have valid page access to unplugged memory regions from /proc/kcore. Existing balloon implementations usually allow reading inflated memory; doing so might result in unnecessary overhead in the hypervisor, which is currently the case with virtio-mem. For future virtio-mem use cases, it will be different when using shmem, huge pages, !anonymous private mappings, ... as backing storage for a VM. virtio-mem unplugged memory must no longer be accessed and access might result in undefined behavior. There will be a virtio spec extension to document this change, including a new feature flag indicating the changed behavior. We really don't want to race against PFN walkers reading random page content. Link: https://lkml.kernel.org/r/20210526093041.8800-6-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Michael S. Tsirkin Acked-by: Mike Rapoport Reviewed-by: Oscar Salvador Cc: Aili Yao Cc: Alexey Dobriyan Cc: Alex Shi Cc: Haiyang Zhang Cc: Jason Wang Cc: Jiri Bohac Cc: "K. Y. Srinivasan" Cc: "Matthew Wilcox (Oracle)" Cc: Michal Hocko Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: Roman Gushchin Cc: Stephen Hemminger Cc: Steven Price Cc: Wei Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/virtio/virtio_mem.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 10ec60d81e84..dc2a2e2b2ff8 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -1065,6 +1065,7 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb, static void virtio_mem_set_fake_offline(unsigned long pfn, unsigned long nr_pages, bool onlined) { + page_offline_begin(); for (; nr_pages--; pfn++) { struct page *page = pfn_to_page(pfn); @@ -1075,6 +1076,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn, ClearPageReserved(page); } } + page_offline_end(); } /* From c6d9eee2a68619b5ba1c25e406a9403f33b56902 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:50:21 -0700 Subject: [PATCH 059/190] fs/proc/kcore: use page_offline_(freeze|thaw) Let's properly synchronize with drivers that set PageOffline(). Unfreeze/thaw every now and then, so drivers that want to set PageOffline() can make progress. Link: https://lkml.kernel.org/r/20210526093041.8800-7-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Mike Rapoport Reviewed-by: Oscar Salvador Cc: Aili Yao Cc: Alexey Dobriyan Cc: Alex Shi Cc: Haiyang Zhang Cc: Jason Wang Cc: Jiri Bohac Cc: "K. Y. Srinivasan" Cc: "Matthew Wilcox (Oracle)" Cc: "Michael S. Tsirkin" Cc: Michal Hocko Cc: Mike Kravetz Cc: Naoya Horiguchi Cc: Roman Gushchin Cc: Stephen Hemminger Cc: Steven Price Cc: Wei Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/kcore.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c index 92ff1e4436cb..982e694aae77 100644 --- a/fs/proc/kcore.c +++ b/fs/proc/kcore.c @@ -313,6 +313,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) { char *buf = file->private_data; size_t phdrs_offset, notes_offset, data_offset; + size_t page_offline_frozen = 1; size_t phdrs_len, notes_len; struct kcore_list *m; size_t tsz; @@ -322,6 +323,11 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) int ret = 0; down_read(&kclist_lock); + /* + * Don't race against drivers that set PageOffline() and expect no + * further page access. + */ + page_offline_freeze(); get_kcore_size(&nphdr, &phdrs_len, ¬es_len, &data_offset); phdrs_offset = sizeof(struct elfhdr); @@ -480,6 +486,12 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) } } + if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) { + page_offline_thaw(); + cond_resched(); + page_offline_freeze(); + } + if (&m->list == &kclist_head) { if (clear_user(buffer, tsz)) { ret = -EFAULT; @@ -565,6 +577,7 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) } out: + page_offline_thaw(); up_read(&kclist_lock); if (ret) return ret; From e3c0db4fec46b46a0c22b46bb55392b36ec940fc Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:50:24 -0700 Subject: [PATCH 060/190] mm/z3fold: define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS Patch series "Cleanup and fixup for z3fold". This series contains cleanups to remove unused function, redefine macro to improve readability and so on. Also this fixes several bugs in z3fold, such as memory leak in z3fold_destroy_pool(). More details can be found in the respective changelogs. This patch (of 6): To improve code readability, we could define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS. No functional change intended. Link: https://lkml.kernel.org/r/20210619093151.1492174-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210619093151.1492174-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Reviewed-by: Vitaly Wool Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/z3fold.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/z3fold.c b/mm/z3fold.c index 7fe7adaaad01..0d0b81637f84 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -62,7 +62,7 @@ #define ZHDR_SIZE_ALIGNED round_up(sizeof(struct z3fold_header), CHUNK_SIZE) #define ZHDR_CHUNKS (ZHDR_SIZE_ALIGNED >> CHUNK_SHIFT) #define TOTAL_CHUNKS (PAGE_SIZE >> CHUNK_SHIFT) -#define NCHUNKS ((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT) +#define NCHUNKS (TOTAL_CHUNKS - ZHDR_CHUNKS) #define BUDDY_MASK (0x3) #define BUDDY_SHIFT 2 From 014284a0815f6b9a6e10c8d575d37a5357ce033d Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:50:27 -0700 Subject: [PATCH 061/190] mm/z3fold: avoid possible underflow in z3fold_alloc() It is not enough to just make sure the z3fold header is not larger than the page size. When z3fold header is equal to PAGE_SIZE, we would underflow when check alloc size against PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE in z3fold_alloc(). Make sure there has remaining spaces for its buddy to fix this theoretical issue. Link: https://lkml.kernel.org/r/20210619093151.1492174-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Reviewed-by: Vitaly Wool Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/z3fold.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/z3fold.c b/mm/z3fold.c index 0d0b81637f84..64ddf864d5ee 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -1803,8 +1803,11 @@ static int __init init_z3fold(void) { int ret; - /* Make sure the z3fold header is not larger than the page size */ - BUILD_BUG_ON(ZHDR_SIZE_ALIGNED > PAGE_SIZE); + /* + * Make sure the z3fold header is not larger than the page size and + * there has remaining spaces for its buddy. + */ + BUILD_BUG_ON(ZHDR_SIZE_ALIGNED > PAGE_SIZE - CHUNK_SIZE); ret = z3fold_mount(); if (ret) return ret; From e891f60e28c3e90e2589a7d2147ae192dca11245 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:50:30 -0700 Subject: [PATCH 062/190] mm/z3fold: remove magic number in z3fold_create_pool() It's meaningless to pass a magic number 2 to __alloc_percpu() as there is a minimum alignment size of PCPU_MIN_ALLOC_SIZE (> 2) in it. Also there is no special alignment requirement for unbuddied. So we could replace this magic number with nature alignment, i.e. __alignof__(struct list_head), to improve readability. Link: https://lkml.kernel.org/r/20210619093151.1492174-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Reviewed-by: Vitaly Wool Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/z3fold.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/z3fold.c b/mm/z3fold.c index 64ddf864d5ee..1b790d02f6f1 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -998,7 +998,8 @@ static struct z3fold_pool *z3fold_create_pool(const char *name, gfp_t gfp, goto out_c; spin_lock_init(&pool->lock); spin_lock_init(&pool->stale_lock); - pool->unbuddied = __alloc_percpu(sizeof(struct list_head)*NCHUNKS, 2); + pool->unbuddied = __alloc_percpu(sizeof(struct list_head) * NCHUNKS, + __alignof__(struct list_head)); if (!pool->unbuddied) goto out_pool; for_each_possible_cpu(cpu) { From 767cc6c5568afa50ef6abbd4efb61beee56f9cc8 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:50:33 -0700 Subject: [PATCH 063/190] mm/z3fold: remove unused function handle_to_z3fold_header() handle_to_z3fold_header() is unused now. So we can remove it. As a result, get_z3fold_header() becomes the only caller of __get_z3fold_header() and the argument lock is always true. Therefore we could further fold the __get_z3fold_header() into get_z3fold_header() with lock = true. Link: https://lkml.kernel.org/r/20210619093151.1492174-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Reviewed-by: Vitaly Wool Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/z3fold.c | 22 ++++------------------ 1 file changed, 4 insertions(+), 18 deletions(-) diff --git a/mm/z3fold.c b/mm/z3fold.c index 1b790d02f6f1..9a42172a1bcd 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -253,9 +253,8 @@ static inline void z3fold_page_unlock(struct z3fold_header *zhdr) spin_unlock(&zhdr->page_lock); } - -static inline struct z3fold_header *__get_z3fold_header(unsigned long handle, - bool lock) +/* return locked z3fold page if it's not headless */ +static inline struct z3fold_header *get_z3fold_header(unsigned long handle) { struct z3fold_buddy_slots *slots; struct z3fold_header *zhdr; @@ -269,13 +268,12 @@ static inline struct z3fold_header *__get_z3fold_header(unsigned long handle, read_lock(&slots->lock); addr = *(unsigned long *)handle; zhdr = (struct z3fold_header *)(addr & PAGE_MASK); - if (lock) - locked = z3fold_page_trylock(zhdr); + locked = z3fold_page_trylock(zhdr); read_unlock(&slots->lock); if (locked) break; cpu_relax(); - } while (lock); + } while (true); } else { zhdr = (struct z3fold_header *)(handle & PAGE_MASK); } @@ -283,18 +281,6 @@ static inline struct z3fold_header *__get_z3fold_header(unsigned long handle, return zhdr; } -/* Returns the z3fold page where a given handle is stored */ -static inline struct z3fold_header *handle_to_z3fold_header(unsigned long h) -{ - return __get_z3fold_header(h, false); -} - -/* return locked z3fold page if it's not headless */ -static inline struct z3fold_header *get_z3fold_header(unsigned long h) -{ - return __get_z3fold_header(h, true); -} - static inline void put_z3fold_header(struct z3fold_header *zhdr) { struct page *page = virt_to_page(zhdr); From dac0d1cfda56472378d330b1b76b9973557a7b1d Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:50:36 -0700 Subject: [PATCH 064/190] mm/z3fold: fix potential memory leak in z3fold_destroy_pool() There is a memory leak in z3fold_destroy_pool() as it forgets to free_percpu pool->unbuddied. Call free_percpu for pool->unbuddied to fix this issue. Link: https://lkml.kernel.org/r/20210619093151.1492174-6-linmiaohe@huawei.com Fixes: d30561c56f41 ("z3fold: use per-cpu unbuddied lists") Signed-off-by: Miaohe Lin Reviewed-by: Vitaly Wool Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/z3fold.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/z3fold.c b/mm/z3fold.c index 9a42172a1bcd..f34130ff093d 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -1046,6 +1046,7 @@ static void z3fold_destroy_pool(struct z3fold_pool *pool) destroy_workqueue(pool->compact_wq); destroy_workqueue(pool->release_wq); z3fold_unregister_migration(pool); + free_percpu(pool->unbuddied); kfree(pool); } From 28473d91ff7f686d58047ff55f2fa98ab59114a4 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:50:39 -0700 Subject: [PATCH 065/190] mm/z3fold: use release_z3fold_page_locked() to release locked z3fold page We should use release_z3fold_page_locked() to release z3fold page when it's locked, although it looks harmless to use release_z3fold_page() now. Link: https://lkml.kernel.org/r/20210619093151.1492174-7-linmiaohe@huawei.com Fixes: dcf5aedb24f8 ("z3fold: stricter locking and more careful reclaim") Signed-off-by: Miaohe Lin Reviewed-by: Vitaly Wool Cc: Hillf Danton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/z3fold.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/z3fold.c b/mm/z3fold.c index f34130ff093d..a1bc021434a1 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -1370,7 +1370,7 @@ static int z3fold_reclaim_page(struct z3fold_pool *pool, unsigned int retries) if (zhdr->foreign_handles || test_and_set_bit(PAGE_CLAIMED, &page->private)) { if (kref_put(&zhdr->refcount, - release_z3fold_page)) + release_z3fold_page_locked)) atomic64_dec(&pool->pages_nr); else z3fold_page_unlock(zhdr); From f356aeacf7bbf32131de10d3e400b25b62e3eaaa Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:50:42 -0700 Subject: [PATCH 066/190] mm/zbud: reuse unbuddied[0] as buddied in zbud_pool Patch series "Cleanups for zbud", v2. This series contains just cleanups to save some possible memory in zbud_pool and avoid exporting any unneeded zbud API. More details can be found in the respective changelogs This patch (of 2): Since commit 9d8c5b5284e4 ("mm: zbud: fix condition check on allocation size"), zbud_pool.unbuddied[0] is always unused. We can reuse it as buddied field to save some possible memory. Link: https://lkml.kernel.org/r/20210608114515.206992-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210608114515.206992-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Cc: Seth Jennings Cc: Dan Streetman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/zbud.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/mm/zbud.c b/mm/zbud.c index 7ec5f27a68b0..c48e60bae4a1 100644 --- a/mm/zbud.c +++ b/mm/zbud.c @@ -93,8 +93,14 @@ */ struct zbud_pool { spinlock_t lock; - struct list_head unbuddied[NCHUNKS]; - struct list_head buddied; + union { + /* + * Reuse unbuddied[0] as buddied on the ground that + * unbuddied[0] is unused. + */ + struct list_head buddied; + struct list_head unbuddied[NCHUNKS]; + }; struct list_head lru; u64 pages_nr; const struct zbud_ops *ops; From 2a03085ce88792bac2e25319fc2874a885e7e102 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:50:45 -0700 Subject: [PATCH 067/190] mm/zbud: don't export any zbud API The zbud doesn't need to export any API and it is meant to be used via zpool API since the commit 12d79d64bfd3 ("mm/zpool: update zswap to use zpool"). So we can remove the unneeded zbud.h and move down zpool API to avoid any forward declaration. [linmiaohe@huawei.com: fix unused function warnings when CONFIG_ZPOOL is disabled] Link: https://lkml.kernel.org/r/20210619025508.1239386-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210608114515.206992-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Cc: Dan Streetman Cc: Seth Jennings Cc: Nathan Chancellor Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- MAINTAINERS | 1 - include/linux/zbud.h | 23 -- mm/Kconfig | 1 + mm/zbud.c | 795 +++++++++++++++++++++---------------------- 4 files changed, 396 insertions(+), 424 deletions(-) delete mode 100644 include/linux/zbud.h diff --git a/MAINTAINERS b/MAINTAINERS index 0cce91cd5624..6657f01ddbd5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -20172,7 +20172,6 @@ M: Seth Jennings M: Dan Streetman L: linux-mm@kvack.org S: Maintained -F: include/linux/zbud.h F: mm/zbud.c ZD1211RW WIRELESS DRIVER diff --git a/include/linux/zbud.h b/include/linux/zbud.h deleted file mode 100644 index b1eaf6e31735..000000000000 --- a/include/linux/zbud.h +++ /dev/null @@ -1,23 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _ZBUD_H_ -#define _ZBUD_H_ - -#include - -struct zbud_pool; - -struct zbud_ops { - int (*evict)(struct zbud_pool *pool, unsigned long handle); -}; - -struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops); -void zbud_destroy_pool(struct zbud_pool *pool); -int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp, - unsigned long *handle); -void zbud_free(struct zbud_pool *pool, unsigned long handle); -int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries); -void *zbud_map(struct zbud_pool *pool, unsigned long handle); -void zbud_unmap(struct zbud_pool *pool, unsigned long handle); -u64 zbud_get_pool_size(struct zbud_pool *pool); - -#endif /* _ZBUD_H_ */ diff --git a/mm/Kconfig b/mm/Kconfig index 055639e52459..a0b2c37f851b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -674,6 +674,7 @@ config ZPOOL config ZBUD tristate "Low (Up to 2x) density storage for compressed pages" + depends on ZPOOL help A special purpose allocator for storing compressed pages. It is designed to store up to two compressed pages per physical diff --git a/mm/zbud.c b/mm/zbud.c index c48e60bae4a1..48e80a3f7976 100644 --- a/mm/zbud.c +++ b/mm/zbud.c @@ -51,7 +51,6 @@ #include #include #include -#include #include /***************** @@ -73,6 +72,12 @@ #define ZHDR_SIZE_ALIGNED CHUNK_SIZE #define NCHUNKS ((PAGE_SIZE - ZHDR_SIZE_ALIGNED) >> CHUNK_SHIFT) +struct zbud_pool; + +struct zbud_ops { + int (*evict)(struct zbud_pool *pool, unsigned long handle); +}; + /** * struct zbud_pool - stores metadata for each zbud pool * @lock: protects all pool fields and first|last_chunk fields of any @@ -104,10 +109,8 @@ struct zbud_pool { struct list_head lru; u64 pages_nr; const struct zbud_ops *ops; -#ifdef CONFIG_ZPOOL struct zpool *zpool; const struct zpool_ops *zpool_ops; -#endif }; /* @@ -126,12 +129,399 @@ struct zbud_header { bool under_reclaim; }; +/***************** + * Helpers +*****************/ +/* Just to make the code easier to read */ +enum buddy { + FIRST, + LAST +}; + +/* Converts an allocation size in bytes to size in zbud chunks */ +static int size_to_chunks(size_t size) +{ + return (size + CHUNK_SIZE - 1) >> CHUNK_SHIFT; +} + +#define for_each_unbuddied_list(_iter, _begin) \ + for ((_iter) = (_begin); (_iter) < NCHUNKS; (_iter)++) + +/* Initializes the zbud header of a newly allocated zbud page */ +static struct zbud_header *init_zbud_page(struct page *page) +{ + struct zbud_header *zhdr = page_address(page); + zhdr->first_chunks = 0; + zhdr->last_chunks = 0; + INIT_LIST_HEAD(&zhdr->buddy); + INIT_LIST_HEAD(&zhdr->lru); + zhdr->under_reclaim = false; + return zhdr; +} + +/* Resets the struct page fields and frees the page */ +static void free_zbud_page(struct zbud_header *zhdr) +{ + __free_page(virt_to_page(zhdr)); +} + +/* + * Encodes the handle of a particular buddy within a zbud page + * Pool lock should be held as this function accesses first|last_chunks + */ +static unsigned long encode_handle(struct zbud_header *zhdr, enum buddy bud) +{ + unsigned long handle; + + /* + * For now, the encoded handle is actually just the pointer to the data + * but this might not always be the case. A little information hiding. + * Add CHUNK_SIZE to the handle if it is the first allocation to jump + * over the zbud header in the first chunk. + */ + handle = (unsigned long)zhdr; + if (bud == FIRST) + /* skip over zbud header */ + handle += ZHDR_SIZE_ALIGNED; + else /* bud == LAST */ + handle += PAGE_SIZE - (zhdr->last_chunks << CHUNK_SHIFT); + return handle; +} + +/* Returns the zbud page where a given handle is stored */ +static struct zbud_header *handle_to_zbud_header(unsigned long handle) +{ + return (struct zbud_header *)(handle & PAGE_MASK); +} + +/* Returns the number of free chunks in a zbud page */ +static int num_free_chunks(struct zbud_header *zhdr) +{ + /* + * Rather than branch for different situations, just use the fact that + * free buddies have a length of zero to simplify everything. + */ + return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks; +} + +/***************** + * API Functions +*****************/ +/** + * zbud_create_pool() - create a new zbud pool + * @gfp: gfp flags when allocating the zbud pool structure + * @ops: user-defined operations for the zbud pool + * + * Return: pointer to the new zbud pool or NULL if the metadata allocation + * failed. + */ +static struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops) +{ + struct zbud_pool *pool; + int i; + + pool = kzalloc(sizeof(struct zbud_pool), gfp); + if (!pool) + return NULL; + spin_lock_init(&pool->lock); + for_each_unbuddied_list(i, 0) + INIT_LIST_HEAD(&pool->unbuddied[i]); + INIT_LIST_HEAD(&pool->buddied); + INIT_LIST_HEAD(&pool->lru); + pool->pages_nr = 0; + pool->ops = ops; + return pool; +} + +/** + * zbud_destroy_pool() - destroys an existing zbud pool + * @pool: the zbud pool to be destroyed + * + * The pool should be emptied before this function is called. + */ +static void zbud_destroy_pool(struct zbud_pool *pool) +{ + kfree(pool); +} + +/** + * zbud_alloc() - allocates a region of a given size + * @pool: zbud pool from which to allocate + * @size: size in bytes of the desired allocation + * @gfp: gfp flags used if the pool needs to grow + * @handle: handle of the new allocation + * + * This function will attempt to find a free region in the pool large enough to + * satisfy the allocation request. A search of the unbuddied lists is + * performed first. If no suitable free region is found, then a new page is + * allocated and added to the pool to satisfy the request. + * + * gfp should not set __GFP_HIGHMEM as highmem pages cannot be used + * as zbud pool pages. + * + * Return: 0 if success and handle is set, otherwise -EINVAL if the size or + * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate + * a new page. + */ +static int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp, + unsigned long *handle) +{ + int chunks, i, freechunks; + struct zbud_header *zhdr = NULL; + enum buddy bud; + struct page *page; + + if (!size || (gfp & __GFP_HIGHMEM)) + return -EINVAL; + if (size > PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE) + return -ENOSPC; + chunks = size_to_chunks(size); + spin_lock(&pool->lock); + + /* First, try to find an unbuddied zbud page. */ + for_each_unbuddied_list(i, chunks) { + if (!list_empty(&pool->unbuddied[i])) { + zhdr = list_first_entry(&pool->unbuddied[i], + struct zbud_header, buddy); + list_del(&zhdr->buddy); + if (zhdr->first_chunks == 0) + bud = FIRST; + else + bud = LAST; + goto found; + } + } + + /* Couldn't find unbuddied zbud page, create new one */ + spin_unlock(&pool->lock); + page = alloc_page(gfp); + if (!page) + return -ENOMEM; + spin_lock(&pool->lock); + pool->pages_nr++; + zhdr = init_zbud_page(page); + bud = FIRST; + +found: + if (bud == FIRST) + zhdr->first_chunks = chunks; + else + zhdr->last_chunks = chunks; + + if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) { + /* Add to unbuddied list */ + freechunks = num_free_chunks(zhdr); + list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); + } else { + /* Add to buddied list */ + list_add(&zhdr->buddy, &pool->buddied); + } + + /* Add/move zbud page to beginning of LRU */ + if (!list_empty(&zhdr->lru)) + list_del(&zhdr->lru); + list_add(&zhdr->lru, &pool->lru); + + *handle = encode_handle(zhdr, bud); + spin_unlock(&pool->lock); + + return 0; +} + +/** + * zbud_free() - frees the allocation associated with the given handle + * @pool: pool in which the allocation resided + * @handle: handle associated with the allocation returned by zbud_alloc() + * + * In the case that the zbud page in which the allocation resides is under + * reclaim, as indicated by the PG_reclaim flag being set, this function + * only sets the first|last_chunks to 0. The page is actually freed + * once both buddies are evicted (see zbud_reclaim_page() below). + */ +static void zbud_free(struct zbud_pool *pool, unsigned long handle) +{ + struct zbud_header *zhdr; + int freechunks; + + spin_lock(&pool->lock); + zhdr = handle_to_zbud_header(handle); + + /* If first buddy, handle will be page aligned */ + if ((handle - ZHDR_SIZE_ALIGNED) & ~PAGE_MASK) + zhdr->last_chunks = 0; + else + zhdr->first_chunks = 0; + + if (zhdr->under_reclaim) { + /* zbud page is under reclaim, reclaim will free */ + spin_unlock(&pool->lock); + return; + } + + /* Remove from existing buddy list */ + list_del(&zhdr->buddy); + + if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) { + /* zbud page is empty, free */ + list_del(&zhdr->lru); + free_zbud_page(zhdr); + pool->pages_nr--; + } else { + /* Add to unbuddied list */ + freechunks = num_free_chunks(zhdr); + list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); + } + + spin_unlock(&pool->lock); +} + +/** + * zbud_reclaim_page() - evicts allocations from a pool page and frees it + * @pool: pool from which a page will attempt to be evicted + * @retries: number of pages on the LRU list for which eviction will + * be attempted before failing + * + * zbud reclaim is different from normal system reclaim in that the reclaim is + * done from the bottom, up. This is because only the bottom layer, zbud, has + * information on how the allocations are organized within each zbud page. This + * has the potential to create interesting locking situations between zbud and + * the user, however. + * + * To avoid these, this is how zbud_reclaim_page() should be called: + * + * The user detects a page should be reclaimed and calls zbud_reclaim_page(). + * zbud_reclaim_page() will remove a zbud page from the pool LRU list and call + * the user-defined eviction handler with the pool and handle as arguments. + * + * If the handle can not be evicted, the eviction handler should return + * non-zero. zbud_reclaim_page() will add the zbud page back to the + * appropriate list and try the next zbud page on the LRU up to + * a user defined number of retries. + * + * If the handle is successfully evicted, the eviction handler should + * return 0 _and_ should have called zbud_free() on the handle. zbud_free() + * contains logic to delay freeing the page if the page is under reclaim, + * as indicated by the setting of the PG_reclaim flag on the underlying page. + * + * If all buddies in the zbud page are successfully evicted, then the + * zbud page can be freed. + * + * Returns: 0 if page is successfully freed, otherwise -EINVAL if there are + * no pages to evict or an eviction handler is not registered, -EAGAIN if + * the retry limit was hit. + */ +static int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries) +{ + int i, ret, freechunks; + struct zbud_header *zhdr; + unsigned long first_handle = 0, last_handle = 0; + + spin_lock(&pool->lock); + if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) || + retries == 0) { + spin_unlock(&pool->lock); + return -EINVAL; + } + for (i = 0; i < retries; i++) { + zhdr = list_last_entry(&pool->lru, struct zbud_header, lru); + list_del(&zhdr->lru); + list_del(&zhdr->buddy); + /* Protect zbud page against free */ + zhdr->under_reclaim = true; + /* + * We need encode the handles before unlocking, since we can + * race with free that will set (first|last)_chunks to 0 + */ + first_handle = 0; + last_handle = 0; + if (zhdr->first_chunks) + first_handle = encode_handle(zhdr, FIRST); + if (zhdr->last_chunks) + last_handle = encode_handle(zhdr, LAST); + spin_unlock(&pool->lock); + + /* Issue the eviction callback(s) */ + if (first_handle) { + ret = pool->ops->evict(pool, first_handle); + if (ret) + goto next; + } + if (last_handle) { + ret = pool->ops->evict(pool, last_handle); + if (ret) + goto next; + } +next: + spin_lock(&pool->lock); + zhdr->under_reclaim = false; + if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) { + /* + * Both buddies are now free, free the zbud page and + * return success. + */ + free_zbud_page(zhdr); + pool->pages_nr--; + spin_unlock(&pool->lock); + return 0; + } else if (zhdr->first_chunks == 0 || + zhdr->last_chunks == 0) { + /* add to unbuddied list */ + freechunks = num_free_chunks(zhdr); + list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); + } else { + /* add to buddied list */ + list_add(&zhdr->buddy, &pool->buddied); + } + + /* add to beginning of LRU */ + list_add(&zhdr->lru, &pool->lru); + } + spin_unlock(&pool->lock); + return -EAGAIN; +} + +/** + * zbud_map() - maps the allocation associated with the given handle + * @pool: pool in which the allocation resides + * @handle: handle associated with the allocation to be mapped + * + * While trivial for zbud, the mapping functions for others allocators + * implementing this allocation API could have more complex information encoded + * in the handle and could create temporary mappings to make the data + * accessible to the user. + * + * Returns: a pointer to the mapped allocation + */ +static void *zbud_map(struct zbud_pool *pool, unsigned long handle) +{ + return (void *)(handle); +} + +/** + * zbud_unmap() - maps the allocation associated with the given handle + * @pool: pool in which the allocation resides + * @handle: handle associated with the allocation to be unmapped + */ +static void zbud_unmap(struct zbud_pool *pool, unsigned long handle) +{ +} + +/** + * zbud_get_pool_size() - gets the zbud pool size in pages + * @pool: pool whose size is being queried + * + * Returns: size in pages of the given pool. The pool lock need not be + * taken to access pages_nr. + */ +static u64 zbud_get_pool_size(struct zbud_pool *pool) +{ + return pool->pages_nr; +} + /***************** * zpool ****************/ -#ifdef CONFIG_ZPOOL - static int zbud_zpool_evict(struct zbud_pool *pool, unsigned long handle) { if (pool->zpool && pool->zpool_ops && pool->zpool_ops->evict) @@ -222,396 +612,6 @@ static struct zpool_driver zbud_zpool_driver = { }; MODULE_ALIAS("zpool-zbud"); -#endif /* CONFIG_ZPOOL */ - -/***************** - * Helpers -*****************/ -/* Just to make the code easier to read */ -enum buddy { - FIRST, - LAST -}; - -/* Converts an allocation size in bytes to size in zbud chunks */ -static int size_to_chunks(size_t size) -{ - return (size + CHUNK_SIZE - 1) >> CHUNK_SHIFT; -} - -#define for_each_unbuddied_list(_iter, _begin) \ - for ((_iter) = (_begin); (_iter) < NCHUNKS; (_iter)++) - -/* Initializes the zbud header of a newly allocated zbud page */ -static struct zbud_header *init_zbud_page(struct page *page) -{ - struct zbud_header *zhdr = page_address(page); - zhdr->first_chunks = 0; - zhdr->last_chunks = 0; - INIT_LIST_HEAD(&zhdr->buddy); - INIT_LIST_HEAD(&zhdr->lru); - zhdr->under_reclaim = false; - return zhdr; -} - -/* Resets the struct page fields and frees the page */ -static void free_zbud_page(struct zbud_header *zhdr) -{ - __free_page(virt_to_page(zhdr)); -} - -/* - * Encodes the handle of a particular buddy within a zbud page - * Pool lock should be held as this function accesses first|last_chunks - */ -static unsigned long encode_handle(struct zbud_header *zhdr, enum buddy bud) -{ - unsigned long handle; - - /* - * For now, the encoded handle is actually just the pointer to the data - * but this might not always be the case. A little information hiding. - * Add CHUNK_SIZE to the handle if it is the first allocation to jump - * over the zbud header in the first chunk. - */ - handle = (unsigned long)zhdr; - if (bud == FIRST) - /* skip over zbud header */ - handle += ZHDR_SIZE_ALIGNED; - else /* bud == LAST */ - handle += PAGE_SIZE - (zhdr->last_chunks << CHUNK_SHIFT); - return handle; -} - -/* Returns the zbud page where a given handle is stored */ -static struct zbud_header *handle_to_zbud_header(unsigned long handle) -{ - return (struct zbud_header *)(handle & PAGE_MASK); -} - -/* Returns the number of free chunks in a zbud page */ -static int num_free_chunks(struct zbud_header *zhdr) -{ - /* - * Rather than branch for different situations, just use the fact that - * free buddies have a length of zero to simplify everything. - */ - return NCHUNKS - zhdr->first_chunks - zhdr->last_chunks; -} - -/***************** - * API Functions -*****************/ -/** - * zbud_create_pool() - create a new zbud pool - * @gfp: gfp flags when allocating the zbud pool structure - * @ops: user-defined operations for the zbud pool - * - * Return: pointer to the new zbud pool or NULL if the metadata allocation - * failed. - */ -struct zbud_pool *zbud_create_pool(gfp_t gfp, const struct zbud_ops *ops) -{ - struct zbud_pool *pool; - int i; - - pool = kzalloc(sizeof(struct zbud_pool), gfp); - if (!pool) - return NULL; - spin_lock_init(&pool->lock); - for_each_unbuddied_list(i, 0) - INIT_LIST_HEAD(&pool->unbuddied[i]); - INIT_LIST_HEAD(&pool->buddied); - INIT_LIST_HEAD(&pool->lru); - pool->pages_nr = 0; - pool->ops = ops; - return pool; -} - -/** - * zbud_destroy_pool() - destroys an existing zbud pool - * @pool: the zbud pool to be destroyed - * - * The pool should be emptied before this function is called. - */ -void zbud_destroy_pool(struct zbud_pool *pool) -{ - kfree(pool); -} - -/** - * zbud_alloc() - allocates a region of a given size - * @pool: zbud pool from which to allocate - * @size: size in bytes of the desired allocation - * @gfp: gfp flags used if the pool needs to grow - * @handle: handle of the new allocation - * - * This function will attempt to find a free region in the pool large enough to - * satisfy the allocation request. A search of the unbuddied lists is - * performed first. If no suitable free region is found, then a new page is - * allocated and added to the pool to satisfy the request. - * - * gfp should not set __GFP_HIGHMEM as highmem pages cannot be used - * as zbud pool pages. - * - * Return: 0 if success and handle is set, otherwise -EINVAL if the size or - * gfp arguments are invalid or -ENOMEM if the pool was unable to allocate - * a new page. - */ -int zbud_alloc(struct zbud_pool *pool, size_t size, gfp_t gfp, - unsigned long *handle) -{ - int chunks, i, freechunks; - struct zbud_header *zhdr = NULL; - enum buddy bud; - struct page *page; - - if (!size || (gfp & __GFP_HIGHMEM)) - return -EINVAL; - if (size > PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE) - return -ENOSPC; - chunks = size_to_chunks(size); - spin_lock(&pool->lock); - - /* First, try to find an unbuddied zbud page. */ - for_each_unbuddied_list(i, chunks) { - if (!list_empty(&pool->unbuddied[i])) { - zhdr = list_first_entry(&pool->unbuddied[i], - struct zbud_header, buddy); - list_del(&zhdr->buddy); - if (zhdr->first_chunks == 0) - bud = FIRST; - else - bud = LAST; - goto found; - } - } - - /* Couldn't find unbuddied zbud page, create new one */ - spin_unlock(&pool->lock); - page = alloc_page(gfp); - if (!page) - return -ENOMEM; - spin_lock(&pool->lock); - pool->pages_nr++; - zhdr = init_zbud_page(page); - bud = FIRST; - -found: - if (bud == FIRST) - zhdr->first_chunks = chunks; - else - zhdr->last_chunks = chunks; - - if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0) { - /* Add to unbuddied list */ - freechunks = num_free_chunks(zhdr); - list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); - } else { - /* Add to buddied list */ - list_add(&zhdr->buddy, &pool->buddied); - } - - /* Add/move zbud page to beginning of LRU */ - if (!list_empty(&zhdr->lru)) - list_del(&zhdr->lru); - list_add(&zhdr->lru, &pool->lru); - - *handle = encode_handle(zhdr, bud); - spin_unlock(&pool->lock); - - return 0; -} - -/** - * zbud_free() - frees the allocation associated with the given handle - * @pool: pool in which the allocation resided - * @handle: handle associated with the allocation returned by zbud_alloc() - * - * In the case that the zbud page in which the allocation resides is under - * reclaim, as indicated by the PG_reclaim flag being set, this function - * only sets the first|last_chunks to 0. The page is actually freed - * once both buddies are evicted (see zbud_reclaim_page() below). - */ -void zbud_free(struct zbud_pool *pool, unsigned long handle) -{ - struct zbud_header *zhdr; - int freechunks; - - spin_lock(&pool->lock); - zhdr = handle_to_zbud_header(handle); - - /* If first buddy, handle will be page aligned */ - if ((handle - ZHDR_SIZE_ALIGNED) & ~PAGE_MASK) - zhdr->last_chunks = 0; - else - zhdr->first_chunks = 0; - - if (zhdr->under_reclaim) { - /* zbud page is under reclaim, reclaim will free */ - spin_unlock(&pool->lock); - return; - } - - /* Remove from existing buddy list */ - list_del(&zhdr->buddy); - - if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) { - /* zbud page is empty, free */ - list_del(&zhdr->lru); - free_zbud_page(zhdr); - pool->pages_nr--; - } else { - /* Add to unbuddied list */ - freechunks = num_free_chunks(zhdr); - list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); - } - - spin_unlock(&pool->lock); -} - -/** - * zbud_reclaim_page() - evicts allocations from a pool page and frees it - * @pool: pool from which a page will attempt to be evicted - * @retries: number of pages on the LRU list for which eviction will - * be attempted before failing - * - * zbud reclaim is different from normal system reclaim in that the reclaim is - * done from the bottom, up. This is because only the bottom layer, zbud, has - * information on how the allocations are organized within each zbud page. This - * has the potential to create interesting locking situations between zbud and - * the user, however. - * - * To avoid these, this is how zbud_reclaim_page() should be called: - * - * The user detects a page should be reclaimed and calls zbud_reclaim_page(). - * zbud_reclaim_page() will remove a zbud page from the pool LRU list and call - * the user-defined eviction handler with the pool and handle as arguments. - * - * If the handle can not be evicted, the eviction handler should return - * non-zero. zbud_reclaim_page() will add the zbud page back to the - * appropriate list and try the next zbud page on the LRU up to - * a user defined number of retries. - * - * If the handle is successfully evicted, the eviction handler should - * return 0 _and_ should have called zbud_free() on the handle. zbud_free() - * contains logic to delay freeing the page if the page is under reclaim, - * as indicated by the setting of the PG_reclaim flag on the underlying page. - * - * If all buddies in the zbud page are successfully evicted, then the - * zbud page can be freed. - * - * Returns: 0 if page is successfully freed, otherwise -EINVAL if there are - * no pages to evict or an eviction handler is not registered, -EAGAIN if - * the retry limit was hit. - */ -int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries) -{ - int i, ret, freechunks; - struct zbud_header *zhdr; - unsigned long first_handle = 0, last_handle = 0; - - spin_lock(&pool->lock); - if (!pool->ops || !pool->ops->evict || list_empty(&pool->lru) || - retries == 0) { - spin_unlock(&pool->lock); - return -EINVAL; - } - for (i = 0; i < retries; i++) { - zhdr = list_last_entry(&pool->lru, struct zbud_header, lru); - list_del(&zhdr->lru); - list_del(&zhdr->buddy); - /* Protect zbud page against free */ - zhdr->under_reclaim = true; - /* - * We need encode the handles before unlocking, since we can - * race with free that will set (first|last)_chunks to 0 - */ - first_handle = 0; - last_handle = 0; - if (zhdr->first_chunks) - first_handle = encode_handle(zhdr, FIRST); - if (zhdr->last_chunks) - last_handle = encode_handle(zhdr, LAST); - spin_unlock(&pool->lock); - - /* Issue the eviction callback(s) */ - if (first_handle) { - ret = pool->ops->evict(pool, first_handle); - if (ret) - goto next; - } - if (last_handle) { - ret = pool->ops->evict(pool, last_handle); - if (ret) - goto next; - } -next: - spin_lock(&pool->lock); - zhdr->under_reclaim = false; - if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) { - /* - * Both buddies are now free, free the zbud page and - * return success. - */ - free_zbud_page(zhdr); - pool->pages_nr--; - spin_unlock(&pool->lock); - return 0; - } else if (zhdr->first_chunks == 0 || - zhdr->last_chunks == 0) { - /* add to unbuddied list */ - freechunks = num_free_chunks(zhdr); - list_add(&zhdr->buddy, &pool->unbuddied[freechunks]); - } else { - /* add to buddied list */ - list_add(&zhdr->buddy, &pool->buddied); - } - - /* add to beginning of LRU */ - list_add(&zhdr->lru, &pool->lru); - } - spin_unlock(&pool->lock); - return -EAGAIN; -} - -/** - * zbud_map() - maps the allocation associated with the given handle - * @pool: pool in which the allocation resides - * @handle: handle associated with the allocation to be mapped - * - * While trivial for zbud, the mapping functions for others allocators - * implementing this allocation API could have more complex information encoded - * in the handle and could create temporary mappings to make the data - * accessible to the user. - * - * Returns: a pointer to the mapped allocation - */ -void *zbud_map(struct zbud_pool *pool, unsigned long handle) -{ - return (void *)(handle); -} - -/** - * zbud_unmap() - maps the allocation associated with the given handle - * @pool: pool in which the allocation resides - * @handle: handle associated with the allocation to be unmapped - */ -void zbud_unmap(struct zbud_pool *pool, unsigned long handle) -{ -} - -/** - * zbud_get_pool_size() - gets the zbud pool size in pages - * @pool: pool whose size is being queried - * - * Returns: size in pages of the given pool. The pool lock need not be - * taken to access pages_nr. - */ -u64 zbud_get_pool_size(struct zbud_pool *pool) -{ - return pool->pages_nr; -} static int __init init_zbud(void) { @@ -619,19 +619,14 @@ static int __init init_zbud(void) BUILD_BUG_ON(sizeof(struct zbud_header) > ZHDR_SIZE_ALIGNED); pr_info("loaded\n"); -#ifdef CONFIG_ZPOOL zpool_register_driver(&zbud_zpool_driver); -#endif return 0; } static void __exit exit_zbud(void) { -#ifdef CONFIG_ZPOOL zpool_unregister_driver(&zbud_zpool_driver); -#endif - pr_info("unloaded\n"); } From 17adb230d6a6e39f9ba39440ee8441291795dff4 Mon Sep 17 00:00:00 2001 From: YueHaibing Date: Wed, 30 Jun 2021 18:50:48 -0700 Subject: [PATCH 068/190] mm/compaction: use DEVICE_ATTR_WO macro Use DEVICE_ATTR_WO helper instead of plain DEVICE_ATTR, which makes the code a bit shorter and easier to read. Link: https://lkml.kernel.org/r/20210523064521.32912-1-yuehaibing@huawei.com Signed-off-by: YueHaibing Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 7d41b58fb17c..b7fb991dee1b 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2722,9 +2722,9 @@ int sysctl_compaction_handler(struct ctl_table *table, int write, } #if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) -static ssize_t sysfs_compact_node(struct device *dev, - struct device_attribute *attr, - const char *buf, size_t count) +static ssize_t compact_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) { int nid = dev->id; @@ -2737,7 +2737,7 @@ static ssize_t sysfs_compact_node(struct device *dev, return count; } -static DEVICE_ATTR(compact, 0200, NULL, sysfs_compact_node); +static DEVICE_ATTR_WO(compact); int compaction_register_node(struct node *node) { From d2155fe54ddb6e289b4f7854df5a7d828d6efbb5 Mon Sep 17 00:00:00 2001 From: Liu Xiang Date: Wed, 30 Jun 2021 18:50:51 -0700 Subject: [PATCH 069/190] mm: compaction: remove duplicate !list_empty(&sublist) check The list_splice_tail(&sublist, freelist) also do !list_empty(&sublist) check, so remove the duplicate call. Link: https://lkml.kernel.org/r/20210609095409.19920-1-liu.xiang@zlingsmart.com Signed-off-by: Liu Xiang Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index b7fb991dee1b..f27ea22f297a 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1297,8 +1297,7 @@ move_freelist_head(struct list_head *freelist, struct page *freepage) if (!list_is_last(freelist, &freepage->lru)) { list_cut_before(&sublist, freelist, &freepage->lru); - if (!list_empty(&sublist)) - list_splice_tail(&sublist, freelist); + list_splice_tail(&sublist, freelist); } } @@ -1315,8 +1314,7 @@ move_freelist_tail(struct list_head *freelist, struct page *freepage) if (!list_is_first(freelist, &freepage->lru)) { list_cut_position(&sublist, freelist, &freepage->lru); - if (!list_empty(&sublist)) - list_splice_tail(&sublist, freelist); + list_splice_tail(&sublist, freelist); } } From b55ca5264b0c0092f238e2f4f33319ba6e9901ab Mon Sep 17 00:00:00 2001 From: Wonhyuk Yang Date: Wed, 30 Jun 2021 18:50:53 -0700 Subject: [PATCH 070/190] mm/compaction: fix 'limit' in fast_isolate_freepages Because of 'min(1, ...)', fast_isolate_freepages set 'limit' to 0 or 1. This takes away the opportunities of find candinate pages. So, by making enough scans available, increases the probability of finding the appropriate freepage. Tested it on the thpscale and the results are as follows. 5.12.0 5.12.0 valnilla patched Amean fault-both-1 598.15 ( 0.00%) 592.56 ( 0.93%) Amean fault-both-3 1494.47 ( 0.00%) 1514.35 ( -1.33%) Amean fault-both-5 2519.48 ( 0.00%) 2471.76 ( 1.89%) Amean fault-both-7 3173.85 ( 0.00%) 3079.19 ( 2.98%) Amean fault-both-12 8063.83 ( 0.00%) 7858.29 ( 2.55%) Amean fault-both-18 8781.20 ( 0.00%) 7827.70 * 10.86%* Amean fault-both-24 12576.44 ( 0.00%) 12250.20 ( 2.59%) Amean fault-both-30 18503.27 ( 0.00%) 17528.11 * 5.27%* Amean fault-both-32 16133.69 ( 0.00%) 13874.24 * 14.00%* 5.12.0 5.12.0 vanilla patched Ops Compaction migrate scanned 6547133.00 5963901.00 Ops Compaction free scanned 32452453.00 26609101.00 5.12 5.12 vanilla patched Duration User 27.99 28.84 Duration System 244.08 236.76 Duration Elapsed 78.27 78.38 Link: https://lkml.kernel.org/r/20210626082443.22547-1-vvghjk1234@gmail.com Fixes: 5a811889de10f ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Wonhyuk Yang Acked-by: Mel Gorman Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index f27ea22f297a..4796c197295f 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1378,7 +1378,7 @@ static int next_search_order(struct compact_control *cc, int order) static unsigned long fast_isolate_freepages(struct compact_control *cc) { - unsigned int limit = min(1U, freelist_scan_limit(cc) >> 1); + unsigned int limit = max(1U, freelist_scan_limit(cc) >> 1); unsigned int nr_scanned = 0; unsigned long low_pfn, min_pfn, highest = 0; unsigned long nr_isolated = 0; @@ -1490,11 +1490,11 @@ fast_isolate_freepages(struct compact_control *cc) spin_unlock_irqrestore(&cc->zone->lock, flags); /* - * Smaller scan on next order so the total scan ig related + * Smaller scan on next order so the total scan is related * to freelist_scan_limit. */ if (order_scanned >= limit) - limit = min(1U, limit >> 1); + limit = max(1U, limit >> 1); } if (!page) { From b26e517a058bd40c790a1d9868c896842f2e4155 Mon Sep 17 00:00:00 2001 From: Feng Tang Date: Wed, 30 Jun 2021 18:50:56 -0700 Subject: [PATCH 071/190] mm/mempolicy: cleanup nodemask intersection check for oom Patch series "mm/mempolicy: some fix and semantics cleanup", v4. Current memory policy code has some confusing and ambiguous part about MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and there are many places having to distinguish them. Also the nodemask intersection check needs cleanup to be more explicit for OOM use, and handle MPOL_INTERLEAVE correctly. This patchset cleans up these and unifies the parameter sanity check for mbind() and set_mempolicy(). This patch (of 3): mempolicy_nodemask_intersects seem to be a general purpose mempolicy function. In fact it is partially tailored for the OOM purpose instead. The oom proper is the only existing user so rename the function to make that purpose explicit. While at it drop the MPOL_INTERLEAVE as those allocations never has a nodemask defined (see alloc_page_interleave) so this is a dead code and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard requirement so it shouldn't be considered during the OOM. The final code can be reduced to a check for MPOL_BIND which is the only memory policy that is a hard requirement and thus relevant to a constrained OOM logic. [mhocko@suse.com: changelog edits] Link: https://lkml.kernel.org/r/1622560492-1294-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622560492-1294-2-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-1-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-2-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang Suggested-by: Michal Hocko Acked-by: Michal Hocko Cc: Andi Kleen Cc: Andrea Arcangeli Cc: Ben Widawsky Cc: Dan Williams Cc: Dave Hansen Cc: David Rientjes Cc: Huang Ying Cc: Mel Gorman Cc: Mike Kravetz Cc: Randy Dunlap Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mempolicy.h | 2 +- mm/mempolicy.c | 34 +++++++++------------------------- mm/oom_kill.c | 2 +- 3 files changed, 11 insertions(+), 27 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 5f1c74df264d..8773c55c7744 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -150,7 +150,7 @@ extern int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol, nodemask_t **nodemask); extern bool init_nodemask_of_mempolicy(nodemask_t *mask); -extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, +extern bool mempolicy_in_oom_domain(struct task_struct *tsk, const nodemask_t *mask); extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b5d95bf1025d..bd213b900e71 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2094,16 +2094,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) #endif /* - * mempolicy_nodemask_intersects + * mempolicy_in_oom_domain * - * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default - * policy. Otherwise, check for intersection between mask and the policy - * nodemask for 'bind' or 'interleave' policy. For 'preferred' or 'local' - * policy, always return true since it may allocate elsewhere on fallback. + * If tsk's mempolicy is "bind", check for intersection between mask and + * the policy nodemask. Otherwise, return true for all other policies + * including "interleave", as a tsk with "interleave" policy may have + * memory allocated from all nodes in system. * * Takes task_lock(tsk) to prevent freeing of its mempolicy. */ -bool mempolicy_nodemask_intersects(struct task_struct *tsk, +bool mempolicy_in_oom_domain(struct task_struct *tsk, const nodemask_t *mask) { struct mempolicy *mempolicy; @@ -2111,29 +2111,13 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk, if (!mask) return ret; + task_lock(tsk); mempolicy = tsk->mempolicy; - if (!mempolicy) - goto out; - - switch (mempolicy->mode) { - case MPOL_PREFERRED: - /* - * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to - * allocate from, they may fallback to other nodes when oom. - * Thus, it's possible for tsk to have allocated memory from - * nodes in mask. - */ - break; - case MPOL_BIND: - case MPOL_INTERLEAVE: + if (mempolicy && mempolicy->mode == MPOL_BIND) ret = nodes_intersects(mempolicy->v.nodes, *mask); - break; - default: - BUG(); - } -out: task_unlock(tsk); + return ret; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index eefd3f5fde46..fcc29e9a3064 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -104,7 +104,7 @@ static bool oom_cpuset_eligible(struct task_struct *start, * mempolicy intersects current, otherwise it may be * needlessly killed. */ - ret = mempolicy_nodemask_intersects(tsk, mask); + ret = mempolicy_in_oom_domain(tsk, mask); } else { /* * This is not a mempolicy constrained oom, so only From 7858d7bca7fbbbbd5b940d2ec371b2d060b21b84 Mon Sep 17 00:00:00 2001 From: Feng Tang Date: Wed, 30 Jun 2021 18:51:00 -0700 Subject: [PATCH 072/190] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy MPOL_LOCAL policy has been setup as a real policy, but it is still handled like a faked POL_PREFERRED policy with one internal MPOL_F_LOCAL flag bit set, and there are many places having to judge the real 'prefer' or the 'local' policy, which are quite confusing. In current code, there are 4 cases that MPOL_LOCAL are used: 1. user specifies 'local' policy 2. user specifies 'prefer' policy, but with empty nodemask 3. system 'default' policy is used 4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES flag set, and when it is 'rebind' to a nodemask which doesn't contains the 'preferred' node, it will perform as 'local' policy So make 'local' a real policy instead of a fake 'prefer' one, and kill MPOL_F_LOCAL bit, which can greatly reduce the confusion for code reading. For case 4, the logic of mpol_rebind_preferred() is confusing, as Michal Hocko pointed out: : I do believe that rebinding preferred policy is just bogus and it should : be dropped altogether on the ground that a preference is a mere hint from : userspace where to start the allocation. Unless I am missing something : cpusets will be always authoritative for the final placement. The : preferred node just acts as a starting point and it should be really : preserved when cpusets changes. Otherwise we have a very subtle behavior : corner cases. So dump all the tricky transformation between 'prefer' and 'local', and just record the new nodemask of rebinding. [feng.tang@intel.com: fix a problem in mpol_set_nodemask(), per Michal Hocko] Link: https://lkml.kernel.org/r/1622560492-1294-3-git-send-email-feng.tang@intel.com [feng.tang@intel.com: refine code and comments of mpol_set_nodemask(), per Michal] Link: https://lkml.kernel.org/r/20210603081807.GE56979@shbuild999.sh.intel.com Link: https://lkml.kernel.org/r/1622469956-82897-3-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang Suggested-by: Michal Hocko Acked-by: Michal Hocko Cc: Andi Kleen Cc: Andrea Arcangeli Cc: Ben Widawsky Cc: Dan Williams Cc: Dave Hansen Cc: David Rientjes Cc: Huang Ying Cc: Mel Gorman Cc: Michal Hocko Cc: Mike Kravetz Cc: Randy Dunlap Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/uapi/linux/mempolicy.h | 1 - mm/mempolicy.c | 138 ++++++++++++++------------------- 2 files changed, 57 insertions(+), 82 deletions(-) diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 4832fd0b5642..19a00bc7fe86 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -60,7 +60,6 @@ enum { * are never OR'ed into the mode in mempolicy API arguments. */ #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ -#define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */ #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index bd213b900e71..22addae6c4ee 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -121,8 +121,7 @@ enum zone_type policy_zone = 0; */ static struct mempolicy default_policy = { .refcnt = ATOMIC_INIT(1), /* never free it */ - .mode = MPOL_PREFERRED, - .flags = MPOL_F_LOCAL, + .mode = MPOL_LOCAL, }; static struct mempolicy preferred_node_policy[MAX_NUMNODES]; @@ -200,12 +199,9 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes) static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes) { - if (!nodes) - pol->flags |= MPOL_F_LOCAL; /* local allocation */ - else if (nodes_empty(*nodes)) - return -EINVAL; /* no allowed nodes */ - else - pol->v.preferred_node = first_node(*nodes); + if (nodes_empty(*nodes)) + return -EINVAL; + pol->v.preferred_node = first_node(*nodes); return 0; } @@ -220,8 +216,7 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes) /* * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if * any, for the new policy. mpol_new() has already validated the nodes - * parameter with respect to the policy mode and flags. But, we need to - * handle an empty nodemask with MPOL_PREFERRED here. + * parameter with respect to the policy mode and flags. * * Must be called holding task's alloc_lock to protect task's mems_allowed * and mempolicy. May also be called holding the mmap_lock for write. @@ -231,33 +226,31 @@ static int mpol_set_nodemask(struct mempolicy *pol, { int ret; - /* if mode is MPOL_DEFAULT, pol is NULL. This is right. */ - if (pol == NULL) + /* + * Default (pol==NULL) resp. local memory policies are not a + * subject of any remapping. They also do not need any special + * constructor. + */ + if (!pol || pol->mode == MPOL_LOCAL) return 0; + /* Check N_MEMORY */ nodes_and(nsc->mask1, cpuset_current_mems_allowed, node_states[N_MEMORY]); VM_BUG_ON(!nodes); - if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes)) - nodes = NULL; /* explicit local allocation */ - else { - if (pol->flags & MPOL_F_RELATIVE_NODES) - mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1); - else - nodes_and(nsc->mask2, *nodes, nsc->mask1); - if (mpol_store_user_nodemask(pol)) - pol->w.user_nodemask = *nodes; - else - pol->w.cpuset_mems_allowed = - cpuset_current_mems_allowed; - } - - if (nodes) - ret = mpol_ops[pol->mode].create(pol, &nsc->mask2); + if (pol->flags & MPOL_F_RELATIVE_NODES) + mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1); else - ret = mpol_ops[pol->mode].create(pol, NULL); + nodes_and(nsc->mask2, *nodes, nsc->mask1); + + if (mpol_store_user_nodemask(pol)) + pol->w.user_nodemask = *nodes; + else + pol->w.cpuset_mems_allowed = cpuset_current_mems_allowed; + + ret = mpol_ops[pol->mode].create(pol, &nsc->mask2); return ret; } @@ -290,13 +283,14 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, if (((flags & MPOL_F_STATIC_NODES) || (flags & MPOL_F_RELATIVE_NODES))) return ERR_PTR(-EINVAL); + + mode = MPOL_LOCAL; } } else if (mode == MPOL_LOCAL) { if (!nodes_empty(*nodes) || (flags & MPOL_F_STATIC_NODES) || (flags & MPOL_F_RELATIVE_NODES)) return ERR_PTR(-EINVAL); - mode = MPOL_PREFERRED; } else if (nodes_empty(*nodes)) return ERR_PTR(-EINVAL); policy = kmem_cache_alloc(policy_cache, GFP_KERNEL); @@ -344,25 +338,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) static void mpol_rebind_preferred(struct mempolicy *pol, const nodemask_t *nodes) { - nodemask_t tmp; - - if (pol->flags & MPOL_F_STATIC_NODES) { - int node = first_node(pol->w.user_nodemask); - - if (node_isset(node, *nodes)) { - pol->v.preferred_node = node; - pol->flags &= ~MPOL_F_LOCAL; - } else - pol->flags |= MPOL_F_LOCAL; - } else if (pol->flags & MPOL_F_RELATIVE_NODES) { - mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes); - pol->v.preferred_node = first_node(tmp); - } else if (!(pol->flags & MPOL_F_LOCAL)) { - pol->v.preferred_node = node_remap(pol->v.preferred_node, - pol->w.cpuset_mems_allowed, - *nodes); - pol->w.cpuset_mems_allowed = *nodes; - } + pol->w.cpuset_mems_allowed = *nodes; } /* @@ -376,7 +352,7 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask) { if (!pol) return; - if (!mpol_store_user_nodemask(pol) && !(pol->flags & MPOL_F_LOCAL) && + if (!mpol_store_user_nodemask(pol) && nodes_equal(pol->w.cpuset_mems_allowed, *newmask)) return; @@ -427,6 +403,9 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { .create = mpol_new_bind, .rebind = mpol_rebind_nodemask, }, + [MPOL_LOCAL] = { + .rebind = mpol_rebind_default, + }, }; static int migrate_page_add(struct page *page, struct list_head *pagelist, @@ -919,10 +898,12 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes) case MPOL_INTERLEAVE: *nodes = p->v.nodes; break; + case MPOL_LOCAL: + /* return empty node mask for local allocation */ + break; + case MPOL_PREFERRED: - if (!(p->flags & MPOL_F_LOCAL)) - node_set(p->v.preferred_node, *nodes); - /* else return empty node mask for local allocation */ + node_set(p->v.preferred_node, *nodes); break; default: BUG(); @@ -1894,9 +1875,9 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) /* Return the node id preferred by the given mempolicy, or the given id */ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) { - if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL)) + if (policy->mode == MPOL_PREFERRED) { nd = policy->v.preferred_node; - else { + } else { /* * __GFP_THISNODE shouldn't even be used with the bind policy * because we might easily break the expectation to stay on the @@ -1933,14 +1914,11 @@ unsigned int mempolicy_slab_node(void) return node; policy = current->mempolicy; - if (!policy || policy->flags & MPOL_F_LOCAL) + if (!policy) return node; switch (policy->mode) { case MPOL_PREFERRED: - /* - * handled MPOL_F_LOCAL above - */ return policy->v.preferred_node; case MPOL_INTERLEAVE: @@ -1960,6 +1938,8 @@ unsigned int mempolicy_slab_node(void) &policy->v.nodes); return z->zone ? zone_to_nid(z->zone) : node; } + case MPOL_LOCAL: + return node; default: BUG(); @@ -2072,16 +2052,18 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) mempolicy = current->mempolicy; switch (mempolicy->mode) { case MPOL_PREFERRED: - if (mempolicy->flags & MPOL_F_LOCAL) - nid = numa_node_id(); - else - nid = mempolicy->v.preferred_node; + nid = mempolicy->v.preferred_node; init_nodemask_of_node(mask, nid); break; case MPOL_BIND: case MPOL_INTERLEAVE: - *mask = mempolicy->v.nodes; + *mask = mempolicy->v.nodes; + break; + + case MPOL_LOCAL: + nid = numa_node_id(); + init_nodemask_of_node(mask, nid); break; default: @@ -2188,7 +2170,7 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, * If the policy is interleave, or does not allow the current * node in its nodemask, we allocate the standard way. */ - if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL)) + if (pol->mode == MPOL_PREFERRED) hpage_node = pol->v.preferred_node; nmask = policy_nodemask(gfp, pol); @@ -2324,10 +2306,9 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) case MPOL_INTERLEAVE: return !!nodes_equal(a->v.nodes, b->v.nodes); case MPOL_PREFERRED: - /* a's ->flags is the same as b's */ - if (a->flags & MPOL_F_LOCAL) - return true; return a->v.preferred_node == b->v.preferred_node; + case MPOL_LOCAL: + return true; default: BUG(); return false; @@ -2465,10 +2446,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long break; case MPOL_PREFERRED: - if (pol->flags & MPOL_F_LOCAL) - polnid = numa_node_id(); - else - polnid = pol->v.preferred_node; + polnid = pol->v.preferred_node; + break; + + case MPOL_LOCAL: + polnid = numa_node_id(); break; case MPOL_BIND: @@ -2835,9 +2817,6 @@ void numa_default_policy(void) * Parse and format mempolicy from/to strings */ -/* - * "local" is implemented internally by MPOL_PREFERRED with MPOL_F_LOCAL flag. - */ static const char * const policy_modes[] = { [MPOL_DEFAULT] = "default", @@ -2915,7 +2894,6 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) */ if (nodelist) goto out; - mode = MPOL_PREFERRED; break; case MPOL_DEFAULT: /* @@ -2959,7 +2937,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) else if (nodelist) new->v.preferred_node = first_node(nodes); else - new->flags |= MPOL_F_LOCAL; + new->mode = MPOL_LOCAL; /* * Save nodes for contextualization: this will be used to "clone" @@ -3005,12 +2983,10 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) switch (mode) { case MPOL_DEFAULT: + case MPOL_LOCAL: break; case MPOL_PREFERRED: - if (flags & MPOL_F_LOCAL) - mode = MPOL_LOCAL; - else - node_set(pol->v.preferred_node, nodes); + node_set(pol->v.preferred_node, nodes); break; case MPOL_BIND: case MPOL_INTERLEAVE: From 95837924587c60425f941dc8cbfba61cb964fcb5 Mon Sep 17 00:00:00 2001 From: Feng Tang Date: Wed, 30 Jun 2021 18:51:03 -0700 Subject: [PATCH 073/190] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy Currently the kernel_mbind() and kernel_set_mempolicy() do almost the same operation for parameter sanity check. Add a helper function to unify the code to reduce the redundancy, and make it easier for changing the sanity check code in future. [thanks to David Rientjes for suggesting using helper function instead of macro]. [feng.tang@intel.com: add comment] Link: https://lkml.kernel.org/r/1622560492-1294-4-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-4-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang Acked-by: Michal Hocko Cc: David Rientjes Cc: Dave Hansen Cc: Ben Widawsky Cc: Andrea Arcangeli Cc: Mel Gorman Cc: Mike Kravetz Cc: Randy Dunlap Cc: Vlastimil Babka Cc: Andi Kleen Cc: Dan Williams Cc: Huang Ying Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 50 +++++++++++++++++++++++++++++++------------------- 1 file changed, 31 insertions(+), 19 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 22addae6c4ee..89721501b22c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1441,26 +1441,38 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode, return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0; } +/* Basic parameter sanity check used by both mbind() and set_mempolicy() */ +static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) +{ + *flags = *mode & MPOL_MODE_FLAGS; + *mode &= ~MPOL_MODE_FLAGS; + if ((unsigned int)(*mode) >= MPOL_MAX) + return -EINVAL; + if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) + return -EINVAL; + + return 0; +} + static long kernel_mbind(unsigned long start, unsigned long len, unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode, unsigned int flags) { - nodemask_t nodes; - int err; unsigned short mode_flags; + nodemask_t nodes; + int lmode = mode; + int err; start = untagged_addr(start); - mode_flags = mode & MPOL_MODE_FLAGS; - mode &= ~MPOL_MODE_FLAGS; - if (mode >= MPOL_MAX) - return -EINVAL; - if ((mode_flags & MPOL_F_STATIC_NODES) && - (mode_flags & MPOL_F_RELATIVE_NODES)) - return -EINVAL; + err = sanitize_mpol_flags(&lmode, &mode_flags); + if (err) + return err; + err = get_nodes(&nodes, nmask, maxnode); if (err) return err; - return do_mbind(start, len, mode, mode_flags, &nodes, flags); + + return do_mbind(start, len, lmode, mode_flags, &nodes, flags); } SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len, @@ -1474,20 +1486,20 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len, static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask, unsigned long maxnode) { - int err; + unsigned short mode_flags; nodemask_t nodes; - unsigned short flags; + int lmode = mode; + int err; + + err = sanitize_mpol_flags(&lmode, &mode_flags); + if (err) + return err; - flags = mode & MPOL_MODE_FLAGS; - mode &= ~MPOL_MODE_FLAGS; - if ((unsigned int)mode >= MPOL_MAX) - return -EINVAL; - if ((flags & MPOL_F_STATIC_NODES) && (flags & MPOL_F_RELATIVE_NODES)) - return -EINVAL; err = get_nodes(&nodes, nmask, maxnode); if (err) return err; - return do_set_mempolicy(mode, flags, &nodes); + + return do_set_mempolicy(lmode, mode_flags, &nodes); } SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask, From e5947d23edd897ffe068564e91fd186adb95ee6d Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 30 Jun 2021 18:51:07 -0700 Subject: [PATCH 074/190] mm: mempolicy: don't have to split pmd for huge zero page When trying to migrate pages to obey mempolicy, the huge zero page is split by inserting base zero pfn to all PTEs, then the page table walk fallback to PTE level and just skips zero page. Skipping zero page for mempolicy has been the behavior of kernel since v2.6.16 due to commit f4598c8b3678 ("[PATCH] migration: make sure there is no attempt to migrate reserved pages."). So it seems pointless to split huge zero page, it could be just skipped like base zero page. Set ACTION_CONTINUE to prevent the walk_page_range() split the pmd for this case. Link: https://lkml.kernel.org/r/20210609172146.3594-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20210604203513.240709-1-shy828301@gmail.com Signed-off-by: Yang Shi Reviewed-by: Zi Yan Acked-by: Michal Hocko Cc: Naoya Horiguchi Cc: Kirill A. Shutemov Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mempolicy.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 89721501b22c..8739bfd967e5 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -437,7 +437,8 @@ static inline bool queue_pages_required(struct page *page, /* * queue_pages_pmd() has four possible return values: - * 0 - pages are placed on the right node or queued successfully. + * 0 - pages are placed on the right node or queued successfully, or + * special page is met, i.e. huge zero page. * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were * specified. * 2 - THP was split. @@ -461,8 +462,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr, page = pmd_page(*pmd); if (is_huge_zero_page(page)) { spin_unlock(ptl); - __split_huge_pmd(walk->vma, pmd, addr, false, NULL); - ret = 2; + walk->action = ACTION_CONTINUE; goto out; } if (!queue_pages_required(page, qp)) @@ -489,7 +489,8 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr, * and move them to the pagelist if they do. * * queue_pages_pte_range() has three possible return values: - * 0 - pages are placed on the right node or queued successfully. + * 0 - pages are placed on the right node or queued successfully, or + * special page is met, i.e. zero page. * 1 - there is unmovable page, and MPOL_MF_MOVE* & MPOL_MF_STRICT were * specified. * -EIO - only MPOL_MF_STRICT was specified and an existing page was already From 269fbe72cded0afce0090103e90d2ae8ef8ac5b5 Mon Sep 17 00:00:00 2001 From: Ben Widawsky Date: Wed, 30 Jun 2021 18:51:10 -0700 Subject: [PATCH 075/190] mm/mempolicy: use unified 'nodes' for bind/interleave/prefer policies Current structure 'mempolicy' uses a union to store the node info for bind/interleave/perfer policies. union { short preferred_node; /* preferred */ nodemask_t nodes; /* interleave/bind */ /* undefined for default */ } v; Since preferred node can also be represented by a nodemask_t with only ont bit set, unify these policies with using one nodemask_t 'nodes', which can remove a union, simplify the code and make it easier to support future's new policy's node info. Link: https://lore.kernel.org/r/20200630212517.308045-7-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1623399825-75651-1-git-send-email-feng.tang@intel.com Co-developed-by: Feng Tang Signed-off-by: Ben Widawsky Signed-off-by: Feng Tang Cc: Michal Hocko Cc: David Rientjes Cc: Dave Hansen Cc: Andrea Arcangeli Cc: Mel Gorman Cc: Mike Kravetz Cc: Vlastimil Babka Cc: Andi Kleen Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mempolicy.h | 7 +-- mm/mempolicy.c | 96 ++++++++++++++++++--------------------- 2 files changed, 46 insertions(+), 57 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 8773c55c7744..0aaf91b496e2 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -46,11 +46,8 @@ struct mempolicy { atomic_t refcnt; unsigned short mode; /* See MPOL_* above */ unsigned short flags; /* See set_mempolicy() MPOL_F_* above */ - union { - short preferred_node; /* preferred */ - nodemask_t nodes; /* interleave/bind */ - /* undefined for default */ - } v; + nodemask_t nodes; /* interleave/bind/perfer */ + union { nodemask_t cpuset_mems_allowed; /* relative to these nodes */ nodemask_t user_nodemask; /* nodemask passed by user */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 8739bfd967e5..e32360e90274 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -193,7 +193,7 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes) { if (nodes_empty(*nodes)) return -EINVAL; - pol->v.nodes = *nodes; + pol->nodes = *nodes; return 0; } @@ -201,7 +201,9 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes) { if (nodes_empty(*nodes)) return -EINVAL; - pol->v.preferred_node = first_node(*nodes); + + nodes_clear(pol->nodes); + node_set(first_node(*nodes), pol->nodes); return 0; } @@ -209,7 +211,7 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes) { if (nodes_empty(*nodes)) return -EINVAL; - pol->v.nodes = *nodes; + pol->nodes = *nodes; return 0; } @@ -324,7 +326,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) else if (pol->flags & MPOL_F_RELATIVE_NODES) mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes); else { - nodes_remap(tmp, pol->v.nodes, pol->w.cpuset_mems_allowed, + nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_allowed, *nodes); pol->w.cpuset_mems_allowed = *nodes; } @@ -332,7 +334,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) if (nodes_empty(tmp)) tmp = *nodes; - pol->v.nodes = tmp; + pol->nodes = tmp; } static void mpol_rebind_preferred(struct mempolicy *pol, @@ -897,15 +899,12 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes) switch (p->mode) { case MPOL_BIND: case MPOL_INTERLEAVE: - *nodes = p->v.nodes; + case MPOL_PREFERRED: + *nodes = p->nodes; break; case MPOL_LOCAL: /* return empty node mask for local allocation */ break; - - case MPOL_PREFERRED: - node_set(p->v.preferred_node, *nodes); - break; default: BUG(); } @@ -989,7 +988,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, *policy = err; } else if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE) { - *policy = next_node_in(current->il_prev, pol->v.nodes); + *policy = next_node_in(current->il_prev, pol->nodes); } else { err = -EINVAL; goto out; @@ -1857,14 +1856,14 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone) BUG_ON(dynamic_policy_zone == ZONE_MOVABLE); /* - * if policy->v.nodes has movable memory only, + * if policy->nodes has movable memory only, * we apply policy when gfp_zone(gfp) = ZONE_MOVABLE only. * - * policy->v.nodes is intersect with node_states[N_MEMORY]. + * policy->nodes is intersect with node_states[N_MEMORY]. * so if the following test fails, it implies - * policy->v.nodes has movable memory only. + * policy->nodes has movable memory only. */ - if (!nodes_intersects(policy->v.nodes, node_states[N_HIGH_MEMORY])) + if (!nodes_intersects(policy->nodes, node_states[N_HIGH_MEMORY])) dynamic_policy_zone = ZONE_MOVABLE; return zone >= dynamic_policy_zone; @@ -1879,8 +1878,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) /* Lower zones don't get a nodemask applied for MPOL_BIND */ if (unlikely(policy->mode == MPOL_BIND) && apply_policy_zone(policy, gfp_zone(gfp)) && - cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) - return &policy->v.nodes; + cpuset_nodemask_valid_mems_allowed(&policy->nodes)) + return &policy->nodes; return NULL; } @@ -1889,7 +1888,7 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) { if (policy->mode == MPOL_PREFERRED) { - nd = policy->v.preferred_node; + nd = first_node(policy->nodes); } else { /* * __GFP_THISNODE shouldn't even be used with the bind policy @@ -1908,7 +1907,7 @@ static unsigned interleave_nodes(struct mempolicy *policy) unsigned next; struct task_struct *me = current; - next = next_node_in(me->il_prev, policy->v.nodes); + next = next_node_in(me->il_prev, policy->nodes); if (next < MAX_NUMNODES) me->il_prev = next; return next; @@ -1932,7 +1931,7 @@ unsigned int mempolicy_slab_node(void) switch (policy->mode) { case MPOL_PREFERRED: - return policy->v.preferred_node; + return first_node(policy->nodes); case MPOL_INTERLEAVE: return interleave_nodes(policy); @@ -1948,7 +1947,7 @@ unsigned int mempolicy_slab_node(void) enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL); zonelist = &NODE_DATA(node)->node_zonelists[ZONELIST_FALLBACK]; z = first_zones_zonelist(zonelist, highest_zoneidx, - &policy->v.nodes); + &policy->nodes); return z->zone ? zone_to_nid(z->zone) : node; } case MPOL_LOCAL: @@ -1961,12 +1960,12 @@ unsigned int mempolicy_slab_node(void) /* * Do static interleaving for a VMA with known offset @n. Returns the n'th - * node in pol->v.nodes (starting from n=0), wrapping around if n exceeds the + * node in pol->nodes (starting from n=0), wrapping around if n exceeds the * number of present nodes. */ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) { - unsigned nnodes = nodes_weight(pol->v.nodes); + unsigned nnodes = nodes_weight(pol->nodes); unsigned target; int i; int nid; @@ -1974,9 +1973,9 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) if (!nnodes) return numa_node_id(); target = (unsigned int)n % nnodes; - nid = first_node(pol->v.nodes); + nid = first_node(pol->nodes); for (i = 0; i < target; i++) - nid = next_node(nid, pol->v.nodes); + nid = next_node(nid, pol->nodes); return nid; } @@ -2032,7 +2031,7 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags, } else { nid = policy_node(gfp_flags, *mpol, numa_node_id()); if ((*mpol)->mode == MPOL_BIND) - *nodemask = &(*mpol)->v.nodes; + *nodemask = &(*mpol)->nodes; } return nid; } @@ -2056,7 +2055,6 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags, bool init_nodemask_of_mempolicy(nodemask_t *mask) { struct mempolicy *mempolicy; - int nid; if (!(mask && current->mempolicy)) return false; @@ -2065,18 +2063,13 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) mempolicy = current->mempolicy; switch (mempolicy->mode) { case MPOL_PREFERRED: - nid = mempolicy->v.preferred_node; - init_nodemask_of_node(mask, nid); - break; - case MPOL_BIND: case MPOL_INTERLEAVE: - *mask = mempolicy->v.nodes; + *mask = mempolicy->nodes; break; case MPOL_LOCAL: - nid = numa_node_id(); - init_nodemask_of_node(mask, nid); + init_nodemask_of_node(mask, numa_node_id()); break; default: @@ -2110,7 +2103,7 @@ bool mempolicy_in_oom_domain(struct task_struct *tsk, task_lock(tsk); mempolicy = tsk->mempolicy; if (mempolicy && mempolicy->mode == MPOL_BIND) - ret = nodes_intersects(mempolicy->v.nodes, *mask); + ret = nodes_intersects(mempolicy->nodes, *mask); task_unlock(tsk); return ret; @@ -2184,7 +2177,7 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, * node in its nodemask, we allocate the standard way. */ if (pol->mode == MPOL_PREFERRED) - hpage_node = pol->v.preferred_node; + hpage_node = first_node(pol->nodes); nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(hpage_node, *nmask)) { @@ -2317,9 +2310,8 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) switch (a->mode) { case MPOL_BIND: case MPOL_INTERLEAVE: - return !!nodes_equal(a->v.nodes, b->v.nodes); case MPOL_PREFERRED: - return a->v.preferred_node == b->v.preferred_node; + return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; default: @@ -2459,7 +2451,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long break; case MPOL_PREFERRED: - polnid = pol->v.preferred_node; + polnid = first_node(pol->nodes); break; case MPOL_LOCAL: @@ -2469,7 +2461,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long case MPOL_BIND: /* Optimize placement among multiple nodes via NUMA balancing */ if (pol->flags & MPOL_F_MORON) { - if (node_isset(thisnid, pol->v.nodes)) + if (node_isset(thisnid, pol->nodes)) break; goto out; } @@ -2480,12 +2472,12 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long * else select nearest allowed node, if any. * If no allowed nodes, use current [!misplaced]. */ - if (node_isset(curnid, pol->v.nodes)) + if (node_isset(curnid, pol->nodes)) goto out; z = first_zones_zonelist( node_zonelist(numa_node_id(), GFP_HIGHUSER), gfp_zone(GFP_HIGHUSER), - &pol->v.nodes); + &pol->nodes); polnid = zone_to_nid(z->zone); break; @@ -2688,7 +2680,7 @@ int mpol_set_shared_policy(struct shared_policy *info, vma->vm_pgoff, sz, npol ? npol->mode : -1, npol ? npol->flags : -1, - npol ? nodes_addr(npol->v.nodes)[0] : NUMA_NO_NODE); + npol ? nodes_addr(npol->nodes)[0] : NUMA_NO_NODE); if (npol) { new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol); @@ -2786,7 +2778,7 @@ void __init numa_policy_init(void) .refcnt = ATOMIC_INIT(1), .mode = MPOL_PREFERRED, .flags = MPOL_F_MOF | MPOL_F_MORON, - .v = { .preferred_node = nid, }, + .nodes = nodemask_of_node(nid), }; } @@ -2945,12 +2937,14 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) * Save nodes for mpol_to_str() to show the tmpfs mount options * for /proc/mounts, /proc/pid/mounts and /proc/pid/mountinfo. */ - if (mode != MPOL_PREFERRED) - new->v.nodes = nodes; - else if (nodelist) - new->v.preferred_node = first_node(nodes); - else + if (mode != MPOL_PREFERRED) { + new->nodes = nodes; + } else if (nodelist) { + nodes_clear(new->nodes); + node_set(first_node(nodes), new->nodes); + } else { new->mode = MPOL_LOCAL; + } /* * Save nodes for contextualization: this will be used to "clone" @@ -2999,11 +2993,9 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) case MPOL_LOCAL: break; case MPOL_PREFERRED: - node_set(pol->v.preferred_node, nodes); - break; case MPOL_BIND: case MPOL_INTERLEAVE: - nodes = pol->v.nodes; + nodes = pol->nodes; break; default: WARN_ON_ONCE(1); From 51c656aef629bae94f2b07fcee7eabe280b905ea Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 30 Jun 2021 18:51:13 -0700 Subject: [PATCH 076/190] include/linux/mmzone.h: add documentation for pfn_valid() Patch series "arm64: drop pfn_valid_within() and simplify pfn_valid()", v4. These patches aim to remove CONFIG_HOLES_IN_ZONE and essentially hardwire pfn_valid_within() to 1. The idea is to mark NOMAP pages as reserved in the memory map and restore the intended semantics of pfn_valid() to designate availability of struct page for a pfn. With this the core mm will be able to cope with the fact that it cannot use NOMAP pages and the holes created by NOMAP ranges within MAX_ORDER blocks will be treated correctly even without the need for pfn_valid_within. This patch (of 4): Add comment describing the semantics of pfn_valid() that clarifies that pfn_valid() only checks for availability of a memory map entry (i.e. struct page) for a PFN rather than availability of usable memory backing that PFN. The most "generic" version of pfn_valid() used by the configurations with SPARSEMEM enabled resides in include/linux/mmzone.h so this is the most suitable place for documentation about semantics of pfn_valid(). Link: https://lkml.kernel.org/r/20210511100550.28178-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20210511100550.28178-2-rppt@kernel.org Signed-off-by: Mike Rapoport Suggested-by: Anshuman Khandual Reviewed-by: Anshuman Khandual Acked-by: Ard Biesheuvel Reviewed-by: Kefeng Wang Cc: Catalin Marinas Cc: David Hildenbrand Cc: Marc Zyngier Cc: Mark Rutland Cc: Mike Rapoport Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 265a32e1ff74..7da43337ad23 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1445,6 +1445,17 @@ static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) #endif #ifndef CONFIG_HAVE_ARCH_PFN_VALID +/** + * pfn_valid - check if there is a valid memory map entry for a PFN + * @pfn: the page frame number to check + * + * Check if there is a valid memory map entry aka struct page for the @pfn. + * Note, that availability of the memory map entry does not imply that + * there is actual usable memory at that @pfn. The struct page may + * represent a hole or an unusable page frame. + * + * Return: 1 for PFNs that have memory map entries and 0 otherwise + */ static inline int pfn_valid(unsigned long pfn) { struct mem_section *ms; From 9092d4f7a1f846bcc72e9aace4ed64ed3fc4aa32 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 30 Jun 2021 18:51:16 -0700 Subject: [PATCH 077/190] memblock: update initialization of reserved pages The struct pages representing a reserved memory region are initialized using reserve_bootmem_range() function. This function is called for each reserved region just before the memory is freed from memblock to the buddy page allocator. The struct pages for MEMBLOCK_NOMAP regions are kept with the default values set by the memory map initialization which makes it necessary to have a special treatment for such pages in pfn_valid() and pfn_valid_within(). Split out initialization of the reserved pages to a function with a meaningful name and treat the MEMBLOCK_NOMAP regions the same way as the reserved regions and mark struct pages for the NOMAP regions as PageReserved. Link: https://lkml.kernel.org/r/20210511100550.28178-3-rppt@kernel.org Signed-off-by: Mike Rapoport Reviewed-by: David Hildenbrand Reviewed-by: Anshuman Khandual Acked-by: Ard Biesheuvel Reviewed-by: Kefeng Wang Cc: Catalin Marinas Cc: Marc Zyngier Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memblock.h | 4 +++- mm/memblock.c | 28 ++++++++++++++++++++++++++-- 2 files changed, 29 insertions(+), 3 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 552309342c38..cbf46f56d105 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -30,7 +30,9 @@ extern unsigned long long max_possible_pfn; * @MEMBLOCK_NONE: no special request * @MEMBLOCK_HOTPLUG: hotpluggable region * @MEMBLOCK_MIRROR: mirrored region - * @MEMBLOCK_NOMAP: don't add to kernel direct mapping + * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as + * reserved in the memory map; refer to memblock_mark_nomap() description + * for further details */ enum memblock_flags { MEMBLOCK_NONE = 0x0, /* No special request */ diff --git a/mm/memblock.c b/mm/memblock.c index 123feef5259d..3e4acbf03ab7 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -906,6 +906,11 @@ int __init_memblock memblock_mark_mirror(phys_addr_t base, phys_addr_t size) * @base: the base phys addr of the region * @size: the size of the region * + * The memory regions marked with %MEMBLOCK_NOMAP will not be added to the + * direct mapping of the physical memory. These regions will still be + * covered by the memory map. The struct page representing NOMAP memory + * frames in the memory map will be PageReserved() + * * Return: 0 on success, -errno on failure. */ int __init_memblock memblock_mark_nomap(phys_addr_t base, phys_addr_t size) @@ -2002,6 +2007,26 @@ static unsigned long __init __free_memory_core(phys_addr_t start, return end_pfn - start_pfn; } +static void __init memmap_init_reserved_pages(void) +{ + struct memblock_region *region; + phys_addr_t start, end; + u64 i; + + /* initialize struct pages for the reserved regions */ + for_each_reserved_mem_range(i, &start, &end) + reserve_bootmem_region(start, end); + + /* and also treat struct pages for the NOMAP regions as PageReserved */ + for_each_mem_region(region) { + if (memblock_is_nomap(region)) { + start = region->base; + end = start + region->size; + reserve_bootmem_region(start, end); + } + } +} + static unsigned long __init free_low_memory_core_early(void) { unsigned long count = 0; @@ -2010,8 +2035,7 @@ static unsigned long __init free_low_memory_core_early(void) memblock_clear_hotplug(0, -1); - for_each_reserved_mem_range(i, &start, &end) - reserve_bootmem_region(start, end); + memmap_init_reserved_pages(); /* * We need to use NUMA_NO_NODE instead of NODE_DATA(0)->node_id From 873ba463914cf484371cba06959d320f9d3121ca Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 30 Jun 2021 18:51:19 -0700 Subject: [PATCH 078/190] arm64: decouple check whether pfn is in linear map from pfn_valid() The intended semantics of pfn_valid() is to verify whether there is a struct page for the pfn in question and nothing else. Yet, on arm64 it is used to distinguish memory areas that are mapped in the linear map vs those that require ioremap() to access them. Introduce a dedicated pfn_is_map_memory() wrapper for memblock_is_map_memory() to perform such check and use it where appropriate. Using a wrapper allows to avoid cyclic include dependencies. While here also update style of pfn_valid() so that both pfn_valid() and pfn_is_map_memory() declarations will be consistent. Link: https://lkml.kernel.org/r/20210511100550.28178-4-rppt@kernel.org Signed-off-by: Mike Rapoport Acked-by: David Hildenbrand Acked-by: Ard Biesheuvel Reviewed-by: Kefeng Wang Cc: Anshuman Khandual Cc: Catalin Marinas Cc: Marc Zyngier Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/include/asm/memory.h | 2 +- arch/arm64/include/asm/page.h | 3 ++- arch/arm64/kvm/mmu.c | 2 +- arch/arm64/mm/init.c | 12 ++++++++++++ arch/arm64/mm/ioremap.c | 4 ++-- arch/arm64/mm/mmu.c | 2 +- 6 files changed, 19 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index 87b90dc27a43..9027b7e16c4c 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -369,7 +369,7 @@ static inline void *phys_to_virt(phys_addr_t x) #define virt_addr_valid(addr) ({ \ __typeof__(addr) __addr = __tag_reset(addr); \ - __is_lm_address(__addr) && pfn_valid(virt_to_pfn(__addr)); \ + __is_lm_address(__addr) && pfn_is_map_memory(virt_to_pfn(__addr)); \ }) void dump_mem_limit(void); diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h index 012cffc574e8..75ddfe671393 100644 --- a/arch/arm64/include/asm/page.h +++ b/arch/arm64/include/asm/page.h @@ -37,7 +37,8 @@ void copy_highpage(struct page *to, struct page *from); typedef struct page *pgtable_t; -extern int pfn_valid(unsigned long); +int pfn_valid(unsigned long pfn); +int pfn_is_map_memory(unsigned long pfn); #include diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 74b3c1a3ff5a..5e0d40f9fb86 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -85,7 +85,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm) static bool kvm_is_device_pfn(unsigned long pfn) { - return !pfn_valid(pfn); + return !pfn_is_map_memory(pfn); } static void *stage2_memcache_zalloc_page(void *arg) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index e55409caaee3..148e752a70f7 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -256,6 +256,18 @@ int pfn_valid(unsigned long pfn) } EXPORT_SYMBOL(pfn_valid); +int pfn_is_map_memory(unsigned long pfn) +{ + phys_addr_t addr = PFN_PHYS(pfn); + + /* avoid false positives for bogus PFNs, see comment in pfn_valid() */ + if (PHYS_PFN(addr) != pfn) + return 0; + + return memblock_is_map_memory(addr); +} +EXPORT_SYMBOL(pfn_is_map_memory); + static phys_addr_t memory_limit = PHYS_ADDR_MAX; /* diff --git a/arch/arm64/mm/ioremap.c b/arch/arm64/mm/ioremap.c index b5e83c46b23e..b7c81dacabf0 100644 --- a/arch/arm64/mm/ioremap.c +++ b/arch/arm64/mm/ioremap.c @@ -43,7 +43,7 @@ static void __iomem *__ioremap_caller(phys_addr_t phys_addr, size_t size, /* * Don't allow RAM to be mapped. */ - if (WARN_ON(pfn_valid(__phys_to_pfn(phys_addr)))) + if (WARN_ON(pfn_is_map_memory(__phys_to_pfn(phys_addr)))) return NULL; area = get_vm_area_caller(size, VM_IOREMAP, caller); @@ -84,7 +84,7 @@ EXPORT_SYMBOL(iounmap); void __iomem *ioremap_cache(phys_addr_t phys_addr, size_t size) { /* For normal memory we already have a cacheable mapping. */ - if (pfn_valid(__phys_to_pfn(phys_addr))) + if (pfn_is_map_memory(__phys_to_pfn(phys_addr))) return (void __iomem *)__phys_to_virt(phys_addr); return __ioremap_caller(phys_addr, size, __pgprot(PROT_NORMAL), diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index add67deb4910..7553a7eab3db 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -82,7 +82,7 @@ void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd) pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, unsigned long size, pgprot_t vma_prot) { - if (!pfn_valid(pfn)) + if (!pfn_is_map_memory(pfn)) return pgprot_noncached(vma_prot); else if (file->f_flags & O_SYNC) return pgprot_writecombine(vma_prot); From a7d9f306ba7052056edf9ccae596aeb400226af8 Mon Sep 17 00:00:00 2001 From: Mike Rapoport Date: Wed, 30 Jun 2021 18:51:22 -0700 Subject: [PATCH 079/190] arm64: drop pfn_valid_within() and simplify pfn_valid() The arm64's version of pfn_valid() differs from the generic because of two reasons: * Parts of the memory map are freed during boot. This makes it necessary to verify that there is actual physical memory that corresponds to a pfn which is done by querying memblock. * There are NOMAP memory regions. These regions are not mapped in the linear map and until the previous commit the struct pages representing these areas had default values. As the consequence of absence of the special treatment of NOMAP regions in the memory map it was necessary to use memblock_is_map_memory() in pfn_valid() and to have pfn_valid_within() aliased to pfn_valid() so that generic mm functionality would not treat a NOMAP page as a normal page. Since the NOMAP regions are now marked as PageReserved(), pfn walkers and the rest of core mm will treat them as unusable memory and thus pfn_valid_within() is no longer required at all and can be disabled on arm64. pfn_valid() can be slightly simplified by replacing memblock_is_map_memory() with memblock_is_memory(). [rppt@kernel.org: fix merge fix] Link: https://lkml.kernel.org/r/YJtoQhidtIJOhYsV@kernel.org Link: https://lkml.kernel.org/r/20210511100550.28178-5-rppt@kernel.org Signed-off-by: Mike Rapoport Acked-by: David Hildenbrand Acked-by: Ard Biesheuvel Reviewed-by: Kefeng Wang Cc: Anshuman Khandual Cc: Catalin Marinas Cc: Marc Zyngier Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/Kconfig | 1 - arch/arm64/mm/init.c | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 937bdf0c3859..59c9255ab9d0 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -201,7 +201,6 @@ config ARM64 select HAVE_KPROBES select HAVE_KRETPROBES select HAVE_GENERIC_VDSO - select HOLES_IN_ZONE select IOMMU_DMA if IOMMU_SUPPORT select IRQ_DOMAIN select IRQ_FORCED_THREADING diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 148e752a70f7..725aa84f2faa 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -252,7 +252,7 @@ int pfn_valid(unsigned long pfn) if (!early_section(ms)) return pfn_section_valid(ms, pfn); - return memblock_is_map_memory(addr); + return memblock_is_memory(addr); } EXPORT_SYMBOL(pfn_valid); From 16c9afc776608324ca71c0bc354987bab532f51d Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Wed, 30 Jun 2021 18:51:26 -0700 Subject: [PATCH 080/190] arm64/mm: drop HAVE_ARCH_PFN_VALID CONFIG_SPARSEMEM_VMEMMAP is now the only available memory model on arm64 platforms and free_unused_memmap() would just return without creating any holes in the memmap mapping. There is no need for any special handling in pfn_valid() and HAVE_ARCH_PFN_VALID can just be dropped. This also moves the pfn upper bits sanity check into generic pfn_valid(). Link: https://lkml.kernel.org/r/1621947349-25421-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Acked-by: David Hildenbrand Acked-by: Mike Rapoport Cc: Catalin Marinas Cc: Will Deacon Cc: David Hildenbrand Cc: Mike Rapoport Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/Kconfig | 1 - arch/arm64/include/asm/page.h | 1 - arch/arm64/mm/init.c | 37 ----------------------------------- include/linux/mmzone.h | 9 +++++++++ 4 files changed, 9 insertions(+), 39 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 59c9255ab9d0..84f0216d3c5c 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -154,7 +154,6 @@ config ARM64 select HAVE_ARCH_KGDB select HAVE_ARCH_MMAP_RND_BITS select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT - select HAVE_ARCH_PFN_VALID select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET select HAVE_ARCH_SECCOMP_FILTER diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h index 75ddfe671393..fcbef3eec4b2 100644 --- a/arch/arm64/include/asm/page.h +++ b/arch/arm64/include/asm/page.h @@ -37,7 +37,6 @@ void copy_highpage(struct page *to, struct page *from); typedef struct page *pgtable_t; -int pfn_valid(unsigned long pfn); int pfn_is_map_memory(unsigned long pfn); #include diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 725aa84f2faa..49019ea0c8a8 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -219,43 +219,6 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max) free_area_init(max_zone_pfns); } -int pfn_valid(unsigned long pfn) -{ - phys_addr_t addr = PFN_PHYS(pfn); - struct mem_section *ms; - - /* - * Ensure the upper PAGE_SHIFT bits are clear in the - * pfn. Else it might lead to false positives when - * some of the upper bits are set, but the lower bits - * match a valid pfn. - */ - if (PHYS_PFN(addr) != pfn) - return 0; - - if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) - return 0; - - ms = __pfn_to_section(pfn); - if (!valid_section(ms)) - return 0; - - /* - * ZONE_DEVICE memory does not have the memblock entries. - * memblock_is_map_memory() check for ZONE_DEVICE based - * addresses will always fail. Even the normal hotplugged - * memory will never have MEMBLOCK_NOMAP flag set in their - * memblock entries. Skip memblock search for all non early - * memory sections covering all of hotplug memory including - * both normal and ZONE_DEVICE based. - */ - if (!early_section(ms)) - return pfn_section_valid(ms, pfn); - - return memblock_is_memory(addr); -} -EXPORT_SYMBOL(pfn_valid); - int pfn_is_map_memory(unsigned long pfn) { phys_addr_t addr = PFN_PHYS(pfn); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7da43337ad23..7bc7e41b6c31 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1460,6 +1460,15 @@ static inline int pfn_valid(unsigned long pfn) { struct mem_section *ms; + /* + * Ensure the upper PAGE_SHIFT bits are clear in the + * pfn. Else it might lead to false positives when + * some of the upper bits are set, but the lower bits + * match a valid pfn. + */ + if (PHYS_PFN(PFN_PHYS(pfn)) != pfn) + return 0; + if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS) return 0; ms = __nr_to_section(pfn_to_section_nr(pfn)); From 6acfb5ba150cf75005ce85e0e25d79ef2fec287c Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Wed, 30 Jun 2021 18:51:29 -0700 Subject: [PATCH 081/190] mm: migrate: fix missing update page_private to hugetlb_page_subpool Since commit d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") converts page.private for hugetlb specific page flags. We should use hugetlb_page_subpool() to get the subpool pointer instead of page_private(). This 'could' prevent the migration of hugetlb pages. page_private(hpage) is now used for hugetlb page specific flags. At migration time, the only flag which could be set is HPageVmemmapOptimized. This flag will only be set if the new vmemmap reduction feature is enabled. In addition, !page_mapping() implies an anonymous mapping. So, this will prevent migration of hugetb pages in anonymous mappings if the vmemmap reduction feature is enabled. In addition, that if statement checked for the rare race condition of a page being migrated while in the process of being freed. Since that check is now wrong, we could leak hugetlb subpool usage counts. The commit forgot to update it in the page migration routine. So fix it. [songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy] Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") Signed-off-by: Muchun Song Reported-by: Anshuman Khandual Reviewed-by: Mike Kravetz Acked-by: Michal Hocko Tested-by: Anshuman Khandual [arm64] Cc: Oscar Salvador Cc: David Hildenbrand Cc: Matthew Wilcox Cc: Xiongchun Duan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/hugetlb.h | 5 +++++ mm/migrate.c | 2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index f11ba701e199..a58e11f2db15 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -898,6 +898,11 @@ static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; +static inline struct hugepage_subpool *hugetlb_page_subpool(struct page *hpage) +{ + return NULL; +} + static inline int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list) { diff --git a/mm/migrate.c b/mm/migrate.c index 8fc766e52e52..9bd6bc12aa72 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1293,7 +1293,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, * page_mapping() set, hugetlbfs specific move page routine will not * be called and we could leak usage counts for subpools. */ - if (page_private(hpage) && !page_mapping(hpage)) { + if (hugetlb_page_subpool(hpage) && !page_mapping(hpage)) { rc = -EBUSY; goto out_unlock; } From eb6ecbed0aa27360712d0674bf132843a9567344 Mon Sep 17 00:00:00 2001 From: Collin Fijalkovich Date: Wed, 30 Jun 2021 18:51:32 -0700 Subject: [PATCH 082/190] mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs Transparent huge pages are supported for read-only non-shmem files, but are only used for vmas with VM_DENYWRITE. This condition ensures that file THPs are protected from writes while an application is running (ETXTBSY). Any existing file THPs are then dropped from the page cache when a file is opened for write in do_dentry_open(). Since sys_mmap ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas produced by execve(). Systems that make heavy use of shared libraries (e.g. Android) are unable to apply VM_DENYWRITE through the dynamic linker, preventing them from benefiting from the resultant reduced contention on the TLB. This patch reduces the constraint on file THPs allowing use with any executable mapping from a file not opened for write (see inode_is_open_for_write()). It also introduces additional conditions to ensure that files opened for write will never be backed by file THPs. Restricting the use of THPs to executable mappings eliminates the risk that a read-only file later opened for write would encounter significant latencies due to page cache truncation. The ld linker flag '-z max-page-size=(hugepage size)' can be used to produce executables with the necessary layout. The dynamic linker must map these file's segments at a hugepage size aligned vma for the mapping to be backed with THPs. Comparison of the performance characteristics of 4KB and 2MB-backed libraries follows; the Android dex2oat tool was used to AOT compile an example application on a single ARM core. 4KB Pages: ========== count event_name # count / runtime 598,995,035,942 cpu-cycles # 1.800861 GHz 81,195,620,851 raw-stall-frontend # 244.112 M/sec 347,754,466,597 iTLB-loads # 1.046 G/sec 2,970,248,900 iTLB-load-misses # 0.854122% miss rate Total test time: 332.854998 seconds. 2MB Pages: ========== count event_name # count / runtime 592,872,663,047 cpu-cycles # 1.800358 GHz 76,485,624,143 raw-stall-frontend # 232.261 M/sec 350,478,413,710 iTLB-loads # 1.064 G/sec 803,233,322 iTLB-load-misses # 0.229182% miss rate Total test time: 329.826087 seconds A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use: /apex/com.android.art/lib64/libart.so FilePmdMapped: 4096 kB /apex/com.android.art/lib64/libart-compiler.so FilePmdMapped: 2048 kB Link: https://lkml.kernel.org/r/20210406000930.3455850-1-cfijalkovich@google.com Signed-off-by: Collin Fijalkovich Acked-by: Hugh Dickins Reviewed-by: William Kucharski Acked-by: Song Liu Cc: Suren Baghdasaryan Cc: Hridya Valsaraju Cc: Kalesh Singh Cc: Tim Murray Cc: Matthew Wilcox Cc: Alexander Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/open.c | 13 +++++++++++-- mm/khugepaged.c | 16 +++++++++++++++- 2 files changed, 26 insertions(+), 3 deletions(-) diff --git a/fs/open.c b/fs/open.c index e53af13b5835..f76e960d10ea 100644 --- a/fs/open.c +++ b/fs/open.c @@ -852,8 +852,17 @@ static int do_dentry_open(struct file *f, * XXX: Huge page cache doesn't support writing yet. Drop all page * cache for this file before processing writes. */ - if ((f->f_mode & FMODE_WRITE) && filemap_nr_thps(inode->i_mapping)) - truncate_pagecache(inode, 0); + if (f->f_mode & FMODE_WRITE) { + /* + * Paired with smp_mb() in collapse_file() to ensure nr_thps + * is up to date and the update to i_writecount by + * get_write_access() is visible. Ensures subsequent insertion + * of THPs into the page cache will fail. + */ + smp_mb(); + if (filemap_nr_thps(inode->i_mapping)) + truncate_pagecache(inode, 0); + } return 0; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index d97b20fad6e8..b0412be08fa2 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -457,7 +457,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma, /* Read-only file mappings need to be aligned for THP to work. */ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file && - (vm_flags & VM_DENYWRITE)) { + !inode_is_open_for_write(vma->vm_file->f_inode) && + (vm_flags & VM_EXEC)) { return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, HPAGE_PMD_NR); } @@ -1862,6 +1863,19 @@ static void collapse_file(struct mm_struct *mm, else { __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr); filemap_nr_thps_inc(mapping); + /* + * Paired with smp_mb() in do_dentry_open() to ensure + * i_writecount is up to date and the update to nr_thps is + * visible. Ensures the page cache will be truncated if the + * file is opened writable. + */ + smp_mb(); + if (inode_is_open_for_write(mapping->host)) { + result = SCAN_FAIL; + __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr); + filemap_nr_thps_dec(mapping); + goto xa_locked; + } } if (nr_none) { From 5db4f15c4fd7ae74dd40c6f84bf56dfcf13d10cf Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 30 Jun 2021 18:51:35 -0700 Subject: [PATCH 083/190] mm: memory: add orig_pmd to struct vm_fault Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3. When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. It is definitely a maintenance burden to keep two THP migration implementation for different code paths and it is more error prone. Using the generic THP migration implementation allows us remove the duplicate code and some hacks needed by the old ad hoc implementation. A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both THP and NUMA balancing. The most of them support THP migration except for S390. Zi Yan tried to add THP migration support for S390 before but it was not accepted due to the design of S390 PMD. For the discussion, please see: https://lkml.org/lkml/2018/4/27/953. Per the discussion with Gerald Schaefer in v1 it is acceptible to skip huge PMD for S390 for now. I saw there were some hacks about gup from git history, but I didn't figure out if they have been removed or not since I just found FOLL_NUMA code in the current gup implementation and they seems useful. Patch #1 ~ #2 are preparation patches. Patch #3 is the real meat. Patch #4 ~ #6 keep consistent counters and behaviors with before. Patch #7 skips change huge PMD to prot_none if thp migration is not supported. Test ---- Did some tests to measure the latency of do_huge_pmd_numa_page. The test VM has 80 vcpus and 64G memory. The test would create 2 processes to consume 128G memory together which would incur memory pressure to cause THP splits. And it also creates 80 processes to hog cpu, and the memory consumer processes are bound to different nodes periodically in order to increase NUMA faults. The below test script is used: echo 3 > /proc/sys/vm/drop_caches # Run stress-ng for 24 hours ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h & PID=$! ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h & # Wait for vm stressors forked sleep 5 PID_1=`pgrep -P $PID | awk 'NR == 1'` PID_2=`pgrep -P $PID | awk 'NR == 2'` JOB1=`pgrep -P $PID_1` JOB2=`pgrep -P $PID_2` # Bind load jobs to different nodes periodically to force generate # cross node memory access while [ -d "/proc/$PID" ] do taskset -apc 8 $JOB1 taskset -apc 8 $JOB2 sleep 300 taskset -apc 58 $JOB1 taskset -apc 58 $JOB2 sleep 300 done With the above test the histogram of latency of do_huge_pmd_numa_page is as shown below. Since the number of do_huge_pmd_numa_page varies drastically for each run (should be due to scheduler), so I converted the raw number to percentage. patched base @us[stress-ng]: [0] 3.57% 0.16% [1] 55.68% 18.36% [2, 4) 10.46% 40.44% [4, 8) 7.26% 17.82% [8, 16) 21.12% 13.41% [16, 32) 1.06% 4.27% [32, 64) 0.56% 4.07% [64, 128) 0.16% 0.35% [128, 256) < 0.1% < 0.1% [256, 512) < 0.1% < 0.1% [512, 1K) < 0.1% < 0.1% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) < 0.1% < 0.1% [16K, 32K) < 0.1% < 0.1% [32K, 64K) < 0.1% < 0.1% Per the result, patched kernel is even slightly better than the base kernel. I think this is because the lock contention against THP split is less than base kernel due to the refactor. To exclude the affect from THP split, I also did test w/o memory pressure. No obvious regression is spotted. The below is the test result *w/o* memory pressure. patched base @us[stress-ng]: [0] 7.97% 18.4% [1] 69.63% 58.24% [2, 4) 4.18% 2.63% [4, 8) 0.22% 0.17% [8, 16) 1.03% 0.92% [16, 32) 0.14% < 0.1% [32, 64) < 0.1% < 0.1% [64, 128) < 0.1% < 0.1% [128, 256) < 0.1% < 0.1% [256, 512) 0.45% 1.19% [512, 1K) 15.45% 17.27% [1K, 2K) < 0.1% < 0.1% [2K, 4K) < 0.1% < 0.1% [4K, 8K) < 0.1% < 0.1% [8K, 16K) 0.86% 0.88% [16K, 32K) < 0.1% 0.15% [32K, 64K) < 0.1% < 0.1% [64K, 128K) < 0.1% < 0.1% [128K, 256K) < 0.1% < 0.1% The series also survived a series of tests that exercise NUMA balancing migrations by Mel. This patch (of 7): Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge page fault could be removed, just like its PTE counterpart does. Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.com Signed-off-by: Yang Shi Acked-by: Mel Gorman Cc: Kirill A. Shutemov Cc: Zi Yan Cc: Huang Ying Cc: Michal Hocko Cc: Hugh Dickins Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Christian Borntraeger Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/huge_mm.h | 9 ++++----- include/linux/mm.h | 7 ++++++- mm/huge_memory.c | 9 ++++++--- mm/memory.c | 26 +++++++++++++------------- 4 files changed, 29 insertions(+), 22 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 939f21b69ead..f123e15d966e 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -11,7 +11,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf); int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma); -void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd); +void huge_pmd_set_accessed(struct vm_fault *vmf); int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm, pud_t *dst_pud, pud_t *src_pud, unsigned long addr, struct vm_area_struct *vma); @@ -24,7 +24,7 @@ static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud) } #endif -vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd); +vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf); struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, unsigned int flags); @@ -288,7 +288,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap); -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t orig_pmd); +vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); extern struct page *huge_zero_page; extern unsigned long huge_zero_pfn; @@ -441,8 +441,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud, return NULL; } -static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, - pmd_t orig_pmd) +static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) { return 0; } diff --git a/include/linux/mm.h b/include/linux/mm.h index aa875dacd9c3..3cbd2d6d248e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -550,7 +550,12 @@ struct vm_fault { pud_t *pud; /* Pointer to pud entry matching * the 'address' */ - pte_t orig_pte; /* Value of PTE at the time of fault */ + union { + pte_t orig_pte; /* Value of PTE at the time of fault */ + pmd_t orig_pmd; /* Value of PMD at the time of fault, + * used by PMD fault only. + */ + }; struct page *cow_page; /* Page handler may use for COW fault */ struct page *page; /* ->fault handlers should return a diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 40a90ff18180..f3a3138d33e9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1257,11 +1257,12 @@ void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud) } #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ -void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) +void huge_pmd_set_accessed(struct vm_fault *vmf) { pmd_t entry; unsigned long haddr; bool write = vmf->flags & FAULT_FLAG_WRITE; + pmd_t orig_pmd = vmf->orig_pmd; vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd); if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) @@ -1278,11 +1279,12 @@ void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd) spin_unlock(vmf->ptl); } -vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) +vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct page *page; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; + pmd_t orig_pmd = vmf->orig_pmd; vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd); VM_BUG_ON_VMA(!vma->anon_vma, vma); @@ -1418,9 +1420,10 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, } /* NUMA hinting page fault entry point for trans huge pmds */ -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) +vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; + pmd_t pmd = vmf->orig_pmd; struct anon_vma *anon_vma = NULL; struct page *page; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; diff --git a/mm/memory.c b/mm/memory.c index b535a1d0317d..5029d793dc3e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4298,12 +4298,12 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) } /* `inline' is required to avoid gcc 4.1.2 build error */ -static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd) +static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf) { if (vma_is_anonymous(vmf->vma)) { - if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd)) + if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd)) return handle_userfault(vmf, VM_UFFD_WP); - return do_huge_pmd_wp_page(vmf, orig_pmd); + return do_huge_pmd_wp_page(vmf); } if (vmf->vma->vm_ops->huge_fault) { vm_fault_t ret = vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); @@ -4530,26 +4530,26 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { - pmd_t orig_pmd = *vmf.pmd; + vmf.orig_pmd = *vmf.pmd; barrier(); - if (unlikely(is_swap_pmd(orig_pmd))) { + if (unlikely(is_swap_pmd(vmf.orig_pmd))) { VM_BUG_ON(thp_migration_supported() && - !is_pmd_migration_entry(orig_pmd)); - if (is_pmd_migration_entry(orig_pmd)) + !is_pmd_migration_entry(vmf.orig_pmd)); + if (is_pmd_migration_entry(vmf.orig_pmd)) pmd_migration_entry_wait(mm, vmf.pmd); return 0; } - if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { - if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) - return do_huge_pmd_numa_page(&vmf, orig_pmd); + if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) { + if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) + return do_huge_pmd_numa_page(&vmf); - if (dirty && !pmd_write(orig_pmd)) { - ret = wp_huge_pmd(&vmf, orig_pmd); + if (dirty && !pmd_write(vmf.orig_pmd)) { + ret = wp_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; } else { - huge_pmd_set_accessed(&vmf, orig_pmd); + huge_pmd_set_accessed(&vmf); return 0; } } From f4c0d8367ea492cdfc7f6d14763c02f472731592 Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 30 Jun 2021 18:51:39 -0700 Subject: [PATCH 084/190] mm: memory: make numa_migrate_prep() non-static The numa_migrate_prep() will be used by huge NUMA fault as well in the following patch, make it non-static. Link: https://lkml.kernel.org/r/20210518200801.7413-3-shy828301@gmail.com Signed-off-by: Yang Shi Acked-by: Mel Gorman Cc: Christian Borntraeger Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Huang Ying Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Vasily Gorbik Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 3 +++ mm/memory.c | 5 ++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 6ec2cea9926b..a4942f9867a3 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -672,4 +672,7 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end, void vunmap_range_noflush(unsigned long start, unsigned long end); +int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, + unsigned long addr, int page_nid, int *flags); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory.c b/mm/memory.c index 5029d793dc3e..64eda960b75e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4175,9 +4175,8 @@ static vm_fault_t do_fault(struct vm_fault *vmf) return ret; } -static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, - unsigned long addr, int page_nid, - int *flags) +int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, + unsigned long addr, int page_nid, int *flags) { get_page(page); From c5b5a3dd2c1fa61049b7789ce596faff4d659a61 Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 30 Jun 2021 18:51:42 -0700 Subject: [PATCH 085/190] mm: thp: refactor NUMA fault handling When the THP NUMA fault support was added THP migration was not supported yet. So the ad hoc THP migration was implemented in NUMA fault handling. Since v4.14 THP migration has been supported so it doesn't make too much sense to still keep another THP migration implementation rather than using the generic migration code. This patch reworks the NUMA fault handling to use generic migration implementation to migrate misplaced page. There is no functional change. After the refactor the flow of NUMA fault handling looks just like its PTE counterpart: Acquire ptl Prepare for migration (elevate page refcount) Release ptl Isolate page from lru and elevate page refcount Migrate the misplaced THP If migration fails just restore the old normal PMD. In the old code anon_vma lock was needed to serialize THP migration against THP split, but since then the THP code has been reworked a lot, it seems anon_vma lock is not required anymore to avoid the race. The page refcount elevation when holding ptl should prevent from THP split. Use migrate_misplaced_page() for both base page and THP NUMA hinting fault and remove all the dead and duplicate code. [dan.carpenter@oracle.com: fix a double unlock bug] Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.com Signed-off-by: Yang Shi Signed-off-by: Dan Carpenter Acked-by: Mel Gorman Cc: Christian Borntraeger Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Huang Ying Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Vasily Gorbik Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/migrate.h | 23 ------ mm/huge_memory.c | 145 ++++++++++---------------------- mm/internal.h | 18 ---- mm/migrate.c | 177 ++++++++-------------------------------- 4 files changed, 77 insertions(+), 286 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 7b7b73977278..9b7b7cd3bae9 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -99,14 +99,9 @@ static inline void __ClearPageMovable(struct page *page) #endif #ifdef CONFIG_NUMA_BALANCING -extern bool pmd_trans_migrating(pmd_t pmd); extern int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node); #else -static inline bool pmd_trans_migrating(pmd_t pmd) -{ - return false; -} static inline int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int node) { @@ -114,24 +109,6 @@ static inline int migrate_misplaced_page(struct page *page, } #endif /* CONFIG_NUMA_BALANCING */ -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE) -extern int migrate_misplaced_transhuge_page(struct mm_struct *mm, - struct vm_area_struct *vma, - pmd_t *pmd, pmd_t entry, - unsigned long address, - struct page *page, int node); -#else -static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm, - struct vm_area_struct *vma, - pmd_t *pmd, pmd_t entry, - unsigned long address, - struct page *page, int node) -{ - return -EAGAIN; -} -#endif /* CONFIG_NUMA_BALANCING && CONFIG_TRANSPARENT_HUGEPAGE*/ - - #ifdef CONFIG_MIGRATION /* diff --git a/mm/huge_memory.c b/mm/huge_memory.c index f3a3138d33e9..de7a8c2a0c80 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1423,94 +1423,22 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; - pmd_t pmd = vmf->orig_pmd; - struct anon_vma *anon_vma = NULL; + pmd_t oldpmd = vmf->orig_pmd; + pmd_t pmd; struct page *page; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; - int page_nid = NUMA_NO_NODE, this_nid = numa_node_id(); + int page_nid = NUMA_NO_NODE; int target_nid, last_cpupid = -1; - bool page_locked; bool migrated = false; - bool was_writable; + bool was_writable = pmd_savedwrite(oldpmd); int flags = 0; vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); - if (unlikely(!pmd_same(pmd, *vmf->pmd))) - goto out_unlock; - - /* - * If there are potential migrations, wait for completion and retry - * without disrupting NUMA hinting information. Do not relock and - * check_same as the page may no longer be mapped. - */ - if (unlikely(pmd_trans_migrating(*vmf->pmd))) { - page = pmd_page(*vmf->pmd); - if (!get_page_unless_zero(page)) - goto out_unlock; + if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) { spin_unlock(vmf->ptl); - put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE); goto out; } - page = pmd_page(pmd); - BUG_ON(is_huge_zero_page(page)); - page_nid = page_to_nid(page); - last_cpupid = page_cpupid_last(page); - count_vm_numa_event(NUMA_HINT_FAULTS); - if (page_nid == this_nid) { - count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); - flags |= TNF_FAULT_LOCAL; - } - - /* See similar comment in do_numa_page for explanation */ - if (!pmd_savedwrite(pmd)) - flags |= TNF_NO_GROUP; - - /* - * Acquire the page lock to serialise THP migrations but avoid dropping - * page_table_lock if at all possible - */ - page_locked = trylock_page(page); - target_nid = mpol_misplaced(page, vma, haddr); - /* Migration could have started since the pmd_trans_migrating check */ - if (!page_locked) { - page_nid = NUMA_NO_NODE; - if (!get_page_unless_zero(page)) - goto out_unlock; - spin_unlock(vmf->ptl); - put_and_wait_on_page_locked(page, TASK_UNINTERRUPTIBLE); - goto out; - } else if (target_nid == NUMA_NO_NODE) { - /* There are no parallel migrations and page is in the right - * node. Clear the numa hinting info in this pmd. - */ - goto clear_pmdnuma; - } - - /* - * Page is misplaced. Page lock serialises migrations. Acquire anon_vma - * to serialises splits - */ - get_page(page); - spin_unlock(vmf->ptl); - anon_vma = page_lock_anon_vma_read(page); - - /* Confirm the PMD did not change while page_table_lock was released */ - spin_lock(vmf->ptl); - if (unlikely(!pmd_same(pmd, *vmf->pmd))) { - unlock_page(page); - put_page(page); - page_nid = NUMA_NO_NODE; - goto out_unlock; - } - - /* Bail if we fail to protect against THP splits for any reason */ - if (unlikely(!anon_vma)) { - put_page(page); - page_nid = NUMA_NO_NODE; - goto clear_pmdnuma; - } - /* * Since we took the NUMA fault, we must have observed the !accessible * bit. Make sure all other CPUs agree with that, to avoid them @@ -1537,43 +1465,58 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) haddr + HPAGE_PMD_SIZE); } - /* - * Migrate the THP to the requested node, returns with page unlocked - * and access rights restored. - */ + pmd = pmd_modify(oldpmd, vma->vm_page_prot); + page = vm_normal_page_pmd(vma, haddr, pmd); + if (!page) + goto out_map; + + /* See similar comment in do_numa_page for explanation */ + if (!was_writable) + flags |= TNF_NO_GROUP; + + page_nid = page_to_nid(page); + last_cpupid = page_cpupid_last(page); + target_nid = numa_migrate_prep(page, vma, haddr, page_nid, + &flags); + + if (target_nid == NUMA_NO_NODE) { + put_page(page); + goto out_map; + } + spin_unlock(vmf->ptl); - migrated = migrate_misplaced_transhuge_page(vma->vm_mm, vma, - vmf->pmd, pmd, vmf->address, page, target_nid); + migrated = migrate_misplaced_page(page, vma, target_nid); if (migrated) { flags |= TNF_MIGRATED; page_nid = target_nid; - } else + } else { flags |= TNF_MIGRATE_FAIL; - - goto out; -clear_pmdnuma: - BUG_ON(!PageLocked(page)); - was_writable = pmd_savedwrite(pmd); - pmd = pmd_modify(pmd, vma->vm_page_prot); - pmd = pmd_mkyoung(pmd); - if (was_writable) - pmd = pmd_mkwrite(pmd); - set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); - update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); - unlock_page(page); -out_unlock: - spin_unlock(vmf->ptl); + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); + if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) { + spin_unlock(vmf->ptl); + goto out; + } + goto out_map; + } out: - if (anon_vma) - page_unlock_anon_vma_read(anon_vma); - if (page_nid != NUMA_NO_NODE) task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags); return 0; + +out_map: + /* Restore the PMD */ + pmd = pmd_modify(oldpmd, vma->vm_page_prot); + pmd = pmd_mkyoung(pmd); + if (was_writable) + pmd = pmd_mkwrite(pmd); + set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd); + update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); + spin_unlock(vmf->ptl); + goto out; } /* diff --git a/mm/internal.h b/mm/internal.h index a4942f9867a3..d565eb94d6ec 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -369,23 +369,6 @@ extern unsigned int munlock_vma_page(struct page *page); */ extern void clear_page_mlock(struct page *page); -/* - * mlock_migrate_page - called only from migrate_misplaced_transhuge_page() - * (because that does not go through the full procedure of migration ptes): - * to migrate the Mlocked page flag; update statistics. - */ -static inline void mlock_migrate_page(struct page *newpage, struct page *page) -{ - if (TestClearPageMlocked(page)) { - int nr_pages = thp_nr_pages(page); - - /* Holding pmd lock, no change in irq context: __mod is safe */ - __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); - SetPageMlocked(newpage); - __mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages); - } -} - extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); /* @@ -461,7 +444,6 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf, #else /* !CONFIG_MMU */ static inline void clear_page_mlock(struct page *page) { } static inline void mlock_vma_page(struct page *page) { } -static inline void mlock_migrate_page(struct page *new, struct page *old) { } static inline void vunmap_range_noflush(unsigned long start, unsigned long end) { } diff --git a/mm/migrate.c b/mm/migrate.c index 9bd6bc12aa72..b7e330900b86 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2048,6 +2048,23 @@ static struct page *alloc_misplaced_dst_page(struct page *page, return newpage; } +static struct page *alloc_misplaced_dst_page_thp(struct page *page, + unsigned long data) +{ + int nid = (int) data; + struct page *newpage; + + newpage = alloc_pages_node(nid, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), + HPAGE_PMD_ORDER); + if (!newpage) + goto out; + + prep_transhuge_page(newpage); + +out: + return newpage; +} + static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) { int page_lru; @@ -2086,12 +2103,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) return 1; } -bool pmd_trans_migrating(pmd_t pmd) -{ - struct page *page = pmd_page(pmd); - return PageLocked(page); -} - /* * Attempt to migrate a misplaced page to the specified destination * node. Caller is expected to have an elevated reference count on @@ -2104,6 +2115,20 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, int isolated; int nr_remaining; LIST_HEAD(migratepages); + new_page_t *new; + bool compound; + + /* + * PTE mapped THP or HugeTLB page can't reach here so the page could + * be either base page or THP. And it must be head page if it is + * THP. + */ + compound = PageTransHuge(page); + + if (compound) + new = alloc_misplaced_dst_page_thp; + else + new = alloc_misplaced_dst_page; /* * Don't migrate file pages that are mapped in multiple processes @@ -2125,9 +2150,8 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, goto out; list_add(&page->lru, &migratepages); - nr_remaining = migrate_pages(&migratepages, alloc_misplaced_dst_page, - NULL, node, MIGRATE_ASYNC, - MR_NUMA_MISPLACED); + nr_remaining = migrate_pages(&migratepages, *new, NULL, node, + MIGRATE_ASYNC, MR_NUMA_MISPLACED); if (nr_remaining) { if (!list_empty(&migratepages)) { list_del(&page->lru); @@ -2146,141 +2170,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, return 0; } #endif /* CONFIG_NUMA_BALANCING */ - -#if defined(CONFIG_NUMA_BALANCING) && defined(CONFIG_TRANSPARENT_HUGEPAGE) -/* - * Migrates a THP to a given target node. page must be locked and is unlocked - * before returning. - */ -int migrate_misplaced_transhuge_page(struct mm_struct *mm, - struct vm_area_struct *vma, - pmd_t *pmd, pmd_t entry, - unsigned long address, - struct page *page, int node) -{ - spinlock_t *ptl; - pg_data_t *pgdat = NODE_DATA(node); - int isolated = 0; - struct page *new_page = NULL; - int page_lru = page_is_file_lru(page); - unsigned long start = address & HPAGE_PMD_MASK; - - new_page = alloc_pages_node(node, - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), - HPAGE_PMD_ORDER); - if (!new_page) - goto out_fail; - prep_transhuge_page(new_page); - - isolated = numamigrate_isolate_page(pgdat, page); - if (!isolated) { - put_page(new_page); - goto out_fail; - } - - /* Prepare a page as a migration target */ - __SetPageLocked(new_page); - if (PageSwapBacked(page)) - __SetPageSwapBacked(new_page); - - /* anon mapping, we can simply copy page->mapping to the new page: */ - new_page->mapping = page->mapping; - new_page->index = page->index; - /* flush the cache before copying using the kernel virtual address */ - flush_cache_range(vma, start, start + HPAGE_PMD_SIZE); - migrate_page_copy(new_page, page); - WARN_ON(PageLRU(new_page)); - - /* Recheck the target PMD */ - ptl = pmd_lock(mm, pmd); - if (unlikely(!pmd_same(*pmd, entry) || !page_ref_freeze(page, 2))) { - spin_unlock(ptl); - - /* Reverse changes made by migrate_page_copy() */ - if (TestClearPageActive(new_page)) - SetPageActive(page); - if (TestClearPageUnevictable(new_page)) - SetPageUnevictable(page); - - unlock_page(new_page); - put_page(new_page); /* Free it */ - - /* Retake the callers reference and putback on LRU */ - get_page(page); - putback_lru_page(page); - mod_node_page_state(page_pgdat(page), - NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR); - - goto out_unlock; - } - - entry = mk_huge_pmd(new_page, vma->vm_page_prot); - entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); - - /* - * Overwrite the old entry under pagetable lock and establish - * the new PTE. Any parallel GUP will either observe the old - * page blocking on the page lock, block on the page table - * lock or observe the new page. The SetPageUptodate on the - * new page and page_add_new_anon_rmap guarantee the copy is - * visible before the pagetable update. - */ - page_add_anon_rmap(new_page, vma, start, true); - /* - * At this point the pmd is numa/protnone (i.e. non present) and the TLB - * has already been flushed globally. So no TLB can be currently - * caching this non present pmd mapping. There's no need to clear the - * pmd before doing set_pmd_at(), nor to flush the TLB after - * set_pmd_at(). Clearing the pmd here would introduce a race - * condition against MADV_DONTNEED, because MADV_DONTNEED only holds the - * mmap_lock for reading. If the pmd is set to NULL at any given time, - * MADV_DONTNEED won't wait on the pmd lock and it'll skip clearing this - * pmd. - */ - set_pmd_at(mm, start, pmd, entry); - update_mmu_cache_pmd(vma, address, &entry); - - page_ref_unfreeze(page, 2); - mlock_migrate_page(new_page, page); - page_remove_rmap(page, true); - set_page_owner_migrate_reason(new_page, MR_NUMA_MISPLACED); - - spin_unlock(ptl); - - /* Take an "isolate" reference and put new page on the LRU. */ - get_page(new_page); - putback_lru_page(new_page); - - unlock_page(new_page); - unlock_page(page); - put_page(page); /* Drop the rmap reference */ - put_page(page); /* Drop the LRU isolation reference */ - - count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR); - count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR); - - mod_node_page_state(page_pgdat(page), - NR_ISOLATED_ANON + page_lru, - -HPAGE_PMD_NR); - return isolated; - -out_fail: - count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); - ptl = pmd_lock(mm, pmd); - if (pmd_same(*pmd, entry)) { - entry = pmd_modify(entry, vma->vm_page_prot); - set_pmd_at(mm, start, pmd, entry); - update_mmu_cache_pmd(vma, address, &entry); - } - spin_unlock(ptl); - -out_unlock: - unlock_page(page); - put_page(page); - return 0; -} -#endif /* CONFIG_NUMA_BALANCING */ - #endif /* CONFIG_NUMA */ #ifdef CONFIG_DEVICE_PRIVATE From c5fc5c3ae0c849c713c4291addb5fce699ad0972 Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 30 Jun 2021 18:51:45 -0700 Subject: [PATCH 086/190] mm: migrate: account THP NUMA migration counters correctly Now both base page and THP NUMA migration is done via migrate_misplaced_page(), keep the counters correctly for THP. Link: https://lkml.kernel.org/r/20210518200801.7413-5-shy828301@gmail.com Signed-off-by: Yang Shi Acked-by: Mel Gorman Cc: Christian Borntraeger Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Huang Ying Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Vasily Gorbik Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index b7e330900b86..e191314f0f19 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2117,6 +2117,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, LIST_HEAD(migratepages); new_page_t *new; bool compound; + unsigned int nr_pages = thp_nr_pages(page); /* * PTE mapped THP or HugeTLB page can't reach here so the page could @@ -2155,13 +2156,13 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, if (nr_remaining) { if (!list_empty(&migratepages)) { list_del(&page->lru); - dec_node_page_state(page, NR_ISOLATED_ANON + - page_is_file_lru(page)); + mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + + page_is_file_lru(page), -nr_pages); putback_lru_page(page); } isolated = 0; } else - count_vm_numa_event(NUMA_PAGE_MIGRATE); + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_pages); BUG_ON(!list_empty(&migratepages)); return isolated; From b0b515bfb3f4f3dc208862989e38ee5268a1003f Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 30 Jun 2021 18:51:48 -0700 Subject: [PATCH 087/190] mm: migrate: don't split THP for misplaced NUMA page The old behavior didn't split THP if migration is failed due to lack of memory on the target node. But the THP migration does split THP, so keep the old behavior for misplaced NUMA page migration. Link: https://lkml.kernel.org/r/20210518200801.7413-6-shy828301@gmail.com Signed-off-by: Yang Shi Acked-by: Mel Gorman Cc: Christian Borntraeger Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Huang Ying Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Vasily Gorbik Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index e191314f0f19..6337b0fdc098 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1423,6 +1423,7 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, int swapwrite = current->flags & PF_SWAPWRITE; int rc, nr_subpages; LIST_HEAD(ret_pages); + bool nosplit = (reason == MR_NUMA_MISPLACED); trace_mm_migrate_pages_start(mode, reason); @@ -1494,8 +1495,9 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, /* * When memory is low, don't bother to try to migrate * other pages, just exit. + * THP NUMA faulting doesn't split THP to retry. */ - if (is_thp) { + if (is_thp && !nosplit) { if (!try_split_thp(page, &page2, from)) { nr_thp_split++; goto retry; From 662aeea7536d84d7e1d01739694e4748ba294ce0 Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 30 Jun 2021 18:51:51 -0700 Subject: [PATCH 088/190] mm: migrate: check mapcount for THP instead of refcount The generic migration path will check refcount, so no need check refcount here. But the old code actually prevents from migrating shared THP (mapped by multiple processes), so bail out early if mapcount is > 1 to keep the behavior. Link: https://lkml.kernel.org/r/20210518200801.7413-7-shy828301@gmail.com Signed-off-by: Yang Shi Cc: Christian Borntraeger Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Huang Ying Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Mel Gorman Cc: Michal Hocko Cc: Vasily Gorbik Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 6337b0fdc098..8810c1421f5d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2073,6 +2073,10 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page); + /* Do not migrate THP mapped by multiple processes */ + if (PageTransHuge(page) && total_mapcount(page) > 1) + return 0; + /* Avoid migrating to a node that is nearly full */ if (!migrate_balanced_pgdat(pgdat, compound_nr(page))) return 0; @@ -2080,18 +2084,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) if (isolate_lru_page(page)) return 0; - /* - * migrate_misplaced_transhuge_page() skips page migration's usual - * check on page_count(), so we must do it here, now that the page - * has been isolated: a GUP pin, or any other pin, prevents migration. - * The expected page count is 3: 1 for page's mapcount and 1 for the - * caller's pin and 1 for the reference taken by isolate_lru_page(). - */ - if (PageTransHuge(page) && page_count(page) != 3) { - putback_lru_page(page); - return 0; - } - page_lru = page_is_file_lru(page); mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru, thp_nr_pages(page)); From e346e6688c4aa18588f2c6a75b572d8ca7a65f5f Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 30 Jun 2021 18:51:55 -0700 Subject: [PATCH 089/190] mm: thp: skip make PMD PROT_NONE if THP migration is not supported A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both NUMA balancing and THP. But S390 doesn't support THP migration so NUMA balancing actually can't migrate any misplaced pages. Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted by pointless NUMA hinting faults on S390. Link: https://lkml.kernel.org/r/20210518200801.7413-8-shy828301@gmail.com Signed-off-by: Yang Shi Acked-by: Mel Gorman Cc: Christian Borntraeger Cc: Gerald Schaefer Cc: Heiko Carstens Cc: Huang Ying Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Vasily Gorbik Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index de7a8c2a0c80..dc0a0c82a5ac 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1742,6 +1742,7 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, * Returns * - 0 if PMD could not be locked * - 1 if PMD was locked but protections unchanged and TLB flush unnecessary + * or if prot_numa but THP migration is not supported * - HPAGE_PMD_NR if protections changed and TLB flush necessary */ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, @@ -1756,6 +1757,9 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, bool uffd_wp = cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; + if (prot_numa && !thp_migration_supported()) + return 1; + ptl = __pmd_trans_huge_lock(pmd, vma); if (!ptl) return 0; From cebc774fdc9cb39b959968fbfd7aabe7a8a5154c Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Wed, 30 Jun 2021 18:51:58 -0700 Subject: [PATCH 090/190] mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2 ARCH_ENABLE_SPLIT_PMD_PTLOCK is irrelevant unless there are more than two page table levels including PMD (also per Documentation/vm/split_page_table_lock.rst). Make this dependency explicit on remaining platforms i.e x86 and s390 where ARCH_ENABLE_SPLIT_PMD_PTLOCK is subscribed. Link: https://lkml.kernel.org/r/1622013501-20409-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Acked-by: Gerald Schaefer # s390 Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Thomas Gleixner Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/s390/Kconfig | 2 +- arch/x86/Kconfig | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 707afbcd81c2..4098637a52fc 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -62,7 +62,7 @@ config S390 select ARCH_BINFMT_ELF_STATE select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM select ARCH_ENABLE_MEMORY_HOTREMOVE - select ARCH_ENABLE_SPLIT_PMD_PTLOCK + select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2 select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DEBUG_WX select ARCH_HAS_DEVMEM_IS_ALLOWED diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5d523ff70fe7..8c5d86655180 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -63,7 +63,7 @@ config X86 select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && MIGRATION select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM) select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG - select ARCH_ENABLE_SPLIT_PMD_PTLOCK if X86_64 || X86_PAE + select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 || X86_PAE) select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_CACHE_LINE_SIZE From 1fb08ac63beedf58e2ae9f229ea1f9474949a185 Mon Sep 17 00:00:00 2001 From: Yang Shi Date: Wed, 30 Jun 2021 18:52:01 -0700 Subject: [PATCH 091/190] mm: rmap: make try_to_unmap() void function Currently try_to_unmap() return bool value by checking page_mapcount(), however this may return false positive since page_mapcount() doesn't check all subpages of compound page. The total_mapcount() could be used instead, but its cost is higher since it traverses all subpages. Actually the most callers of try_to_unmap() don't care about the return value at all. So just need check if page is still mapped by page_mapped() when necessary. And page_mapped() does bail out early when it finds mapped subpage. Link: https://lkml.kernel.org/r/bb27e3fe-6036-b637-5086-272befbfe3da@google.com Suggested-by: Hugh Dickins Signed-off-by: Yang Shi Acked-by: Minchan Kim Reviewed-by: Shakeel Butt Acked-by: Kirill A. Shutemov Signed-off-by: Hugh Dickins Acked-by: Naoya Horiguchi Cc: Alistair Popple Cc: Jan Kara Cc: Jue Wang Cc: "Matthew Wilcox (Oracle)" Cc: Miaohe Lin Cc: Oscar Salvador Cc: Peter Xu Cc: Ralph Campbell Cc: Wang Yugui Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 2 +- mm/memory-failure.c | 15 +++++++-------- mm/rmap.c | 15 ++++----------- mm/vmscan.c | 3 ++- 4 files changed, 14 insertions(+), 21 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 8d04e7deedc6..ed31a559e857 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -195,7 +195,7 @@ static inline void page_dup_rmap(struct page *page, bool compound) int page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags); -bool try_to_unmap(struct page *, enum ttu_flags flags); +void try_to_unmap(struct page *, enum ttu_flags flags); /* Avoid racy checks */ #define PVMW_SYNC (1 << 0) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 9d2d31ffe8a4..419d92b3225d 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1269,7 +1269,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, enum ttu_flags ttu = TTU_IGNORE_MLOCK; struct address_space *mapping; LIST_HEAD(tokill); - bool unmap_success = true; + bool unmap_success; int kill = 1, forcekill; struct page *hpage = *hpagep; bool mlocked = PageMlocked(hpage); @@ -1332,7 +1332,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED); if (!PageHuge(hpage)) { - unmap_success = try_to_unmap(hpage, ttu); + try_to_unmap(hpage, ttu); } else { if (!PageAnon(hpage)) { /* @@ -1344,17 +1344,16 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, */ mapping = hugetlb_page_mapping_lock_write(hpage); if (mapping) { - unmap_success = try_to_unmap(hpage, - ttu|TTU_RMAP_LOCKED); + try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED); i_mmap_unlock_write(mapping); - } else { + } else pr_info("Memory failure: %#lx: could not lock mapping for mapped huge page\n", pfn); - unmap_success = false; - } } else { - unmap_success = try_to_unmap(hpage, ttu); + try_to_unmap(hpage, ttu); } } + + unmap_success = !page_mapped(hpage); if (!unmap_success) pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n", pfn, page_mapcount(hpage)); diff --git a/mm/rmap.c b/mm/rmap.c index e05c300048e6..f9fd5bc54f0a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1405,7 +1405,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* * When racing against e.g. zap_pte_range() on another cpu, * in between its ptep_get_and_clear_full() and page_remove_rmap(), - * try_to_unmap() may return false when it is about to become true, + * try_to_unmap() may return before page_mapped() has become false, * if page table locking is skipped: use TTU_SYNC to wait for that. */ if (flags & TTU_SYNC) @@ -1756,9 +1756,10 @@ static int page_not_mapped(struct page *page) * Tries to remove all the page table entries which are mapping this * page, used in the pageout path. Caller must hold the page lock. * - * If unmap is successful, return true. Otherwise, false. + * It is the caller's responsibility to check if the page is still + * mapped when needed (use TTU_SYNC to prevent accounting races). */ -bool try_to_unmap(struct page *page, enum ttu_flags flags) +void try_to_unmap(struct page *page, enum ttu_flags flags) { struct rmap_walk_control rwc = { .rmap_one = try_to_unmap_one, @@ -1783,14 +1784,6 @@ bool try_to_unmap(struct page *page, enum ttu_flags flags) rmap_walk_locked(page, &rwc); else rmap_walk(page, &rwc); - - /* - * When racing against e.g. zap_pte_range() on another cpu, - * in between its ptep_get_and_clear_full() and page_remove_rmap(), - * try_to_unmap() may return false when it is about to become true, - * if page table locking is skipped: use TTU_SYNC to wait for that. - */ - return !page_mapcount(page); } /** diff --git a/mm/vmscan.c b/mm/vmscan.c index 7b52ab166aae..e1d75e6f9ff4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1499,7 +1499,8 @@ static unsigned int shrink_page_list(struct list_head *page_list, if (unlikely(PageTransHuge(page))) flags |= TTU_SPLIT_HUGE_PMD; - if (!try_to_unmap(page, flags)) { + try_to_unmap(page, flags); + if (page_mapped(page)) { stat->nr_unmap_fail += nr_pages; if (!was_swapbacked && PageSwapBacked(page)) stat->nr_lazyfree_fail += nr_pages; From ab02c252c8609c73ff2897c7e961b631e8bd409c Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 30 Jun 2021 18:52:04 -0700 Subject: [PATCH 092/190] mm/thp: remap_page() is only needed on anonymous THP THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon, and migration entries are only inserted when TTU_MIGRATION (unused here) or TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page() to search for migration entries to remove when !PageAnon. Link: https://lkml.kernel.org/r/f987bc44-f28e-688d-2424-b4722153ed8@google.com Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()") Signed-off-by: Hugh Dickins Reviewed-by: Yang Shi Acked-by: Kirill A. Shutemov Cc: Alistair Popple Cc: Jan Kara Cc: Jue Wang Cc: "Matthew Wilcox (Oracle)" Cc: Miaohe Lin Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Oscar Salvador Cc: Peter Xu Cc: Ralph Campbell Cc: Shakeel Butt Cc: Wang Yugui Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index dc0a0c82a5ac..503c8e1aecc6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2307,6 +2307,7 @@ static void unmap_page(struct page *page) VM_BUG_ON_PAGE(!PageHead(page), page); + /* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */ if (PageAnon(page)) ttu_flags |= TTU_SPLIT_FREEZE; @@ -2318,6 +2319,10 @@ static void unmap_page(struct page *page) static void remap_page(struct page *page, unsigned int nr) { int i; + + /* If TTU_SPLIT_FREEZE is ever extended to file, remove this check */ + if (!PageAnon(page)) + return; if (PageTransHuge(page)) { remove_migration_ptes(page, page, true); } else { From 36af67370e33db2ec48693dd20d6b3cd049e07af Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Wed, 30 Jun 2021 18:52:08 -0700 Subject: [PATCH 093/190] mm: hwpoison_user_mappings() try_to_unmap() with TTU_SYNC TTU_SYNC prevents an unlikely race, when try_to_unmap() returns shortly before the page is accounted as unmapped. It is unlikely to coincide with hwpoisoning, but now that we have the flag, hwpoison_user_mappings() would do well to use it. Link: https://lkml.kernel.org/r/329c28ed-95df-9a2c-8893-b444d8a6d340@google.com Signed-off-by: Hugh Dickins Acked-by: Kirill A. Shutemov Acked-by: Naoya Horiguchi Cc: Alistair Popple Cc: Jan Kara Cc: Jue Wang Cc: "Matthew Wilcox (Oracle)" Cc: Miaohe Lin Cc: Minchan Kim Cc: Oscar Salvador Cc: Peter Xu Cc: Ralph Campbell Cc: Shakeel Butt Cc: Wang Yugui Cc: Yang Shi Cc: Zi Yan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 419d92b3225d..ee02d8b06839 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1266,7 +1266,7 @@ static int get_hwpoison_page(struct page *p, unsigned long flags) static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, int flags, struct page **hpagep) { - enum ttu_flags ttu = TTU_IGNORE_MLOCK; + enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_SYNC; struct address_space *mapping; LIST_HEAD(tokill); bool unmap_success; From 1212e00c93a8016dfd70d209f428f8e0edd5856f Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 30 Jun 2021 18:52:11 -0700 Subject: [PATCH 094/190] mm/thp: fix strncpy warning Using MAX_INPUT_BUF_SZ as the maximum length of the string makes fortify complain as it thinks the string might be longer than the buffer, and if it is, we will end up with a "string" that is missing a NUL terminator. It's trivial to show that 'tok' points to a NUL-terminated string which is less than MAX_INPUT_BUF_SZ in length, so we may as well just use strcpy() and avoid the warning. Link: https://lkml.kernel.org/r/20210615200242.1716568-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Cc: Mike Kravetz Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 503c8e1aecc6..d513b0cd1161 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3101,7 +3101,7 @@ static ssize_t split_huge_pages_write(struct file *file, const char __user *buf, tok = strsep(&buf, ","); if (tok) { - strncpy(file_path, tok, MAX_INPUT_BUF_SZ); + strcpy(file_path, tok); } else { ret = -EINVAL; goto out; From 176056fd740ecaa9873facfc257f8396804754ce Mon Sep 17 00:00:00 2001 From: Chen Li Date: Wed, 30 Jun 2021 18:52:14 -0700 Subject: [PATCH 095/190] nommu: remove __GFP_HIGHMEM in vmalloc/vzalloc mm/nommu.c: void *__vmalloc(unsigned long size, gfp_t gfp_mask) { /* * You can't specify __GFP_HIGHMEM with kmalloc() since kmalloc() * returns only a logical address. */ return kmalloc(size, (gfp_mask | __GFP_COMP) & ~__GFP_HIGHMEM); } nommu's __vmalloc just uses kmalloc internally and elimitates __GFP_HIGHMEM, so it makes no sense to add __GFP_HIGHMEM for nommu's vmalloc/vzalloc. [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/875z00rnp8.wl-chenli@uniontech.com Signed-off-by: Chen Li Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Cc: Greg Ungerer Cc: Geert Uytterhoeven Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/nommu.c b/mm/nommu.c index affda71641ca..0997ca38c2bd 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -223,7 +223,7 @@ long vread(char *buf, char *addr, unsigned long count) */ void *vmalloc(unsigned long size) { - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM); + return __vmalloc(size, GFP_KERNEL); } EXPORT_SYMBOL(vmalloc); @@ -241,7 +241,7 @@ EXPORT_SYMBOL(vmalloc); */ void *vzalloc(unsigned long size) { - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO); + return __vmalloc(size, GFP_KERNEL | __GFP_ZERO); } EXPORT_SYMBOL(vzalloc); From db1d9152c91acf2fef2eb16718a0aafee60dde30 Mon Sep 17 00:00:00 2001 From: Liam Howlett Date: Wed, 30 Jun 2021 18:52:17 -0700 Subject: [PATCH 096/190] mm/nommu: unexport do_munmap() do_munmap() does not take the mmap_write_lock(). vm_munmap() should be used instead. Link: https://lkml.kernel.org/r/20210604194002.648037-1-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett Reviewed-by: Andrew Morton Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/nommu.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/nommu.c b/mm/nommu.c index 0997ca38c2bd..3a93d4054810 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1501,7 +1501,6 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, struct list delete_vma(mm, vma); return 0; } -EXPORT_SYMBOL(do_munmap); int vm_munmap(unsigned long addr, size_t len) { From 63703f37aa09e2c12c0ff25afbf5c460b21bfe4c Mon Sep 17 00:00:00 2001 From: Kefeng Wang Date: Wed, 30 Jun 2021 18:52:20 -0700 Subject: [PATCH 097/190] mm: generalize ZONE_[DMA|DMA32] ZONE_[DMA|DMA32] configs have duplicate definitions on platforms that subscribe to them. Instead, just make them generic options which can be selected on applicable platforms. Also only x86/arm64 architectures could enable both ZONE_DMA and ZONE_DMA32 if EXPERT, add ARCH_HAS_ZONE_DMA_SET to make dma zone configurable and visible on the two architectures. Link: https://lkml.kernel.org/r/20210528074557.17768-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang Acked-by: Catalin Marinas [arm64] Acked-by: Geert Uytterhoeven [m68k] Acked-by: Mike Rapoport Acked-by: Palmer Dabbelt [RISC-V] Acked-by: Michal Simek [microblaze] Acked-by: Michael Ellerman [powerpc] Cc: Catalin Marinas Cc: Will Deacon Cc: Geert Uytterhoeven Cc: Thomas Bogendoerfer Cc: "David S. Miller" Cc: Ingo Molnar Cc: Borislav Petkov Cc: Richard Henderson Cc: Russell King Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/alpha/Kconfig | 5 +---- arch/arm/Kconfig | 3 --- arch/arm64/Kconfig | 9 +-------- arch/ia64/Kconfig | 4 +--- arch/m68k/Kconfig | 5 +---- arch/microblaze/Kconfig | 4 +--- arch/mips/Kconfig | 7 ------- arch/powerpc/Kconfig | 4 ---- arch/powerpc/platforms/Kconfig.cputype | 1 + arch/riscv/Kconfig | 5 +---- arch/s390/Kconfig | 4 +--- arch/sparc/Kconfig | 5 +---- arch/x86/Kconfig | 15 ++------------- mm/Kconfig | 12 ++++++++++++ 14 files changed, 23 insertions(+), 60 deletions(-) diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 8954216b9956..77d3280dc678 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -40,6 +40,7 @@ config ALPHA select MMU_GATHER_NO_RANGE select SET_FS select SPARSEMEM_EXTREME if SPARSEMEM + select ZONE_DMA help The Alpha is a 64-bit general-purpose processor designed and marketed by the Digital Equipment Corporation of blessed memory, @@ -65,10 +66,6 @@ config GENERIC_CALIBRATE_DELAY bool default y -config ZONE_DMA - bool - default y - config GENERIC_ISA_DMA bool default y diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 24804f11302d..000c3f80b58e 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -218,9 +218,6 @@ config GENERIC_CALIBRATE_DELAY config ARCH_MAY_HAVE_PC_FDC bool -config ZONE_DMA - bool - config ARCH_SUPPORTS_UPROBES def_bool y diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 84f0216d3c5c..39b378ee56ce 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -42,6 +42,7 @@ config ARM64 select ARCH_HAS_SYSCALL_WRAPPER select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST + select ARCH_HAS_ZONE_DMA_SET if EXPERT select ARCH_HAVE_ELF_PROT select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_INLINE_READ_LOCK if !PREEMPTION @@ -306,14 +307,6 @@ config GENERIC_CSUM config GENERIC_CALIBRATE_DELAY def_bool y -config ZONE_DMA - bool "Support DMA zone" if EXPERT - default y - -config ZONE_DMA32 - bool "Support DMA32 zone" if EXPERT - default y - config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE def_bool y diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig index f4ff6bd5fc13..cf425c2c63af 100644 --- a/arch/ia64/Kconfig +++ b/arch/ia64/Kconfig @@ -60,6 +60,7 @@ config IA64 select NUMA if !FLATMEM select PCI_MSI_ARCH_FALLBACKS if PCI_MSI select SET_FS + select ZONE_DMA32 default y help The Itanium Processor Family is Intel's 64-bit successor to @@ -72,9 +73,6 @@ config 64BIT select ATA_NONSTANDARD if ATA default y -config ZONE_DMA32 - def_bool y - config MMU bool default y diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig index 372e4e69c43a..05a729c6ad7f 100644 --- a/arch/m68k/Kconfig +++ b/arch/m68k/Kconfig @@ -34,6 +34,7 @@ config M68K select SET_FS select UACCESS_MEMCPY if !MMU select VIRT_TO_BUS + select ZONE_DMA config CPU_BIG_ENDIAN def_bool y @@ -62,10 +63,6 @@ config TIME_LOW_RES config NO_IOPORT_MAP def_bool y -config ZONE_DMA - bool - default y - config HZ int default 1000 if CLEOPATRA diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig index 0660f47012bc..14a67a42fcae 100644 --- a/arch/microblaze/Kconfig +++ b/arch/microblaze/Kconfig @@ -43,6 +43,7 @@ config MICROBLAZE select MMU_GATHER_NO_RANGE select SPARSE_IRQ select SET_FS + select ZONE_DMA # Endianness selection choice @@ -60,9 +61,6 @@ config CPU_LITTLE_ENDIAN endchoice -config ZONE_DMA - def_bool y - config ARCH_HAS_ILOG2_U32 def_bool n diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index df02e36a4a8d..d69f4b17d0a6 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -3274,13 +3274,6 @@ config I8253 select CLKSRC_I8253 select CLKEVT_I8253 select MIPS_EXTERNAL_TIMER - -config ZONE_DMA - bool - -config ZONE_DMA32 - bool - endmenu config TRAD_SIGNALS diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 98206f3ebae0..df46324d5090 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -403,10 +403,6 @@ config PPC_ADV_DEBUG_DAC_RANGE config PPC_DAWR bool -config ZONE_DMA - bool - default y if PPC_BOOK3E_64 - config PGTABLE_LEVELS int default 2 if !PPC64 diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype index f998e655b570..7d271de8fcbd 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -111,6 +111,7 @@ config PPC_BOOK3E_64 select PPC_FPU # Make it a choice ? select PPC_SMP_MUXED_IPI select PPC_DOORBELL + select ZONE_DMA endchoice diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 15f9490a7aad..469a70bd8da6 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -104,6 +104,7 @@ config RISCV select SYSCTL_EXCEPTION_TRACE select THREAD_INFO_IN_TASK select UACCESS_MEMCPY if !MMU + select ZONE_DMA32 if 64BIT config ARCH_MMAP_RND_BITS_MIN default 18 if 64BIT @@ -133,10 +134,6 @@ config MMU Select if you want MMU-based virtualised addressing space support by paged memory management. If unsure, say 'Y'. -config ZONE_DMA32 - bool - default y if 64BIT - config VA_BITS int default 32 if 32BIT diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 4098637a52fc..9e14acacb884 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -2,9 +2,6 @@ config MMU def_bool y -config ZONE_DMA - def_bool y - config CPU_BIG_ENDIAN def_bool y @@ -210,6 +207,7 @@ config S390 select THREAD_INFO_IN_TASK select TTY select VIRT_CPU_ACCOUNTING + select ZONE_DMA # Note: keep the above list sorted alphabetically config SCHED_OMIT_FRAME_POINTER diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index c72f52c704cd..c5fa7932b550 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -59,6 +59,7 @@ config SPARC32 select CLZ_TAB select HAVE_UID16 select OLD_SIGACTION + select ZONE_DMA config SPARC64 def_bool 64BIT @@ -141,10 +142,6 @@ config HIGHMEM default y if SPARC32 select KMAP_LOCAL -config ZONE_DMA - bool - default y if SPARC32 - config GENERIC_ISA_DMA bool default y if SPARC32 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8c5d86655180..5f875112f086 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -33,6 +33,7 @@ config X86_64 select NEED_DMA_MAP_STATE select SWIOTLB select ARCH_HAS_ELFCORE_COMPAT + select ZONE_DMA32 config FORCE_DYNAMIC_FTRACE def_bool y @@ -93,6 +94,7 @@ config X86 select ARCH_HAS_SYSCALL_WRAPPER select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_HAS_DEBUG_WX + select ARCH_HAS_ZONE_DMA_SET if EXPERT select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select ARCH_MIGHT_HAVE_PC_PARPORT @@ -343,9 +345,6 @@ config ARCH_SUSPEND_POSSIBLE config ARCH_WANT_GENERAL_HUGETLB def_bool y -config ZONE_DMA32 - def_bool y if X86_64 - config AUDIT_ARCH def_bool y if X86_64 @@ -393,16 +392,6 @@ config CC_HAS_SANE_STACKPROTECTOR menu "Processor type and features" -config ZONE_DMA - bool "DMA memory allocation support" if EXPERT - default y - help - DMA memory allocation support allows devices with less than 32-bit - addressing to allocate within the first 16MB of address space. - Disable if no such devices will be used. - - If unsure, say Y. - config SMP bool "Symmetric multi-processing support" help diff --git a/mm/Kconfig b/mm/Kconfig index a0b2c37f851b..a02498c0e13d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -761,6 +761,18 @@ config ARCH_HAS_CACHE_LINE_SIZE config ARCH_HAS_PTE_DEVMAP bool +config ARCH_HAS_ZONE_DMA_SET + bool + +config ZONE_DMA + bool "Support DMA zone" if ARCH_HAS_ZONE_DMA_SET + default y if ARM64 || X86 + +config ZONE_DMA32 + bool "Support DMA32 zone" if ARCH_HAS_ZONE_DMA_SET + depends on !X86_32 + default y if ARM64 + config ZONE_DEVICE bool "Device memory (pmem, HMM, etc...) hotplug support" depends on MEMORY_HOTPLUG From a78f1ccd37fbcda706745220b5db76902b325900 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:52:23 -0700 Subject: [PATCH 098/190] mm: make variable names for populate_vma_page_range() consistent Patch series "mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables", v2. Excessive details on MADV_POPULATE_(READ|WRITE) can be found in patch #2. This patch (of 5): Let's make the variable names in the function declaration match the variable names used in the definition. Link: https://lkml.kernel.org/r/20210419135443.12822-1-david@redhat.com Link: https://lkml.kernel.org/r/20210419135443.12822-2-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador Cc: Michal Hocko Cc: Jason Gunthorpe Cc: Peter Xu Cc: Andrea Arcangeli Cc: Arnd Bergmann Cc: Chris Zankel Cc: Dave Hansen Cc: Helge Deller Cc: Hugh Dickins Cc: Ivan Kokshaysky Cc: "James E.J. Bottomley" Cc: Jann Horn Cc: Kirill A. Shutemov Cc: Matthew Wilcox (Oracle) Cc: Matt Turner Cc: Max Filippov Cc: Michael S. Tsirkin Cc: Mike Kravetz Cc: Minchan Kim Cc: Ram Pai Cc: Richard Henderson Cc: Rik van Riel Cc: Rolf Eike Beer Cc: Shuah Khan Cc: Thomas Bogendoerfer Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index d565eb94d6ec..d9c2b33cd2ab 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -344,7 +344,7 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma); #ifdef CONFIG_MMU extern long populate_vma_page_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end, int *nonblocking); + unsigned long start, unsigned long end, int *locked); extern void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); static inline void munlock_vma_pages_all(struct vm_area_struct *vma) From 4ca9b3859dac14bbef0c27d00667bb5b10917adb Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:52:28 -0700 Subject: [PATCH 099/190] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables I. Background: Sparse Memory Mappings When we manage sparse memory mappings dynamically in user space - also sometimes involving MAP_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way (instead of generating SIGBUS) if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for reliably discarding memory for most mapping types, there is no generic approach to populate page tables and preallocate memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report errors during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it cannot really be used for any private mappings on anonymous files via memfd due to COW semantics. In addition, fallocate() does not actually populate page tables, so we still always get pagefaults on first access - which is sometimes undesired (i.e., real-time workloads) and requires real prefaulting of page tables, not just a preallocation of backend storage. There might be interesting use cases for sparse memory regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy as it does not prefault page tables. II. On preallcoation/prefaulting from user space Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., reading+writing one byte to not overwrite existing data) all individual pages. However, that approach 1) Can result in wear on storage backing, because we end up reading/writing each page; this is especially a problem for dax/pmem. 2) Can result in mmap_sem contention when prefaulting via multiple threads. 3) Requires expensive signal handling, especially to catch SIGBUS in case of hugetlbfs/shmem/file-backed memory. For example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation concurrently from other thread is not that easy. III. On MADV_WILLNEED Extending MADV_WILLNEED is not an option because 1. It would change the semantics: "Expect access in the near future." and "might be a good idea to read some pages" vs. "Definitely populate/ preallocate all memory and definitely fail on errors.". 2. Existing users (like virtio-balloon in QEMU when deflating the balloon) don't want populate/prealloc semantics. They treat this rather as a hint to give a little performance boost without too much overhead - and don't expect that a lot of memory might get consumed or a lot of time might be spent. IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by MAP_POPULATE, with the following semantics: 1. MADV_POPULATE_READ can be used to prefault page tables just like manually reading each individual page. This will not break any COW mappings. The shared zero page might get mapped and no backend storage might get preallocated -- allocation might be deferred to write-fault time. Especially shared file mappings require an explicit fallocate() upfront to actually preallocate backend memory (blocks in the file system) in case the file might have holes. 2. If MADV_POPULATE_READ succeeds, all page tables have been populated (prefaulted) readable once. 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and prefault page tables just like manually writing (or reading+writing) each individual page. This will break any COW mappings -- e.g., the shared zeropage is never populated. 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated (prefaulted) writable once. 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special mappings marked with VM_PFNMAP and VM_IO. Also, proper access permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such mapping is encountered, madvise() fails with -EINVAL. 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables might have been populated. 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON when encountering a HW poisoned page in the range. 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot protect from the OOM (Out Of Memory) handler killing the process. While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e., preallocate memory and prefault page tables for VMs), one issue is that whenever we prefault pages writable, the pages have to be marked dirty, because the CPU could dirty them any time. while not a real problem for hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each page will be marked dirty and has to be written back later when evicting. MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole mapping from backend storage without marking it dirty, such that eviction won't have to write it back. As discussed above, shared file mappings might require an explciit fallocate() upfront to achieve preallcoation+prepopulation. Although sparse memory mappings are the primary use case, this will also be useful for other preallocate/prefault use cases where MAP_POPULATE is not desired or the semantics of MAP_POPULATE are not sufficient: as one example, QEMU users can trigger preallocation/prefaulting of guest RAM after the mapping was created -- and don't want errors to be silently suppressed. Looking at the history, MADV_POPULATE was already proposed in 2013 [1], however, the main motivation back than was performance improvements -- which should also still be the case. V. Single-threaded performance comparison I did a short experiment, prefaulting page tables on completely *empty mappings/files* and repeated the experiment 10 times. The results correspond to the shortest execution time. In general, the performance benefit for huge pages is negligible with small mappings. V.1: Private mappings POPULATE_READ and POPULATE_WRITE is fastest. Note that Reading/POPULATE_READ will populate the shared zeropage where applicable -- which result in short population times. The fastest way to allocate backend storage (here: swap or huge pages) and prefault page tables is POPULATE_WRITE. V.2: Shared mappings fallocate() is fastest, however, doesn't prefault page tables. POPULATE_WRITE is faster than simple writes and read/writes. POPULATE_READ is faster than simple reads. Without a fd, the fastest way to allocate backend storage and prefault page tables is POPULATE_WRITE. With an fd, the fastest way is usually FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one exception are actual files: FALLOCATE+Read is slightly faster than FALLOCATE+POPULATE_READ. The fastest way to allocate backend storage prefault page tables is FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then, FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as dirty. v.3: Detailed results ================================================== 2 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 0.119 ms Anon 4 KiB : Write : 0.222 ms Anon 4 KiB : Read/Write : 0.380 ms Anon 4 KiB : POPULATE_READ : 0.060 ms Anon 4 KiB : POPULATE_WRITE : 0.158 ms Memfd 4 KiB : Read : 0.034 ms Memfd 4 KiB : Write : 0.310 ms Memfd 4 KiB : Read/Write : 0.362 ms Memfd 4 KiB : POPULATE_READ : 0.039 ms Memfd 4 KiB : POPULATE_WRITE : 0.229 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.033 ms tmpfs : Write : 0.313 ms tmpfs : Read/Write : 0.406 ms tmpfs : POPULATE_READ : 0.039 ms tmpfs : POPULATE_WRITE : 0.285 ms file : Read : 0.033 ms file : Write : 0.351 ms file : Read/Write : 0.408 ms file : POPULATE_READ : 0.039 ms file : POPULATE_WRITE : 0.290 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 237.940 ms Anon 4 KiB : Write : 708.409 ms Anon 4 KiB : Read/Write : 1054.041 ms Anon 4 KiB : POPULATE_READ : 124.310 ms Anon 4 KiB : POPULATE_WRITE : 572.582 ms Memfd 4 KiB : Read : 136.928 ms Memfd 4 KiB : Write : 963.898 ms Memfd 4 KiB : Read/Write : 1106.561 ms Memfd 4 KiB : POPULATE_READ : 78.450 ms Memfd 4 KiB : POPULATE_WRITE : 805.881 ms Memfd 2 MiB : Read : 357.116 ms Memfd 2 MiB : Write : 357.210 ms Memfd 2 MiB : Read/Write : 357.606 ms Memfd 2 MiB : POPULATE_READ : 356.094 ms Memfd 2 MiB : POPULATE_WRITE : 356.937 ms tmpfs : Read : 137.536 ms tmpfs : Write : 954.362 ms tmpfs : Read/Write : 1105.954 ms tmpfs : POPULATE_READ : 80.289 ms tmpfs : POPULATE_WRITE : 822.826 ms file : Read : 137.874 ms file : Write : 987.025 ms file : Read/Write : 1107.439 ms file : POPULATE_READ : 80.413 ms file : POPULATE_WRITE : 857.622 ms hugetlbfs : Read : 355.607 ms hugetlbfs : Write : 355.729 ms hugetlbfs : Read/Write : 356.127 ms hugetlbfs : POPULATE_READ : 354.585 ms hugetlbfs : POPULATE_WRITE : 355.138 ms ************************************************** 2 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 0.394 ms Anon 4 KiB : Write : 0.348 ms Anon 4 KiB : Read/Write : 0.400 ms Anon 4 KiB : POPULATE_READ : 0.326 ms Anon 4 KiB : POPULATE_WRITE : 0.273 ms Anon 2 MiB : Read : 0.030 ms Anon 2 MiB : Write : 0.030 ms Anon 2 MiB : Read/Write : 0.030 ms Anon 2 MiB : POPULATE_READ : 0.030 ms Anon 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 4 KiB : Read : 0.412 ms Memfd 4 KiB : Write : 0.372 ms Memfd 4 KiB : Read/Write : 0.419 ms Memfd 4 KiB : POPULATE_READ : 0.343 ms Memfd 4 KiB : POPULATE_WRITE : 0.288 ms Memfd 4 KiB : FALLOCATE : 0.137 ms Memfd 4 KiB : FALLOCATE+Read : 0.446 ms Memfd 4 KiB : FALLOCATE+Write : 0.330 ms Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 2 MiB : FALLOCATE : 0.030 ms Memfd 2 MiB : FALLOCATE+Read : 0.031 ms Memfd 2 MiB : FALLOCATE+Write : 0.031 ms Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.416 ms tmpfs : Write : 0.369 ms tmpfs : Read/Write : 0.425 ms tmpfs : POPULATE_READ : 0.346 ms tmpfs : POPULATE_WRITE : 0.295 ms tmpfs : FALLOCATE : 0.139 ms tmpfs : FALLOCATE+Read : 0.447 ms tmpfs : FALLOCATE+Write : 0.333 ms tmpfs : FALLOCATE+Read/Write : 0.454 ms tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms file : Read : 0.191 ms file : Write : 0.511 ms file : Read/Write : 0.524 ms file : POPULATE_READ : 0.196 ms file : POPULATE_WRITE : 0.434 ms file : FALLOCATE : 0.004 ms file : FALLOCATE+Read : 0.197 ms file : FALLOCATE+Write : 0.554 ms file : FALLOCATE+Read/Write : 0.480 ms file : FALLOCATE+POPULATE_READ : 0.201 ms file : FALLOCATE+POPULATE_WRITE : 0.381 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms hugetlbfs : FALLOCATE : 0.030 ms hugetlbfs : FALLOCATE+Read : 0.031 ms hugetlbfs : FALLOCATE+Write : 0.031 ms hugetlbfs : FALLOCATE+Read/Write : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 1053.090 ms Anon 4 KiB : Write : 913.642 ms Anon 4 KiB : Read/Write : 1060.350 ms Anon 4 KiB : POPULATE_READ : 893.691 ms Anon 4 KiB : POPULATE_WRITE : 782.885 ms Anon 2 MiB : Read : 358.553 ms Anon 2 MiB : Write : 358.419 ms Anon 2 MiB : Read/Write : 357.992 ms Anon 2 MiB : POPULATE_READ : 357.533 ms Anon 2 MiB : POPULATE_WRITE : 357.808 ms Memfd 4 KiB : Read : 1078.144 ms Memfd 4 KiB : Write : 942.036 ms Memfd 4 KiB : Read/Write : 1100.391 ms Memfd 4 KiB : POPULATE_READ : 925.829 ms Memfd 4 KiB : POPULATE_WRITE : 804.394 ms Memfd 4 KiB : FALLOCATE : 304.632 ms Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms Memfd 4 KiB : FALLOCATE+Write : 933.186 ms Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms Memfd 2 MiB : Read : 358.131 ms Memfd 2 MiB : Write : 358.099 ms Memfd 2 MiB : Read/Write : 358.250 ms Memfd 2 MiB : POPULATE_READ : 357.563 ms Memfd 2 MiB : POPULATE_WRITE : 357.334 ms Memfd 2 MiB : FALLOCATE : 356.735 ms Memfd 2 MiB : FALLOCATE+Read : 358.152 ms Memfd 2 MiB : FALLOCATE+Write : 358.331 ms Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms tmpfs : Read : 1087.265 ms tmpfs : Write : 950.840 ms tmpfs : Read/Write : 1107.567 ms tmpfs : POPULATE_READ : 922.605 ms tmpfs : POPULATE_WRITE : 810.094 ms tmpfs : FALLOCATE : 306.320 ms tmpfs : FALLOCATE+Read : 1169.796 ms tmpfs : FALLOCATE+Write : 933.730 ms tmpfs : FALLOCATE+Read/Write : 1191.610 ms tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms file : Read : 654.101 ms file : Write : 1259.142 ms file : Read/Write : 1289.509 ms file : POPULATE_READ : 661.642 ms file : POPULATE_WRITE : 1106.816 ms file : FALLOCATE : 1.864 ms file : FALLOCATE+Read : 656.328 ms file : FALLOCATE+Write : 1153.300 ms file : FALLOCATE+Read/Write : 1180.613 ms file : FALLOCATE+POPULATE_READ : 668.347 ms file : FALLOCATE+POPULATE_WRITE : 996.143 ms hugetlbfs : Read : 357.245 ms hugetlbfs : Write : 357.413 ms hugetlbfs : Read/Write : 357.120 ms hugetlbfs : POPULATE_READ : 356.321 ms hugetlbfs : POPULATE_WRITE : 356.693 ms hugetlbfs : FALLOCATE : 355.927 ms hugetlbfs : FALLOCATE+Read : 357.074 ms hugetlbfs : FALLOCATE+Write : 357.120 ms hugetlbfs : FALLOCATE+Read/Write : 356.983 ms hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms ************************************************** [1] https://lkml.org/lkml/2013/6/27/698 [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com Signed-off-by: David Hildenbrand Cc: Arnd Bergmann Cc: Michal Hocko Cc: Oscar Salvador Cc: Matthew Wilcox (Oracle) Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Jann Horn Cc: Jason Gunthorpe Cc: Dave Hansen Cc: Hugh Dickins Cc: Rik van Riel Cc: Michael S. Tsirkin Cc: Kirill A. Shutemov Cc: Vlastimil Babka Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Thomas Bogendoerfer Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Chris Zankel Cc: Max Filippov Cc: Mike Kravetz Cc: Peter Xu Cc: Rolf Eike Beer Cc: Ram Pai Cc: Shuah Khan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/alpha/include/uapi/asm/mman.h | 3 ++ arch/mips/include/uapi/asm/mman.h | 3 ++ arch/parisc/include/uapi/asm/mman.h | 3 ++ arch/xtensa/include/uapi/asm/mman.h | 3 ++ include/uapi/asm-generic/mman-common.h | 3 ++ mm/gup.c | 58 ++++++++++++++++++++++ mm/internal.h | 3 ++ mm/madvise.c | 66 ++++++++++++++++++++++++++ 8 files changed, 142 insertions(+) diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index a18ec7f63888..56b4ee5a6c9e 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -71,6 +71,9 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ +#define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 57dc2ac4f8bd..40b210c65a5a 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -98,6 +98,9 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ +#define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index ab78cba446ed..9e3c010c0f61 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -52,6 +52,9 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ +#define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ + #define MADV_MERGEABLE 65 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index e5e643752947..b3a22095371b 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -106,6 +106,9 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ +#define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index f94f65d429be..1567a3294c3d 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -72,6 +72,9 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE_READ 22 /* populate (prefault) page tables readable */ +#define MADV_POPULATE_WRITE 23 /* populate (prefault) page tables writable */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/gup.c b/mm/gup.c index 8651309f8ec3..728d996767cb 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1500,6 +1500,64 @@ long populate_vma_page_range(struct vm_area_struct *vma, NULL, NULL, locked); } +/* + * faultin_vma_page_range() - populate (prefault) page tables inside the + * given VMA range readable/writable + * + * This takes care of mlocking the pages, too, if VM_LOCKED is set. + * + * @vma: target vma + * @start: start address + * @end: end address + * @write: whether to prefault readable or writable + * @locked: whether the mmap_lock is still held + * + * Returns either number of processed pages in the vma, or a negative error + * code on error (see __get_user_pages()). + * + * vma->vm_mm->mmap_lock must be held. The range must be page-aligned and + * covered by the VMA. + * + * If @locked is NULL, it may be held for read or write and will be unperturbed. + * + * If @locked is non-NULL, it must held for read only and may be released. If + * it's released, *@locked will be set to 0. + */ +long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start, + unsigned long end, bool write, int *locked) +{ + struct mm_struct *mm = vma->vm_mm; + unsigned long nr_pages = (end - start) / PAGE_SIZE; + int gup_flags; + + VM_BUG_ON(!PAGE_ALIGNED(start)); + VM_BUG_ON(!PAGE_ALIGNED(end)); + VM_BUG_ON_VMA(start < vma->vm_start, vma); + VM_BUG_ON_VMA(end > vma->vm_end, vma); + mmap_assert_locked(mm); + + /* + * FOLL_TOUCH: Mark page accessed and thereby young; will also mark + * the page dirty with FOLL_WRITE -- which doesn't make a + * difference with !FOLL_FORCE, because the page is writable + * in the page table. + * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit + * a poisoned page. + * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT. + * !FOLL_FORCE: Require proper access permissions. + */ + gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON; + if (write) + gup_flags |= FOLL_WRITE; + + /* + * See check_vma_flags(): Will return -EFAULT on incompatible mappings + * or with insufficient permissions. + */ + return __get_user_pages(mm, start, nr_pages, gup_flags, + NULL, NULL, locked); +} + /* * __mm_populate - populate and/or mlock pages within a range of address space. * diff --git a/mm/internal.h b/mm/internal.h index d9c2b33cd2ab..f38bc5d75ab7 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -345,6 +345,9 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma); #ifdef CONFIG_MMU extern long populate_vma_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *locked); +extern long faultin_vma_page_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end, + bool write, int *locked); extern void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end); static inline void munlock_vma_pages_all(struct vm_area_struct *vma) diff --git a/mm/madvise.c b/mm/madvise.c index 63e489e5bfdb..6d3d348b17f4 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -53,6 +53,8 @@ static int madvise_need_mmap_write(int behavior) case MADV_COLD: case MADV_PAGEOUT: case MADV_FREE: + case MADV_POPULATE_READ: + case MADV_POPULATE_WRITE: return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -822,6 +824,61 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, return -EINVAL; } +static long madvise_populate(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end, + int behavior) +{ + const bool write = behavior == MADV_POPULATE_WRITE; + struct mm_struct *mm = vma->vm_mm; + unsigned long tmp_end; + int locked = 1; + long pages; + + *prev = vma; + + while (start < end) { + /* + * We might have temporarily dropped the lock. For example, + * our VMA might have been split. + */ + if (!vma || start >= vma->vm_end) { + vma = find_vma(mm, start); + if (!vma || start < vma->vm_start) + return -ENOMEM; + } + + tmp_end = min_t(unsigned long, end, vma->vm_end); + /* Populate (prefault) page tables readable/writable. */ + pages = faultin_vma_page_range(vma, start, tmp_end, write, + &locked); + if (!locked) { + mmap_read_lock(mm); + locked = 1; + *prev = NULL; + vma = NULL; + } + if (pages < 0) { + switch (pages) { + case -EINTR: + return -EINTR; + case -EFAULT: /* Incompatible mappings / permissions. */ + return -EINVAL; + case -EHWPOISON: + return -EHWPOISON; + default: + pr_warn_once("%s: unhandled return value: %ld\n", + __func__, pages); + fallthrough; + case -ENOMEM: + return -ENOMEM; + } + } + start += pages * PAGE_SIZE; + } + return 0; +} + /* * Application wants to free up the pages and associated backing store. * This is effectively punching a hole into the middle of a file. @@ -935,6 +992,9 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, case MADV_FREE: case MADV_DONTNEED: return madvise_dontneed_free(vma, prev, start, end, behavior); + case MADV_POPULATE_READ: + case MADV_POPULATE_WRITE: + return madvise_populate(vma, prev, start, end, behavior); default: return madvise_behavior(vma, prev, start, end, behavior); } @@ -955,6 +1015,8 @@ madvise_behavior_valid(int behavior) case MADV_FREE: case MADV_COLD: case MADV_PAGEOUT: + case MADV_POPULATE_READ: + case MADV_POPULATE_WRITE: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE: @@ -1042,6 +1104,10 @@ process_madvise_behavior_valid(int behavior) * easily if memory pressure happens. * MADV_PAGEOUT - the application is not expected to use this memory soon, * page out the pages in this range immediately. + * MADV_POPULATE_READ - populate (prefault) page tables readable by + * triggering read faults if required + * MADV_POPULATE_WRITE - populate (prefault) page tables writable by + * triggering write faults if required * * return values: * zero - success From 5d334317a9ac5ab42d18a1268773d4d557df8c3e Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:52:32 -0700 Subject: [PATCH 100/190] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT MEMORY MANAGEMENT seems to be a good fit. Link: https://lkml.kernel.org/r/20210419135443.12822-4-david@redhat.com Signed-off-by: David Hildenbrand Acked-by: Mike Rapoport Cc: Michal Hocko Cc: Oscar Salvador Cc: Jason Gunthorpe Cc: Peter Xu Cc: Shuah Khan Cc: Andrea Arcangeli Cc: Arnd Bergmann Cc: Chris Zankel Cc: Dave Hansen Cc: Helge Deller Cc: Hugh Dickins Cc: Ivan Kokshaysky Cc: "James E.J. Bottomley" Cc: Jann Horn Cc: Kirill A. Shutemov Cc: Matthew Wilcox (Oracle) Cc: Matt Turner Cc: Max Filippov Cc: Michael S. Tsirkin Cc: Mike Kravetz Cc: Minchan Kim Cc: Ram Pai Cc: Richard Henderson Cc: Rik van Riel Cc: Rolf Eike Beer Cc: Thomas Bogendoerfer Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 6657f01ddbd5..37f5a8ca413f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -11828,6 +11828,7 @@ F: include/linux/mmzone.h F: include/linux/pagewalk.h F: include/linux/vmalloc.h F: mm/ +F: tools/testing/selftests/vm/ MEMORY TECHNOLOGY DEVICES (MTD) M: Miquel Raynal From 2abdd8b8a29e10aa8d600d2d377690560eb5db3f Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:52:36 -0700 Subject: [PATCH 101/190] selftests/vm: add protection_keys_32 / protection_keys_64 to gitignore We missed adding two binaries to gitignore. Link: https://lkml.kernel.org/r/20210419135443.12822-5-david@redhat.com Signed-off-by: David Hildenbrand Cc: Michal Hocko Cc: Oscar Salvador Cc: Jason Gunthorpe Cc: Peter Xu Cc: Ram Pai Cc: Shuah Khan Cc: Andrea Arcangeli Cc: Arnd Bergmann Cc: Chris Zankel Cc: Dave Hansen Cc: Helge Deller Cc: Hugh Dickins Cc: Ivan Kokshaysky Cc: "James E.J. Bottomley" Cc: Jann Horn Cc: Kirill A. Shutemov Cc: Matthew Wilcox (Oracle) Cc: Matt Turner Cc: Max Filippov Cc: Michael S. Tsirkin Cc: Mike Kravetz Cc: Minchan Kim Cc: Richard Henderson Cc: Rik van Riel Cc: Rolf Eike Beer Cc: Thomas Bogendoerfer Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/.gitignore | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 1f651e85ed60..740e8db9dfb1 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -12,6 +12,8 @@ mremap_test on-fault-limit transhuge-stress protection_keys +protection_keys_32 +protection_keys_64 userfaultfd mlock-intersect-test mlock-random-test From e5bfac53e31087525ba5a629124b3100393b4d3e Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Wed, 30 Jun 2021 18:52:39 -0700 Subject: [PATCH 102/190] selftests/vm: add test for MADV_POPULATE_(READ|WRITE) Let's add a simple test for MADV_POPULATE_READ and MADV_POPULATE_WRITE, verifying some error handling, that population works, and that softdirty tracking works as expected. For now, limit the test to private anonymous memory. Link: https://lkml.kernel.org/r/20210419135443.12822-6-david@redhat.com Signed-off-by: David Hildenbrand Cc: Arnd Bergmann Cc: Michal Hocko Cc: Oscar Salvador Cc: Matthew Wilcox (Oracle) Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Jann Horn Cc: Jason Gunthorpe Cc: Dave Hansen Cc: Hugh Dickins Cc: Rik van Riel Cc: Michael S. Tsirkin Cc: Kirill A. Shutemov Cc: Vlastimil Babka Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Thomas Bogendoerfer Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Chris Zankel Cc: Max Filippov Cc: Mike Kravetz Cc: Peter Xu Cc: Rolf Eike Beer Cc: Shuah Khan Cc: Ram Pai Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/.gitignore | 1 + tools/testing/selftests/vm/Makefile | 1 + tools/testing/selftests/vm/madv_populate.c | 342 +++++++++++++++++++++ tools/testing/selftests/vm/run_vmtests.sh | 16 + 4 files changed, 360 insertions(+) create mode 100644 tools/testing/selftests/vm/madv_populate.c diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore index 740e8db9dfb1..d683a49d07d5 100644 --- a/tools/testing/selftests/vm/.gitignore +++ b/tools/testing/selftests/vm/.gitignore @@ -14,6 +14,7 @@ transhuge-stress protection_keys protection_keys_32 protection_keys_64 +madv_populate userfaultfd mlock-intersect-test mlock-random-test diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 73e1cc96d7c2..8be00319beca 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -31,6 +31,7 @@ TEST_GEN_FILES += hmm-tests TEST_GEN_FILES += hugepage-mmap TEST_GEN_FILES += hugepage-shm TEST_GEN_FILES += khugepaged +TEST_GEN_FILES += madv_populate TEST_GEN_FILES += map_fixed_noreplace TEST_GEN_FILES += map_hugetlb TEST_GEN_FILES += map_populate diff --git a/tools/testing/selftests/vm/madv_populate.c b/tools/testing/selftests/vm/madv_populate.c new file mode 100644 index 000000000000..b959e4ebdad4 --- /dev/null +++ b/tools/testing/selftests/vm/madv_populate.c @@ -0,0 +1,342 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * MADV_POPULATE_READ and MADV_POPULATE_WRITE tests + * + * Copyright 2021, Red Hat, Inc. + * + * Author(s): David Hildenbrand + */ +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../kselftest.h" + +#if defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE) + +/* + * For now, we're using 2 MiB of private anonymous memory for all tests. + */ +#define SIZE (2 * 1024 * 1024) + +static size_t pagesize; + +static uint64_t pagemap_get_entry(int fd, char *start) +{ + const unsigned long pfn = (unsigned long)start / pagesize; + uint64_t entry; + int ret; + + ret = pread(fd, &entry, sizeof(entry), pfn * sizeof(entry)); + if (ret != sizeof(entry)) + ksft_exit_fail_msg("reading pagemap failed\n"); + return entry; +} + +static bool pagemap_is_populated(int fd, char *start) +{ + uint64_t entry = pagemap_get_entry(fd, start); + + /* Present or swapped. */ + return entry & 0xc000000000000000ull; +} + +static bool pagemap_is_softdirty(int fd, char *start) +{ + uint64_t entry = pagemap_get_entry(fd, start); + + return entry & 0x0080000000000000ull; +} + +static void sense_support(void) +{ + char *addr; + int ret; + + addr = mmap(0, pagesize, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); + if (!addr) + ksft_exit_fail_msg("mmap failed\n"); + + ret = madvise(addr, pagesize, MADV_POPULATE_READ); + if (ret) + ksft_exit_skip("MADV_POPULATE_READ is not available\n"); + + ret = madvise(addr, pagesize, MADV_POPULATE_WRITE); + if (ret) + ksft_exit_skip("MADV_POPULATE_WRITE is not available\n"); + + munmap(addr, pagesize); +} + +static void test_prot_read(void) +{ + char *addr; + int ret; + + ksft_print_msg("[RUN] %s\n", __func__); + + addr = mmap(0, SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); + if (addr == MAP_FAILED) + ksft_exit_fail_msg("mmap failed\n"); + + ret = madvise(addr, SIZE, MADV_POPULATE_READ); + ksft_test_result(!ret, "MADV_POPULATE_READ with PROT_READ\n"); + + ret = madvise(addr, SIZE, MADV_POPULATE_WRITE); + ksft_test_result(ret == -1 && errno == EINVAL, + "MADV_POPULATE_WRITE with PROT_READ\n"); + + munmap(addr, SIZE); +} + +static void test_prot_write(void) +{ + char *addr; + int ret; + + ksft_print_msg("[RUN] %s\n", __func__); + + addr = mmap(0, SIZE, PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); + if (addr == MAP_FAILED) + ksft_exit_fail_msg("mmap failed\n"); + + ret = madvise(addr, SIZE, MADV_POPULATE_READ); + ksft_test_result(ret == -1 && errno == EINVAL, + "MADV_POPULATE_READ with PROT_WRITE\n"); + + ret = madvise(addr, SIZE, MADV_POPULATE_WRITE); + ksft_test_result(!ret, "MADV_POPULATE_WRITE with PROT_WRITE\n"); + + munmap(addr, SIZE); +} + +static void test_holes(void) +{ + char *addr; + int ret; + + ksft_print_msg("[RUN] %s\n", __func__); + + addr = mmap(0, SIZE, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); + if (addr == MAP_FAILED) + ksft_exit_fail_msg("mmap failed\n"); + ret = munmap(addr + pagesize, pagesize); + if (ret) + ksft_exit_fail_msg("munmap failed\n"); + + /* Hole in the middle */ + ret = madvise(addr, SIZE, MADV_POPULATE_READ); + ksft_test_result(ret == -1 && errno == ENOMEM, + "MADV_POPULATE_READ with holes in the middle\n"); + ret = madvise(addr, SIZE, MADV_POPULATE_WRITE); + ksft_test_result(ret == -1 && errno == ENOMEM, + "MADV_POPULATE_WRITE with holes in the middle\n"); + + /* Hole at end */ + ret = madvise(addr, 2 * pagesize, MADV_POPULATE_READ); + ksft_test_result(ret == -1 && errno == ENOMEM, + "MADV_POPULATE_READ with holes at the end\n"); + ret = madvise(addr, 2 * pagesize, MADV_POPULATE_WRITE); + ksft_test_result(ret == -1 && errno == ENOMEM, + "MADV_POPULATE_WRITE with holes at the end\n"); + + /* Hole at beginning */ + ret = madvise(addr + pagesize, pagesize, MADV_POPULATE_READ); + ksft_test_result(ret == -1 && errno == ENOMEM, + "MADV_POPULATE_READ with holes at the beginning\n"); + ret = madvise(addr + pagesize, pagesize, MADV_POPULATE_WRITE); + ksft_test_result(ret == -1 && errno == ENOMEM, + "MADV_POPULATE_WRITE with holes at the beginning\n"); + + munmap(addr, SIZE); +} + +static bool range_is_populated(char *start, ssize_t size) +{ + int fd = open("/proc/self/pagemap", O_RDONLY); + bool ret = true; + + if (fd < 0) + ksft_exit_fail_msg("opening pagemap failed\n"); + for (; size > 0 && ret; size -= pagesize, start += pagesize) + if (!pagemap_is_populated(fd, start)) + ret = false; + close(fd); + return ret; +} + +static bool range_is_not_populated(char *start, ssize_t size) +{ + int fd = open("/proc/self/pagemap", O_RDONLY); + bool ret = true; + + if (fd < 0) + ksft_exit_fail_msg("opening pagemap failed\n"); + for (; size > 0 && ret; size -= pagesize, start += pagesize) + if (pagemap_is_populated(fd, start)) + ret = false; + close(fd); + return ret; +} + +static void test_populate_read(void) +{ + char *addr; + int ret; + + ksft_print_msg("[RUN] %s\n", __func__); + + addr = mmap(0, SIZE, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); + if (addr == MAP_FAILED) + ksft_exit_fail_msg("mmap failed\n"); + ksft_test_result(range_is_not_populated(addr, SIZE), + "range initially not populated\n"); + + ret = madvise(addr, SIZE, MADV_POPULATE_READ); + ksft_test_result(!ret, "MADV_POPULATE_READ\n"); + ksft_test_result(range_is_populated(addr, SIZE), + "range is populated\n"); + + munmap(addr, SIZE); +} + +static void test_populate_write(void) +{ + char *addr; + int ret; + + ksft_print_msg("[RUN] %s\n", __func__); + + addr = mmap(0, SIZE, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); + if (addr == MAP_FAILED) + ksft_exit_fail_msg("mmap failed\n"); + ksft_test_result(range_is_not_populated(addr, SIZE), + "range initially not populated\n"); + + ret = madvise(addr, SIZE, MADV_POPULATE_WRITE); + ksft_test_result(!ret, "MADV_POPULATE_WRITE\n"); + ksft_test_result(range_is_populated(addr, SIZE), + "range is populated\n"); + + munmap(addr, SIZE); +} + +static bool range_is_softdirty(char *start, ssize_t size) +{ + int fd = open("/proc/self/pagemap", O_RDONLY); + bool ret = true; + + if (fd < 0) + ksft_exit_fail_msg("opening pagemap failed\n"); + for (; size > 0 && ret; size -= pagesize, start += pagesize) + if (!pagemap_is_softdirty(fd, start)) + ret = false; + close(fd); + return ret; +} + +static bool range_is_not_softdirty(char *start, ssize_t size) +{ + int fd = open("/proc/self/pagemap", O_RDONLY); + bool ret = true; + + if (fd < 0) + ksft_exit_fail_msg("opening pagemap failed\n"); + for (; size > 0 && ret; size -= pagesize, start += pagesize) + if (pagemap_is_softdirty(fd, start)) + ret = false; + close(fd); + return ret; +} + +static void clear_softdirty(void) +{ + int fd = open("/proc/self/clear_refs", O_WRONLY); + const char *ctrl = "4"; + int ret; + + if (fd < 0) + ksft_exit_fail_msg("opening clear_refs failed\n"); + ret = write(fd, ctrl, strlen(ctrl)); + if (ret != strlen(ctrl)) + ksft_exit_fail_msg("writing clear_refs failed\n"); + close(fd); +} + +static void test_softdirty(void) +{ + char *addr; + int ret; + + ksft_print_msg("[RUN] %s\n", __func__); + + addr = mmap(0, SIZE, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); + if (addr == MAP_FAILED) + ksft_exit_fail_msg("mmap failed\n"); + + /* Clear any softdirty bits. */ + clear_softdirty(); + ksft_test_result(range_is_not_softdirty(addr, SIZE), + "range is not softdirty\n"); + + /* Populating READ should set softdirty. */ + ret = madvise(addr, SIZE, MADV_POPULATE_READ); + ksft_test_result(!ret, "MADV_POPULATE_READ\n"); + ksft_test_result(range_is_not_softdirty(addr, SIZE), + "range is not softdirty\n"); + + /* Populating WRITE should set softdirty. */ + ret = madvise(addr, SIZE, MADV_POPULATE_WRITE); + ksft_test_result(!ret, "MADV_POPULATE_WRITE\n"); + ksft_test_result(range_is_softdirty(addr, SIZE), + "range is softdirty\n"); + + munmap(addr, SIZE); +} + +int main(int argc, char **argv) +{ + int err; + + pagesize = getpagesize(); + + ksft_print_header(); + ksft_set_plan(21); + + sense_support(); + test_prot_read(); + test_prot_write(); + test_holes(); + test_populate_read(); + test_populate_write(); + test_softdirty(); + + err = ksft_get_fail_cnt(); + if (err) + ksft_exit_fail_msg("%d out of %d tests failed\n", + err, ksft_test_num()); + return ksft_exit_pass(); +} + +#else /* defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE) */ + +#warning "missing MADV_POPULATE_READ or MADV_POPULATE_WRITE definition" + +int main(int argc, char **argv) +{ + ksft_print_header(); + ksft_exit_skip("MADV_POPULATE_READ or MADV_POPULATE_WRITE not defined\n"); +} + +#endif /* defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE) */ diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh index e953f3cd9664..955782d138ab 100755 --- a/tools/testing/selftests/vm/run_vmtests.sh +++ b/tools/testing/selftests/vm/run_vmtests.sh @@ -346,4 +346,20 @@ else exitcode=1 fi +echo "--------------------------------------------------------" +echo "running MADV_POPULATE_READ and MADV_POPULATE_WRITE tests" +echo "--------------------------------------------------------" +./madv_populate +ret_val=$? + +if [ $ret_val -eq 0 ]; then + echo "[PASS]" +elif [ $ret_val -eq $ksft_skip ]; then + echo "[SKIP]" + exitcode=$ksft_skip +else + echo "[FAIL]" + exitcode=1 +fi + exit $exitcode From 786dee864804f8e851cf0f258df2ccbb4ee03d80 Mon Sep 17 00:00:00 2001 From: Liam Mark Date: Wed, 30 Jun 2021 18:52:43 -0700 Subject: [PATCH 103/190] mm/memory_hotplug: rate limit page migration warnings When offlining memory the system can attempt to migrate a lot of pages, if there are problems with migration this can flood the logs. Printing all the data hogs the CPU and cause some RT threads to run for a long time, which may have some bad consequences. Rate limit the page migration warnings in order to avoid this. Link: https://lkml.kernel.org/r/20210505140542.24935-1-georgi.djakov@linaro.org Signed-off-by: Liam Mark Signed-off-by: Georgi Djakov Cc: David Hildenbrand Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 492560b61999..1a9568d90ea7 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1406,6 +1406,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) struct page *page, *head; int ret = 0; LIST_HEAD(source); + static DEFINE_RATELIMIT_STATE(migrate_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!pfn_valid(pfn)) @@ -1452,8 +1454,10 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) page_is_file_lru(page)); } else { - pr_warn("failed to isolate pfn %lx\n", pfn); - dump_page(page, "isolation failed"); + if (__ratelimit(&migrate_rs)) { + pr_warn("failed to isolate pfn %lx\n", pfn); + dump_page(page, "isolation failed"); + } } put_page(page); } @@ -1482,9 +1486,11 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) (unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_HOTPLUG); if (ret) { list_for_each_entry(page, &source, lru) { - pr_warn("migrating pfn %lx failed ret:%d ", - page_to_pfn(page), ret); - dump_page(page, "migration failure"); + if (__ratelimit(&migrate_rs)) { + pr_warn("migrating pfn %lx failed ret:%d\n", + page_to_pfn(page), ret); + dump_page(page, "migration failure"); + } } putback_movable_pages(&source); } From 27cacaad16c549ce5dd30ae84100b7e680536822 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Wed, 30 Jun 2021 18:52:46 -0700 Subject: [PATCH 104/190] mm,memory_hotplug: drop unneeded locking Currently, memory-hotplug code takes zone's span_writelock and pgdat's resize_lock when resizing the node/zone's spanned pages via {move_pfn_range_to_zone(),remove_pfn_range_from_zone()} and when resizing node and zone's present pages via adjust_present_page_count(). These locks are also taken during the initialization of the system at boot time, where it protects parallel struct page initialization, but they should not really be needed in memory-hotplug where all operations are a) synchronized on device level and b) serialized by the mem_hotplug_lock lock. [akpm@linux-foundation.org: remove now-unused locals] Link: https://lkml.kernel.org/r/20210531093958.15021-1-osalvador@suse.de Signed-off-by: Oscar Salvador Acked-by: David Hildenbrand Acked-by: Michal Hocko Cc: Anshuman Khandual Cc: Vlastimil Babka Cc: Pavel Tatashin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 1a9568d90ea7..553c52751249 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -329,7 +329,6 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, unsigned long pfn; int nid = zone_to_nid(zone); - zone_span_writelock(zone); if (zone->zone_start_pfn == start_pfn) { /* * If the section is smallest section in the zone, it need @@ -362,7 +361,6 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, zone->spanned_pages = 0; } } - zone_span_writeunlock(zone); } static void update_pgdat_span(struct pglist_data *pgdat) @@ -399,7 +397,7 @@ void __ref remove_pfn_range_from_zone(struct zone *zone, { const unsigned long end_pfn = start_pfn + nr_pages; struct pglist_data *pgdat = zone->zone_pgdat; - unsigned long pfn, cur_nr_pages, flags; + unsigned long pfn, cur_nr_pages; /* Poison struct pages because they are now uninitialized again. */ for (pfn = start_pfn; pfn < end_pfn; pfn += cur_nr_pages) { @@ -424,10 +422,8 @@ void __ref remove_pfn_range_from_zone(struct zone *zone, clear_zone_contiguous(zone); - pgdat_resize_lock(zone->zone_pgdat, &flags); shrink_zone_span(zone, start_pfn, start_pfn + nr_pages); update_pgdat_span(pgdat); - pgdat_resize_unlock(zone->zone_pgdat, &flags); set_zone_contiguous(zone); } @@ -634,19 +630,13 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn, { struct pglist_data *pgdat = zone->zone_pgdat; int nid = pgdat->node_id; - unsigned long flags; clear_zone_contiguous(zone); - /* TODO Huh pgdat is irqsave while zone is not. It used to be like that before */ - pgdat_resize_lock(pgdat, &flags); - zone_span_writelock(zone); if (zone_is_empty(zone)) init_currently_empty_zone(zone, start_pfn, nr_pages); resize_zone_range(zone, start_pfn, nr_pages); - zone_span_writeunlock(zone); resize_pgdat_range(pgdat, start_pfn, nr_pages); - pgdat_resize_unlock(pgdat, &flags); /* * Subsection population requires care in pfn_to_online_page(). @@ -736,12 +726,8 @@ struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn, */ void adjust_present_page_count(struct zone *zone, long nr_pages) { - unsigned long flags; - zone->present_pages += nr_pages; - pgdat_resize_lock(zone->zone_pgdat, &flags); zone->zone_pgdat->node_present_pages += nr_pages; - pgdat_resize_unlock(zone->zone_pgdat, &flags); } int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, From 2c1e9a2c668b4606e9c27fe420ddf83d113928c8 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:52:49 -0700 Subject: [PATCH 105/190] mm/zswap.c: remove unused function zswap_debugfs_exit() Patch series "Cleanup and fixup for zswap". This series contains cleanups to remove unused function and avoid unnecessary copy-in at map time. Also this fixes two bugs in the function zswap_writeback_entry(). More details can be found in the respective changelogs. This patch (of 3): zswap_debugfs_exit() is unused, remove it. Link: https://lkml.kernel.org/r/20210522092242.3233191-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210522092242.3233191-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Cc: Seth Jennings Cc: Dan Streetman Cc: Vitaly Wool Cc: Sebastian Andrzej Siewior Cc: Nathan Chancellor Cc: Colin Ian King Cc: Tian Tao Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/zswap.c | 7 ------- 1 file changed, 7 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index 20763267a219..e93459319fdb 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1427,18 +1427,11 @@ static int __init zswap_debugfs_init(void) return 0; } - -static void __exit zswap_debugfs_exit(void) -{ - debugfs_remove_recursive(zswap_debugfs_root); -} #else static int __init zswap_debugfs_init(void) { return 0; } - -static void __exit zswap_debugfs_exit(void) { } #endif /********************************* From ae34af1f11d0a6ae849b7605d15df9798dab7b46 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:52:52 -0700 Subject: [PATCH 106/190] mm/zswap.c: avoid unnecessary copy-in at map time The buf mapped via zpool_map_handle() is only used to store compressed page buffer and there is no information to extract from it. So we could use ZPOOL_MM_WO instead to avoid unnecessary copy-in at map time. Link: https://lkml.kernel.org/r/20210522092242.3233191-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Cc: Colin Ian King Cc: Dan Streetman Cc: Nathan Chancellor Cc: Sebastian Andrzej Siewior Cc: Seth Jennings Cc: Tian Tao Cc: Vitaly Wool Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/zswap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/zswap.c b/mm/zswap.c index e93459319fdb..c27ce9f2cdf8 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1203,7 +1203,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, zswap_reject_alloc_fail++; goto put_dstmem; } - buf = zpool_map_handle(entry->pool->zpool, handle, ZPOOL_MM_RW); + buf = zpool_map_handle(entry->pool->zpool, handle, ZPOOL_MM_WO); memcpy(buf, &zhdr, hlen); memcpy(buf + hlen, dst, dlen); zpool_unmap_handle(entry->pool->zpool, handle); From 46b76f2e09dc35f70aca2f4349eb0d158f53fe93 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:52:55 -0700 Subject: [PATCH 107/190] mm/zswap.c: fix two bugs in zswap_writeback_entry() In the ZSWAP_SWAPCACHE_FAIL and ZSWAP_SWAPCACHE_EXIST case, we forgot to call zpool_unmap_handle() when zpool can't sleep. And we might sleep in zswap_get_swap_cache_page() while zpool can't sleep. To fix all of these, zpool_unmap_handle() should be done before zswap_get_swap_cache_page() when zpool can't sleep. Link: https://lkml.kernel.org/r/20210522092242.3233191-4-linmiaohe@huawei.com Fixes: fc6697a89f56 ("mm/zswap: add the flag can_sleep_mapped") Signed-off-by: Miaohe Lin Cc: Colin Ian King Cc: Dan Streetman Cc: Nathan Chancellor Cc: Sebastian Andrzej Siewior Cc: Seth Jennings Cc: Tian Tao Cc: Vitaly Wool Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/zswap.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index c27ce9f2cdf8..7944e3e57e78 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -967,6 +967,13 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle) spin_unlock(&tree->lock); BUG_ON(offset != entry->offset); + src = (u8 *)zhdr + sizeof(struct zswap_header); + if (!zpool_can_sleep_mapped(pool)) { + memcpy(tmp, src, entry->length); + src = tmp; + zpool_unmap_handle(pool, handle); + } + /* try to allocate swap cache page */ switch (zswap_get_swap_cache_page(swpentry, &page)) { case ZSWAP_SWAPCACHE_FAIL: /* no memory or invalidate happened */ @@ -982,17 +989,7 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle) case ZSWAP_SWAPCACHE_NEW: /* page is locked */ /* decompress */ acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx); - dlen = PAGE_SIZE; - src = (u8 *)zhdr + sizeof(struct zswap_header); - - if (!zpool_can_sleep_mapped(pool)) { - - memcpy(tmp, src, entry->length); - src = tmp; - - zpool_unmap_handle(pool, handle); - } mutex_lock(acomp_ctx->mutex); sg_init_one(&input, src, entry->length); From ce8475b6a4e547fcea60410a8385d80988e12c7e Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:53:01 -0700 Subject: [PATCH 108/190] mm/zsmalloc.c: remove confusing code in obj_free() Patch series "Cleanup for zsmalloc". This series contains cleanups to remove confusing code in obj_free(), combine two atomic ops and improve readability for async_free_zspage(). More details can be found in the respective changelogs. This patch (of 2): OBJ_ALLOCATED_TAG is only set for handle to indicate allocated object. It's irrelevant with obj. So remove this misleading code to improve readability. Link: https://lkml.kernel.org/r/20210624123930.1769093-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210624123930.1769093-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Cc: Minchan Kim Cc: Nitin Gupta Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/zsmalloc.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 19b563bc6c48..9198b16d4f17 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -1471,7 +1471,6 @@ static void obj_free(struct size_class *class, unsigned long obj) unsigned int f_objidx; void *vaddr; - obj &= ~OBJ_ALLOCATED_TAG; obj_to_location(obj, &f_page, &f_objidx); f_offset = (class->size * f_objidx) & ~PAGE_MASK; zspage = get_zspage(f_page); From 338483372626f9b89ed91ec0b422562ef53b0b12 Mon Sep 17 00:00:00 2001 From: Miaohe Lin Date: Wed, 30 Jun 2021 18:53:04 -0700 Subject: [PATCH 109/190] mm/zsmalloc.c: improve readability for async_free_zspage() The class is extracted from pool->size_class[class_idx] again before calling __free_zspage(). It looks like class will change after we fetch the class lock. But this is misleading as class will stay unchanged. Link: https://lkml.kernel.org/r/20210624123930.1769093-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin Cc: Minchan Kim Cc: Nitin Gupta Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/zsmalloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 9198b16d4f17..68e8831068f4 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -2162,7 +2162,7 @@ static void async_free_zspage(struct work_struct *work) VM_BUG_ON(fullness != ZS_EMPTY); class = pool->size_class[class_idx]; spin_lock(&class->lock); - __free_zspage(pool, pool->size_class[class_idx], zspage); + __free_zspage(pool, class, zspage); spin_unlock(&class->lock); } }; From dd794835432c1fbdec5c34ab348ddb641ca2a42d Mon Sep 17 00:00:00 2001 From: Yue Hu Date: Wed, 30 Jun 2021 18:53:07 -0700 Subject: [PATCH 110/190] zram: move backing_dev under macro CONFIG_ZRAM_WRITEBACK backing_dev is never used when not enable CONFIG_ZRAM_WRITEBACK and it's introduced from writeback feature. So it's needless also affect readability in that case. Link: https://lkml.kernel.org/r/20210521060544.2385-1-zbestahu@gmail.com Signed-off-by: Yue Hu Reviewed-by: Sergey Senozhatsky Acked-by: Minchan Kim Cc: Sergey Senozhatsky Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/block/zram/zram_drv.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index 419a7e8281ee..6e73dc3c2769 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -113,8 +113,8 @@ struct zram { * zram is claimed so open request will be failed */ bool claim; /* Protected by bdev->bd_mutex */ - struct file *backing_dev; #ifdef CONFIG_ZRAM_WRITEBACK + struct file *backing_dev; spinlock_t wb_limit_lock; bool wb_limit_enable; u64 bd_wb_limit; From c4ffefd16daba0f29fa7d9534de20949b673eca0 Mon Sep 17 00:00:00 2001 From: Hyeonggon Yoo <42.hyeyoo@gmail.com> Date: Wed, 30 Jun 2021 18:53:10 -0700 Subject: [PATCH 111/190] mm: fix typos and grammar error in comments We moves tha -> We move that in mm/swap.c statments -> statements in include/linux/mm.h Link: https://lkml.kernel.org/r/20210509063444.GA24745@hyeyoo Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mm.h | 2 +- mm/swap.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3cbd2d6d248e..714ad9b26ed2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -155,7 +155,7 @@ extern int mmap_rnd_compat_bits __read_mostly; /* This function must be updated when the size of struct page grows above 80 * or reduces below 56. The idea that compiler optimizes out switch() * statement, and only leaves move/store instructions. Also the compiler can - * combine write statments if they are both assignments and can be reordered, + * combine write statements if they are both assignments and can be reordered, * this can result in several of the writes here being dropped. */ #define mm_zero_struct_page(pp) __mm_zero_struct_page(pp) diff --git a/mm/swap.c b/mm/swap.c index 6c11db780467..19600430e536 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -554,7 +554,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) } else { /* * The page's writeback ends up during pagevec - * We moves tha page into tail of inactive. + * We move that page into tail of inactive. */ add_page_to_lru_list_tail(page, lruvec); __count_vm_events(PGROTATED, nr_pages); From fac7757e1fb05b75c8e22d4f8fe2f6c9c4d7edca Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Wed, 30 Jun 2021 18:53:13 -0700 Subject: [PATCH 112/190] mm: define default value for FIRST_USER_ADDRESS Currently most platforms define FIRST_USER_ADDRESS as 0UL duplication the same code all over. Instead just define a generic default value (i.e 0UL) for FIRST_USER_ADDRESS and let the platforms override when required. This makes it much cleaner with reduced code. The default FIRST_USER_ADDRESS here would be skipped in when the given platform overrides its value via . Link: https://lkml.kernel.org/r/1620615725-24623-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Acked-by: Geert Uytterhoeven [m68k] Acked-by: Guo Ren [csky] Acked-by: Stafford Horne [openrisc] Acked-by: Catalin Marinas [arm64] Acked-by: Mike Rapoport Acked-by: Palmer Dabbelt [RISC-V] Cc: Richard Henderson Cc: Vineet Gupta Cc: Catalin Marinas Cc: Will Deacon Cc: Guo Ren Cc: Brian Cain Cc: Geert Uytterhoeven Cc: Michal Simek Cc: Thomas Bogendoerfer Cc: Ley Foon Tan Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: "James E.J. Bottomley" Cc: Michael Ellerman Cc: Christophe Leroy Cc: Paul Walmsley Cc: Heiko Carstens Cc: Yoshinori Sato Cc: "David S. Miller" Cc: Jeff Dike Cc: Thomas Gleixner Cc: Chris Zankel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/alpha/include/asm/pgtable.h | 1 - arch/arc/include/asm/pgtable.h | 6 ------ arch/arm64/include/asm/pgtable.h | 2 -- arch/csky/include/asm/pgtable.h | 1 - arch/hexagon/include/asm/pgtable.h | 3 --- arch/ia64/include/asm/pgtable.h | 1 - arch/m68k/include/asm/pgtable_mm.h | 1 - arch/microblaze/include/asm/pgtable.h | 2 -- arch/mips/include/asm/pgtable-32.h | 1 - arch/mips/include/asm/pgtable-64.h | 1 - arch/nios2/include/asm/pgtable.h | 2 -- arch/openrisc/include/asm/pgtable.h | 1 - arch/parisc/include/asm/pgtable.h | 2 -- arch/powerpc/include/asm/book3s/pgtable.h | 1 - arch/powerpc/include/asm/nohash/32/pgtable.h | 1 - arch/powerpc/include/asm/nohash/64/pgtable.h | 2 -- arch/riscv/include/asm/pgtable.h | 2 -- arch/s390/include/asm/pgtable.h | 2 -- arch/sh/include/asm/pgtable.h | 2 -- arch/sparc/include/asm/pgtable_32.h | 1 - arch/sparc/include/asm/pgtable_64.h | 3 --- arch/um/include/asm/pgtable-2level.h | 1 - arch/um/include/asm/pgtable-3level.h | 1 - arch/x86/include/asm/pgtable_types.h | 2 -- arch/xtensa/include/asm/pgtable.h | 1 - include/linux/pgtable.h | 9 +++++++++ 26 files changed, 9 insertions(+), 43 deletions(-) diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h index e1757b7cfe3d..ff690846465e 100644 --- a/arch/alpha/include/asm/pgtable.h +++ b/arch/alpha/include/asm/pgtable.h @@ -46,7 +46,6 @@ struct vm_area_struct; #define PTRS_PER_PMD (1UL << (PAGE_SHIFT-3)) #define PTRS_PER_PGD (1UL << (PAGE_SHIFT-3)) #define USER_PTRS_PER_PGD (TASK_SIZE / PGDIR_SIZE) -#define FIRST_USER_ADDRESS 0UL /* Number of pointers that fit on a page: this will go away. */ #define PTRS_PER_PAGE (1UL << (PAGE_SHIFT-3)) diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h index 5878846f00cf..577682e8cc7f 100644 --- a/arch/arc/include/asm/pgtable.h +++ b/arch/arc/include/asm/pgtable.h @@ -222,12 +222,6 @@ */ #define USER_PTRS_PER_PGD (TASK_SIZE / PGDIR_SIZE) -/* - * No special requirements for lowest virtual address we permit any user space - * mapping to be mapped at. - */ -#define FIRST_USER_ADDRESS 0UL - /**************************************************************** * Bucket load of VM Helpers diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 0b10204e72fc..25f5c04b43ce 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -26,8 +26,6 @@ #define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT)) -#define FIRST_USER_ADDRESS 0UL - #ifndef __ASSEMBLY__ #include diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h index 0d60367b6bfa..151607ed5158 100644 --- a/arch/csky/include/asm/pgtable.h +++ b/arch/csky/include/asm/pgtable.h @@ -14,7 +14,6 @@ #define PGDIR_MASK (~(PGDIR_SIZE-1)) #define USER_PTRS_PER_PGD (PAGE_OFFSET/PGDIR_SIZE) -#define FIRST_USER_ADDRESS 0UL /* * C-SKY is two-level paging structure: diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h index dbb22b80b8c4..e4979508cddf 100644 --- a/arch/hexagon/include/asm/pgtable.h +++ b/arch/hexagon/include/asm/pgtable.h @@ -155,9 +155,6 @@ extern unsigned long _dflt_cache_att; extern pgd_t swapper_pg_dir[PTRS_PER_PGD]; /* located in head.S */ -/* Seems to be zero even in architectures where the zero page is firewalled? */ -#define FIRST_USER_ADDRESS 0UL - /* HUGETLB not working currently */ #ifdef CONFIG_HUGETLB_PAGE #define pte_mkhuge(pte) __pte((pte_val(pte) & ~0x3) | HVM_HUGEPAGE_SIZE) diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h index d765fd948fae..3f5dbbd8b9d8 100644 --- a/arch/ia64/include/asm/pgtable.h +++ b/arch/ia64/include/asm/pgtable.h @@ -128,7 +128,6 @@ #define PTRS_PER_PGD_SHIFT PTRS_PER_PTD_SHIFT #define PTRS_PER_PGD (1UL << PTRS_PER_PGD_SHIFT) #define USER_PTRS_PER_PGD (5*PTRS_PER_PGD/8) /* regions 0-4 are user regions */ -#define FIRST_USER_ADDRESS 0UL /* * All the normal masks have the "page accessed" bits on, as any time diff --git a/arch/m68k/include/asm/pgtable_mm.h b/arch/m68k/include/asm/pgtable_mm.h index aca22c2c1ee2..143ba7de9bda 100644 --- a/arch/m68k/include/asm/pgtable_mm.h +++ b/arch/m68k/include/asm/pgtable_mm.h @@ -72,7 +72,6 @@ #define PTRS_PER_PGD 128 #endif #define USER_PTRS_PER_PGD (TASK_SIZE/PGDIR_SIZE) -#define FIRST_USER_ADDRESS 0UL /* Virtual address region for use by kernel_map() */ #ifdef CONFIG_SUN3 diff --git a/arch/microblaze/include/asm/pgtable.h b/arch/microblaze/include/asm/pgtable.h index 9ae8d2c17dd5..71cd547655d9 100644 --- a/arch/microblaze/include/asm/pgtable.h +++ b/arch/microblaze/include/asm/pgtable.h @@ -25,8 +25,6 @@ extern int mem_init_done; #include #include -#define FIRST_USER_ADDRESS 0UL - extern unsigned long va_to_phys(unsigned long address); extern pte_t *va_to_pte(unsigned long address); diff --git a/arch/mips/include/asm/pgtable-32.h b/arch/mips/include/asm/pgtable-32.h index 6c0532d7b211..95df9c293d8d 100644 --- a/arch/mips/include/asm/pgtable-32.h +++ b/arch/mips/include/asm/pgtable-32.h @@ -93,7 +93,6 @@ extern int add_temporary_entry(unsigned long entrylo0, unsigned long entrylo1, #endif #define USER_PTRS_PER_PGD (0x80000000UL/PGDIR_SIZE) -#define FIRST_USER_ADDRESS 0UL #define VMALLOC_START MAP_BASE diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h index 1e7d6ce9d8d6..046465906c82 100644 --- a/arch/mips/include/asm/pgtable-64.h +++ b/arch/mips/include/asm/pgtable-64.h @@ -137,7 +137,6 @@ #define PTRS_PER_PTE ((PAGE_SIZE << PTE_ORDER) / sizeof(pte_t)) #define USER_PTRS_PER_PGD ((TASK_SIZE64 / PGDIR_SIZE)?(TASK_SIZE64 / PGDIR_SIZE):1) -#define FIRST_USER_ADDRESS 0UL /* * TLB refill handlers also map the vmalloc area into xuseg. Avoid diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h index 2600d76c310c..4a995fa628ee 100644 --- a/arch/nios2/include/asm/pgtable.h +++ b/arch/nios2/include/asm/pgtable.h @@ -24,8 +24,6 @@ #include #include -#define FIRST_USER_ADDRESS 0UL - #define VMALLOC_START CONFIG_NIOS2_KERNEL_MMU_REGION_BASE #define VMALLOC_END (CONFIG_NIOS2_KERNEL_REGION_BASE - 1) diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h index 9425bedab4fc..4ac591c9ca33 100644 --- a/arch/openrisc/include/asm/pgtable.h +++ b/arch/openrisc/include/asm/pgtable.h @@ -73,7 +73,6 @@ extern void paging_init(void); */ #define USER_PTRS_PER_PGD (TASK_SIZE/PGDIR_SIZE) -#define FIRST_USER_ADDRESS 0UL /* * Kernels own virtual memory area. diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h index 39017210dbf0..7f33c29764cc 100644 --- a/arch/parisc/include/asm/pgtable.h +++ b/arch/parisc/include/asm/pgtable.h @@ -171,8 +171,6 @@ static inline void purge_tlb_entries(struct mm_struct *mm, unsigned long addr) * pgd entries used up by user/kernel: */ -#define FIRST_USER_ADDRESS 0UL - /* NB: The tlb miss handlers make certain assumptions about the order */ /* of the following bits, so be careful (One example, bits 25-31 */ /* are moved together in one instruction). */ diff --git a/arch/powerpc/include/asm/book3s/pgtable.h b/arch/powerpc/include/asm/book3s/pgtable.h index 0e1263455d73..ad130e15a126 100644 --- a/arch/powerpc/include/asm/book3s/pgtable.h +++ b/arch/powerpc/include/asm/book3s/pgtable.h @@ -8,7 +8,6 @@ #include #endif -#define FIRST_USER_ADDRESS 0UL #ifndef __ASSEMBLY__ /* Insert a PTE, top-level function is out of line. It uses an inline * low level function in the respective pgtable-* files diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h b/arch/powerpc/include/asm/nohash/32/pgtable.h index 96522f7f0618..f06ae00f2a65 100644 --- a/arch/powerpc/include/asm/nohash/32/pgtable.h +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h @@ -54,7 +54,6 @@ extern int icache_44x_need_flush; #define PGD_MASKED_BITS 0 #define USER_PTRS_PER_PGD (TASK_SIZE / PGDIR_SIZE) -#define FIRST_USER_ADDRESS 0UL #define pte_ERROR(e) \ pr_err("%s:%d: bad pte %llx.\n", __FILE__, __LINE__, \ diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h index 57cd3892bfe0..53fbfdfac93d 100644 --- a/arch/powerpc/include/asm/nohash/64/pgtable.h +++ b/arch/powerpc/include/asm/nohash/64/pgtable.h @@ -12,8 +12,6 @@ #include #include -#define FIRST_USER_ADDRESS 0UL - /* * Size of EA range mapped by our pagetables. */ diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 380cd3a7e548..62f3fe7368f3 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -536,8 +536,6 @@ void setup_bootmem(void); void paging_init(void); void misc_mem_init(void); -#define FIRST_USER_ADDRESS 0 - /* * ZERO_PAGE is a global shared page that is always zero, * used for zero-mapped memory areas, etc. diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index b38f7b781564..7c86bf03fc8f 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -65,8 +65,6 @@ extern unsigned long zero_page_mask; /* TODO: s390 cannot support io_remap_pfn_range... */ -#define FIRST_USER_ADDRESS 0UL - #define pte_ERROR(e) \ printk("%s:%d: bad pte %p.\n", __FILE__, __LINE__, (void *) pte_val(e)) #define pmd_ERROR(e) \ diff --git a/arch/sh/include/asm/pgtable.h b/arch/sh/include/asm/pgtable.h index 27751e9470df..d7ddb1ec86a0 100644 --- a/arch/sh/include/asm/pgtable.h +++ b/arch/sh/include/asm/pgtable.h @@ -59,8 +59,6 @@ static inline unsigned long long neff_sign_extend(unsigned long val) /* Entries per level */ #define PTRS_PER_PTE (PAGE_SIZE / (1 << PTE_MAGNITUDE)) -#define FIRST_USER_ADDRESS 0UL - #define PHYS_ADDR_MASK29 0x1fffffff #define PHYS_ADDR_MASK32 0xffffffff diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h index a5cf79c149fe..0888bda245f5 100644 --- a/arch/sparc/include/asm/pgtable_32.h +++ b/arch/sparc/include/asm/pgtable_32.h @@ -48,7 +48,6 @@ unsigned long __init bootmem_init(unsigned long *pages_avail); #define PTRS_PER_PMD 64 #define PTRS_PER_PGD 256 #define USER_PTRS_PER_PGD PAGE_OFFSET / PGDIR_SIZE -#define FIRST_USER_ADDRESS 0UL #define PTE_SIZE (PTRS_PER_PTE*4) #define PAGE_NONE SRMMU_PAGE_NONE diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 2cd80a0a9795..a400e0f23046 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -95,9 +95,6 @@ bool kern_addr_valid(unsigned long addr); #define PTRS_PER_PUD (1UL << PUD_BITS) #define PTRS_PER_PGD (1UL << PGDIR_BITS) -/* Kernel has a separate 44bit address space. */ -#define FIRST_USER_ADDRESS 0UL - #define pmd_ERROR(e) \ pr_err("%s:%d: bad pmd %p(%016lx) seen at (%pS)\n", \ __FILE__, __LINE__, &(e), pmd_val(e), __builtin_return_address(0)) diff --git a/arch/um/include/asm/pgtable-2level.h b/arch/um/include/asm/pgtable-2level.h index 32106d31e4ab..8256ecc5b919 100644 --- a/arch/um/include/asm/pgtable-2level.h +++ b/arch/um/include/asm/pgtable-2level.h @@ -23,7 +23,6 @@ #define PTRS_PER_PTE 1024 #define USER_PTRS_PER_PGD ((TASK_SIZE + (PGDIR_SIZE - 1)) / PGDIR_SIZE) #define PTRS_PER_PGD 1024 -#define FIRST_USER_ADDRESS 0UL #define pte_ERROR(e) \ printk("%s:%d: bad pte %p(%08lx).\n", __FILE__, __LINE__, &(e), \ diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h index 7e6a4180db9d..9289a86643a9 100644 --- a/arch/um/include/asm/pgtable-3level.h +++ b/arch/um/include/asm/pgtable-3level.h @@ -41,7 +41,6 @@ #endif #define USER_PTRS_PER_PGD ((TASK_SIZE + (PGDIR_SIZE - 1)) / PGDIR_SIZE) -#define FIRST_USER_ADDRESS 0UL #define pte_ERROR(e) \ printk("%s:%d: bad pte %p(%016lx).\n", __FILE__, __LINE__, &(e), \ diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index f24d7ef8fffa..40497a9020c6 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -7,8 +7,6 @@ #include -#define FIRST_USER_ADDRESS 0UL - #define _PAGE_BIT_PRESENT 0 /* is present */ #define _PAGE_BIT_RW 1 /* writeable */ #define _PAGE_BIT_USER 2 /* userspace addressable */ diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h index d7fc45c920c2..bd5aeb795567 100644 --- a/arch/xtensa/include/asm/pgtable.h +++ b/arch/xtensa/include/asm/pgtable.h @@ -59,7 +59,6 @@ #define PTRS_PER_PGD 1024 #define PGD_ORDER 0 #define USER_PTRS_PER_PGD (TASK_SIZE/PGDIR_SIZE) -#define FIRST_USER_ADDRESS 0UL #define FIRST_USER_PGD_NR (FIRST_USER_ADDRESS >> PGDIR_SHIFT) #ifdef CONFIG_MMU diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 2b0d02291178..69700e3e615f 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -28,6 +28,15 @@ #define USER_PGTABLES_CEILING 0UL #endif +/* + * This defines the first usable user address. Platforms + * can override its value with custom FIRST_USER_ADDRESS + * defined in their respective . + */ +#ifndef FIRST_USER_ADDRESS +#define FIRST_USER_ADDRESS 0UL +#endif + /* * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD] * From 041711ce7cdf023f53d76f64d82b75210248e18d Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 30 Jun 2021 18:53:17 -0700 Subject: [PATCH 113/190] mm: fix spelling mistakes Fix some spelling mistakes in comments: each having differents usage ==> each has a different usage statments ==> statements adresses ==> addresses aggresive ==> aggressive datas ==> data posion ==> poison higer ==> higher precisly ==> precisely wont ==> won't We moves tha ==> We move the endianess ==> endianness Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei Reviewed-by: Souptick Joarder Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memremap.h | 2 +- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 2 +- mm/memory-failure.c | 2 +- mm/memory_hotplug.c | 4 ++-- mm/page_alloc.c | 2 +- mm/swapfile.c | 2 +- 7 files changed, 8 insertions(+), 8 deletions(-) diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 45a79da89c5f..c0e9d35889e8 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -26,7 +26,7 @@ struct vmem_altmap { }; /* - * Specialize ZONE_DEVICE memory into multiple types each having differents + * Specialize ZONE_DEVICE memory into multiple types each has a different * usage. * * MEMORY_DEVICE_PRIVATE: diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index b66d0225414e..748617780924 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -397,7 +397,7 @@ struct mm_struct { unsigned long mmap_base; /* base of mmap area */ unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */ #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES - /* Base adresses for compatible mmap() */ + /* Base addresses for compatible mmap() */ unsigned long mmap_compat_base; unsigned long mmap_compat_legacy_base; #endif diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7bc7e41b6c31..0ed2c23ed3fb 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -114,7 +114,7 @@ static inline bool free_area_empty(struct free_area *area, int migratetype) struct pglist_data; /* - * Add a wild amount of padding here to ensure datas fall into separate + * Add a wild amount of padding here to ensure data fall into separate * cachelines. There are very few zone structures in the machine, so space * consumption is not a concern here. */ diff --git a/mm/memory-failure.c b/mm/memory-failure.c index ee02d8b06839..eefd823deb67 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1340,7 +1340,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, * could potentially call huge_pmd_unshare. Because of * this, take semaphore in write mode here and set * TTU_RMAP_LOCKED to indicate we have taken the lock - * at this higer level. + * at this higher level. */ mapping = hugetlb_page_mapping_lock_write(hpage); if (mapping) { diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 553c52751249..79f6ce92b6b6 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -783,7 +783,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *z /* * {on,off}lining is constrained to full memory sections (or more - * precisly to memory blocks from the user space POV). + * precisely to memory blocks from the user space POV). * memmap_on_memory is an exception because it reserves initial part * of the physical memory space for vmemmaps. That space is pageblock * aligned. @@ -1580,7 +1580,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages) /* * {on,off}lining is constrained to full memory sections (or more - * precisly to memory blocks from the user space POV). + * precisely to memory blocks from the user space POV). * memmap_on_memory is an exception because it reserves initial part * of the physical memory space for vmemmaps. That space is pageblock * aligned. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eeff64843718..6700cfcfab46 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3180,7 +3180,7 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus) int cpu; /* - * Allocate in the BSS so we wont require allocation in + * Allocate in the BSS so we won't require allocation in * direct reclaim path for CONFIG_CPUMASK_OFFSTACK=y */ static cpumask_t cpus_with_pcps; diff --git a/mm/swapfile.c b/mm/swapfile.c index e898c879a434..1e07d1c776f2 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2967,7 +2967,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p, return 0; } - /* swap partition endianess hack... */ + /* swap partition endianness hack... */ if (swab32(swap_header->info.version) == 1) { swab32s(&swap_header->info.version); swab32s(&swap_header->info.last_page); From f611fab71005af2d726033697e8abda0ee0994e8 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:19 -0700 Subject: [PATCH 114/190] mm/vmscan: remove kerneldoc-like comment from isolate_lru_pages Patch series "Clean W=1 build warnings for mm/". This is a janitorial only. During development of a tool to catch build warnings early to avoid tripping the Intel lkp-robot, I noticed that mm/ is not clean for W=1. This is generally harmless but there is no harm in cleaning it up. It disrupts git blame a little but on relatively obvious lines that are unlikely to be git blame targets. This patch (of 13): make W=1 generates the following warning for vmscan.c mm/vmscan.c:1814: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst It is not a kerneldoc comment and isolate_lru_pages() is a static function. While the detailed comment is nice, it does not need to be exposed via kernel-doc. Link: https://lkml.kernel.org/r/20210520084809.8576-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20210520084809.8576-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Michal Hocko Cc: David Hildenbrand Cc: Dan Streetman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e1d75e6f9ff4..4620df62f0ff 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1821,7 +1821,7 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec, } -/** +/* * Isolating page from the lruvec to fill in @dst list by nr_to_scan times. * * lruvec->lru_lock is heavily contended. Some of the functions that From 5da96bdd93ed732685fb511d9889d3f6c5717fad Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:23 -0700 Subject: [PATCH 115/190] mm/vmalloc: include header for prototype of set_iounmap_nonlazy make W=1 generates the following warning for mm/vmalloc.c mm/vmalloc.c:1599:6: warning: no previous prototype for `set_iounmap_nonlazy' [-Wmissing-prototypes] void set_iounmap_nonlazy(void) ^~~~~~~~~~~~~~~~~~~ This is an arch-generic function only used by x86. On other arches, it's dead code. Include the header with the definition and make it x86-64 specific. Link: https://lkml.kernel.org/r/20210520084809.8576-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: David Hildenbrand Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 71dd29fe618b..d5cd52805149 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -1607,6 +1608,7 @@ static DEFINE_MUTEX(vmap_purge_lock); /* for per-CPU blocks */ static void purge_fragmented_blocks_allcpus(void); +#ifdef CONFIG_X86_64 /* * called before a call to iounmap() if the caller wants vm_area_struct's * immediately freed. @@ -1615,6 +1617,7 @@ void set_iounmap_nonlazy(void) { atomic_long_set(&vmap_lazy_nr, lazy_max_pages()+1); } +#endif /* CONFIG_X86_64 */ /* * Purges all lazily-freed vmap areas. From f7173090033c70886d925995e9dfdfb76dbb2441 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:25 -0700 Subject: [PATCH 116/190] mm/page_alloc: make should_fail_alloc_page() static make W=1 generates the following warning for mm/page_alloc.c mm/page_alloc.c:3651:15: warning: no previous prototype for `should_fail_alloc_page' [-Wmissing-prototypes] noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) ^~~~~~~~~~~~~~~~~~~~~~ This function is deliberately split out for BPF to allow errors to be injected. The function is not used anywhere else so it is local to the file. Make it static which should still allow error injection to be used similar to how block/blk-core.c:should_fail_bio() works. Link: https://lkml.kernel.org/r/20210520084809.8576-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: David Hildenbrand Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6700cfcfab46..2f71d92c45b4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3819,7 +3819,7 @@ static inline bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) #endif /* CONFIG_FAIL_PAGE_ALLOC */ -noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) +static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) { return __should_fail_alloc_page(gfp_mask, order); } From b417941f3ab1a276255e3ae52ff261dc2e196de7 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:29 -0700 Subject: [PATCH 117/190] mm/mapping_dirty_helpers: remove double Note in kerneldoc make W=1 generates the following warning for mm/mapping_dirty_helpers.c mm/mapping_dirty_helpers.c:325: warning: duplicate section name 'Note' The helper function is very specific to one driver -- vmwgfx. While the two notes are separate, all of it needs to be taken into account when using the helper so make it one note. Link: https://lkml.kernel.org/r/20210520084809.8576-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: David Hildenbrand Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mapping_dirty_helpers.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c index b890854ec761..ea734f248fce 100644 --- a/mm/mapping_dirty_helpers.c +++ b/mm/mapping_dirty_helpers.c @@ -317,7 +317,7 @@ EXPORT_SYMBOL_GPL(wp_shared_mapping_range); * pfn_mkwrite(). And then after a TLB flush following the write-protection * pick up all dirty bits. * - * Note: This function currently skips transhuge page-table entries, since + * This function currently skips transhuge page-table entries, since * it's intended for dirty-tracking on the PTE level. It will warn on * encountering transhuge dirty entries, though, and can easily be extended * to handle them as well. From 05395718b2fe48eb4970184c3a9f89f6b5e7440f Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:32 -0700 Subject: [PATCH 118/190] mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection make W=1 generates the following warning for mem_cgroup_calculate_protection mm/memcontrol.c:6468: warning: expecting prototype for mem_cgroup_protected(). Prototype was for mem_cgroup_calculate_protection() instead Commit 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") changed the function definition but not the associated kerneldoc comment. Link: https://lkml.kernel.org/r/20210520084809.8576-7-mgorman@techsingularity.net Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") Signed-off-by: Mel Gorman Reviewed-by: Yang Shi Acked-by: Chris Down Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: David Hildenbrand Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4ee243ce6135..f63399bff018 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6639,7 +6639,7 @@ static unsigned long effective_protection(unsigned long usage, } /** - * mem_cgroup_protected - check if memory consumption is in the normal range + * mem_cgroup_calculate_protection - check if memory consumption is in the normal range * @root: the top ancestor of the sub-tree being checked * @memcg: the memory cgroup to check * From ba2d26660d0e13b3465917022aca78d49e259b59 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:35 -0700 Subject: [PATCH 119/190] mm/memory_hotplug: fix kerneldoc comment for __try_online_node make W=1 generates the following warning for try_online_node mm/memory_hotplug.c:1087: warning: expecting prototype for try_online_node(). Prototype was for __try_online_node() instead Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node") renamed the function but did not update the associated kerneldoc. The function is static and somewhat specialised in nature so it's not clear it warrants being a kerneldoc by moving the comment to try_online_node. Hence, leave the comment of the internal helper in place but leave it out of kerneldoc and correct the function name in the comment. Link: https://lkml.kernel.org/r/20210520084809.8576-8-mgorman@techsingularity.net Fixes: Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node") Signed-off-by: Mel Gorman Reviewed-by: David Hildenbrand Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 79f6ce92b6b6..970e74a0575f 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -942,8 +942,8 @@ static void rollback_node_hotadd(int nid) } -/** - * try_online_node - online a node if offlined +/* + * __try_online_node - online a node if offlined * @nid: the node ID * @set_node_online: Whether we want to online the node * called by cpu_up() to online a node without onlined memory. From 5640c9ca7ed2e54628938f9d505c969b48e3fa67 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:38 -0700 Subject: [PATCH 120/190] mm/memory_hotplug: fix kerneldoc comment for __remove_memory make W=1 generates the following warning for __remove_memory mm/memory_hotplug.c:2044: warning: expecting prototype for remove_memory(). Prototype was for __remove_memory() instead Commit eca499ab3749 ("mm/hotplug: make remove_memory() interface usable") introduced the kerneldoc comment and function but the kerneldoc name and function name did not match. Link: https://lkml.kernel.org/r/20210520084809.8576-9-mgorman@techsingularity.net Fixes: eca499ab3749 ("mm/hotplug: make remove_memory() interface usable") Signed-off-by: Mel Gorman Reviewed-by: David Hildenbrand Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 970e74a0575f..8cb75b26ea4f 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1908,7 +1908,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) } /** - * remove_memory + * __remove_memory - Remove memory if every memory block is offline * @nid: the node ID * @start: physical address of the region to remove * @size: size of the region to remove From a29a7506600d9511dc872a82a139dcfb71c49640 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:41 -0700 Subject: [PATCH 121/190] mm/zbud: add kerneldoc fields for zbud_pool make W=1 generates the following warning for zbud_pool mm/zbud.c:105: warning: Function parameter or member 'zpool' not described in 'zbud_pool' mm/zbud.c:105: warning: Function parameter or member 'zpool_ops' not described in 'zbud_pool' Commit 479305fd7172 ("zpool: remove zpool_evict()") removed the zpool_evict helper and added the associated zpool and operations structure in struct zbud_pool but did not add documentation for the fields. Add rudimentary documentation. Link: https://lkml.kernel.org/r/20210520084809.8576-10-mgorman@techsingularity.net Fixes: 479305fd7172 ("zpool: remove zpool_evict()") Signed-off-by: Mel Gorman Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: David Hildenbrand Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/zbud.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/zbud.c b/mm/zbud.c index 48e80a3f7976..6348932430b8 100644 --- a/mm/zbud.c +++ b/mm/zbud.c @@ -92,6 +92,8 @@ struct zbud_ops { * @pages_nr: number of zbud pages in the pool. * @ops: pointer to a structure of user defined operations specified at * pool creation time. + * @zpool: zpool driver + * @zpool_ops: zpool operations structure with an evict callback * * This structure is allocated at pool creation time and maintains metadata * pertaining to a particular zbud pool. From 30522175d222c98f7976e34f6daf076e9f8cc723 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:44 -0700 Subject: [PATCH 122/190] mm/z3fold: add kerneldoc fields for z3fold_pool make W=1 generates the following warning for z3fold_pool mm/z3fold.c:171: warning: Function parameter or member 'zpool' not described in 'z3fold_pool' mm/z3fold.c:171: warning: Function parameter or member 'zpool_ops' not described in 'z3fold_pool' Commit 9a001fc19ccc ("z3fold: the 3-fold allocator for compressed pages") simply did not document the fields at the time. Add rudimentary documentation. Link: https://lkml.kernel.org/r/20210520084809.8576-11-mgorman@techsingularity.net Signed-off-by: Mel Gorman Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: David Hildenbrand Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/z3fold.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/z3fold.c b/mm/z3fold.c index a1bc021434a1..b3c0577b8095 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -144,6 +144,8 @@ struct z3fold_header { * @c_handle: cache for z3fold_buddy_slots allocation * @ops: pointer to a structure of user defined operations specified at * pool creation time. + * @zpool: zpool driver + * @zpool_ops: zpool operations structure with an evict callback * @compact_wq: workqueue for page layout background optimization * @release_wq: workqueue for safe page release * @work: work_struct for safe page release From 2bb6a033fb4078f1c528ee575f551064ed738d6f Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:47 -0700 Subject: [PATCH 123/190] mm/swap: make swap_address_space an inline function make W=1 generates the following warning in page_mapping() for allnoconfig mm/util.c:700:15: warning: variable `entry' set but not used [-Wunused-but-set-variable] swp_entry_t entry; ^~~~~ swap_address is a #define on !CONFIG_SWAP configurations. Make the helper an inline function to suppress the warning, add type checking and to apply any side-effects in the parameter list. Link: https://lkml.kernel.org/r/20210520084809.8576-12-mgorman@techsingularity.net Signed-off-by: Mel Gorman Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: David Hildenbrand Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 49b1dd2c100b..ac9bd84c905e 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -537,7 +537,11 @@ static inline void put_swap_device(struct swap_info_struct *si) { } -#define swap_address_space(entry) (NULL) +static inline struct address_space *swap_address_space(swp_entry_t entry) +{ + return NULL; +} + #define get_nr_swap_pages() 0L #define total_swap_pages 0L #define total_swapcache_pages() 0UL From d01079f3d0c0a9e306ffbdb2694c5281bd9e065e Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:50 -0700 Subject: [PATCH 124/190] mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations make W=1 generates the following warning in mmap_lock.c for allnoconfig mm/mmap_lock.c:213:6: warning: no previous prototype for `__mmap_lock_do_trace_start_locking' [-Wmissing-prototypes] void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write) ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/mmap_lock.c:219:6: warning: no previous prototype for `__mmap_lock_do_trace_acquire_returned' [-Wmissing-prototypes] void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ mm/mmap_lock.c:226:6: warning: no previous prototype for `__mmap_lock_do_trace_released' [-Wmissing-prototypes] void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write) On !CONFIG_TRACING configurations, the code is dead so put it behind an #ifdef. [cuibixuan@huawei.com: fix warning when CONFIG_TRACING is not defined] Link: https://lkml.kernel.org/r/20210531033426.74031-1-cuibixuan@huawei.com Link: https://lkml.kernel.org/r/20210520084809.8576-13-mgorman@techsingularity.net Signed-off-by: Mel Gorman Signed-off-by: Bixuan Cui Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: David Hildenbrand Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mmap_lock.c | 59 +++++++++++++++++++++++++++----------------------- 1 file changed, 32 insertions(+), 27 deletions(-) diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c index 2ae3f33b85b1..f5852a058ce0 100644 --- a/mm/mmap_lock.c +++ b/mm/mmap_lock.c @@ -153,6 +153,37 @@ static inline void put_memcg_path_buf(void) rcu_read_unlock(); } +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ + do { \ + const char *memcg_path; \ + preempt_disable(); \ + memcg_path = get_mm_memcg_path(mm); \ + trace_mmap_lock_##type(mm, \ + memcg_path != NULL ? memcg_path : "", \ + ##__VA_ARGS__); \ + if (likely(memcg_path != NULL)) \ + put_memcg_path_buf(); \ + preempt_enable(); \ + } while (0) + +#else /* !CONFIG_MEMCG */ + +int trace_mmap_lock_reg(void) +{ + return 0; +} + +void trace_mmap_lock_unreg(void) +{ +} + +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ + trace_mmap_lock_##type(mm, "", ##__VA_ARGS__) + +#endif /* CONFIG_MEMCG */ + +#ifdef CONFIG_TRACING +#ifdef CONFIG_MEMCG /* * Write the given mm_struct's memcg path to a percpu buffer, and return a * pointer to it. If the path cannot be determined, or no buffer was available @@ -187,33 +218,6 @@ static const char *get_mm_memcg_path(struct mm_struct *mm) return buf; } -#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ - do { \ - const char *memcg_path; \ - local_lock(&memcg_paths.lock); \ - memcg_path = get_mm_memcg_path(mm); \ - trace_mmap_lock_##type(mm, \ - memcg_path != NULL ? memcg_path : "", \ - ##__VA_ARGS__); \ - if (likely(memcg_path != NULL)) \ - put_memcg_path_buf(); \ - local_unlock(&memcg_paths.lock); \ - } while (0) - -#else /* !CONFIG_MEMCG */ - -int trace_mmap_lock_reg(void) -{ - return 0; -} - -void trace_mmap_lock_unreg(void) -{ -} - -#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \ - trace_mmap_lock_##type(mm, "", ##__VA_ARGS__) - #endif /* CONFIG_MEMCG */ /* @@ -239,3 +243,4 @@ void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write) TRACE_MMAP_LOCK_EVENT(released, mm, write); } EXPORT_SYMBOL(__mmap_lock_do_trace_released); +#endif /* CONFIG_TRACING */ From ffd8f251f1a61e592aa3146d2c3cfb6a992e80f2 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:53 -0700 Subject: [PATCH 125/190] mm/page_alloc: move prototype for find_suitable_fallback make W=1 generates the following warning in mmap_lock.c for allnoconfig mm/page_alloc.c:2670:5: warning: no previous prototype for `find_suitable_fallback' [-Wmissing-prototypes] int find_suitable_fallback(struct free_area *area, unsigned int order, ^~~~~~~~~~~~~~~~~~~~~~ find_suitable_fallback is only shared outside of page_alloc.c for CONFIG_COMPACTION but to suppress the warning, move the protype outside of CONFIG_COMPACTION. It is not worth the effort at this time to find a clever way of allowing compaction.c to share the code or avoid the use entirely as the function is called on relatively slow paths. Link: https://lkml.kernel.org/r/20210520084809.8576-14-mgorman@techsingularity.net Signed-off-by: Mel Gorman Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: David Hildenbrand Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index f38bc5d75ab7..2d7c9a2e0118 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -274,11 +274,10 @@ isolate_freepages_range(struct compact_control *cc, int isolate_migratepages_range(struct compact_control *cc, unsigned long low_pfn, unsigned long end_pfn); +#endif int find_suitable_fallback(struct free_area *area, unsigned int order, int migratetype, bool only_stealable, bool *can_steal); -#endif - /* * This function returns the order of a free page in the buddy system. In * general, page_zone(page)->lock must be held by the caller to prevent the From 351de44fde5afc3b0b23294ebf404e78065c2745 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 30 Jun 2021 18:53:56 -0700 Subject: [PATCH 126/190] mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM make W=1 generates the following warning in mm/workingset.c for allnoconfig mm/workingset.c: In function `unpack_shadow': mm/workingset.c:201:15: warning: variable `nid' set but not used [-Wunused-but-set-variable] int memcgid, nid; ^~~ On FLATMEM, NODE_DATA returns a global pglist_data without dereferencing nid. Make the helper an inline function to suppress the warning, add type checking and to apply any side-effects in the parameter list. Link: https://lkml.kernel.org/r/20210520084809.8576-15-mgorman@techsingularity.net Signed-off-by: Mel Gorman Reviewed-by: Yang Shi Acked-by: Vlastimil Babka Cc: Dan Streetman Cc: David Hildenbrand Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 0ed2c23ed3fb..fcb535560028 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1064,7 +1064,10 @@ extern char numa_zonelist_order[]; #ifndef CONFIG_NUMA extern struct pglist_data contig_page_data; -#define NODE_DATA(nid) (&contig_page_data) +static inline struct pglist_data *NODE_DATA(int nid) +{ + return &contig_page_data; +} #define NODE_MEM_MAP(nid) mem_map #else /* CONFIG_NUMA */ From 1c2f7d14d84f767a797558609eb034511e02f41e Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Wed, 30 Jun 2021 18:53:59 -0700 Subject: [PATCH 127/190] mm/thp: define default pmd_pgtable() Currently most platforms define pmd_pgtable() as pmd_page() duplicating the same code all over. Instead just define a default value i.e pmd_page() for pmd_pgtable() and let platforms override when required via . All the existing platform that override pmd_pgtable() have been moved into their respective header in order to precede before the new generic definition. This makes it much cleaner with reduced code. Link: https://lkml.kernel.org/r/1623646133-20306-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Acked-by: Geert Uytterhoeven Acked-by: Mike Rapoport Cc: Nick Hu Cc: Richard Henderson Cc: Vineet Gupta Cc: Catalin Marinas Cc: Will Deacon Cc: Guo Ren Cc: Brian Cain Cc: Geert Uytterhoeven Cc: Michal Simek Cc: Thomas Bogendoerfer Cc: Ley Foon Tan Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: "James E.J. Bottomley" Cc: Michael Ellerman Cc: Christophe Leroy Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Heiko Carstens Cc: Yoshinori Sato Cc: "David S. Miller" Cc: Jeff Dike Cc: Thomas Gleixner Cc: Chris Zankel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/alpha/include/asm/pgalloc.h | 1 - arch/arc/include/asm/pgalloc.h | 2 -- arch/arc/include/asm/pgtable.h | 2 ++ arch/arm/include/asm/pgalloc.h | 1 - arch/arm64/include/asm/pgalloc.h | 1 - arch/csky/include/asm/pgalloc.h | 2 -- arch/hexagon/include/asm/pgtable.h | 1 - arch/ia64/include/asm/pgalloc.h | 1 - arch/m68k/include/asm/mcf_pgalloc.h | 2 -- arch/m68k/include/asm/mcf_pgtable.h | 2 ++ arch/m68k/include/asm/motorola_pgalloc.h | 1 - arch/m68k/include/asm/motorola_pgtable.h | 2 ++ arch/m68k/include/asm/sun3_pgalloc.h | 1 - arch/microblaze/include/asm/pgalloc.h | 2 -- arch/mips/include/asm/pgalloc.h | 1 - arch/nds32/include/asm/pgalloc.h | 5 ----- arch/nios2/include/asm/pgalloc.h | 1 - arch/openrisc/include/asm/pgalloc.h | 2 -- arch/parisc/include/asm/pgalloc.h | 1 - arch/powerpc/include/asm/pgalloc.h | 5 ----- arch/powerpc/include/asm/pgtable.h | 6 ++++++ arch/riscv/include/asm/pgalloc.h | 2 -- arch/s390/include/asm/pgalloc.h | 3 --- arch/s390/include/asm/pgtable.h | 3 +++ arch/sh/include/asm/pgalloc.h | 1 - arch/sparc/include/asm/pgalloc_32.h | 1 - arch/sparc/include/asm/pgalloc_64.h | 1 - arch/sparc/include/asm/pgtable_32.h | 2 ++ arch/sparc/include/asm/pgtable_64.h | 2 ++ arch/um/include/asm/pgalloc.h | 1 - arch/x86/include/asm/pgalloc.h | 2 -- arch/xtensa/include/asm/pgalloc.h | 2 -- include/linux/pgtable.h | 9 +++++++++ 33 files changed, 28 insertions(+), 43 deletions(-) diff --git a/arch/alpha/include/asm/pgalloc.h b/arch/alpha/include/asm/pgalloc.h index 9c6a24fe493d..68be7adbfe58 100644 --- a/arch/alpha/include/asm/pgalloc.h +++ b/arch/alpha/include/asm/pgalloc.h @@ -18,7 +18,6 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmd, pgtable_t pte) { pmd_set(pmd, (pte_t *)(page_to_pa(pte) + PAGE_OFFSET)); } -#define pmd_pgtable(pmd) pmd_page(pmd) static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte) diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h index 6147db925248..a32ca3104ced 100644 --- a/arch/arc/include/asm/pgalloc.h +++ b/arch/arc/include/asm/pgalloc.h @@ -129,6 +129,4 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptep) #define __pte_free_tlb(tlb, pte, addr) pte_free((tlb)->mm, pte) -#define pmd_pgtable(pmd) ((pgtable_t) pmd_page_vaddr(pmd)) - #endif /* _ASM_ARC_PGALLOC_H */ diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h index 577682e8cc7f..320cc0ae8a08 100644 --- a/arch/arc/include/asm/pgtable.h +++ b/arch/arc/include/asm/pgtable.h @@ -350,6 +350,8 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, #define kern_addr_valid(addr) (1) +#define pmd_pgtable(pmd) ((pgtable_t) pmd_page_vaddr(pmd)) + /* * remap a physical page `pfn' of size `size' with page protection `prot' * into virtual address `from' diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h index fdee1f04f4f3..a17f01235c29 100644 --- a/arch/arm/include/asm/pgalloc.h +++ b/arch/arm/include/asm/pgalloc.h @@ -143,7 +143,6 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmdp, pgtable_t ptep) __pmd_populate(pmdp, page_to_phys(ptep), prot); } -#define pmd_pgtable(pmd) pmd_page(pmd) #endif /* CONFIG_MMU */ diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h index 31fbab3d6f99..8433a2058eb1 100644 --- a/arch/arm64/include/asm/pgalloc.h +++ b/arch/arm64/include/asm/pgalloc.h @@ -86,6 +86,5 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmdp, pgtable_t ptep) VM_BUG_ON(mm == &init_mm); __pmd_populate(pmdp, page_to_phys(ptep), PMD_TYPE_TABLE | PMD_TABLE_PXN); } -#define pmd_pgtable(pmd) pmd_page(pmd) #endif diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h index cd211aabbefd..bbbd0698b397 100644 --- a/arch/csky/include/asm/pgalloc.h +++ b/arch/csky/include/asm/pgalloc.h @@ -22,8 +22,6 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, set_pmd(pmd, __pmd(__pa(page_address(pte)))); } -#define pmd_pgtable(pmd) pmd_page(pmd) - extern void pgd_init(unsigned long *p); static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm) diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h index e4979508cddf..18cd6ea9ab23 100644 --- a/arch/hexagon/include/asm/pgtable.h +++ b/arch/hexagon/include/asm/pgtable.h @@ -239,7 +239,6 @@ static inline int pmd_bad(pmd_t pmd) * pmd_page - converts a PMD entry to a page pointer */ #define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)) -#define pmd_pgtable(pmd) pmd_page(pmd) /** * pte_none - check if pte is mapped diff --git a/arch/ia64/include/asm/pgalloc.h b/arch/ia64/include/asm/pgalloc.h index 9601cfe83c94..0fb2b6291d58 100644 --- a/arch/ia64/include/asm/pgalloc.h +++ b/arch/ia64/include/asm/pgalloc.h @@ -52,7 +52,6 @@ pmd_populate(struct mm_struct *mm, pmd_t * pmd_entry, pgtable_t pte) { pmd_val(*pmd_entry) = page_to_phys(pte); } -#define pmd_pgtable(pmd) pmd_page(pmd) static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t * pmd_entry, pte_t * pte) diff --git a/arch/m68k/include/asm/mcf_pgalloc.h b/arch/m68k/include/asm/mcf_pgalloc.h index bc1228e00518..5c2c0a864524 100644 --- a/arch/m68k/include/asm/mcf_pgalloc.h +++ b/arch/m68k/include/asm/mcf_pgalloc.h @@ -32,8 +32,6 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address) #define pmd_populate_kernel pmd_populate -#define pmd_pgtable(pmd) pfn_to_virt(pmd_val(pmd) >> PAGE_SHIFT) - static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pgtable, unsigned long address) { diff --git a/arch/m68k/include/asm/mcf_pgtable.h b/arch/m68k/include/asm/mcf_pgtable.h index 8d4ec05996c5..6f2b87d7a50d 100644 --- a/arch/m68k/include/asm/mcf_pgtable.h +++ b/arch/m68k/include/asm/mcf_pgtable.h @@ -150,6 +150,8 @@ #ifndef __ASSEMBLY__ +#define pmd_pgtable(pmd) pfn_to_virt(pmd_val(pmd) >> PAGE_SHIFT) + /* * Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. diff --git a/arch/m68k/include/asm/motorola_pgalloc.h b/arch/m68k/include/asm/motorola_pgalloc.h index b4fc3b4f6bb3..74a817d9387f 100644 --- a/arch/m68k/include/asm/motorola_pgalloc.h +++ b/arch/m68k/include/asm/motorola_pgalloc.h @@ -88,7 +88,6 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, pgtable_t page { pmd_set(pmd, page); } -#define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd)) static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) { diff --git a/arch/m68k/include/asm/motorola_pgtable.h b/arch/m68k/include/asm/motorola_pgtable.h index 8076467eff4b..a2908164ee6f 100644 --- a/arch/m68k/include/asm/motorola_pgtable.h +++ b/arch/m68k/include/asm/motorola_pgtable.h @@ -105,6 +105,8 @@ extern unsigned long mm_cachebits; #define __S110 PAGE_SHARED_C #define __S111 PAGE_SHARED_C +#define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd)) + /* * Conversion functions: convert a page and protection to a page entry, * and a page entry and page directory to the page they refer to. diff --git a/arch/m68k/include/asm/sun3_pgalloc.h b/arch/m68k/include/asm/sun3_pgalloc.h index 000f64869b91..198036aff519 100644 --- a/arch/m68k/include/asm/sun3_pgalloc.h +++ b/arch/m68k/include/asm/sun3_pgalloc.h @@ -32,7 +32,6 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, pgtable_t page { pmd_val(*pmd) = __pa((unsigned long)page_address(page)); } -#define pmd_pgtable(pmd) pmd_page(pmd) /* * allocating and freeing a pmd is trivial: the 1-entry pmd is diff --git a/arch/microblaze/include/asm/pgalloc.h b/arch/microblaze/include/asm/pgalloc.h index d56b9f670ad1..6c33b05f730f 100644 --- a/arch/microblaze/include/asm/pgalloc.h +++ b/arch/microblaze/include/asm/pgalloc.h @@ -28,8 +28,6 @@ static inline pgd_t *get_pgd(void) #define pgd_alloc(mm) get_pgd() -#define pmd_pgtable(pmd) pmd_page(pmd) - extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm); #define __pte_free_tlb(tlb, pte, addr) pte_free((tlb)->mm, (pte)) diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h index 8b18424b3120..dd53d0f79cb3 100644 --- a/arch/mips/include/asm/pgalloc.h +++ b/arch/mips/include/asm/pgalloc.h @@ -28,7 +28,6 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, { set_pmd(pmd, __pmd((unsigned long)page_address(pte))); } -#define pmd_pgtable(pmd) pmd_page(pmd) /* * Initialize a new pmd table with invalid pointers. diff --git a/arch/nds32/include/asm/pgalloc.h b/arch/nds32/include/asm/pgalloc.h index 85c117347c86..a08e1ebca70e 100644 --- a/arch/nds32/include/asm/pgalloc.h +++ b/arch/nds32/include/asm/pgalloc.h @@ -12,11 +12,6 @@ #define __HAVE_ARCH_PTE_ALLOC_ONE #include /* for pte_{alloc,free}_one */ -/* - * Since we have only two-level page tables, these are trivial - */ -#define pmd_pgtable(pmd) pmd_page(pmd) - extern pgd_t *pgd_alloc(struct mm_struct *mm); extern void pgd_free(struct mm_struct *mm, pgd_t * pgd); diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h index e6600d2a5ae0..3c4ae74d5798 100644 --- a/arch/nios2/include/asm/pgalloc.h +++ b/arch/nios2/include/asm/pgalloc.h @@ -25,7 +25,6 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, { set_pmd(pmd, __pmd((unsigned long)page_address(pte))); } -#define pmd_pgtable(pmd) pmd_page(pmd) /* * Initialize a new pmd table with invalid pointers. diff --git a/arch/openrisc/include/asm/pgalloc.h b/arch/openrisc/include/asm/pgalloc.h index 88820299ecc4..b7b2b8d16fad 100644 --- a/arch/openrisc/include/asm/pgalloc.h +++ b/arch/openrisc/include/asm/pgalloc.h @@ -72,6 +72,4 @@ do { \ tlb_remove_page((tlb), (pte)); \ } while (0) -#define pmd_pgtable(pmd) pmd_page(pmd) - #endif diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h index dda557085311..6a7e98e71f1d 100644 --- a/arch/parisc/include/asm/pgalloc.h +++ b/arch/parisc/include/asm/pgalloc.h @@ -69,6 +69,5 @@ pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte) #define pmd_populate(mm, pmd, pte_page) \ pmd_populate_kernel(mm, pmd, page_address(pte_page)) -#define pmd_pgtable(pmd) pmd_page(pmd) #endif diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h index 6dd78a2dc03a..3360cad78ace 100644 --- a/arch/powerpc/include/asm/pgalloc.h +++ b/arch/powerpc/include/asm/pgalloc.h @@ -70,9 +70,4 @@ extern struct kmem_cache *pgtable_cache[]; #include #endif -static inline pgtable_t pmd_pgtable(pmd_t pmd) -{ - return (pgtable_t)pmd_page_vaddr(pmd); -} - #endif /* _ASM_POWERPC_PGALLOC_H */ diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h index c6a676714f04..5969743719bc 100644 --- a/arch/powerpc/include/asm/pgtable.h +++ b/arch/powerpc/include/asm/pgtable.h @@ -152,6 +152,12 @@ static inline bool p4d_is_leaf(p4d_t p4d) } #endif +#define pmd_pgtable pmd_pgtable +static inline pgtable_t pmd_pgtable(pmd_t pmd) +{ + return (pgtable_t)pmd_page_vaddr(pmd); +} + #ifdef CONFIG_PPC64 #define is_ioremap_addr is_ioremap_addr static inline bool is_ioremap_addr(const void *x) diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h index 23b1544e0ca5..0af6933a7100 100644 --- a/arch/riscv/include/asm/pgalloc.h +++ b/arch/riscv/include/asm/pgalloc.h @@ -38,8 +38,6 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) } #endif /* __PAGETABLE_PMD_FOLDED */ -#define pmd_pgtable(pmd) pmd_page(pmd) - static inline pgd_t *pgd_alloc(struct mm_struct *mm) { pgd_t *pgd; diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h index 6b187cd72251..f14a555eff74 100644 --- a/arch/s390/include/asm/pgalloc.h +++ b/arch/s390/include/asm/pgalloc.h @@ -134,9 +134,6 @@ static inline void pmd_populate(struct mm_struct *mm, #define pmd_populate_kernel(mm, pmd, pte) pmd_populate(mm, pmd, pte) -#define pmd_pgtable(pmd) \ - ((pgtable_t)__va(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE)) - /* * page table entry allocation/free routines. */ diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 7c86bf03fc8f..1f8f5da53262 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1709,4 +1709,7 @@ extern void s390_reset_cmma(struct mm_struct *mm); #define HAVE_ARCH_UNMAPPED_AREA #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN +#define pmd_pgtable(pmd) \ + ((pgtable_t)__va(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE)) + #endif /* _S390_PAGE_H */ diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h index 0e6b0be25e33..a9e98233c4d4 100644 --- a/arch/sh/include/asm/pgalloc.h +++ b/arch/sh/include/asm/pgalloc.h @@ -30,7 +30,6 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, { set_pmd(pmd, __pmd((unsigned long)page_address(pte))); } -#define pmd_pgtable(pmd) pmd_page(pmd) #define __pte_free_tlb(tlb,pte,addr) \ do { \ diff --git a/arch/sparc/include/asm/pgalloc_32.h b/arch/sparc/include/asm/pgalloc_32.h index 9d353e6dc5a9..4f73e87b22a3 100644 --- a/arch/sparc/include/asm/pgalloc_32.h +++ b/arch/sparc/include/asm/pgalloc_32.h @@ -51,7 +51,6 @@ static inline void free_pmd_fast(pmd_t * pmd) #define __pmd_free_tlb(tlb, pmd, addr) pmd_free((tlb)->mm, pmd) #define pmd_populate(mm, pmd, pte) pmd_set(pmd, pte) -#define pmd_pgtable(pmd) (pgtable_t)__pmd_page(pmd) void pmd_set(pmd_t *pmdp, pte_t *ptep); #define pmd_populate_kernel pmd_populate diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h index a8dafc550985..7b5561d17ab1 100644 --- a/arch/sparc/include/asm/pgalloc_64.h +++ b/arch/sparc/include/asm/pgalloc_64.h @@ -67,7 +67,6 @@ void pte_free(struct mm_struct *mm, pgtable_t ptepage); #define pmd_populate_kernel(MM, PMD, PTE) pmd_set(MM, PMD, PTE) #define pmd_populate(MM, PMD, PTE) pmd_set(MM, PMD, PTE) -#define pmd_pgtable(PMD) ((pte_t *)pmd_page_vaddr(PMD)) void pgtable_free(void *table, bool is_page); diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h index 0888bda245f5..ebaf374b55ab 100644 --- a/arch/sparc/include/asm/pgtable_32.h +++ b/arch/sparc/include/asm/pgtable_32.h @@ -432,4 +432,6 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma, /* We provide our own get_unmapped_area to cope with VA holes for userland */ #define HAVE_ARCH_UNMAPPED_AREA +#define pmd_pgtable(pmd) ((pgtable_t)__pmd_page(pmd)) + #endif /* !(_SPARC_PGTABLE_H) */ diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index a400e0f23046..e0ee48ec3903 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -1117,6 +1117,8 @@ extern unsigned long cmdline_memory_size; asmlinkage void do_sparc64_fault(struct pt_regs *regs); +#define pmd_pgtable(PMD) ((pte_t *)pmd_page_vaddr(PMD)) + #ifdef CONFIG_HUGETLB_PAGE #define pud_leaf_size pud_leaf_size diff --git a/arch/um/include/asm/pgalloc.h b/arch/um/include/asm/pgalloc.h index 2bbf28cf3aa9..8ec7cd46dd96 100644 --- a/arch/um/include/asm/pgalloc.h +++ b/arch/um/include/asm/pgalloc.h @@ -19,7 +19,6 @@ set_pmd(pmd, __pmd(_PAGE_TABLE + \ ((unsigned long long)page_to_pfn(pte) << \ (unsigned long long) PAGE_SHIFT))) -#define pmd_pgtable(pmd) pmd_page(pmd) /* * Allocate and free page tables. diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h index 62ad61d6fefc..c7ec5bb88334 100644 --- a/arch/x86/include/asm/pgalloc.h +++ b/arch/x86/include/asm/pgalloc.h @@ -84,8 +84,6 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, set_pmd(pmd, __pmd(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE)); } -#define pmd_pgtable(pmd) pmd_page(pmd) - #if CONFIG_PGTABLE_LEVELS > 2 extern void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd); diff --git a/arch/xtensa/include/asm/pgalloc.h b/arch/xtensa/include/asm/pgalloc.h index d3a22da4d2c9..eeb2de3a89e5 100644 --- a/arch/xtensa/include/asm/pgalloc.h +++ b/arch/xtensa/include/asm/pgalloc.h @@ -25,7 +25,6 @@ (pmd_val(*(pmdp)) = ((unsigned long)ptep)) #define pmd_populate(mm, pmdp, page) \ (pmd_val(*(pmdp)) = ((unsigned long)page_to_virt(page))) -#define pmd_pgtable(pmd) pmd_page(pmd) static inline pgd_t* pgd_alloc(struct mm_struct *mm) @@ -63,7 +62,6 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm) return page; } -#define pmd_pgtable(pmd) pmd_page(pmd) #endif /* CONFIG_MMU */ #endif /* _XTENSA_PGALLOC_H */ diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 69700e3e615f..e82660f7b9e4 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -37,6 +37,15 @@ #define FIRST_USER_ADDRESS 0UL #endif +/* + * This defines the generic helper for accessing PMD page + * table page. Although platforms can still override this + * via their respective . + */ +#ifndef pmd_pgtable +#define pmd_pgtable(pmd) pmd_page(pmd) +#endif + /* * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD] * From ff06e45d3aace3f93d23956c1e655224f363ebe2 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Wed, 30 Jun 2021 18:54:03 -0700 Subject: [PATCH 128/190] kfence: unconditionally use unbound work queue Unconditionally use unbound work queue, and not just if wq_power_efficient is true. Because if the system is idle, KFENCE may wait, and by being run on the unbound work queue, we permit the scheduler to make better scheduling decisions and not require pinning KFENCE to the same CPU upon waking up. Link: https://lkml.kernel.org/r/20210521111630.472579-1-elver@google.com Fixes: 36f0b35d0894 ("kfence: use power-efficient work queue to run delayed work") Signed-off-by: Marco Elver Reported-by: Hillf Danton Reviewed-by: Alexander Potapenko Cc: Dmitry Vyukov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/kfence/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 4d21ac44d5d3..d7666ace9d2e 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -636,7 +636,7 @@ static void toggle_allocation_gate(struct work_struct *work) /* Disable static key and reset timer. */ static_branch_disable(&kfence_allocation_key); #endif - queue_delayed_work(system_power_efficient_wq, &kfence_timer, + queue_delayed_work(system_unbound_wq, &kfence_timer, msecs_to_jiffies(kfence_sample_interval)); } static DECLARE_DELAYED_WORK(kfence_timer, toggle_allocation_gate); @@ -666,7 +666,7 @@ void __init kfence_init(void) } WRITE_ONCE(kfence_enabled, true); - queue_delayed_work(system_power_efficient_wq, &kfence_timer, 0); + queue_delayed_work(system_unbound_wq, &kfence_timer, 0); pr_info("initialized - using %lu bytes for %d objects at 0x%p-0x%p\n", KFENCE_POOL_SIZE, CONFIG_KFENCE_NUM_OBJECTS, (void *)__kfence_pool, (void *)(__kfence_pool + KFENCE_POOL_SIZE)); From af5cdaf82238fb3637a0d0fff4670e5be71c611c Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 30 Jun 2021 18:54:06 -0700 Subject: [PATCH 129/190] mm: remove special swap entry functions Patch series "Add support for SVM atomics in Nouveau", v11. Introduction ============ Some devices have features such as atomic PTE bits that can be used to implement atomic access to system memory. To support atomic operations to a shared virtual memory page such a device needs access to that page which is exclusive of the CPU. This series introduces a mechanism to temporarily unmap pages granting exclusive access to a device. These changes are required to support OpenCL atomic operations in Nouveau to shared virtual memory (SVM) regions allocated with the CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the OpenCL SVM feature is available at https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/ OpenCL_API.html#_shared_virtual_memory . Implementation ============== Exclusive device access is implemented by adding a new swap entry type (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main difference is that on fault the original entry is immediately restored by the fault handler instead of waiting. Restoring the entry triggers calls to MMU notifers which allows a device driver to revoke the atomic access permission from the GPU prior to the CPU finalising the entry. Patches ======= Patches 1 & 2 refactor existing migration and device private entry functions. Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated functionality into separate functions - try_to_migrate_one() and try_to_munlock_one(). Patch 5 renames some existing code but does not introduce functionality. Patch 6 is a small clean-up to swap entry handling in copy_pte_range(). Patch 7 contains the bulk of the implementation for device exclusive memory. Patch 8 contains some additions to the HMM selftests to ensure everything works as expected. Patch 9 is a cleanup for the Nouveau SVM implementation. Patch 10 contains the implementation of atomic access for the Nouveau driver. Testing ======= This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program which checks that GPU atomic accesses to system memory are atomic. Without this series the test fails as there is no way of write-protecting the page mapping which results in the device clobbering CPU writes. For reference the test is available at https://ozlabs.org/~apopple/opencl_svm_atomics/ Further testing has been performed by adding support for testing exclusive access to the hmm-tests kselftests. This patch (of 10): Remove multiple similar inline functions for dealing with different types of special swap entries. Both migration and device private swap entries use the swap offset to store a pfn. Instead of multiple inline functions to obtain a struct page for each swap entry type use a common function pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn() functions as this results is shorter code that is easier to understand. Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Ralph Campbell Reviewed-by: Christoph Hellwig Cc: "Matthew Wilcox (Oracle)" Cc: Hugh Dickins Cc: Peter Xu Cc: Shakeel Butt Cc: Ben Skeggs Cc: Jason Gunthorpe Cc: John Hubbard Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/s390/mm/pgtable.c | 2 +- fs/proc/task_mmu.c | 23 +++++--------- include/linux/swap.h | 4 +-- include/linux/swapops.h | 69 ++++++++++++++--------------------------- mm/hmm.c | 5 ++- mm/huge_memory.c | 6 ++-- mm/memcontrol.c | 2 +- mm/memory.c | 10 +++--- mm/migrate.c | 6 ++-- mm/page_vma_mapped.c | 6 ++-- 10 files changed, 51 insertions(+), 82 deletions(-) diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c index 18205f851c24..eec3a9d7176e 100644 --- a/arch/s390/mm/pgtable.c +++ b/arch/s390/mm/pgtable.c @@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry) if (!non_swap_entry(entry)) dec_mm_counter(mm, MM_SWAPENTS); else if (is_migration_entry(entry)) { - struct page *page = migration_entry_to_page(entry); + struct page *page = pfn_swap_entry_to_page(entry); dec_mm_counter(mm, mm_counter(page)); } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 95c8f1e8fea6..eb97468dfe4c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, } else { mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT; } - } else if (is_migration_entry(swpent)) - page = migration_entry_to_page(swpent); - else if (is_device_private_entry(swpent)) - page = device_private_entry_to_page(swpent); + } else if (is_pfn_swap_entry(swpent)) + page = pfn_swap_entry_to_page(swpent); } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap && pte_none(*pte))) { page = xa_load(&vma->vm_file->f_mapping->i_pages, @@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, swp_entry_t entry = pmd_to_swp_entry(*pmd); if (is_migration_entry(entry)) - page = migration_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); } if (IS_ERR_OR_NULL(page)) return; @@ -694,10 +692,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte); - if (is_migration_entry(swpent)) - page = migration_entry_to_page(swpent); - else if (is_device_private_entry(swpent)) - page = device_private_entry_to_page(swpent); + if (is_pfn_swap_entry(swpent)) + page = pfn_swap_entry_to_page(swpent); } if (page) { int mapcount = page_mapcount(page); @@ -1389,11 +1385,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, frame = swp_type(entry) | (swp_offset(entry) << MAX_SWAPFILES_SHIFT); flags |= PM_SWAP; - if (is_migration_entry(entry)) - page = migration_entry_to_page(entry); - - if (is_device_private_entry(entry)) - page = device_private_entry_to_page(entry); + if (is_pfn_swap_entry(entry)) + page = pfn_swap_entry_to_page(entry); } if (page && !PageAnon(page)) @@ -1454,7 +1447,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end, if (pmd_swp_uffd_wp(pmd)) flags |= PM_UFFD_WP; VM_BUG_ON(!is_pmd_migration_entry(pmd)); - page = migration_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); } #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index ac9bd84c905e..df7cbb6b3d3e 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -564,8 +564,8 @@ static inline void show_swap_cache_info(void) { } -#define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));}) -#define swapcache_prepare(e) ({(is_migration_entry(e) || is_device_private_entry(e));}) +/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */ +#define free_swap_and_cache(e) is_pfn_swap_entry(e) static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) { diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 708fbeb21dd3..c24c79812bc1 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -128,16 +128,6 @@ static inline bool is_write_device_private_entry(swp_entry_t entry) { return unlikely(swp_type(entry) == SWP_DEVICE_WRITE); } - -static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry) -{ - return swp_offset(entry); -} - -static inline struct page *device_private_entry_to_page(swp_entry_t entry) -{ - return pfn_to_page(swp_offset(entry)); -} #else /* CONFIG_DEVICE_PRIVATE */ static inline swp_entry_t make_device_private_entry(struct page *page, bool write) { @@ -157,16 +147,6 @@ static inline bool is_write_device_private_entry(swp_entry_t entry) { return false; } - -static inline unsigned long device_private_entry_to_pfn(swp_entry_t entry) -{ - return 0; -} - -static inline struct page *device_private_entry_to_page(swp_entry_t entry) -{ - return NULL; -} #endif /* CONFIG_DEVICE_PRIVATE */ #ifdef CONFIG_MIGRATION @@ -189,22 +169,6 @@ static inline int is_write_migration_entry(swp_entry_t entry) return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE); } -static inline unsigned long migration_entry_to_pfn(swp_entry_t entry) -{ - return swp_offset(entry); -} - -static inline struct page *migration_entry_to_page(swp_entry_t entry) -{ - struct page *p = pfn_to_page(swp_offset(entry)); - /* - * Any use of migration entries may only occur while the - * corresponding page is locked - */ - BUG_ON(!PageLocked(compound_head(p))); - return p; -} - static inline void make_migration_entry_read(swp_entry_t *entry) { *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry)); @@ -224,16 +188,6 @@ static inline int is_migration_entry(swp_entry_t swp) return 0; } -static inline unsigned long migration_entry_to_pfn(swp_entry_t entry) -{ - return 0; -} - -static inline struct page *migration_entry_to_page(swp_entry_t entry) -{ - return NULL; -} - static inline void make_migration_entry_read(swp_entry_t *entryp) { } static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, spinlock_t *ptl) { } @@ -248,6 +202,29 @@ static inline int is_write_migration_entry(swp_entry_t entry) #endif +static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry) +{ + struct page *p = pfn_to_page(swp_offset(entry)); + + /* + * Any use of migration entries may only occur while the + * corresponding page is locked + */ + BUG_ON(is_migration_entry(entry) && !PageLocked(p)); + + return p; +} + +/* + * A pfn swap entry is a special type of swap entry that always has a pfn stored + * in the swap offset. They are used to represent unaddressable device memory + * and to restrict access to a page undergoing migration. + */ +static inline bool is_pfn_swap_entry(swp_entry_t entry) +{ + return is_migration_entry(entry) || is_device_private_entry(entry); +} + struct page_vma_mapped_walk; #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION diff --git a/mm/hmm.c b/mm/hmm.c index 943cb2ba4442..3b2dda71d0ed 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -214,7 +214,7 @@ static inline bool hmm_is_device_private_entry(struct hmm_range *range, swp_entry_t entry) { return is_device_private_entry(entry) && - device_private_entry_to_page(entry)->pgmap->owner == + pfn_swap_entry_to_page(entry)->pgmap->owner == range->dev_private_owner; } @@ -257,8 +257,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, cpu_flags = HMM_PFN_VALID; if (is_write_device_private_entry(entry)) cpu_flags |= HMM_PFN_WRITE; - *hmm_pfn = device_private_entry_to_pfn(entry) | - cpu_flags; + *hmm_pfn = swp_offset(entry) | cpu_flags; return 0; } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d513b0cd1161..327b8d9d8d2f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1643,7 +1643,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, VM_BUG_ON(!is_pmd_migration_entry(orig_pmd)); entry = pmd_to_swp_entry(orig_pmd); - page = migration_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); flush_needed = 0; } else WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!"); @@ -2012,7 +2012,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, swp_entry_t entry; entry = pmd_to_swp_entry(old_pmd); - page = migration_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); } else { page = pmd_page(old_pmd); if (!PageDirty(page) && pmd_dirty(old_pmd)) @@ -2066,7 +2066,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, swp_entry_t entry; entry = pmd_to_swp_entry(old_pmd); - page = migration_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); write = is_write_migration_entry(entry); young = false; soft_dirty = pmd_swp_soft_dirty(old_pmd); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f63399bff018..b826dad6fa36 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5532,7 +5532,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma, * as special swap entry in the CPU page table. */ if (is_device_private_entry(ent)) { - page = device_private_entry_to_page(ent); + page = pfn_swap_entry_to_page(ent); /* * MEMORY_DEVICE_PRIVATE means ZONE_DEVICE page and which have * a refcount of 1 when free (unlike normal page) diff --git a/mm/memory.c b/mm/memory.c index 64eda960b75e..6723931085c7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -729,7 +729,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, } rss[MM_SWAPENTS]++; } else if (is_migration_entry(entry)) { - page = migration_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); rss[mm_counter(page)]++; @@ -748,7 +748,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, set_pte_at(src_mm, addr, src_pte, pte); } } else if (is_device_private_entry(entry)) { - page = device_private_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); /* * Update rss count even for unaddressable pages, as @@ -1280,7 +1280,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, entry = pte_to_swp_entry(ptent); if (is_device_private_entry(entry)) { - struct page *page = device_private_entry_to_page(entry); + struct page *page = pfn_swap_entry_to_page(entry); if (unlikely(details && details->check_mapping)) { /* @@ -1309,7 +1309,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, else if (is_migration_entry(entry)) { struct page *page; - page = migration_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); rss[mm_counter(page)]--; } if (unlikely(!free_swap_and_cache(entry))) @@ -3372,7 +3372,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) migration_entry_wait(vma->vm_mm, vmf->pmd, vmf->address); } else if (is_device_private_entry(entry)) { - vmf->page = device_private_entry_to_page(entry); + vmf->page = pfn_swap_entry_to_page(entry); ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); } else if (is_hwpoison_entry(entry)) { ret = VM_FAULT_HWPOISON; diff --git a/mm/migrate.c b/mm/migrate.c index 8810c1421f5d..b4abb87249e1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -296,7 +296,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, if (!is_migration_entry(entry)) goto out; - page = migration_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); page = compound_head(page); /* @@ -337,7 +337,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd) ptl = pmd_lock(mm, pmd); if (!is_pmd_migration_entry(*pmd)) goto unlock; - page = migration_entry_to_page(pmd_to_swp_entry(*pmd)); + page = pfn_swap_entry_to_page(pmd_to_swp_entry(*pmd)); if (!get_page_unless_zero(page)) goto unlock; spin_unlock(ptl); @@ -2289,7 +2289,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (!is_device_private_entry(entry)) goto next; - page = device_private_entry_to_page(entry); + page = pfn_swap_entry_to_page(entry); if (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) || page->pgmap->owner != migrate->pgmap_owner) diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index a4435311754b..4df6093e759b 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) if (!is_migration_entry(entry)) return false; - pfn = migration_entry_to_pfn(entry); + pfn = swp_offset(entry); } else if (is_swap_pte(*pvmw->pte)) { swp_entry_t entry; @@ -105,7 +105,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) if (!is_device_private_entry(entry)) return false; - pfn = device_private_entry_to_pfn(entry); + pfn = swp_offset(entry); } else { if (!pte_present(*pvmw->pte)) return false; @@ -233,7 +233,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) return not_found(pvmw); entry = pmd_to_swp_entry(pmde); if (!is_migration_entry(entry) || - migration_entry_to_page(entry) != page) + pfn_swap_entry_to_page(entry) != page) return not_found(pvmw); return true; } From 4dd845b5a3e57ad07f26ef808707b064696fe34b Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 30 Jun 2021 18:54:09 -0700 Subject: [PATCH 130/190] mm/swapops: rework swap entry manipulation code Both migration and device private pages use special swap entries that are manipluated by a range of inline functions. The arguments to these are somewhat inconsistent so rework them to remove flag type arguments and to make the arguments similar for both read and write entry creation. Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Christoph Hellwig Reviewed-by: Jason Gunthorpe Reviewed-by: Ralph Campbell Cc: Ben Skeggs Cc: Hugh Dickins Cc: John Hubbard Cc: "Matthew Wilcox (Oracle)" Cc: Peter Xu Cc: Shakeel Butt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swapops.h | 56 ++++++++++++++++++++++------------------- mm/debug_vm_pgtable.c | 12 ++++----- mm/hmm.c | 2 +- mm/huge_memory.c | 26 +++++++++++++------ mm/hugetlb.c | 10 +++++--- mm/memory.c | 10 +++++--- mm/migrate.c | 26 ++++++++++++++----- mm/mprotect.c | 10 +++++--- mm/rmap.c | 10 +++++--- 9 files changed, 100 insertions(+), 62 deletions(-) diff --git a/include/linux/swapops.h b/include/linux/swapops.h index c24c79812bc1..04d76357aa0c 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -107,10 +107,14 @@ static inline void *swp_to_radix_entry(swp_entry_t entry) } #if IS_ENABLED(CONFIG_DEVICE_PRIVATE) -static inline swp_entry_t make_device_private_entry(struct page *page, bool write) +static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset) { - return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ, - page_to_pfn(page)); + return swp_entry(SWP_DEVICE_READ, offset); +} + +static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset) +{ + return swp_entry(SWP_DEVICE_WRITE, offset); } static inline bool is_device_private_entry(swp_entry_t entry) @@ -119,23 +123,19 @@ static inline bool is_device_private_entry(swp_entry_t entry) return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE; } -static inline void make_device_private_entry_read(swp_entry_t *entry) -{ - *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry)); -} - -static inline bool is_write_device_private_entry(swp_entry_t entry) +static inline bool is_writable_device_private_entry(swp_entry_t entry) { return unlikely(swp_type(entry) == SWP_DEVICE_WRITE); } #else /* CONFIG_DEVICE_PRIVATE */ -static inline swp_entry_t make_device_private_entry(struct page *page, bool write) +static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset) { return swp_entry(0, 0); } -static inline void make_device_private_entry_read(swp_entry_t *entry) +static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset) { + return swp_entry(0, 0); } static inline bool is_device_private_entry(swp_entry_t entry) @@ -143,35 +143,32 @@ static inline bool is_device_private_entry(swp_entry_t entry) return false; } -static inline bool is_write_device_private_entry(swp_entry_t entry) +static inline bool is_writable_device_private_entry(swp_entry_t entry) { return false; } #endif /* CONFIG_DEVICE_PRIVATE */ #ifdef CONFIG_MIGRATION -static inline swp_entry_t make_migration_entry(struct page *page, int write) -{ - BUG_ON(!PageLocked(compound_head(page))); - - return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ, - page_to_pfn(page)); -} - static inline int is_migration_entry(swp_entry_t entry) { return unlikely(swp_type(entry) == SWP_MIGRATION_READ || swp_type(entry) == SWP_MIGRATION_WRITE); } -static inline int is_write_migration_entry(swp_entry_t entry) +static inline int is_writable_migration_entry(swp_entry_t entry) { return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE); } -static inline void make_migration_entry_read(swp_entry_t *entry) +static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) { - *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry)); + return swp_entry(SWP_MIGRATION_READ, offset); +} + +static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) +{ + return swp_entry(SWP_MIGRATION_WRITE, offset); } extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, @@ -181,21 +178,28 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, extern void migration_entry_wait_huge(struct vm_area_struct *vma, struct mm_struct *mm, pte_t *pte); #else +static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) +{ + return swp_entry(0, 0); +} + +static inline swp_entry_t make_writable_migration_entry(pgoff_t offset) +{ + return swp_entry(0, 0); +} -#define make_migration_entry(page, write) swp_entry(0, 0) static inline int is_migration_entry(swp_entry_t swp) { return 0; } -static inline void make_migration_entry_read(swp_entry_t *entryp) { } static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, spinlock_t *ptl) { } static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { } static inline void migration_entry_wait_huge(struct vm_area_struct *vma, struct mm_struct *mm, pte_t *pte) { } -static inline int is_write_migration_entry(swp_entry_t entry) +static inline int is_writable_migration_entry(swp_entry_t entry) { return 0; } diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index f7b23565a04f..1c922691aa61 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -843,17 +843,17 @@ static void __init swap_migration_tests(void) * locked, otherwise it stumbles upon a BUG_ON(). */ __SetPageLocked(page); - swp = make_migration_entry(page, 1); + swp = make_writable_migration_entry(page_to_pfn(page)); WARN_ON(!is_migration_entry(swp)); - WARN_ON(!is_write_migration_entry(swp)); + WARN_ON(!is_writable_migration_entry(swp)); - make_migration_entry_read(&swp); + swp = make_readable_migration_entry(swp_offset(swp)); WARN_ON(!is_migration_entry(swp)); - WARN_ON(is_write_migration_entry(swp)); + WARN_ON(is_writable_migration_entry(swp)); - swp = make_migration_entry(page, 0); + swp = make_readable_migration_entry(page_to_pfn(page)); WARN_ON(!is_migration_entry(swp)); - WARN_ON(is_write_migration_entry(swp)); + WARN_ON(is_writable_migration_entry(swp)); __ClearPageLocked(page); __free_page(page); } diff --git a/mm/hmm.c b/mm/hmm.c index 3b2dda71d0ed..11df3ca30b82 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -255,7 +255,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, */ if (hmm_is_device_private_entry(range, entry)) { cpu_flags = HMM_PFN_VALID; - if (is_write_device_private_entry(entry)) + if (is_writable_device_private_entry(entry)) cpu_flags |= HMM_PFN_WRITE; *hmm_pfn = swp_offset(entry) | cpu_flags; return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 327b8d9d8d2f..1c81aa11d618 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1054,8 +1054,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, swp_entry_t entry = pmd_to_swp_entry(pmd); VM_BUG_ON(!is_pmd_migration_entry(pmd)); - if (is_write_migration_entry(entry)) { - make_migration_entry_read(&entry); + if (is_writable_migration_entry(entry)) { + entry = make_readable_migration_entry( + swp_offset(entry)); pmd = swp_entry_to_pmd(entry); if (pmd_swp_soft_dirty(*src_pmd)) pmd = pmd_swp_mksoft_dirty(pmd); @@ -1772,13 +1773,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, swp_entry_t entry = pmd_to_swp_entry(*pmd); VM_BUG_ON(!is_pmd_migration_entry(*pmd)); - if (is_write_migration_entry(entry)) { + if (is_writable_migration_entry(entry)) { pmd_t newpmd; /* * A protection check is difficult so * just be safe and disable write */ - make_migration_entry_read(&entry); + entry = make_readable_migration_entry( + swp_offset(entry)); newpmd = swp_entry_to_pmd(entry); if (pmd_swp_soft_dirty(*pmd)) newpmd = pmd_swp_mksoft_dirty(newpmd); @@ -2067,7 +2069,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, entry = pmd_to_swp_entry(old_pmd); page = pfn_swap_entry_to_page(entry); - write = is_write_migration_entry(entry); + write = is_writable_migration_entry(entry); young = false; soft_dirty = pmd_swp_soft_dirty(old_pmd); uffd_wp = pmd_swp_uffd_wp(old_pmd); @@ -2099,7 +2101,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, */ if (freeze || pmd_migration) { swp_entry_t swp_entry; - swp_entry = make_migration_entry(page + i, write); + if (write) + swp_entry = make_writable_migration_entry( + page_to_pfn(page + i)); + else + swp_entry = make_readable_migration_entry( + page_to_pfn(page + i)); entry = swp_entry_to_pte(swp_entry); if (soft_dirty) entry = pte_swp_mksoft_dirty(entry); @@ -3171,7 +3178,10 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw, pmdval = pmdp_invalidate(vma, address, pvmw->pmd); if (pmd_dirty(pmdval)) set_page_dirty(page); - entry = make_migration_entry(page, pmd_write(pmdval)); + if (pmd_write(pmdval)) + entry = make_writable_migration_entry(page_to_pfn(page)); + else + entry = make_readable_migration_entry(page_to_pfn(page)); pmdswp = swp_entry_to_pmd(entry); if (pmd_soft_dirty(pmdval)) pmdswp = pmd_swp_mksoft_dirty(pmdswp); @@ -3197,7 +3207,7 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new) pmde = pmd_mkold(mk_huge_pmd(new, vma->vm_page_prot)); if (pmd_swp_soft_dirty(*pvmw->pmd)) pmde = pmd_mksoft_dirty(pmde); - if (is_write_migration_entry(entry)) + if (is_writable_migration_entry(entry)) pmde = maybe_pmd_mkwrite(pmde, vma); if (pmd_swp_uffd_wp(*pvmw->pmd)) pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde)); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 89ba5147206e..924553aa8f78 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4242,12 +4242,13 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, is_hugetlb_entry_hwpoisoned(entry))) { swp_entry_t swp_entry = pte_to_swp_entry(entry); - if (is_write_migration_entry(swp_entry) && cow) { + if (is_writable_migration_entry(swp_entry) && cow) { /* * COW mappings require pages in both * parent and child to be set to read. */ - make_migration_entry_read(&swp_entry); + swp_entry = make_readable_migration_entry( + swp_offset(swp_entry)); entry = swp_entry_to_pte(swp_entry); set_huge_swap_pte_at(src, addr, src_pte, entry, sz); @@ -5532,10 +5533,11 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, if (unlikely(is_hugetlb_entry_migration(pte))) { swp_entry_t entry = pte_to_swp_entry(pte); - if (is_write_migration_entry(entry)) { + if (is_writable_migration_entry(entry)) { pte_t newpte; - make_migration_entry_read(&entry); + entry = make_readable_migration_entry( + swp_offset(entry)); newpte = swp_entry_to_pte(entry); set_huge_swap_pte_at(mm, address, ptep, newpte, huge_page_size(h)); diff --git a/mm/memory.c b/mm/memory.c index 6723931085c7..e30488e9202f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -733,13 +733,14 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, rss[mm_counter(page)]++; - if (is_write_migration_entry(entry) && + if (is_writable_migration_entry(entry) && is_cow_mapping(vm_flags)) { /* * COW mappings require pages in both * parent and child to be set to read. */ - make_migration_entry_read(&entry); + entry = make_readable_migration_entry( + swp_offset(entry)); pte = swp_entry_to_pte(entry); if (pte_swp_soft_dirty(*src_pte)) pte = pte_swp_mksoft_dirty(pte); @@ -770,9 +771,10 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, * when a device driver is involved (you cannot easily * save and restore device driver state). */ - if (is_write_device_private_entry(entry) && + if (is_writable_device_private_entry(entry) && is_cow_mapping(vm_flags)) { - make_device_private_entry_read(&entry); + entry = make_readable_device_private_entry( + swp_offset(entry)); pte = swp_entry_to_pte(entry); if (pte_swp_uffd_wp(*src_pte)) pte = pte_swp_mkuffd_wp(pte); diff --git a/mm/migrate.c b/mm/migrate.c index b4abb87249e1..de47bdfbd2f8 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -210,13 +210,18 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, * Recheck VMA as permissions can change since migration started */ entry = pte_to_swp_entry(*pvmw.pte); - if (is_write_migration_entry(entry)) + if (is_writable_migration_entry(entry)) pte = maybe_mkwrite(pte, vma); else if (pte_swp_uffd_wp(*pvmw.pte)) pte = pte_mkuffd_wp(pte); if (unlikely(is_device_private_page(new))) { - entry = make_device_private_entry(new, pte_write(pte)); + if (pte_write(pte)) + entry = make_writable_device_private_entry( + page_to_pfn(new)); + else + entry = make_readable_device_private_entry( + page_to_pfn(new)); pte = swp_entry_to_pte(entry); if (pte_swp_soft_dirty(*pvmw.pte)) pte = pte_swp_mksoft_dirty(pte); @@ -2297,7 +2302,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, mpfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE; - if (is_write_device_private_entry(entry)) + if (is_writable_device_private_entry(entry)) mpfn |= MIGRATE_PFN_WRITE; } else { if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) @@ -2343,8 +2348,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, ptep_get_and_clear(mm, addr, ptep); /* Setup special migration page table entry */ - entry = make_migration_entry(page, mpfn & - MIGRATE_PFN_WRITE); + if (mpfn & MIGRATE_PFN_WRITE) + entry = make_writable_migration_entry( + page_to_pfn(page)); + else + entry = make_readable_migration_entry( + page_to_pfn(page)); swp_pte = swp_entry_to_pte(entry); if (pte_present(pte)) { if (pte_soft_dirty(pte)) @@ -2817,7 +2826,12 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, if (is_device_private_page(page)) { swp_entry_t swp_entry; - swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE); + if (vma->vm_flags & VM_WRITE) + swp_entry = make_writable_device_private_entry( + page_to_pfn(page)); + else + swp_entry = make_readable_device_private_entry( + page_to_pfn(page)); entry = swp_entry_to_pte(swp_entry); } else { /* diff --git a/mm/mprotect.c b/mm/mprotect.c index e7a443157988..ee5961888e70 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -143,23 +143,25 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, swp_entry_t entry = pte_to_swp_entry(oldpte); pte_t newpte; - if (is_write_migration_entry(entry)) { + if (is_writable_migration_entry(entry)) { /* * A protection check is difficult so * just be safe and disable write */ - make_migration_entry_read(&entry); + entry = make_readable_migration_entry( + swp_offset(entry)); newpte = swp_entry_to_pte(entry); if (pte_swp_soft_dirty(oldpte)) newpte = pte_swp_mksoft_dirty(newpte); if (pte_swp_uffd_wp(oldpte)) newpte = pte_swp_mkuffd_wp(newpte); - } else if (is_write_device_private_entry(entry)) { + } else if (is_writable_device_private_entry(entry)) { /* * We do not preserve soft-dirtiness. See * copy_one_pte() for explanation. */ - make_device_private_entry_read(&entry); + entry = make_readable_device_private_entry( + swp_offset(entry)); newpte = swp_entry_to_pte(entry); if (pte_swp_uffd_wp(oldpte)) newpte = pte_swp_mkuffd_wp(newpte); diff --git a/mm/rmap.c b/mm/rmap.c index f9fd5bc54f0a..b9986c8db524 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1533,7 +1533,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * pte. do_swap_page() will wait until the migration * pte is removed and then restart fault handling. */ - entry = make_migration_entry(page, 0); + entry = make_readable_migration_entry(page_to_pfn(page)); swp_pte = swp_entry_to_pte(entry); /* @@ -1629,8 +1629,12 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * pte. do_swap_page() will wait until the migration * pte is removed and then restart fault handling. */ - entry = make_migration_entry(subpage, - pte_write(pteval)); + if (pte_write(pteval)) + entry = make_writable_migration_entry( + page_to_pfn(subpage)); + else + entry = make_readable_migration_entry( + page_to_pfn(subpage)); swp_pte = swp_entry_to_pte(entry); if (pte_soft_dirty(pteval)) swp_pte = pte_swp_mksoft_dirty(swp_pte); From cd62734ca60dbb2ab5bb19c8d837dd9990955310 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 30 Jun 2021 18:54:12 -0700 Subject: [PATCH 131/190] mm/rmap: split try_to_munlock from try_to_unmap The behaviour of try_to_unmap_one() is difficult to follow because it performs different operations based on a fairly large set of flags used in different combinations. TTU_MUNLOCK is one such flag. However it is exclusively used by try_to_munlock() which specifies no other flags. Therefore rather than overload try_to_unmap_one() with unrelated behaviour split this out into it's own function and remove the flag. Link: https://lkml.kernel.org/r/20210616105937.23201-4-apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Ralph Campbell Reviewed-by: Christoph Hellwig Cc: Ben Skeggs Cc: Hugh Dickins Cc: Jason Gunthorpe Cc: John Hubbard Cc: "Matthew Wilcox (Oracle)" Cc: Peter Xu Cc: Shakeel Butt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/unevictable-lru.rst | 33 ++++++-------- include/linux/rmap.h | 3 +- mm/mlock.c | 12 ++--- mm/rmap.c | 68 ++++++++++++++++++++-------- 4 files changed, 70 insertions(+), 46 deletions(-) diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst index 0e1490524f53..eae3af17f2d9 100644 --- a/Documentation/vm/unevictable-lru.rst +++ b/Documentation/vm/unevictable-lru.rst @@ -389,14 +389,14 @@ mlocked, munlock_vma_page() updates that zone statistics for the number of mlocked pages. Note, however, that at this point we haven't checked whether the page is mapped by other VM_LOCKED VMAs. -We can't call try_to_munlock(), the function that walks the reverse map to +We can't call page_mlock(), the function that walks the reverse map to check for other VM_LOCKED VMAs, without first isolating the page from the LRU. -try_to_munlock() is a variant of try_to_unmap() and thus requires that the page +page_mlock() is a variant of try_to_unmap() and thus requires that the page not be on an LRU list [more on these below]. However, the call to -isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So, +isolate_lru_page() could fail, in which case we can't call page_mlock(). So, we go ahead and clear PG_mlocked up front, as this might be the only chance we -have. If we can successfully isolate the page, we go ahead and -try_to_munlock(), which will restore the PG_mlocked flag and update the zone +have. If we can successfully isolate the page, we go ahead and call +page_mlock(), which will restore the PG_mlocked flag and update the zone page statistics if it finds another VMA holding the page mlocked. If we fail to isolate the page, we'll have left a potentially mlocked page on the LRU. This is fine, because we'll catch it later if and if vmscan tries to reclaim @@ -545,31 +545,24 @@ munlock or munmap system calls, mm teardown (munlock_vma_pages_all), reclaim, holepunching, and truncation of file pages and their anonymous COWed pages. -try_to_munlock() Reverse Map Scan +page_mlock() Reverse Map Scan --------------------------------- -.. warning:: - [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the - page_referenced() reverse map walker. - When munlock_vma_page() [see section :ref:`munlock()/munlockall() System Call Handling ` above] tries to munlock a page, it needs to determine whether or not the page is mapped by any VM_LOCKED VMA without actually attempting to unmap all PTEs from the page. For this purpose, the unevictable/mlock infrastructure -introduced a variant of try_to_unmap() called try_to_munlock(). +introduced a variant of try_to_unmap() called page_mlock(). -try_to_munlock() calls the same functions as try_to_unmap() for anonymous and -mapped file and KSM pages with a flag argument specifying unlock versus unmap -processing. Again, these functions walk the respective reverse maps looking -for VM_LOCKED VMAs. When such a VMA is found, as in the try_to_unmap() case, -the functions mlock the page via mlock_vma_page() and return SWAP_MLOCK. This -undoes the pre-clearing of the page's PG_mlocked done by munlock_vma_page. +page_mlock() walks the respective reverse maps looking for VM_LOCKED VMAs. When +such a VMA is found the page is mlocked via mlock_vma_page(). This undoes the +pre-clearing of the page's PG_mlocked done by munlock_vma_page. -Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's +Note that page_mlock()'s reverse map walk must visit every VMA in a page's reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA. However, the scan can terminate when it encounters a VM_LOCKED VMA. -Although try_to_munlock() might be called a great many times when munlocking a +Although page_mlock() might be called a great many times when munlocking a large region or tearing down a large address space that has been mlocked via mlockall(), overall this is a fairly rare event. @@ -602,7 +595,7 @@ inactive lists to the appropriate node's unevictable list. shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd after shrink_active_list() had moved them to the inactive list, or pages mapped into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to -recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter, +recheck via page_mlock(). shrink_inactive_list() won't notice the latter, but will pass on to shrink_page_list(). shrink_page_list() again culls obviously unevictable pages that it could diff --git a/include/linux/rmap.h b/include/linux/rmap.h index ed31a559e857..69190efbd842 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -87,7 +87,6 @@ struct anon_vma_chain { enum ttu_flags { TTU_MIGRATION = 0x1, /* migration mode */ - TTU_MUNLOCK = 0x2, /* munlock mode */ TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */ TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */ @@ -240,7 +239,7 @@ int page_mkclean(struct page *); * called in munlock()/munmap() path to check for other vmas holding * the page mlocked. */ -void try_to_munlock(struct page *); +void page_mlock(struct page *page); void remove_migration_ptes(struct page *old, struct page *new, bool locked); diff --git a/mm/mlock.c b/mm/mlock.c index df590fda5688..4ab757ab6fe8 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -108,7 +108,7 @@ void mlock_vma_page(struct page *page) /* * Finish munlock after successful page isolation * - * Page must be locked. This is a wrapper for try_to_munlock() + * Page must be locked. This is a wrapper for page_mlock() * and putback_lru_page() with munlock accounting. */ static void __munlock_isolated_page(struct page *page) @@ -118,7 +118,7 @@ static void __munlock_isolated_page(struct page *page) * and we don't need to check all the other vmas. */ if (page_mapcount(page) > 1) - try_to_munlock(page); + page_mlock(page); /* Did try_to_unlock() succeed or punt? */ if (!PageMlocked(page)) @@ -158,7 +158,7 @@ static void __munlock_isolation_failed(struct page *page) * munlock()ed or munmap()ed, we want to check whether other vmas hold the * page locked so that we can leave it on the unevictable lru list and not * bother vmscan with it. However, to walk the page's rmap list in - * try_to_munlock() we must isolate the page from the LRU. If some other + * page_mlock() we must isolate the page from the LRU. If some other * task has removed the page from the LRU, we won't be able to do that. * So we clear the PageMlocked as we might not get another chance. If we * can't isolate the page, we leave it for putback_lru_page() and vmscan @@ -168,7 +168,7 @@ unsigned int munlock_vma_page(struct page *page) { int nr_pages; - /* For try_to_munlock() and to serialize with page migration */ + /* For page_mlock() and to serialize with page migration */ BUG_ON(!PageLocked(page)); VM_BUG_ON_PAGE(PageTail(page), page); @@ -205,7 +205,7 @@ static int __mlock_posix_error_return(long retval) * * The fast path is available only for evictable pages with single mapping. * Then we can bypass the per-cpu pvec and get better performance. - * when mapcount > 1 we need try_to_munlock() which can fail. + * when mapcount > 1 we need page_mlock() which can fail. * when !page_evictable(), we need the full redo logic of putback_lru_page to * avoid leaving evictable page in unevictable list. * @@ -414,7 +414,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, * * We don't save and restore VM_LOCKED here because pages are * still on lru. In unmap path, pages might be scanned by reclaim - * and re-mlocked by try_to_{munlock|unmap} before we unmap and + * and re-mlocked by page_mlock/try_to_unmap before we unmap and * free them. This will result in freeing mlocked pages. */ void munlock_vma_pages_range(struct vm_area_struct *vma, diff --git a/mm/rmap.c b/mm/rmap.c index b9986c8db524..73c0ff1c73ab 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1411,10 +1411,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (flags & TTU_SYNC) pvmw.flags = PVMW_SYNC; - /* munlock has nothing to gain from examining un-locked vmas */ - if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) - return true; - if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) && is_zone_device_page(page) && !is_device_private_page(page)) return true; @@ -1476,8 +1472,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, page_vma_mapped_walk_done(&pvmw); break; } - if (flags & TTU_MUNLOCK) - continue; } /* Unexpected PMD-mapped THP? */ @@ -1790,20 +1784,58 @@ void try_to_unmap(struct page *page, enum ttu_flags flags) rmap_walk(page, &rwc); } -/** - * try_to_munlock - try to munlock a page - * @page: the page to be munlocked - * - * Called from munlock code. Checks all of the VMAs mapping the page - * to make sure nobody else has this page mlocked. The page will be - * returned with PG_mlocked cleared if no other vmas have it mlocked. +/* + * Walks the vma's mapping a page and mlocks the page if any locked vma's are + * found. Once one is found the page is locked and the scan can be terminated. */ +static bool page_mlock_one(struct page *page, struct vm_area_struct *vma, + unsigned long address, void *unused) +{ + struct page_vma_mapped_walk pvmw = { + .page = page, + .vma = vma, + .address = address, + }; -void try_to_munlock(struct page *page) + /* An un-locked vma doesn't have any pages to lock, continue the scan */ + if (!(vma->vm_flags & VM_LOCKED)) + return true; + + while (page_vma_mapped_walk(&pvmw)) { + /* + * Need to recheck under the ptl to serialise with + * __munlock_pagevec_fill() after VM_LOCKED is cleared in + * munlock_vma_pages_range(). + */ + if (vma->vm_flags & VM_LOCKED) { + /* PTE-mapped THP are never mlocked */ + if (!PageTransCompound(page)) + mlock_vma_page(page); + page_vma_mapped_walk_done(&pvmw); + } + + /* + * no need to continue scanning other vma's if the page has + * been locked. + */ + return false; + } + + return true; +} + +/** + * page_mlock - try to mlock a page + * @page: the page to be mlocked + * + * Called from munlock code. Checks all of the VMAs mapping the page and mlocks + * the page if any are found. The page will be returned with PG_mlocked cleared + * if it is not mapped by any locked vmas. + */ +void page_mlock(struct page *page) { struct rmap_walk_control rwc = { - .rmap_one = try_to_unmap_one, - .arg = (void *)TTU_MUNLOCK, + .rmap_one = page_mlock_one, .done = page_not_mapped, .anon_lock = page_lock_anon_vma_read, @@ -1855,7 +1887,7 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page, * Find all the mappings of a page using the mapping pointer and the vma chains * contained in the anon_vma struct it points to. * - * When called from try_to_munlock(), the mmap_lock of the mm containing the vma + * When called from page_mlock(), the mmap_lock of the mm containing the vma * where the page was found will be held for write. So, we won't recheck * vm_flags for that VMA. That should be OK, because that vma shouldn't be * LOCKED. @@ -1908,7 +1940,7 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc, * Find all the mappings of a page using the mapping pointer and the vma chains * contained in the address_space struct it points to. * - * When called from try_to_munlock(), the mmap_lock of the mm containing the vma + * When called from page_mlock(), the mmap_lock of the mm containing the vma * where the page was found will be held for write. So, we won't recheck * vm_flags for that VMA. That should be OK, because that vma shouldn't be * LOCKED. From a98a2f0c8ce1b2138cb8e3ae410444dedcc14809 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 30 Jun 2021 18:54:16 -0700 Subject: [PATCH 132/190] mm/rmap: split migration into its own function Migration is currently implemented as a mode of operation for try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE. However it does not have much in common with the rest of the unmap functionality of try_to_unmap_one() and thus splitting it into a separate function reduces the complexity of try_to_unmap_one() making it more readable. Several simplifications can also be made in try_to_migrate_one() based on the following observations: - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK. - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON. - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH. TTU_SPLIT_FREEZE is a special case of migration used when splitting an anonymous page. This is most easily dealt with by calling the correct function from unmap_page() in mm/huge_memory.c - either try_to_migrate() for PageAnon or try_to_unmap(). Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Christoph Hellwig Reviewed-by: Ralph Campbell Cc: Ben Skeggs Cc: Hugh Dickins Cc: Jason Gunthorpe Cc: John Hubbard Cc: "Matthew Wilcox (Oracle)" Cc: Peter Xu Cc: Shakeel Butt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/rmap.h | 4 +- mm/huge_memory.c | 16 +- mm/migrate.c | 9 +- mm/rmap.c | 367 ++++++++++++++++++++++++++++++++----------- 4 files changed, 289 insertions(+), 107 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 69190efbd842..b0ea9d98302f 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -86,8 +86,6 @@ struct anon_vma_chain { }; enum ttu_flags { - TTU_MIGRATION = 0x1, /* migration mode */ - TTU_SPLIT_HUGE_PMD = 0x4, /* split huge PMD if any */ TTU_IGNORE_MLOCK = 0x8, /* ignore mlock */ TTU_SYNC = 0x10, /* avoid racy checks with PVMW_SYNC */ @@ -97,7 +95,6 @@ enum ttu_flags { * do a final flush if necessary */ TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock: * caller holds it */ - TTU_SPLIT_FREEZE = 0x100, /* freeze pte under splitting thp */ }; #ifdef CONFIG_MMU @@ -194,6 +191,7 @@ static inline void page_dup_rmap(struct page *page, bool compound) int page_referenced(struct page *, int is_locked, struct mem_cgroup *memcg, unsigned long *vm_flags); +void try_to_migrate(struct page *page, enum ttu_flags flags); void try_to_unmap(struct page *, enum ttu_flags flags); /* Avoid racy checks */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1c81aa11d618..8b731d53e9f4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2309,16 +2309,20 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma, static void unmap_page(struct page *page) { - enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_SYNC | - TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD; + enum ttu_flags ttu_flags = TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | + TTU_SYNC; VM_BUG_ON_PAGE(!PageHead(page), page); - /* If TTU_SPLIT_FREEZE is ever extended to file, update remap_page() */ + /* + * Anon pages need migration entries to preserve them, but file + * pages can simply be left unmapped, then faulted back on demand. + * If that is ever changed (perhaps for mlock), update remap_page(). + */ if (PageAnon(page)) - ttu_flags |= TTU_SPLIT_FREEZE; - - try_to_unmap(page, ttu_flags); + try_to_migrate(page, ttu_flags); + else + try_to_unmap(page, ttu_flags | TTU_IGNORE_MLOCK); VM_WARN_ON_ONCE_PAGE(page_mapped(page), page); } diff --git a/mm/migrate.c b/mm/migrate.c index de47bdfbd2f8..a69f54bd92b9 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1109,7 +1109,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, /* Establish migration ptes */ VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma, page); - try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK); + try_to_migrate(page, 0); page_was_mapped = 1; } @@ -1311,7 +1311,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, if (page_mapped(hpage)) { bool mapping_locked = false; - enum ttu_flags ttu = TTU_MIGRATION|TTU_IGNORE_MLOCK; + enum ttu_flags ttu = 0; if (!PageAnon(hpage)) { /* @@ -1328,7 +1328,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, ttu |= TTU_RMAP_LOCKED; } - try_to_unmap(hpage, ttu); + try_to_migrate(hpage, ttu); page_was_mapped = 1; if (mapping_locked) @@ -2602,7 +2602,6 @@ static void migrate_vma_prepare(struct migrate_vma *migrate) */ static void migrate_vma_unmap(struct migrate_vma *migrate) { - int flags = TTU_MIGRATION | TTU_IGNORE_MLOCK; const unsigned long npages = migrate->npages; const unsigned long start = migrate->start; unsigned long addr, i, restore = 0; @@ -2614,7 +2613,7 @@ static void migrate_vma_unmap(struct migrate_vma *migrate) continue; if (page_mapped(page)) { - try_to_unmap(page, flags); + try_to_migrate(page, 0); if (page_mapped(page)) goto restore; } diff --git a/mm/rmap.c b/mm/rmap.c index 73c0ff1c73ab..b922c7074cdd 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1411,14 +1411,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, if (flags & TTU_SYNC) pvmw.flags = PVMW_SYNC; - if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) && - is_zone_device_page(page) && !is_device_private_page(page)) - return true; - - if (flags & TTU_SPLIT_HUGE_PMD) { - split_huge_pmd_address(vma, address, - flags & TTU_SPLIT_FREEZE, page); - } + if (flags & TTU_SPLIT_HUGE_PMD) + split_huge_pmd_address(vma, address, false, page); /* * For THP, we have to assume the worse case ie pmd for invalidation. @@ -1443,16 +1437,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, mmu_notifier_invalidate_range_start(&range); while (page_vma_mapped_walk(&pvmw)) { -#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION - /* PMD-mapped THP migration entry */ - if (!pvmw.pte && (flags & TTU_MIGRATION)) { - VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page); - - set_pmd_migration_entry(&pvmw, page); - continue; - } -#endif - /* * If the page is mlock()d, we cannot swap it out. * If it's recently referenced (perhaps page_referenced @@ -1514,46 +1498,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } } - if (IS_ENABLED(CONFIG_MIGRATION) && - (flags & TTU_MIGRATION) && - is_zone_device_page(page)) { - swp_entry_t entry; - pte_t swp_pte; - - pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte); - - /* - * Store the pfn of the page in a special migration - * pte. do_swap_page() will wait until the migration - * pte is removed and then restart fault handling. - */ - entry = make_readable_migration_entry(page_to_pfn(page)); - swp_pte = swp_entry_to_pte(entry); - - /* - * pteval maps a zone device page and is therefore - * a swap pte. - */ - if (pte_swp_soft_dirty(pteval)) - swp_pte = pte_swp_mksoft_dirty(swp_pte); - if (pte_swp_uffd_wp(pteval)) - swp_pte = pte_swp_mkuffd_wp(swp_pte); - set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); - /* - * No need to invalidate here it will synchronize on - * against the special swap migration pte. - * - * The assignment to subpage above was computed from a - * swap PTE which results in an invalid pointer. - * Since only PAGE_SIZE pages can currently be - * migrated, just set it to page. This will need to be - * changed when hugepage migrations to device private - * memory are supported. - */ - subpage = page; - goto discard; - } - /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); if (should_defer_flush(mm, flags)) { @@ -1606,39 +1550,6 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* We have to invalidate as we cleared the pte */ mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE); - } else if (IS_ENABLED(CONFIG_MIGRATION) && - (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { - swp_entry_t entry; - pte_t swp_pte; - - if (arch_unmap_one(mm, vma, address, pteval) < 0) { - set_pte_at(mm, address, pvmw.pte, pteval); - ret = false; - page_vma_mapped_walk_done(&pvmw); - break; - } - - /* - * Store the pfn of the page in a special migration - * pte. do_swap_page() will wait until the migration - * pte is removed and then restart fault handling. - */ - if (pte_write(pteval)) - entry = make_writable_migration_entry( - page_to_pfn(subpage)); - else - entry = make_readable_migration_entry( - page_to_pfn(subpage)); - swp_pte = swp_entry_to_pte(entry); - if (pte_soft_dirty(pteval)) - swp_pte = pte_swp_mksoft_dirty(swp_pte); - if (pte_uffd_wp(pteval)) - swp_pte = pte_swp_mkuffd_wp(swp_pte); - set_pte_at(mm, address, pvmw.pte, swp_pte); - /* - * No need to invalidate here it will synchronize on - * against the special swap migration pte. - */ } else if (PageAnon(page)) { swp_entry_t entry = { .val = page_private(subpage) }; pte_t swp_pte; @@ -1766,6 +1677,277 @@ void try_to_unmap(struct page *page, enum ttu_flags flags) .anon_lock = page_lock_anon_vma_read, }; + if (flags & TTU_RMAP_LOCKED) + rmap_walk_locked(page, &rwc); + else + rmap_walk(page, &rwc); +} + +/* + * @arg: enum ttu_flags will be passed to this argument. + * + * If TTU_SPLIT_HUGE_PMD is specified any PMD mappings will be split into PTEs + * containing migration entries. This and TTU_RMAP_LOCKED are the only supported + * flags. + */ +static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma, + unsigned long address, void *arg) +{ + struct mm_struct *mm = vma->vm_mm; + struct page_vma_mapped_walk pvmw = { + .page = page, + .vma = vma, + .address = address, + }; + pte_t pteval; + struct page *subpage; + bool ret = true; + struct mmu_notifier_range range; + enum ttu_flags flags = (enum ttu_flags)(long)arg; + + if (is_zone_device_page(page) && !is_device_private_page(page)) + return true; + + /* + * When racing against e.g. zap_pte_range() on another cpu, + * in between its ptep_get_and_clear_full() and page_remove_rmap(), + * try_to_migrate() may return before page_mapped() has become false, + * if page table locking is skipped: use TTU_SYNC to wait for that. + */ + if (flags & TTU_SYNC) + pvmw.flags = PVMW_SYNC; + + /* + * unmap_page() in mm/huge_memory.c is the only user of migration with + * TTU_SPLIT_HUGE_PMD and it wants to freeze. + */ + if (flags & TTU_SPLIT_HUGE_PMD) + split_huge_pmd_address(vma, address, true, page); + + /* + * For THP, we have to assume the worse case ie pmd for invalidation. + * For hugetlb, it could be much worse if we need to do pud + * invalidation in the case of pmd sharing. + * + * Note that the page can not be free in this function as call of + * try_to_unmap() must hold a reference on the page. + */ + range.end = PageKsm(page) ? + address + PAGE_SIZE : vma_address_end(page, vma); + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, + address, range.end); + if (PageHuge(page)) { + /* + * If sharing is possible, start and end will be adjusted + * accordingly. + */ + adjust_range_if_pmd_sharing_possible(vma, &range.start, + &range.end); + } + mmu_notifier_invalidate_range_start(&range); + + while (page_vma_mapped_walk(&pvmw)) { +#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION + /* PMD-mapped THP migration entry */ + if (!pvmw.pte) { + VM_BUG_ON_PAGE(PageHuge(page) || + !PageTransCompound(page), page); + + set_pmd_migration_entry(&pvmw, page); + continue; + } +#endif + + /* Unexpected PMD-mapped THP? */ + VM_BUG_ON_PAGE(!pvmw.pte, page); + + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); + address = pvmw.address; + + if (PageHuge(page) && !PageAnon(page)) { + /* + * To call huge_pmd_unshare, i_mmap_rwsem must be + * held in write mode. Caller needs to explicitly + * do this outside rmap routines. + */ + VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); + if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) { + /* + * huge_pmd_unshare unmapped an entire PMD + * page. There is no way of knowing exactly + * which PMDs may be cached for this mm, so + * we must flush them all. start/end were + * already adjusted above to cover this range. + */ + flush_cache_range(vma, range.start, range.end); + flush_tlb_range(vma, range.start, range.end); + mmu_notifier_invalidate_range(mm, range.start, + range.end); + + /* + * The ref count of the PMD page was dropped + * which is part of the way map counting + * is done for shared PMDs. Return 'true' + * here. When there is no other sharing, + * huge_pmd_unshare returns false and we will + * unmap the actual page and drop map count + * to zero. + */ + page_vma_mapped_walk_done(&pvmw); + break; + } + } + + /* Nuke the page table entry. */ + flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); + pteval = ptep_clear_flush(vma, address, pvmw.pte); + + /* Move the dirty bit to the page. Now the pte is gone. */ + if (pte_dirty(pteval)) + set_page_dirty(page); + + /* Update high watermark before we lower rss */ + update_hiwater_rss(mm); + + if (is_zone_device_page(page)) { + swp_entry_t entry; + pte_t swp_pte; + + /* + * Store the pfn of the page in a special migration + * pte. do_swap_page() will wait until the migration + * pte is removed and then restart fault handling. + */ + entry = make_readable_migration_entry( + page_to_pfn(page)); + swp_pte = swp_entry_to_pte(entry); + + /* + * pteval maps a zone device page and is therefore + * a swap pte. + */ + if (pte_swp_soft_dirty(pteval)) + swp_pte = pte_swp_mksoft_dirty(swp_pte); + if (pte_swp_uffd_wp(pteval)) + swp_pte = pte_swp_mkuffd_wp(swp_pte); + set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); + /* + * No need to invalidate here it will synchronize on + * against the special swap migration pte. + * + * The assignment to subpage above was computed from a + * swap PTE which results in an invalid pointer. + * Since only PAGE_SIZE pages can currently be + * migrated, just set it to page. This will need to be + * changed when hugepage migrations to device private + * memory are supported. + */ + subpage = page; + } else if (PageHWPoison(page)) { + pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); + if (PageHuge(page)) { + hugetlb_count_sub(compound_nr(page), mm); + set_huge_swap_pte_at(mm, address, + pvmw.pte, pteval, + vma_mmu_pagesize(vma)); + } else { + dec_mm_counter(mm, mm_counter(page)); + set_pte_at(mm, address, pvmw.pte, pteval); + } + + } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) { + /* + * The guest indicated that the page content is of no + * interest anymore. Simply discard the pte, vmscan + * will take care of the rest. + * A future reference will then fault in a new zero + * page. When userfaultfd is active, we must not drop + * this page though, as its main user (postcopy + * migration) will not expect userfaults on already + * copied pages. + */ + dec_mm_counter(mm, mm_counter(page)); + /* We have to invalidate as we cleared the pte */ + mmu_notifier_invalidate_range(mm, address, + address + PAGE_SIZE); + } else { + swp_entry_t entry; + pte_t swp_pte; + + if (arch_unmap_one(mm, vma, address, pteval) < 0) { + set_pte_at(mm, address, pvmw.pte, pteval); + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } + + /* + * Store the pfn of the page in a special migration + * pte. do_swap_page() will wait until the migration + * pte is removed and then restart fault handling. + */ + if (pte_write(pteval)) + entry = make_writable_migration_entry( + page_to_pfn(subpage)); + else + entry = make_readable_migration_entry( + page_to_pfn(subpage)); + + swp_pte = swp_entry_to_pte(entry); + if (pte_soft_dirty(pteval)) + swp_pte = pte_swp_mksoft_dirty(swp_pte); + if (pte_uffd_wp(pteval)) + swp_pte = pte_swp_mkuffd_wp(swp_pte); + set_pte_at(mm, address, pvmw.pte, swp_pte); + /* + * No need to invalidate here it will synchronize on + * against the special swap migration pte. + */ + } + + /* + * No need to call mmu_notifier_invalidate_range() it has be + * done above for all cases requiring it to happen under page + * table lock before mmu_notifier_invalidate_range_end() + * + * See Documentation/vm/mmu_notifier.rst + */ + page_remove_rmap(subpage, PageHuge(page)); + put_page(page); + } + + mmu_notifier_invalidate_range_end(&range); + + return ret; +} + +/** + * try_to_migrate - try to replace all page table mappings with swap entries + * @page: the page to replace page table entries for + * @flags: action and flags + * + * Tries to remove all the page table entries which are mapping this page and + * replace them with special swap entries. Caller must hold the page lock. + * + * If is successful, return true. Otherwise, false. + */ +void try_to_migrate(struct page *page, enum ttu_flags flags) +{ + struct rmap_walk_control rwc = { + .rmap_one = try_to_migrate_one, + .arg = (void *)flags, + .done = page_not_mapped, + .anon_lock = page_lock_anon_vma_read, + }; + + /* + * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and + * TTU_SPLIT_HUGE_PMD and TTU_SYNC flags. + */ + if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD | + TTU_SYNC))) + return; + /* * During exec, a temporary VMA is setup and later moved. * The VMA is moved under the anon_vma lock but not the @@ -1774,8 +1956,7 @@ void try_to_unmap(struct page *page, enum ttu_flags flags) * locking requirements of exec(), migration skips * temporary VMAs until after exec() completes. */ - if ((flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE)) - && !PageKsm(page) && PageAnon(page)) + if (!PageKsm(page) && PageAnon(page)) rwc.invalid_vma = invalid_migration_vma; if (flags & TTU_RMAP_LOCKED) From 6b49bf6ddbb0d7992c816846acfa5fd1cf751c36 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 30 Jun 2021 18:54:19 -0700 Subject: [PATCH 133/190] mm: rename migrate_pgmap_owner MMU notifier ranges have a migrate_pgmap_owner field which is used by drivers to store a pointer. This is subsequently used by the driver callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types can also benefit from this filtering, so rename the 'migrate_pgmap_owner' field to 'owner' and create a new notifier initialisation function to initialise this field. Link: https://lkml.kernel.org/r/20210616105937.23201-6-apopple@nvidia.com Signed-off-by: Alistair Popple Suggested-by: Peter Xu Reviewed-by: Peter Xu Cc: Ben Skeggs Cc: Christoph Hellwig Cc: Hugh Dickins Cc: Jason Gunthorpe Cc: John Hubbard Cc: "Matthew Wilcox (Oracle)" Cc: Ralph Campbell Cc: Shakeel Butt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/hmm.rst | 2 +- drivers/gpu/drm/nouveau/nouveau_svm.c | 2 +- include/linux/mmu_notifier.h | 20 ++++++++++---------- lib/test_hmm.c | 2 +- mm/migrate.c | 10 +++++----- 5 files changed, 18 insertions(+), 18 deletions(-) diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst index 09e28507f5b2..3df79307a797 100644 --- a/Documentation/vm/hmm.rst +++ b/Documentation/vm/hmm.rst @@ -332,7 +332,7 @@ between device driver specific code and shared common code: walks to fill in the ``args->src`` array with PFNs to be migrated. The ``invalidate_range_start()`` callback is passed a ``struct mmu_notifier_range`` with the ``event`` field set to - ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to + ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is allows the device driver to skip the invalidation callback and only invalidate device private MMU mappings that are actually migrating. diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index 1c3f890377d2..e963f7d81953 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn, * the invalidation is handled as part of the migration process. */ if (update->event == MMU_NOTIFY_MIGRATE && - update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev) + update->owner == svmm->vmm->cli->drm->dev) goto out; if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) { diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 1a6a9eb6d3fa..8e428eb813b8 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -41,7 +41,7 @@ struct mmu_interval_notifier; * * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal * a device driver to possibly ignore the invalidation if the - * migrate_pgmap_owner field matches the driver's device private pgmap owner. + * owner field matches the driver's device private pgmap owner. */ enum mmu_notifier_event { MMU_NOTIFY_UNMAP = 0, @@ -269,7 +269,7 @@ struct mmu_notifier_range { unsigned long end; unsigned flags; enum mmu_notifier_event event; - void *migrate_pgmap_owner; + void *owner; }; static inline int mm_has_notifiers(struct mm_struct *mm) @@ -521,14 +521,14 @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *range, range->flags = flags; } -static inline void mmu_notifier_range_init_migrate( - struct mmu_notifier_range *range, unsigned int flags, +static inline void mmu_notifier_range_init_owner( + struct mmu_notifier_range *range, + enum mmu_notifier_event event, unsigned int flags, struct vm_area_struct *vma, struct mm_struct *mm, - unsigned long start, unsigned long end, void *pgmap) + unsigned long start, unsigned long end, void *owner) { - mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm, - start, end); - range->migrate_pgmap_owner = pgmap; + mmu_notifier_range_init(range, event, flags, vma, mm, start, end); + range->owner = owner; } #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ @@ -655,8 +655,8 @@ static inline void _mmu_notifier_range_init(struct mmu_notifier_range *range, #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end) \ _mmu_notifier_range_init(range, start, end) -#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \ - pgmap) \ +#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \ + end, owner) \ _mmu_notifier_range_init(range, start, end) static inline bool diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 15f2e2db77bc..fc7a20bc9b42 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -218,7 +218,7 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni, * the invalidation is handled as part of the migration process. */ if (range->event == MMU_NOTIFY_MIGRATE && - range->migrate_pgmap_owner == dmirror->mdevice) + range->owner == dmirror->mdevice) return true; if (mmu_notifier_range_blockable(range)) diff --git a/mm/migrate.c b/mm/migrate.c index a69f54bd92b9..23cbd9de030b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2416,8 +2416,8 @@ static void migrate_vma_collect(struct migrate_vma *migrate) * that the registered device driver can skip invalidating device * private page mappings that won't be migrated. */ - mmu_notifier_range_init_migrate(&range, 0, migrate->vma, - migrate->vma->vm_mm, migrate->start, migrate->end, + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0, + migrate->vma, migrate->vma->vm_mm, migrate->start, migrate->end, migrate->pgmap_owner); mmu_notifier_invalidate_range_start(&range); @@ -2927,9 +2927,9 @@ void migrate_vma_pages(struct migrate_vma *migrate) if (!notified) { notified = true; - mmu_notifier_range_init_migrate(&range, 0, - migrate->vma, migrate->vma->vm_mm, - addr, migrate->end, + mmu_notifier_range_init_owner(&range, + MMU_NOTIFY_MIGRATE, 0, migrate->vma, + migrate->vma->vm_mm, addr, migrate->end, migrate->pgmap_owner); mmu_notifier_invalidate_range_start(&range); } From 9a5cc85c407402ae66128d31f0422a3a7ffa5c5c Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 30 Jun 2021 18:54:22 -0700 Subject: [PATCH 134/190] mm/memory.c: allow different return codes for copy_nonpresent_pte() Currently if copy_nonpresent_pte() returns a non-zero value it is assumed to be a swap entry which requires further processing outside the loop in copy_pte_range() after dropping locks. This prevents other values being returned to signal conditions such as failure which a subsequent change requires. Instead make copy_nonpresent_pte() return an error code if further processing is required and read the value for the swap entry in the main loop under the ptl. Link: https://lkml.kernel.org/r/20210616105937.23201-7-apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Peter Xu Cc: Ben Skeggs Cc: Christoph Hellwig Cc: Hugh Dickins Cc: Jason Gunthorpe Cc: John Hubbard Cc: "Matthew Wilcox (Oracle)" Cc: Ralph Campbell Cc: Shakeel Butt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e30488e9202f..75feddbf0190 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -717,7 +717,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, if (likely(!non_swap_entry(entry))) { if (swap_duplicate(entry) < 0) - return entry.val; + return -EIO; /* make sure dst_mm is on swapoff's mmlist. */ if (unlikely(list_empty(&dst_mm->mmlist))) { @@ -973,12 +973,14 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, continue; } if (unlikely(!pte_present(*src_pte))) { - entry.val = copy_nonpresent_pte(dst_mm, src_mm, - dst_pte, src_pte, - dst_vma, src_vma, - addr, rss); - if (entry.val) + ret = copy_nonpresent_pte(dst_mm, src_mm, + dst_pte, src_pte, + dst_vma, src_vma, + addr, rss); + if (ret == -EIO) { + entry = pte_to_swp_entry(*src_pte); break; + } progress += 8; continue; } @@ -1011,20 +1013,24 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_unmap_unlock(orig_dst_pte, dst_ptl); cond_resched(); - if (entry.val) { + if (ret == -EIO) { + VM_WARN_ON_ONCE(!entry.val); if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) { ret = -ENOMEM; goto out; } entry.val = 0; - } else if (ret) { - WARN_ON_ONCE(ret != -EAGAIN); + } else if (ret == -EAGAIN) { prealloc = page_copy_prealloc(src_mm, src_vma, addr); if (!prealloc) return -ENOMEM; - /* We've captured and resolved the error. Reset, try again. */ - ret = 0; + } else if (ret) { + VM_WARN_ON_ONCE(1); } + + /* We've captured and resolved the error. Reset, try again. */ + ret = 0; + if (addr != end) goto again; out: From b756a3b5e7ead8f6f4b03cea8ac22478ce04c8a8 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 30 Jun 2021 18:54:25 -0700 Subject: [PATCH 135/190] mm: device exclusive memory access Some devices require exclusive write access to shared virtual memory (SVM) ranges to perform atomic operations on that memory. This requires CPU page tables to be updated to deny access whilst atomic operations are occurring. In order to do this introduce a new swap entry type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for exclusive access by a device all page table mappings for the particular range are replaced with device exclusive swap entries. This causes any CPU access to the page to result in a fault. Faults are resovled by replacing the faulting entry with the original mapping. This results in MMU notifiers being called which a driver uses to update access permissions such as revoking atomic access. After notifiers have been called the device will no longer have exclusive access to the region. Walking of the page tables to find the target pages is handled by get_user_pages() rather than a direct page table walk. A direct page table walk similar to what migrate_vma_collect()/unmap() does could also have been utilised. However this resulted in more code similar in functionality to what get_user_pages() provides as page faulting is required to make the PTEs present and to break COW. [dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()] Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.com Signed-off-by: Alistair Popple Signed-off-by: Dan Carpenter Reviewed-by: Christoph Hellwig Cc: Ben Skeggs Cc: Hugh Dickins Cc: Jason Gunthorpe Cc: John Hubbard Cc: "Matthew Wilcox (Oracle)" Cc: Peter Xu Cc: Ralph Campbell Cc: Shakeel Butt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/hmm.rst | 17 ++++ include/linux/mmu_notifier.h | 6 ++ include/linux/rmap.h | 4 + include/linux/swap.h | 9 +- include/linux/swapops.h | 44 ++++++++- mm/hmm.c | 5 + mm/memory.c | 127 +++++++++++++++++++++++- mm/mprotect.c | 8 ++ mm/page_vma_mapped.c | 9 +- mm/rmap.c | 186 +++++++++++++++++++++++++++++++++++ 10 files changed, 405 insertions(+), 10 deletions(-) diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst index 3df79307a797..a14c2938e7af 100644 --- a/Documentation/vm/hmm.rst +++ b/Documentation/vm/hmm.rst @@ -405,6 +405,23 @@ between device driver specific code and shared common code: The lock can now be released. +Exclusive access memory +======================= + +Some devices have features such as atomic PTE bits that can be used to implement +atomic access to system memory. To support atomic operations to a shared virtual +memory page such a device needs access to that page which is exclusive of any +userspace access from the CPU. The ``make_device_exclusive_range()`` function +can be used to make a memory range inaccessible from userspace. + +This replaces all mappings for pages in the given range with special swap +entries. Any attempt to access the swap entry results in a fault which is +resovled by replacing the entry with the original mapping. A driver gets +notified that the mapping has been changed by MMU notifiers, after which point +it will no longer have exclusive access to the page. Exclusive access is +guranteed to last until the driver drops the page lock and page reference, at +which point any CPU faults on the page may proceed as described. + Memory cgroup (memcg) and rss accounting ======================================== diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 8e428eb813b8..6692da8d121d 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -42,6 +42,11 @@ struct mmu_interval_notifier; * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal * a device driver to possibly ignore the invalidation if the * owner field matches the driver's device private pgmap owner. + * + * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no + * longer have exclusive access to the page. When sent during creation of an + * exclusive range the owner will be initialised to the value provided by the + * caller of make_device_exclusive_range(), otherwise the owner will be NULL. */ enum mmu_notifier_event { MMU_NOTIFY_UNMAP = 0, @@ -51,6 +56,7 @@ enum mmu_notifier_event { MMU_NOTIFY_SOFT_DIRTY, MMU_NOTIFY_RELEASE, MMU_NOTIFY_MIGRATE, + MMU_NOTIFY_EXCLUSIVE, }; #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b0ea9d98302f..83fb86133fe1 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -194,6 +194,10 @@ int page_referenced(struct page *, int is_locked, void try_to_migrate(struct page *page, enum ttu_flags flags); void try_to_unmap(struct page *, enum ttu_flags flags); +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, + unsigned long end, struct page **pages, + void *arg); + /* Avoid racy checks */ #define PVMW_SYNC (1 << 0) /* Look for migarion entries rather than present PTEs */ diff --git a/include/linux/swap.h b/include/linux/swap.h index df7cbb6b3d3e..6f5a43251593 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -62,12 +62,17 @@ static inline int current_is_kswapd(void) * migrate part of a process memory to device memory. * * When a page is migrated from CPU to device, we set the CPU page table entry - * to a special SWP_DEVICE_* entry. + * to a special SWP_DEVICE_{READ|WRITE} entry. + * + * When a page is mapped by the device for exclusive access we set the CPU page + * table entries to special SWP_DEVICE_EXCLUSIVE_* entries. */ #ifdef CONFIG_DEVICE_PRIVATE -#define SWP_DEVICE_NUM 2 +#define SWP_DEVICE_NUM 4 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM) #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1) +#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2) +#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3) #else #define SWP_DEVICE_NUM 0 #endif diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 04d76357aa0c..d356ab4047f7 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -127,6 +127,27 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry) { return unlikely(swp_type(entry) == SWP_DEVICE_WRITE); } + +static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset) +{ + return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset); +} + +static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset) +{ + return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset); +} + +static inline bool is_device_exclusive_entry(swp_entry_t entry) +{ + return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ || + swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE; +} + +static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) +{ + return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE); +} #else /* CONFIG_DEVICE_PRIVATE */ static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset) { @@ -147,6 +168,26 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry) { return false; } + +static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset) +{ + return swp_entry(0, 0); +} + +static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset) +{ + return swp_entry(0, 0); +} + +static inline bool is_device_exclusive_entry(swp_entry_t entry) +{ + return false; +} + +static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) +{ + return false; +} #endif /* CONFIG_DEVICE_PRIVATE */ #ifdef CONFIG_MIGRATION @@ -226,7 +267,8 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry) */ static inline bool is_pfn_swap_entry(swp_entry_t entry) { - return is_migration_entry(entry) || is_device_private_entry(entry); + return is_migration_entry(entry) || is_device_private_entry(entry) || + is_device_exclusive_entry(entry); } struct page_vma_mapped_walk; diff --git a/mm/hmm.c b/mm/hmm.c index 11df3ca30b82..fad6be2bf072 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -26,6 +26,8 @@ #include #include +#include "internal.h" + struct hmm_vma_walk { struct hmm_range *range; unsigned long last; @@ -271,6 +273,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, if (!non_swap_entry(entry)) goto fault; + if (is_device_exclusive_entry(entry)) + goto fault; + if (is_migration_entry(entry)) { pte_unmap(ptep); hmm_vma_walk->last = addr; diff --git a/mm/memory.c b/mm/memory.c index 75feddbf0190..747a01d495f2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -699,6 +699,68 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, } #endif +static void restore_exclusive_pte(struct vm_area_struct *vma, + struct page *page, unsigned long address, + pte_t *ptep) +{ + pte_t pte; + swp_entry_t entry; + + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot))); + if (pte_swp_soft_dirty(*ptep)) + pte = pte_mksoft_dirty(pte); + + entry = pte_to_swp_entry(*ptep); + if (pte_swp_uffd_wp(*ptep)) + pte = pte_mkuffd_wp(pte); + else if (is_writable_device_exclusive_entry(entry)) + pte = maybe_mkwrite(pte_mkdirty(pte), vma); + + set_pte_at(vma->vm_mm, address, ptep, pte); + + /* + * No need to take a page reference as one was already + * created when the swap entry was made. + */ + if (PageAnon(page)) + page_add_anon_rmap(page, vma, address, false); + else + /* + * Currently device exclusive access only supports anonymous + * memory so the entry shouldn't point to a filebacked page. + */ + WARN_ON_ONCE(!PageAnon(page)); + + if (vma->vm_flags & VM_LOCKED) + mlock_vma_page(page); + + /* + * No need to invalidate - it was non-present before. However + * secondary CPUs may have mappings that need invalidating. + */ + update_mmu_cache(vma, address, ptep); +} + +/* + * Tries to restore an exclusive pte if the page lock can be acquired without + * sleeping. + */ +static int +try_restore_exclusive_pte(pte_t *src_pte, struct vm_area_struct *vma, + unsigned long addr) +{ + swp_entry_t entry = pte_to_swp_entry(*src_pte); + struct page *page = pfn_swap_entry_to_page(entry); + + if (trylock_page(page)) { + restore_exclusive_pte(vma, page, addr, src_pte); + unlock_page(page); + return 0; + } + + return -EBUSY; +} + /* * copy one vm_area from one task to the other. Assumes the page tables * already present in the new task to be cleared in the whole range @@ -780,6 +842,17 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte = pte_swp_mkuffd_wp(pte); set_pte_at(src_mm, addr, src_pte, pte); } + } else if (is_device_exclusive_entry(entry)) { + /* + * Make device exclusive entries present by restoring the + * original entry then copying as for a present pte. Device + * exclusive entries currently only support private writable + * (ie. COW) mappings. + */ + VM_BUG_ON(!is_cow_mapping(src_vma->vm_flags)); + if (try_restore_exclusive_pte(src_pte, src_vma, addr)) + return -EBUSY; + return -ENOENT; } if (!userfaultfd_wp(dst_vma)) pte = pte_swp_clear_uffd_wp(pte); @@ -980,9 +1053,18 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, if (ret == -EIO) { entry = pte_to_swp_entry(*src_pte); break; + } else if (ret == -EBUSY) { + break; + } else if (!ret) { + progress += 8; + continue; } - progress += 8; - continue; + + /* + * Device exclusive entry restored, continue by copying + * the now present pte. + */ + WARN_ON_ONCE(ret != -ENOENT); } /* copy_present_pte() will clear `*prealloc' if consumed */ ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, @@ -1020,6 +1102,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, goto out; } entry.val = 0; + } else if (ret == -EBUSY) { + goto out; } else if (ret == -EAGAIN) { prealloc = page_copy_prealloc(src_mm, src_vma, addr); if (!prealloc) @@ -1287,7 +1371,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, } entry = pte_to_swp_entry(ptent); - if (is_device_private_entry(entry)) { + if (is_device_private_entry(entry) || + is_device_exclusive_entry(entry)) { struct page *page = pfn_swap_entry_to_page(entry); if (unlikely(details && details->check_mapping)) { @@ -1303,7 +1388,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); rss[mm_counter(page)]--; - page_remove_rmap(page, false); + + if (is_device_private_entry(entry)) + page_remove_rmap(page, false); + put_page(page); continue; } @@ -3351,6 +3439,34 @@ void unmap_mapping_range(struct address_space *mapping, } EXPORT_SYMBOL(unmap_mapping_range); +/* + * Restore a potential device exclusive pte to a working pte entry + */ +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) +{ + struct page *page = vmf->page; + struct vm_area_struct *vma = vmf->vma; + struct mmu_notifier_range range; + + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) + return VM_FAULT_RETRY; + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma, + vma->vm_mm, vmf->address & PAGE_MASK, + (vmf->address & PAGE_MASK) + PAGE_SIZE, NULL); + mmu_notifier_invalidate_range_start(&range); + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, + &vmf->ptl); + if (likely(pte_same(*vmf->pte, vmf->orig_pte))) + restore_exclusive_pte(vma, page, vmf->address, vmf->pte); + + pte_unmap_unlock(vmf->pte, vmf->ptl); + unlock_page(page); + + mmu_notifier_invalidate_range_end(&range); + return 0; +} + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -3379,6 +3495,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (is_migration_entry(entry)) { migration_entry_wait(vma->vm_mm, vmf->pmd, vmf->address); + } else if (is_device_exclusive_entry(entry)) { + vmf->page = pfn_swap_entry_to_page(entry); + ret = remove_device_exclusive_entry(vmf); } else if (is_device_private_entry(entry)) { vmf->page = pfn_swap_entry_to_page(entry); ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); diff --git a/mm/mprotect.c b/mm/mprotect.c index ee5961888e70..883e2cc85cad 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -165,6 +165,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, newpte = swp_entry_to_pte(entry); if (pte_swp_uffd_wp(oldpte)) newpte = pte_swp_mkuffd_wp(newpte); + } else if (is_writable_device_exclusive_entry(entry)) { + entry = make_readable_device_exclusive_entry( + swp_offset(entry)); + newpte = swp_entry_to_pte(entry); + if (pte_swp_soft_dirty(oldpte)) + newpte = pte_swp_mksoft_dirty(newpte); + if (pte_swp_uffd_wp(oldpte)) + newpte = pte_swp_mkuffd_wp(newpte); } else { newpte = oldpte; } diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 4df6093e759b..f7b331081791 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -41,7 +41,8 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw) /* Handle un-addressable ZONE_DEVICE memory */ entry = pte_to_swp_entry(*pvmw->pte); - if (!is_device_private_entry(entry)) + if (!is_device_private_entry(entry) && + !is_device_exclusive_entry(entry)) return false; } else if (!pte_present(*pvmw->pte)) return false; @@ -93,7 +94,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) return false; entry = pte_to_swp_entry(*pvmw->pte); - if (!is_migration_entry(entry)) + if (!is_migration_entry(entry) && + !is_device_exclusive_entry(entry)) return false; pfn = swp_offset(entry); @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) /* Handle un-addressable ZONE_DEVICE memory */ entry = pte_to_swp_entry(*pvmw->pte); - if (!is_device_private_entry(entry)) + if (!is_device_private_entry(entry) && + !is_device_exclusive_entry(entry)) return false; pfn = swp_offset(entry); diff --git a/mm/rmap.c b/mm/rmap.c index b922c7074cdd..37c24672125c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2028,6 +2028,192 @@ void page_mlock(struct page *page) rmap_walk(page, &rwc); } +#ifdef CONFIG_DEVICE_PRIVATE +struct make_exclusive_args { + struct mm_struct *mm; + unsigned long address; + void *owner; + bool valid; +}; + +static bool page_make_device_exclusive_one(struct page *page, + struct vm_area_struct *vma, unsigned long address, void *priv) +{ + struct mm_struct *mm = vma->vm_mm; + struct page_vma_mapped_walk pvmw = { + .page = page, + .vma = vma, + .address = address, + }; + struct make_exclusive_args *args = priv; + pte_t pteval; + struct page *subpage; + bool ret = true; + struct mmu_notifier_range range; + swp_entry_t entry; + pte_t swp_pte; + + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma, + vma->vm_mm, address, min(vma->vm_end, + address + page_size(page)), args->owner); + mmu_notifier_invalidate_range_start(&range); + + while (page_vma_mapped_walk(&pvmw)) { + /* Unexpected PMD-mapped THP? */ + VM_BUG_ON_PAGE(!pvmw.pte, page); + + if (!pte_present(*pvmw.pte)) { + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } + + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); + address = pvmw.address; + + /* Nuke the page table entry. */ + flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); + pteval = ptep_clear_flush(vma, address, pvmw.pte); + + /* Move the dirty bit to the page. Now the pte is gone. */ + if (pte_dirty(pteval)) + set_page_dirty(page); + + /* + * Check that our target page is still mapped at the expected + * address. + */ + if (args->mm == mm && args->address == address && + pte_write(pteval)) + args->valid = true; + + /* + * Store the pfn of the page in a special migration + * pte. do_swap_page() will wait until the migration + * pte is removed and then restart fault handling. + */ + if (pte_write(pteval)) + entry = make_writable_device_exclusive_entry( + page_to_pfn(subpage)); + else + entry = make_readable_device_exclusive_entry( + page_to_pfn(subpage)); + swp_pte = swp_entry_to_pte(entry); + if (pte_soft_dirty(pteval)) + swp_pte = pte_swp_mksoft_dirty(swp_pte); + if (pte_uffd_wp(pteval)) + swp_pte = pte_swp_mkuffd_wp(swp_pte); + + set_pte_at(mm, address, pvmw.pte, swp_pte); + + /* + * There is a reference on the page for the swap entry which has + * been removed, so shouldn't take another. + */ + page_remove_rmap(subpage, false); + } + + mmu_notifier_invalidate_range_end(&range); + + return ret; +} + +/** + * page_make_device_exclusive - mark the page exclusively owned by a device + * @page: the page to replace page table entries for + * @mm: the mm_struct where the page is expected to be mapped + * @address: address where the page is expected to be mapped + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks + * + * Tries to remove all the page table entries which are mapping this page and + * replace them with special device exclusive swap entries to grant a device + * exclusive access to the page. Caller must hold the page lock. + * + * Returns false if the page is still mapped, or if it could not be unmapped + * from the expected address. Otherwise returns true (success). + */ +static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm, + unsigned long address, void *owner) +{ + struct make_exclusive_args args = { + .mm = mm, + .address = address, + .owner = owner, + .valid = false, + }; + struct rmap_walk_control rwc = { + .rmap_one = page_make_device_exclusive_one, + .done = page_not_mapped, + .anon_lock = page_lock_anon_vma_read, + .arg = &args, + }; + + /* + * Restrict to anonymous pages for now to avoid potential writeback + * issues. Also tail pages shouldn't be passed to rmap_walk so skip + * those. + */ + if (!PageAnon(page) || PageTail(page)) + return false; + + rmap_walk(page, &rwc); + + return args.valid && !page_mapcount(page); +} + +/** + * make_device_exclusive_range() - Mark a range for exclusive use by a device + * @mm: mm_struct of assoicated target process + * @start: start of the region to mark for exclusive device access + * @end: end address of region + * @pages: returns the pages which were successfully marked for exclusive access + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering + * + * Returns: number of pages found in the range by GUP. A page is marked for + * exclusive access only if the page pointer is non-NULL. + * + * This function finds ptes mapping page(s) to the given address range, locks + * them and replaces mappings with special swap entries preventing userspace CPU + * access. On fault these entries are replaced with the original mapping after + * calling MMU notifiers. + * + * A driver using this to program access from a device must use a mmu notifier + * critical section to hold a device specific lock during programming. Once + * programming is complete it should drop the page lock and reference after + * which point CPU access to the page will revoke the exclusive access. + */ +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, + unsigned long end, struct page **pages, + void *owner) +{ + long npages = (end - start) >> PAGE_SHIFT; + long i; + + npages = get_user_pages_remote(mm, start, npages, + FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD, + pages, NULL, NULL); + if (npages < 0) + return npages; + + for (i = 0; i < npages; i++, start += PAGE_SIZE) { + if (!trylock_page(pages[i])) { + put_page(pages[i]); + pages[i] = NULL; + continue; + } + + if (!page_make_device_exclusive(pages[i], mm, start, owner)) { + unlock_page(pages[i]); + put_page(pages[i]); + pages[i] = NULL; + } + } + + return npages; +} +EXPORT_SYMBOL_GPL(make_device_exclusive_range); +#endif + void __put_anon_vma(struct anon_vma *anon_vma) { struct anon_vma *root = anon_vma->root; From b659baea75469f0c5bd26f18461dfcdc1bbbac82 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 30 Jun 2021 18:54:28 -0700 Subject: [PATCH 136/190] mm: selftests for exclusive device memory Adds some selftests for exclusive device memory. Link: https://lkml.kernel.org/r/20210616105937.23201-9-apopple@nvidia.com Signed-off-by: Alistair Popple Acked-by: Jason Gunthorpe Tested-by: Ralph Campbell Reviewed-by: Ralph Campbell Cc: Ben Skeggs Cc: Christoph Hellwig Cc: Hugh Dickins Cc: John Hubbard Cc: "Matthew Wilcox (Oracle)" Cc: Peter Xu Cc: Shakeel Butt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/test_hmm.c | 125 +++++++++++++++++++ lib/test_hmm_uapi.h | 2 + tools/testing/selftests/vm/hmm-tests.c | 158 +++++++++++++++++++++++++ 3 files changed, 285 insertions(+) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index fc7a20bc9b42..8c55c4723692 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -25,6 +25,7 @@ #include #include #include +#include #include "test_hmm_uapi.h" @@ -46,6 +47,7 @@ struct dmirror_bounce { unsigned long cpages; }; +#define DPT_XA_TAG_ATOMIC 1UL #define DPT_XA_TAG_WRITE 3UL /* @@ -619,6 +621,54 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args, } } +static int dmirror_check_atomic(struct dmirror *dmirror, unsigned long start, + unsigned long end) +{ + unsigned long pfn; + + for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) { + void *entry; + struct page *page; + + entry = xa_load(&dmirror->pt, pfn); + page = xa_untag_pointer(entry); + if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC) + return -EPERM; + } + + return 0; +} + +static int dmirror_atomic_map(unsigned long start, unsigned long end, + struct page **pages, struct dmirror *dmirror) +{ + unsigned long pfn, mapped = 0; + int i; + + /* Map the migrated pages into the device's page tables. */ + mutex_lock(&dmirror->mutex); + + for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++, i++) { + void *entry; + + if (!pages[i]) + continue; + + entry = pages[i]; + entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC); + entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC); + if (xa_is_err(entry)) { + mutex_unlock(&dmirror->mutex); + return xa_err(entry); + } + + mapped++; + } + + mutex_unlock(&dmirror->mutex); + return mapped; +} + static int dmirror_migrate_finalize_and_map(struct migrate_vma *args, struct dmirror *dmirror) { @@ -661,6 +711,72 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args, return 0; } +static int dmirror_exclusive(struct dmirror *dmirror, + struct hmm_dmirror_cmd *cmd) +{ + unsigned long start, end, addr; + unsigned long size = cmd->npages << PAGE_SHIFT; + struct mm_struct *mm = dmirror->notifier.mm; + struct page *pages[64]; + struct dmirror_bounce bounce; + unsigned long next; + int ret; + + start = cmd->addr; + end = start + size; + if (end < start) + return -EINVAL; + + /* Since the mm is for the mirrored process, get a reference first. */ + if (!mmget_not_zero(mm)) + return -EINVAL; + + mmap_read_lock(mm); + for (addr = start; addr < end; addr = next) { + unsigned long mapped; + int i; + + if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT)) + next = end; + else + next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT); + + ret = make_device_exclusive_range(mm, addr, next, pages, NULL); + mapped = dmirror_atomic_map(addr, next, pages, dmirror); + for (i = 0; i < ret; i++) { + if (pages[i]) { + unlock_page(pages[i]); + put_page(pages[i]); + } + } + + if (addr + (mapped << PAGE_SHIFT) < next) { + mmap_read_unlock(mm); + mmput(mm); + return -EBUSY; + } + } + mmap_read_unlock(mm); + mmput(mm); + + /* Return the migrated data for verification. */ + ret = dmirror_bounce_init(&bounce, start, size); + if (ret) + return ret; + mutex_lock(&dmirror->mutex); + ret = dmirror_do_read(dmirror, start, end, &bounce); + mutex_unlock(&dmirror->mutex); + if (ret == 0) { + if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr, + bounce.size)) + ret = -EFAULT; + } + + cmd->cpages = bounce.cpages; + dmirror_bounce_fini(&bounce); + return ret; +} + static int dmirror_migrate(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd) { @@ -948,6 +1064,15 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp, ret = dmirror_migrate(dmirror, &cmd); break; + case HMM_DMIRROR_EXCLUSIVE: + ret = dmirror_exclusive(dmirror, &cmd); + break; + + case HMM_DMIRROR_CHECK_EXCLUSIVE: + ret = dmirror_check_atomic(dmirror, cmd.addr, + cmd.addr + (cmd.npages << PAGE_SHIFT)); + break; + case HMM_DMIRROR_SNAPSHOT: ret = dmirror_snapshot(dmirror, &cmd); break; diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h index 670b4ef2a5b6..f14dea5dcd06 100644 --- a/lib/test_hmm_uapi.h +++ b/lib/test_hmm_uapi.h @@ -33,6 +33,8 @@ struct hmm_dmirror_cmd { #define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd) #define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd) #define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd) +#define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x04, struct hmm_dmirror_cmd) +#define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd) /* * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT. diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 5d1ac691b9f4..864f126ffd78 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -1485,4 +1485,162 @@ TEST_F(hmm2, double_map) hmm_buffer_free(buffer); } +/* + * Basic check of exclusive faulting. + */ +TEST_F(hmm, exclusive) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Map memory exclusively for device access. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + /* Fault pages back to system memory and check them. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i]++, i); + + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i+1); + + /* Check atomic access revoked */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_CHECK_EXCLUSIVE, buffer, npages); + ASSERT_EQ(ret, 0); + + hmm_buffer_free(buffer); +} + +TEST_F(hmm, exclusive_mprotect) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Map memory exclusively for device access. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + ret = mprotect(buffer->ptr, size, PROT_READ); + ASSERT_EQ(ret, 0); + + /* Simulate a device writing system memory. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages); + ASSERT_EQ(ret, -EPERM); + + hmm_buffer_free(buffer); +} + +/* + * Check copy-on-write works. + */ +TEST_F(hmm, exclusive_cow) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Map memory exclusively for device access. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_EXCLUSIVE, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + + fork(); + + /* Fault pages back to system memory and check them. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i]++, i); + + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i+1); + + hmm_buffer_free(buffer); +} + TEST_HARNESS_MAIN From f81c69a2a144afefa277db4917a76bcaecfa2f2e Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 30 Jun 2021 18:54:32 -0700 Subject: [PATCH 137/190] nouveau/svm: refactor nouveau_range_fault Call mmu_interval_notifier_insert() as part of nouveau_range_fault(). This doesn't introduce any functional change but makes it easier for a subsequent patch to alter the behaviour of nouveau_range_fault() to support GPU atomic operations. Link: https://lkml.kernel.org/r/20210616105937.23201-10-apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Ben Skeggs Cc: Christoph Hellwig Cc: Hugh Dickins Cc: Jason Gunthorpe Cc: John Hubbard Cc: "Matthew Wilcox (Oracle)" Cc: Peter Xu Cc: Ralph Campbell Cc: Shakeel Butt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/gpu/drm/nouveau/nouveau_svm.c | 34 ++++++++++++++++----------- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index e963f7d81953..0a0c92539b5c 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -567,18 +567,27 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm, unsigned long hmm_pfns[1]; struct hmm_range range = { .notifier = ¬ifier->notifier, - .start = notifier->notifier.interval_tree.start, - .end = notifier->notifier.interval_tree.last + 1, .default_flags = hmm_flags, .hmm_pfns = hmm_pfns, .dev_private_owner = drm->dev, }; - struct mm_struct *mm = notifier->notifier.mm; + struct mm_struct *mm = svmm->notifier.mm; int ret; + ret = mmu_interval_notifier_insert(¬ifier->notifier, mm, + args->p.addr, args->p.size, + &nouveau_svm_mni_ops); + if (ret) + return ret; + + range.start = notifier->notifier.interval_tree.start; + range.end = notifier->notifier.interval_tree.last + 1; + while (true) { - if (time_after(jiffies, timeout)) - return -EBUSY; + if (time_after(jiffies, timeout)) { + ret = -EBUSY; + goto out; + } range.notifier_seq = mmu_interval_read_begin(range.notifier); mmap_read_lock(mm); @@ -587,7 +596,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm, if (ret) { if (ret == -EBUSY) continue; - return ret; + goto out; } mutex_lock(&svmm->mutex); @@ -606,6 +615,9 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm, svmm->vmm->vmm.object.client->super = false; mutex_unlock(&svmm->mutex); +out: + mmu_interval_notifier_remove(¬ifier->notifier); + return ret; } @@ -727,14 +739,8 @@ nouveau_svm_fault(struct nvif_notify *notify) } notifier.svmm = svmm; - ret = mmu_interval_notifier_insert(¬ifier.notifier, mm, - args.i.p.addr, args.i.p.size, - &nouveau_svm_mni_ops); - if (!ret) { - ret = nouveau_range_fault(svmm, svm->drm, &args.i, - sizeof(args), hmm_flags, ¬ifier); - mmu_interval_notifier_remove(¬ifier.notifier); - } + ret = nouveau_range_fault(svmm, svm->drm, &args.i, + sizeof(args), hmm_flags, ¬ifier); mmput(mm); limit = args.i.p.addr + args.i.p.size; From 8f187163eb890d6d2a53f7efea2b6963fe9526e2 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 30 Jun 2021 18:54:35 -0700 Subject: [PATCH 138/190] nouveau/svm: implement atomic SVM access Some NVIDIA GPUs do not support direct atomic access to system memory via PCIe. Instead this must be emulated by granting the GPU exclusive access to the memory. This is achieved by replacing CPU page table entries with special swap entries that fault on userspace access. The driver then grants the GPU permission to update the page undergoing atomic access via the GPU page tables. When CPU access to the page is required a CPU fault is raised which calls into the device driver via MMU notifiers to revoke the atomic access. The original page table entries are then restored allowing CPU access to proceed. Link: https://lkml.kernel.org/r/20210616105937.23201-11-apopple@nvidia.com Signed-off-by: Alistair Popple Reviewed-by: Ben Skeggs Cc: Christoph Hellwig Cc: Hugh Dickins Cc: Jason Gunthorpe Cc: John Hubbard Cc: "Matthew Wilcox (Oracle)" Cc: Peter Xu Cc: Ralph Campbell Cc: Shakeel Butt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/gpu/drm/nouveau/include/nvif/if000c.h | 1 + drivers/gpu/drm/nouveau/nouveau_svm.c | 126 ++++++++++++++++-- drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h | 1 + .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 6 + 4 files changed, 123 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h b/drivers/gpu/drm/nouveau/include/nvif/if000c.h index d6dd40f21eed..9c7ff56831c5 100644 --- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h +++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h @@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 { #define NVIF_VMM_PFNMAP_V0_APER 0x00000000000000f0ULL #define NVIF_VMM_PFNMAP_V0_HOST 0x0000000000000000ULL #define NVIF_VMM_PFNMAP_V0_VRAM 0x0000000000000010ULL +#define NVIF_VMM_PFNMAP_V0_A 0x0000000000000004ULL #define NVIF_VMM_PFNMAP_V0_W 0x0000000000000002ULL #define NVIF_VMM_PFNMAP_V0_V 0x0000000000000001ULL #define NVIF_VMM_PFNMAP_V0_NONE 0x0000000000000000ULL diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index 0a0c92539b5c..82b583f5fca8 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -35,6 +35,7 @@ #include #include #include +#include struct nouveau_svm { struct nouveau_drm *drm; @@ -67,6 +68,11 @@ struct nouveau_svm { } buffer[1]; }; +#define FAULT_ACCESS_READ 0 +#define FAULT_ACCESS_WRITE 1 +#define FAULT_ACCESS_ATOMIC 2 +#define FAULT_ACCESS_PREFETCH 3 + #define SVM_DBG(s,f,a...) NV_DEBUG((s)->drm, "svm: "f"\n", ##a) #define SVM_ERR(s,f,a...) NV_WARN((s)->drm, "svm: "f"\n", ##a) @@ -411,6 +417,24 @@ nouveau_svm_fault_cancel_fault(struct nouveau_svm *svm, fault->client); } +static int +nouveau_svm_fault_priority(u8 fault) +{ + switch (fault) { + case FAULT_ACCESS_PREFETCH: + return 0; + case FAULT_ACCESS_READ: + return 1; + case FAULT_ACCESS_WRITE: + return 2; + case FAULT_ACCESS_ATOMIC: + return 3; + default: + WARN_ON_ONCE(1); + return -1; + } +} + static int nouveau_svm_fault_cmp(const void *a, const void *b) { @@ -421,9 +445,8 @@ nouveau_svm_fault_cmp(const void *a, const void *b) return ret; if ((ret = (s64)fa->addr - fb->addr)) return ret; - /*XXX: atomic? */ - return (fa->access == 0 || fa->access == 3) - - (fb->access == 0 || fb->access == 3); + return nouveau_svm_fault_priority(fa->access) - + nouveau_svm_fault_priority(fb->access); } static void @@ -487,6 +510,10 @@ static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni, struct svm_notifier *sn = container_of(mni, struct svm_notifier, notifier); + if (range->event == MMU_NOTIFY_EXCLUSIVE && + range->owner == sn->svmm->vmm->cli->drm->dev) + return true; + /* * serializes the update to mni->invalidate_seq done by caller and * prevents invalidation of the PTE from progressing while HW is being @@ -555,6 +582,71 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm *drm, args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W; } +static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm, + struct nouveau_drm *drm, + struct nouveau_pfnmap_args *args, u32 size, + struct svm_notifier *notifier) +{ + unsigned long timeout = + jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); + struct mm_struct *mm = svmm->notifier.mm; + struct page *page; + unsigned long start = args->p.addr; + unsigned long notifier_seq; + int ret = 0; + + ret = mmu_interval_notifier_insert(¬ifier->notifier, mm, + args->p.addr, args->p.size, + &nouveau_svm_mni_ops); + if (ret) + return ret; + + while (true) { + if (time_after(jiffies, timeout)) { + ret = -EBUSY; + goto out; + } + + notifier_seq = mmu_interval_read_begin(¬ifier->notifier); + mmap_read_lock(mm); + ret = make_device_exclusive_range(mm, start, start + PAGE_SIZE, + &page, drm->dev); + mmap_read_unlock(mm); + if (ret <= 0 || !page) { + ret = -EINVAL; + goto out; + } + + mutex_lock(&svmm->mutex); + if (!mmu_interval_read_retry(¬ifier->notifier, + notifier_seq)) + break; + mutex_unlock(&svmm->mutex); + } + + /* Map the page on the GPU. */ + args->p.page = 12; + args->p.size = PAGE_SIZE; + args->p.addr = start; + args->p.phys[0] = page_to_phys(page) | + NVIF_VMM_PFNMAP_V0_V | + NVIF_VMM_PFNMAP_V0_W | + NVIF_VMM_PFNMAP_V0_A | + NVIF_VMM_PFNMAP_V0_HOST; + + svmm->vmm->vmm.object.client->super = true; + ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL); + svmm->vmm->vmm.object.client->super = false; + mutex_unlock(&svmm->mutex); + + unlock_page(page); + put_page(page); + +out: + mmu_interval_notifier_remove(¬ifier->notifier); + return ret; +} + static int nouveau_range_fault(struct nouveau_svmm *svmm, struct nouveau_drm *drm, struct nouveau_pfnmap_args *args, u32 size, @@ -637,7 +729,7 @@ nouveau_svm_fault(struct nvif_notify *notify) unsigned long hmm_flags; u64 inst, start, limit; int fi, fn; - int replay = 0, ret; + int replay = 0, atomic = 0, ret; /* Parse available fault buffer entries into a cache, and update * the GET pointer so HW can reuse the entries. @@ -718,12 +810,14 @@ nouveau_svm_fault(struct nvif_notify *notify) /* * Determine required permissions based on GPU fault * access flags. - * XXX: atomic? */ switch (buffer->fault[fi]->access) { case 0: /* READ. */ hmm_flags = HMM_PFN_REQ_FAULT; break; + case 2: /* ATOMIC. */ + atomic = true; + break; case 3: /* PREFETCH. */ hmm_flags = 0; break; @@ -739,8 +833,14 @@ nouveau_svm_fault(struct nvif_notify *notify) } notifier.svmm = svmm; - ret = nouveau_range_fault(svmm, svm->drm, &args.i, - sizeof(args), hmm_flags, ¬ifier); + if (atomic) + ret = nouveau_atomic_range_fault(svmm, svm->drm, + &args.i, sizeof(args), + ¬ifier); + else + ret = nouveau_range_fault(svmm, svm->drm, &args.i, + sizeof(args), hmm_flags, + ¬ifier); mmput(mm); limit = args.i.p.addr + args.i.p.size; @@ -756,11 +856,15 @@ nouveau_svm_fault(struct nvif_notify *notify) */ if (buffer->fault[fn]->svmm != svmm || buffer->fault[fn]->addr >= limit || - (buffer->fault[fi]->access == 0 /* READ. */ && + (buffer->fault[fi]->access == FAULT_ACCESS_READ && !(args.phys[0] & NVIF_VMM_PFNMAP_V0_V)) || - (buffer->fault[fi]->access != 0 /* READ. */ && - buffer->fault[fi]->access != 3 /* PREFETCH. */ && - !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W))) + (buffer->fault[fi]->access != FAULT_ACCESS_READ && + buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH && + !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)) || + (buffer->fault[fi]->access != FAULT_ACCESS_READ && + buffer->fault[fi]->access != FAULT_ACCESS_WRITE && + buffer->fault[fi]->access != FAULT_ACCESS_PREFETCH && + !(args.phys[0] & NVIF_VMM_PFNMAP_V0_A))) break; } diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h index a2b179568970..f6188aa9171c 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h @@ -178,6 +178,7 @@ void nvkm_vmm_unmap_region(struct nvkm_vmm *, struct nvkm_vma *); #define NVKM_VMM_PFN_APER 0x00000000000000f0ULL #define NVKM_VMM_PFN_HOST 0x0000000000000000ULL #define NVKM_VMM_PFN_VRAM 0x0000000000000010ULL +#define NVKM_VMM_PFN_A 0x0000000000000004ULL #define NVKM_VMM_PFN_W 0x0000000000000002ULL #define NVKM_VMM_PFN_V 0x0000000000000001ULL #define NVKM_VMM_PFN_NONE 0x0000000000000000ULL diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c index 236db5570771..f02abd9cb4dd 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c @@ -88,6 +88,9 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt, if (!(*map->pfn & NVKM_VMM_PFN_W)) data |= BIT_ULL(6); /* RO. */ + if (!(*map->pfn & NVKM_VMM_PFN_A)) + data |= BIT_ULL(7); /* Atomic disable. */ + if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) { addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT; addr = dma_map_page(dev, pfn_to_page(addr), 0, @@ -322,6 +325,9 @@ gp100_vmm_pd0_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt, if (!(*map->pfn & NVKM_VMM_PFN_W)) data |= BIT_ULL(6); /* RO. */ + if (!(*map->pfn & NVKM_VMM_PFN_A)) + data |= BIT_ULL(7); /* Atomic disable. */ + if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) { addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT; addr = dma_map_page(dev, pfn_to_page(addr), 0, From d238692b4b9f2c36e35af4c6e6f6da36184aeb3e Mon Sep 17 00:00:00 2001 From: Marcelo Henrique Cerri Date: Wed, 30 Jun 2021 18:54:38 -0700 Subject: [PATCH 139/190] proc: Avoid mixing integer types in mem_rw() Use size_t when capping the count argument received by mem_rw(). Since count is size_t, using min_t(int, ...) can lead to a negative value that will later be passed to access_remote_vm(), which can cause unexpected behavior. Since we are capping the value to at maximum PAGE_SIZE, the conversion from size_t to int when passing it to access_remote_vm() as "len" shouldn't be a problem. Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.com Reviewed-by: David Disseldorp Signed-off-by: Thadeu Lima de Souza Cascardo Signed-off-by: Marcelo Henrique Cerri Cc: Alexey Dobriyan Cc: Souza Cascardo Cc: Christian Brauner Cc: Michel Lespinasse Cc: Helge Deller Cc: Oleg Nesterov Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/base.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 9cbd915025ad..a0a2fc1c9da2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -854,7 +854,7 @@ static ssize_t mem_rw(struct file *file, char __user *buf, flags = FOLL_FORCE | (write ? FOLL_WRITE : 0); while (count > 0) { - int this_len = min_t(int, count, PAGE_SIZE); + size_t this_len = min_t(size_t, count, PAGE_SIZE); if (write && copy_from_user(page, buf, this_len)) { copied = -EFAULT; From 7bc3fa0172a423afb34e6df7a3998e5f23b1a94a Mon Sep 17 00:00:00 2001 From: Kalesh Singh Date: Wed, 30 Jun 2021 18:54:44 -0700 Subject: [PATCH 140/190] procfs: allow reading fdinfo with PTRACE_MODE_READ MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Android captures per-process system memory state when certain low memory events (e.g a foreground app kill) occur, to identify potential memory hoggers. In order to measure how much memory a process actually consumes, it is necessary to include the DMA buffer sizes for that process in the memory accounting. Since the handle to DMA buffers are raw FDs, it is important to be able to identify which processes have FD references to a DMA buffer. Currently, DMA buffer FDs can be accounted using /proc//fd/* and /proc//fdinfo -- both are only readable by the process owner, as follows: 1. Do a readlink on each FD. 2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD. 3. stat the file to get the dmabuf inode number. 4. Read/ proc//fdinfo/, to get the DMA buffer size. Accessing other processes' fdinfo requires root privileges. This limits the use of the interface to debugging environments and is not suitable for production builds. Granting root privileges even to a system process increases the attack surface and is highly undesirable. Since fdinfo doesn't permit reading process memory and manipulating process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED. Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com Signed-off-by: Kalesh Singh Suggested-by: Jann Horn Acked-by: Christian König Cc: Alexander Viro Cc: Alexey Dobriyan Cc: Alexey Gladkov Cc: Andrei Vagin Cc: Bernd Edlinger Cc: Christian Brauner Cc: Eric W. Biederman Cc: Helge Deller Cc: Hridya Valsaraju Cc: James Morris Cc: Jeff Vander Stoep Cc: Jonathan Corbet Cc: Kees Cook Cc: Matthew Wilcox Cc: Mauro Carvalho Chehab Cc: Michal Hocko Cc: Michel Lespinasse Cc: Minchan Kim Cc: Randy Dunlap Cc: Suren Baghdasaryan Cc: Szabolcs Nagy Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/base.c | 4 ++-- fs/proc/fd.c | 15 ++++++++++++++- 2 files changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index a0a2fc1c9da2..e5b5f7709d48 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3172,7 +3172,7 @@ static const struct pid_entry tgid_base_stuff[] = { DIR("task", S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations), DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations), - DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations), + DIR("fdinfo", S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations), DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations), #ifdef CONFIG_NET DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations), @@ -3517,7 +3517,7 @@ static const struct inode_operations proc_tid_comm_inode_operations = { */ static const struct pid_entry tid_base_stuff[] = { DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), - DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations), + DIR("fdinfo", S_IRUGO|S_IXUGO, proc_fdinfo_inode_operations, proc_fdinfo_operations), DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations), #ifdef CONFIG_NET DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations), diff --git a/fs/proc/fd.c b/fs/proc/fd.c index 07fc4fad2602..6a80b40fd2fe 100644 --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -72,6 +73,18 @@ static int seq_show(struct seq_file *m, void *v) static int seq_fdinfo_open(struct inode *inode, struct file *file) { + bool allowed = false; + struct task_struct *task = get_proc_task(inode); + + if (!task) + return -ESRCH; + + allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS); + put_task_struct(task); + + if (!allowed) + return -EACCES; + return single_open(file, seq_show, inode); } @@ -308,7 +321,7 @@ static struct dentry *proc_fdinfo_instantiate(struct dentry *dentry, struct proc_inode *ei; struct inode *inode; - inode = proc_pid_make_inode(dentry->d_sb, task, S_IFREG | S_IRUSR); + inode = proc_pid_make_inode(dentry->d_sb, task, S_IFREG | S_IRUGO); if (!inode) return ERR_PTR(-ENOENT); From 3845f256a8b527127bfbd4ced21e93d9e89aa6d7 Mon Sep 17 00:00:00 2001 From: Kalesh Singh Date: Wed, 30 Jun 2021 18:54:49 -0700 Subject: [PATCH 141/190] procfs/dmabuf: add inode number to /proc/*/fdinfo MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit And 'ino' field to /proc//fdinfo/ and /proc//task//fdinfo/. The inode numbers can be used to uniquely identify DMA buffers in user space and avoids a dependency on /proc//fd/* when accounting per-process DMA buffer sizes. Link: https://lkml.kernel.org/r/20210308170651.919148-2-kaleshsingh@google.com Signed-off-by: Kalesh Singh Acked-by: Randy Dunlap Acked-by: Christian König Cc: Jann Horn Cc: Jeff Vander Stoep Cc: Kees Cook Cc: Suren Baghdasaryan Cc: Minchan Kim Cc: Hridya Valsaraju Cc: Matthew Wilcox Cc: Alexander Viro Cc: Kalesh Singh Cc: Alexey Dobriyan Cc: Jonathan Corbet Cc: Mauro Carvalho Chehab Cc: Michal Hocko Cc: Alexey Gladkov Cc: Szabolcs Nagy Cc: Eric W. Biederman Cc: Christian Brauner Cc: Michel Lespinasse Cc: Bernd Edlinger Cc: Andrei Vagin Cc: Helge Deller Cc: James Morris Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.rst | 37 +++++++++++++++++++++++++----- fs/proc/fd.c | 5 ++-- 2 files changed, 34 insertions(+), 8 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index dd4e3e18c22b..042c418f4090 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1920,18 +1920,20 @@ if precise results are needed. 3.8 /proc//fdinfo/ - Information about opened file --------------------------------------------------------------- This file provides information associated with an opened file. The regular -files have at least three fields -- 'pos', 'flags' and 'mnt_id'. The 'pos' -represents the current offset of the opened file in decimal form [see lseek(2) -for details], 'flags' denotes the octal O_xxx mask the file has been -created with [see open(2) for details] and 'mnt_id' represents mount ID of -the file system containing the opened file [see 3.5 /proc//mountinfo -for details]. +files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'. +The 'pos' represents the current offset of the opened file in decimal +form [see lseek(2) for details], 'flags' denotes the octal O_xxx mask the +file has been created with [see open(2) for details] and 'mnt_id' represents +mount ID of the file system containing the opened file [see 3.5 +/proc//mountinfo for details]. 'ino' represents the inode number of +the file. A typical output is:: pos: 0 flags: 0100002 mnt_id: 19 + ino: 63107 All locks associated with a file descriptor are shown in its fdinfo too:: @@ -1948,6 +1950,7 @@ Eventfd files pos: 0 flags: 04002 mnt_id: 9 + ino: 63107 eventfd-count: 5a where 'eventfd-count' is hex value of a counter. @@ -1960,6 +1963,7 @@ Signalfd files pos: 0 flags: 04002 mnt_id: 9 + ino: 63107 sigmask: 0000000000000200 where 'sigmask' is hex value of the signal mask associated @@ -1973,6 +1977,7 @@ Epoll files pos: 0 flags: 02 mnt_id: 9 + ino: 63107 tfd: 5 events: 1d data: ffffffffffffffff pos:0 ino:61af sdev:7 where 'tfd' is a target file descriptor number in decimal form, @@ -1989,6 +1994,8 @@ For inotify files the format is the following:: pos: 0 flags: 02000000 + mnt_id: 9 + ino: 63107 inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d where 'wd' is a watch descriptor in decimal form, i.e. a target file @@ -2011,6 +2018,7 @@ For fanotify files the format is:: pos: 0 flags: 02 mnt_id: 9 + ino: 63107 fanotify flags:10 event-flags:0 fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 @@ -2035,6 +2043,7 @@ Timerfd files pos: 0 flags: 02 mnt_id: 9 + ino: 63107 clockid: 0 ticks: 0 settime flags: 01 @@ -2049,6 +2058,22 @@ details]. 'it_value' is remaining time until the timer expiration. with TIMER_ABSTIME option which will be shown in 'settime flags', but 'it_value' still exhibits timer's remaining time. +DMA Buffer files +~~~~~~~~~~~~~~~~ + +:: + + pos: 0 + flags: 04002 + mnt_id: 9 + ino: 63107 + size: 32768 + count: 2 + exp_name: system-heap + +where 'size' is the size of the DMA buffer in bytes. 'count' is the file count of +the DMA buffer file. 'exp_name' is the name of the DMA buffer exporter. + 3.9 /proc//map_files - Information about memory mapped files --------------------------------------------------------------------- This directory contains symbolic links which represent memory mapped files diff --git a/fs/proc/fd.c b/fs/proc/fd.c index 6a80b40fd2fe..172c86270b31 100644 --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -54,9 +54,10 @@ static int seq_show(struct seq_file *m, void *v) if (ret) return ret; - seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\n", + seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\nino:\t%lu\n", (long long)file->f_pos, f_flags, - real_mount(file->f_path.mnt)->mnt_id); + real_mount(file->f_path.mnt)->mnt_id, + file_inode(file)->i_ino); /* show_fd_locks() never deferences files so a stale value is safe */ show_fd_locks(m, file, files); From 9a52c5f3c8957872b2750314b56c64d9600542a9 Mon Sep 17 00:00:00 2001 From: Jiapeng Chong Date: Wed, 30 Jun 2021 18:54:53 -0700 Subject: [PATCH 142/190] sysctl: remove redundant assignment to first Variable first is set to '0', but this value is never read as it is not used later on, hence it is a redundant assignment and can be removed. Clean up the following clang-analyzer warning: kernel/sysctl.c:1562:4: warning: Value stored to 'first' is never read [clang-analyzer-deadcode.DeadStores]. Link: https://lkml.kernel.org/r/1620469990-22182-1-git-send-email-jiapeng.chong@linux.alibaba.com Signed-off-by: Jiapeng Chong Reported-by: Abaci Robot Acked-by: Luis Chamberlain Cc: Kees Cook Cc: Iurii Zaikin Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Andrii Nakryiko Cc: Martin KaFai Lau Cc: Song Liu Cc: Yonghong Song Cc: John Fastabend Cc: KP Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/sysctl.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 69d925f1e5da..bb37739f5731 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1494,7 +1494,6 @@ int proc_do_large_bitmap(struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { int err = 0; - bool first = 1; size_t left = *lenp; unsigned long bitmap_len = table->maxlen; unsigned long *bitmap = *(unsigned long **) table->data; @@ -1579,12 +1578,12 @@ int proc_do_large_bitmap(struct ctl_table *table, int write, } bitmap_set(tmp_bitmap, val_a, val_b - val_a + 1); - first = 0; proc_skip_char(&p, &left, '\n'); } left += skipped; } else { unsigned long bit_a, bit_b = 0; + bool first = 1; while (left) { bit_a = find_next_bit(bitmap, bitmap_len, bit_b); From 070c46505a265d54eba7f713760fa6ed984f2921 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:54:56 -0700 Subject: [PATCH 143/190] drm: include only needed headers in ascii85.h The ascii85.h is user of exactly two headers, i.e. math.h and types.h. There is no need to carry on entire kernel.h. Link: https://lkml.kernel.org/r/20210611185915.44181-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Reviewed-by: Jani Nikula Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/ascii85.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/ascii85.h b/include/linux/ascii85.h index 4cc40201273e..83ad775ad0aa 100644 --- a/include/linux/ascii85.h +++ b/include/linux/ascii85.h @@ -8,7 +8,8 @@ #ifndef _ASCII85_H_ #define _ASCII85_H_ -#include +#include +#include #define ASCII85_BUFSZ 6 From f39650de687e35766572ac89dbcd16a5911e2f0a Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:54:59 -0700 Subject: [PATCH 144/190] kernel.h: split out panic and oops helpers kernel.h is being used as a dump for all kinds of stuff for a long time. Here is the attempt to start cleaning it up by splitting out panic and oops helpers. There are several purposes of doing this: - dropping dependency in bug.h - dropping a loop by moving out panic_notifier.h - unload kernel.h from something which has its own domain At the same time convert users tree-wide to use new headers, although for the time being include new header back to kernel.h to avoid twisted indirected includes for existing users. [akpm@linux-foundation.org: thread_info.h needs limits.h] [andriy.shevchenko@linux.intel.com: ia64 fix] Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Reviewed-by: Bjorn Andersson Co-developed-by: Andrew Morton Acked-by: Mike Rapoport Acked-by: Corey Minyard Acked-by: Christian Brauner Acked-by: Arnd Bergmann Acked-by: Kees Cook Acked-by: Wei Liu Acked-by: Rasmus Villemoes Signed-off-by: Andrew Morton Acked-by: Sebastian Reichel Acked-by: Luis Chamberlain Acked-by: Stephen Boyd Acked-by: Thomas Bogendoerfer Acked-by: Helge Deller # parisc Signed-off-by: Linus Torvalds --- arch/alpha/kernel/setup.c | 2 +- arch/arm64/kernel/setup.c | 1 + arch/ia64/include/asm/pal.h | 1 + arch/mips/kernel/relocate.c | 1 + arch/mips/sgi-ip22/ip22-reset.c | 1 + arch/mips/sgi-ip32/ip32-reset.c | 1 + arch/parisc/kernel/pdc_chassis.c | 1 + arch/powerpc/kernel/setup-common.c | 1 + arch/s390/kernel/ipl.c | 1 + arch/sparc/kernel/sstate.c | 1 + arch/um/drivers/mconsole_kern.c | 1 + arch/um/kernel/um_arch.c | 1 + arch/x86/include/asm/desc.h | 1 + arch/x86/kernel/cpu/mshyperv.c | 1 + arch/x86/kernel/setup.c | 1 + arch/x86/purgatory/purgatory.c | 2 + arch/x86/xen/enlighten.c | 1 + arch/xtensa/platforms/iss/setup.c | 1 + drivers/bus/brcmstb_gisb.c | 1 + drivers/char/ipmi/ipmi_msghandler.c | 1 + drivers/clk/analogbits/wrpll-cln28hpc.c | 4 + drivers/edac/altera_edac.c | 1 + drivers/firmware/google/gsmi.c | 1 + drivers/hv/vmbus_drv.c | 1 + .../hwtracing/coresight/coresight-cpu-debug.c | 1 + drivers/leds/trigger/ledtrig-activity.c | 1 + drivers/leds/trigger/ledtrig-heartbeat.c | 1 + drivers/leds/trigger/ledtrig-panic.c | 1 + drivers/misc/bcm-vk/bcm_vk_dev.c | 1 + drivers/misc/ibmasm/heartbeat.c | 1 + drivers/misc/pvpanic/pvpanic.c | 1 + drivers/net/ipa/ipa_smp2p.c | 1 + drivers/parisc/power.c | 1 + drivers/power/reset/ltc2952-poweroff.c | 1 + drivers/remoteproc/remoteproc_core.c | 1 + drivers/s390/char/con3215.c | 1 + drivers/s390/char/con3270.c | 1 + drivers/s390/char/sclp.c | 1 + drivers/s390/char/sclp_con.c | 1 + drivers/s390/char/sclp_vt220.c | 1 + drivers/s390/char/zcore.c | 1 + drivers/soc/bcm/brcmstb/pm/pm-arm.c | 1 + drivers/staging/olpc_dcon/olpc_dcon.c | 1 + drivers/video/fbdev/hyperv_fb.c | 1 + include/asm-generic/bug.h | 3 +- include/linux/kernel.h | 84 +--------------- include/linux/panic.h | 98 +++++++++++++++++++ include/linux/panic_notifier.h | 12 +++ include/linux/thread_info.h | 1 + kernel/hung_task.c | 1 + kernel/kexec_core.c | 1 + kernel/panic.c | 1 + kernel/rcu/tree.c | 2 + kernel/sysctl.c | 1 + kernel/trace/trace.c | 1 + 55 files changed, 169 insertions(+), 85 deletions(-) create mode 100644 include/linux/panic.h create mode 100644 include/linux/panic_notifier.h diff --git a/arch/alpha/kernel/setup.c b/arch/alpha/kernel/setup.c index 5f6858e9dc28..7d56c217b235 100644 --- a/arch/alpha/kernel/setup.c +++ b/arch/alpha/kernel/setup.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include @@ -46,7 +47,6 @@ #include #include -extern struct atomic_notifier_head panic_notifier_list; static int alpha_panic_event(struct notifier_block *, unsigned long, void *); static struct notifier_block alpha_panic_block = { alpha_panic_event, diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index 61845c0821d9..787bc0f601b3 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/ia64/include/asm/pal.h b/arch/ia64/include/asm/pal.h index 5c51fceedaf9..e6b652f9e45e 100644 --- a/arch/ia64/include/asm/pal.h +++ b/arch/ia64/include/asm/pal.h @@ -99,6 +99,7 @@ #include #include +#include /* * Data types needed to pass information into PAL procedures and diff --git a/arch/mips/kernel/relocate.c b/arch/mips/kernel/relocate.c index 499a5357c09f..56b51de2dc51 100644 --- a/arch/mips/kernel/relocate.c +++ b/arch/mips/kernel/relocate.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/mips/sgi-ip22/ip22-reset.c b/arch/mips/sgi-ip22/ip22-reset.c index c374f3ceec38..9028dbbb45dd 100644 --- a/arch/mips/sgi-ip22/ip22-reset.c +++ b/arch/mips/sgi-ip22/ip22-reset.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include diff --git a/arch/mips/sgi-ip32/ip32-reset.c b/arch/mips/sgi-ip32/ip32-reset.c index 20d8637340be..18d1c115cd53 100644 --- a/arch/mips/sgi-ip32/ip32-reset.c +++ b/arch/mips/sgi-ip32/ip32-reset.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/parisc/kernel/pdc_chassis.c b/arch/parisc/kernel/pdc_chassis.c index 75ae88d13909..da154406d368 100644 --- a/arch/parisc/kernel/pdc_chassis.c +++ b/arch/parisc/kernel/pdc_chassis.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 74a98fff2c2f..046fe21b5c3b 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -9,6 +9,7 @@ #undef DEBUG #include +#include #include #include #include diff --git a/arch/s390/kernel/ipl.c b/arch/s390/kernel/ipl.c index dba04fbc37a2..36f870dc944f 100644 --- a/arch/s390/kernel/ipl.c +++ b/arch/s390/kernel/ipl.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/sparc/kernel/sstate.c b/arch/sparc/kernel/sstate.c index ac8677c3841e..3bcc4ddc6911 100644 --- a/arch/sparc/kernel/sstate.c +++ b/arch/sparc/kernel/sstate.c @@ -6,6 +6,7 @@ #include #include +#include #include #include diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c index 6d00af25ec6b..328b16f99b30 100644 --- a/arch/um/drivers/mconsole_kern.c +++ b/arch/um/drivers/mconsole_kern.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c index 74e07e748a9b..9512253947d5 100644 --- a/arch/um/kernel/um_arch.c +++ b/arch/um/kernel/um_arch.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index 476082a83d1c..ceb12683b6d1 100644 --- a/arch/x86/include/asm/desc.h +++ b/arch/x86/include/asm/desc.h @@ -9,6 +9,7 @@ #include #include +#include #include #include diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index 22f13343b5da..9e5c6f2b044d 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 1e720626069a..d21ba6fa39d9 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c index f03b64d9cb51..7558139920f8 100644 --- a/arch/x86/purgatory/purgatory.c +++ b/arch/x86/purgatory/purgatory.c @@ -9,6 +9,8 @@ */ #include +#include +#include #include #include diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c index aa9f50fccc5d..c79bd0af2e8c 100644 --- a/arch/x86/xen/enlighten.c +++ b/arch/x86/xen/enlighten.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include diff --git a/arch/xtensa/platforms/iss/setup.c b/arch/xtensa/platforms/iss/setup.c index ed519aee0ec8..d3433e1bb94e 100644 --- a/arch/xtensa/platforms/iss/setup.c +++ b/arch/xtensa/platforms/iss/setup.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include diff --git a/drivers/bus/brcmstb_gisb.c b/drivers/bus/brcmstb_gisb.c index 7355fa2cb439..6551286a60cc 100644 --- a/drivers/bus/brcmstb_gisb.c +++ b/drivers/bus/brcmstb_gisb.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/char/ipmi/ipmi_msghandler.c b/drivers/char/ipmi/ipmi_msghandler.c index 8a0e97b33cae..e96cb5c4f97a 100644 --- a/drivers/char/ipmi/ipmi_msghandler.c +++ b/drivers/char/ipmi/ipmi_msghandler.c @@ -16,6 +16,7 @@ #include #include +#include #include #include #include diff --git a/drivers/clk/analogbits/wrpll-cln28hpc.c b/drivers/clk/analogbits/wrpll-cln28hpc.c index 776ead319ae9..7c64ea52a8d5 100644 --- a/drivers/clk/analogbits/wrpll-cln28hpc.c +++ b/drivers/clk/analogbits/wrpll-cln28hpc.c @@ -23,8 +23,12 @@ #include #include +#include #include #include +#include +#include + #include /* MIN_INPUT_FREQ: minimum input clock frequency, in Hz (Fref_min) */ diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c index 5f7fd79ec82f..61c21bd880a4 100644 --- a/drivers/edac/altera_edac.c +++ b/drivers/edac/altera_edac.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/firmware/google/gsmi.c b/drivers/firmware/google/gsmi.c index bb6e77ee3898..adaa492c3d2d 100644 --- a/drivers/firmware/google/gsmi.c +++ b/drivers/firmware/google/gsmi.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index 92cb3f7d21d9..57bbbaa4e8f7 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -25,6 +25,7 @@ #include #include +#include #include #include #include diff --git a/drivers/hwtracing/coresight/coresight-cpu-debug.c b/drivers/hwtracing/coresight/coresight-cpu-debug.c index 2dcf13de751f..9731d3a96073 100644 --- a/drivers/hwtracing/coresight/coresight-cpu-debug.c +++ b/drivers/hwtracing/coresight/coresight-cpu-debug.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/leds/trigger/ledtrig-activity.c b/drivers/leds/trigger/ledtrig-activity.c index 14ba7faaed9e..30bc9df03636 100644 --- a/drivers/leds/trigger/ledtrig-activity.c +++ b/drivers/leds/trigger/ledtrig-activity.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/leds/trigger/ledtrig-heartbeat.c b/drivers/leds/trigger/ledtrig-heartbeat.c index 36b6709afe9f..7fe0a05574d2 100644 --- a/drivers/leds/trigger/ledtrig-heartbeat.c +++ b/drivers/leds/trigger/ledtrig-heartbeat.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/leds/trigger/ledtrig-panic.c b/drivers/leds/trigger/ledtrig-panic.c index 5751cd032f9d..64abf2e91608 100644 --- a/drivers/leds/trigger/ledtrig-panic.c +++ b/drivers/leds/trigger/ledtrig-panic.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include "../leds.h" diff --git a/drivers/misc/bcm-vk/bcm_vk_dev.c b/drivers/misc/bcm-vk/bcm_vk_dev.c index 6bfea3210389..ad639ee85b2a 100644 --- a/drivers/misc/bcm-vk/bcm_vk_dev.c +++ b/drivers/misc/bcm-vk/bcm_vk_dev.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/misc/ibmasm/heartbeat.c b/drivers/misc/ibmasm/heartbeat.c index 4f5f3bdc814d..59c9a0d95659 100644 --- a/drivers/misc/ibmasm/heartbeat.c +++ b/drivers/misc/ibmasm/heartbeat.c @@ -9,6 +9,7 @@ */ #include +#include #include "ibmasm.h" #include "dot_command.h" #include "lowlevel.h" diff --git a/drivers/misc/pvpanic/pvpanic.c b/drivers/misc/pvpanic/pvpanic.c index 65f70a4da8c0..793ea0c01193 100644 --- a/drivers/misc/pvpanic/pvpanic.c +++ b/drivers/misc/pvpanic/pvpanic.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/net/ipa/ipa_smp2p.c b/drivers/net/ipa/ipa_smp2p.c index a5f7a79a1923..34b68dc43886 100644 --- a/drivers/net/ipa/ipa_smp2p.c +++ b/drivers/net/ipa/ipa_smp2p.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include diff --git a/drivers/parisc/power.c b/drivers/parisc/power.c index ebaf6867b457..456776bd8ee6 100644 --- a/drivers/parisc/power.c +++ b/drivers/parisc/power.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/power/reset/ltc2952-poweroff.c b/drivers/power/reset/ltc2952-poweroff.c index d1495af30081..8688c8ba8894 100644 --- a/drivers/power/reset/ltc2952-poweroff.c +++ b/drivers/power/reset/ltc2952-poweroff.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c index 626a6b90fba2..76dd8e2b1e7e 100644 --- a/drivers/remoteproc/remoteproc_core.c +++ b/drivers/remoteproc/remoteproc_core.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/s390/char/con3215.c b/drivers/s390/char/con3215.c index 1fd5bca9fa20..02523f4e29f4 100644 --- a/drivers/s390/char/con3215.c +++ b/drivers/s390/char/con3215.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include /* ASYNC_* flags */ #include diff --git a/drivers/s390/char/con3270.c b/drivers/s390/char/con3270.c index e21962c0fd94..87cdbace1453 100644 --- a/drivers/s390/char/con3270.c +++ b/drivers/s390/char/con3270.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/s390/char/sclp.c b/drivers/s390/char/sclp.c index 986bbbc23d0a..6627820a5eb9 100644 --- a/drivers/s390/char/sclp.c +++ b/drivers/s390/char/sclp.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/s390/char/sclp_con.c b/drivers/s390/char/sclp_con.c index 9b852a47ccc1..cc01a7b8595d 100644 --- a/drivers/s390/char/sclp_con.c +++ b/drivers/s390/char/sclp_con.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/s390/char/sclp_vt220.c b/drivers/s390/char/sclp_vt220.c index 7f4445b0f819..5b8a7b090a97 100644 --- a/drivers/s390/char/sclp_vt220.c +++ b/drivers/s390/char/sclp_vt220.c @@ -9,6 +9,7 @@ #include #include +#include #include #include #include diff --git a/drivers/s390/char/zcore.c b/drivers/s390/char/zcore.c index bd3c724bf695..b5b0848da93b 100644 --- a/drivers/s390/char/zcore.c +++ b/drivers/s390/char/zcore.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include diff --git a/drivers/soc/bcm/brcmstb/pm/pm-arm.c b/drivers/soc/bcm/brcmstb/pm/pm-arm.c index a673fdffe216..3cbb165d6e30 100644 --- a/drivers/soc/bcm/brcmstb/pm/pm-arm.c +++ b/drivers/soc/bcm/brcmstb/pm/pm-arm.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/staging/olpc_dcon/olpc_dcon.c b/drivers/staging/olpc_dcon/olpc_dcon.c index 6d8e9a481786..7284cb4ac395 100644 --- a/drivers/staging/olpc_dcon/olpc_dcon.c +++ b/drivers/staging/olpc_dcon/olpc_dcon.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/video/fbdev/hyperv_fb.c b/drivers/video/fbdev/hyperv_fb.c index a7e6eea2c4a1..23999df52739 100644 --- a/drivers/video/fbdev/hyperv_fb.c +++ b/drivers/video/fbdev/hyperv_fb.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h index b402494883b6..f152b9bb916f 100644 --- a/include/asm-generic/bug.h +++ b/include/asm-generic/bug.h @@ -17,7 +17,8 @@ #endif #ifndef __ASSEMBLY__ -#include +#include +#include #ifdef CONFIG_BUG diff --git a/include/linux/kernel.h b/include/linux/kernel.h index bf950621febf..baea2eb763d0 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -72,7 +73,6 @@ #define lower_32_bits(n) ((u32)((n) & 0xffffffff)) struct completion; -struct pt_regs; struct user; #ifdef CONFIG_PREEMPT_VOLUNTARY @@ -177,14 +177,6 @@ void __might_fault(const char *file, int line); static inline void might_fault(void) { } #endif -extern struct atomic_notifier_head panic_notifier_list; -extern long (*panic_blink)(int state); -__printf(1, 2) -void panic(const char *fmt, ...) __noreturn __cold; -void nmi_panic(struct pt_regs *regs, const char *msg); -extern void oops_enter(void); -extern void oops_exit(void); -extern bool oops_may_print(void); void do_exit(long error_code) __noreturn; void complete_and_exit(struct completion *, long) __noreturn; @@ -372,52 +364,8 @@ extern int __kernel_text_address(unsigned long addr); extern int kernel_text_address(unsigned long addr); extern int func_ptr_is_kernel_text(void *ptr); -#ifdef CONFIG_SMP -extern unsigned int sysctl_oops_all_cpu_backtrace; -#else -#define sysctl_oops_all_cpu_backtrace 0 -#endif /* CONFIG_SMP */ - extern void bust_spinlocks(int yes); -extern int panic_timeout; -extern unsigned long panic_print; -extern int panic_on_oops; -extern int panic_on_unrecovered_nmi; -extern int panic_on_io_nmi; -extern int panic_on_warn; -extern unsigned long panic_on_taint; -extern bool panic_on_taint_nousertaint; -extern int sysctl_panic_on_rcu_stall; -extern int sysctl_max_rcu_stall_to_panic; -extern int sysctl_panic_on_stackoverflow; -extern bool crash_kexec_post_notifiers; - -/* - * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It - * holds a CPU number which is executing panic() currently. A value of - * PANIC_CPU_INVALID means no CPU has entered panic() or crash_kexec(). - */ -extern atomic_t panic_cpu; -#define PANIC_CPU_INVALID -1 - -/* - * Only to be used by arch init code. If the user over-wrote the default - * CONFIG_PANIC_TIMEOUT, honor it. - */ -static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout) -{ - if (panic_timeout == arch_default_timeout) - panic_timeout = timeout; -} -extern const char *print_tainted(void); -enum lockdep_ok { - LOCKDEP_STILL_OK, - LOCKDEP_NOW_UNRELIABLE -}; -extern void add_taint(unsigned flag, enum lockdep_ok); -extern int test_taint(unsigned flag); -extern unsigned long get_taint(void); extern int root_mountflags; extern bool early_boot_irqs_disabled; @@ -436,36 +384,6 @@ extern enum system_states { SYSTEM_SUSPEND, } system_state; -/* This cannot be an enum because some may be used in assembly source. */ -#define TAINT_PROPRIETARY_MODULE 0 -#define TAINT_FORCED_MODULE 1 -#define TAINT_CPU_OUT_OF_SPEC 2 -#define TAINT_FORCED_RMMOD 3 -#define TAINT_MACHINE_CHECK 4 -#define TAINT_BAD_PAGE 5 -#define TAINT_USER 6 -#define TAINT_DIE 7 -#define TAINT_OVERRIDDEN_ACPI_TABLE 8 -#define TAINT_WARN 9 -#define TAINT_CRAP 10 -#define TAINT_FIRMWARE_WORKAROUND 11 -#define TAINT_OOT_MODULE 12 -#define TAINT_UNSIGNED_MODULE 13 -#define TAINT_SOFTLOCKUP 14 -#define TAINT_LIVEPATCH 15 -#define TAINT_AUX 16 -#define TAINT_RANDSTRUCT 17 -#define TAINT_FLAGS_COUNT 18 -#define TAINT_FLAGS_MAX ((1UL << TAINT_FLAGS_COUNT) - 1) - -struct taint_flag { - char c_true; /* character printed when tainted */ - char c_false; /* character printed when not tainted */ - bool module; /* also show as a per-module taint flag */ -}; - -extern const struct taint_flag taint_flags[TAINT_FLAGS_COUNT]; - extern const char hex_asc[]; #define hex_asc_lo(x) hex_asc[((x) & 0x0f)] #define hex_asc_hi(x) hex_asc[((x) & 0xf0) >> 4] diff --git a/include/linux/panic.h b/include/linux/panic.h new file mode 100644 index 000000000000..f5844908a089 --- /dev/null +++ b/include/linux/panic.h @@ -0,0 +1,98 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PANIC_H +#define _LINUX_PANIC_H + +#include +#include + +struct pt_regs; + +extern long (*panic_blink)(int state); +__printf(1, 2) +void panic(const char *fmt, ...) __noreturn __cold; +void nmi_panic(struct pt_regs *regs, const char *msg); +extern void oops_enter(void); +extern void oops_exit(void); +extern bool oops_may_print(void); + +#ifdef CONFIG_SMP +extern unsigned int sysctl_oops_all_cpu_backtrace; +#else +#define sysctl_oops_all_cpu_backtrace 0 +#endif /* CONFIG_SMP */ + +extern int panic_timeout; +extern unsigned long panic_print; +extern int panic_on_oops; +extern int panic_on_unrecovered_nmi; +extern int panic_on_io_nmi; +extern int panic_on_warn; + +extern unsigned long panic_on_taint; +extern bool panic_on_taint_nousertaint; + +extern int sysctl_panic_on_rcu_stall; +extern int sysctl_max_rcu_stall_to_panic; +extern int sysctl_panic_on_stackoverflow; + +extern bool crash_kexec_post_notifiers; + +/* + * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It + * holds a CPU number which is executing panic() currently. A value of + * PANIC_CPU_INVALID means no CPU has entered panic() or crash_kexec(). + */ +extern atomic_t panic_cpu; +#define PANIC_CPU_INVALID -1 + +/* + * Only to be used by arch init code. If the user over-wrote the default + * CONFIG_PANIC_TIMEOUT, honor it. + */ +static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout) +{ + if (panic_timeout == arch_default_timeout) + panic_timeout = timeout; +} + +/* This cannot be an enum because some may be used in assembly source. */ +#define TAINT_PROPRIETARY_MODULE 0 +#define TAINT_FORCED_MODULE 1 +#define TAINT_CPU_OUT_OF_SPEC 2 +#define TAINT_FORCED_RMMOD 3 +#define TAINT_MACHINE_CHECK 4 +#define TAINT_BAD_PAGE 5 +#define TAINT_USER 6 +#define TAINT_DIE 7 +#define TAINT_OVERRIDDEN_ACPI_TABLE 8 +#define TAINT_WARN 9 +#define TAINT_CRAP 10 +#define TAINT_FIRMWARE_WORKAROUND 11 +#define TAINT_OOT_MODULE 12 +#define TAINT_UNSIGNED_MODULE 13 +#define TAINT_SOFTLOCKUP 14 +#define TAINT_LIVEPATCH 15 +#define TAINT_AUX 16 +#define TAINT_RANDSTRUCT 17 +#define TAINT_FLAGS_COUNT 18 +#define TAINT_FLAGS_MAX ((1UL << TAINT_FLAGS_COUNT) - 1) + +struct taint_flag { + char c_true; /* character printed when tainted */ + char c_false; /* character printed when not tainted */ + bool module; /* also show as a per-module taint flag */ +}; + +extern const struct taint_flag taint_flags[TAINT_FLAGS_COUNT]; + +enum lockdep_ok { + LOCKDEP_STILL_OK, + LOCKDEP_NOW_UNRELIABLE, +}; + +extern const char *print_tainted(void); +extern void add_taint(unsigned flag, enum lockdep_ok); +extern int test_taint(unsigned flag); +extern unsigned long get_taint(void); + +#endif /* _LINUX_PANIC_H */ diff --git a/include/linux/panic_notifier.h b/include/linux/panic_notifier.h new file mode 100644 index 000000000000..41e32483d7a7 --- /dev/null +++ b/include/linux/panic_notifier.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PANIC_NOTIFIERS_H +#define _LINUX_PANIC_NOTIFIERS_H + +#include +#include + +extern struct atomic_notifier_head panic_notifier_list; + +extern bool crash_kexec_post_notifiers; + +#endif /* _LINUX_PANIC_NOTIFIERS_H */ diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h index 157762db9d4b..0999f6317978 100644 --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -9,6 +9,7 @@ #define _LINUX_THREAD_INFO_H #include +#include #include #include #include diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 396ebaebea3f..ebc641664eb0 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index f099baee3578..4b34a9aa32bc 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/panic.c b/kernel/panic.c index 332736a72a58..edad89660a2b 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8e78b2430c16..f12056beb916 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -32,6 +32,8 @@ #include #include #include +#include +#include #include #include #include diff --git a/kernel/sysctl.c b/kernel/sysctl.c index bb37739f5731..70e286c83d6b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #include #include diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index d23a09d3eb37..3c1384bc5c5a 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include #include From 92aeda50d4a96b7a30fc87960497d5e15b7428f7 Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 30 Jun 2021 18:55:02 -0700 Subject: [PATCH 145/190] lib: decompress_bunzip2: remove an unneeded semicolon The semicolon immediately following '}' is unneeded. Link: https://lkml.kernel.org/r/20210508094926.2889-1-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/decompress_bunzip2.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/decompress_bunzip2.c b/lib/decompress_bunzip2.c index c72c865032fa..0d619cc129dd 100644 --- a/lib/decompress_bunzip2.c +++ b/lib/decompress_bunzip2.c @@ -385,7 +385,7 @@ static int INIT get_next_block(struct bunzip_data *bd) bd->inbufBits = (bd->inbufBits << 8)|bd->inbuf[bd->inbufPos++]; bd->inbufBitCount += 8; - }; + } bd->inbufBitCount -= hufGroup->maxLen; j = (bd->inbufBits >> bd->inbufBitCount)& ((1 << hufGroup->maxLen)-1); From 994b69703e86ed0ab2228fc606761a3b08d48af3 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:05 -0700 Subject: [PATCH 146/190] lib/string_helpers: switch to use BIT() macro Patch series "lib/string_helpers: get rid of ugly *_escape_mem_ascii()", v3. Get rid of ugly *_escape_mem_ascii() API since it's not flexible and has the only single user. Provide better approach based on usage of the string_escape_mem() with appropriate flags. Test cases has been expanded accordingly to cover new functionality. This patch (of 15): Switch to use BIT() macro for flag definitions. No changes implied. Link: https://lkml.kernel.org/r/20210504180819.73127-1-andriy.shevchenko@linux.intel.com Link: https://lkml.kernel.org/r/20210504180819.73127-2-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: "J. Bruce Fields" Cc: Chuck Lever Cc: Alexander Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/string_helpers.h | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h index fa06dcdc481e..bf01e24edd89 100644 --- a/include/linux/string_helpers.h +++ b/include/linux/string_helpers.h @@ -2,6 +2,7 @@ #ifndef _LINUX_STRING_HELPERS_H_ #define _LINUX_STRING_HELPERS_H_ +#include #include #include @@ -18,10 +19,10 @@ enum string_size_units { void string_get_size(u64 size, u64 blk_size, enum string_size_units units, char *buf, int len); -#define UNESCAPE_SPACE 0x01 -#define UNESCAPE_OCTAL 0x02 -#define UNESCAPE_HEX 0x04 -#define UNESCAPE_SPECIAL 0x08 +#define UNESCAPE_SPACE BIT(0) +#define UNESCAPE_OCTAL BIT(1) +#define UNESCAPE_HEX BIT(2) +#define UNESCAPE_SPECIAL BIT(3) #define UNESCAPE_ANY \ (UNESCAPE_SPACE | UNESCAPE_OCTAL | UNESCAPE_HEX | UNESCAPE_SPECIAL) @@ -42,15 +43,15 @@ static inline int string_unescape_any_inplace(char *buf) return string_unescape_any(buf, buf, 0); } -#define ESCAPE_SPACE 0x01 -#define ESCAPE_SPECIAL 0x02 -#define ESCAPE_NULL 0x04 -#define ESCAPE_OCTAL 0x08 +#define ESCAPE_SPACE BIT(0) +#define ESCAPE_SPECIAL BIT(1) +#define ESCAPE_NULL BIT(2) +#define ESCAPE_OCTAL BIT(3) #define ESCAPE_ANY \ (ESCAPE_SPACE | ESCAPE_OCTAL | ESCAPE_SPECIAL | ESCAPE_NULL) -#define ESCAPE_NP 0x10 +#define ESCAPE_NP BIT(4) #define ESCAPE_ANY_NP (ESCAPE_ANY | ESCAPE_NP) -#define ESCAPE_HEX 0x20 +#define ESCAPE_HEX BIT(5) int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, unsigned int flags, const char *only); From 62519b882d7485bae4c0a7e1e0adb576610400a9 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:08 -0700 Subject: [PATCH 147/190] lib/string_helpers: move ESCAPE_NP check inside 'else' branch in a loop Refactor code to have better readability by moving ESCAPE_NP handling inside 'else' branch in the loop. No functional change intended. Link: https://lkml.kernel.org/r/20210504180819.73127-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/string_helpers.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/lib/string_helpers.c b/lib/string_helpers.c index 7f2d5fbaf243..b10a18b4663b 100644 --- a/lib/string_helpers.c +++ b/lib/string_helpers.c @@ -452,10 +452,10 @@ static bool escape_hex(unsigned char c, char **dst, char *end) * The process of escaping byte buffer includes several parts. They are applied * in the following sequence. * - * 1. The character is matched to the printable class, if asked, and in - * case of match it passes through to the output. - * 2. The character is not matched to the one from @only string and thus + * 1. The character is not matched to the one from @only string and thus * must go as-is to the output. + * 2. The character is matched to the printable class, if asked, and in + * case of match it passes through to the output. * 3. The character is checked if it falls into the class given by @flags. * %ESCAPE_OCTAL and %ESCAPE_HEX are going last since they cover any * character. Note that they actually can't go together, otherwise @@ -506,19 +506,22 @@ int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, /* * Apply rules in the following sequence: - * - the character is printable, when @flags has - * %ESCAPE_NP bit set * - the @only string is supplied and does not contain a * character under question + * - the character is printable, when @flags has + * %ESCAPE_NP bit set * - the character doesn't fall into a class of symbols * defined by given @flags * In these cases we just pass through a character to the * output buffer. */ - if ((flags & ESCAPE_NP && isprint(c)) || - (is_dict && !strchr(only, c))) { + if (is_dict && !strchr(only, c)) { /* do nothing */ } else { + if (isprint(c) && + flags & ESCAPE_NP && escape_passthrough(c, &p, end)) + continue; + if (flags & ESCAPE_SPACE && escape_space(c, &p, end)) continue; From 7e5969aeb7f1e7d6f68d5501a6c040605272763e Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:11 -0700 Subject: [PATCH 148/190] lib/string_helpers: drop indentation level in string_escape_mem() The only one conditional is left on the upper level, move the rest to the same level and drop indentation level. No functional changes. Link: https://lkml.kernel.org/r/20210504180819.73127-4-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/string_helpers.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/lib/string_helpers.c b/lib/string_helpers.c index b10a18b4663b..e3ef9f86cc34 100644 --- a/lib/string_helpers.c +++ b/lib/string_helpers.c @@ -515,29 +515,29 @@ int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, * In these cases we just pass through a character to the * output buffer. */ - if (is_dict && !strchr(only, c)) { - /* do nothing */ - } else { - if (isprint(c) && - flags & ESCAPE_NP && escape_passthrough(c, &p, end)) - continue; + if (is_dict && !strchr(only, c) && + escape_passthrough(c, &p, end)) + continue; - if (flags & ESCAPE_SPACE && escape_space(c, &p, end)) - continue; + if (isprint(c) && + flags & ESCAPE_NP && escape_passthrough(c, &p, end)) + continue; - if (flags & ESCAPE_SPECIAL && escape_special(c, &p, end)) - continue; + if (flags & ESCAPE_SPACE && escape_space(c, &p, end)) + continue; - if (flags & ESCAPE_NULL && escape_null(c, &p, end)) - continue; + if (flags & ESCAPE_SPECIAL && escape_special(c, &p, end)) + continue; - /* ESCAPE_OCTAL and ESCAPE_HEX always go last */ - if (flags & ESCAPE_OCTAL && escape_octal(c, &p, end)) - continue; + if (flags & ESCAPE_NULL && escape_null(c, &p, end)) + continue; - if (flags & ESCAPE_HEX && escape_hex(c, &p, end)) - continue; - } + /* ESCAPE_OCTAL and ESCAPE_HEX always go last */ + if (flags & ESCAPE_OCTAL && escape_octal(c, &p, end)) + continue; + + if (flags & ESCAPE_HEX && escape_hex(c, &p, end)) + continue; escape_passthrough(c, &p, end); } From a0809783355cfe1cc1b2fa7f881c3a79df0b2a27 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:14 -0700 Subject: [PATCH 149/190] lib/string_helpers: introduce ESCAPE_NA for escaping non-ASCII Some users may want to have an ASCII based filter, provided by isascii() function. Here is the addition of a such. Link: https://lkml.kernel.org/r/20210504180819.73127-5-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/string_helpers.h | 1 + lib/string_helpers.c | 21 +++++++++++++++++---- 2 files changed, 18 insertions(+), 4 deletions(-) diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h index bf01e24edd89..d6cf6fe10f74 100644 --- a/include/linux/string_helpers.h +++ b/include/linux/string_helpers.h @@ -52,6 +52,7 @@ static inline int string_unescape_any_inplace(char *buf) #define ESCAPE_NP BIT(4) #define ESCAPE_ANY_NP (ESCAPE_ANY | ESCAPE_NP) #define ESCAPE_HEX BIT(5) +#define ESCAPE_NA BIT(6) int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, unsigned int flags, const char *only); diff --git a/lib/string_helpers.c b/lib/string_helpers.c index e3ef9f86cc34..a963404b8c16 100644 --- a/lib/string_helpers.c +++ b/lib/string_helpers.c @@ -454,8 +454,8 @@ static bool escape_hex(unsigned char c, char **dst, char *end) * * 1. The character is not matched to the one from @only string and thus * must go as-is to the output. - * 2. The character is matched to the printable class, if asked, and in - * case of match it passes through to the output. + * 2. The character is matched to the printable or ASCII class, if asked, + * and in case of match it passes through to the output. * 3. The character is checked if it falls into the class given by @flags. * %ESCAPE_OCTAL and %ESCAPE_HEX are going last since they cover any * character. Note that they actually can't go together, otherwise @@ -463,7 +463,7 @@ static bool escape_hex(unsigned char c, char **dst, char *end) * * Caller must provide valid source and destination pointers. Be aware that * destination buffer will not be NULL-terminated, thus caller have to append - * it if needs. The supported flags are:: + * it if needs. The supported flags are:: * * %ESCAPE_SPACE: (special white space, not space itself) * '\f' - form feed @@ -482,11 +482,18 @@ static bool escape_hex(unsigned char c, char **dst, char *end) * %ESCAPE_ANY: * all previous together * %ESCAPE_NP: - * escape only non-printable characters (checked by isprint) + * escape only non-printable characters, checked by isprint() * %ESCAPE_ANY_NP: * all previous together * %ESCAPE_HEX: * '\xHH' - byte with hexadecimal value HH (2 digits) + * %ESCAPE_NA: + * escape only non-ascii characters, checked by isascii() + * + * One notable caveat, the %ESCAPE_NP and %ESCAPE_NA have higher priority + * than the rest of the flags (%ESCAPE_NP is higher than %ESCAPE_NA). + * It doesn't make much sense to use either of them without %ESCAPE_OCTAL + * or %ESCAPE_HEX, because they cover most of the other character classes. * * Return: * The total size of the escaped output that would be generated for @@ -510,6 +517,8 @@ int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, * character under question * - the character is printable, when @flags has * %ESCAPE_NP bit set + * - the character is ASCII, when @flags has + * %ESCAPE_NA bit set * - the character doesn't fall into a class of symbols * defined by given @flags * In these cases we just pass through a character to the @@ -523,6 +532,10 @@ int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, flags & ESCAPE_NP && escape_passthrough(c, &p, end)) continue; + if (isascii(c) && + flags & ESCAPE_NA && escape_passthrough(c, &p, end)) + continue; + if (flags & ESCAPE_SPACE && escape_space(c, &p, end)) continue; From 0362c27fb373ea04eace9e7a70e61036ab81f09f Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:17 -0700 Subject: [PATCH 150/190] lib/string_helpers: introduce ESCAPE_NAP to escape non-ASCII and non-printable Some users may want to have an ASCII based filter for printable only characters, provided by conjunction of isascii() and isprint() functions. Here is the addition of a such. Link: https://lkml.kernel.org/r/20210504180819.73127-6-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/string_helpers.h | 1 + lib/string_helpers.c | 20 ++++++++++++++++---- 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h index d6cf6fe10f74..811c6a627620 100644 --- a/include/linux/string_helpers.h +++ b/include/linux/string_helpers.h @@ -53,6 +53,7 @@ static inline int string_unescape_any_inplace(char *buf) #define ESCAPE_ANY_NP (ESCAPE_ANY | ESCAPE_NP) #define ESCAPE_HEX BIT(5) #define ESCAPE_NA BIT(6) +#define ESCAPE_NAP BIT(7) int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, unsigned int flags, const char *only); diff --git a/lib/string_helpers.c b/lib/string_helpers.c index a963404b8c16..ceca5fbbd92c 100644 --- a/lib/string_helpers.c +++ b/lib/string_helpers.c @@ -454,9 +454,11 @@ static bool escape_hex(unsigned char c, char **dst, char *end) * * 1. The character is not matched to the one from @only string and thus * must go as-is to the output. - * 2. The character is matched to the printable or ASCII class, if asked, + * 2. The character is matched to the printable and ASCII classes, if asked, * and in case of match it passes through to the output. - * 3. The character is checked if it falls into the class given by @flags. + * 3. The character is matched to the printable or ASCII class, if asked, + * and in case of match it passes through to the output. + * 4. The character is checked if it falls into the class given by @flags. * %ESCAPE_OCTAL and %ESCAPE_HEX are going last since they cover any * character. Note that they actually can't go together, otherwise * %ESCAPE_HEX will be ignored. @@ -489,11 +491,15 @@ static bool escape_hex(unsigned char c, char **dst, char *end) * '\xHH' - byte with hexadecimal value HH (2 digits) * %ESCAPE_NA: * escape only non-ascii characters, checked by isascii() + * %ESCAPE_NAP: + * escape only non-printable or non-ascii characters * - * One notable caveat, the %ESCAPE_NP and %ESCAPE_NA have higher priority - * than the rest of the flags (%ESCAPE_NP is higher than %ESCAPE_NA). + * One notable caveat, the %ESCAPE_NAP, %ESCAPE_NP and %ESCAPE_NA have the + * higher priority than the rest of the flags (%ESCAPE_NAP is the highest). * It doesn't make much sense to use either of them without %ESCAPE_OCTAL * or %ESCAPE_HEX, because they cover most of the other character classes. + * %ESCAPE_NAP can utilize %ESCAPE_SPACE or %ESCAPE_SPECIAL in addition to + * the above. * * Return: * The total size of the escaped output that would be generated for @@ -515,6 +521,8 @@ int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, * Apply rules in the following sequence: * - the @only string is supplied and does not contain a * character under question + * - the character is printable and ASCII, when @flags has + * %ESCAPE_NAP bit set * - the character is printable, when @flags has * %ESCAPE_NP bit set * - the character is ASCII, when @flags has @@ -528,6 +536,10 @@ int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, escape_passthrough(c, &p, end)) continue; + if (isascii(c) && isprint(c) && + flags & ESCAPE_NAP && escape_passthrough(c, &p, end)) + continue; + if (isprint(c) && flags & ESCAPE_NP && escape_passthrough(c, &p, end)) continue; From aec0d0966f20d131cc4ff6927b02d448a478a6d4 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:20 -0700 Subject: [PATCH 151/190] lib/string_helpers: allow to append additional characters to be escaped Introduce a new flag to append additional characters, passed in 'only' parameter, to be escaped if they fall in the corresponding class. Link: https://lkml.kernel.org/r/20210504180819.73127-7-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/string_helpers.h | 1 + lib/string_helpers.c | 19 +++++++++++++++---- 2 files changed, 16 insertions(+), 4 deletions(-) diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h index 811c6a627620..f8728ed4d563 100644 --- a/include/linux/string_helpers.h +++ b/include/linux/string_helpers.h @@ -54,6 +54,7 @@ static inline int string_unescape_any_inplace(char *buf) #define ESCAPE_HEX BIT(5) #define ESCAPE_NA BIT(6) #define ESCAPE_NAP BIT(7) +#define ESCAPE_APPEND BIT(8) int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, unsigned int flags, const char *only); diff --git a/lib/string_helpers.c b/lib/string_helpers.c index ceca5fbbd92c..c15aea7a82e9 100644 --- a/lib/string_helpers.c +++ b/lib/string_helpers.c @@ -493,6 +493,11 @@ static bool escape_hex(unsigned char c, char **dst, char *end) * escape only non-ascii characters, checked by isascii() * %ESCAPE_NAP: * escape only non-printable or non-ascii characters + * %ESCAPE_APPEND: + * append characters from @only to be escaped by the given classes + * + * %ESCAPE_APPEND would help to pass additional characters to the escaped, when + * one of %ESCAPE_NP, %ESCAPE_NA, or %ESCAPE_NAP is provided. * * One notable caveat, the %ESCAPE_NAP, %ESCAPE_NP and %ESCAPE_NA have the * higher priority than the rest of the flags (%ESCAPE_NAP is the highest). @@ -513,9 +518,11 @@ int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, char *p = dst; char *end = p + osz; bool is_dict = only && *only; + bool is_append = flags & ESCAPE_APPEND; while (isz--) { unsigned char c = *src++; + bool in_dict = is_dict && strchr(only, c); /* * Apply rules in the following sequence: @@ -531,20 +538,24 @@ int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, * defined by given @flags * In these cases we just pass through a character to the * output buffer. + * + * When %ESCAPE_APPEND is passed, the characters from @only + * have been excluded from the %ESCAPE_NAP, %ESCAPE_NP, and + * %ESCAPE_NA cases. */ - if (is_dict && !strchr(only, c) && + if (!(is_append || in_dict) && is_dict && escape_passthrough(c, &p, end)) continue; - if (isascii(c) && isprint(c) && + if (!(is_append && in_dict) && isascii(c) && isprint(c) && flags & ESCAPE_NAP && escape_passthrough(c, &p, end)) continue; - if (isprint(c) && + if (!(is_append && in_dict) && isprint(c) && flags & ESCAPE_NP && escape_passthrough(c, &p, end)) continue; - if (isascii(c) && + if (!(is_append && in_dict) && isascii(c) && flags & ESCAPE_NA && escape_passthrough(c, &p, end)) continue; From 229563b196ed3ce36036a18b6bdfe4cce9dcbbd4 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:23 -0700 Subject: [PATCH 152/190] lib/test-string_helpers: print flags in hexadecimal format Since flags are bitmapped, it's better to print them in hexadecimal format. Link: https://lkml.kernel.org/r/20210504180819.73127-8-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/test-string_helpers.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/test-string_helpers.c b/lib/test-string_helpers.c index 10360d4ea273..079fb12d59cc 100644 --- a/lib/test-string_helpers.c +++ b/lib/test-string_helpers.c @@ -19,7 +19,7 @@ static __init bool test_string_check_buf(const char *name, unsigned int flags, if (q_real == q_test && !memcmp(out_test, out_real, q_test)) return true; - pr_warn("Test '%s' failed: flags = %u\n", name, flags); + pr_warn("Test '%s' failed: flags = %#x\n", name, flags); print_hex_dump(KERN_WARNING, "Input: ", DUMP_PREFIX_NONE, 16, 1, in, p, true); @@ -290,7 +290,7 @@ test_string_escape_overflow(const char *in, int p, unsigned int flags, const cha q_real = string_escape_mem(in, p, NULL, 0, flags, esc); if (q_real != q_test) - pr_warn("Test '%s' failed: flags = %u, osz = 0, expected %d, got %d\n", + pr_warn("Test '%s' failed: flags = %#x, osz = 0, expected %d, got %d\n", name, flags, q_test, q_real); } From 69325698df55c609da96ebbd592e59d88c4d335d Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:26 -0700 Subject: [PATCH 153/190] lib/test-string_helpers: get rid of trailing comma in terminators Terminators by definition shouldn't accept anything behind. Make them robust by removing trailing commas. Link: https://lkml.kernel.org/r/20210504180819.73127-9-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/test-string_helpers.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lib/test-string_helpers.c b/lib/test-string_helpers.c index 079fb12d59cc..3e2def9ccfac 100644 --- a/lib/test-string_helpers.c +++ b/lib/test-string_helpers.c @@ -136,7 +136,7 @@ static const struct test_string_2 escape0[] __initconst = {{ .flags = ESCAPE_SPACE | ESCAPE_HEX, },{ /* terminator */ - }}, + }} },{ .in = "\\h\\\"\a\e\\", .s1 = {{ @@ -150,7 +150,7 @@ static const struct test_string_2 escape0[] __initconst = {{ .flags = ESCAPE_SPECIAL | ESCAPE_HEX, },{ /* terminator */ - }}, + }} },{ .in = "\eb \\C\007\"\x90\r]", .s1 = {{ @@ -201,7 +201,7 @@ static const struct test_string_2 escape0[] __initconst = {{ .flags = ESCAPE_NP | ESCAPE_HEX, },{ /* terminator */ - }}, + }} },{ /* terminator */ }}; @@ -217,7 +217,7 @@ static const struct test_string_2 escape1[] __initconst = {{ .flags = ESCAPE_HEX, },{ /* terminator */ - }}, + }} },{ .in = "\\h\\\"\a\e\\", .s1 = {{ @@ -225,7 +225,7 @@ static const struct test_string_2 escape1[] __initconst = {{ .flags = ESCAPE_OCTAL, },{ /* terminator */ - }}, + }} },{ .in = "\eb \\C\007\"\x90\r]", .s1 = {{ @@ -233,7 +233,7 @@ static const struct test_string_2 escape1[] __initconst = {{ .flags = ESCAPE_OCTAL, },{ /* terminator */ - }}, + }} },{ /* terminator */ }}; From 259fa5d7d825122c30ad4122c6a1cc937eb74c2d Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:29 -0700 Subject: [PATCH 154/190] lib/test-string_helpers: add test cases for new features We have got new flags and hence new features of string_escape_mem(). Add test cases for that. Link: https://lkml.kernel.org/r/20210504180819.73127-10-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/string_helpers.h | 4 + lib/test-string_helpers.c | 141 +++++++++++++++++++++++++++++++-- 2 files changed, 137 insertions(+), 8 deletions(-) diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h index f8728ed4d563..9b0eca2badf2 100644 --- a/include/linux/string_helpers.h +++ b/include/linux/string_helpers.h @@ -26,6 +26,8 @@ void string_get_size(u64 size, u64 blk_size, enum string_size_units units, #define UNESCAPE_ANY \ (UNESCAPE_SPACE | UNESCAPE_OCTAL | UNESCAPE_HEX | UNESCAPE_SPECIAL) +#define UNESCAPE_ALL_MASK GENMASK(3, 0) + int string_unescape(char *src, char *dst, size_t size, unsigned int flags); static inline int string_unescape_inplace(char *buf, unsigned int flags) @@ -56,6 +58,8 @@ static inline int string_unescape_any_inplace(char *buf) #define ESCAPE_NAP BIT(7) #define ESCAPE_APPEND BIT(8) +#define ESCAPE_ALL_MASK GENMASK(8, 0) + int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, unsigned int flags, const char *only); diff --git a/lib/test-string_helpers.c b/lib/test-string_helpers.c index 3e2def9ccfac..2185d71704f0 100644 --- a/lib/test-string_helpers.c +++ b/lib/test-string_helpers.c @@ -202,11 +202,25 @@ static const struct test_string_2 escape0[] __initconst = {{ },{ /* terminator */ }} +},{ + .in = "\007 \eb\"\x90\xCF\r", + .s1 = {{ + .out = "\007 \eb\"\\220\\317\r", + .flags = ESCAPE_OCTAL | ESCAPE_NA, + },{ + .out = "\007 \eb\"\\x90\\xcf\r", + .flags = ESCAPE_HEX | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\xCF\r", + .flags = ESCAPE_NA, + },{ + /* terminator */ + }} },{ /* terminator */ }}; -#define TEST_STRING_2_DICT_1 "b\\ \t\r" +#define TEST_STRING_2_DICT_1 "b\\ \t\r\xCF" static const struct test_string_2 escape1[] __initconst = {{ .in = "\f\\ \n\r\t\v", .s1 = {{ @@ -215,14 +229,38 @@ static const struct test_string_2 escape1[] __initconst = {{ },{ .out = "\f\\x5c\\x20\n\\x0d\\x09\v", .flags = ESCAPE_HEX, + },{ + .out = "\f\\134\\040\n\\015\\011\v", + .flags = ESCAPE_ANY | ESCAPE_APPEND, + },{ + .out = "\\014\\134\\040\\012\\015\\011\\013", + .flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NAP, + },{ + .out = "\\x0c\\x5c\\x20\\x0a\\x0d\\x09\\x0b", + .flags = ESCAPE_HEX | ESCAPE_APPEND | ESCAPE_NAP, + },{ + .out = "\f\\134\\040\n\\015\\011\v", + .flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NA, + },{ + .out = "\f\\x5c\\x20\n\\x0d\\x09\v", + .flags = ESCAPE_HEX | ESCAPE_APPEND | ESCAPE_NA, },{ /* terminator */ }} },{ - .in = "\\h\\\"\a\e\\", + .in = "\\h\\\"\a\xCF\e\\", .s1 = {{ - .out = "\\134h\\134\"\a\e\\134", + .out = "\\134h\\134\"\a\\317\e\\134", .flags = ESCAPE_OCTAL, + },{ + .out = "\\134h\\134\"\a\\317\e\\134", + .flags = ESCAPE_ANY | ESCAPE_APPEND, + },{ + .out = "\\134h\\134\"\\007\\317\\033\\134", + .flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NAP, + },{ + .out = "\\134h\\134\"\a\\317\e\\134", + .flags = ESCAPE_OCTAL | ESCAPE_APPEND | ESCAPE_NA, },{ /* terminator */ }} @@ -234,6 +272,88 @@ static const struct test_string_2 escape1[] __initconst = {{ },{ /* terminator */ }} +},{ + .in = "\007 \eb\"\x90\xCF\r", + .s1 = {{ + .out = "\007 \eb\"\x90\xCF\r", + .flags = ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\xCF\r", + .flags = ESCAPE_SPACE | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\xCF\r", + .flags = ESCAPE_SPECIAL | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\xCF\r", + .flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\\317\r", + .flags = ESCAPE_OCTAL | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\\317\r", + .flags = ESCAPE_SPACE | ESCAPE_OCTAL | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\\317\r", + .flags = ESCAPE_SPECIAL | ESCAPE_OCTAL | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\\317\r", + .flags = ESCAPE_ANY | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\\xcf\r", + .flags = ESCAPE_HEX | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\\xcf\r", + .flags = ESCAPE_SPACE | ESCAPE_HEX | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\\xcf\r", + .flags = ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NA, + },{ + .out = "\007 \eb\"\x90\\xcf\r", + .flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NA, + },{ + /* terminator */ + }} +},{ + .in = "\007 \eb\"\x90\xCF\r", + .s1 = {{ + .out = "\007 \eb\"\x90\xCF\r", + .flags = ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\xCF\\r", + .flags = ESCAPE_SPACE | ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\xCF\r", + .flags = ESCAPE_SPECIAL | ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\xCF\\r", + .flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\\317\\015", + .flags = ESCAPE_OCTAL | ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\\317\\r", + .flags = ESCAPE_SPACE | ESCAPE_OCTAL | ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\\317\\015", + .flags = ESCAPE_SPECIAL | ESCAPE_OCTAL | ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\\317\r", + .flags = ESCAPE_ANY | ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\\xcf\\x0d", + .flags = ESCAPE_HEX | ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\\xcf\\r", + .flags = ESCAPE_SPACE | ESCAPE_HEX | ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\\xcf\\x0d", + .flags = ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NAP, + },{ + .out = "\007 \eb\"\x90\\xcf\\r", + .flags = ESCAPE_SPACE | ESCAPE_SPECIAL | ESCAPE_HEX | ESCAPE_NAP, + },{ + /* terminator */ + }} },{ /* terminator */ }}; @@ -315,8 +435,13 @@ static __init void test_string_escape(const char *name, /* NULL injection */ if (flags & ESCAPE_NULL) { in[p++] = '\0'; - out_test[q_test++] = '\\'; - out_test[q_test++] = '0'; + /* '\0' passes isascii() test */ + if (flags & ESCAPE_NA && !(flags & ESCAPE_APPEND && esc)) { + out_test[q_test++] = '\0'; + } else { + out_test[q_test++] = '\\'; + out_test[q_test++] = '0'; + } } /* Don't try strings that have no output */ @@ -459,17 +584,17 @@ static int __init test_string_helpers_init(void) unsigned int i; pr_info("Running tests...\n"); - for (i = 0; i < UNESCAPE_ANY + 1; i++) + for (i = 0; i < UNESCAPE_ALL_MASK + 1; i++) test_string_unescape("unescape", i, false); test_string_unescape("unescape inplace", get_random_int() % (UNESCAPE_ANY + 1), true); /* Without dictionary */ - for (i = 0; i < (ESCAPE_ANY_NP | ESCAPE_HEX) + 1; i++) + for (i = 0; i < ESCAPE_ALL_MASK + 1; i++) test_string_escape("escape 0", escape0, i, TEST_STRING_2_DICT_0); /* With dictionary */ - for (i = 0; i < (ESCAPE_ANY_NP | ESCAPE_HEX) + 1; i++) + for (i = 0; i < ESCAPE_ALL_MASK + 1; i++) test_string_escape("escape 1", escape1, i, TEST_STRING_2_DICT_1); /* Test string_get_size() */ From be613b4025fa3894f3985283d5f2929161fae300 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:32 -0700 Subject: [PATCH 155/190] MAINTAINERS: add myself as designated reviewer for generic string library Add myself as designated reviewer for generic string library. Link: https://lkml.kernel.org/r/20210504180819.73127-11-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- MAINTAINERS | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 37f5a8ca413f..eae74c5a06b3 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7648,6 +7648,14 @@ L: linux-input@vger.kernel.org S: Maintained F: drivers/input/touchscreen/resistive-adc-touch.c +GENERIC STRING LIBRARY +R: Andy Shevchenko +S: Maintained +F: lib/string.c +F: lib/string_helpers.c +F: lib/test_string.c +F: lib/test-string_helpers.c + GENERIC UIO DRIVER FOR PCI DEVICES M: "Michael S. Tsirkin" L: kvm@vger.kernel.org From 1d31aa172a4e6728918a06ee7f1d6bcb7507172c Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:34 -0700 Subject: [PATCH 156/190] seq_file: introduce seq_escape_mem() Introduce seq_escape_mem() to allow users to pass additional parameters to string_escape_mem(). Link: https://lkml.kernel.org/r/20210504180819.73127-12-andriy.shevchenko@linux.intel.com Suggested-by: Al Viro Signed-off-by: Andy Shevchenko Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/seq_file.c | 25 +++++++++++++++++++++++++ include/linux/seq_file.h | 2 ++ 2 files changed, 27 insertions(+) diff --git a/fs/seq_file.c b/fs/seq_file.c index 5059248f2d64..532cac2eae0f 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -355,6 +355,31 @@ int seq_release(struct inode *inode, struct file *file) } EXPORT_SYMBOL(seq_release); +/** + * seq_escape_mem - print data into buffer, escaping some characters + * @m: target buffer + * @src: source buffer + * @len: size of source buffer + * @flags: flags to pass to string_escape_mem() + * @esc: set of characters that need escaping + * + * Puts data into buffer, replacing each occurrence of character from + * given class (defined by @flags and @esc) with printable escaped sequence. + * + * Use seq_has_overflowed() to check for errors. + */ +void seq_escape_mem(struct seq_file *m, const char *src, size_t len, + unsigned int flags, const char *esc) +{ + char *buf; + size_t size = seq_get_buf(m, &buf); + int ret; + + ret = string_escape_mem(src, len, buf, size, flags, esc); + seq_commit(m, ret < size ? ret : -1); +} +EXPORT_SYMBOL(seq_escape_mem); + /** * seq_escape - print string into buffer, escaping some characters * @m: target buffer diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 723b1fa1177e..6de442182784 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -126,6 +126,8 @@ void seq_put_decimal_ll(struct seq_file *m, const char *delimiter, long long num void seq_put_hex_ll(struct seq_file *m, const char *delimiter, unsigned long long v, unsigned int width); +void seq_escape_mem(struct seq_file *m, const char *src, size_t len, + unsigned int flags, const char *esc); void seq_escape(struct seq_file *m, const char *s, const char *esc); void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz); From e7ed4a3b922b04d2042cd2e19d1096fa457b6c11 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:37 -0700 Subject: [PATCH 157/190] seq_file: add seq_escape_str() as replica of string_escape_str() In some cases we want to escape characters from NULL-terminated strings. Add seq_escape_str() as replica of string_escape_str() for that. Link: https://lkml.kernel.org/r/20210504180819.73127-13-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/seq_file.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 6de442182784..63f021cb1b12 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -128,6 +128,13 @@ void seq_put_hex_ll(struct seq_file *m, const char *delimiter, void seq_escape_mem(struct seq_file *m, const char *src, size_t len, unsigned int flags, const char *esc); + +static inline void seq_escape_str(struct seq_file *m, const char *src, + unsigned int flags, const char *esc) +{ + seq_escape_mem(m, src, strlen(src), flags, esc); +} + void seq_escape(struct seq_file *m, const char *s, const char *esc); void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz); From fc3de02eae89a1eb4a964b7b0a05bfb717904700 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:40 -0700 Subject: [PATCH 158/190] seq_file: convert seq_escape() to use seq_escape_str() Convert seq_escape() to use seq_escape_str() rather than open coding it. Note, for now we leave it as an exported symbol due to some old code that can't tolerate ctype.h being (indirectly) included. Link: https://lkml.kernel.org/r/20210504180819.73127-14-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/seq_file.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/fs/seq_file.c b/fs/seq_file.c index 532cac2eae0f..08f54029c2b1 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -392,12 +392,7 @@ EXPORT_SYMBOL(seq_escape_mem); */ void seq_escape(struct seq_file *m, const char *s, const char *esc) { - char *buf; - size_t size = seq_get_buf(m, &buf); - int ret; - - ret = string_escape_str(s, buf, size, ESCAPE_OCTAL, esc); - seq_commit(m, ret < size ? ret : -1); + seq_escape_str(m, s, ESCAPE_OCTAL, esc); } EXPORT_SYMBOL(seq_escape); From c0546391c20f01ca98c6fa42c8cd9e247599550a Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:43 -0700 Subject: [PATCH 159/190] nfsd: avoid non-flexible API in seq_quote_mem() The seq_escape_mem_ascii() is completely non-flexible and shouldn't be used. Replace it with properly called seq_escape_mem(). Link: https://lkml.kernel.org/r/20210504180819.73127-15-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/nfsd/nfs4state.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index b517a8794400..cd5eac2ba054 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -2351,7 +2351,7 @@ static struct nfs4_client *get_nfsdfs_clp(struct inode *inode) static void seq_quote_mem(struct seq_file *m, char *data, int len) { seq_printf(m, "\""); - seq_escape_mem_ascii(m, data, len); + seq_escape_mem(m, data, len, ESCAPE_HEX | ESCAPE_NAP | ESCAPE_APPEND, "\"\\"); seq_printf(m, "\""); } From cc72181a65990193f54284417efa01d4580014e6 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:55:46 -0700 Subject: [PATCH 160/190] seq_file: drop unused *_escape_mem_ascii() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit There are no more users of the seq_escape_mem_ascii() followed by string_escape_mem_ascii(). Remove them for good. Link: https://lkml.kernel.org/r/20210504180819.73127-16-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Cc: Alexander Viro Cc: Chuck Lever Cc: "J. Bruce Fields" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/seq_file.c | 11 ----------- include/linux/seq_file.h | 1 - include/linux/string_helpers.h | 3 --- lib/string_helpers.c | 19 ------------------- 4 files changed, 34 deletions(-) diff --git a/fs/seq_file.c b/fs/seq_file.c index 08f54029c2b1..b117b212ef28 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -396,17 +396,6 @@ void seq_escape(struct seq_file *m, const char *s, const char *esc) } EXPORT_SYMBOL(seq_escape); -void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz) -{ - char *buf; - size_t size = seq_get_buf(m, &buf); - int ret; - - ret = string_escape_mem_ascii(src, isz, buf, size); - seq_commit(m, ret < size ? ret : -1); -} -EXPORT_SYMBOL(seq_escape_mem_ascii); - void seq_vprintf(struct seq_file *m, const char *f, va_list args) { int len; diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 63f021cb1b12..dd99569595fd 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -136,7 +136,6 @@ static inline void seq_escape_str(struct seq_file *m, const char *src, } void seq_escape(struct seq_file *m, const char *s, const char *esc); -void seq_escape_mem_ascii(struct seq_file *m, const char *src, size_t isz); void seq_hex_dump(struct seq_file *m, const char *prefix_str, int prefix_type, int rowsize, int groupsize, const void *buf, size_t len, diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h index 9b0eca2badf2..68189c4a2eb1 100644 --- a/include/linux/string_helpers.h +++ b/include/linux/string_helpers.h @@ -63,9 +63,6 @@ static inline int string_unescape_any_inplace(char *buf) int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, unsigned int flags, const char *only); -int string_escape_mem_ascii(const char *src, size_t isz, char *dst, - size_t osz); - static inline int string_escape_mem_any_np(const char *src, size_t isz, char *dst, size_t osz, const char *only) { diff --git a/lib/string_helpers.c b/lib/string_helpers.c index c15aea7a82e9..5a35c7e16e96 100644 --- a/lib/string_helpers.c +++ b/lib/string_helpers.c @@ -582,25 +582,6 @@ int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, } EXPORT_SYMBOL(string_escape_mem); -int string_escape_mem_ascii(const char *src, size_t isz, char *dst, - size_t osz) -{ - char *p = dst; - char *end = p + osz; - - while (isz--) { - unsigned char c = *src++; - - if (!isprint(c) || !isascii(c) || c == '"' || c == '\\') - escape_hex(c, &p, end); - else - escape_passthrough(c, &p, end); - } - - return p - dst; -} -EXPORT_SYMBOL(string_escape_mem_ascii); - /* * Return an allocated string that has been escaped of special characters * and double quotes, making it safe to log in quotes. From 65a0d3c14685663ba111038a35db70f559e39336 Mon Sep 17 00:00:00 2001 From: Trent Piepho Date: Wed, 30 Jun 2021 18:55:49 -0700 Subject: [PATCH 161/190] lib/math/rational.c: fix divide by zero If the input is out of the range of the allowed values, either larger than the largest value or closer to zero than the smallest non-zero allowed value, then a division by zero would occur. In the case of input too large, the division by zero will occur on the first iteration. The best result (largest allowed value) will be found by always choosing the semi-convergent and excluding the denominator based limit when finding it. In the case of the input too small, the division by zero will occur on the second iteration. The numerator based semi-convergent should not be calculated to avoid the division by zero. But the semi-convergent vs previous convergent test is still needed, which effectively chooses between 0 (the previous convergent) vs the smallest allowed fraction (best semi-convergent) as the result. Link: https://lkml.kernel.org/r/20210525144250.214670-1-tpiepho@gmail.com Fixes: 323dd2c3ed0 ("lib/math/rational.c: fix possible incorrect result from rational fractions helper") Signed-off-by: Trent Piepho Reported-by: Yiyuan Guo Reviewed-by: Andy Shevchenko Cc: Oskar Schirmer Cc: Daniel Latypov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/math/rational.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/lib/math/rational.c b/lib/math/rational.c index 9781d521963d..c0ab51d8fbb9 100644 --- a/lib/math/rational.c +++ b/lib/math/rational.c @@ -12,6 +12,7 @@ #include #include #include +#include /* * calculate best rational approximation for a given fraction @@ -78,13 +79,18 @@ void rational_best_approximation( * found below as 't'. */ if ((n2 > max_numerator) || (d2 > max_denominator)) { - unsigned long t = min((max_numerator - n0) / n1, - (max_denominator - d0) / d1); + unsigned long t = ULONG_MAX; - /* This tests if the semi-convergent is closer - * than the previous convergent. + if (d1) + t = (max_denominator - d0) / d1; + if (n1) + t = min(t, (max_numerator - n0) / n1); + + /* This tests if the semi-convergent is closer than the previous + * convergent. If d1 is zero there is no previous convergent as this + * is the 1st iteration, so always choose the semi-convergent. */ - if (2u * t > a || (2u * t == a && d0 * dp > d1 * d)) { + if (!d1 || 2u * t > a || (2u * t == a && d0 * dp > d1 * d)) { n1 = n0 + t * n1; d1 = d0 + t * d1; } From b6c75c4afceb8bc065a4ebb5c6c381452bf96f53 Mon Sep 17 00:00:00 2001 From: Trent Piepho Date: Wed, 30 Jun 2021 18:55:52 -0700 Subject: [PATCH 162/190] lib/math/rational: add Kunit test cases Adds a number of test cases that cover a range of possible code paths. [akpm@linux-foundation.org: remove non-ascii characters, fix whitespace] [colin.king@canonical.com: fix spelling mistake "demominator" -> "denominator"] Link: https://lkml.kernel.org/r/20210526085049.6393-1-colin.king@canonical.com Link: https://lkml.kernel.org/r/20210525144250.214670-2-tpiepho@gmail.com Signed-off-by: Trent Piepho Signed-off-by: Colin Ian King Reviewed-by: Andy Shevchenko Cc: Daniel Latypov Cc: Oskar Schirmer Cc: Yiyuan Guo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/Kconfig.debug | 12 +++++++++ lib/math/Makefile | 1 + lib/math/rational-test.c | 56 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 69 insertions(+) create mode 100644 lib/math/rational-test.c diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index deca67d28abb..ea9a4096007d 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -2444,6 +2444,18 @@ config SLUB_KUNIT_TEST If unsure, say N. +config RATIONAL_KUNIT_TEST + tristate "KUnit test for rational.c" if !KUNIT_ALL_TESTS + depends on KUNIT + select RATIONAL + default KUNIT_ALL_TESTS + help + This builds the rational math unit test. + For more information on KUnit and unit tests in general please refer + to the KUnit documentation in Documentation/dev-tools/kunit/. + + If unsure, say N. + config TEST_UDELAY tristate "udelay test driver" help diff --git a/lib/math/Makefile b/lib/math/Makefile index 7456edb864fc..bfac26ddfc22 100644 --- a/lib/math/Makefile +++ b/lib/math/Makefile @@ -6,3 +6,4 @@ obj-$(CONFIG_PRIME_NUMBERS) += prime_numbers.o obj-$(CONFIG_RATIONAL) += rational.o obj-$(CONFIG_TEST_DIV64) += test_div64.o +obj-$(CONFIG_RATIONAL_KUNIT_TEST) += rational-test.o diff --git a/lib/math/rational-test.c b/lib/math/rational-test.c new file mode 100644 index 000000000000..01611ddff420 --- /dev/null +++ b/lib/math/rational-test.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include + +#include + +struct rational_test_param { + unsigned long num, den; + unsigned long max_num, max_den; + unsigned long exp_num, exp_den; + + const char *name; +}; + +static const struct rational_test_param test_parameters[] = { + { 1230, 10, 100, 20, 100, 1, "Exceeds bounds, semi-convergent term > 1/2 last term" }, + { 34567,100, 120, 20, 120, 1, "Exceeds bounds, semi-convergent term < 1/2 last term" }, + { 1, 30, 100, 10, 0, 1, "Closest to zero" }, + { 1, 19, 100, 10, 1, 10, "Closest to smallest non-zero" }, + { 27,32, 16, 16, 11, 13, "Use convergent" }, + { 1155, 7735, 255, 255, 33, 221, "Exact answer" }, + { 87, 32, 70, 32, 68, 25, "Semiconvergent, numerator limit" }, + { 14533, 4626, 15000, 2400, 7433, 2366, "Semiconvergent, denominator limit" }, +}; + +static void get_desc(const struct rational_test_param *param, char *desc) +{ + strscpy(desc, param->name, KUNIT_PARAM_DESC_SIZE); +} + +/* Creates function rational_gen_params */ +KUNIT_ARRAY_PARAM(rational, test_parameters, get_desc); + +static void rational_test(struct kunit *test) +{ + const struct rational_test_param *param = (const struct rational_test_param *)test->param_value; + unsigned long n = 0, d = 0; + + rational_best_approximation(param->num, param->den, param->max_num, param->max_den, &n, &d); + KUNIT_EXPECT_EQ(test, n, param->exp_num); + KUNIT_EXPECT_EQ(test, d, param->exp_den); +} + +static struct kunit_case rational_test_cases[] = { + KUNIT_CASE_PARAM(rational_test, rational_gen_params), + {} +}; + +static struct kunit_suite rational_test_suite = { + .name = "rational", + .test_cases = rational_test_cases, +}; + +kunit_test_suites(&rational_test_suite); + +MODULE_LICENSE("GPL v2"); From 05911c5d964956442d17fe21db239de5a1dace4a Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 30 Jun 2021 18:55:55 -0700 Subject: [PATCH 163/190] lib/decompressors: fix spelling mistakes Fix some spelling mistakes in comments: sentinal ==> sentinel compresed ==> compressed dependeny ==> dependency immediatelly ==> immediately dervied ==> derived splitted ==> split nore ==> not independed ==> independent asumed ==> assumed Link: https://lkml.kernel.org/r/20210604085656.12257-1-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei Cc: Jiri Kosina Cc: Thomas Bogendoerfer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/decompress_bunzip2.c | 4 ++-- lib/decompress_unxz.c | 2 +- lib/decompress_unzstd.c | 4 ++-- lib/xz/xz_dec_bcj.c | 2 +- lib/xz/xz_dec_lzma2.c | 8 ++++---- lib/zlib_inflate/inffast.c | 2 +- lib/zstd/huf.h | 2 +- 7 files changed, 12 insertions(+), 12 deletions(-) diff --git a/lib/decompress_bunzip2.c b/lib/decompress_bunzip2.c index 0d619cc129dd..3518e7394eca 100644 --- a/lib/decompress_bunzip2.c +++ b/lib/decompress_bunzip2.c @@ -80,7 +80,7 @@ /* This is what we know about each Huffman coding group */ struct group_data { - /* We have an extra slot at the end of limit[] for a sentinal value. */ + /* We have an extra slot at the end of limit[] for a sentinel value. */ int limit[MAX_HUFCODE_BITS+1]; int base[MAX_HUFCODE_BITS]; int permute[MAX_SYMBOLS]; @@ -337,7 +337,7 @@ static int INIT get_next_block(struct bunzip_data *bd) pp <<= 1; base[i+1] = pp-(t += temp[i]); } - limit[maxLen+1] = INT_MAX; /* Sentinal value for + limit[maxLen+1] = INT_MAX; /* Sentinel value for * reading next sym. */ limit[maxLen] = pp+temp[maxLen]-1; base[minLen] = 0; diff --git a/lib/decompress_unxz.c b/lib/decompress_unxz.c index 25d59a95bd66..a2f38e23004a 100644 --- a/lib/decompress_unxz.c +++ b/lib/decompress_unxz.c @@ -23,7 +23,7 @@ * uncompressible. Thus, we must look for worst-case expansion when the * compressor is encoding uncompressible data. * - * The structure of the .xz file in case of a compresed kernel is as follows. + * The structure of the .xz file in case of a compressed kernel is as follows. * Sizes (as bytes) of the fields are in parenthesis. * * Stream Header (12) diff --git a/lib/decompress_unzstd.c b/lib/decompress_unzstd.c index 790abc472f5b..6b629ab31c1e 100644 --- a/lib/decompress_unzstd.c +++ b/lib/decompress_unzstd.c @@ -16,7 +16,7 @@ * uncompressible. Thus, we must look for worst-case expansion when the * compressor is encoding uncompressible data. * - * The structure of the .zst file in case of a compresed kernel is as follows. + * The structure of the .zst file in case of a compressed kernel is as follows. * Maximum sizes (as bytes) of the fields are in parenthesis. * * Frame Header: (18) @@ -56,7 +56,7 @@ /* * Preboot environments #include "path/to/decompress_unzstd.c". * All of the source files we depend on must be #included. - * zstd's only source dependeny is xxhash, which has no source + * zstd's only source dependency is xxhash, which has no source * dependencies. * * When UNZSTD_PREBOOT is defined we declare __decompress(), which is diff --git a/lib/xz/xz_dec_bcj.c b/lib/xz/xz_dec_bcj.c index 72ddac6ef2ec..ef449e97d1a1 100644 --- a/lib/xz/xz_dec_bcj.c +++ b/lib/xz/xz_dec_bcj.c @@ -422,7 +422,7 @@ XZ_EXTERN enum xz_ret xz_dec_bcj_run(struct xz_dec_bcj *s, /* * Flush pending already filtered data to the output buffer. Return - * immediatelly if we couldn't flush everything, or if the next + * immediately if we couldn't flush everything, or if the next * filter in the chain had already returned XZ_STREAM_END. */ if (s->temp.filtered > 0) { diff --git a/lib/xz/xz_dec_lzma2.c b/lib/xz/xz_dec_lzma2.c index ca2603abee08..7a6781e3f47b 100644 --- a/lib/xz/xz_dec_lzma2.c +++ b/lib/xz/xz_dec_lzma2.c @@ -147,8 +147,8 @@ struct lzma_dec { /* * LZMA properties or related bit masks (number of literal - * context bits, a mask dervied from the number of literal - * position bits, and a mask dervied from the number + * context bits, a mask derived from the number of literal + * position bits, and a mask derived from the number * position bits) */ uint32_t lc; @@ -484,7 +484,7 @@ static __always_inline void rc_normalize(struct rc_dec *rc) } /* - * Decode one bit. In some versions, this function has been splitted in three + * Decode one bit. In some versions, this function has been split in three * functions so that the compiler is supposed to be able to more easily avoid * an extra branch. In this particular version of the LZMA decoder, this * doesn't seem to be a good idea (tested with GCC 3.3.6, 3.4.6, and 4.3.3 @@ -761,7 +761,7 @@ static bool lzma_main(struct xz_dec_lzma2 *s) } /* - * Reset the LZMA decoder and range decoder state. Dictionary is nore reset + * Reset the LZMA decoder and range decoder state. Dictionary is not reset * here, because LZMA state may be reset without resetting the dictionary. */ static void lzma_reset(struct xz_dec_lzma2 *s) diff --git a/lib/zlib_inflate/inffast.c b/lib/zlib_inflate/inffast.c index ed1f3df27260..f19c4fbe1be7 100644 --- a/lib/zlib_inflate/inffast.c +++ b/lib/zlib_inflate/inffast.c @@ -15,7 +15,7 @@ union uu { unsigned char b[2]; }; -/* Endian independed version */ +/* Endian independent version */ static inline unsigned short get_unaligned16(const unsigned short *p) { diff --git a/lib/zstd/huf.h b/lib/zstd/huf.h index 2143da28d952..923218d12e28 100644 --- a/lib/zstd/huf.h +++ b/lib/zstd/huf.h @@ -134,7 +134,7 @@ typedef enum { HUF_repeat_none, /**< Cannot use the previous table */ HUF_repeat_check, /**< Can use the previous table but it must be checked. Note : The previous table must have been constructed by HUF_compress{1, 4}X_repeat */ - HUF_repeat_valid /**< Can use the previous table and it is asumed to be valid */ + HUF_repeat_valid /**< Can use the previous table and it is assumed to be valid */ } HUF_repeat; /** HUF_compress4X_repeat() : * Same as HUF_compress4X_wksp(), but considers using hufTable if *repeat != HUF_repeat_none. From 478485f6c0e5936b62c0c9393a865bfb00f037a5 Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 30 Jun 2021 18:55:58 -0700 Subject: [PATCH 164/190] lib/mpi: fix spelling mistakes Fix some spelling mistakes in comments: flaged ==> flagged bufer ==> buffer multipler ==> multiplier MULTIPLER ==> MULTIPLIER leaset ==> least chnage ==> change Link: https://lkml.kernel.org/r/20210604074401.12198-1-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei Cc: Herbert Xu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mpi.h | 4 ++-- lib/mpi/longlong.h | 4 ++-- lib/mpi/mpicoder.c | 6 +++--- lib/mpi/mpiutil.c | 2 +- 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/include/linux/mpi.h b/include/linux/mpi.h index 3e5358f4de2f..eb0d1c1db208 100644 --- a/include/linux/mpi.h +++ b/include/linux/mpi.h @@ -200,7 +200,7 @@ struct mpi_ec_ctx { unsigned int nbits; /* Number of bits. */ /* Domain parameters. Note that they may not all be set and if set - * the MPIs may be flaged as constant. + * the MPIs may be flagged as constant. */ MPI p; /* Prime specifying the field GF(p). */ MPI a; /* First coefficient of the Weierstrass equation. */ @@ -267,7 +267,7 @@ int mpi_ec_curve_point(MPI_POINT point, struct mpi_ec_ctx *ctx); /** * mpi_get_size() - returns max size required to store the number * - * @a: A multi precision integer for which we want to allocate a bufer + * @a: A multi precision integer for which we want to allocate a buffer * * Return: size required to store the number */ diff --git a/lib/mpi/longlong.h b/lib/mpi/longlong.h index afbd99987cf8..b6fa1d08fb55 100644 --- a/lib/mpi/longlong.h +++ b/lib/mpi/longlong.h @@ -48,8 +48,8 @@ /* Define auxiliary asm macros. * - * 1) umul_ppmm(high_prod, low_prod, multipler, multiplicand) multiplies two - * UWtype integers MULTIPLER and MULTIPLICAND, and generates a two UWtype + * 1) umul_ppmm(high_prod, low_prod, multiplier, multiplicand) multiplies two + * UWtype integers MULTIPLIER and MULTIPLICAND, and generates a two UWtype * word product in HIGH_PROD and LOW_PROD. * * 2) __umulsidi3(a,b) multiplies two UWtype integers A and B, and returns a diff --git a/lib/mpi/mpicoder.c b/lib/mpi/mpicoder.c index 7ea225b2204f..39c4c6731094 100644 --- a/lib/mpi/mpicoder.c +++ b/lib/mpi/mpicoder.c @@ -234,11 +234,11 @@ static int count_lzeros(MPI a) } /** - * mpi_read_buffer() - read MPI to a bufer provided by user (msb first) + * mpi_read_buffer() - read MPI to a buffer provided by user (msb first) * * @a: a multi precision integer - * @buf: bufer to which the output will be written to. Needs to be at - * leaset mpi_get_size(a) long. + * @buf: buffer to which the output will be written to. Needs to be at + * least mpi_get_size(a) long. * @buf_len: size of the buf. * @nbytes: receives the actual length of the data written on success and * the data to-be-written on -EOVERFLOW in case buf_len was too diff --git a/lib/mpi/mpiutil.c b/lib/mpi/mpiutil.c index 3c63710c20c6..9a75ca3f7edf 100644 --- a/lib/mpi/mpiutil.c +++ b/lib/mpi/mpiutil.c @@ -80,7 +80,7 @@ EXPORT_SYMBOL_GPL(mpi_const); /**************** * Note: It was a bad idea to use the number of limbs to allocate * because on a alpha the limbs are large but we normally need - * integers of n bits - So we should chnage this to bits (or bytes). + * integers of n bits - So we should change this to bits (or bytes). * * But mpi_alloc is used in a lot of places :-) */ From 1a58be6277e4324c853babfd35890c2d5e171e8f Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Wed, 30 Jun 2021 18:56:01 -0700 Subject: [PATCH 165/190] lib: memscan() fixlet Generic version doesn't trucate second argument to char. Older brother memchr() does as do s390, sparc and i386 assembly versions. Fortunately, no code passes c >= 256. Link: https://lkml.kernel.org/r/YLv4cCf0t5UPdyK+@localhost.localdomain Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/string.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/string.c b/lib/string.c index 7548eb715ddb..77bd0b1d3296 100644 --- a/lib/string.c +++ b/lib/string.c @@ -977,7 +977,7 @@ void *memscan(void *addr, int c, size_t size) unsigned char *p = addr; while (size) { - if (*p == c) + if (*p == (unsigned char)c) return (void *)p; p++; size--; From ad65dcef3a87c24d6c6156eae5e7b47311d6e3cf Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Wed, 30 Jun 2021 18:56:04 -0700 Subject: [PATCH 166/190] lib: uninline simple_strtoull() Gcc inlines simple_strtoull() too agressively. Given that all 4 signatures match, everything very efficiently calls or tailcalls into simple_strtoull(): ffffffff81da0240 : ffffffff81da0240: 80 3f 2d cmp BYTE PTR [rdi],0x2d ffffffff81da0243: 74 05 je ffffffff81da024a ffffffff81da0245: e9 76 ff ff ff jmp simple_strtoull ffffffff81da024a: 48 83 c7 01 add rdi,0x1 ffffffff81da024e: e8 6d ff ff ff call simple_strtoull ffffffff81da0253: 48 f7 d8 neg rax ffffffff81da0256: c3 ret Space savings (on F34-ish .config) add/remove: 0/0 grow/shrink: 1/3 up/down: 52/-313 (-261) Function old new delta vsscanf 2167 2219 +52 simple_strtoul 72 2 -70 simple_strtoll 143 23 -120 simple_strtol 143 20 -123 Link: https://lkml.kernel.org/r/YMO2zoOQk2eF34tn@localhost.localdomain Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/vsprintf.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/vsprintf.c b/lib/vsprintf.c index cc281f5895f9..30e1bc22105c 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -61,6 +61,7 @@ * * This function has caveats. Please use kstrtoull instead. */ +noinline unsigned long long simple_strtoull(const char *cp, char **endp, unsigned int base) { unsigned long long result; From ce71efd03916ea8fe45e9ef6bd6abe4c20734a57 Mon Sep 17 00:00:00 2001 From: Matteo Croce Date: Wed, 30 Jun 2021 18:56:07 -0700 Subject: [PATCH 167/190] lib/test_string.c: allow module removal The test_string module can't be removed because it lacks an exit hook. Since there is no reason for it to be permanent, add an empty one to allow module removal. Link: https://lkml.kernel.org/r/20210616234503.28678-1-mcroce@linux.microsoft.com Signed-off-by: Matteo Croce Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/test_string.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/lib/test_string.c b/lib/test_string.c index 7b31f4a505bf..9dfd6f52de92 100644 --- a/lib/test_string.c +++ b/lib/test_string.c @@ -179,6 +179,10 @@ static __init int strnchr_selftest(void) return 0; } +static __exit void string_selftest_remove(void) +{ +} + static __init int string_selftest_init(void) { int test, subtest; @@ -216,4 +220,5 @@ static __init int string_selftest_init(void) } module_init(string_selftest_init); +module_exit(string_selftest_remove); MODULE_LICENSE("GPL v2"); From 4c52729377eab025b238caeed48994a39c3b73f2 Mon Sep 17 00:00:00 2001 From: Andy Shevchenko Date: Wed, 30 Jun 2021 18:56:10 -0700 Subject: [PATCH 168/190] kernel.h: split out kstrtox() and simple_strtox() to a separate header kernel.h is being used as a dump for all kinds of stuff for a long time. Here is the attempt to start cleaning it up by splitting out kstrtox() and simple_strtox() helpers. At the same time convert users in header and lib folders to use new header. Though for time being include new header back to kernel.h to avoid twisted indirected includes for existing users. [andy.shevchenko@gmail.com: fix documentation references] Link: https://lkml.kernel.org/r/20210615220003.377901-1-andy.shevchenko@gmail.com Link: https://lkml.kernel.org/r/20210611185815.44103-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko Acked-by: Jonathan Cameron Cc: Francis Laniel Cc: Randy Dunlap Cc: Kars Mulder Cc: Trond Myklebust Cc: Anna Schumaker Cc: "J. Bruce Fields" Cc: Chuck Lever Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/core-api/kernel-api.rst | 7 +- include/linux/kernel.h | 143 +----------------------- include/linux/kstrtox.h | 155 ++++++++++++++++++++++++++ include/linux/string.h | 7 -- include/linux/sunrpc/cache.h | 1 + lib/kstrtox.c | 5 +- lib/parser.c | 1 + 7 files changed, 163 insertions(+), 156 deletions(-) create mode 100644 include/linux/kstrtox.h diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst index 741aa37dc181..2a7444e3a4c2 100644 --- a/Documentation/core-api/kernel-api.rst +++ b/Documentation/core-api/kernel-api.rst @@ -24,11 +24,8 @@ String Conversions .. kernel-doc:: lib/vsprintf.c :export: -.. kernel-doc:: include/linux/kernel.h - :functions: kstrtol - -.. kernel-doc:: include/linux/kernel.h - :functions: kstrtoul +.. kernel-doc:: include/linux/kstrtox.h + :functions: kstrtol kstrtoul .. kernel-doc:: lib/kstrtox.c :export: diff --git a/include/linux/kernel.h b/include/linux/kernel.h index baea2eb763d0..7bb0a5cb7d57 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -10,6 +10,7 @@ #include #include #include +#include #include #include #include @@ -180,148 +181,6 @@ static inline void might_fault(void) { } void do_exit(long error_code) __noreturn; void complete_and_exit(struct completion *, long) __noreturn; -/* Internal, do not use. */ -int __must_check _kstrtoul(const char *s, unsigned int base, unsigned long *res); -int __must_check _kstrtol(const char *s, unsigned int base, long *res); - -int __must_check kstrtoull(const char *s, unsigned int base, unsigned long long *res); -int __must_check kstrtoll(const char *s, unsigned int base, long long *res); - -/** - * kstrtoul - convert a string to an unsigned long - * @s: The start of the string. The string must be null-terminated, and may also - * include a single newline before its terminating null. The first character - * may also be a plus sign, but not a minus sign. - * @base: The number base to use. The maximum supported base is 16. If base is - * given as 0, then the base of the string is automatically detected with the - * conventional semantics - If it begins with 0x the number will be parsed as a - * hexadecimal (case insensitive), if it otherwise begins with 0, it will be - * parsed as an octal number. Otherwise it will be parsed as a decimal. - * @res: Where to write the result of the conversion on success. - * - * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Preferred over simple_strtoul(). Return code must be checked. -*/ -static inline int __must_check kstrtoul(const char *s, unsigned int base, unsigned long *res) -{ - /* - * We want to shortcut function call, but - * __builtin_types_compatible_p(unsigned long, unsigned long long) = 0. - */ - if (sizeof(unsigned long) == sizeof(unsigned long long) && - __alignof__(unsigned long) == __alignof__(unsigned long long)) - return kstrtoull(s, base, (unsigned long long *)res); - else - return _kstrtoul(s, base, res); -} - -/** - * kstrtol - convert a string to a long - * @s: The start of the string. The string must be null-terminated, and may also - * include a single newline before its terminating null. The first character - * may also be a plus sign or a minus sign. - * @base: The number base to use. The maximum supported base is 16. If base is - * given as 0, then the base of the string is automatically detected with the - * conventional semantics - If it begins with 0x the number will be parsed as a - * hexadecimal (case insensitive), if it otherwise begins with 0, it will be - * parsed as an octal number. Otherwise it will be parsed as a decimal. - * @res: Where to write the result of the conversion on success. - * - * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Preferred over simple_strtol(). Return code must be checked. - */ -static inline int __must_check kstrtol(const char *s, unsigned int base, long *res) -{ - /* - * We want to shortcut function call, but - * __builtin_types_compatible_p(long, long long) = 0. - */ - if (sizeof(long) == sizeof(long long) && - __alignof__(long) == __alignof__(long long)) - return kstrtoll(s, base, (long long *)res); - else - return _kstrtol(s, base, res); -} - -int __must_check kstrtouint(const char *s, unsigned int base, unsigned int *res); -int __must_check kstrtoint(const char *s, unsigned int base, int *res); - -static inline int __must_check kstrtou64(const char *s, unsigned int base, u64 *res) -{ - return kstrtoull(s, base, res); -} - -static inline int __must_check kstrtos64(const char *s, unsigned int base, s64 *res) -{ - return kstrtoll(s, base, res); -} - -static inline int __must_check kstrtou32(const char *s, unsigned int base, u32 *res) -{ - return kstrtouint(s, base, res); -} - -static inline int __must_check kstrtos32(const char *s, unsigned int base, s32 *res) -{ - return kstrtoint(s, base, res); -} - -int __must_check kstrtou16(const char *s, unsigned int base, u16 *res); -int __must_check kstrtos16(const char *s, unsigned int base, s16 *res); -int __must_check kstrtou8(const char *s, unsigned int base, u8 *res); -int __must_check kstrtos8(const char *s, unsigned int base, s8 *res); -int __must_check kstrtobool(const char *s, bool *res); - -int __must_check kstrtoull_from_user(const char __user *s, size_t count, unsigned int base, unsigned long long *res); -int __must_check kstrtoll_from_user(const char __user *s, size_t count, unsigned int base, long long *res); -int __must_check kstrtoul_from_user(const char __user *s, size_t count, unsigned int base, unsigned long *res); -int __must_check kstrtol_from_user(const char __user *s, size_t count, unsigned int base, long *res); -int __must_check kstrtouint_from_user(const char __user *s, size_t count, unsigned int base, unsigned int *res); -int __must_check kstrtoint_from_user(const char __user *s, size_t count, unsigned int base, int *res); -int __must_check kstrtou16_from_user(const char __user *s, size_t count, unsigned int base, u16 *res); -int __must_check kstrtos16_from_user(const char __user *s, size_t count, unsigned int base, s16 *res); -int __must_check kstrtou8_from_user(const char __user *s, size_t count, unsigned int base, u8 *res); -int __must_check kstrtos8_from_user(const char __user *s, size_t count, unsigned int base, s8 *res); -int __must_check kstrtobool_from_user(const char __user *s, size_t count, bool *res); - -static inline int __must_check kstrtou64_from_user(const char __user *s, size_t count, unsigned int base, u64 *res) -{ - return kstrtoull_from_user(s, count, base, res); -} - -static inline int __must_check kstrtos64_from_user(const char __user *s, size_t count, unsigned int base, s64 *res) -{ - return kstrtoll_from_user(s, count, base, res); -} - -static inline int __must_check kstrtou32_from_user(const char __user *s, size_t count, unsigned int base, u32 *res) -{ - return kstrtouint_from_user(s, count, base, res); -} - -static inline int __must_check kstrtos32_from_user(const char __user *s, size_t count, unsigned int base, s32 *res) -{ - return kstrtoint_from_user(s, count, base, res); -} - -/* - * Use kstrto instead. - * - * NOTE: simple_strto does not check for the range overflow and, - * depending on the input, may give interesting results. - * - * Use these functions if and only if you cannot use kstrto, because - * the conversion ends on the first non-digit character, which may be far - * beyond the supported range. It might be useful to parse the strings like - * 10x50 or 12:21 without altering original string or temporary buffer in use. - * Keep in mind above caveat. - */ - -extern unsigned long simple_strtoul(const char *,char **,unsigned int); -extern long simple_strtol(const char *,char **,unsigned int); -extern unsigned long long simple_strtoull(const char *,char **,unsigned int); -extern long long simple_strtoll(const char *,char **,unsigned int); - extern int num_to_str(char *buf, int size, unsigned long long num, unsigned int width); diff --git a/include/linux/kstrtox.h b/include/linux/kstrtox.h new file mode 100644 index 000000000000..529974e22ea7 --- /dev/null +++ b/include/linux/kstrtox.h @@ -0,0 +1,155 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_KSTRTOX_H +#define _LINUX_KSTRTOX_H + +#include +#include + +/* Internal, do not use. */ +int __must_check _kstrtoul(const char *s, unsigned int base, unsigned long *res); +int __must_check _kstrtol(const char *s, unsigned int base, long *res); + +int __must_check kstrtoull(const char *s, unsigned int base, unsigned long long *res); +int __must_check kstrtoll(const char *s, unsigned int base, long long *res); + +/** + * kstrtoul - convert a string to an unsigned long + * @s: The start of the string. The string must be null-terminated, and may also + * include a single newline before its terminating null. The first character + * may also be a plus sign, but not a minus sign. + * @base: The number base to use. The maximum supported base is 16. If base is + * given as 0, then the base of the string is automatically detected with the + * conventional semantics - If it begins with 0x the number will be parsed as a + * hexadecimal (case insensitive), if it otherwise begins with 0, it will be + * parsed as an octal number. Otherwise it will be parsed as a decimal. + * @res: Where to write the result of the conversion on success. + * + * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. + * Preferred over simple_strtoul(). Return code must be checked. +*/ +static inline int __must_check kstrtoul(const char *s, unsigned int base, unsigned long *res) +{ + /* + * We want to shortcut function call, but + * __builtin_types_compatible_p(unsigned long, unsigned long long) = 0. + */ + if (sizeof(unsigned long) == sizeof(unsigned long long) && + __alignof__(unsigned long) == __alignof__(unsigned long long)) + return kstrtoull(s, base, (unsigned long long *)res); + else + return _kstrtoul(s, base, res); +} + +/** + * kstrtol - convert a string to a long + * @s: The start of the string. The string must be null-terminated, and may also + * include a single newline before its terminating null. The first character + * may also be a plus sign or a minus sign. + * @base: The number base to use. The maximum supported base is 16. If base is + * given as 0, then the base of the string is automatically detected with the + * conventional semantics - If it begins with 0x the number will be parsed as a + * hexadecimal (case insensitive), if it otherwise begins with 0, it will be + * parsed as an octal number. Otherwise it will be parsed as a decimal. + * @res: Where to write the result of the conversion on success. + * + * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. + * Preferred over simple_strtol(). Return code must be checked. + */ +static inline int __must_check kstrtol(const char *s, unsigned int base, long *res) +{ + /* + * We want to shortcut function call, but + * __builtin_types_compatible_p(long, long long) = 0. + */ + if (sizeof(long) == sizeof(long long) && + __alignof__(long) == __alignof__(long long)) + return kstrtoll(s, base, (long long *)res); + else + return _kstrtol(s, base, res); +} + +int __must_check kstrtouint(const char *s, unsigned int base, unsigned int *res); +int __must_check kstrtoint(const char *s, unsigned int base, int *res); + +static inline int __must_check kstrtou64(const char *s, unsigned int base, u64 *res) +{ + return kstrtoull(s, base, res); +} + +static inline int __must_check kstrtos64(const char *s, unsigned int base, s64 *res) +{ + return kstrtoll(s, base, res); +} + +static inline int __must_check kstrtou32(const char *s, unsigned int base, u32 *res) +{ + return kstrtouint(s, base, res); +} + +static inline int __must_check kstrtos32(const char *s, unsigned int base, s32 *res) +{ + return kstrtoint(s, base, res); +} + +int __must_check kstrtou16(const char *s, unsigned int base, u16 *res); +int __must_check kstrtos16(const char *s, unsigned int base, s16 *res); +int __must_check kstrtou8(const char *s, unsigned int base, u8 *res); +int __must_check kstrtos8(const char *s, unsigned int base, s8 *res); +int __must_check kstrtobool(const char *s, bool *res); + +int __must_check kstrtoull_from_user(const char __user *s, size_t count, unsigned int base, unsigned long long *res); +int __must_check kstrtoll_from_user(const char __user *s, size_t count, unsigned int base, long long *res); +int __must_check kstrtoul_from_user(const char __user *s, size_t count, unsigned int base, unsigned long *res); +int __must_check kstrtol_from_user(const char __user *s, size_t count, unsigned int base, long *res); +int __must_check kstrtouint_from_user(const char __user *s, size_t count, unsigned int base, unsigned int *res); +int __must_check kstrtoint_from_user(const char __user *s, size_t count, unsigned int base, int *res); +int __must_check kstrtou16_from_user(const char __user *s, size_t count, unsigned int base, u16 *res); +int __must_check kstrtos16_from_user(const char __user *s, size_t count, unsigned int base, s16 *res); +int __must_check kstrtou8_from_user(const char __user *s, size_t count, unsigned int base, u8 *res); +int __must_check kstrtos8_from_user(const char __user *s, size_t count, unsigned int base, s8 *res); +int __must_check kstrtobool_from_user(const char __user *s, size_t count, bool *res); + +static inline int __must_check kstrtou64_from_user(const char __user *s, size_t count, unsigned int base, u64 *res) +{ + return kstrtoull_from_user(s, count, base, res); +} + +static inline int __must_check kstrtos64_from_user(const char __user *s, size_t count, unsigned int base, s64 *res) +{ + return kstrtoll_from_user(s, count, base, res); +} + +static inline int __must_check kstrtou32_from_user(const char __user *s, size_t count, unsigned int base, u32 *res) +{ + return kstrtouint_from_user(s, count, base, res); +} + +static inline int __must_check kstrtos32_from_user(const char __user *s, size_t count, unsigned int base, s32 *res) +{ + return kstrtoint_from_user(s, count, base, res); +} + +/* + * Use kstrto instead. + * + * NOTE: simple_strto does not check for the range overflow and, + * depending on the input, may give interesting results. + * + * Use these functions if and only if you cannot use kstrto, because + * the conversion ends on the first non-digit character, which may be far + * beyond the supported range. It might be useful to parse the strings like + * 10x50 or 12:21 without altering original string or temporary buffer in use. + * Keep in mind above caveat. + */ + +extern unsigned long simple_strtoul(const char *,char **,unsigned int); +extern long simple_strtol(const char *,char **,unsigned int); +extern unsigned long long simple_strtoull(const char *,char **,unsigned int); +extern long long simple_strtoll(const char *,char **,unsigned int); + +static inline int strtobool(const char *s, bool *res) +{ + return kstrtobool(s, res); +} + +#endif /* _LINUX_KSTRTOX_H */ diff --git a/include/linux/string.h b/include/linux/string.h index 9521d8cab18e..b48d2d28e0b1 100644 --- a/include/linux/string.h +++ b/include/linux/string.h @@ -2,7 +2,6 @@ #ifndef _LINUX_STRING_H_ #define _LINUX_STRING_H_ - #include /* for inline */ #include /* for size_t */ #include /* for NULL */ @@ -184,12 +183,6 @@ extern char **argv_split(gfp_t gfp, const char *str, int *argcp); extern void argv_free(char **argv); extern bool sysfs_streq(const char *s1, const char *s2); -extern int kstrtobool(const char *s, bool *res); -static inline int strtobool(const char *s, bool *res) -{ - return kstrtobool(s, res); -} - int match_string(const char * const *array, size_t n, const char *string); int __sysfs_match_string(const char * const *array, size_t n, const char *s); diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h index d0965e2997b0..b134b2b3371c 100644 --- a/include/linux/sunrpc/cache.h +++ b/include/linux/sunrpc/cache.h @@ -14,6 +14,7 @@ #include #include #include +#include #include /* diff --git a/lib/kstrtox.c b/lib/kstrtox.c index a118b0b1e9b2..a0cc8dcf4a35 100644 --- a/lib/kstrtox.c +++ b/lib/kstrtox.c @@ -14,11 +14,12 @@ */ #include #include -#include -#include #include +#include +#include #include #include + #include "kstrtox.h" const char *_parse_integer_fixup_radix(const char *s, unsigned int *base) diff --git a/lib/parser.c b/lib/parser.c index f1a6d90b8c34..bcb23484100e 100644 --- a/lib/parser.c +++ b/lib/parser.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include From 7fde9d6e839db604569ad5de5fbe7dd3cd8e2136 Mon Sep 17 00:00:00 2001 From: Rajat Asthana Date: Wed, 30 Jun 2021 18:56:13 -0700 Subject: [PATCH 169/190] lz4_decompress: declare LZ4_decompress_safe_withPrefix64k static Declare LZ4_decompress_safe_withPrefix64k as static to fix sparse warning: > warning: symbol 'LZ4_decompress_safe_withPrefix64k' was not declared. > Should it be static? Link: https://lkml.kernel.org/r/20210511154345.610569-1-thisisrast7@gmail.com Signed-off-by: Rajat Asthana Reviewed-by: Nick Terrell Cc: Gao Xiang Cc: Joe Perches Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/lz4/lz4_decompress.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/lz4/lz4_decompress.c b/lib/lz4/lz4_decompress.c index 8a7724a6ce2f..926f4823d5ea 100644 --- a/lib/lz4/lz4_decompress.c +++ b/lib/lz4/lz4_decompress.c @@ -481,7 +481,7 @@ int LZ4_decompress_fast(const char *source, char *dest, int originalSize) /* ===== Instantiate a few more decoding cases, used more than once. ===== */ -int LZ4_decompress_safe_withPrefix64k(const char *source, char *dest, +static int LZ4_decompress_safe_withPrefix64k(const char *source, char *dest, int compressedSize, int maxOutputSize) { return LZ4_decompress_generic(source, dest, From 2c484419efc09e7234c667aa72698cb79ba8d8ed Mon Sep 17 00:00:00 2001 From: Dimitri John Ledkov Date: Wed, 30 Jun 2021 18:56:16 -0700 Subject: [PATCH 170/190] lib/decompress_unlz4.c: correctly handle zero-padding around initrds. lz4 compatible decompressor is simple. The format is underspecified and relies on EOF notification to determine when to stop. Initramfs buffer format[1] explicitly states that it can have arbitrary number of zero padding. Thus when operating without a fill function, be extra careful to ensure that sizes less than 4, or apperantly empty chunksizes are treated as EOF. To test this I have created two cpio initrds, first a normal one, main.cpio. And second one with just a single /test-file with content "second" second.cpio. Then i compressed both of them with gzip, and with lz4 -l. Then I created a padding of 4 bytes (dd if=/dev/zero of=pad4 bs=1 count=4). To create four testcase initrds: 1) main.cpio.gzip + extra.cpio.gzip = pad0.gzip 2) main.cpio.lz4 + extra.cpio.lz4 = pad0.lz4 3) main.cpio.gzip + pad4 + extra.cpio.gzip = pad4.gzip 4) main.cpio.lz4 + pad4 + extra.cpio.lz4 = pad4.lz4 The pad4 test-cases replicate the initrd load by grub, as it pads and aligns every initrd it loads. All of the above boot, however /test-file was not accessible in the initrd for the testcase #4, as decoding in lz4 decompressor failed. Also an error message printed which usually is harmless. Whith a patched kernel, all of the above testcases now pass, and /test-file is accessible. This fixes lz4 initrd decompress warning on every boot with grub. And more importantly this fixes inability to load multiple lz4 compressed initrds with grub. This patch has been shipping in Ubuntu kernels since January 2021. [1] ./Documentation/driver-api/early-userspace/buffer-format.rst BugLink: https://bugs.launchpad.net/bugs/1835660 Link: https://lore.kernel.org/lkml/20210114200256.196589-1-xnox@ubuntu.com/ # v0 Link: https://lkml.kernel.org/r/20210513104831.432975-1-dimitri.ledkov@canonical.com Signed-off-by: Dimitri John Ledkov Cc: Kyungsik Lee Cc: Yinghai Lu Cc: Bongkyu Kim Cc: Kees Cook Cc: Sven Schmidt <4sschmid@informatik.uni-hamburg.de> Cc: Rajat Asthana Cc: Nick Terrell Cc: Gao Xiang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/decompress_unlz4.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/lib/decompress_unlz4.c b/lib/decompress_unlz4.c index c0cfcfd486be..e6327391b6b6 100644 --- a/lib/decompress_unlz4.c +++ b/lib/decompress_unlz4.c @@ -112,6 +112,9 @@ STATIC inline int INIT unlz4(u8 *input, long in_len, error("data corrupted"); goto exit_2; } + } else if (size < 4) { + /* empty or end-of-file */ + goto exit_3; } chunksize = get_unaligned_le32(inp); @@ -125,6 +128,10 @@ STATIC inline int INIT unlz4(u8 *input, long in_len, continue; } + if (!fill && chunksize == 0) { + /* empty or end-of-file */ + goto exit_3; + } if (posp) *posp += 4; @@ -184,6 +191,7 @@ STATIC inline int INIT unlz4(u8 *input, long in_len, } } +exit_3: ret = 0; exit_2: if (!input) From f9363b31d769245cb7ec8a660460800d4b466911 Mon Sep 17 00:00:00 2001 From: Guenter Roeck Date: Wed, 30 Jun 2021 18:56:19 -0700 Subject: [PATCH 171/190] checkpatch: scripts/spdxcheck.py now requires python3 Since commit d0259c42abff ("spdxcheck.py: Use Python 3"), spdxcheck.py explicitly expects to run as python3 script. If "python" still points to python v2.7 and the script is executed with "python scripts/spdxcheck.py", the following error may be seen even if git-python is installed for python3. Traceback (most recent call last): File "scripts/spdxcheck.py", line 10, in import git ImportError: No module named git To fix the problem, check for the existence of python3, check if the script is executable and not just for its existence, and execute it directly. Link: https://lkml.kernel.org/r/20210505211720.447111-1-linux@roeck-us.net Signed-off-by: Guenter Roeck Cc: Joe Perches Cc: Bert Vermeulen Cc: Dwaipayan Ray Cc: Lukas Bulwahn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 23697a6b1eaa..d65334588eb4 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -1084,10 +1084,10 @@ sub is_maintained_obsolete { sub is_SPDX_License_valid { my ($license) = @_; - return 1 if (!$tree || which("python") eq "" || !(-e "$root/scripts/spdxcheck.py") || !(-e "$gitroot")); + return 1 if (!$tree || which("python3") eq "" || !(-x "$root/scripts/spdxcheck.py") || !(-e "$gitroot")); my $root_path = abs_path($root); - my $status = `cd "$root_path"; echo "$license" | python scripts/spdxcheck.py -`; + my $status = `cd "$root_path"; echo "$license" | scripts/spdxcheck.py -`; return 0 if ($status ne ""); return 1; } From 690786511b32baba073f729844779172d2ed72b6 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Wed, 30 Jun 2021 18:56:22 -0700 Subject: [PATCH 172/190] checkpatch: improve the indented label test checkpatch identifies a label only when a terminating colon immediately follows an identifier. Bitfield definitions can appear to be labels so ignore any spaces between the identifier terminating colon and any digit that may be used to define a bitfield length. Miscellanea: o Improve the initial checkpatch comment o Use the more typical '&&' instead of 'and' o Require the initial label character to be a non-digit (Can't use $Ident here because $Ident allows ## concatenation) o Use $sline instead of $line to ignore comments o Use '$sline !~ /.../' instead of '!($line =~ /.../)' Link: https://lkml.kernel.org/r/b54d673e7cde7de5de0c9ba4dd57dd0858580ca4.camel@perches.com Signed-off-by: Joe Perches Cc: Greg Kroah-Hartman Cc: Manikishan Ghantasala Cc: Alex Elder Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index d65334588eb4..dad87c368632 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -5361,9 +5361,13 @@ sub process { } } -#goto labels aren't indented, allow a single space however - if ($line=~/^.\s+[A-Za-z\d_]+:(?![0-9]+)/ and - !($line=~/^. [A-Za-z\d_]+:/) and !($line=~/^.\s+default:/)) { +# check that goto labels aren't indented (allow a single space indentation) +# and ignore bitfield definitions like foo:1 +# Strictly, labels can have whitespace after the identifier and before the : +# but this is not allowed here as many ?: uses would appear to be labels + if ($sline =~ /^.\s+[A-Za-z_][A-Za-z\d_]*:(?!\s*\d+)/ && + $sline !~ /^. [A-Za-z\d_][A-Za-z\d_]*:/ && + $sline !~ /^.\s+default:/) { if (WARN("INDENTED_LABEL", "labels should not be indented\n" . $herecurr) && $fix) { From 46b85bf96714267ab7855683b40103c9282aaf4e Mon Sep 17 00:00:00 2001 From: Guenter Roeck Date: Wed, 30 Jun 2021 18:56:25 -0700 Subject: [PATCH 173/190] checkpatch: do not complain about positive return values starting with EPOLL checkpatch complains about positive return values of poll functions. Example: WARNING: return of an errno should typically be negative (ie: return -EPOLLIN) + return EPOLLIN; Poll functions return positive values. The defines for the return values of poll functions all start with EPOLL, resulting in a number of false positives. An often used workaround is to assign poll function return values to variables and returning that variable, but that is a less than perfect solution. There is no error definition which starts with EPOLL, so it is safe to omit the warning for return values starting with EPOLL. Link: https://lkml.kernel.org/r/20210622004334.638680-1-linux@roeck-us.net Signed-off-by: Guenter Roeck Acked-by: Joe Perches Cc: Ricardo Ribalda Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- scripts/checkpatch.pl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index dad87c368632..461d4221e4a4 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -5462,7 +5462,7 @@ sub process { # Return of what appears to be an errno should normally be negative if ($sline =~ /\breturn(?:\s*\(+\s*|\s+)(E[A-Z]+)(?:\s*\)+\s*|\s*)[;:,]/) { my $name = $1; - if ($name ne 'EOF' && $name ne 'ERROR') { + if ($name ne 'EOF' && $name ne 'ERROR' && $name !~ /^EPOLL/) { WARN("USE_NEGATIVE_ERRNO", "return of an errno should typically be negative (ie: return -$1)\n" . $herecurr); } From 86d1919a4fb0d9c115dd1d3b969f5d1650e45408 Mon Sep 17 00:00:00 2001 From: Andrew Halaney Date: Wed, 30 Jun 2021 18:56:28 -0700 Subject: [PATCH 174/190] init: print out unknown kernel parameters It is easy to foobar setting a kernel parameter on the command line without realizing it, there's not much output that you can use to assess what the kernel did with that parameter by default. Make it a little more explicit which parameters on the command line _looked_ like a valid parameter for the kernel, but did not match anything and ultimately got tossed to init. This is very similar to the unknown parameter message received when loading a module. This assumes the parameters are processed in a normal fashion, some parameters (dyndbg= for example) don't register their parameter with the rest of the kernel's parameters, and therefore always show up in this list (and are also given to init - like the rest of this list). Another example is BOOT_IMAGE= is highlighted as an offender, which it technically is, but is passed by LILO and GRUB so most systems will see that complaint. An example output where "foobared" and "unrecognized" are intentionally invalid parameters: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.12-dirty debug log_buf_len=4M foobared unrecognized=foo Unknown command line parameters: foobared BOOT_IMAGE=/boot/vmlinuz-5.12-dirty unrecognized=foo Link: https://lkml.kernel.org/r/20210511211009.42259-1-ahalaney@redhat.com Signed-off-by: Andrew Halaney Suggested-by: Steven Rostedt Suggested-by: Borislav Petkov Acked-by: Borislav Petkov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- init/main.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/init/main.c b/init/main.c index e9c42a183e33..d4f7af4cf560 100644 --- a/init/main.c +++ b/init/main.c @@ -872,6 +872,47 @@ void __init __weak arch_call_rest_init(void) rest_init(); } +static void __init print_unknown_bootoptions(void) +{ + char *unknown_options; + char *end; + const char *const *p; + size_t len; + + if (panic_later || (!argv_init[1] && !envp_init[2])) + return; + + /* + * Determine how many options we have to print out, plus a space + * before each + */ + len = 1; /* null terminator */ + for (p = &argv_init[1]; *p; p++) { + len++; + len += strlen(*p); + } + for (p = &envp_init[2]; *p; p++) { + len++; + len += strlen(*p); + } + + unknown_options = memblock_alloc(len, SMP_CACHE_BYTES); + if (!unknown_options) { + pr_err("%s: Failed to allocate %zu bytes\n", + __func__, len); + return; + } + end = unknown_options; + + for (p = &argv_init[1]; *p; p++) + end += sprintf(end, " %s", *p); + for (p = &envp_init[2]; *p; p++) + end += sprintf(end, " %s", *p); + + pr_notice("Unknown command line parameters:%s\n", unknown_options); + memblock_free(__pa(unknown_options), len); +} + asmlinkage __visible void __init __no_sanitize_address start_kernel(void) { char *command_line; @@ -913,6 +954,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void) static_command_line, __start___param, __stop___param - __start___param, -1, -1, NULL, &unknown_bootoption); + print_unknown_bootoptions(); if (!IS_ERR_OR_NULL(after_dashes)) parse_args("Setting init args", after_dashes, NULL, 0, -1, -1, NULL, set_init_arg); From 66ce75144d4b33e376f187df3dec495fe47d2ad0 Mon Sep 17 00:00:00 2001 From: Barry Song Date: Wed, 30 Jun 2021 18:56:31 -0700 Subject: [PATCH 175/190] kprobes: remove duplicated strong free_insn_page in x86 and s390 free_insn_page() in x86 and s390 is same with the common weak function in kernel/kprobes.c. Plus, the comment "Recover page to RW mode before releasing it" in x86 seems insensible to be there since resetting mapping is done by common code in vfree() of module_memfree(). So drop these two duplicated strong functions and related comment, then mark the common one in kernel/kprobes.c strong. Link: https://lkml.kernel.org/r/20210608065736.32656-1-song.bao.hua@hisilicon.com Signed-off-by: Barry Song Acked-by: Masami Hiramatsu Acked-by: Heiko Carstens Reviewed-by: Christoph Hellwig Cc: "H. Peter Anvin" Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Vasily Gorbik Cc: Christian Borntraeger Cc: "Naveen N. Rao" Cc: Anil S Keshavamurthy Cc: David S. Miller Cc: Qi Liu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/s390/kernel/kprobes.c | 5 ----- arch/x86/kernel/kprobes/core.c | 6 ------ include/linux/kprobes.h | 1 - kernel/kprobes.c | 2 +- 4 files changed, 1 insertion(+), 13 deletions(-) diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c index aae24dc75df6..60cfbd24229b 100644 --- a/arch/s390/kernel/kprobes.c +++ b/arch/s390/kernel/kprobes.c @@ -44,11 +44,6 @@ void *alloc_insn_page(void) return page; } -void free_insn_page(void *page) -{ - module_memfree(page); -} - static void *alloc_s390_insn_page(void) { if (xchg(&insn_page_in_use, 1) == 1) diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index d3d65545cb8b..3bce67d3a03c 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -422,12 +422,6 @@ void *alloc_insn_page(void) return page; } -/* Recover page to RW mode before releasing it */ -void free_insn_page(void *page) -{ - module_memfree(page); -} - /* Kprobe x86 instruction emulation - only regs->ip or IF flag modifiers */ static void kprobe_emulate_ifmodifiers(struct kprobe *p, struct pt_regs *regs) diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h index 1883a4a9f16a..c98a35a75f40 100644 --- a/include/linux/kprobes.h +++ b/include/linux/kprobes.h @@ -407,7 +407,6 @@ int enable_kprobe(struct kprobe *kp); void dump_kprobe(struct kprobe *kp); void *alloc_insn_page(void); -void free_insn_page(void *page); int kprobe_get_kallsym(unsigned int symnum, unsigned long *value, char *type, char *sym); diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 745f08fdd7a6..e0c4c9d57299 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -106,7 +106,7 @@ void __weak *alloc_insn_page(void) return module_alloc(PAGE_SIZE); } -void __weak free_insn_page(void *page) +static void free_insn_page(void *page) { module_memfree(page); } From f4048e5aa148b13da84132cc23b6503b626e2576 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Wed, 30 Jun 2021 18:56:34 -0700 Subject: [PATCH 176/190] nilfs2: remove redundant continue statement in a while-loop The continue statement at the end of the while-loop is redundant, remove it. Addresses-Coverity: ("Continue has no effect") Link: https://lkml.kernel.org/r/20210621100519.10257-1-colin.king@canonical.com Link: https://lkml.kernel.org/r/1624557664-17159-1-git-send-email-konishi.ryusuke@gmail.com Signed-off-by: Colin Ian King Signed-off-by: Ryusuke Konishi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/nilfs2/btree.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c index f42ab57201e7..ab9ec073330f 100644 --- a/fs/nilfs2/btree.c +++ b/fs/nilfs2/btree.c @@ -738,7 +738,6 @@ static int nilfs_btree_lookup_contig(const struct nilfs_bmap *btree, if (ptr2 != ptr + cnt || ++cnt == maxblocks) goto end; index++; - continue; } if (level == maxlevel) break; From 7dcae11f4c5862be62443dabe94e10a07b5639fc Mon Sep 17 00:00:00 2001 From: Zhen Lei Date: Wed, 30 Jun 2021 18:56:37 -0700 Subject: [PATCH 177/190] hfsplus: remove unnecessary oom message Fixes scripts/checkpatch.pl warning: WARNING: Possible unnecessary 'out of memory' message Remove it can help us save a bit of memory. Link: https://lkml.kernel.org/r/20210617084944.1279-1-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/hfsplus/xattr.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/hfsplus/xattr.c b/fs/hfsplus/xattr.c index 4d169c5a2673..e2855ceefd39 100644 --- a/fs/hfsplus/xattr.c +++ b/fs/hfsplus/xattr.c @@ -204,7 +204,6 @@ static int hfsplus_create_attributes_file(struct super_block *sb) buf = kzalloc(node_size, GFP_NOFS); if (!buf) { - pr_err("failed to allocate memory for header node\n"); err = -ENOMEM; goto end_attr_file_creation; } From c3eb84092b326a353725edcc8274a3782f1d1524 Mon Sep 17 00:00:00 2001 From: Chung-Chiang Cheng Date: Wed, 30 Jun 2021 18:56:40 -0700 Subject: [PATCH 178/190] hfsplus: report create_date to kstat.btime The create_date field of inode in hfsplus is corresponding to kstat.btime and could be reported in statx. Link: https://lkml.kernel.org/r/20210416172147.8736-1-cccheng@synology.com Signed-off-by: Chung-Chiang Cheng Reviewed-by: Viacheslav Dubeyko Cc: Christian Brauner Cc: James Morris Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/hfsplus/inode.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c index 70e8374ddac4..6fef67c2a9f0 100644 --- a/fs/hfsplus/inode.c +++ b/fs/hfsplus/inode.c @@ -281,6 +281,11 @@ int hfsplus_getattr(struct user_namespace *mnt_userns, const struct path *path, struct inode *inode = d_inode(path->dentry); struct hfsplus_inode_info *hip = HFSPLUS_I(inode); + if (request_mask & STATX_BTIME) { + stat->result_mask |= STATX_BTIME; + stat->btime = hfsp_mt2ut(hip->create_date); + } + if (inode->i_flags & S_APPEND) stat->attributes |= STATX_ATTR_APPEND; if (inode->i_flags & S_IMMUTABLE) From 97c885d585c53d3f1ad4545b0ee10f0bdfaa1a4d Mon Sep 17 00:00:00 2001 From: Al Viro Date: Wed, 30 Jun 2021 18:56:43 -0700 Subject: [PATCH 179/190] x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned Currently we handle SS_AUTODISARM as soon as we have stored the altstack settings into sigframe - that's the point when we have set the things up for eventual sigreturn to restore the old settings. And if we manage to set the sigframe up (we are not done with that yet), everything's fine. However, in case of failure we end up with sigframe-to-be abandoned and SIGSEGV force-delivered. And in that case we end up with inconsistent rules - late failures have altstack reset, early ones do not. It's trivial to get consistent behaviour - just handle SS_AUTODISARM once we have set the sigframe up and are committed to entering the handler, i.e. in signal_delivered(). Link: https://lore.kernel.org/lkml/20200404170604.GN23230@ZenIV.linux.org.uk/ Link: https://github.com/ClangBuiltLinux/linux/issues/876 Link: https://lkml.kernel.org/r/20210422230846.1756380-1-ndesaulniers@google.com Signed-off-by: Al Viro Signed-off-by: Nick Desaulniers Acked-by: Oleg Nesterov Tested-by: Nathan Chancellor Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/compat.h | 2 -- include/linux/signal.h | 2 -- kernel/signal.c | 14 ++++---------- 3 files changed, 4 insertions(+), 14 deletions(-) diff --git a/include/linux/compat.h b/include/linux/compat.h index 8855b1b702b2..c270124e4402 100644 --- a/include/linux/compat.h +++ b/include/linux/compat.h @@ -532,8 +532,6 @@ int __compat_save_altstack(compat_stack_t __user *, unsigned long); &__uss->ss_sp, label); \ unsafe_put_user(t->sas_ss_flags, &__uss->ss_flags, label); \ unsafe_put_user(t->sas_ss_size, &__uss->ss_size, label); \ - if (t->sas_ss_flags & SS_AUTODISARM) \ - sas_ss_reset(t); \ } while (0); /* diff --git a/include/linux/signal.h b/include/linux/signal.h index 5160fd45e5ca..3454c7ff0778 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -462,8 +462,6 @@ int __save_altstack(stack_t __user *, unsigned long); unsafe_put_user((void __user *)t->sas_ss_sp, &__uss->ss_sp, label); \ unsafe_put_user(t->sas_ss_flags, &__uss->ss_flags, label); \ unsafe_put_user(t->sas_ss_size, &__uss->ss_size, label); \ - if (t->sas_ss_flags & SS_AUTODISARM) \ - sas_ss_reset(t); \ } while (0); #ifdef CONFIG_PROC_FS diff --git a/kernel/signal.c b/kernel/signal.c index 30a0bee5ff9b..6b4df88cc630 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2829,6 +2829,8 @@ static void signal_delivered(struct ksignal *ksig, int stepping) if (!(ksig->ka.sa.sa_flags & SA_NODEFER)) sigaddset(&blocked, ksig->sig); set_current_blocked(&blocked); + if (current->sas_ss_flags & SS_AUTODISARM) + sas_ss_reset(current); tracehook_signal_handler(stepping); } @@ -4147,11 +4149,7 @@ int __save_altstack(stack_t __user *uss, unsigned long sp) int err = __put_user((void __user *)t->sas_ss_sp, &uss->ss_sp) | __put_user(t->sas_ss_flags, &uss->ss_flags) | __put_user(t->sas_ss_size, &uss->ss_size); - if (err) - return err; - if (t->sas_ss_flags & SS_AUTODISARM) - sas_ss_reset(t); - return 0; + return err; } #ifdef CONFIG_COMPAT @@ -4206,11 +4204,7 @@ int __compat_save_altstack(compat_stack_t __user *uss, unsigned long sp) &uss->ss_sp) | __put_user(t->sas_ss_flags, &uss->ss_flags) | __put_user(t->sas_ss_size, &uss->ss_size); - if (err) - return err; - if (t->sas_ss_flags & SS_AUTODISARM) - sas_ss_reset(t); - return 0; + return err; } #endif From bae7702a17e9a29d90a997c266296b44d7b087f0 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Wed, 30 Jun 2021 18:56:46 -0700 Subject: [PATCH 180/190] exec: remove checks in __register_bimfmt() Delete NULL check, all callers pass valid pointer. Delete ->load_binary check -- failure to provide hook in a custom module will be very noticeable at the very first execve call. Link: https://lkml.kernel.org/r/YK1Gy1qXaLAR+tPl@localhost.localdomain Signed-off-by: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/exec.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 18594f11c31f..379afcf5d106 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -84,9 +84,6 @@ static DEFINE_RWLOCK(binfmt_lock); void __register_binfmt(struct linux_binfmt * fmt, int insert) { - BUG_ON(!fmt); - if (WARN_ON(!fmt->load_binary)) - return; write_lock(&binfmt_lock); insert ? list_add(&fmt->lh, &formats) : list_add_tail(&fmt->lh, &formats); From 540540d06e9d9b3769b46d88def90f7e7c002322 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Wed, 30 Jun 2021 18:56:49 -0700 Subject: [PATCH 181/190] kcov: add __no_sanitize_coverage to fix noinstr for all architectures Until now no compiler supported an attribute to disable coverage instrumentation as used by KCOV. To work around this limitation on x86, noinstr functions have their coverage instrumentation turned into nops by objtool. However, this solution doesn't scale automatically to other architectures, such as arm64, which are migrating to use the generic entry code. Clang [1] and GCC [2] have added support for the attribute recently. [1] https://github.com/llvm/llvm-project/commit/280333021e9550d80f5c1152a34e33e81df1e178 [2] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=cec4d4a6782c9bd8d071839c50a239c49caca689 The changes will appear in Clang 13 and GCC 12. Add __no_sanitize_coverage for both compilers, and add it to noinstr. Note: In the Clang case, __has_feature(coverage_sanitizer) is only true if the feature is enabled, and therefore we do not require an additional defined(CONFIG_KCOV) (like in the GCC case where __has_attribute(..) is always true) to avoid adding redundant attributes to functions if KCOV is off. That being said, compilers that support the attribute will not generate errors/warnings if the attribute is redundantly used; however, where possible let's avoid it as it reduces preprocessed code size and associated compile-time overheads. [elver@google.com: Implement __has_feature(coverage_sanitizer) in Clang] Link: https://lkml.kernel.org/r/20210527162655.3246381-1-elver@google.com [elver@google.com: add comment explaining __has_feature() in Clang] Link: https://lkml.kernel.org/r/20210527194448.3470080-1-elver@google.com Link: https://lkml.kernel.org/r/20210525175819.699786-1-elver@google.com Signed-off-by: Marco Elver Acked-by: Peter Zijlstra (Intel) Reviewed-by: Miguel Ojeda Reviewed-by: Nathan Chancellor Cc: Nick Desaulniers Cc: Kees Cook Cc: Will Deacon Cc: Ard Biesheuvel Cc: Luc Van Oostenryck Cc: Arvind Sankar Cc: Masahiro Yamada Cc: Sami Tolvanen Cc: Arnd Bergmann Cc: Dmitry Vyukov Cc: Mark Rutland Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/compiler-clang.h | 17 +++++++++++++++++ include/linux/compiler-gcc.h | 6 ++++++ include/linux/compiler_types.h | 2 +- 3 files changed, 24 insertions(+), 1 deletion(-) diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h index adbe76b203e2..49b0ac8b6fd3 100644 --- a/include/linux/compiler-clang.h +++ b/include/linux/compiler-clang.h @@ -13,6 +13,12 @@ /* all clang versions usable with the kernel support KASAN ABI version 5 */ #define KASAN_ABI_VERSION 5 +/* + * Note: Checking __has_feature(*_sanitizer) is only true if the feature is + * enabled. Therefore it is not required to additionally check defined(CONFIG_*) + * to avoid adding redundant attributes in other configurations. + */ + #if __has_feature(address_sanitizer) || __has_feature(hwaddress_sanitizer) /* Emulate GCC's __SANITIZE_ADDRESS__ flag */ #define __SANITIZE_ADDRESS__ @@ -45,6 +51,17 @@ #define __no_sanitize_undefined #endif +/* + * Support for __has_feature(coverage_sanitizer) was added in Clang 13 together + * with no_sanitize("coverage"). Prior versions of Clang support coverage + * instrumentation, but cannot be queried for support by the preprocessor. + */ +#if __has_feature(coverage_sanitizer) +#define __no_sanitize_coverage __attribute__((no_sanitize("coverage"))) +#else +#define __no_sanitize_coverage +#endif + /* * Not all versions of clang implement the type-generic versions * of the builtin overflow checkers. Fortunately, clang implements diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index 5d97ef738a57..cb9217fc60af 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -122,6 +122,12 @@ #define __no_sanitize_undefined #endif +#if defined(CONFIG_KCOV) && __has_attribute(__no_sanitize_coverage__) +#define __no_sanitize_coverage __attribute__((no_sanitize_coverage)) +#else +#define __no_sanitize_coverage +#endif + #if GCC_VERSION >= 50100 #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1 #endif diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index d29bda7f6ebd..cc2bee7f0977 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -210,7 +210,7 @@ struct ftrace_likely_data { /* Section for code which can't be instrumented at all */ #define noinstr \ noinline notrace __attribute((__section__(".noinstr.text"))) \ - __no_kcsan __no_sanitize_address + __no_kcsan __no_sanitize_address __no_sanitize_coverage #endif /* __KERNEL__ */ From f36ef407628835a7d7fb3d235b1f1aac7022d9a3 Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Wed, 30 Jun 2021 18:56:53 -0700 Subject: [PATCH 182/190] selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random Patch series "selftests/vm/pkeys: Bug fixes and a new test". There has been a lot of activity on the x86 front around the XSAVE architecture which is used to context-switch processor state (among other things). In addition, AMD has recently joined the protection keys club by adding processor support for PKU. The AMD implementation helped uncover a kernel bug around the PKRU "init state", which actually applied to Intel's implementation but was just harder to hit. This series adds a test which is expected to help find this class of bug both on AMD and Intel. All the work around pkeys on x86 also uncovered a few bugs in the selftest. This patch (of 4): The "random" pkey allocation code currently does the good old: srand((unsigned int)time(NULL)); *But*, it unfortunately does this on every random pkey allocation. There may be thousands of these a second. time() has a one second resolution. So, each time alloc_random_pkey() is called, the PRNG is *RESET* to time(). This is nasty. Normally, if you do: srand(); foo = rand(); bar = rand(); You'll be quite guaranteed that 'foo' and 'bar' are different. But, if you do: srand(1); foo = rand(); srand(1); bar = rand(); You are quite guaranteed that 'foo' and 'bar' are the *SAME*. The recent "fix" effectively forced the test case to use the same "random" pkey for the whole test, unless the test run crossed a second boundary. Only run srand() once at program startup. This explains some very odd and persistent test failures I've been seeing. Link: https://lkml.kernel.org/r/20210611164153.91B76FB8@viggo.jf.intel.com Link: https://lkml.kernel.org/r/20210611164155.192D00FF@viggo.jf.intel.com Fixes: 6e373263ce07 ("selftests/vm/pkeys: fix alloc_random_pkey() to make it really random") Signed-off-by: Dave Hansen Signed-off-by: Thomas Gleixner Tested-by: Aneesh Kumar K.V Cc: Ram Pai Cc: Sandipan Das Cc: Florian Weimer Cc: "Desnes A. Nunes do Rosario" Cc: Ingo Molnar Cc: Thiago Jung Bauermann Cc: Michael Ellerman Cc: Michal Hocko Cc: Michal Suchanek Cc: Shuah Khan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/protection_keys.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c index fdbb602ecf32..9ee0ae5d3e06 100644 --- a/tools/testing/selftests/vm/protection_keys.c +++ b/tools/testing/selftests/vm/protection_keys.c @@ -561,7 +561,6 @@ int alloc_random_pkey(void) int nr_alloced = 0; int random_index; memset(alloced_pkeys, 0, sizeof(alloced_pkeys)); - srand((unsigned int)time(NULL)); /* allocate every possible key and make a note of which ones we got */ max_nr_pkey_allocs = NR_PKEYS; @@ -1552,6 +1551,8 @@ int main(void) int nr_iterations = 22; int pkeys_supported = is_pkeys_supported(); + srand((unsigned int)time(NULL)); + setup_handlers(); printf("has pkeys: %d\n", pkeys_supported); From bf68294a2ec39ed7fec6a5b45d52034e6983157a Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Wed, 30 Jun 2021 18:56:56 -0700 Subject: [PATCH 183/190] selftests/vm/pkeys: handle negative sys_pkey_alloc() return code The alloc_pkey() sefltest function wraps the sys_pkey_alloc() system call. On success, it updates its "shadow" register value because sys_pkey_alloc() updates the real register. But, the success check is wrong. pkey_alloc() considers any non-zero return code to indicate success where the pkey register will be modified. This fails to take negative return codes into account. Consider only a positive return value as a successful call. Link: https://lkml.kernel.org/r/20210611164157.87AB4246@viggo.jf.intel.com Fixes: 5f23f6d082a9 ("x86/pkeys: Add self-tests") Reported-by: Thomas Gleixner Signed-off-by: Dave Hansen Tested-by: Aneesh Kumar K.V Cc: Ram Pai Cc: Sandipan Das Cc: Florian Weimer Cc: "Desnes A. Nunes do Rosario" Cc: Ingo Molnar Cc: Thiago Jung Bauermann Cc: Michael Ellerman Cc: Michal Hocko Cc: Michal Suchanek Cc: Shuah Khan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/protection_keys.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c index 9ee0ae5d3e06..356d62fca27f 100644 --- a/tools/testing/selftests/vm/protection_keys.c +++ b/tools/testing/selftests/vm/protection_keys.c @@ -510,7 +510,7 @@ int alloc_pkey(void) " shadow: 0x%016llx\n", __func__, __LINE__, ret, __read_pkey_reg(), shadow_pkey_reg); - if (ret) { + if (ret > 0) { /* clear both the bits: */ shadow_pkey_reg = set_pkey_bits(shadow_pkey_reg, ret, ~PKEY_MASK); From 6039ca254979694c5362dfebadd105e286c397bb Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Wed, 30 Jun 2021 18:56:59 -0700 Subject: [PATCH 184/190] selftests/vm/pkeys: refill shadow register after implicit kernel write The pkey test code keeps a "shadow" of the pkey register around. This ensures that any bugs which might write to the register can be caught more quickly. Generally, userspace has a good idea when the kernel is going to write to the register. For instance, alloc_pkey() is passed a permission mask. The caller of alloc_pkey() can update the shadow based on the return value and the mask. But, the kernel can also modify the pkey register in a more sneaky way. For mprotect(PROT_EXEC) mappings, the kernel will allocate a pkey and write the pkey register to create an execute-only mapping. The kernel never tells userspace what key it uses for this. This can cause the test to fail with messages like: protection_keys_64.2: pkey-helpers.h:132: _read_pkey_reg: Assertion `pkey_reg == shadow_pkey_reg' failed. because the shadow was not updated with the new kernel-set value. Forcibly update the shadow value immediately after an mprotect(). Link: https://lkml.kernel.org/r/20210611164200.EF76AB73@viggo.jf.intel.com Fixes: 6af17cf89e99 ("x86/pkeys/selftests: Add PROT_EXEC test") Signed-off-by: Dave Hansen Signed-off-by: Thomas Gleixner Tested-by: Aneesh Kumar K.V Cc: Ram Pai Cc: Sandipan Das Cc: Florian Weimer Cc: "Desnes A. Nunes do Rosario" Cc: Ingo Molnar Cc: Thiago Jung Bauermann Cc: Michael Ellerman Cc: Michal Hocko Cc: Michal Suchanek Cc: Shuah Khan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/protection_keys.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c index 356d62fca27f..87eecd5ba577 100644 --- a/tools/testing/selftests/vm/protection_keys.c +++ b/tools/testing/selftests/vm/protection_keys.c @@ -1448,6 +1448,13 @@ void test_implicit_mprotect_exec_only_memory(int *ptr, u16 pkey) ret = mprotect(p1, PAGE_SIZE, PROT_EXEC); pkey_assert(!ret); + /* + * Reset the shadow, assuming that the above mprotect() + * correctly changed PKRU, but to an unknown value since + * the actual alllocated pkey is unknown. + */ + shadow_pkey_reg = __read_pkey_reg(); + dprintf2("pkey_reg: %016llx\n", read_pkey_reg()); /* Make sure this is an *instruction* fault */ From d892454b6814f07da676dae5e686cf221d34a1af Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Wed, 30 Jun 2021 18:57:03 -0700 Subject: [PATCH 185/190] selftests/vm/pkeys: exercise x86 XSAVE init state On x86, there is a set of instructions used to save and restore register state collectively known as the XSAVE architecture. There are about a dozen different features managed with XSAVE. The protection keys register, PKRU, is one of those features. The hardware optimizes XSAVE by tracking when the state has not changed from its initial (init) state. In this case, it can avoid the cost of writing state to memory (it would usually just be a bunch of 0's). When the pkey register is 0x0 the hardware optionally choose to track the register as being in the init state (optimize away the writes). AMD CPUs do this more aggressively compared to Intel. On x86, PKRU is rarely in its (very permissive) init state. Instead, the value defaults to something very restrictive. It is not surprising that bugs have popped up in the rare cases when PKRU reaches its init state. Add a protection key selftest which gets the protection keys register into its init state in a way that should work on Intel and AMD. Then, do a bunch of pkey register reads to watch for inadvertent changes. This adds "-mxsave" to CFLAGS for all the x86 vm selftests in order to allow use of the XSAVE instruction __builtin functions. This will make the builtins available on all of the vm selftests, but is expected to be harmless. Link: https://lkml.kernel.org/r/20210611164202.1849B712@viggo.jf.intel.com Signed-off-by: Dave Hansen Signed-off-by: Thomas Gleixner Tested-by: Aneesh Kumar K.V Cc: Ram Pai Cc: Sandipan Das Cc: Florian Weimer Cc: "Desnes A. Nunes do Rosario" Cc: Ingo Molnar Cc: Thiago Jung Bauermann Cc: Michael Ellerman Cc: Michal Hocko Cc: Michal Suchanek Cc: Shuah Khan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- tools/testing/selftests/vm/Makefile | 4 +- tools/testing/selftests/vm/pkey-x86.h | 1 + tools/testing/selftests/vm/protection_keys.c | 73 ++++++++++++++++++++ 3 files changed, 76 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 8be00319beca..812bc03e3142 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -101,7 +101,7 @@ $(1) $(1)_64: $(OUTPUT)/$(1)_64 endef ifeq ($(CAN_BUILD_I386),1) -$(BINARIES_32): CFLAGS += -m32 +$(BINARIES_32): CFLAGS += -m32 -mxsave $(BINARIES_32): LDLIBS += -lrt -ldl -lm $(BINARIES_32): $(OUTPUT)/%_32: %.c $(CC) $(CFLAGS) $(EXTRA_CFLAGS) $(notdir $^) $(LDLIBS) -o $@ @@ -109,7 +109,7 @@ $(foreach t,$(TARGETS),$(eval $(call gen-target-rule-32,$(t)))) endif ifeq ($(CAN_BUILD_X86_64),1) -$(BINARIES_64): CFLAGS += -m64 +$(BINARIES_64): CFLAGS += -m64 -mxsave $(BINARIES_64): LDLIBS += -lrt -ldl $(BINARIES_64): $(OUTPUT)/%_64: %.c $(CC) $(CFLAGS) $(EXTRA_CFLAGS) $(notdir $^) $(LDLIBS) -o $@ diff --git a/tools/testing/selftests/vm/pkey-x86.h b/tools/testing/selftests/vm/pkey-x86.h index 3be20f5d5275..e4a4ce2b826d 100644 --- a/tools/testing/selftests/vm/pkey-x86.h +++ b/tools/testing/selftests/vm/pkey-x86.h @@ -126,6 +126,7 @@ static inline u32 pkey_bit_position(int pkey) #define XSTATE_PKEY_BIT (9) #define XSTATE_PKEY 0x200 +#define XSTATE_BV_OFFSET 512 int pkey_reg_xstate_offset(void) { diff --git a/tools/testing/selftests/vm/protection_keys.c b/tools/testing/selftests/vm/protection_keys.c index 87eecd5ba577..2d0ae88665db 100644 --- a/tools/testing/selftests/vm/protection_keys.c +++ b/tools/testing/selftests/vm/protection_keys.c @@ -1277,6 +1277,78 @@ void test_pkey_alloc_exhaust(int *ptr, u16 pkey) } } +void arch_force_pkey_reg_init(void) +{ +#if defined(__i386__) || defined(__x86_64__) /* arch */ + u64 *buf; + + /* + * All keys should be allocated and set to allow reads and + * writes, so the register should be all 0. If not, just + * skip the test. + */ + if (read_pkey_reg()) + return; + + /* + * Just allocate an absurd about of memory rather than + * doing the XSAVE size enumeration dance. + */ + buf = mmap(NULL, 1*MB, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); + + /* These __builtins require compiling with -mxsave */ + + /* XSAVE to build a valid buffer: */ + __builtin_ia32_xsave(buf, XSTATE_PKEY); + /* Clear XSTATE_BV[PKRU]: */ + buf[XSTATE_BV_OFFSET/sizeof(u64)] &= ~XSTATE_PKEY; + /* XRSTOR will likely get PKRU back to the init state: */ + __builtin_ia32_xrstor(buf, XSTATE_PKEY); + + munmap(buf, 1*MB); +#endif +} + + +/* + * This is mostly useless on ppc for now. But it will not + * hurt anything and should give some better coverage as + * a long-running test that continually checks the pkey + * register. + */ +void test_pkey_init_state(int *ptr, u16 pkey) +{ + int err; + int allocated_pkeys[NR_PKEYS] = {0}; + int nr_allocated_pkeys = 0; + int i; + + for (i = 0; i < NR_PKEYS; i++) { + int new_pkey = alloc_pkey(); + + if (new_pkey < 0) + continue; + allocated_pkeys[nr_allocated_pkeys++] = new_pkey; + } + + dprintf3("%s()::%d\n", __func__, __LINE__); + + arch_force_pkey_reg_init(); + + /* + * Loop for a bit, hoping to get exercise the kernel + * context switch code. + */ + for (i = 0; i < 1000000; i++) + read_pkey_reg(); + + for (i = 0; i < nr_allocated_pkeys; i++) { + err = sys_pkey_free(allocated_pkeys[i]); + pkey_assert(!err); + read_pkey_reg(); /* for shadow checking */ + } +} + /* * pkey 0 is special. It is allocated by default, so you do not * have to call pkey_alloc() to use it first. Make sure that it @@ -1508,6 +1580,7 @@ void (*pkey_tests[])(int *ptr, u16 pkey) = { test_implicit_mprotect_exec_only_memory, test_mprotect_with_pkey_0, test_ptrace_of_child, + test_pkey_init_state, test_pkey_syscalls_on_non_allocated_pkey, test_pkey_syscalls_bad_args, test_pkey_alloc_exhaust, From 3b52348345b2cfe038d317de52bcdef788c6520d Mon Sep 17 00:00:00 2001 From: Yu Kuai Date: Wed, 30 Jun 2021 18:57:06 -0700 Subject: [PATCH 186/190] lib/decompressors: remove set but not used variabled 'level' Fixes gcc '-Wunused-but-set-variable' warning: lib/decompress_unlzo.c:46:5: warning: variable `level' set but not used [-Wunused-but-set-variable] It is never used and so can be removed. [akpm@linux-foundation.org: warning: value computed is not used] Link: https://lkml.kernel.org/r/20210514062050.3532344-1-yukuai3@huawei.com Fixes: 7dd65feb6c60 ("lib: add support for LZO-compressed kernels") Signed-off-by: Yu Kuai Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/decompress_unlzo.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/lib/decompress_unlzo.c b/lib/decompress_unlzo.c index 1f439a622076..64c1358500ce 100644 --- a/lib/decompress_unlzo.c +++ b/lib/decompress_unlzo.c @@ -43,7 +43,6 @@ STATIC inline long INIT parse_header(u8 *input, long *skip, long in_len) int l; u8 *parse = input; u8 *end = input + in_len; - u8 level = 0; u16 version; /* @@ -65,7 +64,7 @@ STATIC inline long INIT parse_header(u8 *input, long *skip, long in_len) version = get_unaligned_be16(parse); parse += 7; if (version >= 0x0940) - level = *parse++; + parse++; if (get_unaligned_be32(parse) & HEADER_HAS_FILTER) parse += 8; /* flags + filter info */ else From fc37a3b8b4388e73e8e3525556d9f1feeb232bb9 Mon Sep 17 00:00:00 2001 From: Vasily Averin Date: Wed, 30 Jun 2021 18:57:09 -0700 Subject: [PATCH 187/190] ipc sem: use kvmalloc for sem_undo allocation Patch series "ipc: allocations cleanup", v2. Some ipc objects use the wrong allocation functions: small objects can use kmalloc(), and vice versa, potentially large objects can use kmalloc(). This patch (of 2): Size of sem_undo can exceed one page and with the maximum possible nsems = 32000 it can grow up to 64Kb. Let's switch its allocation to kvmalloc to avoid user-triggered disruptive actions like OOM killer in case of high-order memory shortage. User triggerable high order allocations are quite a problem on heavily fragmented systems. They can be a DoS vector. Link: https://lkml.kernel.org/r/ebc3ac79-3190-520d-81ce-22ad194986ec@virtuozzo.com Link: https://lkml.kernel.org/r/a6354fd9-2d55-2e63-dd4d-fa7dc1d11134@virtuozzo.com Signed-off-by: Vasily Averin Acked-by: Michal Hocko Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Cc: Alexey Dobriyan Cc: Davidlohr Bueso Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Johannes Weiner Cc: Manfred Spraul Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- ipc/sem.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/ipc/sem.c b/ipc/sem.c index bf534c74293e..3a58188733d8 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -1154,7 +1154,7 @@ static void freeary(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp) un->semid = -1; list_del_rcu(&un->list_proc); spin_unlock(&un->ulp->lock); - kfree_rcu(un, rcu); + kvfree_rcu(un, rcu); } /* Wake up all pending processes and let them fail with EIDRM. */ @@ -1937,7 +1937,8 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid) rcu_read_unlock(); /* step 2: allocate new undo structure */ - new = kzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, GFP_KERNEL); + new = kvzalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, + GFP_KERNEL); if (!new) { ipc_rcu_putref(&sma->sem_perm, sem_rcu_free); return ERR_PTR(-ENOMEM); @@ -1949,7 +1950,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid) if (!ipc_valid_object(&sma->sem_perm)) { sem_unlock(sma, -1); rcu_read_unlock(); - kfree(new); + kvfree(new); un = ERR_PTR(-EIDRM); goto out; } @@ -1960,7 +1961,7 @@ static struct sem_undo *find_alloc_undo(struct ipc_namespace *ns, int semid) */ un = lookup_undo(ulp, semid); if (un) { - kfree(new); + kvfree(new); goto success; } /* step 5: initialize & link new undo structure */ @@ -2420,7 +2421,7 @@ void exit_sem(struct task_struct *tsk) rcu_read_unlock(); wake_up_q(&wake_q); - kfree_rcu(un, rcu); + kvfree_rcu(un, rcu); } kfree(ulp); } From bc8136a543aa839a848b49af5e101ac6de5f6b27 Mon Sep 17 00:00:00 2001 From: Vasily Averin Date: Wed, 30 Jun 2021 18:57:12 -0700 Subject: [PATCH 188/190] ipc: use kmalloc for msg_queue and shmid_kernel msg_queue and shmid_kernel are quite small objects, no need to use kvmalloc for them. mhocko@: "Both of them are 256B on most 64b systems." Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(), common function for several ipc objects. It had kvmalloc call inside(). Later, this function went away and was finally replaced by direct kvmalloc call, and now we can use more suitable kmalloc/kfree for them. Link: https://lkml.kernel.org/r/0d0b6c9b-8af3-29d8-34e2-a565c53780f3@virtuozzo.com Reported-by: Alexey Dobriyan Signed-off-by: Vasily Averin Acked-by: Michal Hocko Reviewed-by: Shakeel Butt Acked-by: Roman Gushchin Cc: Johannes Weiner Cc: Vladimir Davydov Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Davidlohr Bueso Cc: Manfred Spraul Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- ipc/msg.c | 6 +++--- ipc/shm.c | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/ipc/msg.c b/ipc/msg.c index 6e6c8e0c9380..6810276d6bb9 100644 --- a/ipc/msg.c +++ b/ipc/msg.c @@ -130,7 +130,7 @@ static void msg_rcu_free(struct rcu_head *head) struct msg_queue *msq = container_of(p, struct msg_queue, q_perm); security_msg_queue_free(&msq->q_perm); - kvfree(msq); + kfree(msq); } /** @@ -147,7 +147,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params) key_t key = params->key; int msgflg = params->flg; - msq = kvmalloc(sizeof(*msq), GFP_KERNEL); + msq = kmalloc(sizeof(*msq), GFP_KERNEL); if (unlikely(!msq)) return -ENOMEM; @@ -157,7 +157,7 @@ static int newque(struct ipc_namespace *ns, struct ipc_params *params) msq->q_perm.security = NULL; retval = security_msg_queue_alloc(&msq->q_perm); if (retval) { - kvfree(msq); + kfree(msq); return retval; } diff --git a/ipc/shm.c b/ipc/shm.c index febd88daba8c..a66b2664558b 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -222,7 +222,7 @@ static void shm_rcu_free(struct rcu_head *head) struct shmid_kernel *shp = container_of(ptr, struct shmid_kernel, shm_perm); security_shm_free(&shp->shm_perm); - kvfree(shp); + kfree(shp); } static inline void shm_rmid(struct ipc_namespace *ns, struct shmid_kernel *s) @@ -619,7 +619,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params) ns->shm_tot + numpages > ns->shm_ctlall) return -ENOSPC; - shp = kvmalloc(sizeof(*shp), GFP_KERNEL); + shp = kmalloc(sizeof(*shp), GFP_KERNEL); if (unlikely(!shp)) return -ENOMEM; @@ -630,7 +630,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params) shp->shm_perm.security = NULL; error = security_shm_alloc(&shp->shm_perm); if (error) { - kvfree(shp); + kfree(shp); return error; } From 17d056e0bdaab3d3f1fbec1ac154addcc4183aed Mon Sep 17 00:00:00 2001 From: Manfred Spraul Date: Wed, 30 Jun 2021 18:57:15 -0700 Subject: [PATCH 189/190] ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock The patch solves three weaknesses in ipc/sem.c: 1) The initial read of use_global_lock in sem_lock() is an intentional race. KCSAN detects these accesses and prints a warning. 2) The code assumes that plain C read/writes are not mangled by the CPU or the compiler. 3) The comment it sysvipc_sem_proc_show() was hard to understand: The rest of the comments in ipc/sem.c speaks about sem_perm.lock, and suddenly this function speaks about ipc_lock_object(). To solve 1) and 2), use READ_ONCE()/WRITE_ONCE(). Plain C reads are used in code that owns sma->sem_perm.lock. The comment is updated to solve 3) [manfred@colorfullife.com: use READ_ONCE()/WRITE_ONCE() for use_global_lock] Link: https://lkml.kernel.org/r/20210627161919.3196-3-manfred@colorfullife.com Link: https://lkml.kernel.org/r/20210514175319.12195-1-manfred@colorfullife.com Signed-off-by: Manfred Spraul Reviewed-by: Paul E. McKenney Reviewed-by: Davidlohr Bueso Cc: <1vier1@web.de> Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- ipc/sem.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/ipc/sem.c b/ipc/sem.c index 3a58188733d8..971e75d28364 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -217,6 +217,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it); * this smp_load_acquire(), this is guaranteed because the smp_load_acquire() * is inside a spin_lock() and after a write from 0 to non-zero a * spin_lock()+spin_unlock() is done. + * To prevent the compiler/cpu temporarily writing 0 to use_global_lock, + * READ_ONCE()/WRITE_ONCE() is used. * * 2) queue.status: (SEM_BARRIER_2) * Initialization is done while holding sem_lock(), so no further barrier is @@ -342,10 +344,10 @@ static void complexmode_enter(struct sem_array *sma) * Nothing to do, just reset the * counter until we return to simple mode. */ - sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS; + WRITE_ONCE(sma->use_global_lock, USE_GLOBAL_LOCK_HYSTERESIS); return; } - sma->use_global_lock = USE_GLOBAL_LOCK_HYSTERESIS; + WRITE_ONCE(sma->use_global_lock, USE_GLOBAL_LOCK_HYSTERESIS); for (i = 0; i < sma->sem_nsems; i++) { sem = &sma->sems[i]; @@ -371,7 +373,8 @@ static void complexmode_tryleave(struct sem_array *sma) /* See SEM_BARRIER_1 for purpose/pairing */ smp_store_release(&sma->use_global_lock, 0); } else { - sma->use_global_lock--; + WRITE_ONCE(sma->use_global_lock, + sma->use_global_lock-1); } } @@ -412,7 +415,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, * Initial check for use_global_lock. Just an optimization, * no locking, no memory barrier. */ - if (!sma->use_global_lock) { + if (!READ_ONCE(sma->use_global_lock)) { /* * It appears that no complex operation is around. * Acquire the per-semaphore lock. @@ -2436,7 +2439,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it) /* * The proc interface isn't aware of sem_lock(), it calls - * ipc_lock_object() directly (in sysvipc_find_ipc). + * ipc_lock_object(), i.e. spin_lock(&sma->sem_perm.lock). + * (in sysvipc_find_ipc) * In order to stay compatible with sem_lock(), we must * enter / leave complex_mode. */ From b869d5be0acf0e125e69adcffdca04000dc5b17c Mon Sep 17 00:00:00 2001 From: Manfred Spraul Date: Wed, 30 Jun 2021 18:57:18 -0700 Subject: [PATCH 190/190] ipc/util.c: use binary search for max_idx If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO, MSG_INFO or SHM_INFO, then the return value is the index of the highest used index in the kernel's internal array recording information about all SysV objects of the requested type for the current namespace. (This information can be used with repeated ..._STAT or ..._STAT_ANY operations to obtain information about all SysV objects on the system.) There is a cache for this value. But when the cache needs up be updated, then the highest used index is determined by looping over all possible values. With the introduction of IPCMNI_EXTEND_SHIFT, this could be a loop over 16 million entries. And due to /proc/sys/kernel/*next_id, the index values do not need to be consecutive. With , msgget(), msgctl(,IPC_RMID) in a loop, I have observed a performance increase of around factor 13000. As there is no get_last() function for idr structures: Implement a "get_last()" using a binary search. As far as I see, ipc is the only user that needs get_last(), thus implement it in ipc/util.c and not in a central location. [akpm@linux-foundation.org: tweak comment, fix typo] Link: https://lkml.kernel.org/r/20210425075208.11777-2-manfred@colorfullife.com Signed-off-by: Manfred Spraul Acked-by: Davidlohr Bueso Cc: <1vier1@web.de> Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- ipc/util.c | 44 +++++++++++++++++++++++++++++++++++++++----- ipc/util.h | 3 +++ 2 files changed, 42 insertions(+), 5 deletions(-) diff --git a/ipc/util.c b/ipc/util.c index cfa0045e748d..0027e47626b7 100644 --- a/ipc/util.c +++ b/ipc/util.c @@ -64,6 +64,7 @@ #include #include #include +#include #include @@ -450,6 +451,41 @@ static void ipc_kht_remove(struct ipc_ids *ids, struct kern_ipc_perm *ipcp) ipc_kht_params); } +/** + * ipc_search_maxidx - search for the highest assigned index + * @ids: ipc identifier set + * @limit: known upper limit for highest assigned index + * + * The function determines the highest assigned index in @ids. It is intended + * to be called when ids->max_idx needs to be updated. + * Updating ids->max_idx is necessary when the current highest index ipc + * object is deleted. + * If no ipc object is allocated, then -1 is returned. + * + * ipc_ids.rwsem needs to be held by the caller. + */ +static int ipc_search_maxidx(struct ipc_ids *ids, int limit) +{ + int tmpidx; + int i; + int retval; + + i = ilog2(limit+1); + + retval = 0; + for (; i >= 0; i--) { + tmpidx = retval | (1<ipcs_idr, &tmpidx)) + retval |= (1<deleted = true; if (unlikely(idx == ids->max_idx)) { - do { - idx--; - if (idx == -1) - break; - } while (!idr_find(&ids->ipcs_idr, idx)); + idx = ids->max_idx-1; + if (idx >= 0) + idx = ipc_search_maxidx(ids, idx); ids->max_idx = idx; } } diff --git a/ipc/util.h b/ipc/util.h index 5766c61aed0e..2dd7ce0416d8 100644 --- a/ipc/util.h +++ b/ipc/util.h @@ -145,6 +145,9 @@ int ipcperms(struct ipc_namespace *ns, struct kern_ipc_perm *ipcp, short flg); * ipc_get_maxidx - get the highest assigned index * @ids: ipc identifier set * + * The function returns the highest assigned index for @ids. The function + * doesn't scan the idr tree, it uses a cached value. + * * Called with ipc_ids.rwsem held for reading. */ static inline int ipc_get_maxidx(struct ipc_ids *ids)