linux

mirror of https://github.com/torvalds/linux synced 2024-09-21 11:38:48 +00:00

History

Dan Williams e900a918b0 mm: shuffle initial free memory to improve memory-side-cache utilization Patch series "mm: Randomize free memory", v10. This patch (of 3): Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. Memory side caching is a platform capability that Linux has been previously exposed to in HPC (high-performance computing) environments on specialty platforms. In that instance it was a smaller pool of high-bandwidth-memory relative to higher-capacity / lower-bandwidth DRAM. Now, this capability is going to be found on general purpose server platforms where DRAM is a cache in front of higher latency persistent memory [1]. Robert offered an explanation of the state of the art of Linux interactions with memory-side-caches [2], and I copy it here: It's been a problem in the HPC space: http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/ A kernel module called zonesort is available to try to help: https://software.intel.com/en-us/articles/xeon-phi-software and this abandoned patch series proposed that for the kernel: https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com Dan's patch series doesn't attempt to ensure buffers won't conflict, but also reduces the chance that the buffers will. This will make performance more consistent, albeit slower than "optimal" (which is near impossible to attain in a general-purpose kernel). That's better than forcing users to deploy remedies like: "To eliminate this gradual degradation, we have added a Stream measurement to the Node Health Check that follows each job; nodes are rebooted whenever their measured memory bandwidth falls below 300 GB/s." A replacement for zonesort was merged upstream in commit `cc9aec03e5` ("x86/numa_emulation: Introduce uniform split capability"). With this numa_emulation capability, memory can be split into cache sized ("near-memory" sized) numa nodes. A bind operation to such a node, and disabling workloads on other nodes, enables full cache performance. However, once the workload exceeds the cache size then cache conflicts are unavoidable. While HPC environments might be able to tolerate time-scheduling of cache sized workloads, for general purpose server platforms, the oversubscribed cache case will be the common case. The worst case scenario is that a server system owner benchmarks a workload at boot with an un-contended cache only to see that performance degrade over time, even below the average cache performance due to excessive conflicts. Randomization clips the peaks and fills in the valleys of cache utilization to yield steady average performance. Here are some performance impact details of the patches: 1/ An Intel internal synthetic memory bandwidth measurement tool, saw a 3X speedup in a contrived case that tries to force cache conflicts. The contrived cased used the numa_emulation capability to force an instance of the benchmark to be run in two of the near-memory sized numa nodes. If both instances were placed on the same emulated they would fit and cause zero conflicts. While on separate emulated nodes without randomization they underutilized the cache and conflicted unnecessarily due to the in-order allocation per node. 2/ A well known Java server application benchmark was run with a heap size that exceeded cache size by 3X. The cache conflict rate was 8% for the first run and degraded to 21% after page allocator aging. With randomization enabled the rate levelled out at 11%. 3/ A MongoDB workload did not observe measurable difference in cache-conflict rates, but the overall throughput dropped by 7% with randomization in one case. 4/ Mel Gorman ran his suite of performance workloads with randomization enabled on platforms without a memory-side-cache and saw a mix of some improvements and some losses [3]. While there is potentially significant improvement for applications that depend on low latency access across a wide working-set, the performance may be negligible to negative for other workloads. For this reason the shuffle capability defaults to off unless a direct-mapped memory-side-cache is detected. Even then, the page_alloc.shuffle=0 parameter can be specified to disable the randomization on those systems. Outside of memory-side-cache utilization concerns there is potentially security benefit from randomization. Some data exfiltration and return-oriented-programming attacks rely on the ability to infer the location of sensitive data objects. The kernel page allocator, especially early in system boot, has predictable first-in-first out behavior for physical pages. Pages are freed in physical address order when first onlined. Quoting Kees: "While we already have a base-address randomization (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and memory layouts would certainly be using the predictability of allocation ordering (i.e. for attacks where the base address isn't important: only the relative positions between allocated memory). This is common in lots of heap-style attacks. They try to gain control over ordering by spraying allocations, etc. I'd really like to see this because it gives us something similar to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator." While SLAB_FREELIST_RANDOM reduces the predictability of some local slab caches it leaves vast bulk of memory to be predictably in order allocated. However, it should be noted, the concrete security benefits are hard to quantify, and no known CVE is mitigated by this randomization. Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform a Fisher-Yates shuffle of the page allocator 'free_area' lists when they are initially populated with free memory at boot and at hotplug time. Do this based on either the presence of a page_alloc.shuffle=Y command line parameter, or autodetection of a memory-side-cache (to be added in a follow-on patch). The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10, 4MB this trades off randomization granularity for time spent shuffling. MAX_ORDER-1 was chosen to be minimally invasive to the page allocator while still showing memory-side cache behavior improvements, and the expectation that the security implications of finer granularity randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The performance impact of the shuffling appears to be in the noise compared to other memory initialization work. This initial randomization can be undone over time so a follow-on patch is introduced to inject entropy on page free decisions. It is reasonable to ask if the page free entropy is sufficient, but it is not enough due to the in-order initial freeing of pages. At the start of that process putting page1 in front or behind page0 still keeps them close together, page2 is still near page1 and has a high chance of being adjacent. As more pages are added ordering diversity improves, but there is still high page locality for the low address pages and this leads to no significant impact to the cache conflict rate. [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM [3]: https://lkml.org/lkml/2018/10/12/309 [dan.j.williams@intel.com: fix shuffle enable] Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts] Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2019-05-14 19:52:48 -07:00
..
kasan	arm64 updates for 5.2	2019-05-06 17:54:22 -07:00
backing-dev.c	writeback: synchronize sync(2) against cgroup writeback membership switches	2019-01-22 14:39:38 -07:00
balloon_compaction.c
cleancache.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
cma.c	mm/cma.c: fix crash on CMA allocation if bitmap allocation fails	2019-05-14 09:47:47 -07:00
cma.h
cma_debug.c	mm/cma_debug.c: fix the break condition in cma_maxchunk_get()	2019-05-14 09:47:45 -07:00
compaction.c	mm/compaction.c: fix an undefined behaviour	2019-05-14 09:47:46 -07:00
debug.c	mm: update references to page _refcount	2019-05-14 19:52:47 -07:00
debug_page_ref.c
dmapool.c	docs/core-api/mm: fix return value descriptions in mm/	2019-03-05 21:07:20 -08:00
early_ioremap.c
fadvise.c	vfs: implement readahead(2) using POSIX_FADV_WILLNEED	2018-08-30 20:01:32 +02:00
failslab.c	mm: no need to check return value of debugfs_create functions	2019-03-05 21:07:17 -08:00
filemap.c	mm: delete find_get_entries_tag	2019-05-14 09:47:51 -07:00
frame_vector.c
frontswap.c	mm: use octal not symbolic permissions	2018-06-15 07:55:25 +09:00
gup.c	mm: introduce put_user_page*(), placeholder versions	2019-05-14 09:47:47 -07:00
gup_benchmark.c	mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM	2019-05-14 09:47:45 -07:00
highmem.c	mm: convert totalram_pages and totalhigh_pages variables to atomic	2018-12-28 12:11:47 -08:00
hmm.c	mm/mmu_notifier: convert user range->blockable to helper function	2019-05-14 09:47:49 -07:00
huge_memory.c	mm/huge_memory.c: make __thp_get_unmapped_area static	2019-05-14 09:47:51 -07:00
hugetlb.c	hugetlbfs: always use address space in inode for resv_map pointer	2019-05-14 09:47:50 -07:00
hugetlb_cgroup.c	mm: rename page_counter's count/limit into usage/max	2018-06-07 17:34:35 -07:00
hwpoison-inject.c	mm/memory_failure: Remove unused trapno from memory_failure	2018-01-23 12:17:42 -06:00
init-mm.c	mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids	2018-07-17 09:35:30 +02:00
internal.h	mm, compaction: capture a page under direct compaction	2019-03-05 21:07:17 -08:00
interval_tree.c	mm/interval_tree.c: use vma_pages() helper	2018-01-31 17:18:37 -08:00
Kconfig	mm/Kconfig: update "Memory Model" help text	2019-05-14 09:47:51 -07:00
Kconfig.debug	mm: remove redundant 'default n' from Kconfig-s	2019-05-14 09:47:50 -07:00
khugepaged.c	mm/mmu_notifier: use correct mmu_notifier events for each invalidation	2019-05-14 09:47:49 -07:00
kmemleak-test.c
kmemleak.c	Merge branch 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2019-05-06 13:11:48 -07:00
ksm.c	mm/mmu_notifier: use correct mmu_notifier events for each invalidation	2019-05-14 09:47:49 -07:00
list_lru.c	numa: make "nr_node_ids" unsigned int	2019-03-05 21:07:19 -08:00
maccess.c	Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses"	2019-02-25 09:10:51 -08:00
madvise.c	mm/mmu_notifier: use correct mmu_notifier events for each invalidation	2019-05-14 09:47:49 -07:00
Makefile	mm: shuffle initial free memory to improve memory-side-cache utilization	2019-05-14 19:52:48 -07:00
memblock.c	mm: memblock: make keeping memblock memory opt-in rather than opt-out	2019-05-14 09:47:50 -07:00
memcontrol.c	mm: memcontrol: quarantine the mem_cgroup_[node_]nr_lru_pages() API	2019-05-14 09:47:47 -07:00
memfd.c	mm: page cache: store only head pages in i_pages	2019-05-14 09:47:45 -07:00
memory-failure.c	mm: hwpoison: fix thp split handing in soft_offline_in_use_page()	2019-03-05 21:07:13 -08:00
memory.c	mm: introduce new vm_map_pages() and vm_map_pages_zero() API	2019-05-14 09:47:50 -07:00
memory_hotplug.c	mm: shuffle initial free memory to improve memory-side-cache utilization	2019-05-14 19:52:48 -07:00
mempolicy.c	mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified	2019-03-29 10:01:37 -07:00
mempool.c	docs/core-api/mm: fix return value descriptions in mm/	2019-03-05 21:07:20 -08:00
memtest.c
migrate.c	mm/mmu_notifier: use correct mmu_notifier events for each invalidation	2019-05-14 09:47:49 -07:00
mincore.c	Revert "Change mincore() to count "mapped" pages rather than "cached" pages"	2019-01-24 09:04:37 +13:00
mlock.c	mm: remove zone_lru_lock() function, access ->lru_lock directly	2019-03-05 21:07:21 -08:00
mm_init.c	mm: convert totalram_pages and totalhigh_pages variables to atomic	2018-12-28 12:11:47 -08:00
mmap.c	coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping	2019-04-19 09:46:05 -07:00
mmu_context.c
mmu_gather.c	asm-generic/tlb: Remove tlb_table_flush()	2019-04-03 10:33:02 +02:00
mmu_notifier.c	mm/mmu_notifier: mmu_notifier_range_update_to_read_only() helper	2019-05-14 09:47:49 -07:00
mmzone.c
mprotect.c	mm/mprotect.c: fix compilation warning because of unused 'mm' variable	2019-05-14 09:47:51 -07:00
mremap.c	mm/mmu_notifier: contextual information for event triggering invalidation	2019-05-14 09:47:49 -07:00
msync.c
nommu.c	mm: introduce new vm_map_pages() and vm_map_pages_zero() API	2019-05-14 09:47:50 -07:00
oom_kill.c	mm/mmu_notifier: contextual information for event triggering invalidation	2019-05-14 09:47:49 -07:00
page-writeback.c	mm/page-writeback: introduce tracepoint for wait_on_page_writeback()	2019-05-14 09:47:51 -07:00
page_alloc.c	mm: shuffle initial free memory to improve memory-side-cache utilization	2019-05-14 19:52:48 -07:00
page_counter.c	memcg: introduce memory.min	2018-06-07 17:34:36 -07:00
page_ext.c	memblock: drop memblock_alloc_*_nopanic() variants	2019-03-12 10:04:02 -07:00
page_idle.c	mm: remove zone_lru_lock() function, access ->lru_lock directly	2019-03-05 21:07:21 -08:00
page_io.c	mm/page_io.c: fix polled swap page in	2019-01-04 13:13:48 -08:00
page_isolation.c	mm/page_isolation.c: remove redundant pfn_valid_within() in __first_valid_page()	2019-05-14 09:47:46 -07:00
page_owner.c	mm/page_owner: Simplify stack trace handling	2019-04-29 12:37:50 +02:00
page_poison.c	page_poison: play nicely with KASAN	2019-03-05 21:07:13 -08:00
page_vma_mapped.c	mm/rmap: map_pte() was not handling private ZONE_DEVICE page properly	2018-10-31 08:54:11 -07:00
pagewalk.c	mm: kernel-doc: add missing parameter descriptions	2018-04-05 21:36:27 -07:00
percpu-internal.h	percpu: convert chunk hints to be based on pcpu_block_md	2019-03-13 12:25:31 -07:00
percpu-km.c	percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE	2019-03-13 12:25:31 -07:00
percpu-stats.c	percpu: convert chunk hints to be based on pcpu_block_md	2019-03-13 12:25:31 -07:00
percpu-vm.c	percpu: allow select gfp to be passed to underlying allocators	2018-02-18 05:33:01 -08:00
percpu.c	Merge branch 'for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu	2019-05-13 15:34:03 -07:00
pgtable-generic.c	x86/mm: Page size aware flush_tlb_mm_range()	2018-10-09 16:51:11 +02:00
process_vm_access.c	mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors	2018-02-06 18:32:48 -08:00
quicklist.c
readahead.c	docs/core-api/mm: fix return value descriptions in mm/	2019-03-05 21:07:20 -08:00
rmap.c	mm/rmap.c: use the pra.mapcount to do the check	2019-05-14 09:47:49 -07:00
rodata_test.c
shmem.c	mm: page cache: store only head pages in i_pages	2019-05-14 09:47:45 -07:00
shuffle.c	mm: shuffle initial free memory to improve memory-side-cache utilization	2019-05-14 19:52:48 -07:00
shuffle.h	mm: shuffle initial free memory to improve memory-side-cache utilization	2019-05-14 19:52:48 -07:00
slab.c	mm/slab.c: fix an infinite loop in leaks_show()	2019-05-14 09:47:45 -07:00
slab.h	mm: add support for kmem caches in DMA32 zone	2019-03-29 10:01:37 -07:00
slab_common.c	mm: add support for kmem caches in DMA32 zone	2019-03-29 10:01:37 -07:00
slob.c	slob: use slab_list instead of lru	2019-05-14 09:47:44 -07:00
slub.c	mm/slub.c: update the comment about slab frozen	2019-05-14 09:47:45 -07:00
sparse-vmemmap.c	mm: remove include/linux/bootmem.h	2018-10-31 08:54:16 -07:00
sparse.c	mm/sparse.c: clean up obsolete code comment	2019-05-14 09:47:48 -07:00
swap.c	mm/swap.c: __pagevec_lru_add_fn: typo fix	2019-05-14 09:47:48 -07:00
swap_cgroup.c
swap_slots.c	mm, swap, get_swap_pages: use entry_size instead of cluster in parameter	2018-08-22 10:52:44 -07:00
swap_state.c	mm: page cache: store only head pages in i_pages	2019-05-14 09:47:45 -07:00
swapfile.c	mm: swapoff: shmem_unuse() stop eviction without igrab()	2019-04-19 09:46:04 -07:00
truncate.c	docs/core-api/mm: fix return value descriptions in mm/	2019-03-05 21:07:20 -08:00
usercopy.c	mm/usercopy.c: no check page span for stack objects	2019-01-08 17:15:11 -08:00
userfaultfd.c	hugetlb: use same fault hash key for shared and private mappings	2019-05-14 09:47:48 -07:00
util.c	mm: fix false-positive OVERCOMMIT_GUESS failures	2019-05-14 09:47:50 -07:00
vmacache.c	mm: get rid of vmacache_flush_all() entirely	2018-09-13 15:18:04 -10:00
vmalloc.c	mm/vmalloc.c: convert vmap_lazy_nr to atomic_long_t	2019-05-14 19:52:48 -07:00
vmpressure.c	mm/vmpressure.c: convert to use match_string() helper	2018-06-07 17:34:36 -07:00
vmscan.c	mm/vmscan.c: don't disable irq again when count pgrefill for memcg	2019-05-14 09:47:51 -07:00
vmstat.c	mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n	2019-04-19 09:46:04 -07:00
workingset.c	mm: memcontrol: push down mem_cgroup_node_nr_lru_pages()	2019-05-14 09:47:46 -07:00
z3fold.c	mm/z3fold.c: support page migration	2019-05-14 09:47:50 -07:00
zbud.c	mm: docs: fix parameter names mismatch	2018-02-06 18:32:48 -08:00
zpool.c	mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc	2018-02-21 15:35:43 -08:00
zsmalloc.c	mm/zsmalloc.c: fix fall-through annotation	2018-10-26 16:26:35 -07:00
zswap.c	mm: convert totalram_pages and totalhigh_pages variables to atomic	2018-12-28 12:11:47 -08:00