linux/arch/x86/mm
Andy Lutomirski b956575bed x86/mm: Flush more aggressively in lazy TLB mode
Since commit:

  94b1b03b51 ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")

x86's lazy TLB mode has been all the way lazy: when running a kernel thread
(including the idle thread), the kernel keeps using the last user mm's
page tables without attempting to maintain user TLB coherence at all.

From a pure semantic perspective, this is fine -- kernel threads won't
attempt to access user pages, so having stale TLB entries doesn't matter.

Unfortunately, I forgot about a subtlety.  By skipping TLB flushes,
we also allow any paging-structure caches that may exist on the CPU
to become incoherent.  This means that we can have a
paging-structure cache entry that references a freed page table, and
the CPU is within its rights to do a speculative page walk starting
at the freed page table.

I can imagine this causing two different problems:

 - A speculative page walk starting from a bogus page table could read
   IO addresses.  I haven't seen any reports of this causing problems.

 - A speculative page walk that involves a bogus page table can install
   garbage in the TLB.  Such garbage would always be at a user VA, but
   some AMD CPUs have logic that triggers a machine check when it notices
   these bogus entries.  I've seen a couple reports of this.

Boris further explains the failure mode:

> It is actually more of an optimization which assumes that paging-structure
> entries are in WB DRAM:
>
> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
> performance optimization that assumes PML4, PDP, PDE, and PTE entries
> are in cacheable WB-DRAM; memory type checks may be bypassed, and
> addresses outside of WB-DRAM may result in undefined behavior or NB
> protocol errors. 1=Disables performance optimization and allows PML4,
> PDP, PDE and PTE entries to be in any memory type. Operating systems
> that maintain page tables in memory types other than WB- DRAM must set
> TlbCacheDis to insure proper operation."
>
> The MCE generated is an NB protocol error to signal that
>
> "Link: A specific coherent-only packet from a CPU was issued to an
> IO link. This may be caused by software which addresses page table
> structures in a memory type other than cacheable WB-DRAM without
> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
> example, when page table structure addresses are above top of memory. In
> such cases, the NB will generate an MCE if it sees a mismatch between
> the memory operation generated by the core and the link type."
>
> I'm assuming coherent-only packets don't go out on IO links, thus the
> error.

To fix this, reinstate TLB coherence in lazy mode.  With this patch
applied, we do it in one of two ways:

 - If we have PCID, we simply switch back to init_mm's page tables
   when we enter a kernel thread -- this seems to be quite cheap
   except for the cost of serializing the CPU.

 - If we don't have PCID, then we set a flag and switch to init_mm
   the first time we would otherwise need to flush the TLB.

The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
to override the default mode for benchmarking.

In theory, we could optimize this better by only flushing the TLB in
lazy CPUs when a page table is freed.  Doing that would require
auditing the mm code to make sure that all page table freeing goes
through tlb_remove_page() as well as reworking some data structures
to implement the improved flush logic.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 94b1b03b51 ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-14 09:21:24 +02:00
..
kmemcheck x86/mm: Audit and remove any unnecessary uses of module.h 2016-07-14 13:04:20 +02:00
amdtopology.c x86/boot/e820: Move asm/e820.h to asm/e820/api.h 2017-01-28 09:31:13 +01:00
debug_pagetables.c x86/mm/ptdump: Make (debugfs)/kernel_page_tables read-only 2015-12-04 12:55:01 +01:00
dump_pagetables.c x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y 2017-07-25 11:22:09 +02:00
extable.c x86/fpu: Reinitialize FPU registers if restoring FPU state fails 2017-09-25 09:26:38 +02:00
fault.c x86/mm: Fix fault error path using unsafe vma pointer 2017-09-25 09:36:15 +02:00
highmem_32.c x86/mm: Audit and remove any unnecessary uses of module.h 2016-07-14 13:04:20 +02:00
hugetlbpage.c x86/mm: Prepare to expose larger address space to userspace 2017-07-21 10:05:18 +02:00
ident_map.c x86/mm, kexec: Allow kexec to be used with SME 2017-07-18 11:38:04 +02:00
init.c x86/mm/64: Initialize CR4.PCIDE early 2017-09-13 09:54:43 +02:00
init_32.c mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory 2017-07-06 16:24:32 -07:00
init_64.c mm/memory_hotplug: introduce add_pages 2017-09-08 18:26:46 -07:00
iomap_32.c x86/mm: Audit and remove any unnecessary uses of module.h 2016-07-14 13:04:20 +02:00
ioremap.c x86/mm: Use proper encryption attributes with /dev/mem 2017-07-18 11:38:05 +02:00
kasan_init_64.c x86/mm: Insure that boot memory areas are mapped properly 2017-07-18 11:38:01 +02:00
kaslr.c x86/mm: Add support for 5-level paging for KASLR 2017-06-13 08:56:58 +02:00
kmmio.c x86/mm: Audit and remove any unnecessary uses of module.h 2016-07-14 13:04:20 +02:00
Makefile x86/mm: Disable various instrumentations of mm/mem_encrypt.c and mm/tlb.c 2017-10-11 17:22:59 +02:00
mem_encrypt.c x86/mm: Disable branch profiling in mem_encrypt.c 2017-09-29 19:37:51 +02:00
mem_encrypt_boot.S x86/mm: Fix SME encryption stack ptr handling 2017-08-29 10:57:16 +02:00
mm_internal.h x86: Enable PAT to use cache mode translation tables 2014-11-16 11:04:26 +01:00
mmap.c Merge branch 'linus' into x86/mm to pick up fixes and to fix conflicts 2017-08-26 09:19:13 +02:00
mmio-mod.c x86/boot/e820: Move asm/e820.h to asm/e820/api.h 2017-01-28 09:31:13 +01:00
mpx.c x86/mpx: Do not allow MPX if we have mappings above 47-bit 2017-07-21 10:05:18 +02:00
numa.c Merge branch 'x86/boot' into x86/mm, to avoid conflict 2017-04-11 08:56:05 +02:00
numa_32.c x86/mm/32: Set the '__vmalloc_start_set' flag in initmem_init() 2017-05-09 08:12:27 +02:00
numa_64.c x86, mm: kill numa_free_all_bootmem() 2012-11-17 11:59:47 -08:00
numa_emulation.c x86/numa_emulation: Recalculate numa_nodes_parsed from emulated nodes 2017-07-18 11:16:49 +02:00
numa_internal.h x86-32, mm: Rip out x86_32 NUMA remapping code 2013-01-31 14:12:30 -08:00
pageattr-test.c x86/mm/pat: Make mm/pageattr[-test].c explicitly non-modular 2015-08-25 09:48:38 +02:00
pageattr.c x86, drm, fbdev: Do not specify encrypted memory for video mappings 2017-07-18 11:38:04 +02:00
pat.c x86/mm: Use proper encryption attributes with /dev/mem 2017-07-18 11:38:05 +02:00
pat_internal.h x86/mm/pat: Convert to pr_*() usage 2015-05-27 14:40:59 +02:00
pat_rbtree.c x86/mm/pat: Use rb_entry() 2017-02-04 17:18:00 +01:00
pf_in.c x86/mm: Audit and remove any unnecessary uses of module.h 2016-07-14 13:04:20 +02:00
pf_in.h
pgtable.c x86/paravirt: Remove no longer used paravirt functions 2017-09-13 10:55:15 +02:00
pgtable_32.c Merge branch 'x86/boot' into x86/mm, to avoid conflict 2017-04-11 08:56:05 +02:00
physaddr.c x86/mm: Audit and remove any unnecessary uses of module.h 2016-07-14 13:04:20 +02:00
physaddr.h
pkeys.c x86/fpu: Rename fpu::fpstate_active to fpu::initialized 2017-09-26 09:43:36 +02:00
setup_nx.c Revert "x86/mm/32: Set NX in __supported_pte_mask before enabling paging" 2016-04-26 19:52:57 +02:00
srat.c x86/boot/e820: Move asm/e820.h to asm/e820/api.h 2017-01-28 09:31:13 +01:00
testmmiotrace.c Annotate hardware config module parameters in arch/x86/mm/ 2017-04-04 16:54:21 +01:00
tlb.c x86/mm: Flush more aggressively in lazy TLB mode 2017-10-14 09:21:24 +02:00