Go to file
Yazen Ghannam 6f15e617cc RAS: Introduce a FRU memory poison manager
Memory errors are an expected occurrence on systems with high memory
density. Generally, errors within a small number of unique physical
locations are acceptable, based on manufacturer and/or admin policy.
During run time, memory with errors may be retired so it is no longer
used by the system. This is done in mm through page poisoning, and the
effect will remain until the system is restarted.

If a memory location is consistently faulty, then the same run time
error handling may occur in the next reboot cycle, leading to
terminating jobs due to that already known bad memory. This could be
prevented if information from the previous boot was not lost.

Some add-in cards with driver-managed memory have on-board persistent
storage. Their driver saves memory error information to the persistent
storage during run time. The information is then restored after reset,
and known bad memory will be retired before the hardware is used.
A running log of bad memory locations is kept across multiple resets.

A similar solution is desirable for CPUs. However, this solution should
leverage industry-standard components as much as possible, rather than
a bespoke platform driver.

Two components are needed: a record format and a persistent storage
interface.

Implement a new module to manage the record formats on persistent
storage. Use the requirements for an AMD MI300-based system to start.
Vendor- and platform-specific details can be abstracted later as needed.

  [ bp: Massage commit message and code, squash 30-ish more fixes from
    Yazen and me. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Co-developed-by: <naveenkrishna.chatradhi@amd.com>
Signed-off-by: <naveenkrishna.chatradhi@amd.com>
Co-developed-by: <muralidhara.mk@amd.com>
Signed-off-by: <muralidhara.mk@amd.com>
Tested-by: <sathyapriya.k@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com
2024-02-20 18:56:15 +01:00
arch powerpc fixes for 6.8 #2 2024-01-21 11:04:29 -08:00
block for-6.8/block-2024-01-18 2024-01-18 18:22:40 -08:00
certs
crypto
Documentation Documentation: Move RAS section to admin-guide 2024-02-14 17:10:06 +01:00
drivers RAS: Introduce a FRU memory poison manager 2024-02-20 18:56:15 +01:00
fs More bcachefs updates for 6.7-rc1 2024-01-21 14:01:12 -08:00
include RAS/AMD/ATL: Add MI300 row retirement support 2024-02-14 17:10:06 +01:00
init
io_uring for-6.8/io_uring-2024-01-18 2024-01-18 18:17:57 -08:00
ipc
kernel Updates for time and clocksources: 2024-01-21 11:14:40 -08:00
lib RISC-V Patches for the 6.8 Merge Window, Part 4 2024-01-20 11:06:04 -08:00
LICENSES
mm vfs-6.8.netfs 2024-01-19 09:10:23 -08:00
net Assorted CephFS fixes and cleanups with nothing standing out. 2024-01-19 09:58:55 -08:00
rust
samples RISC-V Patches for the 6.8 Merge Window, Part 4 2024-01-20 11:06:04 -08:00
scripts Coccinelle change for v6.8 2024-01-20 14:20:34 -08:00
security + Features 2024-01-19 10:53:55 -08:00
sound sound fixes for 6.8-rc1 2024-01-19 12:30:29 -08:00
tools RISC-V Patches for the 6.8 Merge Window, Part 4 2024-01-20 11:06:04 -08:00
usr Kbuild updates for v6.8 2024-01-18 17:57:07 -08:00
virt
.clang-format
.cocciconfig
.editorconfig
.get_maintainer.ignore
.gitattributes
.gitignore
.mailmap
.rustfmt.toml
COPYING
CREDITS Including fixes from bpf and netfilter. 2024-01-18 17:33:50 -08:00
Kbuild
Kconfig
MAINTAINERS RAS: Introduce a FRU memory poison manager 2024-02-20 18:56:15 +01:00
Makefile Linux 6.8-rc1 2024-01-21 14:11:32 -08:00
README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.