Go to file
Kent Overstreet 2b69987be5 sched: Add task_struct->faults_disabled_mapping
There has been a long standing page cache coherence bug with direct IO.
This provides part of a mechanism to fix it, currently just used by
bcachefs but potentially worth promoting to the VFS.

Direct IO evicts the range of the pagecache being read or written to.

For reads, we need dirty pages to be written to disk, so that the read
doesn't return stale data. For writes, we need to evict that range of
the pagecache so that it's not stale after the write completes.

However, without a locking mechanism to prevent those pages from being
re-added to the pagecache - by a buffered read or page fault - page
cache inconsistency is still possible.

This isn't necessarily just an issue for userspace when they're playing
games; filesystems may hang arbitrary state off the pagecache, and so
page cache inconsistency may cause real filesystem bugs, depending on
the filesystem. This is less of an issue for iomap based filesystems,
but e.g. buffer heads caches disk block mappings (!) and attaches them
to the pagecache, and bcachefs attaches disk reservations to pagecache
pages.

This issue has been hard to fix, because
 - we need to add a lock (henceforth called pagecache_add_lock), which
   would be held for the duration of the direct IO
 - page faults add pages to the page cache, thus need to take the same
   lock
 - dio -> gup -> page fault thus can deadlock

And we cannot enforce a lock ordering with this lock, since userspace
will be controlling the lock ordering (via the fd and buffer arguments
to direct IOs), so we need a different method of deadlock avoidance.

We need to tell the page fault handler that we're already holding a
pagecache_add_lock, and since plumbing it through the entire gup() path
would be highly impractical this adds a field to task_struct.

Then the full method is:
 - in the dio path, when we first take the pagecache_add_lock, note the
   mapping in the current task_struct
 - in the page fault handler, if faults_disabled_mapping is set, we
   check if it's the same mapping as the one we're taking a page fault
   for, and if so return an error.

   Then we check lock ordering: if there's a lock ordering violation and
   trylock fails, we'll have to cycle the locks and return an error that
   tells the DIO path to retry: faults_disabled_mapping is also used for
   signalling "locks were dropped, please retry".

Also relevant to this patch: mapping->invalidate_lock.
mapping->invalidate_lock provides most of the required semantics - it's
used by truncate/fallocate to block pages being added to the pagecache.
However, since it's a rwsem, direct IOs would need to take the write
side in order to block page cache adds, and would then be exclusive with
each other - we'll need a new type of lock to pair with this approach.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Jan Kara <jack@suse.cz>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Andreas Grünbacher <andreas.gruenbacher@gmail.com>
2023-09-11 23:59:46 -04:00
arch Fix preemption delays in the SGX code, remove unnecessarily UAPI-exported code, 2023-09-10 10:39:31 -07:00
block block: fix pin count management when merging same-page segments 2023-09-06 07:32:27 -06:00
certs certs: Reference revocation list for all keyrings 2023-08-17 20:12:41 +00:00
crypto This update includes the following changes: 2023-08-29 11:23:29 -07:00
Documentation drm ci for 6.6-rc1 2023-09-10 11:55:26 -07:00
drivers drm ci for 6.6-rc1 2023-09-10 11:55:26 -07:00
fs six smb3 client fixes, one fix for nls Kconfig, one minor spnego registry update 2023-09-09 19:56:23 -07:00
include sched: Add task_struct->faults_disabled_mapping 2023-09-11 23:59:46 -04:00
init sched: Add task_struct->faults_disabled_mapping 2023-09-11 23:59:46 -04:00
io_uring Revert "io_uring: fix IO hang in io_wq_put_and_exit from do_exit()" 2023-09-07 09:41:49 -06:00
ipc Add x86 shadow stack support 2023-08-31 12:20:12 -07:00
kernel RISC-V Patches for the 6.6 Merge Window, Part 2 (try 2) 2023-09-09 14:25:11 -07:00
lib iov_iter: Kunit tests for page extraction 2023-09-09 15:11:49 -07:00
LICENSES LICENSES: Add the copyleft-next-0.3.1 license 2022-11-08 15:44:01 +01:00
mm LoongArch changes for v6.6 2023-09-08 12:16:52 -07:00
net Including fixes from netfilter and bpf. 2023-09-07 18:33:07 -07:00
rust Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
samples VFIO updates for v6.6-rc1 2023-08-30 20:36:01 -07:00
scripts Fix preemption delays in the SGX code, remove unnecessarily UAPI-exported code, 2023-09-10 10:39:31 -07:00
security Landlock updates for v6.6-rc1 2023-09-08 12:06:51 -07:00
sound sound fixes for 6.6-rc1 2023-09-08 13:07:50 -07:00
tools perf tools changes for v6.6: 2023-09-09 20:06:17 -07:00
usr initramfs: Encode dependency on KBUILD_BUILD_TIMESTAMP 2023-06-06 17:54:49 +09:00
virt ARM: 2023-09-07 13:52:20 -07:00
.clang-format iommu: Add for_each_group_device() 2023-05-23 08:15:51 +02:00
.cocciconfig
.get_maintainer.ignore get_maintainer: add Alan to .get_maintainer.ignore 2022-08-20 15:17:44 -07:00
.gitattributes .gitattributes: set diff driver for Rust source code files 2023-05-31 17:48:25 +02:00
.gitignore kbuild: rpm-pkg: rename binkernel.spec to kernel.spec 2023-07-25 00:59:33 +09:00
.mailmap for-linus-2023083101 2023-09-01 12:31:44 -07:00
.rustfmt.toml rust: add .rustfmt.toml 2022-09-28 09:02:20 +02:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS USB: Remove Wireless USB and UWB documentation 2023-08-09 14:17:32 +02:00
Kbuild Kbuild updates for v6.1 2022-10-10 12:00:45 -07:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS drm ci for 6.6-rc1 2023-09-10 11:55:26 -07:00
Makefile Linux 6.6-rc1 2023-09-10 16:28:41 -07:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.