Commit graph

1250784 commits

Author SHA1 Message Date
Kent Overstreet 11def1888f bcachefs: Factor out check_subvol_dirent()
Going to be adding more code here for checking subvol structure.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:09 -04:00
Kent Overstreet ce3e9283de bcachefs: Kill some -EINVALs
Repurposing standard error codes in bcachefs code is banned in new code,
and we need to get rid of the remaining ones - private error codes give
us much better error messages.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:09 -04:00
Kent Overstreet 82fdc1dc98 bcachefs: bump max_active on btree_interior_update_worker
WQ_UNBOUND with max_active 1 means ordered workqueue, but we don't
actually need or want ordered semantics - and probably want a higher
concurrency limit anyways.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:09 -04:00
Kent Overstreet 69c8e6ce02 bcachefs: move fsck_write_inode() to inode.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:09 -04:00
Kent Overstreet 29223b5a55 bcachefs: Initialize super_block->s_uuid
Need to fix this oversight for the new FS_IOC_(GET|SET)UUID ioctls.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:09 -04:00
Kent Overstreet f8f8fb443b bcachefs: Switch to uuid_to_fsid()
switch the statfs code from something horrible and open coded to the
more standard uuid_to_fsid()

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:09 -04:00
Kent Overstreet 7f76b08aca bcachefs: Subvolumes may now be renamed
Files within a subvolume cannot be renamed into another subvolume, but
subvolumes themselves were intended to be.

This implements subvolume renaming - we need to ensure that there's only
a single dirent that points to a subvolume key (not multiple versions in
different snapshots), and we need to ensure that dirent.d_parent_subol
and inode.bi_parent_subvol are updated.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 5f43b0134e bcachefs: btree node prefetching in check_topology
btree_and_journal_iter is old code that we want to get rid of, but we're
not ready to yet.

lack of btree node prefetching is, it turns out, a real performance
issue for fsck on spinning rust, so - add it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet fc634d8e46 bcachefs: btree_and_journal_iter.trans
we now always have a btree_trans when using a btree_and_journal_iter;
prep work for adding prefetching to btree_and_journal_iter

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 916abefd43 bcachefs: better journal pipelining
Recently a severe performance regression was discovered, which bisected
to

  a6548c8b5e bcachefs: Avoid flushing the journal in the discard path

It turns out the old behaviour, which issued excessive journal flushes,
worked around a performance issue where queueing delays would cause the
journal to not be able to write quickly enough and stall.

The journal flushes masked the issue because they periodically flushed
the device write cache, reducing write latency for non flushes.

This patch reworks the journalling code to allow more than one
(non-flush) write to be in flight at a time. With this patch, doing 4k
random writes and an iodepth of 128, we are now able to hit 560k iops to
a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 38789c2508 bcachefs: closure per journal buf
Prep work for having multiple journal writes in flight.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 5165400275 bcachefs: bio per journal buf
Prep work for having multiple journal writes in flight.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 52f7d75e7d bcachefs: jset_entry_datetime
This gives us a way to record the date and time every journal entry was
written - useful for debugging.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 3d3d23b341 bcachefs: improve journal entry read fsck error messages
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet a555bcf4fa bcachefs: convert journal replay ptrs to darray
Eliminates some error paths - no longer have a hardcoded
BCH_REPLICAS_MAX limit.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 5b6271b509 bcachefs: Cleanup bch2_dirent_lookup_trans()
Drop an unnecessary bch2_subvolume_get_snapshot() call, and drop the __
from the name - this is a normal interface.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 23f2522315 bcachefs: bch2_hash_set_snapshot() -> bch2_hash_set_in_snapshot()
Minor renaming for clarity, bit of refactoring.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 6b83aee8a4 bcachefs: Workqueues should be WQ_HIGHPRI
Most bcachefs workqueues are used for completions, and should be
WQ_HIGHPRI - this helps reduce queuing delays, we want to complete
quickly once we can no longer signal backpressure by blocking.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 3f305e0498 bcachefs: Improve bch2_dirent_to_text()
For DT_SUBVOL, we now print both parent and child subvol IDs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 7b05ecbafc bcachefs: fixup for building in userspace
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet e6fab655e6 bcachefs: Avoid taking journal lock unnecessarily
Previously, any time we failed to get a journal reservation we'd retry,
with the journal lock held; but this isn't necessary given
wait_event()/wake_up() ordering.

This avoids performance cliffs when the journal starts to get backed up
and lock contention shoots up.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet bdec47f57f bcachefs: Journal writes should be REQ_SYNC|REQ_META
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet a4e9233911 bcachefs: Avoid setting j->write_work unnecessarily
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 656f05d8bd bcachefs: Split out journal workqueue
We don't want journal write completions to be blocked behind btree
transactions - io_complete_wq is used for btree updates after data and
metadata writes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Kent Overstreet 4f70176cb9 bcachefs: Kill unnecessary wakeups in journal reclaim
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:08 -04:00
Guoyu Ou 0be5b38bce bcachefs: skip invisible entries in empty subvolume checking
When we are checking whether a subvolume is empty in the specified snapshot,
entries that do not belong to this subvolume should be skipped.

This fixes the following case:

    $ bcachefs subvolume create ./sub
    $ cd sub
    $ bcachefs subvolume create ./sub2
    $ bcachefs subvolume snapshot . ./snap
    $ ls -a snap
    . ..
    $ rmdir snap
    rmdir: failed to remove 'snap': Directory not empty

As Kent suggested, we pass 0 in may_delete_deleted_inode() to ignore subvols
in the subvol we are checking, because inode.bi_subvol is only set on
subvolume roots, and we can't go through every inode in the subvolume and
change bi_subvol when taking a snapshot. It makes the check less strict, but
that's ok, the rest of fsck will still catch it.

Signed-off-by: Guoyu Ou <benogy@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:34:07 -04:00
Kent Overstreet 067f244c9e bcachefs: fix split brain message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:30:56 -04:00
Kent Overstreet fadc6067f2 bcachefs: Set path->uptodate when no node at level
We were failing to set path->uptodate when reaching the end of a btree
node iterator, causing the new prefetch code for backpointers gc to go
into an infinite loop.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:30:56 -04:00
Kent Overstreet 94817db956 bcachefs: Correctly validate k->u64s in btree node read path
validate_bset_keys() never properly validated k->u64s; it checked if it
was 0, but not if it was smaller than keys for the given packed format;
this fixes that small oversight.

This patch was backported, so it's adding quite a few error enums so
that they don't get renumbered and we don't have confusing gaps.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:21:04 -04:00
Kent Overstreet b3eba6a4a7 bcachefs: Fix degraded mode fsck
We don't know where the superblock and journal lives on offline devices;
that means if a device is offline fsck can't check those buckets.

Previously, fsck would incorrectly clear bucket data types for those
buckets on offline devices; now we just use the previous state.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:18:45 -04:00
Kent Overstreet ba89083e9f bcachefs: Fix journal replay with unreadable btree roots
When a btree root is unreadable, we still might be able to get some data
back by replaying what's in the journal. Previously though, we got
confused when journal replay would attempt to replay a key for a level
that didn't exist.

This adds bch2_btree_increase_depth(), so that journal replay can handle
this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:18:13 -04:00
Kent Overstreet 52f3a72fa7 bcachefs: fix check_inode_deleted_list()
check_inode_deleted_list() returns true if the inode is on the deleted
list; check_inode() was checking the return code incorrectly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:17:00 -04:00
Kent Overstreet 2f300f09c7 bcachefs: no_splitbrain_check option
This adds an option to disable kicking out devices when splitbrain is
detected - it seems there's some issues with splitbrain detection and
we're kicking out devices erronously.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:12:54 -04:00
Kent Overstreet 88005d5dfb bcachefs: extent_entry_next_safe()
We need to be able to iterate over extent ptrs that may be corrupted in
order to print them - this fixes a bug where we'd pop an assert in
bch2_bkey_durability_safe().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:12:13 -04:00
Kent Overstreet 6fa30fe7f7 bcachefs: journal_seq_blacklist_add() now handles entries being added out of order
bch2_journal_seq_blacklist_add() was bugged when the new entry
overlapped with multiple existing entries, and it also assumed new
entries are being added in increasing order.

This is true on any sane filesystem, but when trying to recover from
very badly mangled filesystems we might end up with the journal sequence
number rewinding vs. what the blacklist list knows about - easiest to
just handle that here.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:09:59 -04:00
Li Zetao f8cdf65b51 bcachefs: Fix null-ptr-deref in bch2_fs_alloc()
There is a null-ptr-deref issue reported by kasan:

  KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
  Call Trace:
    <TASK>
    bch2_fs_alloc+0x1092/0x2170 [bcachefs]
    bch2_fs_open+0x683/0xe10 [bcachefs]
    ...

When initializing the name of bch_fs, it needs to dynamically alloc memory
to meet the length of the name. However, when name allocation failed, it
will cause a null-ptr-deref access exception in subsequent string copy.

Fix this issue by checking if name allocation is successful.

Fixes: 401ec4db63 ("bcachefs: Printbuf rework")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-03-10 15:09:59 -04:00
Linus Torvalds d206a76d7d Linux 6.8-rc6 2024-02-25 15:46:06 -08:00
Linus Torvalds e231dbd452 bcachefs fixes for 6.8-rc6
Some more mostly boring fixes, but some not
 
 User reported ones:
  - the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty performance
    bug; user reported an unter initially taking 2 seconds and then ~2
    minutes
 
  - kill a __GFP_NOFAIL in the buffered read path; this was a leftover
    from the trickier fix to kill __GFP_NOFAIL in readahead, where we
    can't return errors (and have to silently truncate the read
    ourselves).
 
    bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
    filesystems because our folio state is just barely too big, 2MB
    hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.
 
    additionally, the flags argument was just buggy, we weren't supplying
    GFP_KERNEL previously (!).
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmXbqqMACgkQE6szbY3K
 bnYjnhAApY0vT6eVIYrZ7JGR6tw++xw02xRkcNW4zFE8INAvxQor5TXMEKkJs9Ui
 owh8WZjydXe0FJPE+pROcHMfxkkup4yP2SafgzR8DGERBwZbV9x7hvUbdG90EngY
 V/MevV+vr6UaV7133sY70K8BqUA/yAlCmmtOQVFgGRprEtEPS4Ur3vYR5+IzA0N7
 OhNXu6LxzkYbrNp9qroCN2UEVgRDJ/Mtda6uHfIUrqOQMUhiq2og9kvzJXzIrW9l
 URxm4eFQtJe0Yz09Ppypve+FutJIbtuDEYbcMJNT9Ig7BosD5vDjy9nhp8A5Q1Uk
 oDWBbCJhDdSYSVC/EQY8bv0AaCkyCa7vshSoKq0fDCFJ8k+nQ1YMF5wNhfgJhtU9
 Tl2Qytphp9/dxkvpIsR/5iNhLply9xTka1Wkp3G+3QJk0c17Dftpvz0/WhKI0P2B
 d6y4mz/hfCtWoSQOJbJl3fM/ZVpjH54VHDmb7sGyb5f+bTUkX6OUoJ4os8MNKGcS
 GdpEoWt/IAQj69c7w8aama5TXJ4kYe0XtXwbHTRE4j1PIQJA5SPvVt+32spRtb6i
 1gIa94uWKYMuG2U0XGxookHfZZZaMQkl79oXJOYRiC589YVyZC1Lp5iqr027jHEQ
 1HacrWPekPfmrhchyIzpH1mHOgaS+FKoD7eKrkvj0QSxpwfwpbI=
 =KNWR
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Some more mostly boring fixes, but some not

  User reported ones:

   - the BTREE_ITER_FILTER_SNAPSHOTS one fixes a really nasty
     performance bug; user reported an untar initially taking two
     seconds and then ~2 minutes

   - kill a __GFP_NOFAIL in the buffered read path; this was a leftover
     from the trickier fix to kill __GFP_NOFAIL in readahead, where we
     can't return errors (and have to silently truncate the read
     ourselves).

     bcachefs can't use GFP_NOFAIL for folio state unlike iomap based
     filesystems because our folio state is just barely too big, 2MB
     hugepages cause us to exceed the 2 page threshhold for GFP_NOFAIL.

     additionally, the flags argument was just buggy, we weren't
     supplying GFP_KERNEL previously (!)"

* tag 'bcachefs-2024-02-25' of https://evilpiepirate.org/git/bcachefs:
  bcachefs: fix bch2_save_backtrace()
  bcachefs: Fix check_snapshot() memcpy
  bcachefs: Fix bch2_journal_flush_device_pins()
  bcachefs: fix iov_iter count underflow on sub-block dio read
  bcachefs: Fix BTREE_ITER_FILTER_SNAPSHOTS on inodes btree
  bcachefs: Kill __GFP_NOFAIL in buffered read path
  bcachefs: fix backpointer_to_text() when dev does not exist
2024-02-25 15:31:57 -08:00
Kent Overstreet 5197728f81 bcachefs: fix bch2_save_backtrace()
Missed a call in the previous fix.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-25 15:45:36 -05:00
Linus Torvalds 70ff1fe626 Two documentation build fixes:
- The XFS online fsck documentation uses incredibly deeply nested
   subsection and list nesting; that broke the PDF docs build.  Tweak a
   parameter to tell LaTeX to allow the deeper nesting.
 
 - Fix a 6.8 PDF-build regression
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAmXbi5QPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YZSMH/RIZh48S/Jh5mhjzqnKhGf1sFn6lSk8sFY3I
 uJqML/LPo6GYzX8WvYKlfyP9+UvrLiDcQF0Er6MeIhK6mhKE1Lp7w1YvRgeXcgFR
 H9DtxA4fJSGWlAaMqZBwsXjF2EFwjyxHtHUeNyaJ+YocHfrT6L9Cp9uBEvdT3Iye
 F191VpjWLrFD0DJEh64CcmNd3rggN5jeD/n24dbNOmnem1cak2brIIUeltdkUmQG
 48Hr27xqYF1QyVckfoRtnT/C3AyaCKbxRbTxeAjwUDjU+7nCsHf1MKltiFAZHnFs
 7ZLsOboLhmR+y9xiZUg7OlpRaVj1C+7JSYC+WSaNjwRkkIfJUu4=
 =MEzm
 -----END PGP SIGNATURE-----

Merge tag 'docs-6.8-fixes3' of git://git.lwn.net/linux

Pull two documentation build fixes from Jonathan Corbet:

 - The XFS online fsck documentation uses incredibly deeply nested
   subsection and list nesting; that broke the PDF docs build. Tweak a
   parameter to tell LaTeX to allow the deeper nesting.

 - Fix a 6.8 PDF-build regression

* tag 'docs-6.8-fixes3' of git://git.lwn.net/linux:
  docs: translations: use attribute to store current language
  docs: Instruct LaTeX to cope with deeper nesting
2024-02-25 10:58:12 -08:00
Linus Torvalds c46ac50ebe USB fixes for 6.8-rc6
Here are some small USB fixes for 6.8-rc6 to resolve some reported
 problems.  These include:
   - regression fixes with typec tpcm code as reported by many
   - cdnsp and cdns3 driver fixes
   - usb role setting code bugfixes
   - build fix for uhci driver
   - ncm gadget driver bugfix
   - MAINTAINERS entry update
 
 All of these have been in linux-next all week with no reported issues
 and there is at least one fix in here that is in Thorsten's regression
 list that is being tracked.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZdtGEA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymzsgCg2IsWqIR72XUGsa5rrbRnskOP/G4An24BmUb6
 t34d0VjiHagZTFlfRx6g
 =eOL1
 -----END PGP SIGNATURE-----

Merge tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some small USB fixes for 6.8-rc6 to resolve some reported
  problems. These include:

   - regression fixes with typec tpcm code as reported by many

   - cdnsp and cdns3 driver fixes

   - usb role setting code bugfixes

   - build fix for uhci driver

   - ncm gadget driver bugfix

   - MAINTAINERS entry update

  All of these have been in linux-next all week with no reported issues
  and there is at least one fix in here that is in Thorsten's regression
  list that is being tracked"

* tag 'usb-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: tpcm: Fix issues with power being removed during reset
  MAINTAINERS: Drop myself as maintainer of TYPEC port controller drivers
  usb: gadget: ncm: Avoid dropping datagrams of properly parsed NTBs
  Revert "usb: typec: tcpm: reset counter when enter into unattached state after try role"
  usb: gadget: omap_udc: fix USB gadget regression on Palm TE
  usb: dwc3: gadget: Don't disconnect if not started
  usb: cdns3: fix memory double free when handle zero packet
  usb: cdns3: fixed memory use after free at cdns3_gadget_ep_disable()
  usb: roles: don't get/set_role() when usb_role_switch is unregistered
  usb: roles: fix NULL pointer issue when put module's reference
  usb: cdnsp: fixed issue with incorrect detecting CDNSP family controllers
  usb: cdnsp: blocked some cdns3 specific code
  usb: uhci-grlib: Explicitly include linux/platform_device.h
2024-02-25 10:41:57 -08:00
Linus Torvalds 1e592e9536 TTY/Serial driver fixes for 6.8-rc6
Here are 3 small serial/tty driver fixes for 6.8-rc6 that resolve the
 following reported errors:
   - riscv hvc console driver fix that was reported by many
   - amba-pl011 serial driver fix for RS485 mode
   - stm32 serial driver fix for RS485 mode
 
 All of these have been in linux-next all week with no reported problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZdtGnA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymrqwCfSIsUj9GLazXJTTTgMz1I94HXLrQAnjq9QOtg
 EFt6xmUGcF4zFhnfSLal
 =/k5+
 -----END PGP SIGNATURE-----

Merge tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial driver fixes from Greg KH:
 "Here are three small serial/tty driver fixes for 6.8-rc6 that resolve
  the following reported errors:

   - riscv hvc console driver fix that was reported by many

   - amba-pl011 serial driver fix for RS485 mode

   - stm32 serial driver fix for RS485 mode

  All of these have been in linux-next all week with no reported
  problems"

* tag 'tty-6.8-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serial: amba-pl011: Fix DMA transmission in RS485 mode
  serial: stm32: do not always set SER_RS485_RX_DURING_TX if RS485 is enabled
  tty: hvc: Don't enable the RISC-V SBI console by default
2024-02-25 10:35:41 -08:00
Linus Torvalds 1eee4ef38c - Make sure clearing CPU buffers using VERW happens at the latest possible
point in the return-to-userspace path, otherwise memory accesses after
   the VERW execution could cause data to land in CPU buffers again
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXbG7IACgkQEsHwGGHe
 VUoEEg//d1qt/PEWCC23wMO6gLMl4J/e4ZQAuGOKGed/jUmOaQKpHJmpDMRc0li5
 llRYDdfE0ikmtQT3t9vQDs3xbWfT5bLMsijliRimb193FaS1HGlHMMS1nxhfjyfv
 MecbWfkwzX2JnrxJpsbfue+7kks3HyIXYsXV7kSFiHavk4F3GFQXYLO11pKbNQwN
 9UfjJDeVsrcWPGCHhoPKF5NHUnQKIA8ZC6g8yBq894AtdWOhFY7ePKBZefUWQQ1n
 myc5GJ3dKFICMCZvkMABtHYCmHU/W3y/6tPtnrz3kT8GdCIAHG+K9VRUfY1ml94H
 x327GoM3sEzHLsPizKy00/Uao+j6FOtv631LoDLsO2MF3sHoTZDaSgg5y2D/ZC7t
 IZdK3mUGtdINRhGiWWpdxyaMfkQ62cdZk8FkeYkRAewYS6WYSdMX3cPqFNy4Ss5u
 r3reMOD3JcxAatcqhHMXjARMfY+N08gQBpxBul3ejgH8t8aY7xJx6Vggty5kBlHZ
 7urV9jIRxSXfbBmOcYu6HP1ucFLWNSUQCBn7Imrh+5zbE1XVv7NaAWvT4Nmgb0/X
 57fHoYYSVwaJ0k3zWWM7QYEdcuJ7IZnVgTCQYx26Ec2AOQRxE9ose+awTLYtTbp1
 T+XaOlItHKMRzx9K46D7xHwmC5qiokFki3exp5vfGZxGyT3+t/c=
 =n5us
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Make sure clearing CPU buffers using VERW happens at the latest
   possible point in the return-to-userspace path, otherwise memory
   accesses after the VERW execution could cause data to land in CPU
   buffers again

* tag 'x86_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  KVM/VMX: Move VERW closer to VMentry for MDS mitigation
  KVM/VMX: Use BT+JNC, i.e. EFLAGS.CF to select VMRESUME vs. VMLAUNCH
  x86/bugs: Use ALTERNATIVE() instead of mds_user_clear static key
  x86/entry_32: Add VERW just before userspace transition
  x86/entry_64: Add VERW just before userspace transition
  x86/bugs: Add asm helpers for executing VERW
2024-02-25 10:22:21 -08:00
Linus Torvalds 8c46ed3740 - Make sure GICv4 always gets initialized to prevent a kexec-ed kernel
from silently failing to set it up
 
 - Do not call bus_get_dev_root() for the mbigen irqchip as it always
   returns NULL - use NULL directly
 
 - Fix hardware interrupt number truncation when assigning MSI interrupts
 
 - Correct sending end-of-interrupt messages to disabled interrupts lines on
   RISC-V PLIC
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmXbGgcACgkQEsHwGGHe
 VUqrig//ay2UcLEi8CwxobHXIpuUq+pMt1pLhdDtyehTKx+T44GCwMFXGML8H27A
 8CszKTEJsRxXuUP1iXquECfYqYqGmOZHcIMCX0vDodezRriJXq3m549zdoVY6LIy
 m7x5mN4rfc8xaK/krSz0IKgCn7TZ7Nugw8zHE9PEJ7hj/exIA6EH2f1p0dbDc2z8
 PRWsexi39mVLEstOl7yf5+hys6RN07a+9+PFJrEWCC0bO5We9Z+m7gnpu3zUrwcO
 LlDAU6UwWhVc+xFipW9SFYEhCqtprdfUftf1OW2BLe1TM7pHxdvA3OwlT5ZxxN90
 h4wmQ084v08hcn8YpUkaK5fWEtT+1isD3/8dVUMSRQQ4jcjLiEAVdVOvKOmJNEeJ
 +MYqAktCoyay9ZYCrpZRRIVYfC4/FLMEPExPwILFM6nfMMVEBfkDqFyjGyw3uls7
 QT7eHSo121kQsZc5V/8SuU30f64w/vJbtaZIYOjSR9+hQMQ+i+8cMuV0RSvmDIa2
 1vGdhOLvG5Rk7Wy9xd9ITXbeq+z9KD/tH8BadyfARosvQTrg6aMhhqQRHWJtS6Pf
 Vg50yS6D8ETzNNZSPFABrjBKiqQmEo3ILlUpMbR8jGBaLggqZhTs8eBRj3+XIXp8
 UxgB2b47qKmZI9eaFzne9scnOGmQSRuZ7x8IrwworFGzChSzwpU=
 =cMur
 -----END PGP SIGNATURE-----

Merge tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

 - Make sure GICv4 always gets initialized to prevent a kexec-ed kernel
   from silently failing to set it up

 - Do not call bus_get_dev_root() for the mbigen irqchip as it always
   returns NULL - use NULL directly

 - Fix hardware interrupt number truncation when assigning MSI
   interrupts

 - Correct sending end-of-interrupt messages to disabled interrupts
   lines on RISC-V PLIC

* tag 'irq_urgent_for_v6.8_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v3-its: Do not assume vPE tables are preallocated
  irqchip/mbigen: Don't use bus_get_dev_root() to find the parent
  PCI/MSI: Prevent MSI hardware interrupt number truncation
  irqchip/sifive-plic: Enable interrupt if needed before EOI
2024-02-25 10:14:12 -08:00
Linus Torvalds 4ca0d9894f Change since last update:
- Fix page refcount leak when looking up specific inodes
    introduced by metabuf reworking.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmXbQekRHHhpYW5nQGtl
 cm5lbC5vcmcACgkQUXZn5Zlu5qpkyg//cjKnuzI2gHL3ff1o0jiTMD11Y/EmwzQS
 C+JgE8WDMjUlVQpbwYnFWo/6se8qvk/fwTXPcRc45piF8YNZVchwsagxO626Kab7
 0xTX7ZUgjuth6ahrkAJZkJNMgO4mYf928uYB6EVWaTb4iF0iT6glEOsSVAR/cssL
 wYKbtgv3OP9t6fQcN/XL31hRkqr2MPk0y5Q27KyT4zo4lrx3xyah7Ndo3aEK/RcM
 +6FUwqRiDsgDF/Ga65ylDvEp9eA03OFNHBn4DrORe3B9KV75NkmSJf/8QVEceNV/
 9D072Hvt7/iyOq53AxWH3Jvp7aro/i0rvAHPbXZX4RVyqcJxaLYCQyrBFvQL0/Ie
 B+793Iua8zbkQCbZ85LpTGrxAb5WydlSJp10AuHTr2MO+wS8bBqf96Jp9x1MuS9D
 vqq7jbjwuZfnFUjpzu49GF6htG+WRVgY5TDzU6IAr3izXuqVxmz+zw5zExbGMmKm
 2S5lb+q68DOBP4YaelAQHh97k0pYDW3eQ6GhDD2FA4P/DOr8p3vszZFBeaqaL70v
 WS8z1wzOgEaleTRN4iRFlMCTvRLjOtDEaUNquohcHqJLe7w/DF7gzCU+0//8dVjD
 cZieFX8uZDtzzcjrlYWeTo8oHd0q2WYixbCN8P92/YXUFIVpfdl85ugyr9KvkZxd
 jKzpAy2LbSw=
 =/Bfj
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fix from Gao Xiang:

 - Fix page refcount leak when looking up specific inodes
   introduced by metabuf reworking

* tag 'erofs-for-6.8-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix refcount on the metabuf used for inode lookup
2024-02-25 09:53:13 -08:00
Linus Torvalds 66a97c2ec9 We still have some races in filesystem methods when exposed to RCU
pathwalk.  This series is a result of code audit (the second round
 of it) and it should deal with most of that stuff.  Exceptions: ntfs3
 ->d_hash()/->d_compare() and ceph_d_revalidate().  Up to maintainers (a
 note for NTFS folks - when documentation says that a method may not block,
 it *does* imply that blocking allocations are to be avoided.  Really).
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZdroDAAKCRBZ7Krx/gZQ
 60dKAQCzp6rYr3ye4nylho9Rzu8LEpH04TuNf3C6JuyUaNHxHwEAvNLatZsyFnmV
 Ksp2Rg/IlKPNtQgYJ8xPxv9DFmNe8gI=
 =47Un
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull RCU pathwalk fixes from Al Viro:
 "We still have some races in filesystem methods when exposed to RCU
  pathwalk. This series is a result of code audit (the second round of
  it) and it should deal with most of that stuff.

  Still pending: ntfs3 ->d_hash()/->d_compare() and ceph_d_revalidate().
  Up to maintainers (a note for NTFS folks - when documentation says
  that a method may not block, it *does* imply that blocking allocations
  are to be avoided. Really)"

[ More explanations for people who aren't familiar with the vagaries of
  RCU path walking: most of it is hidden from filesystems, but if a
  filesystem actively participates in the low-level path walking it
  needs to make sure the fields involved in that walk are RCU-safe.

  That "actively participate in low-level path walking" includes things
  like having its own ->d_hash()/->d_compare() routines, or by having
  its own directory permission function that doesn't just use the common
  helpers.  Having a ->d_revalidate() function will also have this issue.

  Note that instead of making everything RCU safe you can also choose to
  abort the RCU pathwalk if your operation cannot be done safely under
  RCU, but that obviously comes with a performance penalty. One common
  pattern is to allow the simple cases under RCU, and abort only if you
  need to do something more complicated.

  So not everything needs to be RCU-safe, and things like the inode etc
  that the VFS itself maintains obviously already are. But these fixes
  tend to be about properly RCU-delaying things like ->s_fs_info that
  are maintained by the filesystem and that got potentially released too
  early.   - Linus ]

* tag 'pull-fixes.pathwalk-rcu-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4_get_link(): fix breakage in RCU mode
  cifs_get_link(): bail out in unsafe case
  fuse: fix UAF in rcu pathwalks
  procfs: make freeing proc_fs_info rcu-delayed
  procfs: move dropping pde and pid from ->evict_inode() to ->free_inode()
  nfs: fix UAF on pathwalk running into umount
  nfs: make nfs_set_verifier() safe for use in RCU pathwalk
  afs: fix __afs_break_callback() / afs_drop_open_mmap() race
  hfsplus: switch to rcu-delayed unloading of nls and freeing ->s_fs_info
  exfat: move freeing sbi, upcase table and dropping nls into rcu-delayed helper
  affs: free affs_sb_info with kfree_rcu()
  rcu pathwalk: prevent bogus hard errors from may_lookup()
  fs/super.c: don't drop ->s_user_ns until we free struct super_block itself
2024-02-25 09:29:05 -08:00
Linus Torvalds 9b24349279 A couple of fixes - revert of regression from this cycle
and a fix for erofs failure exit breakage (had been there since
 way back).
 
 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZdrkZAAKCRBZ7Krx/gZQ
 67D8AP0eM68yZvbThA/Hb5iElDh3Aogt1AW/QAu9/alkDVHr+wD+PKqhamC8WXGk
 b1QZ5AOHQFwzkzdF4738fdbujquBWQE=
 =Ra0D
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs fixes from Al Viro:
 "A couple of fixes - revert of regression from this cycle and a fix for
  erofs failure exit breakage (had been there since way back)"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  erofs: fix handling kern_mount() failure
  Revert "get rid of DCACHE_GENOCIDE"
2024-02-25 09:17:15 -08:00
Al Viro 9fa8e282c2 ext4_get_link(): fix breakage in RCU mode
1) errors from ext4_getblk() should not be propagated to caller
unless we are really sure that we would've gotten the same error
in non-RCU pathwalk.
2) we leak buffer_heads if ext4_getblk() is successful, but bh is
not uptodate.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro 0511fdb4a3 cifs_get_link(): bail out in unsafe case
->d_revalidate() bails out there, anyway.  It's not enough
to prevent getting into ->get_link() in RCU mode, but that
could happen only in a very contrieved setup.  Not worth
trying to do anything fancy here unless ->d_revalidate()
stops kicking out of RCU mode at least in some cases.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00
Al Viro 053fc4f755 fuse: fix UAF in rcu pathwalks
->permission(), ->get_link() and ->inode_get_acl() might dereference
->s_fs_info (and, in case of ->permission(), ->s_fs_info->fc->user_ns
as well) when called from rcu pathwalk.

Freeing ->s_fs_info->fc is rcu-delayed; we need to make freeing ->s_fs_info
and dropping ->user_ns rcu-delayed too.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-02-25 02:10:32 -05:00