Commit graph

77144 commits

Author SHA1 Message Date
Qu Wenruo 381b9b4c9c btrfs: use integrated bitmaps for scrub_parity::dbitmap and ebitmap
Previously we use "unsigned long *" for those two bitmaps.

But since we only support fixed stripe length (64KiB, already checked in
tree-checker), "unsigned long *" is really a waste of memory, while we
can just use "unsigned long".

This saves us 8 bytes in total for scrub_parity.

To be extra safe, add an ASSERT() making sure calclulated @nsectors is
always smaller than BITS_PER_LONG.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Qu Wenruo c67c68eb57 btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap
Previsouly we use "unsigned long *" for those two bitmaps.

But since we only support fixed stripe length (64KiB, already checked in
tree-checker), "unsigned long *" is really a waste of memory, while we
can just use "unsigned long".

This saves us 8 bytes in total for btrfs_raid_bio.

To be extra safe, add an ASSERT() making sure calculated
@stripe_nsectors is always smaller than BITS_PER_LONG.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Nikolay Borisov 099aa97213 btrfs: use btrfs_try_lock_balance in btrfs_ioctl_balance
This eliminates 2 labels and makes the code generally more streamlined.
Also rename the 'out_bargs' label to 'out_unlock' since bargs is going
to be freed under the 'out' label. This also fixes a memory leak since
bargs wasn't correctly freed in one of the condition which are now moved
in btrfs_try_lock_balance.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Nikolay Borisov 7fb10ed89e btrfs: introduce btrfs_try_lock_balance
This function contains the factored out locking sequence of
btrfs_ioctl_balance. Having this piece of code separate helps to
simplify btrfs_ioctl_balance which has too complicated.  This will be
used in the next patch to streamline the logic in btrfs_ioctl_balance.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Christoph Hellwig 1e87770cb3 btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio
Use the new btrfs_bio_for_each_sector iterator to simplify
btrfs_check_read_dio_bio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Qu Wenruo 261d812b04 btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks
Add a helper that works similar to __bio_for_each_segment, but instead of
iterating over PAGE_SIZE chunks it iterates over each sector.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: split from a larger patch, and iterate over the offset instead of
      the offset bits]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add parameter comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Christoph Hellwig a89ce08ce6 btrfs: factor out a btrfs_csum_ptr helper
Add a helper to find the csum for a byte offset into the csum buffer.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Christoph Hellwig 97861cd166 btrfs: refactor end_bio_extent_readpage code flow
Untangle the goto and move the code it jumps to so it goes in the order
of the most likely states first.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:34 +02:00
Christoph Hellwig a5aa7ab6e7 btrfs: factor out a helper to end a single sector buffer I/O
Add a helper to end I/O on a single sector, which will come in handy
with the new read repair code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
Qu Wenruo fd5a6f63cb btrfs: remove duplicated parameters from submit_data_read_repair()
The function submit_data_read_repair() is only called for buffered data
read path, thus those members can be calculated using bvec directly:

- start
  start = page_offset(bvec->bv_page) + bvec->bv_offset;

- end
  end = start + bvec->bv_len - 1;

- page
  page = bvec->bv_page;

- pgoff
  pgoff = bvec->bv_offset;

Thus we can safely replace those 4 parameters with just one bio_vec.

Also remove the unused return value.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: also remove the return value]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
Qu Wenruo ae643a74eb btrfs: introduce a data checksum checking helper
Although we have several data csum verification code, we never have a
function really just to verify checksum for one sector.

Function check_data_csum() do extra work for error reporting, thus it
requires a lot of extra things like file offset, bio_offset etc.

Function btrfs_verify_data_csum() is even worse, it will utilize page
checked flag, which means it can not be utilized for direct IO pages.

Here we introduce a new helper, btrfs_check_sector_csum(), which really
only accept a sector in page, and expected checksum pointer.

We use this function to implement check_data_csum(), and export it for
incoming patch.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[hch: keep passing the csum array as an arguments, as the callers want
      to print it, rename per request]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
Qu Wenruo b036f47996 btrfs: quit early if the fs has no RAID56 support for raid56 related checks
The following functions do special handling for RAID56 chunks:

- btrfs_is_parity_mirror()
  Check if the range is in RAID56 chunks.

- btrfs_full_stripe_len()
  Either return sectorsize for non-RAID56 profiles or full stripe length
  for RAID56 chunks.

But if a filesystem without any RAID56 chunks, it will not have RAID56
incompat flags, and we can skip the chunk tree looking up completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
Fanjun Kong 1280d2d165 btrfs: use PAGE_ALIGNED instead of IS_ALIGNED
The <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's
use it instead of IS_ALIGNED and passing PAGE_SIZE directly.

Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Fanjun Kong <bh1scw@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
Pankaj Raghav 31f3726980 btrfs: zoned: fix comment description for sb_write_pointer logic
Fix the comment to represent the actual logic used for sb_write_pointer

- Empty[0] && In use[1] should be an invalid state instead of returning
  zone 0 wp
- Empty[0] && Full[1] should be returning zone 0 wp instead of zone 1 wp
- In use[0] && Empty[1] should be returning zone 0 wp instead of being an
  invalid state
- In use[0] && Full[1] should be returning zone 0 wp instead of returning
  zone 1 wp
- Full[0] && Empty[1] should be returning zone 1 wp instead of returning
  zone 0 wp
- Full[0] && In use[1] should be returning zone 1 wp instead of returning
  zone 0 wp

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
David Sterba 143823cf4d btrfs: fix typos in comments
Codespell has found a few typos.

Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25 17:44:33 +02:00
Linus Torvalds a5235996e1 io_uring-5.19-2022-07-21
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLaEsMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprg0EADcHfSh4bM+37B1rHV3urn+Cmpei69/QUPo
 XP9hLHFeNsokZC30JFaEN2oNiHic5oGOO9aoHSrWifnXGKScaN1ZmveGE5ctwA6B
 n4mLKVjKNUAWzAG2MWFHCykxW2gXaLSfrMQNl/FIGrjnUwEVjS7b9BsORhSfeM86
 AMl2he38Btr9n1PUw63RHf0tC/p0/nBg/dta95Bwu1cjhC/7gv3wBe739PKslo4c
 oYimvLPobDkX60LcMXBsZbu9pwT7UR+WKcgOYtgqX9/Dq1G3KtO/mN+cOKJiTwAz
 yStCQepnDqYHS9ltvBeh2a6BEtl09YFGMK8WkAVUPo21+AdmzZtAxAZtsuK9erSL
 w9i3Z524rLR+PNjo0oW4QGIaLLPrQt152Q+KhCGtD69qdxf/BecO6jOiNMauWZjp
 LWJW7I9bNQI8rxiSYIA4mgS8oS/8GXae2N+UYsjaMSXtJY1GMtO+ldI05y01rcJp
 8/0MIqC3uhR5uFZ6VFLVGnzZuy0Tsw4KfN/Z/JMzTM5iWoCZvNXHgNTAnNbzLP6u
 M3yUGlO3qEhg9UerqNSRi+rOyVNScXL6joP8nKjy+fSxOtS9uYegBKlL4fjrDAQr
 fYuYnq+r9dk/Kqm6uQSuowZj8+nEwEEPbYErTLQT1GRO9A+/rIJ5a34FzKeRYWPU
 bJAuWCu+BQ==
 =lo96
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-07-21' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Fix for a bad kfree() introduced in this cycle, and a quick fix for
  disabling buffer recycling for IORING_OP_READV.

  The latter will get reworked for 5.20, but it gets the job done for
  5.19"

* tag 'io_uring-5.19-2022-07-21' of git://git.kernel.dk/linux-block:
  io_uring: do not recycle buffer in READV
  io_uring: fix free of unallocated buffer list
2022-07-22 12:47:09 -07:00
Dylan Yudaken 934447a603 io_uring: do not recycle buffer in READV
READV cannot recycle buffers as it would lose some of the data required to
reimport that buffer.

Reported-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Fixes: b66e65f414 ("io_uring: never call io_buffer_select() for a buffer re-select")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220721131325.624788-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 08:31:31 -06:00
Dylan Yudaken ec8516f3b7 io_uring: fix free of unallocated buffer list
in the error path of io_register_pbuf_ring, only free bl if it was
allocated.

Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com>
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/all/CANX2M5bXKw1NaHdHNVqssUUaBCs8aBpmzRNVEYEvV0n44P7ioA@mail.gmail.com/
Link: https://lore.kernel.org/all/CANX2M5YiZBXU3L6iwnaLs-HHJXRvrxM8mhPDiMDF9Y9sAvOHUA@mail.gmail.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-21 08:29:01 -06:00
Linus Torvalds 972a278fe6 for-5.19-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLRpPgACgkQxWXV+ddt
 WDtu/BAAnfx7CXKIfWKpz6FZEio9Qb3mUHVOglyKzqR0qB72OdrC1dQMvEWPJc6h
 N65di6+8tTNmRIlaFBMU0MDHODR2aDRpDtlR9eUzUuidTc4iOp1fi31uBwl31r7b
 k8mCZBc/IAdfH13lBtcfkb2HGid7rik5ZC6Kx/glMcqh647QkSMAleupUsIYHKsK
 IgcUWuN3wFIUK2WVgsja7+ljlwIHBHKRp9yrEYw+ef/B0NCNKvOnrIOPJzO7nxMP
 1FbqJ6F7u7HjoMFcMwn5rbV/BoIwSSvXyKRqOW+EhGeQR/imVmkH9jXJ7wXdblSz
 IvSqaZ0DaWWSvivdMpwbr8Z0Cu4iIYhVY6PSA0hukR63qB5GwKKJ6j1L0zoYoz8C
 IDWJPW03FNRIu5ZOduvUQ3qG7jcJQZ3WPCCfrDST1cO2xHT/7f65Tjz4k0hvp4za
 edITetC1mEv310CHeGsJaLxGYPNrRe38VZYPxgJ7yFpteGYjh0ZwsuyUHb4MH1no
 JWwgElNW+m1BatdWSUBYk6xhqod1s2LOFPNqo7jNlv8I27hPViqCbBA2i9FlkXf+
 FwL5kWyJXs69gfjUIj59381Z0U1VdA1tvU8GP2m2+JvIDS6ooAcZj7yEQ69mCZxi
 2RFJIU0NFbnc/5j2ARSzOTGs9glDD0yffgXJM+cK+TWsQ3AC31I=
 =/47A
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs reverts from David Sterba:
 "Due to a recent report [1] we need to revert the radix tree to xarray
  conversion patches.

  There's a problem with sleeping under spinlock, when xa_insert could
  allocate memory under pressure. We use GFP_NOFS so this is a real
  problem that we unfortunately did not discover during review.

  I'm sorry to do such change at rc6 time but the revert is IMO the
  safer option, there are patches to use mutex instead of the spin locks
  but that would need more testing. The revert branch has been tested on
  a few setups, all seem ok.

  The conversion to xarray will be revisited in the future"

Link: https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/ [1]

* tag 'for-5.19-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Revert "btrfs: turn delayed_nodes_tree into an XArray"
  Revert "btrfs: turn name_cache radix tree into XArray in send_ctx"
  Revert "btrfs: turn fs_info member buffer_radix into XArray"
  Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
2022-07-16 13:48:55 -07:00
Linus Torvalds 1ce9d792e8 A folio locking fixup that Xiubo and David cooperated on, marked for
stable.  Most of it is in netfs but I picked it up into ceph tree on
 agreement with David.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmLRle4THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziwNrB/wLIT7pDkZl2h1LclJS1WfgzgPkaOVq
 sN8RO+QH3zIx5av/b3BH/R9Ilp2M4QjWr7f5y3emVZPxV9KQ2lrUj30XKecfIO4+
 nGU3YunO+rfaUTyySJb06VFfhLpOjxjWGFEjgAO+exiWz4zl2h8dOXqYBTE/cStT
 +721WZKYR25UK7c7kp/LgRC9QhjqH1MDm7wvPOAg6CR7mw2OiwjYD7o8Ou+zvGfp
 6GimxbWouJNT+/xW2T3wIJsmQuwZbw4L4tsLSfhKTk57ooKtR1cdm0h/N7LM1bQa
 fijU36LdGJGqKKF+kVJV73sNuPIZGY+KVS+ApiuOJ/LMDXxoeuiYtewT
 =P3hf
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.19-rc7' of https://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A folio locking fixup that Xiubo and David cooperated on, marked for
  stable. Most of it is in netfs but I picked it up into ceph tree on
  agreement with David"

* tag 'ceph-for-5.19-rc7' of https://github.com/ceph/ceph-client:
  netfs: do not unlock and put the folio twice
2022-07-15 10:27:28 -07:00
David Sterba 088aea3b97 Revert "btrfs: turn delayed_nodes_tree into an XArray"
This reverts commit 253bf57555.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:15:19 +02:00
David Sterba 5b8418b843 Revert "btrfs: turn name_cache radix tree into XArray in send_ctx"
This reverts commit 4076942021.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:14:58 +02:00
David Sterba 01cd390903 Revert "btrfs: turn fs_info member buffer_radix into XArray"
This reverts commit 8ee922689d.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:14:33 +02:00
David Sterba fc7cbcd489 Revert "btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray"
This reverts commit 48b36a602a.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/

Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-15 19:14:28 +02:00
Linus Torvalds b926f2adb0 Revert "vf/remap: return the amount of bytes actually deduplicated"
This reverts commit 4a57a84000.

Dave Chinner reports:
 "As I suspected would occur, this change causes test failures. e.g
  generic/517 in fstests fails with:

  generic/517 1s ... - output mismatch [..]
  -deduped 131172/131172 bytes at offset 65536
  +deduped 131072/131172 bytes at offset 65536"

  can you please revert this commit for the 5.19 series to give us more
  time to investigate and consider the impact of the the API change on
  userspace applications before we commit to changing the API"

That changed return value seems to reflect reality, but with the fstest
change, let's revert for now.

Requested-by: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/all/20220714223238.GH3600936@dread.disaster.area/
Cc: Ansgar Lößer <ansgar.loesser@tu-darmstadt.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-14 15:35:24 -07:00
Linus Torvalds f41d5df5f1 three smb3 client fixes: 2 related to multichannel, one for working around a negotiate protocol bug in some Samba servers
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmLPa1YACgkQiiy9cAdy
 T1EaFQv+JB99S1F4ANk7uc+diFd/j+b1pIpR2nuNq6BM0Nn1sofbNZup/yFovoXN
 L+nm/XyWUYhtx5Rl5HPgfBaQwMq4uZPN8o/5vxKW1Kb/uxBV+VdSw9LhiNAWZpPB
 dGP7pGVu05TNnlsNBJjLTjAtnB7kL2r+XSMU+UPBDx2yTECJzhZbiedcg1qysh3t
 GqC03rsh8Gi5gnqUI4XZDh5iC2OAnlKeoQIXck0JbF6z3/kOjHsqUyuVE8ToDSry
 RNXxoy+hB7dsO1eehwsKBMg4mry5ArOv4aTpAa6nH3vF94FV2YG1tVroKIQRkw3r
 dcnLRcaBj8E3voNy1xK7qbSIdj6DlAbBeHS4cjAFHX+VCmecn6drolZDjIMfTZv8
 XSweHaP94Bm043866+vcFqqNNB9B2DgFiJgyZwOQC7sdTOmN49r44o0zE68mWn40
 c6n7RxayvobU9hU7ylqID6prKsI1zibUYXEItJaAHWUuqFbISd4iihFzJky+npvR
 Tep17Tts
 =FxnJ
 -----END PGP SIGNATURE-----

Merge tag '5.19-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three smb3 client fixes:

   - two multichannel fixes: fix a potential deadlock freeing a channel,
     and fix a race condition on failed creation of a new channel

   - mount failure fix: work around a server bug in some common older
     Samba servers by avoiding padding at the end of the negotiate
     protocol request"

* tag '5.19-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: workaround negprot bug in some Samba servers
  cifs: remove unnecessary locking of chan_lock while freeing session
  cifs: fix race condition with delayed threads
2022-07-14 12:35:15 -07:00
Linus Torvalds a24a6c05ff Notable regression fixes:
- Enable SETATTR(time_create) to fix regression with Mac OS clients
 - Fix a lockd crasher and broken NLM UNLCK behavior
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmLQSzcACgkQM2qzM29m
 f5d5Sg/9Hln5dSnIpgJKyhIy+ic5QB29VL9RWZ7JtVCUsWsaVObkwobmne7UWAms
 X9eehGG2gYTsLcXH9oIGeLRrjbWGYbdg3B6tGMFeQZgI6NeSvBS9AeDzyG+rRTyL
 2czvODorTSfYokbsHNE3kWh0Tohkvd/4wV1y5GUcLYKUj6ZO21xsSbEj7381nriJ
 9nkudqgJTfCINxqRiIRBT+iCNfjvAc8EhfxP+ThcmcjJH9WeWi9kR0ZF050jloO/
 JH2RZRNN+crNdg77HlXyc/jcr2G/wv/IR5G40LqeLgrJbvPAKRGAltzeyErWY7eB
 6jsWf7/vil6+PfuUaPskt9h60hGe3DeBzraK4gFcEKci6YSqbE7h1GrUtM6/2E0a
 5HM2mYAZfYBZyT2HitZCoFXOvGssM6543QXW2noMo/6eAWFPfdzjlN55FLNssqkq
 8fPMLoiBElOvVLCYvi91CpnqSsmR4WW44dwY/d+H7DG4QnCurPGMH2OFHX3epk5J
 kJzhT63GuemgwrxZICT8KQ6aVWHreHRO2LTKv9yjhEohsqkCf3eV/xlWKzcRFXLj
 HQIBfQf1UIOonxOkdGUG1w0Es/Wz7btQehc/CwaMi74VUjZUp5qSMjCUWkckyALc
 3k1ojcODEyR/5rOepQF4kRzOzSbm7TETwhF7YG0TRPugIGri/sQ=
 =T8gK
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Notable regression fixes:

   - Enable SETATTR(time_create) to fix regression with Mac OS clients

   - Fix a lockd crasher and broken NLM UNLCK behavior"

* tag 'nfsd-5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  lockd: fix nlm_close_files
  lockd: set fl_owner when unlocking files
  NFSD: Decode NFSv4 birth time attribute
2022-07-14 12:29:43 -07:00
Xiubo Li fac47b43c7 netfs: do not unlock and put the folio twice
check_write_begin() will unlock and put the folio when return
non-zero.  So we should avoid unlocking and putting it twice in
netfs layer.

Change the way ->check_write_begin() works in the following two ways:

 (1) Pass it a pointer to the folio pointer, allowing it to unlock and put
     the folio prior to doing the stuff it wants to do, provided it clears
     the folio pointer.

 (2) Change the return values such that 0 with folio pointer set means
     continue, 0 with folio pointer cleared means re-get and all error
     codes indicating an error (no special treatment for -EAGAIN).

[ bagasdotme: use Sphinx code text syntax for *foliop pointer ]

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/56423
Link: https://lore.kernel.org/r/cf169f43-8ee7-8697-25da-0204d1b4343e@redhat.com
Co-developed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-07-14 10:10:12 +02:00
Steve French 32f319183c smb3: workaround negprot bug in some Samba servers
Mount can now fail to older Samba servers due to a server
bug handling padding at the end of the last negotiate
context (negotiate contexts typically are rounded up to 8
bytes by adding padding if needed). This server bug can
be avoided by switching the order of negotiate contexts,
placing a negotiate context at the end that does not
require padding (prior to the recent netname context fix
this was the case on the client).

Fixes: 73130a7b1a ("smb3: fix empty netname context on secondary channels")
Reported-by: Julian Sikorski <belegdol@gmail.com>
Tested-by: Julian Sikorski <belegdol+github@gmail.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-13 19:59:47 -05:00
Ansgar Lößer 4a57a84000 vf/remap: return the amount of bytes actually deduplicated
When using the FIDEDUPRANGE ioctl, in case of success the requested size
is returned. In some cases this might not be the actual amount of bytes
deduplicated.

This change modifies vfs_dedupe_file_range() to report the actual amount
of bytes deduplicated, instead of the requested amount.

Link: https://lore.kernel.org/linux-fsdevel/5548ef63-62f9-4f46-5793-03165ceccacc@tu-darmstadt.de/
Reported-by: Ansgar Lößer <ansgar.loesser@kom.tu-darmstadt.de>
Reported-by: Max Schlecht <max.schlecht@informatik.hu-berlin.de>
Reported-by: Björn Scheuermann <scheuermann@kom.tu-darmstadt.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Darrick J Wong <djwong@kernel.org>
Signed-off-by: Ansgar Lößer <ansgar.loesser@kom.tu-darmstadt.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-13 12:08:14 -07:00
Dave Chinner 5750676b64 fs/remap: constrain dedupe of EOF blocks
If dedupe of an EOF block is not constrainted to match against only
other EOF blocks with the same EOF offset into the block, it can
match against any other block that has the same matching initial
bytes in it, even if the bytes beyond EOF in the source file do
not match.

Fix this by constraining the EOF block matching to only match
against other EOF blocks that have identical EOF offsets and data.
This allows "whole file dedupe" to continue to work without allowing
eof blocks to randomly match against partial full blocks with the
same data.

Reported-by: Ansgar Lößer <ansgar.loesser@tu-darmstadt.de>
Fixes: 1383a7ed67 ("vfs: check file ranges before cloning files")
Link: https://lore.kernel.org/linux-fsdevel/a7c93559-4ba1-df2f-7a85-55a143696405@tu-darmstadt.de/
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-13 10:28:16 -07:00
Linus Torvalds 72a8e05d4f overlayfs fixes for 5.19-rc7
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCYs11GgAKCRDh3BK/laaZ
 PAD3APsHu08aHid5O/zPnD/90BNqAo3ruvu2WhI5wa8Dacd5SwEAgoSlH2Tx3iy9
 4zWK4zZX98qAGyI+ij5aejc0TvONqAE=
 =4KjV
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fix from Miklos Szeredi:
 "Add a temporary fix for posix acls on idmapped mounts introduced in
  this cycle. A proper fix will be added in the next cycle"

* tag 'ovl-fixes-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: turn off SB_POSIXACL with idmapped layers temporarily
2022-07-12 08:59:35 -07:00
Shyam Prasad N 2883f4b5a0 cifs: remove unnecessary locking of chan_lock while freeing session
In cifs_put_smb_ses, when we're freeing the last ref count to
the session, we need to free up each channel. At this point,
it is unnecessary to take chan_lock, since we have the last
reference to the ses.

Picking up this lock also introduced a deadlock because it calls
cifs_put_tcp_ses, which locks cifs_tcp_ses_lock.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-12 10:35:48 -05:00
Shyam Prasad N 50bd7d5a64 cifs: fix race condition with delayed threads
On failure to create a new channel, first cancel the
delayed threads, which could try to search for this
channel, and not find it.

The other option was to put the tcp session for the
channel first, before decrementing chan_count. But
that would leave a reference to the tcp session, when
it has been freed already.

So going with the former option and cancelling the
delayed works first, before rolling back the channel.

Fixes: aa45dadd34 ("cifs: change iface_list from array to sorted linked list")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-07-12 10:35:13 -05:00
Linus Torvalds 5a29232d87 for-5.19-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmLMiQQACgkQxWXV+ddt
 WDvBQg/+I1ebfW2DFY8kBwy7c1qKZWIhNx1VVk2AegIXvrW/Tos7wp5O6fi7p/jL
 d6k8zO/zFLlfiI4Ckmz3gt7cxaMTNXxr6+GQpNNm1b92Wdcy1a+3gquzcehT9Q10
 ZB4ecPWzEDXgORvdBYG2eD2Z8PrsF0Wu88XRDiiJOBQLjZ+k2sVp8QvJlOllLDoC
 m7rPoq98jC6VpZwFJ+fGk2jC7y4+1QXrOuQMy7LRTe59Thp6wUFDDPtkKfr5scDC
 UxkctlUdInD7A6DVvPzwaBFNoT8UeEByGHcMd3KjjrTdmqSWW6k8FiF4ckZwA3zJ
 oPdJVzdC5a2W7t6BHw+t7VNmkKd+swnr2sVSGQ8eIzF7z3/JSqyYVwziOD1YzAdU
 QUmawWm4/SFvsbO8aoLrEKNbUiTgQwVbKzJh4Dhu9VJ43jeCwCX7pa/uZI4evgyG
 T0tuwm58bWCk4y1o1fcFYgf4JcVgK23F2vKckUFZeHoV3Q8R0DnPCCGTqs1qT5vY
 irZ9AIawmaR09JptMjjsAEjDA9qb16Ut/J6/anukyCgL610EyYZG7zb1WH1cUD1o
 zNXY6O/iKyNdiXj7V1fTMiG/M8hGDcFu4pOpBk3hFjHEXX9BefoVC0J5YzvCecPz
 isqboD5Lt1I4mrzac1X+serMYfVbFH6+tsEPBQBZf/o/a0u43jI=
 =Cxvn
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A more fixes that seem to me to be important enough to get merged
  before release:

   - in zoned mode, fix leak of a structure when reading zone info, this
     happens on normal path so this can be significant

   - in zoned mode, revert an optimization added in 5.19-rc1 to finish a
     zone when the capacity is full, but this is not reliable in all
     cases

   - try to avoid short reads for compressed data or inline files when
     it's a NOWAIT read, applications should handle that but there are
     two, qemu and mariadb, that are affected"

* tag 'for-5.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: drop optimization of zone finish
  btrfs: zoned: fix a leaked bioc in read_zone_info
  btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
2022-07-11 14:41:44 -07:00
Jeff Layton 1197eb5906 lockd: fix nlm_close_files
This loop condition tries a bit too hard to be clever. Just test for
the two indices we care about explicitly.

Cc: J. Bruce Fields <bfields@fieldses.org>
Fixes: 7f024fcd5c ("Keep read and write fds with each nlm_file")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-11 15:49:56 -04:00
Linus Torvalds 8e59a6a7a4 Mainly MM fixes. About half for issues which were introduced after 5.18
and the remainder for longer-term issues.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYsxt9wAKCRDdBJ7gKXxA
 jnjWAQD6ts4tgsX+hQ5lrZjWRvYIxH/I4jbtxyMyhc+iKarotAD+NILVgrzIvr0v
 ijlA4LLtmdhN1UWdSomUm3bZVn6n+QA=
 =1375
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "Mainly MM fixes. About half for issues which were introduced after
  5.18 and the remainder for longer-term issues"

* tag 'mm-hotfixes-stable-2022-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: split huge PUD on wp_huge_pud fallback
  nilfs2: fix incorrect masking of permission flags for symlinks
  mm/rmap: fix dereferencing invalid subpage pointer in try_to_migrate_one()
  riscv/mm: fix build error while PAGE_TABLE_CHECK enabled without MMU
  Documentation: highmem: use literal block for code example in highmem.h comment
  mm: sparsemem: fix missing higher order allocation splitting
  mm/damon: use set_huge_pte_at() to make huge pte old
  sh: convert nommu io{re,un}map() to static inline functions
  mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pages
2022-07-11 12:49:56 -07:00
Jeff Layton aec158242b lockd: set fl_owner when unlocking files
Unlocking a POSIX lock on an inode with vfs_lock_file only works if
the owner matches. Ensure we set it in the request.

Cc: J. Bruce Fields <bfields@fieldses.org>
Fixes: 7f024fcd5c ("Keep read and write fds with each nlm_file")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-11 15:49:56 -04:00
Chuck Lever 5b2f3e0777 NFSD: Decode NFSv4 birth time attribute
NFSD has advertised support for the NFSv4 time_create attribute
since commit e377a3e698 ("nfsd: Add support for the birth time
attribute").

Igor Mammedov reports that Mac OS clients attempt to set the NFSv4
birth time attribute via OPEN(CREATE) and SETATTR if the server
indicates that it supports it, but since the above commit was
merged, those attempts now fail.

Table 5 in RFC 8881 lists the time_create attribute as one that can
be both set and retrieved, but the above commit did not add server
support for clients to provide a time_create attribute. IMO that's
a bug in our implementation of the NFSv4 protocol, which this commit
addresses.

Whether NFSD silently ignores the new birth time or actually sets it
is another matter. I haven't found another filesystem service in the
Linux kernel that enables users or clients to modify a file's birth
time attribute.

This commit reflects my (perhaps incorrect) understanding of whether
Linux users can set a file's birth time. NFSD will now recognize a
time_create attribute but it ignores its value. It clears the
time_create bit in the returned attribute bitmask to indicate that
the value was not used.

Reported-by: Igor Mammedov <imammedo@redhat.com>
Fixes: e377a3e698 ("nfsd: Add support for the birth time attribute")
Tested-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-07-11 13:52:22 -04:00
Oleg Nesterov d5b36a4dbd fix race between exit_itimers() and /proc/pid/timers
As Chris explains, the comment above exit_itimers() is not correct,
we can race with proc_timers_seq_ops. Change exit_itimers() to clear
signal->posix_timers with ->siglock held.

Cc: <stable@vger.kernel.org>
Reported-by: chris@accessvector.net
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-11 09:52:59 -07:00
Linus Torvalds d9919d43cb o_uring-5.19-2022-07-09
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLJpHQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmHAD/9r8+hKrvYwoBVoi0r8SD/hDkOJMJAnJOBb
 IBzpDRGcZhdLqGvMkhkdk1s83TTsKysStYSsr0jTlNMtevHNUx5jalYY0x7sIJBt
 S5WEwJ8aKbCs4G7pRU/T/wtRtYFz+JmlKT3VNGjkLWVrWn77CGEWtAHCSgzIgScn
 7QzZiQIVheDr9RsqU8xre1m21+rQWvOssip82UfuTpiYOKTYyHkb91MoNdTGd3JB
 ABrr1Ind4lSbMaXXcUtUP6iV15NU95CplEr8ln4DS9/syoM7O1ZiOdJb5uo/ywyC
 +VEh2LO+UyRn7V4me743C1UGASNTYoyOjGJs0t0en2L10gTZHx3tWLkKRvk0uF0E
 MOXJ/F8QRZxtb/RTUPX2MED/U/ZERLBbF/Jrdfunwj4NuF6mxOhM0MhFb+DCyxK0
 BmEfTIdmYzHkyKgp46OeaoJ+8cuyoHKC2DlZMoCl+mw6Fmv1XjVnnDag0Oxl2rv4
 ANCPZNGHvi5kC2t5fHu7zgOBgbb1IAkLLwcUA8SQ27dRQPfmsB6sm7YUAabuTu8N
 zc937WxpXw9FsjtmFb7JQR5yLsTIG4BYHSZFM1FF8YX6nFptm4OhKFevgYH8r3sg
 sj4qUZFVVc60xTnFR7Yr+ccBbNPQPzy1tRG70OZvMYtkPFqn+IrfxrBzZTxl6/bl
 5dWh6Y8L+A==
 =ZK3Q
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-07-09' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
 "A single fix for an issue that came up yesterday that we should plug
  for -rc6.

  This is a regression introduced in this cycle"

* tag 'io_uring-5.19-2022-07-09' of git://git.kernel.dk/linux-block:
  io_uring: check that we have a file table when allocating update slots
2022-07-10 09:14:54 -07:00
Jens Axboe d785a773be io_uring: check that we have a file table when allocating update slots
If IORING_FILE_INDEX_ALLOC is set asking for an allocated slot, the
helper doesn't check if we actually have a file table or not. The non
alloc path does do that correctly, and returns -ENXIO if we haven't set
one up.

Do the same for the allocated path, avoiding a NULL pointer dereference
when trying to find a free bit.

Fixes: a7c41b4687 ("io_uring: let IORING_OP_FILES_UPDATE support choosing fixed file slots")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-09 07:02:10 -06:00
Linus Torvalds e5524c2a1f fscache fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmLInYwACgkQ+7dXa6fL
 C2tbmxAAlBkMvoIeyOlkMeAB3l1aK5CAYmgddjWOxY6UHNIGgXqWlGJCoxYngTmy
 j+5Uz1mMNh7TamSJpCTXUBbiLWxzoHI34oie+NT9VUXTTVw0/GQ//jQPlvE+HeJG
 Hmi0HR4lby1O6X9r63obzvDctJPhsVLdgJFtZ3UidwzgY2zYnScxwyCuwSfLUVnF
 YJACZWPFj1oWxOiS2bQZgnvCJpfJ6J0I5asmO0lI1tQDS0o4RoiRvDPTx8Z6K3KO
 8CSm5tvu+vYrbZ5aNCN9f0MZfNl+YZiNQyxEiIP9Phu9iZFRe9UgXvDrE9346uQM
 /Vxa5nvyj+D0Hg+l1jBFXvwoKxLHTZ/BaK74h2/QzRKgxKrHTAyYOgxqt7U/YCsa
 S489eHOGNwbqIKACljBcMX9YZeCdeixt4aK/VIRr+d1f2Ui30Z05ExuYOfGn4wBq
 VcSsBgAALwcyqrpi61URYvKduHTa2D8ATKNKXNM/i1Wv6KQDeom5jyKj3mE7UNsI
 WAgWE2xy74kjYs1Lc0izsfieyZrt1MqZiyYFjS86Cnt0wVmdAQYrQ3wciN5D9IJN
 MiUYuienGRf1MNM2tmFK+N7e+pa/waKcl15/d3Zal94rlATpR2bFOI1zJ9qJ8rpi
 k8B/eydAZx+oUthxWvfIdLwIREJzA49jzBHgfAy/CMjqPCj7gZ8=
 =FDkz
 -----END PGP SIGNATURE-----

Merge tag 'fscache-fixes-20220708' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fscache fixes from David Howells:

 - Fix a check in fscache_wait_on_volume_collision() in which the
   polarity is reversed. It should complain if a volume is still marked
   acquisition-pending after 20s, but instead complains if the mark has
   been cleared (ie. the condition has cleared).

   Also switch an open-coded test of the ACQUIRE_PENDING volume flag to
   use the helper function for consistency.

 - Not a fix per se, but neaten the code by using a helper to check for
   the DROPPED state.

 - Fix cachefiles's support for erofs to only flush requests associated
   with a released control file, not all requests.

 - Fix a race between one process invalidating an object in the cache
   and another process trying to look it up.

* tag 'fscache-fixes-20220708' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  fscache: Fix invalidation/lookup race
  cachefiles: narrow the scope of flushed requests when releasing fd
  fscache: Introduce fscache_cookie_is_dropped()
  fscache: Fix if condition in fscache_wait_on_volume_collision()
2022-07-08 16:08:48 -07:00
Linus Torvalds 29837019d5 io_uring-5.19-2022-07-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmLIJGcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgS1D/9E2MxaWaHR+A35AXignJLaXxiLWlzTAwxr
 /ioT6omgogbiaoTrxVZMbW+6vAPbUamXA9yqv2fC/4RERBfz3Z6UBLa8Hp6SFCMQ
 vb/LSwRwnUuMr/HyF3hLjwISOi4oYs/uo3E7IpOPbpC/e0nDpJnGWx1UfQn0tJSH
 WVwZsLbUe8T9fhHA0uSHbMoMSvLQGqfwwY+MET2+j+ZDwMoet194yka22jJwfDbF
 l3cnUe2TQh6orRdzuagWX9WmdnWWyQM3DTqW2cSA0hepyxGMWkCMhMuyV5yqUXhD
 noHshcyL76h8WQi/BwYDAGYqGy1+FkOkuV3DmVnjHVQory17Ze8ijtImMoEWpkgl
 TwTTd2+o0ivcEd0JHLeqLHkTXKUENeUMTpJVuLotLMMdupIvF0jrdNWTWCM7uBto
 Q9JxIkEs+16bRqT+yzC4cNuzSQRL6+qQ5jVO5BsNmJoNvs15KN7vgAQ+uR4NCCIv
 GqHbTiBVsi7DYipS6jNi/bWnxDtIsNsn48WCdx52OYpd1NbkY3oEHMqBZ6rnsPJH
 /uek3VLajRZfG61EMUGlazitQdW3/z31wM0iP8Y8xPyvSNCbsJkFtrNO13jpuy9p
 0YRP3peXYb2eEzYpq355RSCCpyActDYp77hjcyYQP3gJcnoViZ6rBUHr5Rw5W6lN
 siwWz7aNXg==
 =w3SQ
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-07-08' of git://git.kernel.dk/linux-block

Pull io_uring tweak from Jens Axboe:
 "Just a minor tweak to an addition made in this release cycle: padding
  a 32-bit value that's in a 64-bit union to avoid any potential
  funkiness from that"

* tag 'io_uring-5.19-2022-07-08' of git://git.kernel.dk/linux-block:
  io_uring: explicit sqe padding for ioctl commands
2022-07-08 11:25:01 -07:00
Naohiro Aota b3a3b02557 btrfs: zoned: drop optimization of zone finish
We have an optimization in do_zone_finish() to send REQ_OP_ZONE_FINISH only
when necessary, i.e. we don't send REQ_OP_ZONE_FINISH when we assume we
wrote fully into the zone.

The assumption is determined by "alloc_offset == capacity". This condition
won't work if the last ordered extent is canceled due to some errors. In
that case, we consider the zone is deactivated without sending the finish
command while it's still active.

This inconstancy results in activating another block group while we cannot
really activate the underlying zone, which causes the active zone exceeds
errors like below.

    BTRFS error (device nvme3n2): allocation failed flags 1, wanted 520192 tree-log 0, relocation: 0
    nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
    active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
    nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
    active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0

Fix the issue by removing the optimization for now.

Fixes: 8376d9e1ed ("btrfs: zoned: finish superblock zone once no space left for new SB")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-08 19:18:00 +02:00
Christoph Hellwig 2963457829 btrfs: zoned: fix a leaked bioc in read_zone_info
The bioc would leak on the normal completion path and also on the RAID56
check (but that one won't happen in practice due to the invalid
combination with zoned mode).

Fixes: 7db1c5d14d ("btrfs: zoned: support dev-replace in zoned filesystems")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-08 19:13:32 +02:00
Filipe Manana a4527e1853 btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents
When doing a direct IO read or write, we always return -ENOTBLK when we
find a compressed extent (or an inline extent) so that we fallback to
buffered IO. This however is not ideal in case we are in a NOWAIT context
(io_uring for example), because buffered IO can block and we currently
have no support for NOWAIT semantics for buffered IO, so if we need to
fallback to buffered IO we should first signal the caller that we may
need to block by returning -EAGAIN instead.

This behaviour can also result in short reads being returned to user
space, which although it's not incorrect and user space should be able
to deal with partial reads, it's somewhat surprising and even some popular
applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't
deal with short reads properly (or at all).

The short read case happens when we try to read from a range that has a
non-compressed and non-inline extent followed by a compressed extent.
After having read the first extent, when we find the compressed extent we
return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
treat the request as a short read, returning 0 (success) and waiting for
previously submitted bios to complete (this happens at
fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at
btrfs_file_read_iter(), we call filemap_read() to use buffered IO to
read the remaining data, and pass it the number of bytes we were able to
read with direct IO. Than at filemap_read() if we get a page fault error
when accessing the read buffer, we return a partial read instead of an
-EFAULT error, because the number of bytes previously read is greater
than zero.

So fix this by returning -EAGAIN for NOWAIT direct IO when we find a
compressed or an inline extent.

Reported-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/
Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582
Tested-by: Dominique MARTINET <dominique.martinet@atmark-techno.com>
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-08 19:13:22 +02:00
Christian Brauner 4a47c6385b ovl: turn of SB_POSIXACL with idmapped layers temporarily
This cycle we added support for mounting overlayfs on top of idmapped
mounts.  Recently I've started looking into potential corner cases when
trying to add additional tests and I noticed that reporting for POSIX ACLs
is currently wrong when using idmapped layers with overlayfs mounted on top
of it.

I have sent out an patch that fixes this and makes POSIX ACLs work
correctly but the patch is a bit bigger and we're already at -rc5 so I
recommend we simply don't raise SB_POSIXACL when idmapped layers are
used. Then we can fix the VFS part described below for the next merge
window so we can have good exposure in -next.

I'm going to give a rather detailed explanation to both the origin of the
problem and mention the solution so people know what's going on.

Let's assume the user creates the following directory layout and they have
a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you
would expect files on your host system to be owned. For example, ~/.bashrc
for your regular user would be owned by 1000:1000 and /root/.bashrc would
be owned by 0:0. IOW, this is just regular boring filesystem tree on an
ext4 or xfs filesystem.

The user chooses to set POSIX ACLs using the setfacl binary granting the
user with uid 4 read, write, and execute permissions for their .bashrc
file:

        setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

Now they to expose the whole rootfs to a container using an idmapped
mount. So they first create:

        mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
        mkdir -pv /vol/contpool/ctrover/{over,work}
        chown 10000000:10000000 /vol/contpool/ctrover/{over,work}

The user now creates an idmapped mount for the rootfs:

        mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
                                      /var/lib/lxc/c2/rootfs \
                                      /vol/contpool/lowermap

This for example makes it so that
/var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid
1000 as being owned by uid and gid 10001000 at
/vol/contpool/lowermap/home/ubuntu/.bashrc.

Assume the user wants to expose these idmapped mounts through an overlayfs
mount to a container.

       mount -t overlay overlay                      \
             -o lowerdir=/vol/contpool/lowermap,     \
                upperdir=/vol/contpool/overmap/over, \
                workdir=/vol/contpool/overmap/work   \
             /vol/contpool/merge

The user can do this in two ways:

(1) Mount overlayfs in the initial user namespace and expose it to the
    container.

(2) Mount overlayfs on top of the idmapped mounts inside of the container's
    user namespace.

Let's assume the user chooses the (1) option and mounts overlayfs on the
host and then changes into a container which uses the idmapping
0:10000000:65536 which is the same used for the two idmapped mounts.

Now the user tries to retrieve the POSIX ACLs using the getfacl command

        getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc

and to their surprise they see:

        # file: vol/contpool/merge/home/ubuntu/.bashrc
        # owner: 1000
        # group: 1000
        user::rw-
        user:4294967295:rwx
        group::r--
        mask::rwx
        other::r--

indicating the uid wasn't correctly translated according to the idmapped
mount. The problem is how we currently translate POSIX ACLs. Let's inspect
the callchain in this example:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  -> __vfs_getxattr()
                  |     -> handler->get == ovl_posix_acl_xattr_get()
                  |        -> ovl_xattr_get()
                  |           -> vfs_getxattr()
                  |              -> __vfs_getxattr()
                  |                 -> handler->get() /* lower filesystem callback */
                  |> posix_acl_fix_xattr_to_user()
                     {
                              4 = make_kuid(&init_user_ns, 4);
                              4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
                              /* FAILURE */
                             -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                     }

If the user chooses to use option (2) and mounts overlayfs on top of
idmapped mounts inside the container things don't look that much better:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  -> __vfs_getxattr()
                  |     -> handler->get == ovl_posix_acl_xattr_get()
                  |        -> ovl_xattr_get()
                  |           -> vfs_getxattr()
                  |              -> __vfs_getxattr()
                  |                 -> handler->get() /* lower filesystem callback */
                  |> posix_acl_fix_xattr_to_user()
                     {
                              4 = make_kuid(&init_user_ns, 4);
                              4 = mapped_kuid_fs(&init_user_ns, 4);
                              /* FAILURE */
                             -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                     }

As is easily seen the problem arises because the idmapping of the lower
mount isn't taken into account as all of this happens in do_gexattr(). But
do_getxattr() is always called on an overlayfs mount and inode and thus
cannot possible take the idmapping of the lower layers into account.

This problem is similar for fscaps but there the translation happens as
part of vfs_getxattr() already. Let's walk through an fscaps overlayfs
callchain:

        setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc

The expected outcome here is that we'll receive the cap_net_raw capability
as we are able to map the uid associated with the fscap to 0 within our
container.  IOW, we want to see 0 as the result of the idmapping
translations.

If the user chooses option (1) we get the following callchain for fscaps:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                   -> vfs_getxattr()
                      -> xattr_getsecurity()
                         -> security_inode_getsecurity()                                       ________________________________
                            -> cap_inode_getsecurity()                                         |                              |
                               {                                                               V                              |
                                        10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000);                     |
                                        10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                  |
                                               /* Expected result is 0 and thus that we own the fscap. */                     |
                                               0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);            |
                               }                                                                                              |
                               -> vfs_getxattr_alloc()                                                                        |
                                  -> handler->get == ovl_other_xattr_get()                                                    |
                                     -> vfs_getxattr()                                                                        |
                                        -> xattr_getsecurity()                                                                |
                                           -> security_inode_getsecurity()                                                    |
                                              -> cap_inode_getsecurity()                                                      |
                                                 {                                                                            |
                                                                0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);               |
                                                         10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
                                                         10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000);    |
                                                         |____________________________________________________________________|
                                                 }
                                                 -> vfs_getxattr_alloc()
                                                    -> handler->get == /* lower filesystem callback */

And if the user chooses option (2) we get:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                   -> vfs_getxattr()
                      -> xattr_getsecurity()
                         -> security_inode_getsecurity()                                                _______________________________
                            -> cap_inode_getsecurity()                                                  |                             |
                               {                                                                        V                             |
                                       10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0);                           |
                                       10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                           |
                                               /* Expected result is 0 and thus that we own the fscap. */                             |
                                              0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);                     |
                               }                                                                                                      |
                               -> vfs_getxattr_alloc()                                                                                |
                                  -> handler->get == ovl_other_xattr_get()                                                            |
                                    |-> vfs_getxattr()                                                                                |
                                        -> xattr_getsecurity()                                                                        |
                                           -> security_inode_getsecurity()                                                            |
                                              -> cap_inode_getsecurity()                                                              |
                                                 {                                                                                    |
                                                                 0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);                      |
                                                          10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0);        |
                                                                 0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
                                                                 |____________________________________________________________________|
                                                 }
                                                 -> vfs_getxattr_alloc()
                                                    -> handler->get == /* lower filesystem callback */

We can see how the translation happens correctly in those cases as the
conversion happens within the vfs_getxattr() helper.

For POSIX ACLs we need to do something similar. However, in contrast to
fscaps we cannot apply the fix directly to the kernel internal posix acl
data structure as this would alter the cached values and would also require
a rework of how we currently deal with POSIX ACLs in general which almost
never take the filesystem idmapping into account (the noteable exception
being FUSE but even there the implementation is special) and instead
retrieve the raw values based on the initial idmapping.

The correct values are then generated right before returning to
userspace. The fix for this is to move taking the mount's idmapping into
account directly in vfs_getxattr() instead of having it be part of
posix_acl_fix_xattr_to_user().

To this end we simply move the idmapped mount translation into a separate
step performed in vfs_{g,s}etxattr() instead of in
posix_acl_fix_xattr_{from,to}_user().

To see how this fixes things let's go back to the original example. Assume
the user chose option (1) and mounted overlayfs on top of idmapped mounts
on the host:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  |> __vfs_getxattr()
                  |  |  -> handler->get == ovl_posix_acl_xattr_get()
                  |  |     -> ovl_xattr_get()
                  |  |        -> vfs_getxattr()
                  |  |           |> __vfs_getxattr()
                  |  |           |  -> handler->get() /* lower filesystem callback */
                  |  |           |> posix_acl_getxattr_idmapped_mnt()
                  |  |              {
                  |  |                              4 = make_kuid(&init_user_ns, 4);
                  |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                  |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                  |  |                       |_______________________
                  |  |              }                               |
                  |  |                                              |
                  |  |> posix_acl_getxattr_idmapped_mnt()           |
                  |     {                                           |
                  |                                                 V
                  |             10000004 = make_kuid(&init_user_ns, 10000004);
                  |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                  |             10000004 = from_kuid(&init_user_ns, 10000004);
                  |     }       |_________________________________________________
                  |                                                              |
                  |                                                              |
                  |> posix_acl_fix_xattr_to_user()                               |
                     {                                                           V
                                 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                        /* SUCCESS */
                                        4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
                     }

And similarly if the user chooses option (1) and mounted overayfs on top of
idmapped mounts inside the container:

        idmapped mount /vol/contpool/merge:      0:10000000:65536
        caller's idmapping:                      0:10000000:65536
        overlayfs idmapping (ofs->creator_cred): 0:10000000:65536

        sys_getxattr()
        -> path_getxattr()
           -> getxattr()
              -> do_getxattr()
                  |> vfs_getxattr()
                  |  |> __vfs_getxattr()
                  |  |  -> handler->get == ovl_posix_acl_xattr_get()
                  |  |     -> ovl_xattr_get()
                  |  |        -> vfs_getxattr()
                  |  |           |> __vfs_getxattr()
                  |  |           |  -> handler->get() /* lower filesystem callback */
                  |  |           |> posix_acl_getxattr_idmapped_mnt()
                  |  |              {
                  |  |                              4 = make_kuid(&init_user_ns, 4);
                  |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                  |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                  |  |                       |_______________________
                  |  |              }                               |
                  |  |                                              |
                  |  |> posix_acl_getxattr_idmapped_mnt()           |
                  |     {                                           V
                  |             10000004 = make_kuid(&init_user_ns, 10000004);
                  |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                  |             10000004 = from_kuid(0(&init_user_ns, 10000004);
                  |             |_________________________________________________
                  |     }                                                        |
                  |                                                              |
                  |> posix_acl_fix_xattr_to_user()                               |
                     {                                                           V
                                 10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                        /* SUCCESS */
                                        4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
                     }

The last remaining problem we need to fix here is ovl_get_acl(). During
ovl_permission() overlayfs will call:

        ovl_permission()
        -> generic_permission()
           -> acl_permission_check()
              -> check_acl()
                 -> get_acl()
                    -> inode->i_op->get_acl() == ovl_get_acl()
                        > get_acl() /* on the underlying filesystem)
                          ->inode->i_op->get_acl() == /*lower filesystem callback */
                 -> posix_acl_permission()

passing through the get_acl request to the underlying filesystem. This will
retrieve the acls stored in the lower filesystem without taking the
idmapping of the underlying mount into account as this would mean altering
the cached values for the lower filesystem. The simple solution is to have
ovl_get_acl() simply duplicate the ACLs, update the values according to the
idmapped mount and return it to acl_permission_check() so it can be used in
posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be
released right after.

Link: https://github.com/brauner/mount-idmapped/issues/9
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: linux-unionfs@vger.kernel.org
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Fixes: bc70682a49 ("ovl: support idmapped layers")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-08 15:48:31 +02:00
Pavel Begunkov bdb2c48e4b io_uring: explicit sqe padding for ioctl commands
32 bit sqe->cmd_op is an union with 64 bit values. It's always a good
idea to do padding explicitly. Also zero check it in prep, so it can be
used in the future if needed without compatibility concerns.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e6b95a05e970af79000435166185e85b196b2ba2.1657202417.git.asml.silence@gmail.com
[axboe: turn bitwise OR into logical variant]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-07 17:33:01 -06:00
David Howells 85e4ea1049 fscache: Fix invalidation/lookup race
If an NFS file is opened for writing and closed, fscache_invalidate() will
be asked to invalidate the file - however, if the cookie is in the
LOOKING_UP state (or the CREATING state), then request to invalidate
doesn't get recorded for fscache_cookie_state_machine() to do something
with.

Fix this by making __fscache_invalidate() set a flag if it sees the cookie
is in the LOOKING_UP state to indicate that we need to go to invalidation.
Note that this requires a count on the n_accesses counter for the state
machine, which that will release when it's done.

fscache_cookie_state_machine() then shifts to the INVALIDATING state if it
sees the flag.

Without this, an nfs file can get corrupted if it gets modified locally and
then read locally as the cache contents may not get updated.

Fixes: d24af13e2e ("fscache: Implement cookie invalidation")
Reported-by: Max Kellermann <mk@cm4all.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Max Kellermann <mk@cm4all.com>
Link: https://lore.kernel.org/r/YlWWbpW5Foynjllo@rabbit.intern.cm-ag [1]
2022-07-05 16:12:55 +01:00
Jia Zhu 65aa5f6fd8 cachefiles: narrow the scope of flushed requests when releasing fd
When an anonymous fd is released, only flush the requests
associated with it, rather than all of requests in xarray.

Fixes: 9032b6e858 ("cachefiles: implement on-demand read")
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://listman.redhat.com/archives/linux-cachefs/2022-June/006937.html
2022-07-05 16:12:21 +01:00
Yue Hu 5c4588aea6 fscache: Introduce fscache_cookie_is_dropped()
FSCACHE_COOKIE_STATE_DROPPED will be read more than once, so let's add a
helper to avoid code duplication.

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://listman.redhat.com/archives/linux-cachefs/2022-May/006919.html
2022-07-05 16:12:20 +01:00
Yue Hu bf17455b9c fscache: Fix if condition in fscache_wait_on_volume_collision()
After waiting for the volume to complete the acquisition with timeout,
the if condition under which potential volume collision occurs should be
acquire the volume is still pending rather than not pending so that we
will continue to wait until the pending flag is cleared. Also, use the
existing test pending wrapper directly instead of test_bit().

Fixes: 62ab633523 ("fscache: Implement volume registration")
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Link: https://listman.redhat.com/archives/linux-cachefs/2022-May/006918.html
2022-07-05 16:12:20 +01:00
Ryusuke Konishi 5924e6ec15 nilfs2: fix incorrect masking of permission flags for symlinks
The permission flags of newly created symlinks are wrongly dropped on
nilfs2 with the current umask value even though symlinks should have 777
(rwxrwxrwx) permissions:

 $ umask
 0022
 $ touch file && ln -s file symlink; ls -l file symlink
 -rw-r--r--. 1 root root 0 Jun 23 16:29 file
 lrwxr-xr-x. 1 root root 4 Jun 23 16:29 symlink -> file

This fixes the bug by inserting a missing check that excludes
symlinks.

Link: https://lkml.kernel.org/r/1655974441-5612-1-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: Tommy Pettersson <ptp@lysator.liu.se>
Reported-by: Ciprian Craciun <ciprian.craciun@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03 15:42:33 -07:00
Linus Torvalds 20855e4cb3 Fixes for 5.19-rc5:
- Fix statfs blocking on background inode gc workers
  - Fix some broken inode lock assertion code
  - Fix xattr leaf buffer leaks when cancelling a deferred xattr update
    operation
  - Clean up xattr recovery to make it easier to understand.
  - Fix xattr leaf block verifiers tripping over empty blocks.
  - Remove complicated and error prone xattr leaf block bholding mess.
  - Fix a bug where an rt extent crossing EOF was treated as "posteof"
    blocks and cleaned unnecessarily.
  - Fix a UAF when log shutdown races with unmount.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmK/kVMACgkQ+H93GTRK
 tOs0tQ/+PYRhEDKrgocxZGJFNvnxqPRdEDu9k5XCnO2Y/DZRAF52F0JZaPtuiFH4
 12e9vzYYRNrE9KifzPWo4j2L067kFszt4XcAjytJuf5f6k/duX7XbsdMb17Qxd28
 mZDtBBSQCc9fcQo21u5SdZlPaD1SC1843jB4Oe7Sbo3AFvVAMwuBUgnp2TSDA8V0
 0q25PUD0ZvWP3UTQS4M4fW4WhFa5wF+GnLR1DZjryFIzuUp9JwdCQZHIFnp6cHq9
 TZMDJ4WhD9igMSzicRfgPoC8z/D3Mm0cFmRoURbG3GLzAeJ+e7PJ43rvlwq6Ajcv
 v5DhyQvFkiVjKLsrtJyvvUGSpkLL/touNG8MUE9I0heiiwb0QbP108aHWU8AS1Dr
 q7XHIxPaOhvlzVZN1uTuZE4N51/0NWITGKBwF0XU1b5D3wLyvOY6fbI7KLfkX2Sa
 4zHKn4QpHUIE9fs5Na3H6L+ndlJclo2DJA6lF26pLgmrT7NLmJG+r97XagBsp/pr
 X8qOvVMg1XJA37Vy1bTN5cfEYzTTksJk/fQ3AvSKHDCeP5u87kiZ6hqNnW6dD0YF
 D8VTX29rVQr5HavbcGCmAyBZpk4CfclCsWCQrZu9MCnQSW37HnObXPJkIWvzt8Mn
 j6emhPcYHy5TwSChxdpzl733ZX0KdkdOAgkWgqtod2E/7fe+g7Q=
 =8QeL
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "This fixes some stalling problems and corrects the last of the
  problems (I hope) observed during testing of the new atomic xattr
  update feature.

   - Fix statfs blocking on background inode gc workers

   - Fix some broken inode lock assertion code

   - Fix xattr leaf buffer leaks when cancelling a deferred xattr update
     operation

   - Clean up xattr recovery to make it easier to understand.

   - Fix xattr leaf block verifiers tripping over empty blocks.

   - Remove complicated and error prone xattr leaf block bholding mess.

   - Fix a bug where an rt extent crossing EOF was treated as "posteof"
     blocks and cleaned unnecessarily.

   - Fix a UAF when log shutdown races with unmount"

* tag 'xfs-5.19-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: prevent a UAF when log IO errors race with unmount
  xfs: dont treat rt extents beyond EOF as eofblocks to be cleared
  xfs: don't hold xattr leaf buffers across transaction rolls
  xfs: empty xattr leaf header blocks are not corruption
  xfs: clean up the end of xfs_attri_item_recover
  xfs: always free xattri_leaf_bp when cancelling a deferred op
  xfs: use invalidate_lock to check the state of mmap_lock
  xfs: factor out the common lock flags assert
  xfs: introduce xfs_inodegc_push()
  xfs: bound maximum wait time for inodegc work
2022-07-03 09:42:17 -07:00
Linus Torvalds 69cb6c6556 Notable regression fixes:
- Fix NFSD crash during NFSv4.2 READ_PLUS operation
 - Fix incorrect status code returned by COMMIT operation
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmLAhFIACgkQM2qzM29m
 f5fK2w//TyUAzwoTZQEgmFbxNhXgUYC6uwWPqTCLMieStKtPwT8GsuXviY34QziF
 2vER0NW9Am2GQyL3gtiYFM07OoHhQ4gr86ltaeIHAHhcwm3eIs879ARwEsyN6eDR
 +RDpqnONwtg+yaepfCMc4Bki9Jex+mmoXro86nFPmH+TDM5QiIRY0ncBWSLVWvYT
 YciAgvL6vfo2G79NYOzohoTb15ydotmy6m9H70nN+a2l6bKOIT8cF4S8lZETJZXA
 Rlj+R0eE0iXZTtp7VsAfVAHHfOzexGJjE85hpVzGiZWbxe6o0WNmBpoHICXs9VoP
 WRkq6vBC9P6wJ7EOu84/SRlR/rQaxfrYB7beqTC2kO6t5Ka5/xpJMKZOhfukCMV+
 7+uPAUnSIRqpHG0hWm1kPto00bfXm9fMETQbOEw7UQ9dH/322iOJnXwy03ZEXtJq
 9G0k+yNcydoIs2g5OpNrw4f/wTQcdlbf1ZA5O9dAsxwS1ZTNpKUiE0Sd0Ez/0VIJ
 t5DZFWH44rs/60avxRYxtw965gNugTxb1YiKdXObvbFGf+xYtH0zePce++4kTJlE
 t886stLcHbuPYs4/XItwTxLMSnDM5UR22Icho7ElRHvPiWjZmc6o2fxCkWTzCOE2
 ylZowLFMYZ/KbrP5mcqLARKEP6oFJERXt/U8U0CWeVbZomy9GVY=
 =z5W8
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Notable regression fixes:

   - Fix NFSD crash during NFSv4.2 READ_PLUS operation

   - Fix incorrect status code returned by COMMIT operation"

* tag 'nfsd-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Fix READ_PLUS crasher
  NFSD: restore EINVAL error translation in nfsd_commit()
2022-07-02 11:20:56 -07:00
Linus Torvalds 76ff294e16 More NFS Client Bugfixes for Linux 5.19-rc
- Bugfixes:
   - Allocate a fattr for _nfs4_discover_trunking()
   - Fix module reference count leak in nfs4_run_state_manager()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmK/NHYACgkQ18tUv7Cl
 QOtbXBAAhivn5bqZvrKz4WI4WjddTRcjyvITiW6m26GZZVgNHz1Sc2Tp6TA6QEL8
 c7OLkj2SoA0SmO4kZX+gCpOzsapQA7FUWULFxpPxJFp/NCsgoxYqO5ZfX0qVpk6t
 1w8lGFqsMve3LdRmcqvbaIrvzJPMdsVvixrZwXRQMe/atvtUMmgHo4pkBPuQ5nv9
 FzWh0KRiohhEWSmncD0fjdYBCq4VOqrUEEn4BTeMoXwvg/noLj5GX3mS14CFH12Y
 +iOqQ05O48Ny8qhzeQ8bRat43t4cZoCpFUcwEPB0CWoNCqS4Qoqvn48Ic+4WMxpN
 nPg2CqkqaG2RUJozSJz8m+GQNbEohoGkruZXJh7TQaqWXrIJGRMBhKhI2b7hujBG
 meu0ypETzlbofjleCpevfvnNStoRTuakssMpcU/hfKjnfNIsHnADSYey0HWHWTAH
 ZaBT6N5Z3C6hRWz7wXIn3uTuWadZbDC9+HGvvyMpuP+PDQFM9exJfhoO0JQvQc/G
 dPB7SyINVj2Z9gguJaew4csRoSxZqxeLB28XHQ7yyYsmo7lbUWaeUgGGpynrrsZL
 Ysuh7JVoZyhSSdNb3b2vlnI7zK+F1qIkzg0/u2ZN8CLPfMaBCmPL031oFudpWJ2n
 TLnxoFHhsG2MTwua3CaOfk8QvNc2dT7ZsvxXHl7HxBlxGtBgSN4=
 =OYKI
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.19-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:

 - Allocate a fattr for _nfs4_discover_trunking()

 - Fix module reference count leak in nfs4_run_state_manager()

* tag 'nfs-for-5.19-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Add an fattr allocation to _nfs4_discover_trunking()
  NFS: restore module put when manager exits.
2022-07-01 11:11:32 -07:00
Linus Torvalds 6f8693ea2b A filesystem fix, marked for stable. There appears to be a deeper
issue on the MDS side, but for now we are going with this one-liner
 to avoid busy looping and potential soft lockups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmK/C8MTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi1cEB/9CiJoDsc1v+DrP/4Ud/AbI4LMffMcr
 tkHmUo8ZT5D4feUzSFE6iKgb3gRCJUkYKzesywQ7Xhv7Mr6/DKB4+t9QtrympZFd
 sAg775mHkL0NI6/OLnLSRva/r627PFk6f1v8OWENOjsw01PLOtWAB/B5FqlgN8tG
 EQLfX0G83o4AXt4NcPCcsucPh7FxC2iKe8XWqAE6VTjkKnyz3IQHvSLweWV68U8R
 ht6eun8H+slx8Kw1lSZfW/XoFGFO4uKntCh/CKKH28ZqaXrxrdsfmXSVOMlOi351
 qxPfrTPgaSfvWQLbYQfPdQZCsfyyPgP2wdAVfpy56vk0yoxi2TLGBPsD
 =bu9O
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.19-rc5' of https://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "A ceph filesystem fix, marked for stable.

  There appears to be a deeper issue on the MDS side, but for now we are
  going with this one-liner to avoid busy looping and potential soft
  lockups"

* tag 'ceph-for-5.19-rc5' of https://github.com/ceph/ceph-client:
  ceph: wait on async create before checking caps for syncfs
2022-07-01 11:06:21 -07:00
Linus Torvalds 0a35d1622d io_uring-5.19-2022-07-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmK+6CoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsZPD/9xPZTAJhX3/HNTjbi+FlSvTaJ/4rll98No
 1pzW+nZyBVr4yesnHW2qtLwLRaYMNAFjdJmakn1BIUau4IT4Eqhb8NEz4ZCKnDD2
 Kwi0q/9c0I/GxTnVXmwXPQzQkZarYLa8cppQr1L/L3el1xTU9qXUdpR7+vxPKi4J
 ADDP+7buRYp7Td2RfBD2lD4B7jNMpZYVC/2/Y3fixkuJvK4eYKuf+5K7zgmbahm5
 YOm86k3P7QN7saTxUeyUrwR/G6CoY99Dd54KadQAS4XkU1f6XuNjF6IsYjPUEZ1B
 pKlhK4mhGieMlW8yBti0BdJLLTAHVsL9Pa0Aqsv1EdZ3x/Mfp9kmwig9RAGREyQX
 gNs316VgsfnZb+AdImZ9EItRnPZ/1Z0//VOWiDy7CijKABCZCSFXqOwQ+Yonyfab
 ZoVXlwlvOaxmiQAWhJe2XKxzRtAfeQgyirmF95N+c/wtIH6dWzJeIs2xFLPIKCaY
 tkv5Ah4IBGxofJj1SNqKNRUcv6N/Hr7zs/p6yTQpVEoUzsKqzh1eNz8PDA3ewrq4
 C6nkXnZfidyqPuUZJIfOa02N/cPLUSclxdll6pHQfIMiwLBlV60pFcSsylgdYTE+
 XT/iwiiaSTPUUIkCTYhyoUpfZnNX6IoVpxKOuh5gLOmTz/+xlRfcRjcjuXIoneHQ
 D9qlUWbYLA==
 =Edge
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-07-01' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two minor tweaks:

   - While we still can, adjust the send/recv based flags to be in
     ->ioprio rather than in ->addr2. This is consistent with eg accept,
     and also doesn't waste a full 64-bit field for flags (Pavel)

   - 5.18-stable fix for re-importing provided buffers. Not much real
     world relevance here as it'll only impact non-pollable files gone
     async, which is more of a practical test case rather than something
     that is used in the wild (Dylan)"

* tag 'io_uring-5.19-2022-07-01' of git://git.kernel.dk/linux-block:
  io_uring: fix provided buffer import
  io_uring: keep sendrecv flags in ioprio
2022-07-01 10:52:01 -07:00
Darrick J. Wong 7561cea5db xfs: prevent a UAF when log IO errors race with unmount
KASAN reported the following use after free bug when running
generic/475:

 XFS (dm-0): Mounting V5 Filesystem
 XFS (dm-0): Starting recovery (logdev: internal)
 XFS (dm-0): Ending recovery (logdev: internal)
 Buffer I/O error on dev dm-0, logical block 20639616, async page read
 Buffer I/O error on dev dm-0, logical block 20639617, async page read
 XFS (dm-0): log I/O error -5
 XFS (dm-0): Filesystem has been shut down due to log error (0x2).
 XFS (dm-0): Unmounting Filesystem
 XFS (dm-0): Please unmount the filesystem and rectify the problem(s).
 ==================================================================
 BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270
 Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136

 CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
 Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs]
 Call Trace:
  <TASK>
  dump_stack_lvl+0x34/0x44
  print_report.cold+0x2b8/0x661
  ? do_raw_spin_lock+0x246/0x270
  kasan_report+0xab/0x120
  ? do_raw_spin_lock+0x246/0x270
  do_raw_spin_lock+0x246/0x270
  ? rwlock_bug.part.0+0x90/0x90
  xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
  xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
  process_one_work+0x672/0x1040
  worker_thread+0x59b/0xec0
  ? __kthread_parkme+0xc6/0x1f0
  ? process_one_work+0x1040/0x1040
  ? process_one_work+0x1040/0x1040
  kthread+0x29e/0x340
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

 Allocated by task 154099:
  kasan_save_stack+0x1e/0x40
  __kasan_kmalloc+0x81/0xa0
  kmem_alloc+0x8d/0x2e0 [xfs]
  xlog_cil_init+0x1f/0x540 [xfs]
  xlog_alloc_log+0xd1e/0x1260 [xfs]
  xfs_log_mount+0xba/0x640 [xfs]
  xfs_mountfs+0xf2b/0x1d00 [xfs]
  xfs_fs_fill_super+0x10af/0x1910 [xfs]
  get_tree_bdev+0x383/0x670
  vfs_get_tree+0x7d/0x240
  path_mount+0xdb7/0x1890
  __x64_sys_mount+0x1fa/0x270
  do_syscall_64+0x2b/0x80
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

 Freed by task 154151:
  kasan_save_stack+0x1e/0x40
  kasan_set_track+0x21/0x30
  kasan_set_free_info+0x20/0x30
  ____kasan_slab_free+0x110/0x190
  slab_free_freelist_hook+0xab/0x180
  kfree+0xbc/0x310
  xlog_dealloc_log+0x1b/0x2b0 [xfs]
  xfs_unmountfs+0x119/0x200 [xfs]
  xfs_fs_put_super+0x6e/0x2e0 [xfs]
  generic_shutdown_super+0x12b/0x3a0
  kill_block_super+0x95/0xd0
  deactivate_locked_super+0x80/0x130
  cleanup_mnt+0x329/0x4d0
  task_work_run+0xc5/0x160
  exit_to_user_mode_prepare+0xd4/0xe0
  syscall_exit_to_user_mode+0x1d/0x40
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

This appears to be a race between the unmount process, which frees the
CIL and waits for in-flight iclog IO; and the iclog IO completion.  When
generic/475 runs, it starts fsstress in the background, waits a few
seconds, and substitutes a dm-error device to simulate a disk falling
out of a machine.  If the fsstress encounters EIO on a pure data write,
it will exit but the filesystem will still be online.

The next thing the test does is unmount the filesystem, which tries to
clean the log, free the CIL, and wait for iclog IO completion.  If an
iclog was being written when the dm-error switch occurred, it can race
with log unmounting as follows:

Thread 1				Thread 2

					xfs_log_unmount
					xfs_log_clean
					xfs_log_quiesce
xlog_ioend_work
<observe error>
xlog_force_shutdown
test_and_set_bit(XLOG_IOERROR)
					xfs_log_force
					<log is shut down, nop>
					xfs_log_umount_write
					<log is shut down, nop>
					xlog_dealloc_log
					xlog_cil_destroy
					<wait for iclogs>
spin_lock(&log->l_cilp->xc_push_lock)
<KABOOM>

Therefore, free the CIL after waiting for the iclogs to complete.  I
/think/ this race has existed for quite a few years now, though I don't
remember the ~2014 era logging code well enough to know if it was a real
threat then or if the actual race was exposed only more recently.

Fixes: ac983517ec ("xfs: don't sleep in xlog_cil_force_lsn on shutdown")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-07-01 09:09:52 -07:00
Amir Goldstein 868f9f2f8e vfs: fix copy_file_range() regression in cross-fs copies
A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file.

Before commit 5dae222a5f ("vfs: allow copy_file_range to copy across
devices") the kernel would return -EXDEV to userspace when trying to
copy a file across different filesystems.  After this commit, the
syscall doesn't fail anymore and instead returns zero (zero bytes
copied), as this file's content is generated on-the-fly and thus reports
a size of zero.

Another regression has been reported by He Zhe - the assertion of
WARN_ON_ONCE(ret == -EOPNOTSUPP) can be triggered from userspace when
copying from a sysfs file whose read operation may return -EOPNOTSUPP.

Since we do not have test coverage for copy_file_range() between any two
types of filesystems, the best way to avoid these sort of issues in the
future is for the kernel to be more picky about filesystems that are
allowed to do copy_file_range().

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5f ("vfs: allow copy_file_range to copy across
devices"), namely, cross-sb copy is not allowed for filesystems that do
not implement ->copy_file_range().

Filesystems that do implement ->copy_file_range() have full control of
the result - if this method returns an error, the error is returned to
the user.  Before this change this was only true for fs that did not
implement the ->remap_file_range() operation (i.e.  nfsv3).

Filesystems that do not implement ->copy_file_range() still fall-back to
the generic_copy_file_range() implementation when the copy is within the
same sb.  This helps the kernel can maintain a more consistent story
about which filesystems support copy_file_range().

nfsd and ksmbd servers are modified to fall-back to the
generic_copy_file_range() implementation in case vfs_copy_file_range()
fails with -EOPNOTSUPP or -EXDEV, which preserves behavior of
server-side-copy.

fall-back to generic_copy_file_range() is not implemented for the smb
operation FSCTL_DUPLICATE_EXTENTS_TO_FILE, which is arguably a correct
change of behavior.

Fixes: 5dae222a5f ("vfs: allow copy_file_range to copy across devices")
Link: https://lore.kernel.org/linux-fsdevel/20210212044405.4120619-1-drinkcat@chromium.org/
Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx+BnvW=ZmpsG47CyHFJwnw7zSX6Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-fsdevel/20210126135012.1.If45b7cdc3ff707bc1efa17f5366057d60603c45f@changeid/
Link: https://lore.kernel.org/linux-fsdevel/20210630161320.29006-1-lhenriques@suse.de/
Reported-by: Nicolas Boichat <drinkcat@chromium.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Fixes: 64bf5ff58d ("vfs: no fallback for ->copy_file_range")
Link: https://lore.kernel.org/linux-fsdevel/20f17f64-88cb-4e80-07c1-85cb96c83619@windriver.com/
Reported-by: He Zhe <zhe.he@windriver.com>
Tested-by: Namjae Jeon <linkinjeon@kernel.org>
Tested-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-30 15:16:38 -07:00
Scott Mayhew 4f40a5b554 NFSv4: Add an fattr allocation to _nfs4_discover_trunking()
This was missed in c3ed222745 ("NFSv4: Fix free of uninitialized
nfs4_label on referral lookup.") and causes a panic when mounting
with '-o trunkdiscovery':

PID: 1604   TASK: ffff93dac3520000  CPU: 3   COMMAND: "mount.nfs"
 #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee
 #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd
 #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed
 #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d
 #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e
    [exception RIP: nfs_fattr_init+0x5]
    RIP: ffffffffc0c18265  RSP: ffffb79140f73b08  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff93dac304a800  RCX: 0000000000000000
    RDX: ffffb79140f73bb0  RSI: ffff93dadc8cbb40  RDI: d03ee11cfaf6bd50
    RBP: ffffb79140f73be8   R8: ffffffffc0691560   R9: 0000000000000006
    R10: ffff93db3ffd3df8  R11: 0000000000000000  R12: ffff93dac4040000
    R13: ffff93dac2848e00  R14: ffffb79140f73b60  R15: ffffb79140f73b30
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4]
 #6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4]
 #7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4]
 #8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs]
 #9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs]
    RIP: 00007f6254fce26e  RSP: 00007ffc69496ac8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000000  RCX: 00007f6254fce26e
    RDX: 00005600220a82a0  RSI: 00005600220a64d0  RDI: 00005600220a6520
    RBP: 00007ffc69496c50   R8: 00005600220a8710   R9: 003035322e323231
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007ffc69496c50
    R13: 00005600220a8440  R14: 0000000000000010  R15: 0000560020650ef9
    ORIG_RAX: 00000000000000a5  CS: 0033  SS: 002b

Fixes: c3ed222745 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-30 16:13:00 -04:00
NeilBrown 080abad71e NFS: restore module put when manager exits.
Commit f49169c97f ("NFSD: Remove svc_serv_ops::svo_module") removed
calls to module_put_and_kthread_exit() from threads that acted as SUNRPC
servers and had a related svc_serv_ops structure.  This was correct.

It ALSO removed the module_put_and_kthread_exit() call from
nfs4_run_state_manager() which is NOT a SUNRPC service.

Consequently every time the NFSv4 state manager runs the module count
increments and won't be decremented.  So the nfsv4 module cannot be
unloaded.

So restore the module_put_and_kthread_exit() call.

Fixes: f49169c97f ("NFSD: Remove svc_serv_ops::svo_module")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-30 16:13:00 -04:00
Dylan Yudaken 09007af2b6 io_uring: fix provided buffer import
io_import_iovec uses the s pointer, but this was changed immediately
after the iovec was re-imported and so it was imported into the wrong
place.

Change the ordering.

Fixes: 2be2eb02e2 ("io_uring: ensure reads re-import for selected buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220630132006.2825668-1-dylany@fb.com
[axboe: ensure we don't half-import as well]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-30 11:34:41 -06:00
Linus Torvalds 9fb3bb25d1 Merge tag 'fsnotify_for_v5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fanotify fix from Jan Kara:
 "A fix for recently added fanotify API to have stricter checks and
  refuse some invalid flag combinations to make our life easier in the
  future"

* tag 'fsnotify_for_v5.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fanotify: refine the validation checks on non-dir inode mask
2022-06-30 09:57:18 -07:00
Pavel Begunkov 29c1ac230e io_uring: keep sendrecv flags in ioprio
We waste a u64 SQE field for flags even though we don't need as many
bits and it can be used for something more useful later. Store io_uring
specific send/recv flags in sqe->ioprio instead of ->addr2.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: 0455d4ccec ("io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg")
[axboe: change comment in io_uring.h as well]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-30 07:15:50 -06:00
Linus Torvalds 732f306943 six ksmbd server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmK7GlQACgkQiiy9cAdy
 T1FIPQv/agDjMbpfqI4+3JrDzGofGyUgq31bQRwjKouM8dxC7iE0KRaerzDUYur1
 oZSaqidCZPiFTrROQ2Kuzf8YAKFS0XubaH3HTJd5yu2/WiApsSNwelcDYtrolq5T
 OnC3t5tpkVDUcx8ynHAGrYe0mEkY5+IIYkKU9O9JduPlDX4qC796Iel9ESg5LoDR
 0fx+HTdCr3OdfOipOEUl4HxmwWsx8u7RywmNPusaH21jbuC9w1LWSLBWuFIZywio
 bHKq57P9k/I9OXTGV7DuEaf2uvZd+ceH1ZahgXIvDdHzHM183ltdTWiyIbmtLDAj
 iYyos6UZsdK5IpN8+z9wY1zqUfj0btr+UVhbrXRwv5dIH+C0tnYCwDGNXszUh7Id
 XTeQWnaI+BaBA9VuqpvUtSXIJTsikaTYdnk4m/rLDYPaHVeuCKv1QkcitWpjHvIt
 FPlIvB95QMy6CrWtlGxjwDpNBLnca4NKigLTCESJaIKcvnn25kA8OtRG6tHbpvVM
 KiCkjH14
 =8zjT
 -----END PGP SIGNATURE-----

Merge tag '5.19-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull ksmbd server fixes from Steve French:

 - seek null check (don't use f_seek op directly and blindly)

 - offset validation in FSCTL_SET_ZERO_DATA

 - fallocate fix (relates e.g. to xfstests generic/091 and 263)

 - two cleanup fixes

 - fix socket settings on some arch

* tag '5.19-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: use vfs_llseek instead of dereferencing NULL
  ksmbd: check invalid FileOffset and BeyondFinalZero in FSCTL_ZERO_DATA
  ksmbd: set the range of bytes to zero without extending file size in FSCTL_ZERO_DATA
  ksmbd: remove duplicate flag set in smb2_write
  ksmbd: smbd: Remove useless license text when SPDX-License-Identifier is already used
  ksmbd: use SOCK_NONBLOCK type for kernel_accept()
2022-06-29 09:20:40 -07:00
Jeff Layton 8692969e91 ceph: wait on async create before checking caps for syncfs
Currently, we'll call ceph_check_caps, but if we're still waiting
on the reply, we'll end up spinning around on the same inode in
flush_dirty_session_caps. Wait for the async create reply before
flushing caps.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/55823
Fixes: fbed7045f5 ("ceph: wait for async create reply before sending any cap messages")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2022-06-29 18:02:57 +02:00
Darrick J. Wong 8944c6fb8a xfs: dont treat rt extents beyond EOF as eofblocks to be cleared
On a system with a realtime volume and a 28k realtime extent,
generic/491 fails because the test opens a file on a frozen filesystem
and closing it causes xfs_release -> xfs_can_free_eofblocks to
mistakenly think that the the blocks of the realtime extent beyond EOF
are posteof blocks to be freed.  Realtime extents cannot be partially
unmapped, so this is pointless.  Worse yet, this triggers posteof
cleanup, which stalls on a transaction allocation, which is why the test
fails.

Teach the predicate to account for realtime extents properly.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-06-29 08:47:56 -07:00
Darrick J. Wong e53bcffad0 xfs: don't hold xattr leaf buffers across transaction rolls
Now that we've established (again!) that empty xattr leaf buffers are
ok, we no longer need to bhold them to transactions when we're creating
new leaf blocks.  Get rid of the entire mechanism, which should simplify
the xattr code quite a bit.

The original justification for using bhold here was to prevent the AIL
from trying to write the empty leaf block into the fs during the brief
time that we release the buffer lock.  The reason for /that/ was to
prevent recovery from tripping over the empty ondisk block.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-06-29 08:47:56 -07:00
Darrick J. Wong 7be3bd8856 xfs: empty xattr leaf header blocks are not corruption
TLDR: Revert commit 51e6104fdb ("xfs: detect empty attr leaf blocks in
xfs_attr3_leaf_verify") because it was wrong.

Every now and then we get a corruption report from the kernel or
xfs_repair about empty leaf blocks in the extended attribute structure.
We've long thought that these shouldn't be possible, but prior to 5.18
one would shake loose in the recoveryloop fstests about once a month.

A new addition to the xattr leaf block verifier in 5.19-rc1 makes this
happen every 7 minutes on my testing cloud.  I added a ton of logging to
detect any time we set the header count on an xattr leaf block to zero.
This produced the following dmesg output on generic/388:

XFS (sda4): ino 0x21fcbaf leaf 0x129bf78 hdcount==0!
Call Trace:
 <TASK>
 dump_stack_lvl+0x34/0x44
 xfs_attr3_leaf_create+0x187/0x230
 xfs_attr_shortform_to_leaf+0xd1/0x2f0
 xfs_attr_set_iter+0x73e/0xa90
 xfs_xattri_finish_update+0x45/0x80
 xfs_attr_finish_item+0x1b/0xd0
 xfs_defer_finish_noroll+0x19c/0x770
 __xfs_trans_commit+0x153/0x3e0
 xfs_attr_set+0x36b/0x740
 xfs_xattr_set+0x89/0xd0
 __vfs_setxattr+0x67/0x80
 __vfs_setxattr_noperm+0x6e/0x120
 vfs_setxattr+0x97/0x180
 setxattr+0x88/0xa0
 path_setxattr+0xc3/0xe0
 __x64_sys_setxattr+0x27/0x30
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

So now we know that someone is creating empty xattr leaf blocks as part
of converting a sf xattr structure into a leaf xattr structure.  The
conversion routine logs any existing sf attributes in the same
transaction that creates the leaf block, so we know this is a setxattr
to a file that has no attributes at all.

Next, g/388 calls the shutdown ioctl and cycles the mount to trigger log
recovery.  I also augmented buffer item recovery to call ->verify_struct
on any attr leaf blocks and complain if it finds a failure:

XFS (sda4): Unmounting Filesystem
XFS (sda4): Mounting V5 Filesystem
XFS (sda4): Starting recovery (logdev: internal)
XFS (sda4): xattr leaf daddr 0x129bf78 hdrcount == 0!
Call Trace:
 <TASK>
 dump_stack_lvl+0x34/0x44
 xfs_attr3_leaf_verify+0x3b8/0x420
 xlog_recover_buf_commit_pass2+0x60a/0x6c0
 xlog_recover_items_pass2+0x4e/0xc0
 xlog_recover_commit_trans+0x33c/0x350
 xlog_recovery_process_trans+0xa5/0xe0
 xlog_recover_process_data+0x8d/0x140
 xlog_do_recovery_pass+0x19b/0x720
 xlog_do_log_recovery+0x62/0xc0
 xlog_do_recover+0x33/0x1d0
 xlog_recover+0xda/0x190
 xfs_log_mount+0x14c/0x360
 xfs_mountfs+0x517/0xa60
 xfs_fs_fill_super+0x6bc/0x950
 get_tree_bdev+0x175/0x280
 vfs_get_tree+0x1a/0x80
 path_mount+0x6f5/0xaa0
 __x64_sys_mount+0x103/0x140
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fc61e241eae

And a moment later, the _delwri_submit of the recovered buffers trips
the same verifier and recovery fails:

XFS (sda4): Metadata corruption detected at xfs_attr3_leaf_verify+0x393/0x420 [xfs], xfs_attr3_leaf block 0x129bf78
XFS (sda4): Unmount and run xfs_repair
XFS (sda4): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
00000010: 00 00 00 00 01 29 bf 78 00 00 00 00 00 00 00 00  .....).x........
00000020: a5 1b d0 02 b2 9a 49 df 8e 9c fb 8d f8 31 3e 9d  ......I......1>.
00000030: 00 00 00 00 02 1f cb af 00 00 00 00 10 00 00 00  ................
00000040: 00 50 0f b0 00 00 00 00 00 00 00 00 00 00 00 00  .P..............
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
XFS (sda4): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply+0x37f/0x3b0 [xfs] (fs/xfs/xfs_buf.c:1518).  Shutting down filesystem.
XFS (sda4): Please unmount the filesystem and rectify the problem(s)
XFS (sda4): log mount/recovery failed: error -117
XFS (sda4): log mount failed

I think I see what's going on here -- setxattr is racing with something
that shuts down the filesystem:

Thread 1				Thread 2
--------				--------
xfs_attr_sf_addname
xfs_attr_shortform_to_leaf
<create empty leaf>
xfs_trans_bhold(leaf)
xattri_dela_state = XFS_DAS_LEAF_ADD
<roll transaction>
					<flush log>
					<shut down filesystem>
xfs_trans_bhold_release(leaf)
<discover fs is dead, bail>

Thread 3
--------
<cycle mount, start recovery>
xlog_recover_buf_commit_pass2
xlog_recover_do_reg_buffer
<replay empty leaf buffer from recovered buf item>
xfs_buf_delwri_queue(leaf)
xfs_buf_delwri_submit
_xfs_buf_ioapply(leaf)
xfs_attr3_leaf_write_verify
<trip over empty leaf buffer>
<fail recovery>

As you can see, the bhold keeps the leaf buffer locked and thus prevents
the *AIL* from tripping over the ichdr.count==0 check in the write
verifier.  Unfortunately, it doesn't prevent the log from getting
flushed to disk, which sets up log recovery to fail.

So.  It's clear that the kernel has always had the ability to persist
attr leaf blocks with ichdr.count==0, which means that it's part of the
ondisk format now.

Unfortunately, this check has been added and removed multiple times
throughout history.  It first appeared in[1] kernel 3.10 as part of the
early V5 format patches.  The check was later discovered to break log
recovery and hence disabled[2] during log recovery in kernel 4.10.
Simultaneously, the check was added[3] to xfs_repair 4.9.0 to try to
weed out the empty leaf blocks.  This was still not correct because log
recovery would recover an empty attr leaf block successfully only for
regular xattr operations to trip over the empty block during of the
block during regular operation.  Therefore, the check was removed
entirely[4] in kernel 5.7 but removal of the xfs_repair check was
forgotten.  The continued complaints from xfs_repair lead to us
mistakenly re-adding[5] the verifier check for kernel 5.19.  Remove it
once again.

[1] 517c22207b ("xfs: add CRCs to attr leaf blocks")
[2] 2e1d23370e ("xfs: ignore leaf attr ichdr.count in verifier
                   during log replay")
[3] f7140161 ("xfs_repair: junk leaf attribute if count == 0")
[4] f28cef9e4d ("xfs: don't fail verifier on empty attr3 leaf
                   block")
[5] 51e6104fdb ("xfs: detect empty attr leaf blocks in
                   xfs_attr3_leaf_verify")

Looking at the rest of the xattr code, it seems that files with empty
leaf blocks behave as expected -- listxattr reports no attributes;
getxattr on any xattr returns nothing as expected; removexattr does
nothing; and setxattr can add attributes just fine.

Original-bug: 517c22207b ("xfs: add CRCs to attr leaf blocks")
Still-not-fixed-by: 2e1d23370e ("xfs: ignore leaf attr ichdr.count in verifier during log replay")
Removed-in: f28cef9e4d ("xfs: don't fail verifier on empty attr3 leaf block")
Fixes: 51e6104fdb ("xfs: detect empty attr leaf blocks in xfs_attr3_leaf_verify")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-06-29 08:47:56 -07:00
Amir Goldstein 8698e3bab4 fanotify: refine the validation checks on non-dir inode mask
Commit ceaf69f8ea ("fanotify: do not allow setting dirent events in
mask of non-dir") added restrictions about setting dirent events in the
mask of a non-dir inode mark, which does not make any sense.

For backward compatibility, these restictions were added only to new
(v5.17+) APIs.

It also does not make any sense to set the flags FAN_EVENT_ON_CHILD or
FAN_ONDIR in the mask of a non-dir inode.  Add these flags to the
dir-only restriction of the new APIs as well.

Move the check of the dir-only flags for new APIs into the helper
fanotify_events_supported(), which is only called for FAN_MARK_ADD,
because there is no need to error on an attempt to remove the dir-only
flags from non-dir inode.

Fixes: ceaf69f8ea ("fanotify: do not allow setting dirent events in mask of non-dir")
Link: https://lore.kernel.org/linux-fsdevel/20220627113224.kr2725conevh53u4@quack3.lan/
Link: https://lore.kernel.org/r/20220627174719.2838175-1-amir73il@gmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-06-28 11:18:13 +02:00
Alexey Khoroshilov 8a9ffb8c85 NFSD: restore EINVAL error translation in nfsd_commit()
commit 555dbf1a9a ("nfsd: Replace use of rwsem with errseq_t")
incidentally broke translation of -EINVAL to nfserr_notsupp.
The patch restores that.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Fixes: 555dbf1a9a ("nfsd: Replace use of rwsem with errseq_t")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2022-06-27 10:33:05 -04:00
Darrick J. Wong f94e08b602 xfs: clean up the end of xfs_attri_item_recover
The end of this function could use some cleanup -- the EAGAIN
conditionals make it harder to figure out what's going on with the
disposal of xattri_leaf_bp, and the dual error/ret variables aren't
needed.  Turn the EAGAIN case into a separate block documenting all the
subtleties of recovering in the middle of an xattr update chain, which
makes the rest of the prologue much simpler.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-06-26 14:43:28 -07:00
Darrick J. Wong b822ea17fd xfs: always free xattri_leaf_bp when cancelling a deferred op
While running the following fstest with logged xattrs DISabled, I
noticed the following:

# FSSTRESS_AVOID="-z -f unlink=1 -f rmdir=1 -f creat=2 -f mkdir=2 -f
getfattr=3 -f listfattr=3 -f attr_remove=4 -f removefattr=4 -f
setfattr=20 -f attr_set=60" ./check generic/475

INFO: task u9:1:40 blocked for more than 61 seconds.
      Tainted: G           O      5.19.0-rc2-djwx #rc2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:u9:1            state:D stack:12872 pid:   40 ppid:     2 flags:0x00004000
Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
Call Trace:
 <TASK>
 __schedule+0x2db/0x1110
 schedule+0x58/0xc0
 schedule_timeout+0x115/0x160
 __down_common+0x126/0x210
 down+0x54/0x70
 xfs_buf_lock+0x2d/0xe0 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
 xfs_buf_item_unpin+0x227/0x3a0 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
 xfs_trans_committed_bulk+0x18e/0x320 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
 xlog_cil_committed+0x2ea/0x360 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
 xlog_cil_push_work+0x60f/0x690 [xfs 0532c1cb1d67dd81d15cb79ac6e415c8dec58f73]
 process_one_work+0x1df/0x3c0
 worker_thread+0x53/0x3b0
 kthread+0xea/0x110
 ret_from_fork+0x1f/0x30
 </TASK>

This appears to be the result of shortform_to_leaf creating a new leaf
buffer as part of adding an xattr to a file.  The new leaf buffer is
held and attached to the xfs_attr_intent structure, but then the
filesystem shuts down.  Instead of the usual path (which adds the attr
to the held leaf buffer which releases the hold), we instead cancel the
entire deferred operation.

Unfortunately, xfs_attr_cancel_item doesn't release any attached leaf
buffers, so we leak the locked buffer.  The CIL cannot do anything
about that, and hangs.  Fix this by teaching it to release leaf buffers,
and make XFS a little more careful about not leaving a dangling
reference.

The prologue of xfs_attri_item_recover is (in this author's opinion) a
little hard to figure out, so I'll clean that up in the next patch.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-06-26 14:43:28 -07:00
Kaixu Xia 82af880639 xfs: use invalidate_lock to check the state of mmap_lock
We should use invalidate_lock and XFS_MMAPLOCK_SHARED to check the state
of mmap_lock rw_semaphore in xfs_isilocked(), rather than i_rwsem and
XFS_IOLOCK_SHARED.

Fixes: 2433480a7e ("xfs: Convert to use invalidate_lock")
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-06-26 14:43:28 -07:00
Kaixu Xia ca76a761ea xfs: factor out the common lock flags assert
There are similar lock flags assert in xfs_ilock(), xfs_ilock_nowait(),
xfs_iunlock(), thus we can factor it out into a helper that is clear.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-06-26 14:43:28 -07:00
Linus Torvalds 413c1f1491 Minor things, mainly - mailmap updates, MAINTAINERS updates, etc.
Fixes for post-5.18 changes:
 
 - fix for a damon boot hang, from SeongJae
 
 - fix for a kfence warning splat, from Jason Donenfeld
 
 - fix for zero-pfn pinning, from Alex Williamson
 
 - fix for fallocate hole punch clearing, from Mike Kravetz
 
 Fixes pre-5.18 material:
 
 - fix for a performance regression, from Marcelo
 
 - fix for a hwpoisining BUG from zhenwei pi
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYri4RgAKCRDdBJ7gKXxA
 jmhsAQDCvGqtIUhgkTwid8KBRNbowsg0LXd6k+gUjcxBhH403wEA0r0cxxkDAmgr
 QNXn/qZRzQP2ji+pdjH9NBOsd2g2XQA=
 =UGJ7
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "Minor things, mainly - mailmap updates, MAINTAINERS updates, etc.

  Fixes for this merge window:

   - fix for a damon boot hang, from SeongJae

   - fix for a kfence warning splat, from Jason Donenfeld

   - fix for zero-pfn pinning, from Alex Williamson

   - fix for fallocate hole punch clearing, from Mike Kravetz

  Fixes for previous releases:

   - fix for a performance regression, from Marcelo

   - fix for a hwpoisining BUG from zhenwei pi"

* tag 'mm-hotfixes-stable-2022-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: add entry for Christian Marangi
  mm/memory-failure: disable unpoison once hw error happens
  hugetlbfs: zero partial pages during fallocate hole punch
  mm: memcontrol: reference to tools/cgroup/memcg_slabinfo.py
  mm: re-allow pinning of zero pfns
  mm/kfence: select random number before taking raw lock
  MAINTAINERS: add maillist information for LoongArch
  MAINTAINERS: update MM tree references
  MAINTAINERS: update Abel Vesa's email
  MAINTAINERS: add MEMORY HOT(UN)PLUG section and add David as reviewer
  MAINTAINERS: add Miaohe Lin as a memory-failure reviewer
  mailmap: add alias for jarkko@profian.com
  mm/damon/reclaim: schedule 'damon_reclaim_timer' only after 'system_wq' is initialized
  kthread: make it clear that kthread_create_on_node() might be terminated by any fatal signal
  mm: lru_cache_disable: use synchronize_rcu_expedited
  mm/page_isolation.c: fix one kernel-doc comment
2022-06-26 14:00:55 -07:00
Linus Torvalds 82708bb1eb for-5.19-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmK4dV4ACgkQxWXV+ddt
 WDs4uQ/7B0XqPK05NJntJfwnuIoT/yOreKf47wt/6DyFV3CDMFte/qzaZwthwu6P
 F0GMpSYAlVszLlML5elvF9VXymlV+e+QROtbD6QCNLNW1IwHA7ZiF5fV/a1Rj930
 XSuaDyVFPAK7892RR6yMQ20IeMBuvqiAhXWEzaIJ2tIcAHn+fP+VkY8Nc0aZj3iC
 mI+ep4n93karDxmnHVGUxJTxAe0l/uNopx+fYBWQDj7HuoMLo0Cu+rAdv0gRIxi2
 RWUBkR4e4PBwV1OFScwNCsljjt6bHdUHrtdB3fo5Hzu9cO5hHdL7NEsKB1K2w7rV
 bgNuNqfj6Y4xUBchAfQO5CCJ9ISci5KoJ4RBpk6EprZR3QN40kN8GPlhi2519K7w
 F3d8jolDDHlkqxIsqoe47MYOcSepNEadVNsiYKb0rM6doilfxyXiu6dtTFMrC8Vy
 K2HDCdTyuIgw+TnwqT1puaUwxiIL8DFJf1CVyjwGuQ4UgaIEkHXKIsCssyyJ76Jh
 QkWX1aeRldbfkVArJWHQWqDQopx9pFBz1gjlws0YjAsU5YijOOXva464P9Rxg+Gq
 4pRlgnO48joQam9bRirP2Z6yhqa4O6jkzKDOXSYduAUYD7IMfpsYnz09wKS95jj+
 QCrR7VmKnpQdsXg5a/mqyacfIH30ph002VywRxPiFM89Syd25yo=
 =rUrf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - zoned relocation fixes:
      - fix critical section end for extent writeback, this could lead
        to out of order write
      - prevent writing to previous data relocation block group if space
        gets low

 - reflink fixes:
      - fix race between reflinking and ordered extent completion
      - proper error handling when block reserve migration fails
      - add missing inode iversion/mtime/ctime updates on each iteration
        when replacing extents

 - fix deadlock when running fsync/fiemap/commit at the same time

 - fix false-positive KCSAN report regarding pid tracking for read locks
   and data race

 - minor documentation update and link to new site

* tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Documentation: update btrfs list of features and link to readthedocs.io
  btrfs: fix deadlock with fsync+fiemap+transaction commit
  btrfs: don't set lock_owner when locking extent buffer for reading
  btrfs: zoned: fix critical section of relocation inode writeback
  btrfs: zoned: prevent allocation from previous data relocation BG
  btrfs: do not BUG_ON() on failure to migrate space when replacing extents
  btrfs: add missing inode updates on each iteration when replacing extents
  btrfs: fix race between reflinking and ordered extent completion
2022-06-26 10:11:36 -07:00
Linus Torvalds 97d4d02697 Description for this pull request:
- Use updated exfat_chain directly instead of snapshot values  in rename.
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAmK3ryYWHGxpbmtpbmpl
 b25Aa2VybmVsLm9yZwAKCRBnC/sDUUQhCLBLEADQV1zAIN3/NwvrHlsB/8fUDLoM
 DgnAfvGSQ61A5DYpx3clo5BtvpRhQIj22/3Jj3AMcUDXUkrEvsgLe1R7tQ6xUu/w
 PXwDPU89F5hI/nviyJYB6g8EjpWctzvbkkgR9eJO6ZxBna7VYHpz3tljnciW9slR
 v4tPD7iSlhaMkb+3qO4ll3LvMx1uKRUATu1C9sh5YbesYAN6A3De6fcSi7lPuXKN
 JefjWOWQEir+/9Hcrjxz/FNOdYzKS6CSE4ps6AJx+mqXkChcWLgwp5+Q+oDOYnAG
 SQ9Tk2pW37Ba7+WlM9HJc5vM2j1a0Ww4HFQEqAG/yzQmP1N97mX4Jv3/9nao0ojY
 OUEchgVIutPojFK1ykXUd4RZSDkLPq+LREtfQ++gfO2/oKlDhS6TwXYMg3Tg5kW5
 q8TIWXzfd+waYCcHtP4MGsh6dVGT15REukbKHzFb6X+e/R1f9cGnI6a68187eDb5
 bewEw/zIwx3GH8UT1k4sjwxvkGFln/HuAS11g5cLbcYG/htSehY1u/ir4ddmNn2u
 bRpUe0KuyzZfBdJ2bvOWJ37P7n4YOweRWudjMOIGENS5NwLFb4zd64NwHzklUmiw
 cpwMsaMxzzOdmf9TobjJ0ioRNh9eiltaxBme1KK22tJSNfGG/6+UnxryKLMFc5Rd
 8KBPcuD8/rOKYdIi4Q==
 =GSFP
 -----END PGP SIGNATURE-----

Merge tag 'exfat-for-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat fix from Namjae Jeon:

 - Use updated exfat_chain directly instead of snapshot values in
   rename.

* tag 'exfat-for-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: use updated exfat_chain directly during renaming
2022-06-26 08:41:04 -07:00
Linus Torvalds 918c30dffd 7 SMB3 fixes, addressing important multichannel, reconnect issues
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmK3e7AACgkQiiy9cAdy
 T1EYgwwArQqUZTxIAQ8zcM1AKP7CbCmZWKt1EMH3tYghnYc0D1oY7XAZ6obrbVNU
 jstw09T+fc2V/5lCKQ5FRIXFwS9DuG7/pnYeEWoft37zhe5mE/uVdIVbd139LE6t
 Ho96+OY6xkUL+cd2v6v2SyECxE+ahJQBvOmfmY3bvvr9MGWR1pC6aU182cQQCUKs
 sOoPEj/KTYRc2AutMHu0xJTIEkrGkQBaUrUd+YbQKfoMg48WFkTVHl+2XhKumkM+
 2uF97G5P+1J4WPlc/XlsnmJfA93J8H1Ex6rfv3NuMpBh0N6Q5YSCfcheXVTN5yN6
 Px9r5Da1n+M1/WeGSUplm1r+z3jSYYJT9RRi0QPzYBzwlngyqdcfMsspU9qnBEdv
 Yya1QXn3cO0MY1y1SwwX0OqGc3AVdAvTAoPgBJG5fecnhf+X7UVe9dpcg7wRzXE/
 bK97MOdDf4UE9u8NhFcTSmpubu7iplkY1CtMqGDd5VA4pSuRGNYsnQbo+7ZsaG0k
 6Etl+sCz
 =Y554
 -----END PGP SIGNATURE-----

Merge tag '5.19-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client fixes from Steve French:
 "Fixes addressing important multichannel, and reconnect issues.

  Multichannel mounts when the server network interfaces changed, or ip
  addresses changed, uncovered problems, especially in reconnect, but
  the patches for this were held up until recently due to some lock
  conflicts that are now addressed.

  Included in this set of fixes:

   - three fixes relating to multichannel reconnect, dynamically
     adjusting the list of server interfaces to avoid problems during
     reconnect

   - a lock conflict fix related to the above

   - two important fixes for negotiate on secondary channels (null
     netname can unintentionally cause multichannel to be disabled to
     some servers)

   - a reconnect fix (reporting incorrect IP address in some cases)"

* tag '5.19-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update cifs_ses::ip_addr after failover
  cifs: avoid deadlocks while updating iface
  cifs: periodically query network interfaces from server
  cifs: during reconnect, update interface if necessary
  cifs: change iface_list from array to sorted linked list
  smb3: use netname when available on secondary channels
  smb3: fix empty netname context on secondary channels
2022-06-26 08:34:52 -07:00
Jason A. Donenfeld 067baa9a37 ksmbd: use vfs_llseek instead of dereferencing NULL
By not checking whether llseek is NULL, this might jump to NULL. Also,
it doesn't check FMODE_LSEEK. Fix this by using vfs_llseek(), which
always does the right thing.

Fixes: f441584858 ("cifsd: add file operations")
Cc: stable@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: Ronnie Sahlberg <lsahlber@redhat.com>
Cc: Hyunchul Lee <hyc.lee@gmail.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-25 19:52:49 -05:00
Linus Torvalds 29eeafc661 f2fs-fix-5.19
This includes some urgent fixes to avoid generating corrupted inodes
 caused by compressed and inline_data files. In addition, another patch
 tries to avoid wrong error report which prevents a roll-forward
 recovery.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmK2WREACgkQQBSofoJI
 UNJtuA//a9/7svQ32hK2/mGE9boK8V1tQEeOnTS79toMOh/AajAAlQyo7PmNuY3Z
 CkvT3wFJ7KzTgHZ7pHSAMdXX3grb+xs9vGqVdp6ICE4Le3p1QSdIaX7XCtTuhB3t
 p5u7yMuPorDFFKTJ9Ijq6/3xiS/qoKLCITAgzxMW8fdJzgJGU9qM2XMFw6r7fQnq
 sCQAJLGI0mZUkL0eDeb5iBTup9fSh3O5VEtXiOxqOI97tyUpeCt68PfTT3xW6viB
 u0QVaxTQYyM9/e61KpdgbhX7pfhz3mWsUgCvTZ9nH2siM9j0tWm3Q/vtMdnH1ETk
 bau2100B/hDywkulGrRYDmiYBbFQ/DZyPXxnE8kxe5AOejq47t1HDEmzd+fnac1x
 1eHSSw/ZKVEMlQX0bGDSRBJM7hZBfCdq4cj5GbswQ8vsYJ/1FYKWTi8T6s8fYTD3
 6QPkDxKDHemcbNbbFnHlBjxrb+L1QmVZK+WDqmTe9Nh2G1Er/nnhjM3T7D6iOJG9
 9egE+37r90Z/I3CFOKelMxJ1cpVq7/baunCSe1sN7y40WwLfUOfkATctl8TyuN/1
 gwLshYdTrvn6m5GKNkL/Nsu4o5HewIak+SJdP3HXahEk1ZMzVPWvz+xb5CnbziJk
 U0gc7rwhc8rpTjePTVYmOeaYwDJi6WTIjRQqhW6CxdkTYB2ttPA=
 =m3Fh
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs fixes from Jaegeuk Kim:
 "Some urgent fixes to avoid generating corrupted inodes caused by
  compressed and inline_data files.

  In addition, avoid a wrong error report which prevents a roll-forward
  recovery"

* tag 'f2fs-for-5.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: do not count ENOENT for error case
  f2fs: fix iostat related lock protection
  f2fs: attach inline_data after setting compression
2022-06-25 09:19:51 -07:00
Paulo Alcantara af3a6d1018 cifs: update cifs_ses::ip_addr after failover
cifs_ses::ip_addr wasn't being updated in cifs_session_setup() when
reconnecting SMB sessions thus returning wrong value in
/proc/fs/cifs/DebugData.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Cc: stable@kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-24 13:34:28 -05:00
Linus Torvalds 598f240487 io_uring-5.19-2022-06-24
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmK19YQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiycD/0TUfhJbMosCPjfIQXD4pHx+hWhN24B4RA6
 /2EDMfmhtm8gHVNZzeh0Qyh9UHXbteZK3uhiAUE42MzMr4FJ1ow3Lt+28Ou9dkF5
 tMvwANRiipXFjIJJez2v30zZ2nozhaJlPsdrSq9YY48kS2F9nYGVm07rPQ0gMdoI
 Awjwb515xK+VMciSDpYo9BcBj1LqDr+yYAyPELt8UlSuvEhZ0TauYzyP7VCSgByI
 aA8BWe5Gh5LLbEg3JoGAE1eG/Xs1OJjPAL/fY9C8k9umCmH3dQvpsOwtek1v359D
 JuL/Q1M/iPdq8TRg+Dj+ynv92EDVULuwnSQdOypAQIXKCVdCvCak4QwK0rQ8vn+c
 AinbHMaKpDc28P07ISBpPsvvpinktBd3fVfNLtq6tn2epkqYXvPcZa6n9La4Jrh8
 zAt3YIzKt60LSbrOs8jervVF+YZpCU0xKt8WFbhwy5D8POIgRUX8Nu5sI5e8vFEL
 vdzhEzEJrL6HlOo2LOQbX4zMHG2IqPcUJQo5Yt2DXOIos5cJifPnhv8OMTQ1dZIG
 gS3N2DgH4AA0FP1awiB7C45sVltDDKb/DEgTUdde4UmP0I4Cy7LXjxrYn58kA1mi
 l+c/465D1US/fmfzc2sXxlKhMA932ICNeJldZwBJByTRdfV1gDCMWgY4B7XRlQMZ
 LuGKsxtUIw==
 =Z57a
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A few fixes that should go into the 5.19 release. All are fixing
  issues that either happened in this release, or going to stable.

  In detail:

   - A small series of fixlets for the poll handling, all destined for
     stable (Pavel)

   - Fix a merge error from myself that caused a potential -EINVAL for
     the recv/recvmsg flag setting (me)

   - Fix a kbuf recycling issue for partial IO (me)

   - Use the original request for the inflight tracking (me)

   - Fix an issue introduced this merge window with trace points using a
     custom decoder function, which won't work for perf (Dylan)"

* tag 'io_uring-5.19-2022-06-24' of git://git.kernel.dk/linux-block:
  io_uring: use original request task for inflight tracking
  io_uring: move io_uring_get_opcode out of TP_printk
  io_uring: fix double poll leak on repolling
  io_uring: fix wrong arm_poll error handling
  io_uring: fail links when poll fails
  io_uring: fix req->apoll_events
  io_uring: fix merge error in checking send/recv addr2 flags
  io_uring: mark reissue requests with REQ_F_PARTIAL_IO
2022-06-24 11:02:26 -07:00
Shyam Prasad N 8da33fd11c cifs: avoid deadlocks while updating iface
We use cifs_tcp_ses_lock to protect a lot of things.
Not only does it protect the lists of connections, sessions,
tree connects, open file lists, etc., we also use it to
protect some fields in each of it's entries.

In this case, cifs_mark_ses_for_reconnect takes the
cifs_tcp_ses_lock to traverse the lists, and then calls
cifs_update_iface. However, that can end up calling
cifs_put_tcp_session, which picks up the same lock again.

Avoid this by taking a ref for the session, drop the lock,
and then call update iface.

Also, in cifs_update_iface, avoid nested locking of iface_lock
and chan_lock, as much as possible. When unavoidable, we need
to pick iface_lock first.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-24 09:17:56 -05:00
Namjae Jeon b5e5f9dfc9 ksmbd: check invalid FileOffset and BeyondFinalZero in FSCTL_ZERO_DATA
FileOffset should not be greater than BeyondFinalZero in FSCTL_ZERO_DATA.
And don't call ksmbd_vfs_zero_data() if length is zero.

Cc: stable@vger.kernel.org
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-23 23:30:46 -05:00
Namjae Jeon 18e39fb960 ksmbd: set the range of bytes to zero without extending file size in FSCTL_ZERO_DATA
generic/091, 263 test failed since commit f66f8b94e7 ("cifs: when
extending a file with falloc we should make files not-sparse").
FSCTL_ZERO_DATA sets the range of bytes to zero without extending file
size. The VFS_FALLOCATE_FL_KEEP_SIZE flag should be used even on
non-sparse files.

Cc: stable@vger.kernel.org
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-23 23:30:46 -05:00
Hyunchul Lee 745bbc0995 ksmbd: remove duplicate flag set in smb2_write
The writethrough flag is set again if is_rdma_channel is false.

Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-23 23:30:46 -05:00
Dave Chinner 5e672cd69f xfs: introduce xfs_inodegc_push()
The current blocking mechanism for pushing the inodegc queue out to
disk can result in systems becoming unusable when there is a long
running inodegc operation. This is because the statfs()
implementation currently issues a blocking flush of the inodegc
queue and a significant number of common system utilities will call
statfs() to discover something about the underlying filesystem.

This can result in userspace operations getting stuck on inodegc
progress, and when trying to remove a heavily reflinked file on slow
storage with a full journal, this can result in delays measuring in
hours.

Avoid this problem by adding "push" function that expedites the
flushing of the inodegc queue, but doesn't wait for it to complete.

Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this
mechanism so they don't block but still ensure that queued
operations are expedited.

Fixes: ab23a77687 ("xfs: per-cpu deferred inode inactivation queues")
Reported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[djwong: fix _getquota_next to use _inodegc_push too]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-06-23 13:34:38 -07:00
Dave Chinner 7cf2b0f961 xfs: bound maximum wait time for inodegc work
Currently inodegc work can sit queued on the per-cpu queue until
the workqueue is either flushed of the queue reaches a depth that
triggers work queuing (and later throttling). This means that we
could queue work that waits for a long time for some other event to
trigger flushing.

Hence instead of just queueing work at a specific depth, use a
delayed work that queues the work at a bound time. We can still
schedule the work immediately at a given depth, but we no long need
to worry about leaving a number of items on the list that won't get
processed until external events prevail.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-06-23 13:34:38 -07:00
Linus Torvalds fa1796a835 Tracing fixes:
- Check for NULL in kretprobe_dispatcher()
   NULL can now be passed in, make sure it can handle it
 
 - Clean up unneeded #endif #ifdef of the same preprocessor check in the
   middle of the block.
 
 - Comment clean up
 
 - Remove unneeded initialization of the "ret" variable in
   __trace_uprobe_create()
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYrMu9hQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qpuZAP9gS8Xcd7nenV3i9j4lCFktWQrvQwvh
 wyNb9UuLqPVMUQEAkk4hzq38P2UvEOZ+v+WdJnXfOb3wpFhrxWFycz5ZVAw=
 =9WXA
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Check for NULL in kretprobe_dispatcher()

   NULL can now be passed in, make sure it can handle it

 - Clean up unneeded #endif #ifdef of the same preprocessor
   check in the middle of the block.

 - Comment clean up

 - Remove unneeded initialization of the "ret" variable in
   __trace_uprobe_create()

* tag 'trace-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/uprobes: Remove unwanted initialization in __trace_uprobe_create()
  tracefs: Fix syntax errors in comments
  tracing: Simplify conditional compilation code in tracing_set_tracer()
  tracing/kprobes: Check whether get_kretprobe() returns NULL in kretprobe_dispatcher()
2022-06-23 12:24:49 -05:00
Jens Axboe 386e4fb696 io_uring: use original request task for inflight tracking
In prior kernels, we did file assignment always at prep time. This meant
that req->task == current. But after deferring that assignment and then
pushing the inflight tracking back in, we've got the inflight tracking
using current when it should in fact now be using req->task.

Fixup that error introduced by adding the inflight tracking back after
file assignments got modifed.

Fixes: 9cae36a094 ("io_uring: reinstate the inflight tracking")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-23 11:06:43 -06:00
Shyam Prasad N 6e1c1c08cd cifs: periodically query network interfaces from server
Currently, we only query the server for network interfaces
information at the time of mount, and never afterwards.
This can be a problem, especially for services like Azure,
where the IP address of the channel endpoints can change
over time.

With this change, we schedule a 600s polling of this info
from the server for each tree connect.

An alternative for periodic polling was to do this only at
the time of reconnect. But this could delay the reconnect
time slightly. Also, there are some challenges w.r.t how
we have cifs_reconnect implemented today.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22 19:51:43 -05:00
Shyam Prasad N b54034a73b cifs: during reconnect, update interface if necessary
Going forward, the plan is to periodically query the server
for it's interfaces (when multichannel is enabled).

This change allows checking for inactive interfaces during
reconnect, and reconnect to a new interface if necessary.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22 19:51:43 -05:00
Shyam Prasad N aa45dadd34 cifs: change iface_list from array to sorted linked list
A server's published interface list can change over time, and needs
to be updated. We've storing iface_list as a simple array, which
makes it difficult to manipulate an existing list.

With this change, iface_list is modified into a linked list of
interfaces, which is kept sorted by speed.

Also added a reference counter for an iface entry, so that each
channel can maintain a backpointer to the iface and drop it
easily when needed.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22 19:51:43 -05:00
Shyam Prasad N 9de74996a7 smb3: use netname when available on secondary channels
Some servers do not allow null netname contexts, which would cause
multichannel to revert to single channel when mounting to some
servers (e.g. Azure xSMB). The previous patch fixed that by avoiding
incorrectly sending the netname context when there would be a null
hostname sent in the netname context, while this patch fixes the null
hostname for the secondary channel by using the hostname of the
primary channel for the secondary channel.

Fixes: 4c14d7043f ("cifs: populate empty hostnames for extra channels")
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-22 19:46:53 -05:00
Linus Torvalds 3abc3ae553 9p-for-5.19-rc4: fid refcount and fscache fixes
This contains a couple of fixes:
  - fid refcounting was incorrect in some corner cases and would
 leak resources, only freed at umount time. The first three commits
 fix three such cases
  - cache=loose or fscache was broken when trying to write a partial
 page to a file with no read permission since the rework a few releases
 ago. The fix taken here is just to restore old behavior of using the
 special 'writeback_fid' for such reads, which is open as root/RDWR
 and such not get complains that we try to read on a WRONLY fid.
 Long-term it'd be nice to get rid of this and not issue the read at
 all (skip cache?) in such cases, but that direction hasn't progressed
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmKynEUACgkQq06b7GqY
 5nBOwQ//c1AoCuzt8gXefaBy9dDvaq/Cg5a339bUGmsvRJS8dHWTx2/HO7ncf3wE
 59uRh+ipLxXmHTkkLz13JtaVAFQ2HYlxKwmyvakBIjGVgDC+IYm9vkPHb2Z2yIBY
 D6XTuNufnb+/lrqekrmHiT2+eJOi2MhxPNyjXUAML7KKny6LpzdwymF/KIEsCbR8
 EbRrSf+KTnCssIfJlrZUbbk2UkbW18uG/V1MgThN3rgj+bgG/oB+lU6BELCIOQc2
 +0io2dg+ZgfJIK2fpBKF64vK2ILMSNEJ8obkfWgqOyI/LBOya38Z/cSbuzPMBwZd
 P2A2zQmjp8oYSbXM8EGaSFTXix28Lxljk5vvT/xbEipzyUU3UZAPJJE6UX9M66UF
 d/FHA8ljDVuRrknM0yDv5sqBYRB8uuEBtUiKGBO6k5zPTn0A7oEzEviryMCiEUF5
 1fbe/PWrFLnZMB2hWZ1aiY0tyopivp67zo6mRY/qehCihb/QlpiVNLGCC1e3eMdu
 FHPR3pSD1B5jFurOB8Wn1zUMjsZsnIjvpOET4WiP9pU9SJpOCd2fsAo69POHZVfA
 NIJxZ9MqW+3/eK+7CDmwnJLhTNRvvrQmTH55Ex61HTcn+2KFIqizCr/I6sQUl/g0
 teAB8T5UlS6+nDDWfZouUiXcm0He2C56RyJOCYlagHD1qYm//Gg=
 =2yZw
 -----END PGP SIGNATURE-----

Merge tag '9p-for-5.19-rc4' of https://github.com/martinetd/linux

Pull 9pfs fixes from Dominique Martinet:
 "A couple of fid refcount and fscache fixes:

   - fid refcounting was incorrect in some corner cases and would leak
     resources, only freed at umount time. The first three commits fix
     three such cases

   - 'cache=loose' or fscache was broken when trying to write a partial
     page to a file with no read permission since the rework a few
     releases ago.

     The fix taken here is just to restore old behavior of using the
     special 'writeback_fid' for such reads, which is open as root/RDWR
     and such not get complains that we try to read on a WRONLY fid.

     Long-term it'd be nice to get rid of this and not issue the read at
     all (skip cache?) in such cases, but that direction hasn't
     progressed"

* tag '9p-for-5.19-rc4' of https://github.com/martinetd/linux:
  9p: fix EBADF errors in cached mode
  9p: Fix refcounting during full path walks for fid lookups
  9p: fix fid refcount leak in v9fs_vfs_get_link
  9p: fix fid refcount leak in v9fs_vfs_atomic_open_dotl
2022-06-22 08:09:49 -05:00
Pavel Begunkov c0737fa9a5 io_uring: fix double poll leak on repolling
We have re-polling for partial IO, so a request can be polled twice. If
it used two poll entries the first time then on the second
io_arm_poll_handler() it will find the old apoll entry and NULL
kmalloc()'ed second entry, i.e. apoll->double_poll, so leaking it.

Fixes: 10c873334f ("io_uring: allow re-poll if we made progress")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fee2452494222ecc7f1f88c8fb659baef971414a.1655852245.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21 17:24:37 -06:00
Pavel Begunkov 9d2ad2947a io_uring: fix wrong arm_poll error handling
Leaving ip.error set when a request was punted to task_work execution is
problematic, don't forget to clear it.

Fixes: aa43477b04 ("io_uring: poll rework")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a6c84ef4182c6962380aebe11b35bdcb25b0ccfb.1655852245.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21 17:24:37 -06:00
Pavel Begunkov c487a5ad48 io_uring: fail links when poll fails
Don't forget to cancel all linked requests of poll request when
__io_arm_poll_handler() failed.

Fixes: aa43477b04 ("io_uring: poll rework")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a78aad962460f9fdfe4aa4c0b62425c88f9415bc.1655852245.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21 17:24:37 -06:00
Linus Torvalds ff872b76b3 for-5.19-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmKxvkkACgkQxWXV+ddt
 WDsQYhAAofZGaOdBwSDvGA4srB2ieDIFoMeNb1NYp2P5vafPo3Q5AAvgGAeKhp5x
 g2C7W/8q2GMJ+B9SjyiBkVufuQmCWbFKxStQM3QysYoj/EyKyp7SXtO4YMWHz2T3
 nfMMlPo2aNpr7Z2s+tcjhthq/hIvVFi6kweRFNvacM2bb/17IxgAdqLpQBqK5xe9
 /IGSUTw75jSd2sZSyzBqrqshKDonmJ7u4qCV2X5hTPi8w4AUDERJrm0bOnikNXHx
 4LnNDmSIA0BEXybHwEAShoK0ge66z1kP1UspQNB7pKriJcyroNPjgm/fMZJiRKIc
 zEYEMSzTYQa5eDwhXCz5PCaPqY4y/ovfYCsmySVXt1a7wgplVl+vsOaesE2NFVCX
 FE36d58L+4I8iTJhpVCNmEU9N/spfvAr3mBAcKCkbp9WKyGJ9/2yJpRThkV8Pw2Y
 bzhFIYRs1CJvkK7P4Cp+FSfzJx6tvYAqblvE97VUt83PuqS1Fb49lKdr5DZnbplV
 vDkewmvXSmHH9Ic5xBeTJXJZ+yeibk/0LSNEKczWva6f60h0ubF0OI6BzmS+NZbN
 HyitKerX0ZyFi5VUOZ+PKzXfR3ZlX3SmjAcHrl9BjZjFOJkpxAx6yWBzdnkitb+O
 fYyT68H4IetxwkghPVBv8qFCkuNy/i9NsEILcAAXd8CHGQlfwDA=
 =eORM
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - print more error messages for invalid mount option values

 - prevent remount with v1 space cache for subpage filesystem

 - fix hang during unmount when block group reclaim task is running

* tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: add error messages to all unrecognized mount options
  btrfs: prevent remounting to v1 space cache for subpage mount
  btrfs: fix hang during unmount when block group reclaim task is running
2022-06-21 12:06:04 -05:00
David Howells cb78d1b5ef afs: Fix dynamic root getattr
The recent patch to make afs_getattr consult the server didn't account
for the pseudo-inodes employed by the dynamic root-type afs superblock
not having a volume or a server to access, and thus an oops occurs if
such a directory is stat'd.

Fix this by checking to see if the vnode->volume pointer actually points
anywhere before following it in afs_getattr().

This can be tested by stat'ing a directory in /afs.  It may be
sufficient just to do "ls /afs" and the oops looks something like:

        BUG: kernel NULL pointer dereference, address: 0000000000000020
        ...
        RIP: 0010:afs_getattr+0x8b/0x14b
        ...
        Call Trace:
         <TASK>
         vfs_statx+0x79/0xf5
         vfs_fstatat+0x49/0x62

Fixes: 2aeb8c86d4 ("afs: Fix afs_getattr() to refetch file status if callback break occurred")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/165408450783.1031787.7941404776393751186.stgit@warthog.procyon.org.uk/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-21 11:47:30 -05:00
Jaegeuk Kim 82c7863ed9 f2fs: do not count ENOENT for error case
Otherwise, we can get a wrong cp_error mark.

Cc: <stable@vger.kernel.org>
Fixes: a7b8618aa2 ("f2fs: avoid infinite loop to flush node pages")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-21 08:29:56 -07:00
Pavel Begunkov aacf2f9f38 io_uring: fix req->apoll_events
apoll_events should be set once in the beginning of poll arming just as
poll->events and not change after. However, currently io_uring resets it
on each __io_poll_execute() for no clear reason. There is also a place
in __io_arm_poll_handler() where we add EPOLLONESHOT to downgrade a
multishot, but forget to do the same thing with ->apoll_events, which is
buggy.

Fixes: 81459350d5 ("io_uring: cache req->apoll->events in req->cflags")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/0aef40399ba75b1a4d2c2e85e6e8fd93c02fc6e4.1655814213.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21 07:49:05 -06:00
Jens Axboe b60cac14bb io_uring: fix merge error in checking send/recv addr2 flags
With the dropping of the IOPOLL checking in the per-opcode handlers,
we inadvertently left two checks in the recv/recvmsg and send/sendmsg
prep handlers for the same thing, and one of them includes addr2 which
holds the flags for these opcodes.

Fix it up and kill the redundant checks.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-21 07:47:13 -06:00
Josef Bacik bf7ba8ee75 btrfs: fix deadlock with fsync+fiemap+transaction commit
We are hitting the following deadlock in production occasionally

Task 1		Task 2		Task 3		Task 4		Task 5
		fsync(A)
		 start trans
						start commit
				falloc(A)
				 lock 5m-10m
				 start trans
				  wait for commit
fiemap(A)
 lock 0-10m
  wait for 5m-10m
   (have 0-5m locked)

		 have btrfs_need_log_full_commit
		  !full_sync
		  wait_ordered_extents
								finish_ordered_io(A)
								lock 0-5m
								DEADLOCK

We have an existing dependency of file extent lock -> transaction.
However in fsync if we tried to do the fast logging, but then had to
fall back to committing the transaction, we will be forced to call
btrfs_wait_ordered_range() to make sure all of our extents are updated.

This creates a dependency of transaction -> file extent lock, because
btrfs_finish_ordered_io() will need to take the file extent lock in
order to run the ordered extents.

Fix this by stopping the transaction if we have to do the full commit
and we attempted to do the fast logging.  Then attach to the transaction
and commit it if we need to.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21 14:47:08 +02:00
Zygo Blaxell 97e86631bc btrfs: don't set lock_owner when locking extent buffer for reading
In 196d59ab9c "btrfs: switch extent buffer tree lock to rw_semaphore"
the functions for tree read locking were rewritten, and in the process
the read lock functions started setting eb->lock_owner = current->pid.
Previously lock_owner was only set in tree write lock functions.

Read locks are shared, so they don't have exclusive ownership of the
underlying object, so setting lock_owner to any single value for a
read lock makes no sense.  It's mostly harmless because write locks
and read locks are mutually exclusive, and none of the existing code
in btrfs (btrfs_init_new_buffer and print_eb_refs_lock) cares what
nonsense is written in lock_owner when no writer is holding the lock.

KCSAN does care, and will complain about the data race incessantly.
Remove the assignments in the read lock functions because they're
useless noise.

Fixes: 196d59ab9c ("btrfs: switch extent buffer tree lock to rw_semaphore")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21 14:46:56 +02:00
Naohiro Aota 19ab78ca86 btrfs: zoned: fix critical section of relocation inode writeback
We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to
write out to the relocation inode. That critical section must include all
the IO submission for the inode. However, flush_write_bio() in
extent_writepages() is out of the critical section, causing an IO
submission outside of the lock. This leads to an out of the order IO
submission and fail the relocation process.

Fix it by extending the critical section.

Fixes: 35156d8527 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21 14:46:30 +02:00
Naohiro Aota 343d8a3085 btrfs: zoned: prevent allocation from previous data relocation BG
After commit 5f0addf7b8 ("btrfs: zoned: use dedicated lock for data
relocation"), we observe IO errors on e.g, btrfs/232 like below.

  [09.0][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs]
  <snip>
  [09.9][T4038707] Call Trace:
  [09.5][T4038707]  <TASK>
  [09.3][T4038707]  run_delalloc_nocow+0x7f1/0x11a0 [btrfs]
  [09.6][T4038707]  ? test_range_bit+0x174/0x320 [btrfs]
  [09.2][T4038707]  ? fallback_to_cow+0x980/0x980 [btrfs]
  [09.3][T4038707]  ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs]
  [09.5][T4038707]  btrfs_run_delalloc_range+0x445/0x1320 [btrfs]
  [09.2][T4038707]  ? test_range_bit+0x320/0x320 [btrfs]
  [09.4][T4038707]  ? lock_downgrade+0x6a0/0x6a0
  [09.2][T4038707]  ? orc_find.part.0+0x1ed/0x300
  [09.5][T4038707]  ? __module_address.part.0+0x25/0x300
  [09.0][T4038707]  writepage_delalloc+0x159/0x310 [btrfs]
  <snip>
  [09.4][    C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
  [09.5][    C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current]
  [09.9][    C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command
  [09.5][    C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00
  [09.4][    C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
  [09.9][    C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0

The IO errors occur when we allocate a regular extent in previous data
relocation block group.

On zoned btrfs, we use a dedicated block group to relocate a data
extent. Thus, we allocate relocating data extents (pre-alloc) only from
the dedicated block group and vice versa. Once the free space in the
dedicated block group gets tight, a relocating extent may not fit into
the block group. In that case, we need to switch the dedicated block
group to the next one. Then, the previous one is now freed up for
allocating a regular extent. The BG is already not enough to allocate
the relocating extent, but there is still room to allocate a smaller
extent. Now the problem happens. By allocating a regular extent while
nocow IOs for the relocation is still on-going, we will issue WRITE IOs
(for relocation) and ZONE APPEND IOs (for the regular writes) at the
same time. That mixed IOs confuses the write pointer and arises the
unaligned write errors.

This commit introduces a new bit 'zoned_data_reloc_ongoing' to the
btrfs_block_group. We set this bit before releasing the dedicated block
group, and no extent are allocated from a block group having this bit
set. This bit is similar to setting block_group->ro, but is different from
it by allowing nocow writes to start.

Once all the nocow IO for relocation is done (hooked from
btrfs_finish_ordered_io), we reset the bit to release the block group for
further allocation.

Fixes: c2707a2556 ("btrfs: zoned: add a dedicated data relocation block group")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21 14:43:48 +02:00
Filipe Manana 650c9caba3 btrfs: do not BUG_ON() on failure to migrate space when replacing extents
At btrfs_replace_file_extents(), if we fail to migrate reserved metadata
space from the transaction block reserve into the local block reserve,
we trigger a BUG_ON(). This is because it should not be possible to have
a failure here, as we reserved more space when we started the transaction
than the space we want to migrate. However having a BUG_ON() is way too
drastic, we can perfectly handle the failure and return the error to the
caller. So just do that instead, and add a WARN_ON() to make it easier
to notice the failure if it ever happens (which is particularly useful
for fstests, and the warning will trigger a failure of a test case).

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21 14:43:27 +02:00
Filipe Manana 983d8209c6 btrfs: add missing inode updates on each iteration when replacing extents
When replacing file extents, called during fallocate, hole punching,
clone and deduplication, we may not be able to replace/drop all the
target file extent items with a single transaction handle. We may get
-ENOSPC while doing it, in which case we release the transaction handle,
balance the dirty pages of the btree inode, flush delayed items and get
a new transaction handle to operate on what's left of the target range.

By dropping and replacing file extent items we have effectively modified
the inode, so we should bump its iversion and update its mtime/ctime
before we update the inode item. This is because if the transaction
we used for partially modifying the inode gets committed by someone after
we release it and before we finish the rest of the range, a power failure
happens, then after mounting the filesystem our inode has an outdated
iversion and mtime/ctime, corresponding to the values it had before we
changed it.

So add the missing iversion and mtime/ctime updates.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21 14:43:21 +02:00
Filipe Manana d4597898ba btrfs: fix race between reflinking and ordered extent completion
While doing a reflink operation, if an ordered extent for a file range
that does not overlap with the source and destination ranges of the
reflink operation happens, we can end up having a failure in the reflink
operation and return -EINVAL to user space.

The following sequence of steps explains how this can happen:

1) We have the page at file offset 315392 dirty (under delalloc);

2) A reflink operation for this file starts, using the same file as both
   source and destination, the source range is [372736, 409600) (length of
   36864 bytes) and the destination range is [208896, 245760);

3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
   and destination ranges, and wait for any ordered extents in those range
   to complete;

4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
   the inode, but we neither wait for it to complete nor any ordered
   extents to complete. This results in starting delalloc for the page at
   file offset 315392 and creating an ordered extent for that single page
   range;

5) We then move to btrfs_clone() and enter the loop to find file extent
   items to copy from the source range to destination range;

6) In the first iteration we end up at last file extent item stored in
   leaf A:

   (...)
   item 131 key (143616 108 315392) itemoff 5101 itemsize 53
            extent data disk bytenr 1903988736 nr 73728
            extent data offset 12288 nr 61440 ram 73728

   This represents the file range [315392, 376832), which overlaps with
   the source range to clone.

   @datal is set to 61440, key.offset is 315392 and @next_key_min_offset
   is therefore set to 376832 (315392 + 61440).

   @off (372736) is > key.offset (315392), so @new_key.offset is set to
   the value of @destoff (208896).

   @new_key.offset == @last_dest_end (208896) so @drop_start is set to
   208896 (@new_key.offset).

   @datal is adjusted to 4096, as @off is > @key.offset.

   So in this iteration we call btrfs_replace_file_extents() for the range
   [208896, 212991] (a single page, which is
   [@drop_start, @new_key.offset + @datal - 1]).

   @last_dest_end is set to 212992 (@new_key.offset + @datal =
   208896 + 4096 = 212992).

   Before the next iteration of the loop, @key.offset is set to the value
   376832, which is @next_key_min_offset;

7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
   but this time pointing beyond the last slot of leaf A, as that's where
   a key with offset 376832 should be at if it existed. So end up calling
   btrfs_next_leaf();

8) btrfs_next_leaf() releases the path, but before it searches again the
   tree for the next key/leaf, the ordered extent for the single page
   range at file offset 315392 completes. That results in trimming the
   file extent item we processed before, adjusting its key offset from
   315392 to 319488, reducing its length from 61440 to 57344 and inserting
   a new file extent item for that single page range, with a key offset of
   315392 and a length of 4096.

   Leaf A now looks like:

     (...)
     item 132 key (143616 108 315392) itemoff 4995 itemsize 53
              extent data disk bytenr 1801666560 nr 4096
              extent data offset 0 nr 4096 ram 4096
     item 133 key (143616 108 319488) itemoff 4942 itemsize 53
              extent data disk bytenr 1903988736 nr 73728
              extent data offset 16384 nr 57344 ram 73728

9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
   at slot 133, since it's the first key that follows what was the last
   key we saw (143616 108 315392). In fact it's the same item we processed
   before, but its key offset was changed, so it counts as a new key;

10) So now we have:

    @key.offset == 319488
    @datal == 57344

    @off (372736) is > key.offset (319488), so @new_key.offset is set to
    208896 (@destoff value).

    @new_key.offset (208896) != @last_dest_end (212992), so @drop_start
    is set to 212992 (@last_dest_end value).

    @datal is adjusted to 4096 because @off > @key.offset.

    So in this iteration we call btrfs_replace_file_extents() for the
    invalid range of [212992, 212991] (which is
    [@drop_start, @new_key.offset + @datal - 1]).

    This range is empty, the end offset is smaller than the start offset
    so btrfs_replace_file_extents() returns -EINVAL, which we end up
    returning to user space and fail the reflink operation.

    This all happens because the range of this file extent item was
    already processed in the previous iteration.

This scenario can be triggered very sporadically by fsx from fstests, for
example with test case generic/522.

So fix this by having btrfs_clone() skip file extent items that cover a
file range that we have already processed.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-21 14:43:13 +02:00
Steve French 73130a7b1a smb3: fix empty netname context on secondary channels
Some servers do not allow null netname contexts, which would cause
multichannel to revert to single channel when mounting to some
servers (e.g. Azure xSMB).

Fixes: 4c14d7043f ("cifs: populate empty hostnames for extra channels")
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-20 16:23:50 -05:00
Jens Axboe 1bacd264d3 io_uring: mark reissue requests with REQ_F_PARTIAL_IO
If we mark for reissue, we assume that the buffer will remain stable.
Hence if are using a provided buffer, we need to ensure that we stick
with it for the duration of that request.

This only affects block devices that use provided buffers, as those are
the only ones that get marked with REQ_F_REISSUE.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-20 06:39:27 -06:00
Daeho Jeong 61803e9843 f2fs: fix iostat related lock protection
Made iostat related locks safe to be called from irq context again.

Cc: <stable@vger.kernel.org>
Fixes: a1e09b03e6 ("f2fs: use iomap for direct I/O")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Stanley Chu <stanley.chu@mediatek.com>
Tested-by: Eddie Huang <eddie.huang@mediatek.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-19 15:16:12 -07:00
Jaegeuk Kim 4cde00d507 f2fs: attach inline_data after setting compression
This fixes the below corruption.

[345393.335389] F2FS-fs (vdb): sanity_check_inode: inode (ino=6d0, mode=33206) should not have inline_data, run fsck to fix

Cc: <stable@vger.kernel.org>
Fixes: 677a82b44e ("f2fs: fix to do sanity check for inline inode")
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2022-06-19 15:16:10 -07:00
Linus Torvalds 063232b6c4 Fixes for 5.19-rc3:
- Fix a bug where inode flag changes would accidentally drop nrext64.
  - Fix a race condition when toggling LARP mode.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmKqyp4ACgkQ+H93GTRK
 tOtnURAAmJUASVXnixuuqRp8srbotuWc9EGJY+0/UFAfnfSlgasVeS1XB5bZ1CZP
 QhRYgDfPnuDvXwNrz3LHFL1ihll1whbJeXP2tYnCTolB8yFutk/xDLmwvXuRVR0y
 yzbbl6MtnHZ7SThhsXgUoJ3b0ItVxq8xN/0h1VVr0OI2zUryOR+Kd1c/G3VIPPZ6
 ZXyigcdQFAqB1oB/f2D6yHIqtIZopS+kwtcMTBz0qr82Tvp4Vzh9OMCU6BwdtidG
 o/UIBSrliW8qgrXom5Asy5mmLCa3wou7JfQc176ADbG09XjxoL0djHF5ZcbpQT7i
 A3WRQwwsNPfTGmyukngk2rH9JoeVSzvhyXD2ArrLJB/Ra097reXpsH0ABm63ova3
 YV8sX8BCoTjNzoN+abHq9jXxfcLaesJyZKfm6wU1bJ/0nkSYnGqwI9tWii18lRUQ
 GuVEShDMJAIUYWo2ysmm1fRhNM7I9+kE8ZprNBuUnK3ej9efZQPV20uOzqDI7H0Z
 6IW1JKHZr4WHAHeymkl8AHKt6U6+tCBjSUT/CGlfph+NNvytd2XvvEAIW5oFMEvA
 fMvYSnuk40tb6LpBGQcXxRjl14BvgBgc2omkVZuJf1X3rkg7i6U9zJv9rp87CBhl
 PnEnLvDa86KHxmq2Jxs1rh0LYu2OzCNGsoxICf8w4mloZmEFIqA=
 =vvDX
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.19-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "There's not a whole lot this time around (I'm still on vacation) but
  here are some important fixes for new features merged in -rc1:

   - Fix a bug where inode flag changes would accidentally drop nrext64

   - Fix a race condition when toggling LARP mode"

* tag 'xfs-5.19-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: preserve DIFLAG2_NREXT64 when setting other inode attributes
  xfs: fix variable state usage
  xfs: fix TOCTOU race involving the new logged xattrs control knob
2022-06-19 09:24:49 -05:00
Linus Torvalds 354c6e071b Fix a variety of bugs, many of which were found by folks using fuzzing
or error injection.  Also fix up how test_dummy_encryption mount
 option is handled for the new mount API.  Finally, fix/cleanup a
 number of comments and ext4 Documentation files.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmKuYpcACgkQ8vlZVpUN
 gaMXwwf8DSHJ3gI2Lo0wrzJm7KSS0C+HK29/rtLCZdxECQsZR156ZzSF3zAFKOwK
 Yx3RJwiFxrciUUytY/MWTyalCk+M8oW1093SfRqNNZCbZNi33acnbTqioa7INnDw
 snFGGEU1y0M0AUduxNWPr71P80sTyQa0ZplIc4YeR98zzMvoWgi1dvo4wNdtJNQb
 Gb0FtBhgP+IeK50eBlK4O0Eg5kqd0V5OeTLUYUfsWqU28ap8dHYE48I6sIdHx6az
 sa6b2+YRuBxJUV61FNujuVtkDgUHXtXM97kkGpywRSLjo4iFxlQvX9Ew4lBD9RDI
 b0YHVzK/DU9M3VfiYgzGwShCb/M68w==
 =NtNY
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a variety of bugs, many of which were found by folks using fuzzing
  or error injection.

  Also fix up how test_dummy_encryption mount option is handled for the
  new mount API.

  Finally, fix/cleanup a number of comments and ext4 Documentation
  files"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix a doubled word "need" in a comment
  ext4: add reserved GDT blocks check
  ext4: make variable "count" signed
  ext4: correct the judgment of BUG in ext4_mb_normalize_request
  ext4: fix bug_on ext4_mb_use_inode_pa
  ext4: fix up test_dummy_encryption handling for new mount API
  ext4: use kmemdup() to replace kmalloc + memcpy
  ext4: fix super block checksum incorrect after mount
  ext4: improve write performance with disabled delalloc
  ext4: fix warning when submitting superblock in ext4_commit_super()
  ext4, doc: remove unnecessary escaping
  ext4: fix incorrect comment in ext4_bio_write_page()
  fs: fix jbd2_journal_try_to_free_buffers() kernel-doc comment
2022-06-18 21:51:12 -05:00
Linus Torvalds ace2045ed5 2 smb3 debugging improvements
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmKuOPcACgkQiiy9cAdy
 T1FNoAv/VZwWl1J5iFVAbZhLAt/LhkL/1Ee8TeMRxa7QExifBJ4latsi1duOXBnR
 bRQ+lFuDmg1cuma4aayH7bHGnZZMEoZku0bpj/h8MOTf/w+GLIUUH/0LSEOi1klz
 nmj3fbJ4TMF/rA0Elsz4/iJIZhka3QbTAS3y7l9SlsLlgktoKJuZpEEuRgFsYNEW
 zdQMbb7q53L2txDDZAnR5TqesDgzeXePnvVRZDPAar8HnYrOg4sC6ueqxJtUKKBP
 TcC/2956tXHqd+5EYyH2Vuspf38dGxYs5qIhsMokRoMx42dAQ824JeuFy+D7eps6
 /hwDp+U1XIdllQW7qVD8MZ5CzIZlFKTZGu/B4Uh7GAtzluIAFyayGhVcDdj7LFVV
 fEaR8N9og9DEAmqUhsKLBZM656lhpu38cOslpGqNw0gCSZNxLyyp1hNkXrVlYv9L
 SwclZjoQbOBMPGriyv0h6rSaNoR+J7hps8cpW/eVnXMC5VNnsrXM+EYPJbu8EWYL
 SLJKZp6g
 =oBRl
 -----END PGP SIGNATURE-----

Merge tag '5.19-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client fixes from Steve French:
 "Two cifs debugging improvements - one found to deal with debugging a
  multichannel problem and one for a recent fallocate issue

  This does include the two larger multichannel reconnect (dynamically
  adjusting interfaces on reconnect) patches, because we recently found
  an additional problem with multichannel to one server type that I want
  to include at the same time"

* tag '5.19-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: when a channel is not found for server, log its connection id
  smb3: add trace point for SMB2_set_eof
2022-06-18 21:44:44 -05:00
Xiang wangx 1f3ddff375 ext4: fix a doubled word "need" in a comment
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Link: https://lore.kernel.org/r/20220605091503.12513-1-wangxiang@cdjrlc.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:36:20 -04:00
Zhang Yi b55c3cd102 ext4: add reserved GDT blocks check
We capture a NULL pointer issue when resizing a corrupt ext4 image which
is freshly clear resize_inode feature (not run e2fsck). It could be
simply reproduced by following steps. The problem is because of the
resize_inode feature was cleared, and it will convert the filesystem to
meta_bg mode in ext4_resize_fs(), but the es->s_reserved_gdt_blocks was
not reduced to zero, so could we mistakenly call reserve_backup_gdb()
and passing an uninitialized resize_inode to it when adding new group
descriptors.

 mkfs.ext4 /dev/sda 3G
 tune2fs -O ^resize_inode /dev/sda #forget to run requested e2fsck
 mount /dev/sda /mnt
 resize2fs /dev/sda 8G

 ========
 BUG: kernel NULL pointer dereference, address: 0000000000000028
 CPU: 19 PID: 3243 Comm: resize2fs Not tainted 5.18.0-rc7-00001-gfde086c5ebfd #748
 ...
 RIP: 0010:ext4_flex_group_add+0xe08/0x2570
 ...
 Call Trace:
  <TASK>
  ext4_resize_fs+0xbec/0x1660
  __ext4_ioctl+0x1749/0x24e0
  ext4_ioctl+0x12/0x20
  __x64_sys_ioctl+0xa6/0x110
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f2dd739617b
 ========

The fix is simple, add a check in ext4_resize_begin() to make sure that
the es->s_reserved_gdt_blocks is zero when the resize_inode feature is
disabled.

Cc: stable@kernel.org
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220601092717.763694-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:36:08 -04:00
Ding Xiang bc75a6eb85 ext4: make variable "count" signed
Since dx_make_map() may return -EFSCORRUPTED now, so change "count" to
be a signed integer so we can correctly check for an error code returned
by dx_make_map().

Fixes: 46c116b920 ("ext4: verify dir block before splitting it")
Cc: stable@kernel.org
Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com>
Link: https://lore.kernel.org/r/20220530100047.537598-1-dingxiang@cmss.chinamobile.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:57 -04:00
Baokun Li cf4ff938b4 ext4: correct the judgment of BUG in ext4_mb_normalize_request
ext4_mb_normalize_request() can move logical start of allocated blocks
to reduce fragmentation and better utilize preallocation. However logical
block requested as a start of allocation (ac->ac_o_ex.fe_logical) should
always be covered by allocated blocks so we should check that by
modifying and to or in the assertion.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220528110017.354175-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:57 -04:00
Baokun Li a08f789d2a ext4: fix bug_on ext4_mb_use_inode_pa
Hulk Robot reported a BUG_ON:
==================================================================
kernel BUG at fs/ext4/mballoc.c:3211!
[...]
RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
[...]
Call Trace:
 ext4_mb_new_blocks+0x9df/0x5d30
 ext4_ext_map_blocks+0x1803/0x4d80
 ext4_map_blocks+0x3a4/0x1a10
 ext4_writepages+0x126d/0x2c30
 do_writepages+0x7f/0x1b0
 __filemap_fdatawrite_range+0x285/0x3b0
 file_write_and_wait_range+0xb1/0x140
 ext4_sync_file+0x1aa/0xca0
 vfs_fsync_range+0xfb/0x260
 do_fsync+0x48/0xa0
[...]
==================================================================

Above issue may happen as follows:
-------------------------------------
do_fsync
 vfs_fsync_range
  ext4_sync_file
   file_write_and_wait_range
    __filemap_fdatawrite_range
     do_writepages
      ext4_writepages
       mpage_map_and_submit_extent
        mpage_map_one_extent
         ext4_map_blocks
          ext4_mb_new_blocks
           ext4_mb_normalize_request
            >>> start + size <= ac->ac_o_ex.fe_logical
           ext4_mb_regular_allocator
            ext4_mb_simple_scan_group
             ext4_mb_use_best_found
              ext4_mb_new_preallocation
               ext4_mb_new_inode_pa
                ext4_mb_use_inode_pa
                 >>> set ac->ac_b_ex.fe_len <= 0
           ext4_mb_mark_diskspace_used
            >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);

we can easily reproduce this problem with the following commands:
	`fallocate -l100M disk`
	`mkfs.ext4 -b 1024 -g 256 disk`
	`mount disk /mnt`
	`fsstress -d /mnt -l 0 -n 1000 -p 1`

The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
when the size is truncated. So start should be the start position of
the group where ac_o_ex.fe_logical is located after alignment.
In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
is very large, the value calculated by start_off is more accurate.

Cc: stable@kernel.org
Fixes: cd648b8a8f ("ext4: trim allocation requests to group size")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220528110017.354175-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:43 -04:00
Eric Biggers 85456054e1 ext4: fix up test_dummy_encryption handling for new mount API
Since ext4 was converted to the new mount API, the test_dummy_encryption
mount option isn't being handled entirely correctly, because the needed
fscrypt_set_test_dummy_encryption() helper function combines
parsing/checking/applying into one function.  That doesn't work well
with the new mount API, which split these into separate steps.

This was sort of okay anyway, due to the parsing logic that was copied
from fscrypt_set_test_dummy_encryption() into ext4_parse_param(),
combined with an additional check in ext4_check_test_dummy_encryption().
However, these overlooked the case of changing the value of
test_dummy_encryption on remount, which isn't allowed but ext4 wasn't
detecting until ext4_apply_options() when it's too late to fail.
Another bug is that if test_dummy_encryption was specified multiple
times with an argument, memory was leaked.

Fix this up properly by using the new helper functions that allow
splitting up the parse/check/apply steps for test_dummy_encryption.

Fixes: cebe85d570 ("ext4: switch to the new mount api")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220526040412.173025-1-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:43 -04:00
Shuqi Zhang 4efd9f0d12 ext4: use kmemdup() to replace kmalloc + memcpy
Replace kmalloc + memcpy with kmemdup()

Signed-off-by: Shuqi Zhang <zhangshuqi3@huawei.com>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220525030120.803330-1-zhangshuqi3@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:43 -04:00
Ye Bin 9b6641dd95 ext4: fix super block checksum incorrect after mount
We got issue as follows:
[home]# mount  /dev/sda  test
EXT4-fs (sda): warning: mounting fs with errors, running e2fsck is recommended
[home]# dmesg
EXT4-fs (sda): warning: mounting fs with errors, running e2fsck is recommended
EXT4-fs (sda): Errors on filesystem, clearing orphan list.
EXT4-fs (sda): recovery complete
EXT4-fs (sda): mounted filesystem with ordered data mode. Quota mode: none.
[home]# debugfs /dev/sda
debugfs 1.46.5 (30-Dec-2021)
Checksum errors in superblock!  Retrying...

Reason is ext4_orphan_cleanup will reset ‘s_last_orphan’ but not update
super block checksum.

To solve above issue, defer update super block checksum after
ext4_orphan_cleanup.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Cc: stable@kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220525012904.1604737-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-18 19:35:24 -04:00
Shyam Prasad N 5d24968f5b cifs: when a channel is not found for server, log its connection id
cifs_ses_get_chan_index gets the index for a given server pointer.
When a match is not found, we warn about a possible bug.
However, printing details about the non-matching server could be
more useful to debug here.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-18 14:55:06 -05:00
Xiang wangx 93a8c044b9 tracefs: Fix syntax errors in comments
Delete the redundant word 'to'.

Link: https://lkml.kernel.org/r/20220605092729.13010-1-wangxiang@cdjrlc.com

Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-06-17 19:01:28 -04:00
Linus Torvalds 4b35035bcf NFS Client Fixes for Linux 5.19-rc
- Bugfixes:
   - Add FMODE_CAN_ODIRECT support to NFSv4 so opens don't fail
   - Fix trunking detection & cl_max_connect setting
   - Avoid pnfs_update_layout() livelocks
   - Don't keep retrying pNFS if the server replies with NFS4ERR_UNAVAILABLE
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmKs2ZwACgkQ18tUv7Cl
 QOsnGxAA6g0M7IU6g375rfqxq9a/XqSlvwIeuSOb14WNybh7D6qXOa+iInsHFIl7
 d7coORg846SOdUl218hVcp8Ba1OTsj/XAruJllsrHQnB50raBJ/nUbIbQlrxGYKn
 WzMtVnyLQ8Ml+rvERINPqVcgUBJ0PGWMiRi/h8OcUlWylV5ZI/irpkaZiuXamzhe
 O0Wa04N/82bUU03dEmQ0ZuPuhMn5JbOMaSzciRvHEV8nLvqvRAhGIVBtvrrYF0VB
 UfZx/4DTRXDD5/RA65hX2vgixZh7/cLTv4pr26wmfDBofo1zDiFsQpPS8QaZo5bt
 Sw+UQK1c15kW/EJS+au90mazBmFnk5UX2BOyfN+Cg5/GlHjE0YHV1f2ejbHycsyh
 Rcsu8nxNa5T82mg2EjOCqK2YWy8mGHYr5MJTYftL/uE8NdqP/DSgNpqNSQhQD2Bm
 vzsG0wP5RP+i3pRWWQnXOlZE+GdaKxtXKtg2ZjHx1Wkb2QIUbgbxST59q/U5QnIN
 MJKS1nhbxQA0dGo3ClzYNe76S9It7DDE0A3mdvIzPwRSheQhgmF4UlzyTWgSdyfw
 lnT5EK3pQat6cvdaszjMn1f6vx4BvTuhXUE9eH35opVMYykvAU6hz4ypllxKoR/h
 BES6KMJPKXn77ICJJVC3RR5w2v756DRNMeOfkjCi18TiuTVFXek=
 =nTJk
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:

 - Add FMODE_CAN_ODIRECT support to NFSv4 so opens don't fail

 - Fix trunking detection & cl_max_connect setting

 - Avoid pnfs_update_layout() livelocks

 - Don't keep retrying pNFS if the server replies with NFS4ERR_UNAVAILABLE

* tag 'nfs-for-5.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Add FMODE_CAN_ODIRECT after successful open of a NFS4.x file
  sunrpc: set cl_max_connect when cloning an rpc_clnt
  pNFS: Avoid a live lock condition in pnfs_update_layout()
  pNFS: Don't keep retrying if the server replied NFS4ERR_LAYOUTUNAVAILABLE
2022-06-17 15:17:57 -05:00
Linus Torvalds f8e174c307 io_uring-5.19-2022-06-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKsc6oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgGxD/9YB9O3Dw2WOlzE+bnbadDEL0/XdaMVSZQX
 t5pfz1YTBUf/KgF+HWo6cGgvNupNjo6a2FAGJiGIXaEx2lZbKw7gUEPXohqY18h6
 alLPzt881whWESXTjpsDtc57PfCVY/K5/5ebqN5AhoXCgtl6CvlePJZH8uzBMq5F
 vGjcBgdofum677uNPSEpn6AzIGVtd9jI6Rg8r/a4iRdzeAJlkp1ifVh424qGgWtQ
 fQuoV83EPut/RTUodXZwJ/2XrdJwNDex98LEmp1Pi78IprGawrQ5F9JzsypQR2ie
 8ajLe6xn4wiXuWFr3pE9paow3c1APuftJ/PRXqBHoh2X6sMI4G2B2UNDkKrlK6DD
 9r5INcKzpMY390nN6GnSD1BSWBGNuglu9mASXDKFXL/JK+XNi6nYlaXdPn4uAhyR
 Cp41xx3gGf3r8aq8Pv+YNRej3kpNSi8oHKhYPToxn+EwPX8TpTdexnQC4ZKWNMbZ
 Mg1hY5Z0NxuhEyvKlTXZmOF8dlf2dTZYJoqHHeYhvcoZT9dWwjrINXqJvqsCyywB
 2fPOPjdn1SuBwsugSkYkMlsbLm4rlyLCLnEL2SgcbzyQ2rubN5UFcp3ouJOEt5Nz
 HDZi4s7LBOZTGmnmtev5GOA7kDCQ2EqOcRZQOdWPSa5g5pOL11ahxRW0KESSsPik
 1pTBDjTfxg==
 =55JE
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.19-2022-06-16' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Bigger than usual at this time, both because we missed -rc2, but also
  because of some reverts that we chose to do. In detail:

   - Adjust mapped buffer API while we still can (Dylan)

   - Mapped buffer fixes (Dylan, Hao, Pavel, me)

   - Fix for uring_cmd wrong API usage for task_work (Dylan)

   - Fix for bug introduced in fixed file closing (Hao)

   - Fix race in buffer/file resource handling (Pavel)

   - Revert the NOP support for CQE32 and buffer selection that was
     brought up during the merge window (Pavel)

   - Remove IORING_CLOSE_FD_AND_FILE_SLOT introduced in this merge
     window. The API needs further refining, so just yank it for now and
     we'll revisit for a later kernel.

   - Series cleaning up the CQE32 support added in this merge window,
     making it more integrated rather than sitting on the side (Pavel)"

* tag 'io_uring-5.19-2022-06-16' of git://git.kernel.dk/linux-block: (21 commits)
  io_uring: recycle provided buffer if we punt to io-wq
  io_uring: do not use prio task_work_add in uring_cmd
  io_uring: commit non-pollable provided mapped buffers upfront
  io_uring: make io_fill_cqe_aux honour CQE32
  io_uring: remove __io_fill_cqe() helper
  io_uring: fix ->extra{1,2} misuse
  io_uring: fill extra big cqe fields from req
  io_uring: unite fill_cqe and the 32B version
  io_uring: get rid of __io_fill_cqe{32}_req()
  io_uring: remove IORING_CLOSE_FD_AND_FILE_SLOT
  Revert "io_uring: add buffer selection support to IORING_OP_NOP"
  Revert "io_uring: support CQE32 for nop operation"
  io_uring: limit size of provided buffer ring
  io_uring: fix types in provided buffer ring
  io_uring: fix index calculation
  io_uring: fix double unlock for pbuf select
  io_uring: kbuf: fix bug of not consuming ring buffer in partial io case
  io_uring: openclose: fix bug of closing wrong fixed file
  io_uring: fix not locked access to fixed buf table
  io_uring: fix races with buffer table unregister
  ...
2022-06-17 11:14:07 -07:00
Linus Torvalds 5c0cd3d4a9 Merge tag 'fs_for_v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull writeback and ext2 fixes from Jan Kara:
 "A fix for writeback bug which prevented machines with kdevtmpfs from
  booting and also one small ext2 bugfix in IO error handling"

* tag 'fs_for_v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  init: Initialize noop_backing_dev_info early
  ext2: fix fs corruption when trying to remove a non-empty directory with IO error
2022-06-17 10:09:24 -07:00
Jens Axboe 6436c770f1 io_uring: recycle provided buffer if we punt to io-wq
io_arm_poll_handler() will recycle the buffer appropriately if we end
up arming poll (or if we're ready to retry), but not for the io-wq case
if we have attempted poll first.

Explicitly recycle the buffer to avoid both hanging on to it too long,
but also to avoid multiple reads grabbing the same one. This can happen
for ring mapped buffers, since it hasn't necessarily been committed.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Link: https://github.com/axboe/liburing/issues/605
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-17 06:24:26 -06:00
Mike Kravetz 68d32527d3 hugetlbfs: zero partial pages during fallocate hole punch
hugetlbfs fallocate support was originally added with commit 70c3547e36
("hugetlbfs: add hugetlbfs_fallocate()").  Initial support only operated
on whole hugetlb pages.  This makes sense for populating files as other
interfaces such as mmap and truncate require hugetlb page size alignment. 
Only operating on whole hugetlb pages for the hole punch case was a
simplification and there was no compelling use case to zero partial pages.

In a recent discussion[1] it was assumed that hugetlbfs hole punch would
zero partial hugetlb pages as that is in line with the man page
description saying 'partial filesystem blocks are zeroed'.  However, the
hugetlbfs hole punch code actually does this:

        hole_start = round_up(offset, hpage_size);
        hole_end = round_down(offset + len, hpage_size);

Modify code to zero partial hugetlb pages in hole punch range.  It is
possible that application code could note a change in behavior.  However,
that would imply the code is passing in an unaligned range and expecting
only whole pages be removed.  This is unlikely as the fallocate
documentation states the opposite.

The current hugetlbfs fallocate hole punch behavior is tested with the
libhugetlbfs test fallocate_align[2].  This test will be updated to
validate partial page zeroing.

[1] https://lore.kernel.org/linux-mm/20571829-9d3d-0b48-817c-b6b15565f651@redhat.com/
[2] https://github.com/libhugetlbfs/libhugetlbfs/blob/master/tests/fallocate_align.c

Link: https://lkml.kernel.org/r/YqeiMlZDKI1Kabfe@monkey
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-16 19:11:32 -07:00
Steve French 7c05eae8db smb3: add trace point for SMB2_set_eof
In order to debug problems with file size being reported incorrectly
temporarily (in this case xfstest generic/584 intermittent failure)
we need to add trace point for the non-compounded code path where
we set the file size (SMB2_set_eof).  The new trace point is:
   "smb3_set_eof"

Here is sample output from the tracepoint:

            TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
              | |         |   |||||     |         |
          xfs_io-75403   [002] ..... 95219.189835: smb3_set_eof: xid=221 sid=0xeef1cbd2 tid=0x27079ee6 fid=0x52edb58c offset=0x100000
 aio-dio-append--75418   [010] ..... 95219.242402: smb3_set_eof: xid=226 sid=0xeef1cbd2 tid=0x27079ee6 fid=0xae89852d offset=0x0

Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-16 18:07:10 -05:00
Dominique Martinet b0017602fd 9p: fix EBADF errors in cached mode
cached operations sometimes need to do invalid operations (e.g. read
on a write only file)
Historic fscache had added a "writeback fid", a special handle opened
RW as root, for this. The conversion to new fscache missed that bit.

This commit reinstates a slightly lesser variant of the original code
that uses the writeback fid for partial pages backfills if the regular
user fid had been open as WRONLY, and thus would lack read permissions.

Link: https://lkml.kernel.org/r/20220614033802.1606738-1-asmadeus@codewreck.org
Fixes: eb497943fa ("9p: Convert to using the netfs helper lib to do reads and caching")
Cc: stable@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>
Reported-By: Christian Schoenebeck <linux_oss@crudebyte.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Tested-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-06-17 06:03:30 +09:00
Jan Kara 8d5459c11f ext4: improve write performance with disabled delalloc
When delayed allocation is disabled (either through mount option or
because we are running low on free space), ext4_write_begin() allocates
blocks with EXT4_GET_BLOCKS_IO_CREATE_EXT flag. With this flag extent
merging is disabled and since ext4_write_begin() is called for each page
separately, we end up with a *lot* of 1 block extents in the extent tree
and following writeback is writing 1 block at a time which results in
very poor write throughput (4 MB/s instead of 200 MB/s). These days when
ext4_get_block_unwritten() is used only by ext4_write_begin(),
ext4_page_mkwrite() and inline data conversion, we can safely allow
extent merging to happen from these paths since following writeback will
happen on different boundaries anyway. So use
EXT4_GET_BLOCKS_CREATE_UNRIT_EXT instead which restores the performance.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220520111402.4252-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16 12:17:56 -04:00
Zhang Yi 15baa7dcad ext4: fix warning when submitting superblock in ext4_commit_super()
We have already check the io_error and uptodate flag before submitting
the superblock buffer, and re-set the uptodate flag if it has been
failed to write out. But it was lockless and could be raced by another
ext4_commit_super(), and finally trigger '!uptodate' WARNING when
marking buffer dirty. Fix it by submit buffer directly.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20220520023216.3065073-1-yi.zhang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16 11:50:48 -04:00
Dylan Yudaken 32fc810b36 io_uring: do not use prio task_work_add in uring_cmd
io_req_task_prio_work_add has a strict assumption that it will only be
used with io_req_task_complete. There is a codepath that assumes this is
the case and will not even call the completion function if it is hit.

For uring_cmd with an arbitrary completion function change the call to the
correct non-priority version.

Fixes: ee692a21e9 ("fs,io_uring: add infrastructure for uring-cmd")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20220616135011.441980-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 09:10:26 -06:00
Wang Jianjian 48e02e6113 ext4: fix incorrect comment in ext4_bio_write_page()
Signed-off-by: Wang Jianjian <wangjianjian3@huawei.com>
Link: https://lore.kernel.org/r/20220520022255.2120576-1-wangjianjian3@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16 11:03:16 -04:00
Yang Li 4f5bf12732 fs: fix jbd2_journal_try_to_free_buffers() kernel-doc comment
Add the description of @folio and remove @page in function kernel-doc
comment to remove warnings found by running scripts/kernel-doc, which
is caused by using 'make W=1'.

fs/jbd2/transaction.c:2149: warning: Function parameter or member
'folio' not described in 'jbd2_journal_try_to_free_buffers'
fs/jbd2/transaction.c:2149: warning: Excess function parameter 'page'
description in 'jbd2_journal_try_to_free_buffers'

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220512075432.31763-1-yang.lee@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2022-06-16 10:36:09 -04:00
Jens Axboe a76c0b31ee io_uring: commit non-pollable provided mapped buffers upfront
For recv/recvmsg, IO either completes immediately or gets queued for a
retry. This isn't the case for read/readv, if eg a normal file or a block
device is used. Here, an operation can get queued with the block layer.
If this happens, ring mapped buffers must get committed immediately to
avoid that the next read can consume the same buffer.

Check if we're dealing with pollable file, when getting a new ring mapped
provided buffer. If it's not, commit it immediately rather than wait post
issue. If we don't wait, we can race with completions coming in, or just
plain buffer reuse by committing after a retry where others could have
grabbed the same buffer.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16 07:14:44 -06:00
Ye Bin 27cfa25895 ext2: fix fs corruption when trying to remove a non-empty directory with IO error
We got issue as follows:
[home]# mount  /dev/sdd  test
[home]# cd test
[test]# ls
dir1  lost+found
[test]# rmdir  dir1
ext2_empty_dir: inject fault
[test]# ls
lost+found
[test]# cd ..
[home]# umount test
[home]# fsck.ext2 -fn  /dev/sdd
e2fsck 1.42.9 (28-Dec-2013)
Pass 1: Checking inodes, blocks, and sizes
Inode 4065, i_size is 0, should be 1024.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Unconnected directory inode 4065 (/???)
Connect to /lost+found? no

'..' in ... (4065) is / (2), should be <The NULL inode> (0).
Fix? no

Pass 4: Checking reference counts
Inode 2 ref count is 3, should be 4.  Fix? no

Inode 4065 ref count is 2, should be 3.  Fix? no

Pass 5: Checking group summary information

/dev/sdd: ********** WARNING: Filesystem still has errors **********

/dev/sdd: 14/128016 files (0.0% non-contiguous), 18477/512000 blocks

Reason is same with commit 7aab5c84a0. We can't assume directory
is empty when read directory entry failed.

Link: https://lore.kernel.org/r/20220615090010.1544152-1-yebin10@huawei.com
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-06-16 10:55:45 +02:00
Darrick J. Wong e89ab76d7e xfs: preserve DIFLAG2_NREXT64 when setting other inode attributes
It is vitally important that we preserve the state of the NREXT64 inode
flag when we're changing the other flags2 fields.

Fixes: 9b7d16e34b ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2022-06-15 23:13:33 -07:00
Darrick J. Wong 10930b254d xfs: fix variable state usage
The variable @args is fed to a tracepoint, and that's the only place
it's used.  This is fine for the kernel, but for userspace, tracepoints
are #define'd out of existence, which results in this warning on gcc
11.2:

xfs_attr.c: In function ‘xfs_attr_node_try_addname’:
xfs_attr.c:1440:42: warning: unused variable ‘args’ [-Wunused-variable]
 1440 |         struct xfs_da_args              *args = attr->xattri_da_args;
      |                                          ^~~~

Clean this up.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2022-06-15 23:13:32 -07:00
Darrick J. Wong f4288f0182 xfs: fix TOCTOU race involving the new logged xattrs control knob
I found a race involving the larp control knob, aka the debugging knob
that lets developers enable logging of extended attribute updates:

Thread 1			Thread 2

echo 0 > /sys/fs/xfs/debug/larp
				setxattr(REPLACE)
				xfs_has_larp (returns false)
				xfs_attr_set

echo 1 > /sys/fs/xfs/debug/larp

				xfs_attr_defer_replace
				xfs_attr_init_replace_state
				xfs_has_larp (returns true)
				xfs_attr_init_remove_state

				<oops, wrong DAS state!>

This isn't a particularly severe problem right now because xattr logging
is only enabled when CONFIG_XFS_DEBUG=y, and developers *should* know
what they're doing.

However, the eventual intent is that callers should be able to ask for
the assistance of the log in persisting xattr updates.  This capability
might not be required for /all/ callers, which means that dynamic
control must work correctly.  Once an xattr update has decided whether
or not to use logged xattrs, it needs to stay in that mode until the end
of the operation regardless of what subsequent parallel operations might
do.

Therefore, it is an error to continue sampling xfs_globals.larp once
xfs_attr_change has made a decision about larp, and it was not correct
for me to have told Allison that ->create_intent functions can sample
the global log incompat feature bitfield to decide to elide a log item.

Instead, create a new op flag for the xfs_da_args structure, and convert
all other callers of xfs_has_larp and xfs_sb_version_haslogxattrs within
the attr update state machine to look for the operations flag.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2022-06-15 23:13:32 -07:00
Dave Wysochanski 5ee3d10f84 NFSv4: Add FMODE_CAN_ODIRECT after successful open of a NFS4.x file
Commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag")
added the FMODE_CAN_ODIRECT flag for NFSv3 but neglected to add
it for NFSv4.x.  This causes direct io on NFSv4.x to fail open
with EINVAL:
  mount -o vers=4.2 127.0.0.1:/export /mnt/nfs4
  dd if=/dev/zero of=/mnt/nfs4/file.bin bs=128k count=1 oflag=direct
  dd: failed to open '/mnt/nfs4/file.bin': Invalid argument
  dd of=/dev/null if=/mnt/nfs4/file.bin bs=128k count=1 iflag=direct
  dd: failed to open '/mnt/dir1/file1.bin': Invalid argument

Fixes: a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-15 15:03:12 -04:00
Linus Torvalds 979086f5e0 fs.fixes.v5.19-rc3
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYqmpKwAKCRCRxhvAZXjc
 ogvLAQCsgqKYjmqx1s9ta8PXH9qiTWLQh1/s3ONCAvSBe0rYRAD9HPwbUoxguqxr
 T2RzjuX2+rqzA5qTErjQqVEftn7DgAo=
 =+P6m
 -----END PGP SIGNATURE-----

Merge tag 'fs.fixes.v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull vfs idmapping fix from Christian Brauner:
 "This fixes an issue where we fail to change the group of a file when
  the caller owns the file and is a member of the group to change to.

  This is only relevant on idmapped mounts.

  There's a detailed description in the commit message and regression
  tests have been added to xfstests"

* tag 'fs.fixes.v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: account for group membership
2022-06-15 09:04:55 -07:00
Pavel Begunkov c5595975b5 io_uring: make io_fill_cqe_aux honour CQE32
Don't let io_fill_cqe_aux() post 16B cqes for CQE32 rings, neither the
kernel nor the userspace expect this to happen.

Fixes: 76c68fbf1a ("io_uring: enable CQE32")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/64fae669fae1b7083aa15d0cd807f692b0880b9a.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:56 -06:00
Pavel Begunkov cd94903d3b io_uring: remove __io_fill_cqe() helper
In preparation for the following patch, inline __io_fill_cqe(), there is
only one user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/71dab9afc3cde3f8b64d26f20d3b60bdc40726ff.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:42 -06:00
Pavel Begunkov 2caf9822f0 io_uring: fix ->extra{1,2} misuse
We don't really know the state of req->extra{1,2] fields in
__io_fill_cqe_req(), if an opcode handler is not aware of CQE32 option,
it never sets them up properly. Track the state of those fields with a
request flag.

Fixes: 76c68fbf1a ("io_uring: enable CQE32")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4b3e5be512fbf4debec7270fd485b8a3b014d464.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:09 -06:00
Pavel Begunkov 29ede2014c io_uring: fill extra big cqe fields from req
The only user of io_req_complete32()-like functions is cmd
requests. Instead of keeping the whole complete32 family, remove them
and provide the extras in already added for inline completions
req->extra{1,2}. When fill_cqe_res() finds CQE32 option enabled
it'll use those fields to fill a 32B cqe.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/af1319eb661b1f9a0abceb51cbbf72b8002e019d.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:09 -06:00
Pavel Begunkov f43de1f888 io_uring: unite fill_cqe and the 32B version
We want just one function that will handle both normal cqes and 32B
cqes. Combine __io_fill_cqe_req() and __io_fill_cqe_req32(). It's still
not entirely correct yet, but saves us from cases when we fill an CQE of
a wrong size.

Fixes: 76c68fbf1a ("io_uring: enable CQE32")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8085c5b2f74141520f60decd45334f87e389b718.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:09 -06:00
Pavel Begunkov 91ef75a7db io_uring: get rid of __io_fill_cqe{32}_req()
There are too many cqe filling helpers, kill __io_fill_cqe{32}_req(),
use __io_fill_cqe{32}_req_filled() instead, and then rename it. It'll
simplify fixing in following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c18e0d191014fb574f24721245e4e3fddd0b6917.1655287457.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-15 05:06:09 -06:00
Tyler Hicks 2a3dcbccd6 9p: Fix refcounting during full path walks for fid lookups
Decrement the refcount of the parent dentry's fid after walking
each path component during a full path walk for a lookup. Failure to do
so can lead to fids that are not clunked until the filesystem is
unmounted, as indicated by this warning:

 9pnet: found fid 3 not clunked

The improper refcounting after walking resulted in open(2) returning
-EIO on any directories underneath the mount point when using the virtio
transport. When using the fd transport, there's no apparent issue until
the filesytem is unmounted and the warning above is emitted to the logs.

In some cases, the user may not yet be attached to the filesystem and a
new root fid, associated with the user, is created and attached to the
root dentry before the full path walk is performed. Increment the new
root fid's refcount to two in that situation so that it can be safely
decremented to one after it is used for the walk operation. The new fid
will still be attached to the root dentry when
v9fs_fid_lookup_with_uid() returns so a final refcount of one is
correct/expected.

Link: https://lkml.kernel.org/r/20220527000003.355812-2-tyhicks@linux.microsoft.com
Link: https://lkml.kernel.org/r/20220612085330.1451496-4-asmadeus@codewreck.org
Fixes: 6636b6dcc3 ("9p: add refcount to p9_fid struct")
Cc: stable@vger.kernel.org
Signed-off-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
[Dominique: fix clunking fid multiple times discussed in second link]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-06-15 12:05:33 +09:00
Dominique Martinet e5690f2632 9p: fix fid refcount leak in v9fs_vfs_get_link
we check for protocol version later than required, after a fid has
been obtained. Just move the version check earlier.

Link: https://lkml.kernel.org/r/20220612085330.1451496-3-asmadeus@codewreck.org
Fixes: 6636b6dcc3 ("9p: add refcount to p9_fid struct")
Cc: stable@vger.kernel.org
Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-06-15 12:05:29 +09:00
Dominique Martinet beca774fc5 9p: fix fid refcount leak in v9fs_vfs_atomic_open_dotl
We need to release directory fid if we fail halfway through open

This fixes fid leaking with xfstests generic 531

Link: https://lkml.kernel.org/r/20220612085330.1451496-2-asmadeus@codewreck.org
Fixes: 6636b6dcc3 ("9p: add refcount to p9_fid struct")
Cc: stable@vger.kernel.org
Reported-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Tyler Hicks <tyhicks@linux.microsoft.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-06-15 12:05:21 +09:00
Pavel Begunkov d884b6498d io_uring: remove IORING_CLOSE_FD_AND_FILE_SLOT
This partially reverts a7c41b4687

Even though IORING_CLOSE_FD_AND_FILE_SLOT might save cycles for some
users, but it tries to do two things at a time and it's not clear how to
handle errors and what to return in a single result field when one part
fails and another completes well. Kill it for now.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/837c745019b3795941eee4fcfd7de697886d645b.1655224415.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-14 10:57:40 -06:00
Pavel Begunkov aa165d6d2b Revert "io_uring: add buffer selection support to IORING_OP_NOP"
This reverts commit 3d200242a6.

Buffer selection with nops was used for debugging and benchmarking but
is useless in real life. Let's revert it before it's released.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c5012098ca6b51dfbdcb190f8c4e3c0bf1c965dc.1655224415.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-14 10:57:40 -06:00
Pavel Begunkov 8899ce4b2f Revert "io_uring: support CQE32 for nop operation"
This reverts commit 2bb04df7c2.

CQE32 nops were used for debugging and benchmarking but it doesn't
target any real use case. Revert it, we can return it back if someone
finds a good way to use it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5ff623d84ccb4b3f3b92a3ea41cdcfa612f3d96f.1655224415.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-14 10:57:40 -06:00
Christian Brauner 168f912893
fs: account for group membership
When calling setattr_prepare() to determine the validity of the
attributes the ia_{g,u}id fields contain the value that will be written
to inode->i_{g,u}id. This is exactly the same for idmapped and
non-idmapped mounts and allows callers to pass in the values they want
to see written to inode->i_{g,u}id.

When group ownership is changed a caller whose fsuid owns the inode can
change the group of the inode to any group they are a member of. When
searching through the caller's groups we need to use the gid mapped
according to the idmapped mount otherwise we will fail to change
ownership for unprivileged users.

Consider a caller running with fsuid and fsgid 1000 using an idmapped
mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file
owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the
idmapped mount.

The caller now requests the gid of the file to be changed to 1000 going
through the idmapped mount. In the vfs we will immediately map the
requested gid to the value that will need to be written to inode->i_gid
and place it in attr->ia_gid. Since this idmapped mount maps 65534 to
1000 we place 65534 in attr->ia_gid.

When we check whether the caller is allowed to change group ownership we
first validate that their fsuid matches the inode's uid. The
inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount.
Since the caller's fsuid is 1000 we pass the check.

We now check whether the caller is allowed to change inode->i_gid to the
requested gid by calling in_group_p(). This will compare the passed in
gid to the caller's fsgid and search the caller's additional groups.

Since we're dealing with an idmapped mount we need to pass in the gid
mapped according to the idmapped mount. This is akin to checking whether
a caller is privileged over the future group the inode is owned by. And
that needs to take the idmapped mount into account. Note, all helpers
are nops without idmapped mounts.

New regression test sent to xfstests.

Link: https://github.com/lxc/lxd/issues/10537
Link: https://lore.kernel.org/r/20220613111517.2186646-1-brauner@kernel.org
Fixes: 2f221d6f7b ("attr: handle idmapped mounts")
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org # 5.15+
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2022-06-14 12:18:47 +02:00
Jens Axboe feaf625e70 Merge branch 'io_uring/io_uring-5.19' of https://github.com/isilence/linux into io_uring-5.19
Pull io_uring fixes from Pavel.

* 'io_uring/io_uring-5.19' of https://github.com/isilence/linux:
  io_uring: fix double unlock for pbuf select
  io_uring: kbuf: fix bug of not consuming ring buffer in partial io case
  io_uring: openclose: fix bug of closing wrong fixed file
  io_uring: fix not locked access to fixed buf table
  io_uring: fix races with buffer table unregister
  io_uring: fix races with file table unregister
2022-06-13 06:52:52 -06:00
Dylan Yudaken f9437ac0f8 io_uring: limit size of provided buffer ring
The type of head and tail do not allow more than 2^15 entries in a
provided buffer ring, so do not allow this.
At 2^16 while each entry can be indexed, there is no way to
disambiguate full vs empty.

Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220613101157.3687-4-dylany@fb.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13 05:13:33 -06:00
Dylan Yudaken c6e9fa5c0a io_uring: fix types in provided buffer ring
The type of head needs to match that of tail in order for rollover and
comparisons to work correctly.

Without this change the comparison of tail to head might incorrectly allow
io_uring to use a buffer that userspace had not given it.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220613101157.3687-3-dylany@fb.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13 05:13:31 -06:00
Dylan Yudaken 97da4a5379 io_uring: fix index calculation
When indexing into a provided buffer ring, do not subtract 1 from the
index.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220613101157.3687-2-dylany@fb.com
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-13 05:13:09 -06:00
Pavel Begunkov fc9375e3f7 io_uring: fix double unlock for pbuf select
io_buffer_select(), which is the only caller of io_ring_buffer_select(),
fully handles locking, mutex unlock in io_ring_buffer_select() will lead
to double unlock.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13 11:37:41 +01:00
Hao Xu 42db0c00e2 io_uring: kbuf: fix bug of not consuming ring buffer in partial io case
When we use ring-mapped provided buffer, we should consume it before
arm poll if partial io has been done. Otherwise the buffer may be used
by other requests and thus we lost the data.

Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Hao Xu <howeyxu@tencent.com>
[pavel: 5.19 rebase]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13 11:37:30 +01:00
Hao Xu e71d7c56dd io_uring: openclose: fix bug of closing wrong fixed file
Don't update ret until fixed file is closed, otherwise the file slot
becomes the error code.

Fixes: a7c41b4687 ("io_uring: let IORING_OP_FILES_UPDATE support choosing fixed file slots")
Signed-off-by: Hao Xu <howeyxu@tencent.com>
[pavel: 5.19 rebase]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13 11:37:03 +01:00
Pavel Begunkov 05b538c176 io_uring: fix not locked access to fixed buf table
We can look inside the fixed buffer table only while holding
->uring_lock, however in some cases we don't do the right async prep for
IORING_OP_{WRITE,READ}_FIXED ending up with NULL req->imu forcing making
an io-wq worker to try to resolve the fixed buffer without proper
locking.

Move req->imu setup into early req init paths, i.e. io_prep_rw(), which
is called unconditionally for rw requests and under uring_lock.

Fixes: 634d00df5e ("io_uring: add full-fledged dynamic buffers support")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13 09:53:41 +01:00
Pavel Begunkov d11d31fc5d io_uring: fix races with buffer table unregister
Fixed buffer table quiesce might unlock ->uring_lock, potentially
letting new requests to be submitted, don't allow those requests to
use the table as they will race with unregistration.

Reported-and-tested-by: van fantasy <g1042620637@gmail.com>
Fixes: bd54b6fe33 ("io_uring: implement fixed buffers registration similar to fixed files")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13 09:53:27 +01:00
Pavel Begunkov b0380bf6da io_uring: fix races with file table unregister
Fixed file table quiesce might unlock ->uring_lock, potentially letting
new requests to be submitted, don't allow those requests to use the
table as they will race with unregistration.

Reported-and-tested-by: van fantasy <g1042620637@gmail.com>
Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
2022-06-13 09:53:07 +01:00
Linus Torvalds 2275c6babf 3 smb3 reconnect fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmKkwyEACgkQiiy9cAdy
 T1EqyAv/d8aDey0rQzGy918wzJd91gZrNFOJUpzVhIs3O5MakBgeoYn+S6rySl1+
 xs6lXTQdSEyiL0edqTIq8iqA+iuhLPCBW2BWa/Zw089yHM/Ho3tjc5gBl5w38OcF
 7NpFUInkg+yoBYWY9cCwjL83YaPxhcLKGY7S6WWptUxzf5Sg6eUqXCkMS7eUV6hb
 YniMa5uWZSJtqY4F6qw/NOw90QekodEmfL4lLU/GXOnDxlJ8v5Ztf3aGHITWNwsd
 ovhutUSai/tZz9fYHp6yOZYDcl4i0brOa3dIyU2tr52TdtzS73he8rE+Th4bu+uM
 XTXvrDCTwsnOTiRFyyBJcaVDF+6LpqPEcURqLEbVOf0xXHyoEQ4zEwVFQJIBYOP4
 Oy8XeXQePRxCnToI2cFZaw85IkLikoZ+4PggFkbsaFdJfkboR7b+XVhkfGVr5jnn
 A6Unwrn3f6LS6MLbsDAjUpfzftwRyhbcvYCukeYKWz836xAx5tyOBIZA3m6keXah
 LyhX4qSb
 =mHWJ
 -----END PGP SIGNATURE-----

Merge tag '5.19-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs client fixes from Steve French:
 "Three reconnect fixes, all for stable as well.

  One of these three reconnect fixes does address a problem with
  multichannel reconnect, but this does not include the additional
  fix (still being tested) for dynamically detecting multichannel
  adapter changes which will improve those reconnect scenarios even
  more"

* tag '5.19-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: populate empty hostnames for extra channels
  cifs: return errors during session setup during reconnects
  cifs: fix reconnect on smb3 mount types
2022-06-12 11:05:44 -07:00
Christophe JAILLET 06ee1c0aeb ksmbd: smbd: Remove useless license text when SPDX-License-Identifier is already used
An SPDX-License-Identifier is already in place. There is no need to
duplicate part of the corresponding license.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-11 11:18:26 -05:00
Namjae Jeon fe0fde09e1 ksmbd: use SOCK_NONBLOCK type for kernel_accept()
I found that normally it is O_NONBLOCK but there are different value
for some arch.

/include/linux/net.h:
#ifndef SOCK_NONBLOCK
#define SOCK_NONBLOCK   O_NONBLOCK
#endif

/arch/alpha/include/asm/socket.h:
#define SOCK_NONBLOCK   0x40000000

Use SOCK_NONBLOCK instead of O_NONBLOCK for kernel_accept().

Suggested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Namjae Jeon <linkinjeon@kerne.org>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-11 11:18:26 -05:00
Linus Torvalds 0885eacdc8 Notable changes:
- There is now a backup maintainer for NFSD
 
 Notable fixes:
 - Prevent array overruns in svc_rdma_build_writes()
 - Prevent buffer overruns when encoding NFSv3 READDIR results
 - Fix a potential UAF in nfsd_file_put()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmKjhbsACgkQM2qzM29m
 f5cgig/6A9gC2c9v4lR2fH6ufiCWvJBfuVaBbToubwktJHaDLvqH56JcvS3s/gKL
 PKGmbQTI/6lgmVgJqQSxJUnfe6wzHx8G1MdjlZEIwi3pUeiV+LpiprZz9TOhgYYV
 YDnXRGhb4wKOm75w8rb6X8k106XdKwdBaRQwb88FDawffWoEY0XNYrlmNmmWi8To
 ELlOlIwRCBbKJoJ6yEEWQRrBuVBXapbsn29tipZXbdo58g+vL0yDQq9s97b0mHhi
 C2apAN2+k18FiBJsA7b7pW1l/P6k9FNEeetvgWyN8OSMpPNmt0vz1HvKaIstPgg1
 BX6rgWe5eQBFEk2KNvSGHrV3R+wAp7jeuVpHUMjxXvzmfj4exJV/H8lu+qZJNDGN
 ybCJatomR4APFxk+s1kptlzNo7zfyPz15L80HmWIngYJ/lrBOoKPIIi3bwPQcBwW
 q2Rc+SlvpqbJvEcomgF/lqQN6inmx44J+KpOSA/S8qSIdSkz0iaZsDahFxgZNe82
 h+X/i1maRtnSIvWdGMR7O6kEFT5jky35WlTv/VutTOsUwA4mUU9vZUnufBBHJH07
 nOdLMi/QS/O5GOnlyegrODtN75wi+IeKt+WMNmnN+JB8Tsg0kZwjOsc/dbQfyJqP
 PrQJ5AUP0TMm90B1873z8yKmhWeXtB71vgAI/d53aadBG4ZEstU=
 =6ugk
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:
 "Notable changes:

   - There is now a backup maintainer for NFSD

  Notable fixes:

   - Prevent array overruns in svc_rdma_build_writes()

   - Prevent buffer overruns when encoding NFSv3 READDIR results

   - Fix a potential UAF in nfsd_file_put()"

* tag 'nfsd-5.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Remove pointer type casts from xdr_get_next_encode_buffer()
  SUNRPC: Clean up xdr_get_next_encode_buffer()
  SUNRPC: Clean up xdr_commit_encode()
  SUNRPC: Optimize xdr_reserve_space()
  SUNRPC: Fix the calculation of xdr->end in xdr_get_next_encode_buffer()
  SUNRPC: Trap RDMA segment overflows
  NFSD: Fix potential use-after-free in nfsd_file_put()
  MAINTAINERS: reciprocal co-maintainership for file locking and nfsd
2022-06-10 17:28:43 -07:00
Shyam Prasad N 4c14d7043f cifs: populate empty hostnames for extra channels
Currently, the secondary channels of a multichannel session
also get hostname populated based on the info in primary channel.
However, this will end up with a wrong resolution of hostname to
IP address during reconnect.

This change fixes this by not populating hostname info for all
secondary channels.

Fixes: 5112d80c16 ("cifs: populate server_hostname for extra channels")
Cc: stable@vger.kernel.org
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-10 18:55:02 -05:00
David Howells 40a8110120 netfs: Rename the netfs_io_request cleanup op and give it an op pointer
The netfs_io_request cleanup op is now always in a position to be given a
pointer to a netfs_io_request struct, so this can be passed in instead of
the mapping and private data arguments (both of which are included in the
struct).

So rename the ->cleanup op to ->free_request (to match ->init_request) and
pass in the I/O pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
2022-06-10 20:55:21 +01:00
Linus Torvalds e81fb4198e netfs: Further cleanups after struct netfs_inode wrapper introduced
Change the signature of netfs helper functions to take a struct netfs_inode
pointer rather than a struct inode pointer where appropriate, thereby
relieving the need for the network filesystem to convert its internal inode
format down to the VFS inode only for netfslib to bounce it back up.  For
type safety, it's better not to do that (and it's less typing too).

Give netfs_write_begin() an extra argument to pass in a pointer to the
netfs_inode struct rather than deriving it internally from the file
pointer.  Note that the ->write_begin() and ->write_end() ops are intended
to be replaced in the future by netfslib code that manages this without the
need to call in twice for each page.

netfs_readpage() and similar are intended to be pointed at directly by the
address_space_operations table, so must stick to the signature dictated by
the function pointers there.

Changes
=======
- Updated the kerneldoc comments and documentation [DH].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/CAHk-=wgkwKyNmNdKpQkqZ6DnmUL-x9hp0YBnUGjaPFEAdxDTbw@mail.gmail.com/
2022-06-10 20:55:21 +01:00
David Howells 102d841055 afs: Fix some checker issues
Remove an unused global variable and make another static as reported by
make C=1.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
2022-06-10 20:55:21 +01:00
Linus Torvalds ad6e076498 zonefs fixes for 5.19-rc2
* Fix handling of the explicit-open mount option, and in particular the
   conditions under which this option can be ignored.
 
 * Fix a problem with zonefs iomap_begin method, causing a hang in
   iomap_readahead() when a readahead request reaches the end of a file.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCYqMh/wAKCRDdoc3SxdoY
 dvd+AP4jNRFhAedXl0mIutoP4k0XwblSz9RwrXLOYzkOtgpXGQD+Lps42w6EQliE
 wWuuL4syVgKamolj0WGcPLarGZC7LQA=
 =neot
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs fixes from Damien Le Moal:

 - Fix handling of the explicit-open mount option, and in particular the
   conditions under which this option can be ignored.

 - Fix a problem with zonefs iomap_begin method, causing a hang in
   iomap_readahead() when a readahead request reaches the end of a file.

* tag 'zonefs-5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: fix zonefs_iomap_begin() for reads
  zonefs: Do not ignore explicit_open with active zone limit
  zonefs: fix handling of explicit_open option on mount
2022-06-10 10:56:28 -07:00
David Howells 874c8ca1e6 netfs: Fix gcc-12 warning by embedding vfs inode in netfs_i_context
While randstruct was satisfied with using an open-coded "void *" offset
cast for the netfs_i_context <-> inode casting, __builtin_object_size() as
used by FORTIFY_SOURCE was not as easily fooled.  This was causing the
following complaint[1] from gcc v12:

  In file included from include/linux/string.h:253,
                   from include/linux/ceph/ceph_debug.h:7,
                   from fs/ceph/inode.c:2:
  In function 'fortify_memset_chk',
      inlined from 'netfs_i_context_init' at include/linux/netfs.h:326:2,
      inlined from 'ceph_alloc_inode' at fs/ceph/inode.c:463:2:
  include/linux/fortify-string.h:242:25: warning: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
    242 |                         __write_overflow_field(p_size_field, size);
        |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fix this by embedding a struct inode into struct netfs_i_context (which
should perhaps be renamed to struct netfs_inode).  The struct inode
vfs_inode fields are then removed from the 9p, afs, ceph and cifs inode
structs and vfs_inode is then simply changed to "netfs.inode" in those
filesystems.

Further, rename netfs_i_context to netfs_inode, get rid of the
netfs_inode() function that converted a netfs_i_context pointer to an
inode pointer (that can now be done with &ctx->inode) and rename the
netfs_i_context() function to netfs_inode() (which is now a wrapper
around container_of()).

Most of the changes were done with:

  perl -p -i -e 's/vfs_inode/netfs.inode/'g \
        `git grep -l 'vfs_inode' -- fs/{9p,afs,ceph,cifs}/*.[ch]`

Kees suggested doing it with a pair structure[2] and a special
declarator to insert that into the network filesystem's inode
wrapper[3], but I think it's cleaner to embed it - and then it doesn't
matter if struct randomisation reorders things.

Dave Chinner suggested using a filesystem-specific VFS_I() function in
each filesystem to convert that filesystem's own inode wrapper struct
into the VFS inode struct[4].

Version #2:
 - Fix a couple of missed name changes due to a disabled cifs option.
 - Rename nfs_i_context to nfs_inode
 - Use "netfs" instead of "nic" as the member name in per-fs inode wrapper
   structs.

[ This also undoes commit 507160f46c ("netfs: gcc-12: temporarily
  disable '-Wattribute-warning' for now") that is no longer needed ]

Fixes: bc899ee1c8 ("netfs: Add a netfs inode context")
Reported-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
cc: Jonathan Corbet <corbet@lwn.net>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Steve French <smfrench@gmail.com>
cc: William Kucharski <william.kucharski@oracle.com>
cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
cc: Dave Chinner <david@fromorbit.com>
cc: linux-doc@vger.kernel.org
cc: v9fs-developer@lists.sourceforge.net
cc: linux-afs@lists.infradead.org
cc: ceph-devel@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: samba-technical@lists.samba.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/r/d2ad3a3d7bdd794c6efb562d2f2b655fb67756b9.camel@kernel.org/ [1]
Link: https://lore.kernel.org/r/20220517210230.864239-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/r/20220518202212.2322058-1-keescook@chromium.org/ [3]
Link: https://lore.kernel.org/r/20220524101205.GI2306852@dread.disaster.area/ [4]
Link: https://lore.kernel.org/r/165296786831.3591209.12111293034669289733.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165305805651.4094995.7763502506786714216.stgit@warthog.procyon.org.uk # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-09 13:55:00 -07:00
Linus Torvalds 3d9f55c57b \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmKiO9UACgkQnJ2qBz9k
 QNk9+Af/RjaJEozyj/He7nqj1xncN6bIJzeyOqQVJNkHBsKYt7oDFvSuYI1Kbzk+
 x7/x8dRtVR3kRZCO6VarETkzGp6Nw10RdzFKqT2FRmQ66wVZaXPQeqVZqwXSKdtR
 qgU892e9S2SqUH9EyUwk3D/HwLr1VNKKp6B0N+By7EwKmZdyTg5siFJ26+z+QpJQ
 wo84nN/m6GgHSm+c8kMFa+cs635tMY3+vP4nviUKyuDTxW3Yu6maIa5973WLiFqo
 EZSLtSfXYasjoOl5fN3AaO0dAl8fRJIh6wsgbeQI/NeUYMIqKWslW+5esq1SwreS
 r1+Xig8MmxDJ/1I3i/L/aDM7FipY9A==
 =kMe8
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull ext2, writeback, and quota fixes and cleanups from Jan Kara:
 "A fix for race in writeback code and two cleanups in quota and ext2"

* tag 'fs_for_v5.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Prevent memory allocation recursion while holding dq_lock
  writeback: Fix inode->i_io_list not be protected by inode->i_lock error
  fs: Fix syntax errors in comments
2022-06-09 12:26:05 -07:00
Linus Torvalds 507160f46c netfs: gcc-12: temporarily disable '-Wattribute-warning' for now
This is a pure band-aid so that I can continue merging stuff from people
while some of the gcc-12 fallout gets sorted out.

In particular, gcc-12 is very unhappy about the kinds of pointer
arithmetic tricks that netfs does, and that makes the fortify checks
trigger in afs and ceph:

  In function ‘fortify_memset_chk’,
      inlined from ‘netfs_i_context_init’ at include/linux/netfs.h:327:2,
      inlined from ‘afs_set_netfs_context’ at fs/afs/inode.c:61:2,
      inlined from ‘afs_root_iget’ at fs/afs/inode.c:543:2:
  include/linux/fortify-string.h:258:25: warning: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Wattribute-warning]
    258 |                         __write_overflow_field(p_size_field, size);
        |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

and the reason is that netfs_i_context_init() is passed a 'struct inode'
pointer, and then it does

        struct netfs_i_context *ctx = netfs_i_context(inode);

        memset(ctx, 0, sizeof(*ctx));

where that netfs_i_context() function just does pointer arithmetic on
the inode pointer, knowing that the netfs_i_context is laid out
immediately after it in memory.

This is all truly disgusting, since the whole "netfs_i_context is laid
out immediately after it in memory" is not actually remotely true in
general, but is just made to be that way for afs and ceph.

See for example fs/cifs/cifsglob.h:

  struct cifsInodeInfo {
        struct {
                /* These must be contiguous */
                struct inode    vfs_inode;      /* the VFS's inode record */
                struct netfs_i_context netfs_ctx; /* Netfslib context */
        };
	[...]

and realize that this is all entirely wrong, and the pointer arithmetic
that netfs_i_context() is doing is also very very wrong and wouldn't
give the right answer if netfs_ctx had different alignment rules from a
'struct inode', for example).

Anyway, that's just a long-winded way to say "the gcc-12 warning is
actually quite reasonable, and our code happens to work but is pretty
disgusting".

This is getting fixed properly, but for now I made the mistake of
thinking "the week right after the merge window tends to be calm for me
as people take a breather" and I did a sustem upgrade.  And I got gcc-12
as a result, so to continue merging fixes from people and not have the
end result drown in warnings, I am fixing all these gcc-12 issues I hit.

Including with these kinds of temporary fixes.

Cc: Kees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/all/AEEBCF5D-8402-441D-940B-105AA718C71F@chromium.org/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-06-09 11:29:36 -07:00
Sungjong Seo 204e6ceaa1 exfat: use updated exfat_chain directly during renaming
In order for a file to access its own directory entry set,
exfat_inode_info(ei) has two copied values. One is ei->dir, which is
a snapshot of exfat_chain of the parent directory, and the other is
ei->entry, which is the offset of the start of the directory entry set
in the parent directory.

Since the parent directory can be updated after the snapshot point,
it should be used only for accessing one's own directory entry set.

However, as of now, during renaming, it could try to traverse or to
allocate clusters via snapshot values, it does not make sense.

This potential problem has been revealed when exfat_update_parent_info()
was removed by commit d8dad2588a ("exfat: fix referencing wrong parent
directory information after renaming"). However, I don't think it's good
idea to bring exfat_update_parent_info() back.

Instead, let's use the updated exfat_chain of parent directory diectly.

Fixes: d8dad2588a ("exfat: fix referencing wrong parent directory information after renaming")
Reported-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: Sungjong Seo <sj1557.seo@samsung.com>
Tested-by: Wang Yugui <wangyugui@e16-tech.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
2022-06-09 21:26:32 +09:00
Damien Le Moal c1c1204c0d zonefs: fix zonefs_iomap_begin() for reads
If a readahead is issued to a sequential zone file with an offset
exactly equal to the current file size, the iomap type is set to
IOMAP_UNWRITTEN, which will prevent an IO, but the iomap length is
calculated as 0. This causes a WARN_ON() in iomap_iter():

[17309.548939] WARNING: CPU: 3 PID: 2137 at fs/iomap/iter.c:34 iomap_iter+0x9cf/0xe80
[...]
[17309.650907] RIP: 0010:iomap_iter+0x9cf/0xe80
[...]
[17309.754560] Call Trace:
[17309.757078]  <TASK>
[17309.759240]  ? lock_is_held_type+0xd8/0x130
[17309.763531]  iomap_readahead+0x1a8/0x870
[17309.767550]  ? iomap_read_folio+0x4c0/0x4c0
[17309.771817]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[17309.778848]  ? lock_release+0x370/0x750
[17309.784462]  ? folio_add_lru+0x217/0x3f0
[17309.790220]  ? reacquire_held_locks+0x4e0/0x4e0
[17309.796543]  read_pages+0x17d/0xb60
[17309.801854]  ? folio_add_lru+0x238/0x3f0
[17309.807573]  ? readahead_expand+0x5f0/0x5f0
[17309.813554]  ? policy_node+0xb5/0x140
[17309.819018]  page_cache_ra_unbounded+0x27d/0x450
[17309.825439]  filemap_get_pages+0x500/0x1450
[17309.831444]  ? filemap_add_folio+0x140/0x140
[17309.837519]  ? lock_is_held_type+0xd8/0x130
[17309.843509]  filemap_read+0x28c/0x9f0
[17309.848953]  ? zonefs_file_read_iter+0x1ea/0x4d0 [zonefs]
[17309.856162]  ? trace_contention_end+0xd6/0x130
[17309.862416]  ? __mutex_lock+0x221/0x1480
[17309.868151]  ? zonefs_file_read_iter+0x166/0x4d0 [zonefs]
[17309.875364]  ? filemap_get_pages+0x1450/0x1450
[17309.881647]  ? __mutex_unlock_slowpath+0x15e/0x620
[17309.888248]  ? wait_for_completion_io_timeout+0x20/0x20
[17309.895231]  ? lock_is_held_type+0xd8/0x130
[17309.901115]  ? lock_is_held_type+0xd8/0x130
[17309.906934]  zonefs_file_read_iter+0x356/0x4d0 [zonefs]
[17309.913750]  new_sync_read+0x2d8/0x520
[17309.919035]  ? __x64_sys_lseek+0x1d0/0x1d0

Furthermore, this causes iomap_readahead() to loop forever as
iomap_readahead_iter() always returns 0, making no progress.

Fix this by treating reads after the file size as access to holes,
setting the iomap type to IOMAP_HOLE, the iomap addr to IOMAP_NULL_ADDR
and using the length argument as is for the iomap length. To simplify
the code with this change, zonefs_iomap_begin() is split into the read
variant, zonefs_read_iomap_begin() and zonefs_read_iomap_ops, and the
write variant, zonefs_write_iomap_begin() and zonefs_write_iomap_ops.

Reported-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jorgen Hansen <Jorgen.Hansen@wdc.com>
2022-06-08 19:13:55 +09:00
Damien Le Moal 96eca145cb zonefs: Do not ignore explicit_open with active zone limit
A zoned device may have no limit on the number of open zones but may
have a limit on the number of active zones it can support. In such
case, the explicit_open mount option should not be ignored to ensure
that the open() system call activates the zone with an explicit zone
open command, thus guaranteeing that the zone can be written.

Enforce this by ignoring the explicit_open mount option only for
devices that have both the open and active zone limits equal to 0.

Fixes: 87c9ce3ffe ("zonefs: Add active seq file accounting")
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-06-08 15:38:44 +09:00
Damien Le Moal a2a513be71 zonefs: fix handling of explicit_open option on mount
Ignoring the explicit_open mount option on mount for devices that do not
have a limit on the number of open zones must be done after the mount
options are parsed and set in s_mount_opts. Move the check to ignore
the explicit_open option after the call to zonefs_parse_options() in
zonefs_fill_super().

Fixes: b5c00e9757 ("zonefs: open/close zone on file open/close")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
2022-06-08 15:38:42 +09:00
David Sterba e3a4167c88 btrfs: add error messages to all unrecognized mount options
Almost none of the errors stemming from a valid mount option but wrong
value prints a descriptive message which would help to identify why
mount failed. Like in the linked report:

  $ uname -r
  v4.19
  $ mount -o compress=zstd /dev/sdb /mnt
  mount: /mnt: wrong fs type, bad option, bad superblock on
  /dev/sdb, missing codepage or helper program, or other error.
  $ dmesg
  ...
  BTRFS error (device sdb): open_ctree failed

Errors caused by memory allocation failures are left out as it's not a
user error so reporting that would be confusing.

Link: https://lore.kernel.org/linux-btrfs/9c3fec36-fc61-3a33-4977-a7e207c3fa4e@gmx.de/
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-07 17:29:50 +02:00
Shyam Prasad N 8ea21823aa cifs: return errors during session setup during reconnects
During reconnects, we check the return value from
cifs_negotiate_protocol, and have handlers for both success
and failures. But if that passes, and cifs_setup_session
returns any errors other than -EACCES, we do not handle
that. This fix adds a handler for that, so that we don't
go ahead and try a tree_connect on a failed session.

Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-06 18:23:38 -05:00
Trond Myklebust 880265c77a pNFS: Avoid a live lock condition in pnfs_update_layout()
If we're about to send the first layoutget for an empty layout, we want
to make sure that we drain out the existing pending layoutget calls
first. The reason is that these layouts may have been already implicitly
returned to the server by a recall to which the client gave a
NFS4ERR_NOMATCHING_LAYOUT response.

The problem is that wait_var_event_killable() could in principle see the
plh_outstanding count go back to '1' when the first process to wake up
starts sending a new layoutget. If it fails to get a layout, then this
loop can continue ad infinitum...

Fixes: 0b77f97a7e ("NFSv4/pnfs: Fix layoutget behaviour after invalidation")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-06 11:53:55 -04:00
Trond Myklebust fe44fb23d6 pNFS: Don't keep retrying if the server replied NFS4ERR_LAYOUTUNAVAILABLE
If the server tells us that a pNFS layout is not available for a
specific file, then we should not keep pounding it with further
layoutget requests.

Fixes: 183d9e7b11 ("pnfs: rework LAYOUTGET retry handling")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-06-06 11:53:54 -04:00
Qu Wenruo 0591f04036 btrfs: prevent remounting to v1 space cache for subpage mount
Upstream commit 9f73f1aef9 ("btrfs: force v2 space cache usage for
subpage mount") forces subpage mount to use v2 cache, to avoid
deprecated v1 cache which doesn't support subpage properly.

But there is a loophole that user can still remount to v1 cache.

The existing check will only give users a warning, but does not really
prevent to do the remount.

Although remounting to v1 will not cause any problems since the v1 cache
will always be marked invalid when mounted with a different page size,
it's still better to prevent v1 cache at all for subpage mounts.

Fixes: 9f73f1aef9 ("btrfs: force v2 space cache usage for subpage mount")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-06 16:18:59 +02:00
Filipe Manana 31e70e5278 btrfs: fix hang during unmount when block group reclaim task is running
When we start an unmount, at close_ctree(), if we have the reclaim task
running and in the middle of a data block group relocation, we can trigger
a deadlock when stopping an async reclaim task, producing a trace like the
following:

[629724.498185] task:kworker/u16:7   state:D stack:    0 pid:681170 ppid:     2 flags:0x00004000
[629724.499760] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[629724.501267] Call Trace:
[629724.501759]  <TASK>
[629724.502174]  __schedule+0x3cb/0xed0
[629724.502842]  schedule+0x4e/0xb0
[629724.503447]  btrfs_wait_on_delayed_iputs+0x7c/0xc0 [btrfs]
[629724.504534]  ? prepare_to_wait_exclusive+0xc0/0xc0
[629724.505442]  flush_space+0x423/0x630 [btrfs]
[629724.506296]  ? rcu_read_unlock_trace_special+0x20/0x50
[629724.507259]  ? lock_release+0x220/0x4a0
[629724.507932]  ? btrfs_get_alloc_profile+0xb3/0x290 [btrfs]
[629724.508940]  ? do_raw_spin_unlock+0x4b/0xa0
[629724.509688]  btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
[629724.510922]  process_one_work+0x252/0x5a0
[629724.511694]  ? process_one_work+0x5a0/0x5a0
[629724.512508]  worker_thread+0x52/0x3b0
[629724.513220]  ? process_one_work+0x5a0/0x5a0
[629724.514021]  kthread+0xf2/0x120
[629724.514627]  ? kthread_complete_and_exit+0x20/0x20
[629724.515526]  ret_from_fork+0x22/0x30
[629724.516236]  </TASK>
[629724.516694] task:umount          state:D stack:    0 pid:719055 ppid:695412 flags:0x00004000
[629724.518269] Call Trace:
[629724.518746]  <TASK>
[629724.519160]  __schedule+0x3cb/0xed0
[629724.519835]  schedule+0x4e/0xb0
[629724.520467]  schedule_timeout+0xed/0x130
[629724.521221]  ? lock_release+0x220/0x4a0
[629724.521946]  ? lock_acquired+0x19c/0x420
[629724.522662]  ? trace_hardirqs_on+0x1b/0xe0
[629724.523411]  __wait_for_common+0xaf/0x1f0
[629724.524189]  ? usleep_range_state+0xb0/0xb0
[629724.524997]  __flush_work+0x26d/0x530
[629724.525698]  ? flush_workqueue_prep_pwqs+0x140/0x140
[629724.526580]  ? lock_acquire+0x1a0/0x310
[629724.527324]  __cancel_work_timer+0x137/0x1c0
[629724.528190]  close_ctree+0xfd/0x531 [btrfs]
[629724.529000]  ? evict_inodes+0x166/0x1c0
[629724.529510]  generic_shutdown_super+0x74/0x120
[629724.530103]  kill_anon_super+0x14/0x30
[629724.530611]  btrfs_kill_super+0x12/0x20 [btrfs]
[629724.531246]  deactivate_locked_super+0x31/0xa0
[629724.531817]  cleanup_mnt+0x147/0x1c0
[629724.532319]  task_work_run+0x5c/0xa0
[629724.532984]  exit_to_user_mode_prepare+0x1a6/0x1b0
[629724.533598]  syscall_exit_to_user_mode+0x16/0x40
[629724.534200]  do_syscall_64+0x48/0x90
[629724.534667]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[629724.535318] RIP: 0033:0x7fa2b90437a7
[629724.535804] RSP: 002b:00007ffe0b7e4458 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[629724.536912] RAX: 0000000000000000 RBX: 00007fa2b9182264 RCX: 00007fa2b90437a7
[629724.538156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000555d6cf20dd0
[629724.539053] RBP: 0000555d6cf20ba0 R08: 0000000000000000 R09: 00007ffe0b7e3200
[629724.539956] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[629724.540883] R13: 0000555d6cf20dd0 R14: 0000555d6cf20cb0 R15: 0000000000000000
[629724.541796]  </TASK>

This happens because:

1) Before entering close_ctree() we have the async block group reclaim
   task running and relocating a data block group;

2) There's an async metadata (or data) space reclaim task running;

3) We enter close_ctree() and park the cleaner kthread;

4) The async space reclaim task is at flush_space() and runs all the
   existing delayed iputs;

5) Before the async space reclaim task calls
   btrfs_wait_on_delayed_iputs(), the block group reclaim task which is
   doing the data block group relocation, creates a delayed iput at
   replace_file_extents() (called when COWing leaves that have file extent
   items pointing to relocated data extents, during the merging phase
   of relocation roots);

6) The async reclaim space reclaim task blocks at
   btrfs_wait_on_delayed_iputs(), since we have a new delayed iput;

7) The task at close_ctree() then calls cancel_work_sync() to stop the
   async space reclaim task, but it blocks since that task is waiting for
   the delayed iput to be run;

8) The delayed iput is never run because the cleaner kthread is parked,
   and no one else runs delayed iputs, resulting in a hang.

So fix this by stopping the async block group reclaim task before we
park the cleaner kthread.

Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-06-06 16:18:52 +02:00
Matthew Wilcox (Oracle) 537e11cdc7 quota: Prevent memory allocation recursion while holding dq_lock
As described in commit 02117b8ae9 ("f2fs: Set GF_NOFS in
read_cache_page_gfp while doing f2fs_quota_read"), we must not enter
filesystem reclaim while holding the dq_lock.  Prevent this more generally
by using memalloc_nofs_save() while holding the lock.

Link: https://lore.kernel.org/r/20220605143815.2330891-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-06-06 10:08:10 +02:00
Jchao Sun 10e1407310 writeback: Fix inode->i_io_list not be protected by inode->i_lock error
Commit b35250c081 ("writeback: Protect inode->i_io_list with
inode->i_lock") made inode->i_io_list not only protected by
wb->list_lock but also inode->i_lock, but inode_io_list_move_locked()
was missed. Add lock there and also update comment describing
things protected by inode->i_lock. This also fixes a race where
__mark_inode_dirty() could move inode under flush worker's hands
and thus sync(2) could miss writing some inodes.

Fixes: b35250c081 ("writeback: Protect inode->i_io_list with inode->i_lock")
Link: https://lore.kernel.org/r/20220524150540.12552-1-sunjunchao2870@gmail.com
CC: stable@vger.kernel.org
Signed-off-by: Jchao Sun <sunjunchao2870@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-06-06 09:54:30 +02:00
Xiang wangx 2aab03b867 fs: Fix syntax errors in comments
Delete the redundant word 'not'.

Link: https://lore.kernel.org/r/20220605125509.14837-1-wangxiang@cdjrlc.com
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2022-06-06 09:53:03 +02:00
Paulo Alcantara c36ee7dab7 cifs: fix reconnect on smb3 mount types
cifs.ko defines two file system types: cifs & smb3, and
__cifs_get_super() was not including smb3 file system type when
looking up superblocks, therefore failing to reconnect tcons in
cifs_tree_connect().

Fix this by calling iterate_supers_type() on both file system types.

Link: https://lore.kernel.org/r/CAFrh3J9soC36+BVuwHB=g9z_KB5Og2+p2_W+BBoBOZveErz14w@mail.gmail.com
Cc: stable@vger.kernel.org
Tested-by: Satadru Pramanik <satadru@gmail.com>
Reported-by: Satadru Pramanik <satadru@gmail.com>
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
2022-06-06 01:04:12 -05:00
Linus Torvalds 6684cf4290 fix for breakage in #work.fd this window
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCYpz/IQAKCRBZ7Krx/gZQ
 666mAPwKOC/voemjl2m+RpSruxAbdlRvKei3IY8YxLfyv+rmUQD9HKLJJtQX2VRF
 QTFmQ3p7kx30ydwSbyY8Kpw3VMCDxgs=
 =1ZKm
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.fd-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull file descriptor fix from Al Viro:
 "Fix for breakage in #work.fd this window"

* tag 'pull-work.fd-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix the breakage in close_fd_get_file() calling conventions change
2022-06-05 17:14:03 -07:00
Al Viro 40a1926022 fix the breakage in close_fd_get_file() calling conventions change
It used to grab an extra reference to struct file rather than
just transferring to caller the one it had removed from descriptor
table.  New variant doesn't, and callers need to be adjusted.

Reported-and-tested-by: syzbot+47dd250f527cb7bebf24@syzkaller.appspotmail.com
Fixes: 6319194ec5 ("Unify the primitives for file descriptor closing")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-06-05 15:03:03 -04:00