Commit graph

68570 commits

Author SHA1 Message Date
Pavel Begunkov a1bb3cd589 io_uring: fix __io_uring_files_cancel() with TASK_UNINTERRUPTIBLE
If the tctx inflight number haven't changed because of cancellation,
__io_uring_task_cancel() will continue leaving the task in
TASK_UNINTERRUPTIBLE state, that's not expected by
__io_uring_files_cancel(). Ensure we always call finish_wait() before
retrying.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-26 08:51:08 -07:00
Miklos Szeredi 0b964446c6 ecryptfs: fix uid translation for setxattr on security.capability
Prior to commit 7c03e2cda4 ("vfs: move cap_convert_nscap() call into
vfs_setxattr()") the translation of nscap->rootid did not take stacked
filesystems (overlayfs and ecryptfs) into account.

That patch fixed the overlay case, but made the ecryptfs case worse.

Restore old the behavior for ecryptfs that existed before the overlayfs
fix.  This does not fix ecryptfs's handling of complex user namespace
setups, but it does make sure existing setups don't regress.

Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Tyler Hicks <code@tyhicks.com>
Fixes: 7c03e2cda4 ("vfs: move cap_convert_nscap() call into vfs_setxattr()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Tyler Hicks <code@tyhicks.com>
2021-01-26 01:47:14 +00:00
Johannes Berg f8ad8187c3 fs/pipe: allow sendfile() to pipe again
After commit 36e2c7421f ("fs: don't allow splice read/write
without explicit ops") sendfile() could no longer send data
from a real file to a pipe, breaking for example certain cgit
setups (e.g. when running behind fcgiwrap), because in this
case cgit will try to do exactly this: sendfile() to a pipe.

Fix this by using iter_file_splice_write for the splice_write
method of pipes, as suggested by Christoph.

Cc: stable@vger.kernel.org
Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-25 12:32:26 -08:00
Filipe Manana 9ad6d91f05 btrfs: fix log replay failure due to race with space cache rebuild
After a sudden power failure we may end up with a space cache on disk that
is not valid and needs to be rebuilt from scratch.

If that happens, during log replay when we attempt to pin an extent buffer
from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
the space cache to be rebuilt through the call to:

    btrfs_cache_block_group(cache, 1);

That is because that only waits for the task (work queue job) that loads
the space cache to change the cache state from BTRFS_CACHE_FAST to any
other value. That is ok when the space cache on disk exists and is valid,
but when the cache is not valid and needs to be rebuilt, it ends up
returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
at caching_thread()).

So this means that we can end up trying to unpin a range which is not yet
marked as free in the block group. This results in the call to
btrfs_remove_free_space() to return -EINVAL to
btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
as well as mounting the filesystem. More specifically the -EINVAL comes
from free_space_cache.c:remove_from_bitmap(), because the requested range
is not marked as free space (ones in the bitmap), we have the following
condition triggered:

static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
(...)
       if (ret < 0 || search_start != *offset)
            return -EINVAL;
(...)

It's the "search_start != *offset" that results in the condition being
evaluated to true.

When this happens we got the following in dmesg/syslog:

[72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
[72383.417837] BTRFS info (device sdb): disk space caching is enabled
[72383.418536] BTRFS info (device sdb): has skinny extents
[72383.423846] BTRFS info (device sdb): start tree-log replay
[72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
[72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
[72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
[72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
[72383.460241] BTRFS error (device sdb): open_ctree failed

We also mark the range for the extent buffer in the excluded extents io
tree. That is fine when the space cache is valid on disk and we can load
it, in which case it causes no problems.

However, for the case where we need to rebuild the space cache, because it
is either invalid or it is missing, having the extent buffer range marked
in the excluded extents io tree leads to a -EINVAL failure from the call
to btrfs_remove_free_space(), resulting in the log replay and mount to
fail. This is because by having the range marked in the excluded extents
io tree, the caching thread ends up never adding the range of the extent
buffer as free space in the block group since the calls to
add_new_free_space(), called from load_extent_tree_free(), filter out any
ranges that are marked as excluded extents.

So fix this by making sure that during log replay we wait for the caching
task to finish completely when we need to rebuild a space cache, and also
drop the need to mark the extent buffer range in the excluded extents io
tree, as well as clearing ranges from that tree at
btrfs_finish_extent_commit().

This started to happen with some frequency on large filesystems having
block groups with a lot of fragmentation since the recent commit
e747853cae ("btrfs: load free space cache asynchronously"), but in
fact the issue has been there for years, it was just much less likely
to happen.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-25 18:44:53 +01:00
Su Yue c41ec4529d btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
This effectively reverts commit d5c8238849 ("btrfs: convert
data_seqcount to seqcount_mutex_t").

While running fstests on 32 bits test box, many tests failed because of
warnings in dmesg. One of those warnings (btrfs/003):

  [66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
  [66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G           O      5.11.0-rc4-custom+ #5
  [66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
  [66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs]
  [66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803
  [66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70
  [66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
  [66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0
  [66.441485] Call Trace:
  [66.441510]  btrfs_relocate_chunk+0xb1/0x100 [btrfs]
  [66.441529]  ? btrfs_lookup_block_group+0x17/0x20 [btrfs]
  [66.441562]  btrfs_balance+0x8ed/0x13b0 [btrfs]
  [66.441586]  ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
  [66.441619]  ? __this_cpu_preempt_check+0xf/0x11
  [66.441643]  btrfs_ioctl_balance+0x333/0x3c0 [btrfs]
  [66.441664]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [66.441683]  btrfs_ioctl+0x414/0x2ae0 [btrfs]
  [66.441700]  ? __lock_acquire+0x35f/0x2650
  [66.441717]  ? lockdep_hardirqs_on+0x87/0x120
  [66.441720]  ? lockdep_hardirqs_on_prepare+0xd0/0x1e0
  [66.441724]  ? call_rcu+0x2d3/0x530
  [66.441731]  ? __might_fault+0x41/0x90
  [66.441736]  ? kvm_sched_clock_read+0x15/0x50
  [66.441740]  ? sched_clock+0x8/0x10
  [66.441745]  ? sched_clock_cpu+0x13/0x180
  [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [66.441750]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
  [66.441768]  __ia32_sys_ioctl+0x165/0x8a0
  [66.441773]  ? __this_cpu_preempt_check+0xf/0x11
  [66.441785]  ? __might_fault+0x89/0x90
  [66.441791]  __do_fast_syscall_32+0x54/0x80
  [66.441796]  do_fast_syscall_32+0x32/0x70
  [66.441801]  do_SYSENTER_32+0x15/0x20
  [66.441805]  entry_SYSENTER_32+0x9f/0xf2
  [66.441808] EIP: 0xab7b5549
  [66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c
  [66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98
  [66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292
  [66.441833] irq event stamp: 42579
  [66.441835] hardirqs last  enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590
  [66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590
  [66.441840] softirqs last  enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60
  [66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60

  ========================================================================
  btrfs_remove_chunk+0x58b/0x7b0:
  __seqprop_mutex_assert at linux/./include/linux/seqlock.h:279
  (inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212
  (inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994
  ========================================================================

The warning is produced by lockdep_assert_held() in
__seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled.
And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock
fs_info->chunk_mutex held already.

After adding some debug prints, the cause was found that many
__alloc_device() are called with NULL @fs_info (during scanning ioctl).
Inside the function, btrfs_device_data_ordered_init() is expanded to
seqcount_mutex_init().  In this scenario, its second
parameter info->chunk_mutex  is &NULL->chunk_mutex which equals
to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus,
seqcount_mutex_init() is called in wrong way. And later
btrfs_device_get/set helpers trigger lockdep warnings.

The device and filesystem object lifetimes are different and we'd have
to synchronize initialization of the btrfs_device::data_seqcount with
the fs_info, possibly using some additional synchronization. It would
still not prevent concurrent access to the seqcount lock when it's used
for read and initialization.

Commit d5c8238849 ("btrfs: convert data_seqcount to seqcount_mutex_t")
does not mention a particular problem being fixed so revert should not
cause any harm and we'll get the lockdep warning fixed.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139
Reported-by: Erhard F <erhard_f@mailbox.org>
Fixes: d5c8238849 ("btrfs: convert data_seqcount to seqcount_mutex_t")
CC: stable@vger.kernel.org # 5.10
CC: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-25 18:44:50 +01:00
Josef Bacik 2f96e40212 btrfs: fix possible free space tree corruption with online conversion
While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.

This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same.  During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups.  While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.

Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method.  We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.

CC: stable@vger.kernel.org # 4.9+
Fixes: a5ed918285 ("Btrfs: implement the free space B-tree")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-25 18:44:37 +01:00
Trond Myklebust d29b468da4 pNFS/NFSv4: Improve rejection of out-of-order layouts
If a layoutget ends up being reordered w.r.t. a layoutreturn, e.g. due
to a layoutget-on-open not knowing a priori which file to lock, then we
must assume the layout is no longer being considered valid state by the
server.
Incrementally improve our ability to reject such states by using the
cached old stateid in conjunction with the plh_barrier to try to
identify them.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-24 20:52:31 -05:00
Trond Myklebust 1bcf34fdac pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
When we're scheduling a layoutreturn, we need to ignore any further
incoming layouts with sequence ids that are going to be affected by the
layout return.

Fixes: 44ea8dfce0 ("NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-24 20:52:31 -05:00
Trond Myklebust 08bd8dbe88 pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
If the server returns a new stateid that does not match the one in our
cache, then try to return the one we hold instead of just invalidating
it on the client side. This ensures that both client and server will
agree that the stateid is invalid.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-24 20:52:30 -05:00
Trond Myklebust 814b849713 pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
If the server returns a new stateid that does not match the one in our
cache, then pnfs_layout_process() will leak the layout segments returned
by pnfs_mark_layout_stateid_invalid().

Fixes: 9888d837f3 ("pNFS: Force a retry of LAYOUTGET if the stateid doesn't match our cache")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-24 20:52:30 -05:00
Jens Axboe b18032bb0a io_uring: only call io_cqring_ev_posted() if events were posted
This normally doesn't cause any extra harm, but it does mean that we'll
increment the eventfd notification count, if one has been registered
with the ring. This can confuse applications, when they see more
notifications on the eventfd side than are available in the ring.

Do the nice thing and only increment this count, if we actually posted
(or even overflowed) events.

Reported-and-tested-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 18:13:56 -07:00
Jens Axboe 84965ff8a8 io_uring: if we see flush on exit, cancel related tasks
Ensure we match tasks that belong to a dead or dying task as well, as we
need to reap those in addition to those belonging to the exiting task.

Cc: stable@vger.kernel.org # 5.9+
Reported-by: Josef Grieb <josef.grieb@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 18:13:56 -07:00
Linus Torvalds ef7b1a0ea8 io_uring-5.11-2021-01-24
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmANwKAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprSsD/9Gp1dLmjYv0EhBAKSqC3UaXCLIZujontwb
 OWEM03ssjQrg0s5MNEAvzmFdScJequVMoGay6OeLf2jXA0xl9jr71kThVuwziZAK
 otspuMV2YrgL8FHclg52Uq3b+RKJGyCjNE2HlytRVGKwp501c6rcOESkeiiSJ/gE
 92lkJCATkle8CLfiRp+zf7owbiEFvJrztpa4CLgNCLlQyYiqMK0Vnct2J29NLoXt
 fWxVOx8rTpQy1P0y+YfsMh4uxcb7MzI2bG2Cqm7G0VACe9XtzaXorTxuKTC1sQ4d
 Q1+h1eYIlTITcQ3WwZ1FZzF/OPidJHd/BKQRlP4qfCwt5f4SN4/hH9Mxhg9K3Uq2
 c51wl0RD8FPI+S8XOe0btKopnZmhIqci76jKSFkbYAxm3XzVgiBOl4huD6e5JrLl
 3VpxTNsATmMgatdAvxAb5DzoeUUVMmo4sb74mtpavXPg/i6E64joOxfcozLswXgc
 rwa83GBigXOLythGTemeySG9GlCl0MCU5MioFRO2men5+rF2Lq2MfY51GEnP+if+
 8O27/PsN+d9vuCwWHqhyDFvUIF2s6KqvDKoj3X54xKf5PpB3mISUYlamyvoVZzMw
 t3OBalUq51JAxYhxTiF7RMYKEz7pBQ2aTBy1YV3sXLEcSzgt0iskWDi8mYB5/2lc
 oIiSTapZDA==
 =Vbhq
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Still need a final cancelation fix that isn't quite done done,
  expected in the next day or two. That said, this contains:

   - Wakeup fix for IOPOLL requests

   - SQPOLL split close op handling fix

   - Ensure that any use of io_uring fd itself is marked as inflight

   - Short non-regular file read fix (Pavel)

   - Fix up bad false positive warning (Pavel)

   - SQPOLL fixes (Pavel)

   - In-flight removal fix (Pavel)"

* tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block:
  io_uring: account io_uring internal files as REQ_F_INFLIGHT
  io_uring: fix sleeping under spin in __io_clean_op
  io_uring: fix short read retries for non-reg files
  io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
  io_uring: fix skipping disabling sqo on exec
  io_uring: fix uring_flush in exit_files() warning
  io_uring: fix false positive sqo warning on flush
  io_uring: iopoll requests should also wake task ->in_idle state
2021-01-24 12:30:14 -08:00
Linus Torvalds 5130680642 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "18 patches.

  Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
  memory-failure, and highmem), ubsan, proc, and MAINTAINERS"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  MAINTAINERS: add a couple more files to the Clang/LLVM section
  proc_sysctl: fix oops caused by incorrect command parameters
  powerpc/mm/highmem: use __set_pte_at() for kmap_local()
  mips/mm/highmem: use set_pte() for kmap_local()
  mm/highmem: prepare for overriding set_pte_at()
  sparc/mm/highmem: flush cache and TLB
  mm: fix page reference leak in soft_offline_page()
  ubsan: disable unsigned-overflow check for i386
  kasan, mm: fix resetting page_alloc tags for HW_TAGS
  kasan, mm: fix conflicts with init_on_alloc/free
  kasan: fix HW_TAGS boot parameters
  kasan: fix incorrect arguments passing in kasan_add_zero_shadow
  kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
  mm: fix numa stats for thp migration
  mm: memcg: fix memcg file_dirty numa stat
  mm: memcg/slab: optimize objcg stock draining
  mm: fix initialization of struct page for holes in memory layout
  x86/setup: don't remove E820_TYPE_RAM for pfn 0
2021-01-24 12:16:34 -08:00
Linus Torvalds 443d11297b Driver core fixes for 5.11-rc5
Here are some small driver core fixes for 5.11-rc5 that resolve some
 reported problems:
 	- revert of a -rc1 patch that was causing problems with some
 	  machines
 	- device link device name collision problem fix (busses only
 	  have to name devices unique to their bus, not unique to all
 	  busses)
 	- kernfs splice bugfixes to resolve firmware loading problems
 	  for Qualcomm systems.
 	- other tiny driver core fixes for minor issues reported.
 
 All of these have been in linux-next with no reported problems.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYA1qGw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymd2ACgzVEpWJddtLNz+9guU9kAIIPcNboAn2GreCle
 vgNkgCapi2ZjYtWBk8Cl
 =YNIw
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are some small driver core fixes for 5.11-rc5 that resolve some
  reported problems:

   - revert of a -rc1 patch that was causing problems with some machines

   - device link device name collision problem fix (busses only have to
     name devices unique to their bus, not unique to all busses)

   - kernfs splice bugfixes to resolve firmware loading problems for
     Qualcomm systems.

   - other tiny driver core fixes for minor issues reported.

  All of these have been in linux-next with no reported problems"

* tag 'driver-core-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: Fix device link device name collision
  driver core: Extend device_is_dependent()
  kernfs: wire up ->splice_read and ->splice_write
  kernfs: implement ->write_iter
  kernfs: implement ->read_iter
  Revert "driver core: Reorder devices on successful probe"
  Driver core: platform: Add extra error check in devm_platform_get_irqs_affinity()
  drivers core: Free dma_range_map when driver probe failed
2021-01-24 11:05:48 -08:00
Xiaoming Ni 697edcb0e4 proc_sysctl: fix oops caused by incorrect command parameters
The process_sysctl_arg() does not check whether val is empty before
invoking strlen(val).  If the command line parameter () is incorrectly
configured and val is empty, oops is triggered.

For example:
  "hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is
  triggered. The call stack is as follows:
    Kernel command line: .... hung_task_panic
    ......
    Call trace:
    __pi_strlen+0x10/0x98
    parse_args+0x278/0x344
    do_sysctl_args+0x8c/0xfc
    kernel_init+0x5c/0xf4
    ret_from_fork+0x10/0x30

To fix it, check whether "val" is empty when "phram" is a sysctl field.
Error codes are returned in the failure branch, and error logs are
generated by parse_args().

Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com
Fixes: 3db978d480 ("kernel/sysctl: support setting sysctl parameters from kernel command line")
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org>	[5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-24 10:34:53 -08:00
Linus Torvalds 4dcd3bcc20 an important signal handling patch for stable, and two small cleanup patches
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmAM+XcACgkQiiy9cAdy
 T1E44gwAnonpHv2eDi6Lo6/Ev3hP7kETaIcYc3Jql61gz0rn880cNh67C5osdxVY
 hdaKycuwQOBQbHcrt6kOfpOOcnjVXb92L/THzUJcHgUQx4bUFHNhuYt8NTrQFsnd
 ke+nLVbRcAiQRVAN46d9y2YQBA64SGUrbjjGyFD0ew13ORpsGRVBQ1UjEXzUvObV
 d0rFrlVoAIZA3RAj5uxBeHnyIy3pZaviEhg0ZlL4QEqEDhcKh82kWcX4AdZFOpwI
 qrgJ1/TtJPj4bhPPCsWXwkSQSF9JETmigRCn03fRyykvdt2f4xIgvVyaqhWqmpvv
 BZL8BKzutcFw7c5FTgNvjxAH5G2xGpefdPKE+ve0AwGnL0yrWIjk6tbpCPs9BqEe
 nD2Pxgb+r9aCtXYYYgkZ59YOF50gdtJF7ffFqMWZ+OGckFSHOpLeFP8IW9Q102b9
 VRkMTZQSd0E60ClVvuaEFIVHX52bJxM0kgdtH/2DVGdARllA4QjZuQdXZuNughYi
 zKckmEyQ
 =wHrR
 -----END PGP SIGNATURE-----

Merge tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "An important signal handling patch for stable, and two small cleanup
  patches"

* tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: do not fail __smb_send_rqst if non-fatal signals are pending
  fs/cifs: Simplify bool comparison.
  fs/cifs: Assign boolean values to a bool variable
2021-01-24 09:27:14 -08:00
Jens Axboe 02a13674fa io_uring: account io_uring internal files as REQ_F_INFLIGHT
We need to actively cancel anything that introduces a potential circular
loop, where io_uring holds a reference to itself. If the file in question
is an io_uring file, then add the request to the inflight list.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 10:15:33 -07:00
Pavel Begunkov 9d5c819068 io_uring: fix sleeping under spin in __io_clean_op
[   27.629441] BUG: sleeping function called from invalid context
	at fs/file.c:402
[   27.631317] in_atomic(): 1, irqs_disabled(): 1, non_block: 0,
	pid: 1012, name: io_wqe_worker-0
[   27.633220] 1 lock held by io_wqe_worker-0/1012:
[   27.634286]  #0: ffff888105e26c98 (&ctx->completion_lock)
	{....}-{2:2}, at: __io_req_complete.part.102+0x30/0x70
[   27.649249] Call Trace:
[   27.649874]  dump_stack+0xac/0xe3
[   27.650666]  ___might_sleep+0x284/0x2c0
[   27.651566]  put_files_struct+0xb8/0x120
[   27.652481]  __io_clean_op+0x10c/0x2a0
[   27.653362]  __io_cqring_fill_event+0x2c1/0x350
[   27.654399]  __io_req_complete.part.102+0x41/0x70
[   27.655464]  io_openat2+0x151/0x300
[   27.656297]  io_issue_sqe+0x6c/0x14e0
[   27.660991]  io_wq_submit_work+0x7f/0x240
[   27.662890]  io_worker_handle_work+0x501/0x8a0
[   27.664836]  io_wqe_worker+0x158/0x520
[   27.667726]  kthread+0x134/0x180
[   27.669641]  ret_from_fork+0x1f/0x30

Instead of cleaning files on overflow, return back overflow cancellation
into io_uring_cancel_files(). Previously it was racy to clean
REQ_F_OVERFLOW flag, but we got rid of it, and can do it through
repetitive attempts targeting all matching requests.

Reported-by: Abaci <abaci@linux.alibaba.com>
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24 10:15:33 -07:00
Ronnie Sahlberg 214a5ea081 cifs: do not fail __smb_send_rqst if non-fatal signals are pending
RHBZ 1848178

The original intent of returning an error in this function
in the patch:
  "CIFS: Mask off signals when sending SMB packets"
was to avoid interrupting packet send in the middle of
sending the data (and thus breaking an SMB connection),
but we also don't want to fail the request for non-fatal
signals even before we have had a chance to try to
send it (the reported problem could be reproduced e.g.
by exiting a child process when the parent process was in
the midst of calling futimens to update a file's timestamps).

In addition, since the signal may remain pending when we enter the
sending loop, we may end up not sending the whole packet before
TCP buffers become full. In this case the code returns -EINTR
but what we need here is to return -ERESTARTSYS instead to
allow system calls to be restarted.

Fixes: b30c74c73c ("CIFS: Mask off signals when sending SMB packets")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-23 01:28:20 -06:00
Linus Torvalds a9034304ff A patch to zero out sensitive cryptographic data and two minor cleanups
prompted by the fact that a bunch of code was moved in this cycle.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmAK9skTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziyGwB/9rZnaaYR6Frqc2tzE5vbVtjAxvhftn
 pGDr8laOHiK5jnKR+ljNlPAe07TSEK+qVidulX05moujKrZeIrDUJZnEpScrssZO
 o7Tm99dHziqJc10liembtSZzB3LzGJyW1hgavC5Vjo7JW+EZ+YR9x2pFKCO7Hz/M
 QlT6kQmXZLnsLB2OieAyC9Yb7IMD1wfiOHHvOZDeFpIn49Z8reFahI+dSwwK/uOv
 UouxZKKuaikSTvzp8WmTiuCpsUfBMOaDy5/pWLfBS+/116K2aieJmzSjUb2MZwDT
 cLGhzrkyeCkBeFO5vhnob3n9KqWXN03I9rPB25StcrHCRYcHa3z/D/4k
 =9COg
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A patch to zero out sensitive cryptographic data and two minor
  cleanups prompted by the fact that a bunch of code was moved in this
  cycle"

* tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client:
  libceph: fix "Boolean result is used in bitwise operation" warning
  libceph, ceph: disambiguate ceph_connection_operations handlers
  libceph: zero out session key and connection secret
2021-01-22 13:47:25 -08:00
Pavel Begunkov 9a173346bd io_uring: fix short read retries for non-reg files
Sockets and other non-regular files may actually expect short reads to
happen, don't retry reads for them. Because non-reg files don't set
FMODE_BUF_RASYNC and so it won't do second/retry do_read, we can filter
out those cases after first do_read() attempt with ret>0.

Cc: stable@vger.kernel.org # 5.9+
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-22 12:42:54 -07:00
Jens Axboe 607ec89ed1 io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
IORING_OP_CLOSE is special in terms of cancelation, since it has an
intermediate state where we've removed the file descriptor but hasn't
closed the file yet. For that reason, it's currently marked with
IO_WQ_WORK_NO_CANCEL to prevent cancelation. This ensures that the op
is always run even if canceled, to prevent leaving us with a live file
but an fd that is gone. However, with SQPOLL, since a cancel request
doesn't carry any resources on behalf of the request being canceled, if
we cancel before any of the close op has been run, we can end up with
io-wq not having the ->files assigned. This can result in the following
oops reported by Joseph:

BUG: kernel NULL pointer dereference, address: 00000000000000d8
PGD 800000010b76f067 P4D 800000010b76f067 PUD 10b462067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1788 Comm: io_uring-sq Not tainted 5.11.0-rc4 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:__lock_acquire+0x19d/0x18c0
Code: 00 00 8b 1d fd 56 dd 08 85 db 0f 85 43 05 00 00 48 c7 c6 98 7b 95 82 48 c7 c7 57 96 93 82 e8 9a bc f5 ff 0f 0b e9 2b 05 00 00 <48> 81 3f c0 ca 67 8a b8 00 00 00 00 41 0f 45 c0 89 04 24 e9 81 fe
RSP: 0018:ffffc90001933828 EFLAGS: 00010002
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000d8
RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff888106e8a140 R15: 00000000000000d8
FS:  0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d8 CR3: 0000000106efa004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 lock_acquire+0x31a/0x440
 ? close_fd_get_file+0x39/0x160
 ? __lock_acquire+0x647/0x18c0
 _raw_spin_lock+0x2c/0x40
 ? close_fd_get_file+0x39/0x160
 close_fd_get_file+0x39/0x160
 io_issue_sqe+0x1334/0x14e0
 ? lock_acquire+0x31a/0x440
 ? __io_free_req+0xcf/0x2e0
 ? __io_free_req+0x175/0x2e0
 ? find_held_lock+0x28/0xb0
 ? io_wq_submit_work+0x7f/0x240
 io_wq_submit_work+0x7f/0x240
 io_wq_cancel_cb+0x161/0x580
 ? io_wqe_wake_worker+0x114/0x360
 ? io_uring_get_socket+0x40/0x40
 io_async_find_and_cancel+0x3b/0x140
 io_issue_sqe+0xbe1/0x14e0
 ? __lock_acquire+0x647/0x18c0
 ? __io_queue_sqe+0x10b/0x5f0
 __io_queue_sqe+0x10b/0x5f0
 ? io_req_prep+0xdb/0x1150
 ? mark_held_locks+0x6d/0xb0
 ? mark_held_locks+0x6d/0xb0
 ? io_queue_sqe+0x235/0x4b0
 io_queue_sqe+0x235/0x4b0
 io_submit_sqes+0xd7e/0x12a0
 ? _raw_spin_unlock_irq+0x24/0x30
 ? io_sq_thread+0x3ae/0x940
 io_sq_thread+0x207/0x940
 ? do_wait_intr_irq+0xc0/0xc0
 ? __ia32_sys_io_uring_enter+0x650/0x650
 kthread+0x134/0x180
 ? kthread_create_worker_on_cpu+0x90/0x90
 ret_from_fork+0x1f/0x30

Fix this by moving the IO_WQ_WORK_NO_CANCEL until _after_ we've modified
the fdtable. Canceling before this point is totally fine, and running
it in the io-wq context _after_ that point is also fine.

For 5.12, we'll handle this internally and get rid of the no-cancel
flag, as IORING_OP_CLOSE is the only user of it.

Cc: stable@vger.kernel.org
Fixes: b5dba59e0c ("io_uring: add support for IORING_OP_CLOSE")
Reported-by: "Abaci <abaci@linux.alibaba.com>"
Reviewed-and-tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-22 12:42:54 -07:00
Linus Torvalds 9f29bd8b2e \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmAJfUoACgkQnJ2qBz9k
 QNnH+Qf/Q/e0zGxGW7Snj6Kz4VL9yHwfvQUdZPFashv+Uff1jnPkXQbnJia2mUNE
 6g4XsMTuXTF13TvNmf93MBHAmlJSicUfRNqyDx9HP0VrNy2NarIwDcN4yAoOs2cZ
 jwzwOVDJmDf/EALVv+6JUySRq/v5f4EEihtYjVEXxVNh7rZiLOBYnY5wPLvicU3h
 ButzuOnH0F4aqBKuZsanJYDIswmJ05awxy4wu4SWoyghc+KUc61pXeND1KHa4LoR
 1A4NL7OcjZkpHvaWCw8FKgEQnyTyvbi78aSSrOLcZhdT6l5jQt8xaRwz66zk9Yw2
 mxQe9YefTcvlmm8iNn3B7QxMmJmZ7g==
 =agn9
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fs and udf fixes from Jan Kara:
 "A lazytime handling fix from Eric Biggers and a fix of UDF session
  handling for large devices"

* tag 'fs_for_v5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: fix the problem that the disc content is not displayed
  fs: fix lazytime expiration handling in __writeback_single_inode()
2021-01-21 11:45:40 -08:00
Christoph Hellwig f2d6c2708b kernfs: wire up ->splice_read and ->splice_write
Wire up the splice_read and splice_write methods to the default
helpers using ->read_iter and ->write_iter now that those are
implemented for kernfs.  This restores support to use splice and
sendfile on kernfs files.

Fixes: 36e2c7421f ("fs: don't allow splice read/write without explicit ops")
Reported-by: Siddharth Gupta <sidgup@codeaurora.org>
Tested-by: Siddharth Gupta <sidgup@codeaurora.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210120204631.274206-4-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-21 18:30:28 +01:00
Christoph Hellwig cc099e0b39 kernfs: implement ->write_iter
Switch kernfs to implement the write_iter method instead of plain old
write to prepare to supporting splice and sendfile again.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210120204631.274206-3-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-21 18:30:28 +01:00
Christoph Hellwig 4eaad21a6a kernfs: implement ->read_iter
Switch kernfs to implement the read_iter method instead of plain old
read to prepare to supporting splice and sendfile again.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210120204631.274206-2-hch@lst.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-21 18:30:28 +01:00
Linus Torvalds 9791581c04 for-5.11-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAIojwACgkQxWXV+ddt
 WDstnw/+O0KSsK6ChZCNdjqFAgWL41RYj0fPOgM/8xlNaQyYHS0Jczeoud6m/2Wm
 U41kTb/a6xpmx0Z/2uf/5pDIBPFld/IUuUf/AdJsMzy8Bpky2/sfg6Kmx0tKGLXQ
 1WKp9ox0MlAUI0Tz/jGfX5rwsIgWKYKIF2iGUio/H1ktR3l+cXlmLWsSIB43F6VL
 AjKRRyFCNU//dV7syNMmmj9yU0HpSs53SpWxUIURuTFaE71LyUgzaxDTlZ6c/PET
 e4wdf8nl0wzEESCgSUPdh2AWNNiTEbbGhhhNi9250PUyQki2f4AGBlxVSLZH/fDn
 6PbBDvefW4umCMeMxxmgnYJU6tG78qg/LvxzZXt54rOtB0WMbrIl0u7hFCVhQ3Qk
 nqrS4tqeaL+OeuR6xamBMaRohgRFa9S+QVjTwtDFo/oVYH4TVvQDfKQS6GsWwDvB
 ySzz3WewoFqhe47cMsy28Dg49xkDSIJIr5hmSNGSXTreZ2JIa+qLKywoH87+YDIE
 ql0PN47z4NB+MbWDV7SZM8DCVqiQ7+1LOV9bPmqfvNl3YTfvXyMaoPLmWWVstPr2
 iyhXrvESgm1s2RCF1a0tXIkv82L6QYjJ3eeEDsvAmtKBouNL9BnMvwi3zW5yKiry
 m1qj7C7e6C1TivYitcCfbRCKqeAnUv8VwkSbW9BvNDe7i5AD++U=
 =gSYr
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more one line fixes for various bugs, stable material.

   - fix send when emitting clone operation from the same file and root

   - fix double free on error when cleaning backrefs

   - lockdep fix during relocation

   - handle potential error during reloc when starting transaction

   - skip running delayed refs during commit (leftover from code removal
     in this dev cycle)"

* tag 'for-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: don't clear ret in btrfs_start_dirty_block_groups
  btrfs: fix lockdep splat in btrfs_recover_relocation
  btrfs: do not double free backref nodes on error
  btrfs: don't get an EINTR during drop_snapshot for reloc
  btrfs: send: fix invalid clone operations when cloning from the same file and root
  btrfs: no need to run delayed refs after commit_fs_roots during commit
2021-01-20 14:15:33 -08:00
Takashi Iwai db58465f11 cachefiles: Drop superfluous readpages aops NULL check
After the recent actions to convert readpages aops to readahead, the
NULL checks of readpages aops in cachefiles_read_or_alloc_page() may
hit falsely.  More badly, it's an ASSERT() call, and this panics.

Drop the superfluous NULL checks for fixing this regression.

[DH: Note that cachefiles never actually used readpages, so this check was
 never actually necessary]

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208883
BugLink: https://bugzilla.opensuse.org/show_bug.cgi?id=1175245
Fixes: 9ae326a690 ("CacheFiles: A cache that backs onto a mounted filesystem")
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-20 11:33:51 -08:00
Linus Torvalds f419f031de Fixes:
- Avoid exposing parent of root directory in NFSv3 READDIRPLUS results
 - Fix a tracepoint change that went in the initial 5.11 merge
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl//AVMACgkQM2qzM29m
 f5dTWg//c2prRAhE1V9fwIDczJn8MM0tXljoWWgSpuslbd8Bgv0Ss8mitvr4B3pO
 JhzBdWcTb2/3j2D52LbLOGjr0z6BCvXX1Gp0QUnC96lhaNBC5aby309xQpSkhPbQ
 j2jw3CImGbiH7YY2BGsjcUx5mfpIkJMbg7rPSSOVHufIiUZLCg98Y3JJJKrJk+78
 qGFgaAHqBLzLK96F7Sz9q8du5lsiCbpLgx+qWjpaJEfJ0XWbEe2jA/uakrb1OzoD
 OkpG8RjZiJFAhWGdnR8y7eJQ7FyIi8h7BYAr4AlE97YZRZdjDqyummshJkKKVG2f
 5u4B225cKkcmVfLQem7Ym+nVFneR7/WLy00O12v08d0s54RLDp4xjdKplgLnHdwB
 AJg+l6K/AN24UtyE1OUIuOKJsLZd+DSANYNzZrCjeF8o6LKsKSUGrRtRbNVmtyJH
 qBYXR3gXrNt9lWYU+i/4OfJIVfksWjjyRk2/ww83INi5KxixuL0w8BcMpaTC1qQg
 ds+rmvosLvtfnY2k0wdScYbQZHoFvf+qJHRDhOVq4lWgpooExOMXKUry6k5AVOd4
 EchDX870Qe6wc4uT8xafmizD6hdJXCDN0rTGTuGnMoksoBZ7uCCsyyztbfNGiFMC
 i+0wCIWkHU3LgfHQMmTJ3J6e8mgTWPD3pTOJU5xoizQnTHGoTho=
 =Qf5O
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Avoid exposing parent of root directory in NFSv3 READDIRPLUS results

 - Fix a tracepoint change that went in the initial 5.11 merge

* tag 'nfsd-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Move the svc_xdr_recvfrom tracepoint again
  nfsd4: readdirplus shouldn't return parent of export
2021-01-19 13:01:50 -08:00
Josef Bacik 34d1eb0e59 btrfs: don't clear ret in btrfs_start_dirty_block_groups
If we fail to update a block group item in the loop we'll break, however
we'll do btrfs_run_delayed_refs and lose our error value in ret, and
thus not clean up properly.  Fix this by only running the delayed refs
if there was no failure.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18 16:00:11 +01:00
Josef Bacik fb28610097 btrfs: fix lockdep splat in btrfs_recover_relocation
While testing the error paths of relocation I hit the following lockdep
splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.10.0-rc6+ #217 Not tainted
  ------------------------------------------------------
  mount/779 is trying to acquire lock:
  ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340

  but task is already holding lock:
  ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (btrfs-root-00){++++}-{3:3}:
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x27/0x100
	 btrfs_read_lock_root_node+0x31/0x40
	 btrfs_search_slot+0x462/0x8f0
	 btrfs_update_root+0x55/0x2b0
	 btrfs_drop_snapshot+0x398/0x750
	 clean_dirty_subvols+0xdf/0x120
	 btrfs_recover_relocation+0x534/0x5a0
	 btrfs_start_pre_rw_mount+0xcb/0x170
	 open_ctree+0x151f/0x1726
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x380
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x433/0xc10
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (sb_internal#2){.+.+}-{0:0}:
	 start_transaction+0x444/0x700
	 insert_balance_item.isra.0+0x37/0x320
	 btrfs_balance+0x354/0xf40
	 btrfs_ioctl_balance+0x2cf/0x380
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x1120/0x1e10
	 lock_acquire+0x116/0x370
	 __mutex_lock+0x7e/0x7b0
	 btrfs_recover_balance+0x2f0/0x340
	 open_ctree+0x1095/0x1726
	 btrfs_mount_root.cold+0x12/0xea
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 vfs_kern_mount.part.0+0x71/0xb0
	 btrfs_mount+0x10d/0x380
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x433/0xc10
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-root-00);
				 lock(sb_internal#2);
				 lock(btrfs-root-00);
    lock(&fs_info->balance_mutex);

   *** DEADLOCK ***

  2 locks held by mount/779:
   #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
   #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100

  stack backtrace:
  CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   ? trace_call_bpf+0x139/0x260
   __lock_acquire+0x1120/0x1e10
   lock_acquire+0x116/0x370
   ? btrfs_recover_balance+0x2f0/0x340
   __mutex_lock+0x7e/0x7b0
   ? btrfs_recover_balance+0x2f0/0x340
   ? btrfs_recover_balance+0x2f0/0x340
   ? rcu_read_lock_sched_held+0x3f/0x80
   ? kmem_cache_alloc_trace+0x2c4/0x2f0
   ? btrfs_get_64+0x5e/0x100
   btrfs_recover_balance+0x2f0/0x340
   open_ctree+0x1095/0x1726
   btrfs_mount_root.cold+0x12/0xea
   ? rcu_read_lock_sched_held+0x3f/0x80
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   vfs_kern_mount.part.0+0x71/0xb0
   btrfs_mount+0x10d/0x380
   ? __kmalloc_track_caller+0x2f2/0x320
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   ? capable+0x3a/0x60
   path_mount+0x433/0xc10
   __x64_sys_mount+0xe3/0x120
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is straightforward to fix, simply release the path before we setup
the balance_ctl.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18 15:44:59 +01:00
Josef Bacik 49ecc679ab btrfs: do not double free backref nodes on error
Zygo reported the following KASAN splat:

  BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
  Read of size 8 at addr ffff888112402950 by task btrfs/28836

  CPU: 0 PID: 28836 Comm: btrfs Tainted: G        W         5.10.0-e35f27394290-for-next+ #23
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  Call Trace:
   dump_stack+0xbc/0xf9
   ? btrfs_backref_cleanup_node+0x18a/0x420
   print_address_description.constprop.8+0x21/0x210
   ? record_print_text.cold.34+0x11/0x11
   ? btrfs_backref_cleanup_node+0x18a/0x420
   ? btrfs_backref_cleanup_node+0x18a/0x420
   kasan_report.cold.10+0x20/0x37
   ? btrfs_backref_cleanup_node+0x18a/0x420
   __asan_load8+0x69/0x90
   btrfs_backref_cleanup_node+0x18a/0x420
   btrfs_backref_release_cache+0x83/0x1b0
   relocate_block_group+0x394/0x780
   ? merge_reloc_roots+0x4a0/0x4a0
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   ? check_flags.part.50+0x6c/0x1e0
   ? btrfs_relocate_chunk+0x120/0x120
   ? kmem_cache_alloc_trace+0xa06/0xcb0
   ? _copy_from_user+0x83/0xc0
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   ? __kasan_check_read+0x11/0x20
   ? check_chain_key+0x1f4/0x2f0
   ? __asan_loadN+0xf/0x20
   ? btrfs_ioctl_get_supported_features+0x30/0x30
   ? kvm_sched_clock_read+0x18/0x30
   ? check_chain_key+0x1f4/0x2f0
   ? lock_downgrade+0x3f0/0x3f0
   ? handle_mm_fault+0xad6/0x2150
   ? do_vfs_ioctl+0xfc/0x9d0
   ? ioctl_file_clone+0xe0/0xe0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags.part.50+0x6c/0x1e0
   ? check_flags+0x26/0x30
   ? lock_is_held_type+0xc3/0xf0
   ? syscall_enter_from_user_mode+0x1b/0x60
   ? do_syscall_64+0x13/0x80
   ? rcu_read_lock_sched_held+0xa1/0xd0
   ? __kasan_check_read+0x11/0x20
   ? __fget_light+0xae/0x110
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f4c4bdfe427

  Allocated by task 28836:
   kasan_save_stack+0x21/0x50
   __kasan_kmalloc.constprop.18+0xbe/0xd0
   kasan_kmalloc+0x9/0x10
   kmem_cache_alloc_trace+0x410/0xcb0
   btrfs_backref_alloc_node+0x46/0xf0
   btrfs_backref_add_tree_node+0x60d/0x11d0
   build_backref_tree+0xc5/0x700
   relocate_tree_blocks+0x2be/0xb90
   relocate_block_group+0x2eb/0x780
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 28836:
   kasan_save_stack+0x21/0x50
   kasan_set_track+0x20/0x30
   kasan_set_free_info+0x1f/0x30
   __kasan_slab_free+0xf3/0x140
   kasan_slab_free+0xe/0x10
   kfree+0xde/0x200
   btrfs_backref_error_cleanup+0x452/0x530
   build_backref_tree+0x1a5/0x700
   relocate_tree_blocks+0x2be/0xb90
   relocate_block_group+0x2eb/0x780
   btrfs_relocate_block_group+0x26e/0x4c0
   btrfs_relocate_chunk+0x52/0x120
   btrfs_balance+0xe2e/0x1900
   btrfs_ioctl_balance+0x3a7/0x460
   btrfs_ioctl+0x24c8/0x4360
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This occurred because we freed our backref node in
btrfs_backref_error_cleanup(), but then tried to free it again in
btrfs_backref_release_cache().  This is because
btrfs_backref_release_cache() will cycle through all of the
cache->leaves nodes and free them up.  However
btrfs_backref_error_cleanup() freed the backref node with
btrfs_backref_free_node(), which simply kfree()d the backref node
without unlinking it from the cache.  Change this to a
btrfs_backref_drop_node(), which does the appropriate cleanup and
removes the node from the cache->leaves list, so when we go to free the
remaining cache we don't trip over items we've already dropped.

Fixes: 75bfb9aff4 ("Btrfs: cleanup error handling in build_backref_tree")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18 15:44:53 +01:00
Josef Bacik 18d3bff411 btrfs: don't get an EINTR during drop_snapshot for reloc
This was partially fixed by f3e3d9cc35 ("btrfs: avoid possible signal
interruption of btrfs_drop_snapshot() on relocation tree"), however it
missed a spot when we restart a trans handle because we need to end the
transaction.  The fix is the same, simply use btrfs_join_transaction()
instead of btrfs_start_transaction() when deleting reloc roots.

Fixes: f3e3d9cc35 ("btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-18 15:44:47 +01:00
lianzhi chang 5cdc4a6950 udf: fix the problem that the disc content is not displayed
When the capacity of the disc is too large (assuming the 4.7G
specification), the disc (UDF file system) will be burned
multiple times in the windows (Multisession Usage). When the
remaining capacity of the CD is less than 300M (estimated
value, for reference only), open the CD in the Linux system,
the content of the CD is displayed as blank (the kernel will
say "No VRS found"). Windows can display the contents of the
CD normally.
Through analysis, in the "fs/udf/super.c": udf_check_vsd
function, the actual value of VSD_MAX_SECTOR_OFFSET may
be much larger than 0x800000. According to the current code
logic, it is found that the type of sbi->s_session is "__s32",
 when the remaining capacity of the disc is less than 300M
(take a set of test values: sector=3154903040,
sbi->s_session=1540464, sb->s_blocksize_bits=11 ), the
calculation result of "sbi->s_session << sb->s_blocksize_bits"
 will overflow. Therefore, it is necessary to convert the
type of s_session to "loff_t" (when udf_check_vsd starts,
assign a value to _sector, which is also converted in this
way), so that the result will not overflow, and then the
content of the disc can be displayed normally.

Link: https://lore.kernel.org/r/20210114075741.30448-1-changlianzhi@uniontech.com
Signed-off-by: lianzhi chang <changlianzhi@uniontech.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-18 12:06:33 +01:00
Jiapeng Zhong 16a78851e1 fs/cifs: Simplify bool comparison.
Fix the follow warnings:

./fs/cifs/connect.c: WARNING: Comparison of 0/1 to bool variable

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-17 19:22:24 -06:00
Jiapeng Zhong 2be449fcf3 fs/cifs: Assign boolean values to a bool variable
Fix the following coccicheck warnings:

./fs/cifs/connect.c:3386:2-21: WARNING: Assignment of 0/1 to
bool variable.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-17 19:22:18 -06:00
Linus Torvalds a527a2b32d Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs fixes from Al Viro:
 "Several assorted fixes.

  I still think that audit ->d_name race is better fixed this way for
  the benefit of backports, with any possibly fancier variants done on
  top of it"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  dump_common_audit_data(): fix racy accesses to ->d_name
  iov_iter: fix the uaccess area in copy_compat_iovec_from_user
  umount(2): move the flag validity checks first
2021-01-17 12:16:47 -08:00
Pavel Begunkov 0b5cd6c32b io_uring: fix skipping disabling sqo on exec
If there are no requests at the time __io_uring_task_cancel() is called,
tctx_inflight() returns zero and and it terminates not getting a chance
to go through __io_uring_files_cancel() and do
io_disable_sqo_submit(). And we absolutely want them disabled by the
time cancellation ends.

Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: d9d05217cb ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-16 21:02:49 -07:00
Pavel Begunkov 4325cb498c io_uring: fix uring_flush in exit_files() warning
WARNING: CPU: 1 PID: 11100 at fs/io_uring.c:9096
	io_uring_flush+0x326/0x3a0 fs/io_uring.c:9096
RIP: 0010:io_uring_flush+0x326/0x3a0 fs/io_uring.c:9096
Call Trace:
 filp_close+0xb4/0x170 fs/open.c:1280
 close_files fs/file.c:401 [inline]
 put_files_struct fs/file.c:416 [inline]
 put_files_struct+0x1cc/0x350 fs/file.c:413
 exit_files+0x7e/0xa0 fs/file.c:433
 do_exit+0xc22/0x2ae0 kernel/exit.c:820
 do_group_exit+0x125/0x310 kernel/exit.c:922
 get_signal+0x3e9/0x20a0 kernel/signal.c:2770
 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
 handle_signal_work kernel/entry/common.c:147 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:201
 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
 syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

An SQPOLL ring creator task may have gotten rid of its file note during
exit and called io_disable_sqo_submit(), but the io_uring is still left
referenced through fdtable, which will be put during close_files() and
cause a false positive warning.

First split the warning into two for more clarity when is hit, and the
add sqo_dead check to handle the described case.

Reported-by: syzbot+a32b546d58dde07875a1@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-16 12:14:02 -07:00
Pavel Begunkov 6b393a1ff1 io_uring: fix false positive sqo warning on flush
WARNING: CPU: 1 PID: 9094 at fs/io_uring.c:8884
	io_disable_sqo_submit+0x106/0x130 fs/io_uring.c:8884
Call Trace:
 io_uring_flush+0x28b/0x3a0 fs/io_uring.c:9099
 filp_close+0xb4/0x170 fs/open.c:1280
 close_fd+0x5c/0x80 fs/file.c:626
 __do_sys_close fs/open.c:1299 [inline]
 __se_sys_close fs/open.c:1297 [inline]
 __x64_sys_close+0x2f/0xa0 fs/open.c:1297
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

io_uring's final close() may be triggered by any task not only the
creator. It's well handled by io_uring_flush() including SQPOLL case,
though a warning in io_disable_sqo_submit() will fallaciously fire by
moving this warning out to the only call site that matters.

Reported-by: syzbot+2f5d1785dc624932da78@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-16 12:14:02 -07:00
Jens Axboe c93cc9e16d io_uring: iopoll requests should also wake task ->in_idle state
If we're freeing/finishing iopoll requests, ensure we check if the task
is in idling in terms of cancelation. Otherwise we could end up waiting
forever in __io_uring_task_cancel() if the task has active iopoll
requests that need cancelation.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-16 12:13:59 -07:00
Linus Torvalds 11c0239ae2 io_uring-5.11-2021-01-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmADKaIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnjVD/402IBhG40EoLMdcztTbOF7S5JZLbECbWam
 HPN99D0YyjRQNwsd7CPzeO+7W4zO4P0T/NcnXLF1TmPnkRynDNsT4oygcLrJoH8n
 xZoAXi6ykM/4eE+4NSaNfWF/f3DtGepsIHMpSNNyzE+qxz4I36/EPTu7M13ga1A4
 H6pAbvU1DrDZ+z62rhWcb+4ouUV7QDS02UTNo9cW4ueeMNLyW/Zxp9lBzNuAkbzF
 gRlsxkhXLgcqtA36EdRWKs9fOwcdTyiMGzS5DRRmmE/O/a3//yl0ENFTPmWd6+jf
 daZcAYW6BdB+0va1sYGbECl9zOkMlhcm/hNFnl3ZmFIW5a4QiRy2WLoWuiNwVSYM
 nGCz0l5P2pP32tt0fQlSQh8KdI+py+c7MhJ1IaRhJi8X4t1MHoOR3vlilw0UvrdQ
 zpm8H+hMmL3g9/6pZLwxBvd31N1Y/A8ghh6qTdQ2EwdPyjImR3Xv/wYUqdbfvBo5
 VcB6ccoFwuP5KzQcyI2j2WRcx7AtHZuiVGvoqkA8QL9Onq3GWoY8tnkf/sP8RxlQ
 wL91wVkT46UMkCLeY7QCGdx4nnmSahCEa8SQEYxibnh++9jh+c6Y5FtZc3YNjnED
 GRFT9vJmqNmrvKCLo78iTf/F8vJllGXQRKDeMuU9DVj3ylhZOBSEhpmMIZwdXNQa
 CqMCqia4Xg==
 =XL14
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.11-2021-01-16' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "We still have a pending fix for a cancelation issue, but it's still
  being investigated. In the meantime:

   - Dead mm handling fix (Pavel)

   - SQPOLL setup error handling (Pavel)

   - Flush timeout sequence fix (Marcelo)

   - Missing finish_wait() for one exit case"

* tag 'io_uring-5.11-2021-01-16' of git://git.kernel.dk/linux-block:
  io_uring: ensure finish_wait() is always called in __io_uring_task_cancel()
  io_uring: flush timeouts that should already have expired
  io_uring: do sqo disable on install_fd error
  io_uring: fix null-deref in io_disable_sqo_submit
  io_uring: don't take files/mm for a dead task
  io_uring: drop mm and files after task_work_run
2021-01-16 11:12:02 -08:00
Linus Torvalds 9348b73c2e mm: don't play games with pinned pages in clear_page_refs
Turning a pinned page read-only breaks the pinning after COW.  Don't do it.

The whole "track page soft dirty" state doesn't work with pinned pages
anyway, since the page might be dirtied by the pinning entity without
ever being noticed in the page tables.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-16 10:51:26 -08:00
Linus Torvalds 29a951dfb3 mm: fix clear_refs_write locking
Turning page table entries read-only requires the mmap_sem held for
writing.

So stop doing the odd games with turning things from read locks to write
locks and back.  Just get the write lock.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-16 10:46:39 -08:00
Jens Axboe a8d13dbccb io_uring: ensure finish_wait() is always called in __io_uring_task_cancel()
If we enter with requests pending and performm cancelations, we'll have
a different inflight count before and after calling prepare_to_wait().
This causes the loop to restart. If we actually ended up canceling
everything, or everything completed in-between, then we'll break out
of the loop without calling finish_wait() on the waitqueue. This can
trigger a warning on exit_signals(), as we leave the task state in
TASK_UNINTERRUPTIBLE.

Put a finish_wait() after the loop to catch that case.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-15 16:04:23 -07:00
Linus Torvalds 0bc9bc1d8b A number of bug fixes for ext4:
* For the new fast_commit feature
    * Fix some error handling codepaths in whiteout handling and
      mountpoint sampling
    * Fix how we write ext4_error information so it goes through the journal
      when journalling is active, to avoid races that can lead to lost
      error information, superblock checksum failures, or DIF/DIX features.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmAB8eMACgkQ8vlZVpUN
 gaMUxAf+MW22dceTto2RO0ox9OEBNoZDFiVnlEuUaIOxkqOlovIWaqX7wwuF/121
 +FaNeDVzqNSS/QjQSB5lHF5OfHCD2u1Ef/bGzCm9cQyeN2/n0sCsStfPCcyLHy/0
 4R8PsjF0xhhbCETLcAc0U/YBFEoqSn1i7DG5nnpx63Wt1S/SSMmTAXzafWbzisEZ
 XNsz3CEPCDDSmSzOt3qMMHxkSoOZhYcLe7fCoKkhZ2pvTyrQsHrne6NNLtxc+sDL
 AcKkaI0EWFiFRhebowQO/5ouq6nnGKLCsukuZN9//Br8ht5gNcFpuKNVFl+LOiM6
 ud4H3qcRokcdPPAn3uwI0AJKFXqLvg==
 =Dgdj
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "A number of bug fixes for ext4:

   - Fix for the new fast_commit feature

   - Fix some error handling codepaths in whiteout handling and
     mountpoint sampling

   - Fix how we write ext4_error information so it goes through the
     journal when journalling is active, to avoid races that can lead to
     lost error information, superblock checksum failures, or DIF/DIX
     features"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: remove expensive flush on fast commit
  ext4: fix bug for rename with RENAME_WHITEOUT
  ext4: fix wrong list_splice in ext4_fc_cleanup
  ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR
  ext4: don't leak old mountpoint samples
  ext4: drop ext4_handle_dirty_super()
  ext4: fix superblock checksum failure when setting password salt
  ext4: use sbi instead of EXT4_SB(sb) in ext4_update_super()
  ext4: save error info to sb through journal if available
  ext4: protect superblock modifications with a buffer lock
  ext4: drop sync argument of ext4_commit_super()
  ext4: combine ext4_handle_error() and save_error_info()
2021-01-15 14:54:24 -08:00
Linus Torvalds 7cd3c41261 2 small cifs fixes for stable (including an important handle leak fix) and 3 small cleanup patches
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmABLeEACgkQiiy9cAdy
 T1G5VAv/ZqpBJp6A/SOFGKlSBVTUqf++AWo/DsTZQLx938j6URChazImFb3eJ3FU
 WvqeUW9kza1OICVg0Wf3II2RCMDwJBqxfSCsj5qK53g9By4X1tsiatQZd9rWfBSc
 AW8nw5hzjrZ2e3//4gieSXZOHfT6n80hteAlIBNWvjN7w6R1DvZ7cIMkS6+M4EEn
 RBfcQlUTAihxFC8/RUqXMK8+jIXvt3hiRdI4gYV5GuAD8C7CvLKZJc0n6WZ0hPR5
 IPlrY5j2J8yB+R+FkJ1Qjpzz7KrNCpkOyB0DPoHHlqRAT46oM8yfGGf2wlVr6v3A
 dG0DPQ5JAqB3vK9/3frJencXYRDCgLc4ZVwdFGrvv/wNZd29KBKiigfWIrxjnM13
 tcN9Noy5mX791YvdYKpvijTSJqWslB04gLy2HuEkOAyOKo2sqiEtemB4bRPeJr1E
 cmheItvkOhNY13KZKyrTgnUy61H9eLpjhS+CJZl3dArhaPNtI+4OGuAv+BIQQNNf
 FbNvJWmQ
 =7ueC
 -----END PGP SIGNATURE-----

Merge tag '5.11-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Two small cifs fixes for stable (including an important handle leak
  fix) and three small cleanup patches"

* tag '5.11-rc3-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: style: replace one-element array with flexible-array
  cifs: connect: style: Simplify bool comparison
  fs: cifs: remove unneeded variable in smb3_fs_context_dup
  cifs: fix interrupted close commands
  cifs: check pointer before freeing
2021-01-15 14:39:21 -08:00
Daejun Park e9f53353e1 ext4: remove expensive flush on fast commit
In the fast commit, it adds REQ_FUA and REQ_PREFLUSH on each fast
commit block when barrier is enabled.  However, in recovery phase,
ext4 compares CRC value in the tail.  So it is sufficient to add
REQ_FUA and REQ_PREFLUSH on the block that has tail.

Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210106013242epcms2p5b6b4ed8ca86f29456fdf56aa580e74b4@epcms2p5
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-01-15 14:41:31 -05:00
yangerkun 6b4b8e6b4a ext4: fix bug for rename with RENAME_WHITEOUT
We got a "deleted inode referenced" warning cross our fsstress test. The
bug can be reproduced easily with following steps:

  cd /dev/shm
  mkdir test/
  fallocate -l 128M img
  mkfs.ext4 -b 1024 img
  mount img test/
  dd if=/dev/zero of=test/foo bs=1M count=128
  mkdir test/dir/ && cd test/dir/
  for ((i=0;i<1000;i++)); do touch file$i; done # consume all block
  cd ~ && renameat2(AT_FDCWD, /dev/shm/test/dir/file1, AT_FDCWD,
    /dev/shm/test/dir/dst_file, RENAME_WHITEOUT) # ext4_add_entry in
    ext4_rename will return ENOSPC!!
  cd /dev/shm/ && umount test/ && mount img test/ && ls -li test/dir/file1
  We will get the output:
  "ls: cannot access 'test/dir/file1': Structure needs cleaning"
  and the dmesg show:
  "EXT4-fs error (device loop0): ext4_lookup:1626: inode #2049: comm ls:
  deleted inode referenced: 139"

ext4_rename will create a special inode for whiteout and use this 'ino'
to replace the source file's dir entry 'ino'. Once error happens
latter(the error above was the ENOSPC return from ext4_add_entry in
ext4_rename since all space has been consumed), the cleanup do drop the
nlink for whiteout, but forget to restore 'ino' with source file. This
will trigger the bug describle as above.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Fixes: cd808deced ("ext4: support RENAME_WHITEOUT")
Link: https://lore.kernel.org/r/20210105062857.3566-1-yangerkun@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2021-01-15 14:41:31 -05:00
Daejun Park 31e203e09f ext4: fix wrong list_splice in ext4_fc_cleanup
After full/fast commit, entries in staging queue are promoted to main
queue. In ext4_fs_cleanup function, it splice to staging queue to
staging queue.

Fixes: aa75f4d3da ("ext4: main fast-commit commit path")
Signed-off-by: Daejun Park <daejun7.park@samsung.com>
Reviewed-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201230094851epcms2p6eeead8cc984379b37b2efd21af90fd1a@epcms2p6
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2021-01-15 14:40:12 -05:00
Yi Li 23dd561ad9 ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR
1: ext4_iget/ext4_find_extent never returns NULL, use IS_ERR
instead of IS_ERR_OR_NULL to fix this.

2: ext4_fc_replay_inode should set the inode to NULL when IS_ERR.
and go to call iput properly.

Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Yi Li <yili@winhong.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201230033827.3996064-1-yili@winhong.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2021-01-15 14:39:14 -05:00
Marcelo Diop-Gonzalez f010505b78 io_uring: flush timeouts that should already have expired
Right now io_flush_timeouts() checks if the current number of events
is equal to ->timeout.target_seq, but this will miss some timeouts if
there have been more than 1 event added since the last time they were
flushed (possible in io_submit_flush_completions(), for example). Fix
it by recording the last sequence at which timeouts were flushed so
that the number of events seen can be compared to the number of events
needed without overflow.

Signed-off-by: Marcelo Diop-Gonzalez <marcelo827@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-15 10:02:28 -07:00
YANG LI e54fd0716c cifs: style: replace one-element array with flexible-array
There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use "flexible array members"[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9/process/
    deprecated.html#zero-length-and-one-element-arrays

Signed-off-by: YANG LI <abaci-bugfix@linux.alibaba.com>
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-13 13:36:45 -06:00
YANG LI ed6b1920f8 cifs: connect: style: Simplify bool comparison
Fix the following coccicheck warning:
./fs/cifs/connect.c:3740:6-21: WARNING: Comparison of 0/1 to bool
variable

Signed-off-by: YANG LI <abaci-bugfix@linux.alibaba.com>
Reported-by: Abaci Robot<abaci@linux.alibaba.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-13 12:55:40 -06:00
Menglong Dong c13e7af042 fs: cifs: remove unneeded variable in smb3_fs_context_dup
'rc' in smb3_fs_context_dup is not used and can be removed.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-13 12:55:37 -06:00
Paulo Alcantara 2659d3bff3 cifs: fix interrupted close commands
Retry close command if it gets interrupted to not leak open handles on
the server.

Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Reported-by: Duncan Findlay <duncf@duncf.ca>
Suggested-by: Pavel Shilovsky <pshilov@microsoft.com>
Fixes: 6988a619f5 ("cifs: allow syscalls to be restarted in __smb_send_rqst()")
Cc: stable@vger.kernel.org
Reviewd-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-13 12:55:33 -06:00
Tom Rix 77b6ec01c2 cifs: check pointer before freeing
clang static analysis reports this problem

dfs_cache.c:591:2: warning: Argument to kfree() is a constant address
  (18446744073709551614), which is not memory allocated by malloc()
        kfree(vi);
        ^~~~~~~~~

In dfs_cache_del_vol() the volume info pointer 'vi' being freed
is the return of a call to find_vol().  The large constant address
is find_vol() returning an error.

Add an error check to dfs_cache_del_vol() similar to the one done
in dfs_cache_update_vol().

Fixes: 54be1f6c1c ("cifs: Add DFS cache routines")
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
CC: <stable@vger.kernel.org> # v5.0+
Signed-off-by: Steve French <stfrench@microsoft.com>
2021-01-13 12:55:29 -06:00
Eric Biggers 1e249cb5b7 fs: fix lazytime expiration handling in __writeback_single_inode()
When lazytime is enabled and an inode is being written due to its
in-memory updated timestamps having expired, either due to a sync() or
syncfs() system call or due to dirtytime_expire_interval having elapsed,
the VFS needs to inform the filesystem so that the filesystem can copy
the inode's timestamps out to the on-disk data structures.

This is done by __writeback_single_inode() calling
mark_inode_dirty_sync(), which then calls ->dirty_inode(I_DIRTY_SYNC).

However, this occurs after __writeback_single_inode() has already
cleared the dirty flags from ->i_state.  This causes two bugs:

- mark_inode_dirty_sync() redirties the inode, causing it to remain
  dirty.  This wastefully causes the inode to be written twice.  But
  more importantly, it breaks cases where sync_filesystem() is expected
  to clean dirty inodes.  This includes the FS_IOC_REMOVE_ENCRYPTION_KEY
  ioctl (as reported at
  https://lore.kernel.org/r/20200306004555.GB225345@gmail.com), as well
  as possibly filesystem freezing (freeze_super()).

- Since ->i_state doesn't contain I_DIRTY_TIME when ->dirty_inode() is
  called from __writeback_single_inode() for lazytime expiration,
  xfs_fs_dirty_inode() ignores the notification.  (XFS only cares about
  lazytime expirations, and it assumes that i_state will contain
  I_DIRTY_TIME during those.)  Therefore, lazy timestamps aren't
  persisted by sync(), syncfs(), or dirtytime_expire_interval on XFS.

Fix this by moving the call to mark_inode_dirty_sync() to earlier in
__writeback_single_inode(), before the dirty flags are cleared from
i_state.  This makes filesystems be properly notified of the timestamp
expiration, and it avoids incorrectly redirtying the inode.

This fixes xfstest generic/580 (which tests
FS_IOC_REMOVE_ENCRYPTION_KEY) when run on ext4 or f2fs with lazytime
enabled.  It also fixes the new lazytime xfstest I've proposed, which
reproduces the above-mentioned XFS bug
(https://lore.kernel.org/r/20210105005818.92978-1-ebiggers@kernel.org).

Alternatively, we could call ->dirty_inode(I_DIRTY_SYNC) directly.  But
due to the introduction of I_SYNC_QUEUED, mark_inode_dirty_sync() is the
right thing to do because mark_inode_dirty_sync() now knows not to move
the inode to a writeback list if it is currently queued for sync.

Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
Cc: stable@vger.kernel.org
Depends-on: 5afced3bf2 ("writeback: Avoid skipping inode writeback")
Link: https://lore.kernel.org/r/20210112190253.64307-2-ebiggers@kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-01-13 17:26:21 +01:00
Pavel Begunkov 06585c497b io_uring: do sqo disable on install_fd error
WARNING: CPU: 0 PID: 8494 at fs/io_uring.c:8717
	io_ring_ctx_wait_and_kill+0x4f2/0x600 fs/io_uring.c:8717
Call Trace:
 io_uring_release+0x3e/0x50 fs/io_uring.c:8759
 __fput+0x283/0x920 fs/file_table.c:280
 task_work_run+0xdd/0x190 kernel/task_work.c:140
 tracehook_notify_resume include/linux/tracehook.h:189 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
 exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201
 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
 syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

failed io_uring_install_fd() is a special case, we don't do
io_ring_ctx_wait_and_kill() directly but defer it to fput, though still
need to io_disable_sqo_submit() before.

note: it doesn't fix any real problem, just a warning. That's because
sqring won't be available to the userspace in this case and so SQPOLL
won't submit anything.

Reported-by: syzbot+9c9c35374c0ecac06516@syzkaller.appspotmail.com
Fixes: d9d05217cb ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-13 08:29:17 -07:00
Pavel Begunkov b4411616c2 io_uring: fix null-deref in io_disable_sqo_submit
general protection fault, probably for non-canonical address
	0xdffffc0000000022: 0000 [#1] KASAN: null-ptr-deref
	in range [0x0000000000000110-0x0000000000000117]
RIP: 0010:io_ring_set_wakeup_flag fs/io_uring.c:6929 [inline]
RIP: 0010:io_disable_sqo_submit+0xdb/0x130 fs/io_uring.c:8891
Call Trace:
 io_uring_create fs/io_uring.c:9711 [inline]
 io_uring_setup+0x12b1/0x38e0 fs/io_uring.c:9739
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

io_disable_sqo_submit() might be called before user rings were
allocated, don't do io_ring_set_wakeup_flag() in those cases.

Reported-by: syzbot+ab412638aeb652ded540@syzkaller.appspotmail.com
Fixes: d9d05217cb ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-13 08:29:17 -07:00
Linus Torvalds e609571b5f NFS client bugfixes for Linux 5.11
Highlights include:
 
 Bugfixes:
 - Fix parsing of link-local IPv6 addresses
 - Fix confusing logging of mount errors that was introduced by the
   fsopen() patchset.
 - Fix a tracing use after free in _nfs4_do_setlk()
 - Layout return-on-close fixes when called from nfs4_evict_inode()
 - Layout segments were being leaked in pnfs_generic_clear_request_commit()
 - Don't leak DS commits in pnfs_generic_retry_commit()
 - Fix an Oopsable use-after-free when nfs_delegation_find_inode_server()
   calls iput() on an inode after the super block has gone away.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl/806IACgkQZwvnipYK
 APKIZA/+L+LvkMXflS9TQGGccpOPw+BBW5ixi2DabFYLqHz6WXNnIcUStU0NtF3q
 uHM2YrJT0XtWtQ8W6fWcsfdeS/1ixciXDS/5RH/o2e+fFMNg1lPWAeOc4brQSDFd
 DYEc7lSqw0D/pX8vY4dFIrpQorU2hnasjMK582JU7mDYXveRMLB/Bhcq9qBP2XgQ
 LVUpnHU/3dayvFGmr/sPzzZk/rIEfPaHU/J0YLbPfrEGFOo/mZKqstfS4ZkINAWp
 0yRD90s1hWTfRcxAiDaUoYPoxEw5AYjdbwC82owOaEa0zNWA2U7tD94UeVS51JCJ
 DtCn81znWaF4jVzes4VGzPlWirYoumthJwrKpKh04tEwo0a4V4AtsOAg2IbxfE/O
 CYsfwjwikzW4nOEerv22zOHICLNd2IP65kHAACaN0NVhS7dlLSuckwnMILdstD2Z
 x0LHxFhyRQe5c7bf6W6Jal2E/ThyD2qaUmSIxWweTq93OldD0mTLGHO7e2/chXwP
 3xkcuZLpU6bmg9QzmylWZWBB3ncDtC95VlRv/IV29mbN3a8XjJaugSOAwjx14JNT
 OFlJtLav2pvCwFLUutvgAMSgbshhfkwdUoUUHrcabXNL/4QBeeZB/pp9Ytr3NoBT
 xxC6nmB/Af7FtRnTrTpOSlH9s1NEB3JN4uMNx4kAKC+ZLySdMPQ=
 =08H3
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highlights include:

   - Fix parsing of link-local IPv6 addresses

   - Fix confusing logging of mount errors that was introduced by the
     fsopen() patchset.

   - Fix a tracing use after free in _nfs4_do_setlk()

   - Layout return-on-close fixes when called from nfs4_evict_inode()

   - Layout segments were being leaked in
     pnfs_generic_clear_request_commit()

   - Don't leak DS commits in pnfs_generic_retry_commit()

   - Fix an Oopsable use-after-free when nfs_delegation_find_inode_server()
     calls iput() on an inode after the super block has gone away"

* tag 'nfs-for-5.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: nfs_igrab_and_active must first reference the superblock
  NFS: nfs_delegation_find_inode_server must first reference the superblock
  NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
  NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
  NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
  pNFS: Stricter ordering of layoutget and layoutreturn
  pNFS: Clean up pnfs_layoutreturn_free_lsegs()
  pNFS: We want return-on-close to complete when evicting the inode
  pNFS: Mark layout for return if return-on-close was not sent
  net: sunrpc: interpret the return value of kstrtou32 correctly
  NFS: Adjust fs_context error logging
  NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
2021-01-12 09:38:53 -08:00
Filipe Manana 518837e650 btrfs: send: fix invalid clone operations when cloning from the same file and root
When an incremental send finds an extent that is shared, it checks which
file extent items in the range refer to that extent, and for those it
emits clone operations, while for others it emits regular write operations
to avoid corruption at the destination (as described and fixed by commit
d906d49fc5 ("Btrfs: send, fix file corruption due to incorrect cloning
operations")).

However when the root we are cloning from is the send root, we are cloning
from the inode currently being processed and the source file range has
several extent items that partially point to the desired extent, with an
offset smaller than the offset in the file extent item for the range we
want to clone into, it can cause the algorithm to issue a clone operation
that starts at the current eof of the file being processed in the receiver
side, in which case the receiver will fail, with EINVAL, when attempting
to execute the clone operation.

Example reproducer:

  $ cat test-send-clone.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  # Create our test file with a single and large extent (1M) and with
  # different content for different file ranges that will be reflinked
  # later.
  xfs_io -f \
         -c "pwrite -S 0xab 0 128K" \
         -c "pwrite -S 0xcd 128K 128K" \
         -c "pwrite -S 0xef 256K 256K" \
         -c "pwrite -S 0x1a 512K 512K" \
         $MNT/foobar

  btrfs subvolume snapshot -r $MNT $MNT/snap1
  btrfs send -f /tmp/snap1.send $MNT/snap1

  # Now do a series of changes to our file such that we end up with
  # different parts of the extent reflinked into different file offsets
  # and we overwrite a large part of the extent too, so no file extent
  # items refer to that part that was overwritten. This used to confuse
  # the algorithm used by the kernel to figure out which file ranges to
  # clone, making it attempt to clone from a source range starting at
  # the current eof of the file, resulting in the receiver to fail since
  # it is an invalid clone operation.
  #
  xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
         -c "reflink $MNT/foobar 0K 512K 256K" \
         -c "reflink $MNT/foobar 512K 128K 256K" \
         -c "pwrite -S 0x73 384K 640K" \
         $MNT/foobar

  btrfs subvolume snapshot -r $MNT $MNT/snap2
  btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

  echo -e "\nFile digest in the original filesystem:"
  md5sum $MNT/snap2/foobar

  # Now unmount the filesystem, create a new one, mount it and try to
  # apply both send streams to recreate both snapshots.
  umount $DEV

  mkfs.btrfs -f $DEV >/dev/null
  mount $DEV $MNT

  btrfs receive -f /tmp/snap1.send $MNT
  btrfs receive -f /tmp/snap2.send $MNT

  # Must match what we got in the original filesystem of course.
  echo -e "\nFile digest in the new filesystem:"
  md5sum $MNT/snap2/foobar

  umount $MNT

When running the reproducer, the incremental send operation fails due to
an invalid clone operation:

  $ ./test-send-clone.sh
  wrote 131072/131072 bytes at offset 0
  128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
  wrote 131072/131072 bytes at offset 131072
  128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
  wrote 262144/262144 bytes at offset 262144
  256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
  wrote 524288/524288 bytes at offset 524288
  512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  linked 983040/983040 bytes at offset 1048576
  960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
  linked 262144/262144 bytes at offset 524288
  256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
  linked 262144/262144 bytes at offset 131072
  256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
  wrote 655360/655360 bytes at offset 393216
  640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2

  File digest in the original filesystem:
  9c13c61cb0b9f5abf45344375cb04dfa  /mnt/sdi/snap2/foobar
  At subvol snap1
  At snapshot snap2
  ERROR: failed to clone extents to foobar: Invalid argument

  File digest in the new filesystem:
  132f0396da8f48d2e667196bff882cfc  /mnt/sdi/snap2/foobar

The clone operation is invalid because its source range starts at the
current eof of the file in the receiver, causing the receiver to get
an EINVAL error from the clone operation when attempting it.

For the example above, what happens is the following:

1) When processing the extent at file offset 1M, the algorithm checks that
   the extent is shared and can be (fully or partially) found at file
   offset 0.

   At this point the file has a size (and eof) of 1M at the receiver;

2) It finds that our extent item at file offset 1M has a data offset of
   64K and, since the file extent item at file offset 0 has a data offset
   of 0, it issues a clone operation, from the same file and root, that
   has a source range offset of 64K, destination offset of 1M and a length
   of 64K, since the extent item at file offset 0 refers only to the first
   128K of the shared extent.

   After this clone operation, the file size (and eof) at the receiver is
   increased from 1M to 1088K (1M + 64K);

3) Now there's still 896K (960K - 64K) of data left to clone or write, so
   it checks for the next file extent item, which starts at file offset
   128K. This file extent item has a data offset of 0 and a length of
   256K, so a clone operation with a source range offset of 256K, a
   destination offset of 1088K (1M + 64K) and length of 128K is issued.

   After this operation the file size (and eof) at the receiver increases
   from 1088K to 1216K (1088K + 128K);

4) Now there's still 768K (896K - 128K) of data left to clone or write, so
   it checks for the next file extent item, located at file offset 384K.
   This file extent item points to a different extent, not the one we want
   to clone, with a length of 640K. So we issue a write operation into the
   file range 1216K (1088K + 128K, end of the last clone operation), with
   a length of 640K and with a data matching the one we can find for that
   range in send root.

   After this operation, the file size (and eof) at the receiver increases
   from 1216K to 1856K (1216K + 640K);

5) Now there's still 128K (768K - 640K) of data left to clone or write, so
   we look into the file extent item, which is for file offset 1M and it
   points to the extent we want to clone, with a data offset of 64K and a
   length of 960K.

   However this matches the file offset we started with, the start of the
   range to clone into. So we can't for sure find any file extent item
   from here onwards with the rest of the data we want to clone, yet we
   proceed and since the file extent item points to the shared extent,
   with a data offset of 64K, we issue a clone operation with a source
   range starting at file offset 1856K, which matches the file extent
   item's offset, 1M, plus the amount of data cloned and written so far,
   which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
   operation is invalid since the source range offset matches the current
   eof of the file in the receiver. We should have stopped looking for
   extents to clone at this point and instead fallback to write, which
   would simply the contain the data in the file range from 1856K to
   1856K + 128K.

So fix this by stopping the loop that looks for file ranges to clone at
clone_range() when we reach the current eof of the file being processed,
if we are cloning from the same file and using the send root as the clone
root. This ensures any data not yet cloned will be sent to the receiver
through a write operation.

A test case for fstests will follow soon.

Reported-by: Massimo B. <massimo.b@gmx.net>
Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
Fixes: 11f2069c11 ("Btrfs: send, allow clone operations within the same file")
CC: stable@vger.kernel.org # 5.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-12 15:35:09 +01:00
David Sterba 14ff8e1970 btrfs: no need to run delayed refs after commit_fs_roots during commit
The inode number cache has been removed in this dev cycle, there's one
more leftover. We don't need to run the delayed refs again after
commit_fs_roots as stated in the comment, because btrfs_save_ino_cache
is no more since 5297199a8b ("btrfs: remove inode number cache
feature").

Nothing else between commit_fs_roots and btrfs_qgroup_account_extents
could create new delayed refs so the qgroup consistency should be safe.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-12 15:35:04 +01:00
J. Bruce Fields 51b2ee7d00 nfsd4: readdirplus shouldn't return parent of export
If you export a subdirectory of a filesystem, a READDIRPLUS on the root
of that export will return the filehandle of the parent with the ".."
entry.

The filehandle is optional, so let's just not return the filehandle for
".." if we're at the root of an export.

Note that once the client learns one filehandle outside of the export,
they can trivially access the rest of the export using further lookups.

However, it is also not very difficult to guess filehandles outside of
the export.  So exporting a subdirectory of a filesystem should
considered equivalent to providing access to the entire filesystem.  To
avoid confusion, we recommend only exporting entire filesystems.

Reported-by: Youjipeng <wangzhibei1999@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-12 08:54:14 -05:00
Linus Torvalds 6e68b9961f for-5.11-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/8jD4ACgkQxWXV+ddt
 WDteWQ//QcpD6STpLwAC+g6zJyJln7Au9lfQvawugvOJssbtdPkJQP3ZiK+Izwi/
 /xagu6XMazJM+47acNJKDNntOqVkp+O6CxEbLU+rL/D288L3HEGxayZ2LL90wm6J
 tbIebOE+BSVZ/5oe0jVdqZXwYvUtTiJ7PoFgrZPXJCnddSitZRD3tC4Wmi/Yo5+0
 +7CW6PT3/s7KARwYXpgpMM5vi8qO2nfHfTUdRlSh59g7zC/TH7HiitL6roHzlX1k
 g/aaKYLVcg62OPpw7ZXwde/qH8n1TR+H5WX6vBInqd/9jYcNkVGqijCgBeL1TJkN
 Vx/b69ccODK2GNzuuYoo3k3XvSwZWsOTZp+k4y3EZ1cMONMo1snu/xglYsvSZvUL
 lNCQlA9hIZNskRwEvkEea68/bQdiOl6xezgR9tajMlmz7oCsV/Cz/MJ+RfqaxdH3
 bV6eTTex67lQfzAda+gN+zjBrFzQdmK700gKimdzF1XfcYmmCIdZVX8Gm/N6ldQN
 LNRe8zYRaqrmRk9PQ355RqYDZmft/wLiUV6V0j74oV65WpPe2R4pULWdmPAGm6Oj
 UWM+ZR3u9m8asg7ghKYgct2pxCS3+gLbDNXNcOSxYxthEEZB2JqkAMjtjCfwJilN
 PXfuXaBKRmRck+AcYfbBrfJOljQ+zAJdTK/Rid40TwwpFCe/jjY=
 =G3R4
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "More material for stable trees.

   - tree-checker: check item end overflow

   - fix false warning during relocation regarding extent type

   - fix inode flushing logic, caused notable performance regression
     (since 5.10)

   - debugging fixups:
      - print correct offset for reloc tree key
      - pass reliable fs_info pointer to error reporting helper"

* tag 'for-5.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: shrink delalloc pages instead of full inodes
  btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
  btrfs: tree-checker: check if chunk item end overflows
  btrfs: prevent NULL pointer dereference in extent_io_tree_panic
  btrfs: print the actual offset in btrfs_root_name
2021-01-11 14:18:56 -08:00
Linus Torvalds c912fd05fa Fixes:
- Fix major TCP performance regression
 - Get NFSv4.2 READ_PLUS regression tests to pass
 - Improve NFSv4 COMPOUND memory allocation
 - Fix sparse warning
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl/138wACgkQM2qzM29m
 f5c0QQ/+NkUxtmXd5lKXjzB0NcXsiQm9QxGvY52Oj75DHHprGmGkNEQAKczr/1Gu
 l+MArFXJTITrZRwbqQMA4uxwgCfup51atI12c27n1u5T9+bMicJIjT5yCtQ7rT2t
 U70VSZKgBlWWTcvfiEcFc1rloI3IY5c4ZYpeMxaXseegn6w3LYQfkZLcRdRleSz3
 P0IO59Eow8Wt/GxRXpeYv0sK2m8OK1OyknKAzbq9swrc0ARJzKIwuTDs7jPtlvg5
 SkDOTrXdSHwVvTrCqr9BwaNtQa76xR/Zo5UqKYgyzx3/NQ7h39hRTR5xLVst+Ynh
 3TgOPS0YDWlmRzjX0xhr5y+rwWFxRvS6uecaIMOSuqABQ1F0RwbfXE/XplQLhk1E
 kjL819y5MuUpOdjMx5SZEo0pC7VeAoqGmzvTunpf974ExTNvDiKf0fPFs74cYUzG
 /a4k3DYJQbzUgG1PzPElbKbPUwSk/W/M7p9Tw7R9dnX2huVa/2J6TllbnbUi6REf
 4qVqCe3WXFHE8Q9FCBuYEaTddToPqA4M98B8ba/pDYiqgfI8goWvGEQukuL7RES0
 0i3G5SMC5zScgk44RMewyNrzl8IzCJXITv39+YDQ9O4FVJJXTSAMoyQ5aXlzVhc6
 v+b4560cXoltEecFzooKjNbb+2FURKNgfeDk9xgG2DoydzelipU=
 =POBn
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.11-1' of git://git.linux-nfs.org/projects/cel/cel-2.6

Pull nfsd fixes from Chuck Lever:

 - Fix major TCP performance regression

 - Get NFSv4.2 READ_PLUS regression tests to pass

 - Improve NFSv4 COMPOUND memory allocation

 - Fix sparse warning

* tag 'nfsd-5.11-1' of git://git.linux-nfs.org/projects/cel/cel-2.6:
  NFSD: Restore NFSv4 decoding's SAVEMEM functionality
  SUNRPC: Handle TCP socket sends with kernel_sendpage() again
  NFSD: Fix sparse warning in nfssvc.c
  nfsd: Don't set eof on a truncated READ_PLUS
  nfsd: Fixes for nfsd4_encode_read_plus_data()
2021-01-11 11:35:46 -08:00
Pavel Begunkov 621fadc223 io_uring: don't take files/mm for a dead task
In rare cases a task may be exiting while io_ring_exit_work() trying to
cancel/wait its requests. It's ok for __io_sq_thread_acquire_mm()
because of SQPOLL check, but is not for __io_sq_thread_acquire_files().
Play safe and fail for both of them.

Cc: stable@vger.kernel.org # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-11 07:39:54 -07:00
Pavel Begunkov d434ab6db5 io_uring: drop mm and files after task_work_run
__io_req_task_submit() run by task_work can set mm and files, but
io_sq_thread() in some cases, and because __io_sq_thread_acquire_mm()
and __io_sq_thread_acquire_files() do a simple current->mm/files check
it may end up submitting IO with mm/files of another task.

We also need to drop it after in the end to drop potentially grabbed
references to them.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-11 07:39:54 -07:00
Trond Myklebust 896567ee7f NFS: nfs_igrab_and_active must first reference the superblock
Before referencing the inode, we must ensure that the superblock can be
referenced. Otherwise, we can end up with iput() calling superblock
operations that are no longer valid or accessible.

Fixes: ea7c38fef0 ("NFSv4: Ensure we reference the inode for return-on-close in delegreturn")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 16:29:28 -05:00
Trond Myklebust 113aac6d56 NFS: nfs_delegation_find_inode_server must first reference the superblock
Before referencing the inode, we must ensure that the superblock can be
referenced. Otherwise, we can end up with iput() calling superblock
operations that are no longer valid or accessible.

Fixes: e39d8a186e ("NFSv4: Fix an Oops during delegation callbacks")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 16:29:28 -05:00
Linus Torvalds ed41fd071c block-5.11-2021-01-10
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/7KA0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpn6WEACeUa97qyzm7G8/E5ejBL6lXSTRXNc8qa+h
 YCdrDltkqs6OHAuEyUCwGw3zPmb7fp4M5RLZ/Dp9EtMwld45HfoN6mpRe0+i4U96
 iAkHMNUo6ytp3wXX1XKgZ0FhcSOSwkQK8CMzmLPn+pxkDYzQPFg38AUISPpoDA/L
 YNh4tEiHHd5oprHIzludE00m2i1oYNrBcmUe27sKxR0mak0kEJtxr4cXLrqBtN3k
 9C31A0gstCINSHmQPAcRvFerDxDM0WPYQ7K6UEXfkCfbyf6i+1eG/qLUwUCdm9MD
 Rjot6dXzQ2LzqJbaAZndjJRDRZx2xpC2TNlNaBjYzSOC6AXSY0MKiZBCnH/i/OoZ
 f0Bq/k7LVeMbyu02cgIis4DPLabfG+XQUOniu4HQTrzK8+neApAlCwINc73cvQOb
 hBS+LfUVqP6K6g3oVGSvqG01wj2HK69SWMNKTr9GZ3GIqrcWYtA/JnqFfTE7/KwC
 H7rkPL8i3+NBXmjjz6hm8hx3MrnekKJpsdCBicm9OOYqJRbkGVjoUYeDFz5MElfp
 k71u2WDQ81aiqfWajsJkZaUFxZgUrRzuWeyBZiQQP9kJEMzUUiDSg4K+0WJhk5bO
 Y0EX0sdCz8k9IBKfi2+FcF5dYj3RDolALmBDrrcfchTW0h7vxMpn4rr/ueN7gViz
 rW/Gj9pRsA==
 =CClj
 -----END PGP SIGNATURE-----

Merge tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Missing CRC32 selections (Arnd)

 - Fix for a merge window regression with bdev inode init (Christoph)

 - bcache fixes

 - rnbd fixes

 - NVMe pull request from Christoph:
    - fix a race in the nvme-tcp send code (Sagi Grimberg)
    - fix a list corruption in an nvme-rdma error path (Israel Rukshin)
    - avoid a possible double fetch in nvme-pci (Lalithambika Krishnakumar)
    - add the susystem NQN quirk for a Samsung driver (Gopal Tiwari)
    - fix two compiler warnings in nvme-fcloop (James Smart)
    - don't call sleeping functions from irq context in nvme-fc (James Smart)
    - remove an unused argument (Max Gurtovoy)
    - remove unused exports (Minwoo Im)

 - Use-after-free fix for partition iteration (Ming)

 - Missing blk-mq debugfs flag annotation (John)

 - Bdev freeze regression fix (Satya)

 - blk-iocost NULL pointer deref fix (Tejun)

* tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-block: (26 commits)
  bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET
  bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket
  bcache: check unsupported feature sets for bcache register
  bcache: fix typo from SUUP to SUPP in features.h
  bcache: set pdev_set_uuid before scond loop iteration
  blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED
  block/rnbd-clt: avoid module unload race with close confirmation
  block/rnbd: Adding name to the Contributors List
  block/rnbd-clt: Fix sg table use after free
  block/rnbd-srv: Fix use after free in rnbd_srv_sess_dev_force_close
  block/rnbd: Select SG_POOL for RNBD_CLIENT
  block: pre-initialize struct block_device in bdev_alloc_inode
  fs: Fix freeze_bdev()/thaw_bdev() accounting of bd_fsfreeze_sb
  nvme: remove the unused status argument from nvme_trace_bio_complete
  nvmet-rdma: Fix list_del corruption on queue establishment failure
  nvme: unexport functions with no external caller
  nvme: avoid possible double fetch in handling CQE
  nvme-tcp: Fix possible race of io_work and direct send
  nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN
  nvme-fcloop: Fix sscanf type and list_first_entry_or_null warnings
  ...
2021-01-10 12:53:08 -08:00
Linus Torvalds d430adfea8 io_uring-5.11-2021-01-10
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/7J/cQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr2IEACmFmPGOHiRgo3ET6F+pXufVtqN/TiiOKIM
 GNznobO62HRL35ZTSVMH86u+koouWfoBCWtNMOq6ln2wMbNnHUKFgCdNhhX1vXPg
 T404WTVFtqw19wBigLUBnANDWAHOOVOdcbShNuznE/PYgC1faz9HFjuKY7GVnrzU
 xOSuw1k8P8swwf5h641fIwGI1sdyBgFHRsG8YwVBVJtmf4B8tRSQvAF1fesIPw6j
 ZQTcfAAFl29Kj1Tjlog/rD75CS6xVQRd59AaJqvF8khMAsp/VHb7snAPMZJ2EEO2
 6by5zNbh5mhqOhWFe7jvem+MBFGARx/Ol5cfKPehGsjLoOAk/QJU6tOYA5T1YHdY
 aOB5QR/c/H7uHdWBu6pcU9sT3b7LvOJGJF+kXHt/N2A8k03jXmmWk+LoxrRsXrbr
 Tvo7ehOIHdgnOuy2M9R9a/c0WtjY6mzBM9OEyFT3oa1ijM5mjAgU1YAVKJgjOa83
 eeN9p50+OcFH34bDU2bGCbDdnfZvplcgcpib1p2A5H0seoci2Tn//ihJ7us4+JTP
 5hZS1alkL/ngwr8LGoltS/RnJyJsP2ZJO8NzrRli66e6EflUXnJtla8zP4EASKmQ
 FnD5Yt2TFtBTt/MOAhUKG2whD5uvUtJIRLll33i6gfO97LSMZVkH43pji4XEAZkR
 idVotzElKw==
 =+yAt
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.11-2021-01-10' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "A bit larger than I had hoped at this point, but it's all changes that
  will be directed towards stable anyway. In detail:

   - Fix a merge window regression on error return (Matthew)

   - Remove useless variable declaration/assignment (Ye Bin)

   - IOPOLL fixes (Pavel)

   - Exit and cancelation fixes (Pavel)

   - fasync lockdep complaint fix (Pavel)

   - Ensure SQPOLL is synchronized with creator life time (Pavel)"

* tag 'io_uring-5.11-2021-01-10' of git://git.kernel.dk/linux-block:
  io_uring: stop SQPOLL submit on creator's death
  io_uring: add warn_once for io_uring_flush()
  io_uring: inline io_uring_attempt_task_drop()
  io_uring: io_rw_reissue lockdep annotations
  io_uring: synchronise ev_posted() with waitqueues
  io_uring: dont kill fasync under completion_lock
  io_uring: trigger eventfd for IOPOLL
  io_uring: Fix return value from alloc_fixed_file_ref_node
  io_uring: Delete useless variable ‘id’ in io_prep_async_work
  io_uring: cancel more aggressively in exit_work
  io_uring: drop file refs after task cancel
  io_uring: patch up IOPOLL overflow_flush sync
  io_uring: synchronise IOPOLL on task_submit fail
2021-01-10 12:39:38 -08:00
Linus Torvalds a440e4d761 - A fix for fanotify_mark() missing the conversion of x86_32 native
syscalls which take 64-bit arguments to the compat handlers due to
 former having a general compat handler. (Brian Gerst)
 
 - Add a forgotten pmd page destructor call to pud_free_pmd_page() where
 a pmd page is freed. (Dan Williams)
 
 - Make IN/OUT insns with an u8 immediate port operand handling for
 SEV-ES guests more precise by using only the single port byte and not
 the whole s32 value of the insn decoder. (Peter Gonda)
 
 - Correct a straddling end range check before returning the proper MTRR
 type, when the end address is the same as top of memory. (Ying-Tsun
 Huang)
 
 - Change PQR_ASSOC MSR update scheme when moving a task to a resctrl
 resource group to avoid significant performance overhead with some
 resctrl workloads. (Fenghua Yu)
 
 - Avoid the actual task move overhead when the task is already in the
 resource group. (Fenghua Yu)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/61xYACgkQEsHwGGHe
 VUrcvQ//dAWAteCC/BXVHpgcWrBOgPrkwv7aAo70bIO50fUj4pHPYbfhOJU1ey7j
 5o4FrqdsOVhGfZjQzvT/juLsr9mQHsfszxKpDTLyK3wVtUtIODYXzgiXRc/qfZDO
 ozXCVUsUSKJgrIcKTBQbmugK36iZZk+ER+qzUaqd0aq8mocdtSSO8b14uaRJw3MR
 vumqmEmEEcyM9XK0UgTLPcf6Uhu+Mlg3YSNkV5Qhu0yiCTJaqeEySsytUcRsnnF/
 z8AkxZP03Q65o3aoRoSGZihHNKTkNucbavYp70LkcqopoHlC+XERvya9ANRibLPi
 /+s9GQUm4QPg7XRHLB8dXFZ9RY3YGUeE60BUxVZa4vI3pwciPQD5tbvUF3F/jEN0
 PYLy/zVlAkDfI6Z8wTl8DNmd8nd/rE0F4p5zayjpQUWsjjfZDrh+GzBl/YsMuYRp
 G8dk3tEUc8KREBEccv/YzuVcE0AhX4t1tkn3l2Le5v+4PbwRWBm2uNOiRfn4OM31
 iB4E4yCHBnBhTyBA0TkWuHV1TJX6Tb2+0g+D49ZoMGFVoBd8NL6f+dBr0psjX/U+
 RsZucit0FcJG2VhJNXEPD+rwNZ6XPfDmIU9GNTAmXUuoKR/kqT8D/NWYkqmKh/Vw
 +F2EIgOZVhQVOvLKWRut+4qmQRStm6B3UBJimEDySUJPT72O+dU=
 =2/Eq
 -----END PGP SIGNATURE-----

Merge tag 'x86_urgent_for_v5.11_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:
 "As expected, fixes started trickling in after the holidays so here is
  the accumulated pile of x86 fixes for 5.11:

   - A fix for fanotify_mark() missing the conversion of x86_32 native
     syscalls which take 64-bit arguments to the compat handlers due to
     former having a general compat handler. (Brian Gerst)

   - Add a forgotten pmd page destructor call to pud_free_pmd_page()
     where a pmd page is freed. (Dan Williams)

   - Make IN/OUT insns with an u8 immediate port operand handling for
     SEV-ES guests more precise by using only the single port byte and
     not the whole s32 value of the insn decoder. (Peter Gonda)

   - Correct a straddling end range check before returning the proper
     MTRR type, when the end address is the same as top of memory.
     (Ying-Tsun Huang)

   - Change PQR_ASSOC MSR update scheme when moving a task to a resctrl
     resource group to avoid significant performance overhead with some
     resctrl workloads. (Fenghua Yu)

   - Avoid the actual task move overhead when the task is already in the
     resource group. (Fenghua Yu)"

* tag 'x86_urgent_for_v5.11_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Don't move a task to the same resource group
  x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR
  x86/mtrr: Correct the range check before performing MTRR type lookups
  x86/sev-es: Fix SEV-ES OUT/IN immediate opcode vc handling
  x86/mm: Fix leak of pmd ptlock
  fanotify: Fix sys_fanotify_mark() on native x86-32
2021-01-10 11:31:17 -08:00
Trond Myklebust cb2856c597 NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
If we exit _lgopen_prepare_attached() without setting a layout, we will
currently leak the plh_outstanding counter.

Fixes: 411ae722d1 ("pNFS: Wait for stale layoutget calls to complete in pnfs_update_layout()")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 13:32:52 -05:00
Trond Myklebust 46c9ea1d4f NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
We must ensure that we pass a layout segment to nfs_retry_commit() when
we're cleaning up after pnfs_bucket_alloc_ds_commits(). Otherwise,
requests that should be committed to the DS will get committed to the
MDS.
Do so by ensuring that pnfs_bucket_get_committing() always tries to
return a layout segment when it returns a non-empty page list.

Fixes: c84bea5944 ("NFS/pNFS: Simplify bucket layout segment reference counting")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 13:32:52 -05:00
Trond Myklebust 1757655d78 NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
In pnfs_generic_clear_request_commit(), we try calling
pnfs_free_bucket_lseg() before we remove the request from the DS bucket.
That will always fail, since the point is to test for whether or not
that bucket is empty.

Fixes: c84bea5944 ("NFS/pNFS: Simplify bucket layout segment reference counting")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 13:32:52 -05:00
Trond Myklebust 2c8d5fc37f pNFS: Stricter ordering of layoutget and layoutreturn
If a layout return is in progress, we should wait for it to complete,
in case the layout segment we are picking up gets returned too.

Fixes: 30cb3ee299 ("pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state seqid")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 13:32:52 -05:00
Trond Myklebust c18d1e17ba pNFS: Clean up pnfs_layoutreturn_free_lsegs()
Remove the check for whether or not the stateid is NULL, and fix up the
callers.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 13:32:52 -05:00
Trond Myklebust 078000d02d pNFS: We want return-on-close to complete when evicting the inode
If the inode is being evicted, it should be safe to run return-on-close,
so we should do it to ensure we don't inadvertently leak layout segments.

Fixes: 1c5bd76d17 ("pNFS: Enable layoutreturn operation for return-on-close")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 13:32:51 -05:00
Trond Myklebust 67bbceedc9 pNFS: Mark layout for return if return-on-close was not sent
If the layout return-on-close failed because the layoutreturn was never
sent, then we should mark the layout for return again.

Fixes: 9c47b18cf7 ("pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 13:32:51 -05:00
Scott Mayhew c98e9daa59 NFS: Adjust fs_context error logging
Several existing dprink()/dfprintk() calls were converted to use the new
mount API logging macros by commit ce8866f091 ("NFS: Attach
supplementary error information to fs_context").  If the fs_context was
not created using fsopen() then it will not have had a log buffer
allocated for it, and the new mount API logging macros will wind up
calling printk().

This can result in syslog messages being logged where previously there
were none... most notably "NFS4: Couldn't follow remote path", which can
happen if the client is auto-negotiating a protocol version with an NFS
server that doesn't support the higher v4.x versions.

Convert the nfs_errorf(), nfs_invalf(), and nfs_warnf() macros to check
for the existence of the fs_context's log buffer and call dprintk() if
it doesn't exist.  Add nfs_ferrorf(), nfs_finvalf(), and nfs_warnf(),
which do the same thing but take an NFS debug flag as an argument and
call dfprintk().  Finally, modify the "NFS4: Couldn't follow remote
path" message to use nfs_ferrorf().

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207385
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Fixes: ce8866f091 ("NFS: Attach supplementary error information to fs_context.")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 13:32:39 -05:00
Pavel Begunkov d9d05217cb io_uring: stop SQPOLL submit on creator's death
When the creator of SQPOLL io_uring dies (i.e. sqo_task), we don't want
its internals like ->files and ->mm to be poked by the SQPOLL task, it
have never been nice and recently got racy. That can happen when the
owner undergoes destruction and SQPOLL tasks tries to submit new
requests in parallel, and so calls io_sq_thread_acquire*().

That patch halts SQPOLL submissions when sqo_task dies by introducing
sqo_dead flag. Once set, the SQPOLL task must not do any submission,
which is synchronised by uring_lock as well as the new flag.

The tricky part is to make sure that disabling always happens, that
means either the ring is discovered by creator's do_exit() -> cancel,
or if the final close() happens before it's done by the creator. The
last is guaranteed by the fact that for SQPOLL the creator task and only
it holds exactly one file note, so either it pins up to do_exit() or
removed by the creator on the final put in flush. (see comments in
uring_flush() around file->f_count == 2).

One more place that can trigger io_sq_thread_acquire_*() is
__io_req_task_submit(). Shoot off requests on sqo_dead there, even
though actually we don't need to. That's because cancellation of
sqo_task should wait for the request before going any further.

note 1: io_disable_sqo_submit() does io_ring_set_wakeup_flag() so the
caller would enter the ring to get an error, but it still doesn't
guarantee that the flag won't be cleared.

note 2: if final __userspace__ close happens not from the creator
task, the file note will pin the ring until the task dies.

Fixed: b1b6b5a30d ("kernel/io_uring: cancel io_uring before task works")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-09 09:21:43 -07:00
Pavel Begunkov 6b5733eb63 io_uring: add warn_once for io_uring_flush()
files_cancel() should cancel all relevant requests and drop file notes,
so we should never have file notes after that, including on-exit fput
and flush. Add a WARN_ONCE to be sure.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-09 09:21:43 -07:00
Pavel Begunkov 4f793dc40b io_uring: inline io_uring_attempt_task_drop()
A simple preparation change inlining io_uring_attempt_task_drop() into
io_uring_flush().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-09 09:21:43 -07:00
Pavel Begunkov 55e6ac1e1f io_uring: io_rw_reissue lockdep annotations
We expect io_rw_reissue() to take place only during submission with
uring_lock held. Add a lockdep annotation to check that invariant.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-09 09:21:43 -07:00
Linus Torvalds 996e435fd4 zonefs fixes for 5.11-rc3
A single patch from Arnd in this pull request to fix a missing
 dependency in zonefs Kconfig.
 
 Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCX/j4XAAKCRDdoc3SxdoY
 dq7JAQDx1EafzD1RgEKoa0gKNZMBkNr08zaF1viEg9RoAu+KsAEApem43PmHfUQx
 FSH9cSzkqbgcnusBsWv8pdCKS1R0iQs=
 =Fd60
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs fix from Damien Le Moal:
 "A single patch from Arnd to fix a missing dependency in zonefs
  Kconfig"

* tag 'zonefs-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: select CONFIG_CRC32
2021-01-08 18:04:43 -08:00
Linus Torvalds ef0ba05538 poll: fix performance regression due to out-of-line __put_user()
The kernel test robot reported a -5.8% performance regression on the
"poll2" test of will-it-scale, and bisected it to commit d55564cfc2
("x86: Make __put_user() generate an out-of-line call").

I didn't expect an out-of-line __put_user() to matter, because no normal
core code should use that non-checking legacy version of user access any
more.  But I had overlooked the very odd poll() usage, which does a
__put_user() to update the 'revents' values of the poll array.

Now, Al Viro correctly points out that instead of updating just the
'revents' field, it would be much simpler to just copy the _whole_
pollfd entry, and then we could just use "copy_to_user()" on the whole
array of entries, the same way we use "copy_from_user()" a few lines
earlier to get the original values.

But that is not what we've traditionally done, and I worry that threaded
applications might be concurrently modifying the other fields of the
pollfd array.  So while Al's suggestion is simpler - and perhaps worth
trying in the future - this instead keeps the "just update revents"
model.

To fix the performance regression, use the modern "unsafe_put_user()"
instead of __put_user(), with the proper "user_write_access_begin()"
guarding in place. This improves code generation enormously.

Link: https://lore.kernel.org/lkml/20210107134723.GA28532@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Laight <David.Laight@aculab.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-01-08 11:06:29 -08:00
Josef Bacik e076ab2a2c btrfs: shrink delalloc pages instead of full inodes
Commit 38d715f494 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
some infrastructure we have in place to flush inodes that we use for
device replace and snapshot.  However this introduced a pretty serious
performance regression.  To reproduce the user untarred the source
tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
see it take anywhere from 5 to 20 times as long to untar in 5.10
compared to 5.9. This was observed on fast devices (SSD and better) and
not on HDD.

The root cause is because before we would generally use the normal
writeback path to reclaim delalloc space, and for this we would provide
it with the number of pages we wanted to flush.  The referenced commit
changed this to flush that many inodes, which drastically increased the
amount of space we were flushing in certain cases, which severely
affected performance.

We cannot revert this patch unfortunately because of 3d45f221ce
("btrfs: fix deadlock when cloning inline extent and low on free
metadata space") which requires the ability to skip flushing inodes that
are being cloned in certain scenarios, which means we need to keep using
our flushing infrastructure or risk re-introducing the deadlock.

Instead to fix this problem we can go back to providing
btrfs_start_delalloc_roots with a number of pages to flush, and then set
up a writeback_control and utilize sync_inode() to handle the flushing
for us.  This gives us the same behavior we had prior to the fix, while
still allowing us to avoid the deadlock that was fixed by Filipe.  I
redid the users original test and got the following results on one of
our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)

  5.9		0m54.258s
  5.10		1m26.212s
  5.10+patch	0m38.800s

5.10+patch is significantly faster than plain 5.9 because of my patch
series "Change data reservations to use the ticketing infra" which
contained the patch that introduced the regression, but generally
improved the overall ENOSPC flushing mechanisms.

Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
the results:

  5.10.5            4m00s
  5.10.5+patch      1m08s
  5.11-rc2	    5m14s
  5.11-rc2+patch    1m30s

Reported-by: René Rebe <rene@exactcode.de>
Fixes: 38d715f494 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
CC: stable@vger.kernel.org # 5.10
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: David Sterba <dsterba@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add my test results ]
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-08 16:36:44 +01:00
Christoph Hellwig 2d2f6f1b47 block: pre-initialize struct block_device in bdev_alloc_inode
bdev_evict_inode and bdev_free_inode are also called for the root inode
of bdevfs, for which bdev_alloc is never called.  Move the zeroing o
f struct block_device and the initialization of the bd_bdi field into
bdev_alloc_inode to make sure they are initialized for the root inode
as well.

Fixes: e6cb53827e ("block: initialize struct block_device in bdev_alloc")
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-07 20:57:53 -07:00
Satya Tangirala 04a6a536bc fs: Fix freeze_bdev()/thaw_bdev() accounting of bd_fsfreeze_sb
freeze/thaw_bdev() currently use bdev->bd_fsfreeze_count to infer
whether or not bdev->bd_fsfreeze_sb is valid (it's valid iff
bd_fsfreeze_count is non-zero). thaw_bdev() doesn't nullify
bd_fsfreeze_sb.

But this means a freeze_bdev() call followed by a thaw_bdev() call can
leave bd_fsfreeze_sb with a non-null value, while bd_fsfreeze_count is
zero. If freeze_bdev() is called again, and this time
get_active_super() returns NULL (e.g. because the FS is unmounted),
we'll end up with bd_fsfreeze_count > 0, but bd_fsfreeze_sb is
*untouched* - it stays the same (now garbage) value. A subsequent
thaw_bdev() will decide that the bd_fsfreeze_sb value is legitimate
(since bd_fsfreeze_count > 0), and attempt to use it.

Fix this by always setting bd_fsfreeze_sb to NULL when
bd_fsfreeze_count is successfully decremented to 0 in thaw_sb().
Alternatively, we could set bd_fsfreeze_sb to whatever
get_active_super() returns in freeze_bdev() whenever bd_fsfreeze_count
is successfully incremented to 1 from 0 (which can be achieved cleanly
by moving the line currently setting bd_fsfreeze_sb to immediately
after the "sync:" label, but it might be a little too subtle/easily
overlooked in future).

This fixes the currently panicking xfstests generic/085.

Fixes: 040f04bd2e ("fs: simplify freeze_bdev/thaw_bdev")
Signed-off-by: Satya Tangirala <satyat@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-07 09:25:54 -07:00
Qu Wenruo 50e31ef486 btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
[BUG]
There are several bug reports about recent kernel unable to relocate
certain data block groups.

Sometimes the error just goes away, but there is one reporter who can
reproduce it reliably.

The dmesg would look like:

  [438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
  [438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
  [450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
  [463.501781] BTRFS info (device dm-10): balance: ended with status: -2

[CAUSE]
The ENOENT error is returned from the following call chain:

  add_data_references()
  |- delete_v1_space_cache();
     |- if (!found)
	   return -ENOENT;

The variable @found is set to true if we find a data extent whose
disk bytenr matches parameter @data_bytes.

With extra debugging, the offending tree block looks like this:

  leaf bytenr = 42676709441536, data_bytenr = 34626327621632

                ctime 1567904822.739884119 (2019-09-08 03:07:02)
                mtime 0.0 (1970-01-01 01:00:00)
                otime 0.0 (1970-01-01 01:00:00)
        item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
                generation 1517381 type 2 (prealloc)
                prealloc data disk byte 34626327621632 nr 262144 <<<
                prealloc data offset 0 nr 262144
        item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
                generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
                lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
                uuid d0d4361f-d231-6d40-8901-fe506e4b2b53

Although item 27 has disk bytenr 34626327621632, which matches the
data_bytenr, its type is prealloc, not reg.
This makes the existing code skip that item, and return ENOENT.

[FIX]
The code is modified in commit 19b546d7a1 ("btrfs: relocation: Use
btrfs_find_all_leafs to locate data extent parent tree leaves"), before
that commit, we use something like

  "if (type == BTRFS_FILE_EXTENT_INLINE) continue;"

But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG),
ignoring BTRFS_FILE_EXTENT_PREALLOC.

Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC.

Reported-by: Stéphane Lesimple <stephane_btrfs2@lesimple.fr>
Link: https://lore.kernel.org/linux-btrfs/505cabfa88575ed6dbe7cb922d8914fb@lesimple.fr
Fixes: 19b546d7a1 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves")
CC: stable@vger.kernel.org # 5.6+
Tested-By: Stéphane Lesimple <stephane_btrfs2@lesimple.fr>
Reviewed-by: Su Yue <l@damenly.su>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-07 17:25:05 +01:00
Su Yue 347fb0cfc9 btrfs: tree-checker: check if chunk item end overflows
While mounting a crafted image provided by user, kernel panics due to
the invalid chunk item whose end is less than start.

  [66.387422] loop: module loaded
  [66.389773] loop0: detected capacity change from 262144 to 0
  [66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613)
  [66.431061] BTRFS info (device loop0): disk space caching is enabled
  [66.431078] BTRFS info (device loop0): has skinny extents
  [66.437101] BTRFS error: insert state: end < start 29360127 37748736
  [66.437136] ------------[ cut here ]------------
  [66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs]
  [66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G           O      5.11.0-rc1-custom #45
  [66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
  [66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs]
  [66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286
  [66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
  [66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
  [66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001
  [66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0
  [66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000
  [66.437447] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
  [66.437451] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
  [66.437460] PKRU: 55555554
  [66.437464] Call Trace:
  [66.437475]  set_extent_bit+0x652/0x740 [btrfs]
  [66.437539]  set_extent_bits_nowait+0x1d/0x20 [btrfs]
  [66.437576]  add_extent_mapping+0x1e0/0x2f0 [btrfs]
  [66.437621]  read_one_chunk+0x33c/0x420 [btrfs]
  [66.437674]  btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
  [66.437708]  ? kvm_sched_clock_read+0x18/0x40
  [66.437739]  open_ctree+0xb32/0x1734 [btrfs]
  [66.437781]  ? bdi_register_va+0x1b/0x20
  [66.437788]  ? super_setup_bdi_name+0x79/0xd0
  [66.437810]  btrfs_mount_root.cold+0x12/0xeb [btrfs]
  [66.437854]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.437873]  legacy_get_tree+0x34/0x60
  [66.437880]  vfs_get_tree+0x2d/0xc0
  [66.437888]  vfs_kern_mount.part.0+0x78/0xc0
  [66.437897]  vfs_kern_mount+0x13/0x20
  [66.437902]  btrfs_mount+0x11f/0x3c0 [btrfs]
  [66.437940]  ? kfree+0x5ff/0x670
  [66.437944]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.437962]  legacy_get_tree+0x34/0x60
  [66.437974]  vfs_get_tree+0x2d/0xc0
  [66.437983]  path_mount+0x48c/0xd30
  [66.437998]  __x64_sys_mount+0x108/0x140
  [66.438011]  do_syscall_64+0x38/0x50
  [66.438018]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [66.438023] RIP: 0033:0x7f0138827f6e
  [66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  [66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
  [66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
  [66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
  [66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
  [66.438078] irq event stamp: 18169
  [66.438082] hardirqs last  enabled at (18175): [<ffffffffb81154bf>] console_unlock+0x4ff/0x5f0
  [66.438088] hardirqs last disabled at (18180): [<ffffffffb8115427>] console_unlock+0x467/0x5f0
  [66.438092] softirqs last  enabled at (16910): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20
  [66.438097] softirqs last disabled at (16905): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20
  [66.438103] ---[ end trace e114b111db64298b ]---
  [66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127
  [66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists)
  [66.441069] ------------[ cut here ]------------
  [66.441072] kernel BUG at fs/btrfs/extent_io.c:679!
  [66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  [66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G        W  O      5.11.0-rc1-custom #45
  [66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
  [66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
  [66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
  [66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
  [66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
  [66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
  [66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
  [66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
  [66.458356] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
  [66.459841] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
  [66.462196] PKRU: 55555554
  [66.462692] Call Trace:
  [66.463139]  set_extent_bit.cold+0x30/0x98 [btrfs]
  [66.464049]  set_extent_bits_nowait+0x1d/0x20 [btrfs]
  [66.490466]  add_extent_mapping+0x1e0/0x2f0 [btrfs]
  [66.514097]  read_one_chunk+0x33c/0x420 [btrfs]
  [66.534976]  btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
  [66.555718]  ? kvm_sched_clock_read+0x18/0x40
  [66.575758]  open_ctree+0xb32/0x1734 [btrfs]
  [66.595272]  ? bdi_register_va+0x1b/0x20
  [66.614638]  ? super_setup_bdi_name+0x79/0xd0
  [66.633809]  btrfs_mount_root.cold+0x12/0xeb [btrfs]
  [66.652938]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.671925]  legacy_get_tree+0x34/0x60
  [66.690300]  vfs_get_tree+0x2d/0xc0
  [66.708221]  vfs_kern_mount.part.0+0x78/0xc0
  [66.725808]  vfs_kern_mount+0x13/0x20
  [66.742730]  btrfs_mount+0x11f/0x3c0 [btrfs]
  [66.759350]  ? kfree+0x5ff/0x670
  [66.775441]  ? __kmalloc_track_caller+0x217/0x3b0
  [66.791750]  legacy_get_tree+0x34/0x60
  [66.807494]  vfs_get_tree+0x2d/0xc0
  [66.823349]  path_mount+0x48c/0xd30
  [66.838753]  __x64_sys_mount+0x108/0x140
  [66.854412]  do_syscall_64+0x38/0x50
  [66.869673]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [66.885093] RIP: 0033:0x7f0138827f6e
  [66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  [66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
  [66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
  [67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
  [67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
  [67.216138] ---[ end trace e114b111db64298c ]---
  [67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
  [67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
  [67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
  [67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
  [67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
  [67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
  [67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
  [67.465436] FS:  00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
  [67.511660] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
  [67.558449] PKRU: 55555554
  [67.581146] note: mount[613] exited with preempt_count 2

The image has a chunk item which has a logical start 37748736 and length
18446744073701163008 (-8M). The calculated end 29360127 overflows.
EEXIST was caught by insert_state() because of the duplicate end and
extent_io_tree_panic() was called.

Add overflow check of chunk item end to tree checker so it can be
detected early at mount time.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-07 17:25:05 +01:00
Su Yue 29b665cc51 btrfs: prevent NULL pointer dereference in extent_io_tree_panic
Some extent io trees are initialized with NULL private member (e.g.
btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
Dereference of a NULL tree->private as inode pointer will cause panic.

Pass tree->fs_info as it's known to be valid in all cases.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
Fixes: 05912a3c04 ("btrfs: drop extent_io_ops::tree_fs_info callback")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-07 17:25:05 +01:00
Josef Bacik 71008734d2 btrfs: print the actual offset in btrfs_root_name
We're supposed to print the root_key.offset in btrfs_root_name in the
case of a reloc root, not the objectid.  Fix this helper to take the key
so we have access to the offset when we need it.

Fixes: 457f1864b5 ("btrfs: pretty print leaked root name")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-07 17:25:05 +01:00
Pavel Begunkov b1445e59cc io_uring: synchronise ev_posted() with waitqueues
waitqueue_active() needs smp_mb() to be in sync with waitqueues
modification, but we miss it in io_cqring_ev_posted*() apart from
cq_wait() case.

Take an smb_mb() out of wq_has_sleeper() making it waitqueue_active(),
and place it a few lines before, so it can synchronise other
waitqueue_active() as well.

The patch doesn't add any additional overhead, so even if there are
no problems currently, it's just safer to have it this way.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-07 07:48:09 -07:00
Pavel Begunkov 4aa84f2ffa io_uring: dont kill fasync under completion_lock
CPU0                    CPU1
       ----                    ----
  lock(&new->fa_lock);
                               local_irq_disable();
                               lock(&ctx->completion_lock);
                               lock(&new->fa_lock);
  <Interrupt>
    lock(&ctx->completion_lock);

 *** DEADLOCK ***

Move kill_fasync() out of io_commit_cqring() to io_cqring_ev_posted(),
so it doesn't hold completion_lock while doing it. That saves from the
reported deadlock, and it's just nice to shorten the locking time and
untangle nested locks (compl_lock -> wq_head::lock).

Reported-by: syzbot+91ca3f25bd7f795f019c@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-07 07:48:09 -07:00
Pavel Begunkov 80c18e4ac2 io_uring: trigger eventfd for IOPOLL
Make sure io_iopoll_complete() tries to wake up eventfd, which currently
is skipped together with io_cqring_ev_posted() for non-SQPOLL IOPOLL.

Add an iopoll version of io_cqring_ev_posted(), duplicates a bit of
code, but they actually use different sets of wait queues may be for
better.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-07 07:48:09 -07:00
Linus Torvalds 71c061d244 for-5.11-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/0cI8ACgkQxWXV+ddt
 WDspQw/8DcC8zhGgunk0m2kcXd6dFOGbsr3hNGCsgUSKESRw6AgTZ0rJf/QLjayF
 /vaJWzQW9ijfZ92fWZS+mrmskk0N8RFOsEvkCRLesgRaasbrkchLBo5HGQasOBEV
 LXyU878GrBkNaHzClJz+JdU26i0d17BFdddgtZVQ1St9Wd9ecc7Q6iqG80RWFeE7
 uVbhv+QjocM3EieOnwIy5Mz6jZgJLYwqw7/y2njKduBeJtbt1K1j/y7IJk0WFMUM
 8eUpDL6vlAHB8FjV2wWOzO46bbEaUpaBADM6yabrq0lnM0kr7Rb+WV/WSLM/AZ3g
 Hzs4qROOEP+zjfZ5nYjJQDJRMpSipZomsUY5uMZnhRxlZuHPaoBotRRzs5AIZYj2
 BnkfucOcjxS/JTBD//ltJXE8RxbMIyMBBBipbBwqmxOkR9gM9BPuJ6iJPfUX//gG
 1GHJ+FPns8ua3JW21ih6H31xNEPS36tsywvE8yCEtEWMxCFCBwgGu+4D8KpGBjtY
 ySFxkxxAbTuFi9fqSE/mBC+6lpbVTO0OvizuoEQh8C2izkXRbDsDVgPN8d7rCW7h
 Cdox4DUp61sNf+G3ll9Dv9ceAXroZTVRTHGjlav6NAFpydz3yPo5x54Ex7S+k3oN
 BAcZEl1Tl3hz4WxF8Ywc+yJ8n8l9AVa3KcYRXVbyVjTGg+JjU94=
 =jlQf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes that arrived before the end of the year:

   - a bunch of fixes related to transaction handle lifetime wrt various
     operations (umount, remount, qgroup scan, orphan cleanup)

   - async discard scheduling fixes

   - fix item size calculation when item keys collide for extend refs
     (hardlinks)

   - fix qgroup flushing from running transaction

   - fix send, wrong file path when there is an inode with a pending
     rmdir

   - fix deadlock when cloning inline extent and low on free metadata
     space"

* tag 'for-5.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: run delayed iputs when remounting RO to avoid leaking them
  btrfs: add assertion for empty list of transactions at late stage of umount
  btrfs: fix race between RO remount and the cleaner task
  btrfs: fix transaction leak and crash after cleaning up orphans on RO mount
  btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
  btrfs: merge critical sections of discard lock in workfn
  btrfs: fix racy access to discard_ctl data
  btrfs: fix async discard stall
  btrfs: tests: initialize test inodes location
  btrfs: send: fix wrong file path when there is an inode with a pending rmdir
  btrfs: qgroup: don't try to wait flushing if we're already holding a transaction
  btrfs: correctly calculate item size used when item key collision happens
  btrfs: fix deadlock when cloning inline extent and low on free metadata space
2021-01-06 11:19:08 -08:00
Dave Wysochanski 3d1a90ab0e NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
It is only safe to call the tracepoint before rpc_put_task() because
'data' is freed inside nfs4_lock_release (rpc_release).

Fixes: 48c9579a1a ("Adding stateid information to tracepoints")
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-06 12:34:57 -05:00