Commit graph

67384 commits

Author SHA1 Message Date
Matthew Wilcox (Oracle) 73bb49da50 mm/readahead: make page_cache_ra_unbounded take a readahead_control
Define it in the callers instead of in page_cache_ra_unbounded().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Eric Biggers <ebiggers@google.com>
Link: https://lkml.kernel.org/r/20200903140844.14194-4-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:16 -07:00
Matthew Wilcox (Oracle) 01c7026705 fs: add a filesystem flag for THPs
The page cache needs to know whether the filesystem supports THPs so that
it doesn't send THPs to filesystems which can't handle them.  Dave Chinner
points out that getting from the page mapping to the filesystem type is
too many steps (mapping->host->i_sb->s_type->fs_flags) so cache that
information in the address space flags.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20200916032717.22917-1-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-16 11:11:15 -07:00
Filipe Manana 1afc708dca btrfs: fix relocation failure due to race with fallocate
When doing a fallocate() we have a short time window, after reserving an
extent and before starting a transaction, where if relocation for the block
group containing the reserved extent happens, we can end up missing the
extent in the data relocation inode causing relocation to fail later.

This only happens when we don't pass a transaction to the internal
fallocate function __btrfs_prealloc_file_range(), which is for all the
cases where fallocate() is called from user space (the internal use cases
include space cache extent allocation and relocation).

When the race triggers the relocation failure, it produces a trace like
the following:

  [200611.995995] ------------[ cut here ]------------
  [200611.997084] BTRFS: Transaction aborted (error -2)
  [200611.998208] WARNING: CPU: 3 PID: 235845 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x3a0/0x5b0 [btrfs]
  [200611.999042] Modules linked in: dm_thin_pool dm_persistent_data (...)
  [200612.003287] CPU: 3 PID: 235845 Comm: btrfs Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [200612.004442] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [200612.006186] RIP: 0010:__btrfs_cow_block+0x3a0/0x5b0 [btrfs]
  [200612.007110] Code: 1b 00 00 02 72 2a 83 f8 fb 0f 84 b8 01 (...)
  [200612.007341] BTRFS warning (device sdb): Skipping commit of aborted transaction.
  [200612.008959] RSP: 0018:ffffaee38550f918 EFLAGS: 00010286
  [200612.009672] BTRFS: error (device sdb) in cleanup_transaction:1901: errno=-30 Readonly filesystem
  [200612.010428] RAX: 0000000000000000 RBX: ffff9174d96f4000 RCX: 0000000000000000
  [200612.011078] BTRFS info (device sdb): forced readonly
  [200612.011862] RDX: 0000000000000001 RSI: ffffffffa8161978 RDI: 00000000ffffffff
  [200612.013215] RBP: ffff9172569a0f80 R08: 0000000000000000 R09: 0000000000000000
  [200612.014263] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9174b8403b88
  [200612.015203] R13: ffff9174b8400a88 R14: ffff9174c90f1000 R15: ffff9174a5a60e08
  [200612.016182] FS:  00007fa55cf878c0(0000) GS:ffff9174ece00000(0000) knlGS:0000000000000000
  [200612.017174] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [200612.018418] CR2: 00007f8fb8048148 CR3: 0000000428a46003 CR4: 00000000003706e0
  [200612.019510] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [200612.020648] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [200612.021520] Call Trace:
  [200612.022434]  btrfs_cow_block+0x10b/0x250 [btrfs]
  [200612.023407]  do_relocation+0x54e/0x7b0 [btrfs]
  [200612.024343]  ? do_raw_spin_unlock+0x4b/0xc0
  [200612.025280]  ? _raw_spin_unlock+0x29/0x40
  [200612.026200]  relocate_tree_blocks+0x3bc/0x6d0 [btrfs]
  [200612.027088]  relocate_block_group+0x2f3/0x600 [btrfs]
  [200612.027961]  btrfs_relocate_block_group+0x15e/0x340 [btrfs]
  [200612.028896]  btrfs_relocate_chunk+0x38/0x110 [btrfs]
  [200612.029772]  btrfs_balance+0xb22/0x1790 [btrfs]
  [200612.030601]  ? btrfs_ioctl_balance+0x253/0x380 [btrfs]
  [200612.031414]  btrfs_ioctl_balance+0x2cf/0x380 [btrfs]
  [200612.032279]  btrfs_ioctl+0x620/0x36f0 [btrfs]
  [200612.033077]  ? _raw_spin_unlock+0x29/0x40
  [200612.033948]  ? handle_mm_fault+0x116d/0x1ca0
  [200612.034749]  ? up_read+0x18/0x240
  [200612.035542]  ? __x64_sys_ioctl+0x83/0xb0
  [200612.036244]  __x64_sys_ioctl+0x83/0xb0
  [200612.037269]  do_syscall_64+0x33/0x80
  [200612.038190]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [200612.038976] RIP: 0033:0x7fa55d07ed87
  [200612.040127] Code: 00 00 00 48 8b 05 09 91 0c 00 64 c7 00 26 (...)
  [200612.041669] RSP: 002b:00007ffd5ebf03e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
  [200612.042437] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fa55d07ed87
  [200612.043511] RDX: 00007ffd5ebf0470 RSI: 00000000c4009420 RDI: 0000000000000003
  [200612.044250] RBP: 0000000000000003 R08: 000055d8362642a0 R09: 00007fa55d148be0
  [200612.044963] R10: fffffffffffff52e R11: 0000000000000206 R12: 00007ffd5ebf1614
  [200612.045683] R13: 00007ffd5ebf0470 R14: 0000000000000002 R15: 00007ffd5ebf0470
  [200612.046361] irq event stamp: 0
  [200612.047040] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  [200612.047725] hardirqs last disabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0
  [200612.048387] softirqs last  enabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0
  [200612.049024] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [200612.049722] ---[ end trace 49006c6876e65227 ]---

The race happens like this:

1) Task A starts an fallocate() (plain or zero range) and it calls
   __btrfs_prealloc_file_range() with the 'trans' parameter set to NULL;

2) Task A calls btrfs_reserve_extent() and gets an extent that belongs to
   block group X;

3) Before task A gets into btrfs_replace_file_extents(), through the call
   to insert_prealloc_file_extent(), task B starts relocation of block
   group X;

4) Task B enters btrfs_relocate_block_group() and it sets block group X to
   RO mode;

5) Task B enters relocate_block_group(), it calls prepare_to_relocate()
   whichs joins/starts a transaction and then commits the transaction;

6) Task B then starts scanning the extent tree looking for extents that
   belong to block group X - it does not find yet the extent reserved by
   task A, since that extent was not yet added to the extent tree, as its
   delayed reference was not even yet created at this point;

7) The data relocation inode ends up not having the extent reserved by
   task A associated to it;

8) Task A then starts a transaction through btrfs_replace_file_extents(),
   inserts a file extent item in the subvolume tree pointing to the
   reserved extent and creates a delayed reference for it;

9) Task A finishes and returns success to user space;

10) Later on, while relocation is still in progress, the leaf where task A
    inserted the new file extent item is COWed, so we end up at
    __btrfs_cow_block(), which calls btrfs_reloc_cow_block(), and that in
    turn calls relocation.c:replace_file_extents();

11) At relocation.c:replace_file_extents() we iterate over all the items in
    the leaf and find the file extent item pointing to the extent that was
    allocated by task A, and then call relocation.c:get_new_location(), to
    find the new location for the extent;

12) However relocation.c:get_new_location() fails, returning -ENOENT,
    because it couldn't find a corresponding file extent item associated
    with the data relocation inode. This is because the extent was not seen
    in the extent tree at step 6). The -ENOENT error is propagated to
    __btrfs_cow_block(), which aborts the transaction.

So fix this simply by decrementing the block group's number of reservations
after calling insert_prealloc_file_extent(), as relocation waits for that
counter to go down to zero before calling prepare_to_relocate() and start
looking for extents in the extent tree.

This issue only started to happen recently as of commit 8fccebfa53
("btrfs: fix metadata reservation for fallocate that leads to transaction
aborts"), because now we can reserve an extent before starting/joining a
transaction, and previously we always did it after that, so relocation
ended up waiting for a concurrent fallocate() to finish because before
searching for the extents of the block group, it starts/joins a transaction
and then commits it (at prepare_to_relocate()), which made it wait for the
fallocate task to complete first.

Fixes: 8fccebfa53 ("btrfs: fix metadata reservation for fallocate that leads to transaction aborts")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-16 16:01:56 +02:00
David Howells 7530d3eb3d afs: Don't assert on unpurgeable server records
Don't give an assertion failure on unpurgeable afs_server records - which
kills the thread - but rather emit a trace line when we are purging a
record (which only happens during network namespace removal or rmmod) and
print a notice of the problem.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-16 14:39:34 +01:00
David Howells dca54a7bbb afs: Add tracing for cell refcount and active user count
Add a tracepoint to log the cell refcount and active user count and pass in
a reason code through various functions that manipulate these counters.

Additionally, a helper function, afs_see_cell(), is provided to log
interesting places that deal with a cell without actually doing any
accounting directly.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-16 14:39:21 +01:00
David Howells 1d0e850a49 afs: Fix cell removal
Fix cell removal by inserting a more final state than AFS_CELL_FAILED that
indicates that the cell has been unpublished in case the manager is already
requeued and will go through again.  The new AFS_CELL_REMOVED state will
just immediately leave the manager function.

Going through a second time in the AFS_CELL_FAILED state will cause it to
try to remove the cell again, potentially leading to the proc list being
removed.

Fixes: 989782dcdc ("afs: Overhaul cell database management")
Reported-by: syzbot+b994ecf2b023f14832c1@syzkaller.appspotmail.com
Reported-by: syzbot+0e0db88e1eb44a91ae8d@syzkaller.appspotmail.com
Reported-by: syzbot+2d0585e5efcd43d113c2@syzkaller.appspotmail.com
Reported-by: syzbot+1ecc2f9d3387f1d79d42@syzkaller.appspotmail.com
Reported-by: syzbot+18d51774588492bf3f69@syzkaller.appspotmail.com
Reported-by: syzbot+a5e4946b04d6ca8fa5f3@syzkaller.appspotmail.com
Suggested-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hillf Danton <hdanton@sina.com>
2020-10-16 14:38:26 +01:00
David Howells 286377f6bd afs: Fix cell purging with aliases
When the afs module is removed, one of the things that has to be done is to
purge the cell database.  afs_cell_purge() cancels the management timer and
then starts the cell manager work item to do the purging.  This does a
single run through and then assumes that all cells are now purged - but
this is no longer the case.

With the introduction of alias detection, a later cell in the database can
now be holding an active count on an earlier cell (cell->alias_of).  The
purge scan passes by the earlier cell first, but this can't be got rid of
until it has discarded the alias.  Ordinarily, afs_unuse_cell() would
handle this by setting the management timer to trigger another pass - but
afs_set_cell_timer() doesn't do anything if the namespace is being removed
(net->live == false).  rmmod then hangs in the wait on cells_outstanding in
afs_cell_purge().

Fix this by making afs_set_cell_timer() directly queue the cell manager if
net->live is false.  This causes additional management passes.

Queueing the cell manager increments cells_outstanding to make sure the
wait won't complete until all cells are destroyed.

Fixes: 8a070a9648 ("afs: Detect cell aliases 1 - Cells with root volumes")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-16 14:38:26 +01:00
David Howells 88c853c3f5 afs: Fix cell refcounting by splitting the usage counter
Management of the lifetime of afs_cell struct has some problems due to the
usage counter being used to determine whether objects of that type are in
use in addition to whether anyone might be interested in the structure.

This is made trickier by cell objects being cached for a period of time in
case they're quickly reused as they hold the result of a setup process that
may be slow (DNS lookups, AFS RPC ops).

Problems include the cached root volume from alias resolution pinning its
parent cell record, rmmod occasionally hanging and occasionally producing
assertion failures.

Fix this by splitting the count of active users from the struct reference
count.  Things then work as follows:

 (1) The cell cache keeps +1 on the cell's activity count and this has to
     be dropped before the cell can be removed.  afs_manage_cell() tries to
     exchange the 1 to a 0 with the cells_lock write-locked, and if
     successful, the record is removed from the net->cells.

 (2) One struct ref is 'owned' by the activity count.  That is put when the
     active count is reduced to 0 (final_destruction label).

 (3) A ref can be held on a cell whilst it is queued for management on a
     work queue without confusing the active count.  afs_queue_cell() is
     added to wrap this.

 (4) The queue's ref is dropped at the end of the management.  This is
     split out into a separate function, afs_manage_cell_work().

 (5) The root volume record is put after a cell is removed (at the
     final_destruction label) rather then in the RCU destruction routine.

 (6) Volumes hold struct refs, but aren't active users.

 (7) Both counts are displayed in /proc/net/afs/cells.

There are some management function changes:

 (*) afs_put_cell() now just decrements the refcount and triggers the RCU
     destruction if it becomes 0.  It no longer sets a timer to have the
     manager do this.

 (*) afs_use_cell() and afs_unuse_cell() are added to increase and decrease
     the active count.  afs_unuse_cell() sets the management timer.

 (*) afs_queue_cell() is added to queue a cell with approprate refs.

There are also some other fixes:

 (*) Don't let /proc/net/afs/cells access a cell's vllist if it's NULL.

 (*) Make sure that candidate cells in lookups are properly destroyed
     rather than being simply kfree'd.  This ensures the bits it points to
     are destroyed also.

 (*) afs_dec_cells_outstanding() is now called in cell destruction rather
     than at "final_destruction".  This ensures that cell->net is still
     valid to the end of the destructor.

 (*) As a consequence of the previous two changes, move the increment of
     net->cells_outstanding that was at the point of insertion into the
     tree to the allocation routine to correctly balance things.

Fixes: 989782dcdc ("afs: Overhaul cell database management")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-10-16 14:38:22 +01:00
Olga Kornievskaia 8c39076c27 NFSv4.2: support EXCHGID4_FLAG_SUPP_FENCE_OPS 4.2 EXCHANGE_ID flag
RFC 7862 introduced a new flag that either client or server is
allowed to set: EXCHGID4_FLAG_SUPP_FENCE_OPS.

Client needs to update its bitmask to allow for this flag value.

v2: changed minor version argument to unsigned int

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-16 09:28:43 -04:00
David Howells 92e3cc91d8 afs: Fix rapid cell addition/removal by not using RCU on cells tree
There are a number of problems that are being seen by the rapidly mounting
and unmounting an afs dynamic root with an explicit cell and volume
specified (which should probably be rejected, but that's a separate issue):

What the tests are doing is to look up/create a cell record for the name
given and then tear it down again without actually using it to try to talk
to a server.  This is repeated endlessly, very fast, and the new cell
collides with the old one if it's not quick enough to reuse it.

It appears (as suggested by Hillf Danton) that the search through the RB
tree under a read_seqbegin_or_lock() under RCU conditions isn't safe and
that it's not blocking the write_seqlock(), despite taking two passes at
it.  He suggested that the code should take a ref on the cell it's
attempting to look at - but this shouldn't be necessary until we've
compared the cell names.  It's possible that I'm missing a barrier
somewhere.

However, using an RCU search for this is overkill, really - we only need to
access the cell name in a few places, and they're places where we're may
end up sleeping anyway.

Fix this by switching to an R/W semaphore instead.

Additionally, draw the down_read() call inside the function (renamed to
afs_find_cell()) since all the callers were taking the RCU read lock (or
should've been[*]).

[*] afs_probe_cell_name() should have been, but that doesn't appear to be
involved in the bug reports.

The symptoms of this look like:

	general protection fault, probably for non-canonical address 0xf27d208691691fdb: 0000 [#1] PREEMPT SMP KASAN
	KASAN: maybe wild-memory-access in range [0x93e924348b48fed8-0x93e924348b48fedf]
	...
	RIP: 0010:strncasecmp lib/string.c:52 [inline]
	RIP: 0010:strncasecmp+0x5f/0x240 lib/string.c:43
	 afs_lookup_cell_rcu+0x313/0x720 fs/afs/cell.c:88
	 afs_lookup_cell+0x2ee/0x1440 fs/afs/cell.c:249
	 afs_parse_source fs/afs/super.c:290 [inline]
	...

Fixes: 989782dcdc ("afs: Overhaul cell database management")
Reported-by: syzbot+459a5dce0b4cb70fd076@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hillf Danton <hdanton@sina.com>
cc: syzkaller-bugs@googlegroups.com
2020-10-16 14:04:59 +01:00
Steve French 29e2792304 smb3.1.1: add new module load parm enable_gcm_256
Add new module load parameter enable_gcm_256. If set, then add
AES-256-GCM (strongest encryption type) to the list of encryption
types requested. Put it in the list as the second choice (since
AES-128-GCM is faster and much more broadly supported by
SMB3 servers).  To make this stronger encryption type, GCM-256,
required (the first and only choice, you would use module parameter
"require_gcm_256."

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-15 23:58:15 -05:00
Steve French fbfd0b46af smb3.1.1: add new module load parm require_gcm_256
Add new module load parameter require_gcm_256. If set, then only
request AES-256-GCM (strongest encryption type).

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-15 23:58:15 -05:00
Stefan Metzmacher 330857a5d8 cifs: map STATUS_ACCOUNT_LOCKED_OUT to -EACCES
This is basically the same as STATUS_LOGON_FAILURE,
but after the account is locked out.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-15 23:58:14 -05:00
Steve French 682955491a SMB3.1.1: add defines for new signing negotiate context
Currently there are three supported signing algorithms

Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-15 23:58:14 -05:00
Ronnie Sahlberg c6cc4c5a72 cifs: handle -EINTR in cifs_setattr
RHBZ: 1848178

Some calls that set attributes, like utimensat(), are not supposed to return
-EINTR and thus do not have handlers for this in glibc which causes us
to leak -EINTR to the applications which are also unprepared to handle it.

For example tar will break if utimensat() return -EINTR and abort unpacking
the archive. Other applications may break too.

To handle this we add checks, and retry, for -EINTR in cifs_setattr()

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-15 23:58:14 -05:00
Rohith Surabattula 8e670f77c4 Handle STATUS_IO_TIMEOUT gracefully
Currently STATUS_IO_TIMEOUT is not treated as retriable error.
It is currently mapped to ETIMEDOUT and returned to userspace
for most system calls. STATUS_IO_TIMEOUT is returned by server
in case of unavailability or throttling errors.

This patch will map the STATUS_IO_TIMEOUT to EAGAIN, so that it
can be retried. Also, added a check to drop the connection to
not overload the server in case of ongoing unavailability.

Signed-off-by: Rohith Surabattula <rohiths@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-15 23:58:04 -05:00
Linus Torvalds 9ff9b0d392 networking changes for the 5.10 merge window
Add redirect_neigh() BPF packet redirect helper, allowing to limit stack
 traversal in common container configs and improving TCP back-pressure.
 Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.
 
 Expand netlink policy support and improve policy export to user space.
 (Ge)netlink core performs request validation according to declared
 policies. Expand the expressiveness of those policies (min/max length
 and bitmasks). Allow dumping policies for particular commands.
 This is used for feature discovery by user space (instead of kernel
 version parsing or trial and error).
 
 Support IGMPv3/MLDv2 multicast listener discovery protocols in bridge.
 
 Allow more than 255 IPv4 multicast interfaces.
 
 Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
 packets of TCPv6.
 
 In Multi-patch TCP (MPTCP) support concurrent transmission of data
 on multiple subflows in a load balancing scenario. Enhance advertising
 addresses via the RM_ADDR/ADD_ADDR options.
 
 Support SMC-Dv2 version of SMC, which enables multi-subnet deployments.
 
 Allow more calls to same peer in RxRPC.
 
 Support two new Controller Area Network (CAN) protocols -
 CAN-FD and ISO 15765-2:2016.
 
 Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
 kernel problem.
 
 Add TC actions for implementing MPLS L2 VPNs.
 
 Improve nexthop code - e.g. handle various corner cases when nexthop
 objects are removed from groups better, skip unnecessary notifications
 and make it easier to offload nexthops into HW by converting
 to a blocking notifier.
 
 Support adding and consuming TCP header options by BPF programs,
 opening the doors for easy experimental and deployment-specific
 TCP option use.
 
 Reorganize TCP congestion control (CC) initialization to simplify life
 of TCP CC implemented in BPF.
 
 Add support for shipping BPF programs with the kernel and loading them
 early on boot via the User Mode Driver mechanism, hence reusing all the
 user space infra we have.
 
 Support sleepable BPF programs, initially targeting LSM and tracing.
 
 Add bpf_d_path() helper for returning full path for given 'struct path'.
 
 Make bpf_tail_call compatible with bpf-to-bpf calls.
 
 Allow BPF programs to call map_update_elem on sockmaps.
 
 Add BPF Type Format (BTF) support for type and enum discovery, as
 well as support for using BTF within the kernel itself (current use
 is for pretty printing structures).
 
 Support listing and getting information about bpf_links via the bpf
 syscall.
 
 Enhance kernel interfaces around NIC firmware update. Allow specifying
 overwrite mask to control if settings etc. are reset during update;
 report expected max time operation may take to users; support firmware
 activation without machine reboot incl. limits of how much impact
 reset may have (e.g. dropping link or not).
 
 Extend ethtool configuration interface to report IEEE-standard
 counters, to limit the need for per-vendor logic in user space.
 
 Adopt or extend devlink use for debug, monitoring, fw update
 in many drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw,
 mv88e6xxx, dpaa2-eth).
 
 In mlxsw expose critical and emergency SFP module temperature alarms.
 Refactor port buffer handling to make the defaults more suitable and
 support setting these values explicitly via the DCBNL interface.
 
 Add XDP support for Intel's igb driver.
 
 Support offloading TC flower classification and filtering rules to
 mscc_ocelot switches.
 
 Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
 fixed interval period pulse generator and one-step timestamping in
 dpaa-eth.
 
 Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
 offload.
 
 Add Lynx PHY/PCS MDIO module, and convert various drivers which have
 this HW to use it. Convert mvpp2 to split PCS.
 
 Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
 7-port Mediatek MT7531 IP.
 
 Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
 and wcn3680 support in wcn36xx.
 
 Improve performance for packets which don't require much offloads
 on recent Mellanox NICs by 20% by making multiple packets share
 a descriptor entry.
 
 Move chelsio inline crypto drivers (for TLS and IPsec) from the crypto
 subtree to drivers/net. Move MDIO drivers out of the phy directory.
 
 Clean up a lot of W=1 warnings, reportedly the actively developed
 subsections of networking drivers should now build W=1 warning free.
 
 Make sure drivers don't use in_interrupt() to dynamically adapt their
 code. Convert tasklets to use new tasklet_setup API (sadly this
 conversion is not yet complete).
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+ItRwACgkQMUZtbf5S
 IrtTMg//UxpdR/MirT1DatBU0K/UGAZY82hV7F/UC8tPgjfHZeHvWlDFxfi3YP81
 PtPKbhRZ7DhwBXefUp6nY3UdvjftrJK2lJm8prJUPSsZRye8Wlcb7y65q7/P2y2U
 Efucyopg6RUrmrM0DUsIGYGJgylQLHnMYUl/keCsD4t5Bp4ksyi9R2t5eitGoWzh
 r3QGdbSa0AuWx4iu0i+tqp6Tj0ekMBMXLVb35dtU1t0joj2KTNEnSgABN3prOa8E
 iWYf2erOau68Ogp3yU3miCy0ZU4p/7qGHTtzbcp677692P/ekak6+zmfHLT9/Pjy
 2Stq2z6GoKuVxdktr91D9pA3jxG4LxSJmr0TImcGnXbvkMP3Ez3g9RrpV5fn8j6F
 mZCH8TKZAoD5aJrAJAMkhZmLYE1pvDa7KolSk8WogXrbCnTEb5Nv8FHTS1Qnk3yl
 wSKXuvutFVNLMEHCnWQLtODbTST9DI/aOi6EctPpuOA/ZyL1v3pl+gfp37S+LUTe
 owMnT/7TdvKaTD0+gIyU53M6rAWTtr5YyRQorX9awIu/4Ha0F0gYD7BJZQUGtegp
 HzKt59NiSrFdbSH7UdyemdBF4LuCgIhS7rgfeoUXMXmuPHq7eHXyHZt5dzPPa/xP
 81P0MAvdpFVwg8ij2yp2sHS7sISIRKq17fd1tIewUabxQbjXqPc=
 =bc1U
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:

 - Add redirect_neigh() BPF packet redirect helper, allowing to limit
   stack traversal in common container configs and improving TCP
   back-pressure.

   Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

 - Expand netlink policy support and improve policy export to user
   space. (Ge)netlink core performs request validation according to
   declared policies. Expand the expressiveness of those policies
   (min/max length and bitmasks). Allow dumping policies for particular
   commands. This is used for feature discovery by user space (instead
   of kernel version parsing or trial and error).

 - Support IGMPv3/MLDv2 multicast listener discovery protocols in
   bridge.

 - Allow more than 255 IPv4 multicast interfaces.

 - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
   packets of TCPv6.

 - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
   multiple subflows in a load balancing scenario. Enhance advertising
   addresses via the RM_ADDR/ADD_ADDR options.

 - Support SMC-Dv2 version of SMC, which enables multi-subnet
   deployments.

 - Allow more calls to same peer in RxRPC.

 - Support two new Controller Area Network (CAN) protocols - CAN-FD and
   ISO 15765-2:2016.

 - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
   kernel problem.

 - Add TC actions for implementing MPLS L2 VPNs.

 - Improve nexthop code - e.g. handle various corner cases when nexthop
   objects are removed from groups better, skip unnecessary
   notifications and make it easier to offload nexthops into HW by
   converting to a blocking notifier.

 - Support adding and consuming TCP header options by BPF programs,
   opening the doors for easy experimental and deployment-specific TCP
   option use.

 - Reorganize TCP congestion control (CC) initialization to simplify
   life of TCP CC implemented in BPF.

 - Add support for shipping BPF programs with the kernel and loading
   them early on boot via the User Mode Driver mechanism, hence reusing
   all the user space infra we have.

 - Support sleepable BPF programs, initially targeting LSM and tracing.

 - Add bpf_d_path() helper for returning full path for given 'struct
   path'.

 - Make bpf_tail_call compatible with bpf-to-bpf calls.

 - Allow BPF programs to call map_update_elem on sockmaps.

 - Add BPF Type Format (BTF) support for type and enum discovery, as
   well as support for using BTF within the kernel itself (current use
   is for pretty printing structures).

 - Support listing and getting information about bpf_links via the bpf
   syscall.

 - Enhance kernel interfaces around NIC firmware update. Allow
   specifying overwrite mask to control if settings etc. are reset
   during update; report expected max time operation may take to users;
   support firmware activation without machine reboot incl. limits of
   how much impact reset may have (e.g. dropping link or not).

 - Extend ethtool configuration interface to report IEEE-standard
   counters, to limit the need for per-vendor logic in user space.

 - Adopt or extend devlink use for debug, monitoring, fw update in many
   drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
   dpaa2-eth).

 - In mlxsw expose critical and emergency SFP module temperature alarms.
   Refactor port buffer handling to make the defaults more suitable and
   support setting these values explicitly via the DCBNL interface.

 - Add XDP support for Intel's igb driver.

 - Support offloading TC flower classification and filtering rules to
   mscc_ocelot switches.

 - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
   fixed interval period pulse generator and one-step timestamping in
   dpaa-eth.

 - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
   offload.

 - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
   this HW to use it. Convert mvpp2 to split PCS.

 - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
   7-port Mediatek MT7531 IP.

 - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
   and wcn3680 support in wcn36xx.

 - Improve performance for packets which don't require much offloads on
   recent Mellanox NICs by 20% by making multiple packets share a
   descriptor entry.

 - Move chelsio inline crypto drivers (for TLS and IPsec) from the
   crypto subtree to drivers/net. Move MDIO drivers out of the phy
   directory.

 - Clean up a lot of W=1 warnings, reportedly the actively developed
   subsections of networking drivers should now build W=1 warning free.

 - Make sure drivers don't use in_interrupt() to dynamically adapt their
   code. Convert tasklets to use new tasklet_setup API (sadly this
   conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
  Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
  net, sockmap: Don't call bpf_prog_put() on NULL pointer
  bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
  bpf, sockmap: Add locking annotations to iterator
  netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
  net: fix pos incrementment in ipv6_route_seq_next
  net/smc: fix invalid return code in smcd_new_buf_create()
  net/smc: fix valid DMBE buffer sizes
  net/smc: fix use-after-free of delayed events
  bpfilter: Fix build error with CONFIG_BPFILTER_UMH
  cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
  net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
  bpf: Fix register equivalence tracking.
  rxrpc: Fix loss of final ack on shutdown
  rxrpc: Fix bundle counting for exclusive connections
  netfilter: restore NF_INET_NUMHOOKS
  ibmveth: Identify ingress large send packets.
  ibmveth: Switch order of ibmveth_helper calls.
  cxgb4: handle 4-tuple PEDIT to NAT mode translation
  selftests: Add VRF route leaking tests
  ...
2020-10-15 18:42:13 -07:00
Linus Torvalds bbf6259903 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial updates from Jiri Kosina:
 "The latest advances in computer science from the trivial queue"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  xtensa: fix Kconfig typo
  spelling.txt: Remove some duplicate entries
  mtd: rawnand: oxnas: cleanup/simplify code
  selftests: vm: add fragment CONFIG_GUP_BENCHMARK
  perf: Fix opt help text for --no-bpf-event
  HID: logitech-dj: Fix spelling in comment
  bootconfig: Fix kernel message mentioning CONFIG_BOOT_CONFIG
  MAINTAINERS: rectify MMP SUPPORT after moving cputype.h
  scif: Fix spelling of EACCES
  printk: fix global comment
  lib/bitmap.c: fix spello
  fs: Fix missing 'bit' in comment
2020-10-15 15:11:56 -07:00
Linus Torvalds 4a165feba2 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl+IUDwACgkQnJ2qBz9k
 QNnjoAf8CVCf/N2kblZywcy/Ks/2WzlZDU4S/dVr93O2VLXEb6kT+onZq4tc/+sY
 QIV8kUMGvt0Ez/KpwY6/wOUVEPZn8bSqM0vUsB+Xi8CwSz/CwcDKTbU/tm1667J0
 eSFX96OL2hT5BockTzuESFohtykiPU3CvW10ae02N6k6XKTS0+frkn3cebJKno2O
 NDt0j2HkVjvpO6DqS7lnceFxo4t6DV6YXJHMtkJqRk8hYItzjyjOB6Df+1WHJJOA
 4jxlobY4qR2MHh1y2ncryPvvQ91sMIxx1beMMXXUIXr+8Zu8aFVWBLgSwyoxBgRP
 G+ctJu3zMCmeeGnkpLfSlJK9ipYUMw==
 =nXw4
 -----END PGP SIGNATURE-----

Merge tag 'dio_for_v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull direct-io fix from Jan Kara:
 "Fix for unaligned direct IO read past EOF in legacy DIO code"

* tag 'dio_for_v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  direct-io: defer alignment check until after the EOF check
  direct-io: don't force writeback for reads beyond EOF
  direct-io: clean up error paths of do_blockdev_direct_IO
2020-10-15 15:03:10 -07:00
Linus Torvalds b77a69b81c \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAl+ITrkACgkQnJ2qBz9k
 QNnNDgf/fEA4pI24FUlvdndDSLS51XEueSuzqjCU1cQ1C1uVmAf//gXkyQ7wJ/ef
 Ph8hvHIaezpG6gE3xEQkREvf4EZQiIYDpjprz6ARLxn0rMdMDAqVDZ+5+F2Rlrk4
 uPPYgc8cbyIHMNLQ2SBFRzb0xm/tuNlvLaQawKiaoZI8NdKJ1U8uGt7o1QFrDGGs
 XdMdoYRHEYbaXao4PCH96JjNEA8zzPUhbDNYB+wwwqzzx5vfWLZK6SU0VivojNDD
 JV4VhvYrQUkZ4gwePYhmS18Kp6GRkGM18Cu7Nh/R1ltUk4AdHmjTNGeRbGXqjlso
 Q7v5tg5fQ0MUCcHzuZgmqgkgCd5pHw==
 =roOT
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull UDF, reiserfs, ext2, quota fixes from Jan Kara:

 - a couple of UDF fixes for issues found by syzbot fuzzing

 - a couple of reiserfs fixes for issues found by syzbot fuzzing

 - some minor ext2 cleanups

 - quota patches to support grace times beyond year 2038 for XFS quota
   APIs

* tag 'fs_for_v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: Fix oops during mount
  udf: Limit sparing table size
  udf: Remove pointless union in udf_inode_info
  udf: Avoid accessing uninitialized data on failed inode read
  quota: clear padding in v2r1_mem2diskdqb()
  reiserfs: Initialize inode keys properly
  udf: Fix memory leak when mounting
  udf: Remove redundant initialization of variable ret
  reiserfs: only call unlock_new_inode() if I_NEW
  ext2: Fix some kernel-doc warnings in balloc.c
  quota: Expand comment describing d_itimer
  quota: widen timestamps for the fs_disk_quota structure
  reiserfs: Fix memory leak in reiserfs_parse_options()
  udf: Use kvzalloc() in udf_sb_alloc_bitmap()
  ext2: remove duplicate include
2020-10-15 14:56:15 -07:00
Matthew Wilcox (Oracle) 7b84b665c8 fs: Allow a NULL pos pointer to __kernel_read
Match the behaviour of new_sync_read() and __kernel_write().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-15 14:20:42 -04:00
Matthew Wilcox (Oracle) 4c207ef482 fs: Allow a NULL pos pointer to __kernel_write
Linus prefers that callers be allowed to pass in a NULL pointer for ppos
like new_sync_write().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-15 14:20:41 -04:00
Trond Myklebust 094eca3719 NFSv4: Fix up RCU annotations for struct nfs_netns_client
The identifier is read as an RCU protected string. Its value may
be changed during the lifetime of the network namespace by writing
a new string into the sysfs pseudofile (at which point, we free the
old string only after a call to synchronize_rcu()).

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-15 13:31:08 -04:00
Linus Torvalds 726eb70e0d Char/Misc driver patches for 5.10-rc1
Here is the big set of char, misc, and other assorted driver subsystem
 patches for 5.10-rc1.
 
 There's a lot of different things in here, all over the drivers/
 directory.  Some summaries:
 	- soundwire driver updates
 	- habanalabs driver updates
 	- extcon driver updates
 	- nitro_enclaves new driver
 	- fsl-mc driver and core updates
 	- mhi core and bus updates
 	- nvmem driver updates
 	- eeprom driver updates
 	- binder driver updates and fixes
 	- vbox minor bugfixes
 	- fsi driver updates
 	- w1 driver updates
 	- coresight driver updates
 	- interconnect driver updates
 	- misc driver updates
 	- other minor driver updates
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCX4g8YQ8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+yngKgCeNpArCP/9vQJRK9upnDm8ZLunSCUAn1wUT/2A
 /bTQ42c/WRQ+LU828GSM
 =6sO2
 -----END PGP SIGNATURE-----

Merge tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver updates from Greg KH:
 "Here is the big set of char, misc, and other assorted driver subsystem
  patches for 5.10-rc1.

  There's a lot of different things in here, all over the drivers/
  directory. Some summaries:

   - soundwire driver updates

   - habanalabs driver updates

   - extcon driver updates

   - nitro_enclaves new driver

   - fsl-mc driver and core updates

   - mhi core and bus updates

   - nvmem driver updates

   - eeprom driver updates

   - binder driver updates and fixes

   - vbox minor bugfixes

   - fsi driver updates

   - w1 driver updates

   - coresight driver updates

   - interconnect driver updates

   - misc driver updates

   - other minor driver updates

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (396 commits)
  binder: fix UAF when releasing todo list
  docs: w1: w1_therm: Fix broken xref, mistakes, clarify text
  misc: Kconfig: fix a HISI_HIKEY_USB dependency
  LSM: Fix type of id parameter in kernel_post_load_data prototype
  misc: Kconfig: add a new dependency for HISI_HIKEY_USB
  firmware_loader: fix a kernel-doc markup
  w1: w1_therm: make w1_poll_completion static
  binder: simplify the return expression of binder_mmap
  test_firmware: Test partial read support
  firmware: Add request_partial_firmware_into_buf()
  firmware: Store opt_flags in fw_priv
  fs/kernel_file_read: Add "offset" arg for partial reads
  IMA: Add support for file reads without contents
  LSM: Add "contents" flag to kernel_read_file hook
  module: Call security_kernel_post_load_data()
  firmware_loader: Use security_post_load_data()
  LSM: Introduce kernel_post_load_data() hook
  fs/kernel_read_file: Add file_size output argument
  fs/kernel_read_file: Switch buffer size arg to size_t
  fs/kernel_read_file: Remove redundant size argument
  ...
2020-10-15 10:01:51 -07:00
Darrick J. Wong 407e9c63ee vfs: move the generic write and copy checks out of mm
The generic write check helpers also don't have much to do with the page
cache, so move them to the vfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-10-15 09:50:01 -07:00
Darrick J. Wong 1b2c54d63c vfs: move the remap range helpers to remap_range.c
Complete the migration by moving the file remapping helper functions out
of read_write.c and into remap_range.c.  This reduces the clutter in the
first file and (eventually) will make it so that we can compile out the
second file if it isn't needed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-10-15 09:48:49 -07:00
Bob Peterson e2c6c8a797 gfs2: eliminate GLF_QUEUED flag in favor of list_empty(gl_holders)
Before this patch, glock.c maintained a flag, GLF_QUEUED, which indicated
when a glock had a holder queued. It was only checked for inode glocks,
although set and cleared by all glocks, and it was only used to determine
whether the glock should be held for the minimum hold time before releasing.

The problem is that the flag is not accurate at all. If a process holds
the glock, the flag is set. When they dequeue the glock, it only cleared
the flag in cases when the state actually changed. So if the state doesn't
change, the flag may still be set, even when nothing is queued.

This happens to iopen glocks often: the get held in SH, then the file is
closed, but the glock remains in SH mode.

We don't need a special flag to indicate this: we can simply tell whether
the glock has any items queued to the holders queue. It's a waste of cpu
time to maintain it.

This patch eliminates the flag in favor of simply checking list_empty
on the glock holders.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 17:04:53 +02:00
Bob Peterson b2a846dbef gfs2: Ignore journal log writes for jdata holes
When flushing out its ail1 list, gfs2_write_jdata_page calls function
__block_write_full_page passing in function gfs2_get_block_noalloc.
But there was a problem when a process wrote to a jdata file, then
truncated it or punched a hole, leaving references to the blocks within
the new hole in its ail list, which are to be written to the journal log.

In writing them to the journal, after calling gfs2_block_map, function
gfs2_get_block_noalloc determined that the (hole-punched) block was not
mapped, so it returned -EIO to generic_writepages, which passed it back
to gfs2_ail1_start_one. This, in turn, performed a withdraw, assuming
there was a real IO error writing to the journal.

This might be a valid error when writing metadata to the journal, but for
journaled data writes, it does not warrant a withdraw.

This patch adds a check to function gfs2_block_map that makes an exception
for journaled data writes that correspond to jdata holes: If the iomap
get function returns a block type of IOMAP_HOLE, it instead returns
-ENODATA which does not cause the withdraw. Other errors are returned as
before.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 14:29:04 +02:00
Bob Peterson a6645745d4 gfs2: simplify gfs2_block_map
Function gfs2_block_map had a lot of redundancy between its create and
no_create paths. This patch simplifies the code to eliminate the redundancy.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 14:29:04 +02:00
Bob Peterson 6302d6f43e gfs2: Only set PageChecked if we have a transaction
With jdata writes, we frequently got into situations where gfs2 deadlocked
because of this calling sequence:

gfs2_ail1_start
   gfs2_ail1_flush - for every tr on the sd_ail1_list:
      gfs2_ail1_start_one - for every bd on the tr's tr_ail1_list:
         generic_writepages
	    write_cache_pages passing __writepage()
	       calls clear_page_dirty_for_io which calls set_page_dirty:
	          which calls jdata_set_page_dirty which sets PageChecked.
	       __writepage() calls
	          mapping->a_ops->writepage AKA gfs2_jdata_writepage

However, gfs2_jdata_writepage checks if PageChecked is set, and if so, it
ignores the write and redirties the page. The problem is that write_cache_pages
calls clear_page_dirty_for_io, which often calls set_page_dirty(). See comments
in page-writeback.c starting with "Yes, Virginia". If it's jdata,
set_page_dirty will call jdata_set_page_dirty which will set PageChecked.
That causes a conflict because it makes it look like the page has been
redirtied by another writer, in which case we need to skip writing it and
redirty the page. That ends up in a deadlock because it isn't a "real" writer
and nothing will ever clear PageChecked.

If we do have a real writer, it will have started a transaction. So this
patch checks if a transaction is in use, and if not, it skips setting
PageChecked. That way, the page will be dirtied, cleaned, and written
appropriately.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 14:29:03 +02:00
Bob Peterson 249ffe18c6 gfs2: don't lock sd_ail_lock in gfs2_releasepage
Patch 380f7c65a7 changed gfs2_releasepage
so that it held the sd_ail_lock spin_lock for most of its processing.
It did this for some mysterious undocumented bug somewhere in the
evict code path. But in the nine years since, evict has been reworked
and fixed many times, and so have the transactions and ail list.
I can't see a reason to hold the sd_ail_lock unless it's protecting
the actual ail lists hung off the transactions. Therefore, this patch
removes the locking to increase speed and efficiency, and to further help
us rework the log flush code to be more concurrent with transactions.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 14:29:03 +02:00
Bob Peterson 36c783092d gfs2: make gfs2_ail1_empty_one return the count of active items
This patch is one baby step toward simplifying the journal management.
It simply changes function gfs2_ail1_empty_one from a void to an int and
makes it return a count of active items. This allows the caller to check
the return code rather than list_empty on the tr_ail1_list. This way
we can, in a later patch, combine transaction ail1 and ail2 lists.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 14:29:03 +02:00
Bob Peterson 68942870c6 gfs2: Wipe jdata and ail1 in gfs2_journal_wipe, formerly gfs2_meta_wipe
Before this patch, when blocks were freed, it called gfs2_meta_wipe to
take the metadata out of the pending journal blocks. It did this mostly
by calling another function called gfs2_remove_from_journal. This is
shortsighted because it does not do anything with jdata blocks which
may also be in the journal.

This patch expands the function so that it wipes out jdata blocks from
the journal as well, and it wipes it from the ail1 list if it hasn't
been written back yet. Since it now processes jdata blocks as well,
the function has been renamed from gfs2_meta_wipe to gfs2_journal_wipe.

New function gfs2_ail1_wipe wants a static view of the ail list, so it
locks the sd_ail_lock when removing items. To accomplish this, function
gfs2_remove_from_journal no longer locks the sd_ail_lock, and it's now
the caller's responsibility to do so.

I was going to make sd_ail_lock locking conditional, but the practice is
generally frowned upon. For details, see: https://lwn.net/Articles/109066/

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 14:29:03 +02:00
Bob Peterson 97c5e43d51 gfs2: enhance log_blocks trace point to show log blocks free
This patch adds some code to enhance the log_blocks trace point. It
reports the number of free log blocks. This makes the trace point much
more useful, especially for debugging performance problems when we can
tell when the journal gets full and needs to wait for flushes, etc.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 14:29:03 +02:00
Bob Peterson 77650bdbd2 gfs2: add missing log_blocks trace points in gfs2_write_revokes
Function gfs2_write_revokes was incrementing and decrementing the number
of log blocks free, but there was never a log_blocks trace point for it.
Thus, the free blocks from a log_blocks trace would jump around
mysteriously.

This patch adds the missing trace points so the trace makes more sense.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 14:29:03 +02:00
Bob Peterson 21b6924bb7 gfs2: rename gfs2_write_full_page to gfs2_write_jdata_page, remove parm
Since the function is only used for writing jdata pages, this patch
simply renames function gfs2_write_full_page to a more appropriate
name: gfs2_write_jdata_page. This makes the code easier to understand.

The function was only called in one place, which passed in a pointer to
function gfs2_get_block_noalloc. The function doesn't need to be
passed in. Therefore, this also eliminates the unnecessary parameter
to increase efficiency.

I also took the liberty of cleaning up the function comments.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 14:29:03 +02:00
Anant Thazhemadam 0ddc5154b2 gfs2: add validation checks for size of superblock
In gfs2_check_sb(), no validation checks are performed with regards to
the size of the superblock.
syzkaller detected a slab-out-of-bounds bug that was primarily caused
because the block size for a superblock was set to zero.
A valid size for a superblock is a power of 2 between 512 and PAGE_SIZE.
Performing validation checks and ensuring that the size of the superblock
is valid fixes this bug.

Reported-by: syzbot+af90d47a37376844e731@syzkaller.appspotmail.com
Tested-by: syzbot+af90d47a37376844e731@syzkaller.appspotmail.com
Suggested-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
[Minor code reordering.]
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-15 14:29:03 +02:00
Darrick J. Wong 02e83f46eb vfs: move generic_remap_checks out of mm
I would like to move all the generic helpers for the vfs remap range
functionality (aka clonerange and dedupe) into a separate file so that
they won't be scattered across the vfs and the mm subsystems.  The
eventual goal is to be able to deselect remap_range.c if none of the
filesystems need that code, but the tricky part here is picking a
stable(ish) part of the merge window to rearrange code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-10-14 16:47:08 -07:00
Linus Torvalds fe151462bd Driver Core patches for 5.10-rc1
Here is the "big" set of driver core patches for 5.10-rc1
 
 They include a lot of different things, all related to the driver core
 and/or some driver logic:
 	- sysfs common write functions to make it easier to audit sysfs
 	  attributes
 	- device connection cleanups and fixes
 	- devm helpers for a few functions
 	- NOIO allocations for when devices are being removed
 	- minor cleanups and fixes
 
 All have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCX4c4yA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylS7gCfcS+7/PE42eXxMY0z8rBX8aDMadIAn2DVEghA
 Eoh9UoMEW4g1uMKORA0c
 =CVAW
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the "big" set of driver core patches for 5.10-rc1

  They include a lot of different things, all related to the driver core
  and/or some driver logic:

   - sysfs common write functions to make it easier to audit sysfs
     attributes

   - device connection cleanups and fixes

   - devm helpers for a few functions

   - NOIO allocations for when devices are being removed

   - minor cleanups and fixes

  All have been in linux-next for a while with no reported issues"

* tag 'driver-core-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (31 commits)
  regmap: debugfs: use semicolons rather than commas to separate statements
  platform/x86: intel_pmc_core: do not create a static struct device
  drivers core: node: Use a more typical macro definition style for ACCESS_ATTR
  drivers core: Use sysfs_emit for shared_cpu_map_show and shared_cpu_list_show
  mm: and drivers core: Convert hugetlb_report_node_meminfo to sysfs_emit
  drivers core: Miscellaneous changes for sysfs_emit
  drivers core: Reindent a couple uses around sysfs_emit
  drivers core: Remove strcat uses around sysfs_emit and neaten
  drivers core: Use sysfs_emit and sysfs_emit_at for show(device *...) functions
  sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
  dyndbg: use keyword, arg varnames for query term pairs
  driver core: force NOIO allocations during unplug
  platform_device: switch to simpler IDA interface
  driver core: platform: Document return type of more functions
  Revert "driver core: Annotate dev_err_probe() with __must_check"
  Revert "test_firmware: Test platform fw loading on non-EFI systems"
  iio: adc: xilinx-xadc: use devm_krealloc()
  hwmon: pmbus: use more devres helpers
  devres: provide devm_krealloc()
  syscore: Use pm_pr_dbg() for syscore_{suspend,resume}()
  ...
2020-10-14 16:09:32 -07:00
Andrii Nakryiko 09cad07547 fs: fix NULL dereference due to data race in prepend_path()
Fix data race in prepend_path() with re-reading mnt->mnt_ns twice
without holding the lock.

is_mounted() does check for NULL, but is_anon_ns(mnt->mnt_ns) might
re-read the pointer again which could be NULL already, if in between
reads one of kern_unmount()/kern_unmount_array()/umount_tree() sets
mnt->mnt_ns to NULL.

This is seen in production with the following stack trace:

  BUG: kernel NULL pointer dereference, address: 0000000000000048
  ...
  RIP: 0010:prepend_path.isra.4+0x1ce/0x2e0
  Call Trace:
    d_path+0xe6/0x150
    proc_pid_readlink+0x8f/0x100
    vfs_readlink+0xf8/0x110
    do_readlinkat+0xfd/0x120
    __x64_sys_readlinkat+0x1a/0x20
    do_syscall_64+0x42/0x110
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: f2683bd8d5 ("[PATCH] fix d_absolute_path() interplay with fsmount()")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-14 14:54:45 -07:00
Jamie Iles c2a04b02c0 gfs2: use-after-free in sysfs deregistration
syzkaller found the following splat with CONFIG_DEBUG_KOBJECT_RELEASE=y:

  Read of size 1 at addr ffff000028e896b8 by task kworker/1:2/228

  CPU: 1 PID: 228 Comm: kworker/1:2 Tainted: G S                5.9.0-rc8+ #101
  Hardware name: linux,dummy-virt (DT)
  Workqueue: events kobject_delayed_cleanup
  Call trace:
   dump_backtrace+0x0/0x4d8
   show_stack+0x34/0x48
   dump_stack+0x174/0x1f8
   print_address_description.constprop.0+0x5c/0x550
   kasan_report+0x13c/0x1c0
   __asan_report_load1_noabort+0x34/0x60
   memcmp+0xd0/0xd8
   gfs2_uevent+0xc4/0x188
   kobject_uevent_env+0x54c/0x1240
   kobject_uevent+0x2c/0x40
   __kobject_del+0x190/0x1d8
   kobject_delayed_cleanup+0x2bc/0x3b8
   process_one_work+0x96c/0x18c0
   worker_thread+0x3f0/0xc30
   kthread+0x390/0x498
   ret_from_fork+0x10/0x18

  Allocated by task 1110:
   kasan_save_stack+0x28/0x58
   __kasan_kmalloc.isra.0+0xc8/0xe8
   kasan_kmalloc+0x10/0x20
   kmem_cache_alloc_trace+0x1d8/0x2f0
   alloc_super+0x64/0x8c0
   sget_fc+0x110/0x620
   get_tree_bdev+0x190/0x648
   gfs2_get_tree+0x50/0x228
   vfs_get_tree+0x84/0x2e8
   path_mount+0x1134/0x1da8
   do_mount+0x124/0x138
   __arm64_sys_mount+0x164/0x238
   el0_svc_common.constprop.0+0x15c/0x598
   do_el0_svc+0x60/0x150
   el0_svc+0x34/0xb0
   el0_sync_handler+0xc8/0x5b4
   el0_sync+0x15c/0x180

  Freed by task 228:
   kasan_save_stack+0x28/0x58
   kasan_set_track+0x28/0x40
   kasan_set_free_info+0x24/0x48
   __kasan_slab_free+0x118/0x190
   kasan_slab_free+0x14/0x20
   slab_free_freelist_hook+0x6c/0x210
   kfree+0x13c/0x460

Use the same pattern as f2fs + ext4 where the kobject destruction must
complete before allowing the FS itself to be freed.  This means that we
need an explicit free_sbd in the callers.

Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Jamie Iles <jamie@nuviainc.com>
[Also go to fail_free when init_names fails.]
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:43 +02:00
Andrew Price 0e539ca1bb gfs2: Fix NULL pointer dereference in gfs2_rgrp_dump
When an rindex entry is found to be corrupt, compute_bitstructs() calls
gfs2_consist_rgrpd() which calls gfs2_rgrp_dump() like this:

    gfs2_rgrp_dump(NULL, rgd->rd_gl, fs_id_buf);

gfs2_rgrp_dump then dereferences the gl without checking it and we get

    BUG: KASAN: null-ptr-deref in gfs2_rgrp_dump+0x28/0x280

because there's no rgrp glock involved while reading the rindex on mount.

Fix this by changing gfs2_rgrp_dump to take an rgrp argument.

Reported-by: syzbot+43fa87986bdd31df9de6@syzkaller.appspotmail.com
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:43 +02:00
Christoph Hellwig 2164f9b918 gfs2: use iomap for buffered I/O in ordered and writeback mode
Switch to using the iomap readpage and writepage helpers for all I/O in
the ordered and writeback modes, and thus eliminate using buffer_heads
for I/O in these cases.  The journaled data mode is left untouched.

(Andreas Gruenbacher: In gfs2_unstuffer_page, switch from mark_buffer_dirty
to set_page_dirty instead of accidentally leaving the page / buffer clean.)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:42 +02:00
Bob Peterson ee1e2c773e gfs2: call truncate_inode_pages_final for address space glocks
Before this patch, we were not calling truncate_inode_pages_final for the
address space for glocks, which left the possibility of a leak. We now
take care of the problem instead of complaining, and we do it during
glock tear-down..

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:42 +02:00
Bob Peterson 0a0d9f55c2 gfs2: simplify the logic in gfs2_evict_inode
Now that we've factored out the deleted and undeleted dinode cases
in gfs2_evict_inode, we can greatly simplify the logic. Now the
function is easy to read and understand.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:42 +02:00
Bob Peterson d90be6ab9a gfs2: factor evict_linked_inode out of gfs2_evict_inode
Now that we've factored out the delete-dinode case to simplify
gfs2_evict_inode, we take it a step further and factor out the other
case: where we don't delete the inode.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:42 +02:00
Bob Peterson 53dbc27eb1 gfs2: further simplify gfs2_evict_inode with new func evict_should_delete
This patch further simplifies function gfs2_evict_inode() by adding a
new function evict_should_delete. The function may also lock the inode
glock.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:42 +02:00
Bob Peterson 6e7e9a5055 gfs2: factor evict_unlinked_inode out of gfs2_evict_inode
Function gfs2_evict_inode is way too big, complex and unreadable. This
is a baby step toward breaking it apart to be more readable. It factors
out the portion that deletes the online bits for a dinode that is
unlinked and needs to be deleted. A future patch will factor out more.
(If I factor out too much, the patch itself becomes unreadable).

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:41 +02:00
Bob Peterson 23d828fc3f gfs2: rename variable error to ret in gfs2_evict_inode
Function gfs2_evict_inode is too big and unreadable. This patch is just
a baby step toward improving that. This first step just renames variable
error to ret. This will help make future patches more readable.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:41 +02:00
Liu Shixin e8a8023ee0 gfs2: convert to use DEFINE_SEQ_ATTRIBUTE macro
Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code.

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:41 +02:00
Bob Peterson 521031fa97 gfs2: Fix bad comment for trans_drain
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:41 +02:00
Andreas Gruenbacher 5a61ae1402 gfs2: Make sure we don't miss any delayed withdraws
Commit ca399c96e9 changes gfs2_log_flush to not withdraw the
filesystem while holding the log flush lock, but it fails to check if
the filesystem needs to be withdrawn once the log flush lock has been
released.  Likewise, commit f05b86db31 depends on gfs2_log_flush to
trigger for delayed withdraws.  Add that and clean up the code flow
somewhat.

In gfs2_put_super, add a check for delayed withdraws that have been
missed to prevent these kinds of bugs in the future.

Fixes: ca399c96e9 ("gfs2: flesh out delayed withdraw for gfs2_log_flush")
Fixes: f05b86db31 ("gfs2: Prepare to withdraw as soon as an IO error occurs in log write")
Cc: stable@vger.kernel.org # v5.7+: 462582b99b: gfs2: add some much needed cleanup for log flushes that fail
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-10-14 23:54:41 +02:00
Linus Torvalds 2fc61f25fb New code for 5.10:
- Clean up the buffer ioend calling path so that the retry strategy
   isn't quite so scattered everywhere.
 - Clean up m_sb_bp handling.
 - New feature: storing inode btree counts in the AGI to speed up certain
   mount time per-AG block reservation operatoins and add a little more
   metadata redundancy.
 - New feature: Widen inode timestamps and quota grace expiration
   timestamps to support dates through the year 2486.
 - Get rid of more of our custom buffer allocation API wrappers.
 - Use a proper VLA for shortform xattr structure namevals.
 - Force the log after reflinking or deduping into a file that is opened
   with O_SYNC or O_DSYNC.
 - Fix some math errors in the realtime allocator.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl9iOCEACgkQ+H93GTRK
 tOvn8Q//VYzuMUIxoc9pVfyd0L3ThROKEO2cwzbGEauXXIginmMqfITVCMWtLg9/
 siDnXGFDetqF9g+1DjyzLucVGAleEd0QSCrWqTxNjHUeUu6AtNrv716jQ09BTZD+
 SoD9oZhG0k9seGWrVkQRSfyVyARxw/OEGbnGe5xCR3VTXG/xMTRFOJMsbCM6trhL
 1nHJDhsoWSDeSYi59GEm/nscZ3yuZOcnnkTMpQrdWX+Y2BHzlUqrXuSf1ZxWmfQm
 2RPVxdRRt8Mt0n28oo7eGQ1tC7nYHDqVRlZcM8IuBGIu3kDrPhNVWOIkjzaaBYf9
 goZG67RZ/Rm3GDtlaWtz0KRDpJUKOWV+SuSh3MtpqZSb91llAN2tVbiHKYHW72zG
 Bi+9RCadHKpuU1iyLGP+eaMPkVGV0H2co1DuENJYPy9wQTdRg2LD3qGJuNr7YwMR
 Rs9QBDntjFO0WbC9UiGkIxSryYdgUctLLv3/eT0LpwvoygabFjNQ69hguFRHhfP0
 d1AxS8o2qzyf1v+0NVqM9FAPhiqSY1SBd+seawF5tlQL4OY2BMUgAwdtVqRJapyU
 BoLSih507nbw+R0FqayKpcUraU7OFrY5SZ21hOsHKt9AR3XW+ntU9ySZMM4EdAXx
 DunsgaAk6hvZy99y10O1f1Wo30jZKjgnEGHi/IgjG7FKD3iSIvw=
 =99iy
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.10-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "The biggest changes are two new features for the ondisk metadata: one
  to record the sizes of the inode btrees in the AG to increase
  redundancy checks and to improve mount times; and a second new feature
  to support timestamps until the year 2486.

  We also fixed a problem where reflinking into a file that requires
  synchronous writes wouldn't actually flush the updates to disk; clean
  up a fair amount of cruft; and started fixing some bugs in the
  realtime volume code.

  Summary:

   - Clean up the buffer ioend calling path so that the retry strategy
     isn't quite so scattered everywhere.

   - Clean up m_sb_bp handling.

   - New feature: storing inode btree counts in the AGI to speed up
     certain mount time per-AG block reservation operatoins and add a
     little more metadata redundancy.

   - New feature: Widen inode timestamps and quota grace expiration
     timestamps to support dates through the year 2486.

   - Get rid of more of our custom buffer allocation API wrappers.

   - Use a proper VLA for shortform xattr structure namevals.

   - Force the log after reflinking or deduping into a file that is
     opened with O_SYNC or O_DSYNC.

   - Fix some math errors in the realtime allocator"

* tag 'xfs-5.10-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (42 commits)
  xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size
  xfs: make sure the rt allocator doesn't run off the end
  xfs: Remove unneeded semicolon
  xfs: force the log after remapping a synchronous-writes file
  xfs: Convert xfs_attr_sf macros to inline functions
  xfs: Use variable-size array for nameval in xfs_attr_sf_entry
  xfs: Remove typedef xfs_attr_shortform_t
  xfs: remove typedef xfs_attr_sf_entry_t
  xfs: Remove kmem_zalloc_large()
  xfs: enable big timestamps
  xfs: trace timestamp limits
  xfs: widen ondisk quota expiration timestamps to handle y2038+
  xfs: widen ondisk inode timestamps to deal with y2038+
  xfs: redefine xfs_ictimestamp_t
  xfs: redefine xfs_timestamp_t
  xfs: move xfs_log_dinode_to_disk to the log recovery code
  xfs: refactor quota timestamp coding
  xfs: refactor default quota grace period setting code
  xfs: refactor quota expiration timer modification
  xfs: explicitly define inode timestamp range
  ...
2020-10-14 14:06:06 -07:00
Chengguang Xu 788e96d1d3 f2fs: code cleanup by removing unnecessary check
f2fs_seek_block() is only used for regular file,
so don't have to check inline dentry in it.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-14 13:23:41 -07:00
Jamie Iles ae284d87ab f2fs: wait for sysfs kobject removal before freeing f2fs_sb_info
syzkaller found that with CONFIG_DEBUG_KOBJECT_RELEASE=y, unmounting an
f2fs filesystem could result in the following splat:

  kobject: 'loop5' ((____ptrval____)): kobject_release, parent 0000000000000000 (delayed 250)
  kobject: 'f2fs_xattr_entry-7:5' ((____ptrval____)): kobject_release, parent 0000000000000000 (delayed 750)
  ------------[ cut here ]------------
  ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x98
  WARNING: CPU: 0 PID: 699 at lib/debugobjects.c:485 debug_print_object+0x180/0x240
  Kernel panic - not syncing: panic_on_warn set ...
  CPU: 0 PID: 699 Comm: syz-executor.5 Tainted: G S                5.9.0-rc8+ #101
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   dump_backtrace+0x0/0x4d8
   show_stack+0x34/0x48
   dump_stack+0x174/0x1f8
   panic+0x360/0x7a0
   __warn+0x244/0x2ec
   report_bug+0x240/0x398
   bug_handler+0x50/0xc0
   call_break_hook+0x160/0x1d8
   brk_handler+0x30/0xc0
   do_debug_exception+0x184/0x340
   el1_dbg+0x48/0xb0
   el1_sync_handler+0x170/0x1c8
   el1_sync+0x80/0x100
   debug_print_object+0x180/0x240
   debug_check_no_obj_freed+0x200/0x430
   slab_free_freelist_hook+0x190/0x210
   kfree+0x13c/0x460
   f2fs_put_super+0x624/0xa58
   generic_shutdown_super+0x120/0x300
   kill_block_super+0x94/0xf8
   kill_f2fs_super+0x244/0x308
   deactivate_locked_super+0x104/0x150
   deactivate_super+0x118/0x148
   cleanup_mnt+0x27c/0x3c0
   __cleanup_mnt+0x28/0x38
   task_work_run+0x10c/0x248
   do_notify_resume+0x9d4/0x1188
   work_pending+0x8/0x34c

Like the error handling for f2fs_register_sysfs(), we need to wait for
the kobject to be destroyed before returning to prevent a potential
use-after-free.

Fixes: bf9e697ecd ("f2fs: expose features to sysfs entry")
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Signed-off-by: Jamie Iles <jamie@nuviainc.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-14 13:23:30 -07:00
Linus Torvalds 37187df45a New code for 5.10:
- Don't WARN_ON weird states that unprivileged users can create.
 - Don't invalidate page cache when direct writes want to fall back to
   buffered.
 - Fix some problems when readahead ios fail.
 - Fix a problem where inline data pages weren't getting flushed during
   an unshare operation.
 - Rework iomap to support arbitrarily many blocks per page in
   preparation to support THP for the page cache.
 - Fix a bug in the blocksize < pagesize buffered io path where we could
   fail to initialize the many-blocks-per-page uptodate bitmap correctly
   when the backing page is actually up to date.  This could cause us to
   forget to write out dirty pages.
 - Split out the generic_write_sync at the end of the directio write path
   so that btrfs can drop the inode lock before sync'ing the file.
 - Call inode_dio_end before trying to sync the file after a O_DSYNC
   direct write (instead of afterwards) to match the behavior of the
   old directio code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl9yB04ACgkQ+H93GTRK
 tOuZxw/+IrBV3HV45PtqQX+HC2F4ebax26cIJrmCQD0neiu16I7H3COjIGN/YOGw
 bN04VirC3bG4BtzVHO/eRHQOCwCevIpP3LkhT6yOfOgkO4Z9Xn/O7E+7uYtgT5Qi
 dBqOFe/aoB6+uHEHaioWUTxF1MlsVqEK/yPWjbSIdQGKFVE03Azj4V5QHtBouF2+
 pNEk7lbBnF0ua3biambeyDO3JTR9dsziIPH8QzQ4M/fMuNLfR2v0s6d4Ol/ndVrC
 Lp3RtThLcioAXh8xSPMO6RVUFfK97SLgNCRngApFbIJn85z9yq7eI7llnhO+XcHF
 FBJ+XottlwJFDt+0xNUaHmjkfUH9GoK8VeFOd3zHvp6xgZZpDkjG2JJk9ZC8Qnn5
 xg4grGngWshNdxFBf8S/O73bAJ1SyRcD5ePYGyMfiij3beGJ0aulKGoYOdDfC/4c
 hHcUc8XpjHSobg5gklQijBif0WIQos1Z4OyDK9d2LqrJOO0NUypO/t2YIdgPFzkj
 rXLmWlKsUYSZyefI5Z8q0AVy7TQGxstS9poC3lkXlsszQ1E5BNup0/bhCGTgCW+5
 az9m41KXxPEDLxieOvIAUhHSSP02IAGQ9Lvvat1GnGfEqShAEWS/IvmIxHDbvyNW
 lZ0NLqNKsItKBH0oIPsrP7fHz2ES1hUIMIaLbApUwKpUcAxrCLY=
 =ocIt
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.10-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "There's not a lot of new stuff going on here -- a little bit of code
  refactoring to make iomap workable with btrfs' fsync locking model,
  cleanups in preparation for adding THP support for filesystems, and
  fixing a data corruption issue for blocksize < pagesize filesystems.

  Summary:

   - Don't WARN_ON weird states that unprivileged users can create.

   - Don't invalidate page cache when direct writes want to fall back to
     buffered.

   - Fix some problems when readahead ios fail.

   - Fix a problem where inline data pages weren't getting flushed
     during an unshare operation.

   - Rework iomap to support arbitrarily many blocks per page in
     preparation to support THP for the page cache.

   - Fix a bug in the blocksize < pagesize buffered io path where we
     could fail to initialize the many-blocks-per-page uptodate bitmap
     correctly when the backing page is actually up to date. This could
     cause us to forget to write out dirty pages.

   - Split out the generic_write_sync at the end of the directio write
     path so that btrfs can drop the inode lock before sync'ing the
     file.

   - Call inode_dio_end before trying to sync the file after a O_DSYNC
     direct write (instead of afterwards) to match the behavior of the
     old directio code"

* tag 'iomap-5.10-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: Call inode_dio_end() before generic_write_sync()
  iomap: Allow filesystem to call iomap_dio_complete without i_rwsem
  iomap: Set all uptodate bits for an Uptodate page
  iomap: Change calling convention for zeroing
  iomap: Convert iomap_write_end types
  iomap: Convert write_count to write_bytes_pending
  iomap: Convert read_count to read_bytes_pending
  iomap: Support arbitrarily many blocks per page
  iomap: Use bitmap ops to set uptodate bits
  iomap: Use kzalloc to allocate iomap_page
  fs: Introduce i_blocks_per_page
  iomap: Fix misplaced page flushing
  iomap: Use round_down/round_up macros in __iomap_write_begin
  iomap: Mark read blocks uptodate in write_begin
  iomap: Clear page error before beginning a write
  iomap: Fix direct I/O write consistency check
  iomap: fix WARN_ON_ONCE() from unprivileged users
2020-10-14 12:23:00 -07:00
Vivek Goyal 42d3e2d041 virtiofs: calculate number of scatter-gather elements accurately
virtiofs currently maps various buffers in scatter gather list and it looks
at number of pages (ap->pages) and assumes that same number of pages will
be used both for input and output (sg_count_fuse_req()), and calculates
total number of scatterlist elements accordingly.

But looks like this assumption is not valid in all the cases. For example,
Cai Qian reported that trinity, triggers warning with virtiofs sometimes.
A closer look revealed that if one calls ioctl(fd, 0x5a004000, buf), it
will trigger following warning.

WARN_ON(out_sgs + in_sgs != total_sgs)

In this case, total_sgs = 8, out_sgs=4, in_sgs=3. Number of pages is 2
(ap->pages), but out_sgs are using both the pages but in_sgs are using
only one page. In this case, fuse_do_ioctl() sets different size values
for input and output.

args->in_args[args->in_numargs - 1].size == 6656
args->out_args[args->out_numargs - 1].size == 4096

So current method of calculating how many scatter-gather list elements
will be used is not accurate. Make calculations more precise by parsing
size and ap->descs.

Reported-by: Qian Cai <cai@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/linux-fsdevel/5ea77e9f6cb8c2db43b09fbd4158ab2d8c066a0a.camel@redhat.com/
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-14 14:16:22 +02:00
Daeho Jeong 8c8cf26ae3 f2fs: fix writecount false positive in releasing compress blocks
In current condition check, if it detects writecount, it return -EBUSY
regardless of f_mode of the file. Fixed it.

Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-13 23:23:34 -07:00
Chao Yu af4b6b8edf f2fs: introduce check_swap_activate_fast()
check_swap_activate() will lookup block mapping via bmap() one by one, so
its performance is very bad, this patch introduces check_swap_activate_fast()
to use f2fs_fiemap() to boost this process, since f2fs_fiemap() will lookup
block mappings in batch, therefore, it can improve swapon()'s performance
significantly.

Note that this enhancement only works when page size is equal to f2fs' block
size.

Testcase: (backend device: zram)
- touch file
- pin & fallocate file to 8GB
- mkswap file
- swapon file

Before:
real	0m2.999s
user	0m0.000s
sys	0m2.980s

After:
real	0m0.081s
user	0m0.000s
sys	0m0.064s

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-13 23:23:34 -07:00
Chao Yu 6ed29fe1ca f2fs: don't issue flush in f2fs_flush_device_cache() for nobarrier case
This patch changes f2fs_flush_device_cache() to skip issuing flush for
nobarrier case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-13 23:23:34 -07:00
Jaegeuk Kim 86f33603f8 f2fs: handle errors of f2fs_get_meta_page_nofail
First problem is we hit BUG_ON() in f2fs_get_sum_page given EIO on
f2fs_get_meta_page_nofail().

Quick fix was not to give any error with infinite loop, but syzbot caught
a case where it goes to that loop from fuzzed image. In turned out we abused
f2fs_get_meta_page_nofail() like in the below call stack.

- f2fs_fill_super
 - f2fs_build_segment_manager
  - build_sit_entries
   - get_current_sit_page

INFO: task syz-executor178:6870 can't die for more than 143 seconds.
task:syz-executor178 state:R
 stack:26960 pid: 6870 ppid:  6869 flags:0x00004006
Call Trace:

Showing all locks held in the system:
1 lock held by khungtaskd/1179:
 #0: ffffffff8a554da0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6242
1 lock held by systemd-journal/3920:
1 lock held by in:imklog/6769:
 #0: ffff88809eebc130 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:930
1 lock held by syz-executor178/6870:
 #0: ffff8880925120e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0x201/0xaf0 fs/super.c:229

Actually, we didn't have to use _nofail in this case, since we could return
error to mount(2) already with the error handler.

As a result, this patch tries to 1) remove _nofail callers as much as possible,
2) deal with error case in last remaining caller, f2fs_get_sum_page().

Reported-by: syzbot+ee250ac8137be41d7b13@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-13 23:23:29 -07:00
Suren Baghdasaryan 67197a4f28 mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary
Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm.  This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.

Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important.  Such operation can happen frequently.  We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression.  Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.

Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM && !CLONE_THREAD && !CLONE_VFORK).  Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes.  To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex.  Its scope is changed to global.

The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
the case of vfork().  To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare.  Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.

With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.

[surenb@google.com: v3]
  Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com

Fixes: 44a70adec9 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray <timmurray@google.com>
Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Eugene Syromiatnikov <esyr@redhat.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: Adrian Reber <areber@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
Cc: John Johansen <john.johansen@canonical.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
Debugged-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:35 -07:00
Chinwen Chang ff9f47f6f0 mm: proc: smaps_rollup: do not stall write attempts on mmap_lock
smaps_rollup will try to grab mmap_lock and go through the whole vma list
until it finishes the iterating.  When encountering large processes, the
mmap_lock will be held for a longer time, which may block other write
requests like mmap and munmap from progressing smoothly.

There are upcoming mmap_lock optimizations like range-based locks, but the
lock applied to smaps_rollup would be the coarse type, which doesn't avoid
the occurrence of unpleasant contention.

To solve aforementioned issue, we add a check which detects whether anyone
wants to grab mmap_lock for write attempts.

Signed-off-by: Chinwen Chang <chinwen.chang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Song Liu <songliubraving@fb.com>
Cc: Jimmy Assarsson <jimmyassarsson@gmail.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Link: http://lkml.kernel.org/r/1597715898-3854-4-git-send-email-chinwen.chang@mediatek.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:31 -07:00
Chinwen Chang 03b4b11493 mm: smaps*: extend smap_gather_stats to support specified beginning
Extend smap_gather_stats to support indicated beginning address at which
it should start gathering.  To achieve the goal, we add a new parameter
@start assigned by the caller and try to refactor it for simplicity.

If @start is 0, it will use the range of @vma for gathering.

Signed-off-by: Chinwen Chang <chinwen.chang@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Steven Price <steven.price@arm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jimmy Assarsson <jimmyassarsson@gmail.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/1597715898-3854-3-git-send-email-chinwen.chang@mediatek.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:31 -07:00
Matthew Wilcox (Oracle) 8cf886463e proc: optimise smaps for shmem entries
Avoid bumping the refcount on pages when we're only interested in the
swap entries.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Link: https://lkml.kernel.org/r/20200910183318.20139-5-willy@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:29 -07:00
Luo Jiaxing 97383c741b fs_parse: mark fs_param_bad_value() as static
We found the following warning when build kernel with W=1:

fs/fs_parser.c:192:5: warning: no previous prototype for `fs_param_bad_value' [-Wmissing-prototypes]
int fs_param_bad_value(struct p_log *log, struct fs_parameter *param)
     ^
CC      drivers/usb/gadget/udc/snps_udc_core.o

And no header file define a prototype for this function, so we should mark
it as static.

Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.kernel.org/r/1601293463-25763-1-git-send-email-luojiaxing@huawei.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:27 -07:00
Randy Dunlap da5c1c0bb3 fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr
Fix kernel-doc warnings in fs/xattr.c:

../fs/xattr.c:251: warning: Function parameter or member 'dentry' not described in '__vfs_setxattr_locked'
../fs/xattr.c:251: warning: Function parameter or member 'name' not described in '__vfs_setxattr_locked'
../fs/xattr.c:251: warning: Function parameter or member 'value' not described in '__vfs_setxattr_locked'
../fs/xattr.c:251: warning: Function parameter or member 'size' not described in '__vfs_setxattr_locked'
../fs/xattr.c:251: warning: Function parameter or member 'flags' not described in '__vfs_setxattr_locked'
../fs/xattr.c:251: warning: Function parameter or member 'delegated_inode' not described in '__vfs_setxattr_locked'
../fs/xattr.c:458: warning: Function parameter or member 'dentry' not described in '__vfs_removexattr_locked'
../fs/xattr.c:458: warning: Function parameter or member 'name' not described in '__vfs_removexattr_locked'
../fs/xattr.c:458: warning: Function parameter or member 'delegated_inode' not described in '__vfs_removexattr_locked'

Fixes: 08b5d5014a ("xattr: break delegations in {set,remove}xattr")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Frank van der Linden <fllinden@amazon.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Link: http://lkml.kernel.org/r/7a3dd5a2-5787-adf3-d525-c203f9910ec4@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:27 -07:00
Gang He 8dd71b25c5 ocfs2: fix potential soft lockup during fstrim
When we discard unused blocks on a mounted ocfs2 filesystem, fstrim
handles each block goup with locking/unlocking global bitmap meta-file
repeatedly. we should let fstrim thread take a break(if need) between
unlock and lock, this will avoid the potential soft lockup problem,
and also gives the upper applications more IO opportunities, these
applications are not blocked for too long at writing files.

Signed-off-by: Gang He <ghe@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Link: https://lkml.kernel.org/r/20200927015815.14904-1-ghe@suse.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:27 -07:00
Randy Dunlap 679edeb0ed ocfs2: delete repeated words in comments
Drop duplicated words {the, and} in comments.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Link: https://lkml.kernel.org/r/20200811021845.25134-1-rdunlap@infradead.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:27 -07:00
Rustam Kovhaev 4f8c94022f ntfs: add check for mft record size in superblock
Number of bytes allocated for mft record should be equal to the mft record
size stored in ntfs superblock as reported by syzbot, userspace might
trigger out-of-bounds read by dereferencing ctx->attr in ntfs_attr_find()

Reported-by: syzbot+aed06913f36eff9b544e@syzkaller.appspotmail.com
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: syzbot+aed06913f36eff9b544e@syzkaller.appspotmail.com
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Link: https://syzkaller.appspot.com/bug?extid=aed06913f36eff9b544e
Link: https://lkml.kernel.org/r/20200824022804.226242-1-rkovhaev@gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13 18:38:27 -07:00
Sargun Dhillon 61ca2c4afd NFS: Only reference user namespace from nfs4idmap struct instead of cred
The nfs4idmapper only needs access to the user namespace, and not the
entire cred struct. This replaces the struct cred* member with
struct user_namespace*. This is mostly hygiene, so we don't have to
hold onto the cred object, which has extraneous references to
things like user_struct. This also makes switching away
from init_user_ns more straightforward in the future.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-13 15:56:54 -04:00
Linus Torvalds 6ad4bf6ea1 io_uring-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EXPEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiR4EAC3trm1ojXVF7y9/XRhcPpb4Pror+ZA1coO
 gyoy+zUuCEl9WCzzHWqXULMYMP0YzNJnJs0oLQPA1s0sx1H4uDMl/UXg0OXZisYG
 Y59Kca3c1DHFwj9KPQXfGmCEjc/rbDWK5TqRc2iZMz+6E5Mt71UFZHtenwgV1zD8
 hTmZZkzLCu2ePfOvrPONgL5tDqPWGVyn61phoC7cSzMF66juXGYuvQGktzi/m6q+
 jAxUnhKvKTlLB9wsq3s5X/20/QD56Yuba9U+YxeeNDBE8MDWQOsjz0mZCV1fn4p3
 h/6762aRaWaXH7EwMtsHFUWy7arJZg/YoFYNYLv4Ksyy3y4sMABZCy3A+JyzrgQ0
 hMu7vjsP+k22X1WH8nyejBfWNEmxu6dpgckKrgF0dhJcXk/acWA3XaDWZ80UwfQy
 isKRAP1rA0MJKHDMIwCzSQJDPvtUAkPptbNZJcUSU78o+pPoCaQ93V++LbdgGtKn
 iGJJqX05dVbcsDx5X7fluphjkUTC4yFr7ZgLgbOIedXajWRD8iOkO2xxCHk6SKFl
 iv9entvRcX9k3SHK9uffIUkRBUujMU0+HCIQFCO1qGmkCaS5nSrovZl4HoL7L/Dj
 +T8+v7kyJeklLXgJBaE7jk01O4HwZKjwPWMbCjvL9NKk8j7c1soYnRu5uNvi85Mu
 /9wn671s+w==
 =udgj
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:

 - Add blkcg accounting for io-wq offload (Dennis)

 - A use-after-free fix for io-wq (Hillf)

 - Cancelation fixes and improvements

 - Use proper files_struct references for offload

 - Cleanup of io_uring_get_socket() since that can now go into our own
   header

 - SQPOLL fixes and cleanups, and support for sharing the thread

 - Improvement to how page accounting is done for registered buffers and
   huge pages, accounting the real pinned state

 - Series cleaning up the xarray code (Willy)

 - Various cleanups, refactoring, and improvements (Pavel)

 - Use raw spinlock for io-wq (Sebastian)

 - Add support for ring restrictions (Stefano)

* tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (62 commits)
  io_uring: keep a pointer ref_node in file_data
  io_uring: refactor *files_register()'s error paths
  io_uring: clean file_data access in files_register
  io_uring: don't delay io_init_req() error check
  io_uring: clean leftovers after splitting issue
  io_uring: remove timeout.list after hrtimer cancel
  io_uring: use a separate struct for timeout_remove
  io_uring: improve submit_state.ios_left accounting
  io_uring: simplify io_file_get()
  io_uring: kill extra check in fixed io_file_get()
  io_uring: clean up ->files grabbing
  io_uring: don't io_prep_async_work() linked reqs
  io_uring: Convert advanced XArray uses to the normal API
  io_uring: Fix XArray usage in io_uring_add_task_file
  io_uring: Fix use of XArray in __io_uring_files_cancel
  io_uring: fix break condition for __io_uring_register() waiting
  io_uring: no need to call xa_destroy() on empty xarray
  io_uring: batch account ->req_issue and task struct references
  io_uring: kill callback_head argument for io_req_task_work_add()
  io_uring: move req preps out of io_issue_sqe()
  ...
2020-10-13 12:36:21 -07:00
Linus Torvalds 3ad11d7ac8 block-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
 u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
 fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
 e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
 WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
 c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
 FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
 YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
 ZwfpDh2+Tg==
 =LzyE
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Series of merge handling cleanups (Baolin, Christoph)

 - Series of blk-throttle fixes and cleanups (Baolin)

 - Series cleaning up BDI, seperating the block device from the
   backing_dev_info (Christoph)

 - Removal of bdget() as a generic API (Christoph)

 - Removal of blkdev_get() as a generic API (Christoph)

 - Cleanup of is-partition checks (Christoph)

 - Series reworking disk revalidation (Christoph)

 - Series cleaning up bio flags (Christoph)

 - bio crypt fixes (Eric)

 - IO stats inflight tweak (Gabriel)

 - blk-mq tags fixes (Hannes)

 - Buffer invalidation fixes (Jan)

 - Allow soft limits for zone append (Johannes)

 - Shared tag set improvements (John, Kashyap)

 - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

 - DM no-wait support (Mike, Konstantin)

 - Request allocation improvements (Ming)

 - Allow md/dm/bcache to use IO stat helpers (Song)

 - Series improving blk-iocost (Tejun)

 - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
   Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
  block: fix uapi blkzoned.h comments
  blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
  blk-mq: get rid of the dead flush handle code path
  block: get rid of unnecessary local variable
  block: fix comment and add lockdep assert
  blk-mq: use helper function to test hw stopped
  block: use helper function to test queue register
  block: remove redundant mq check
  block: invoke blk_mq_exit_sched no matter whether have .exit_sched
  percpu_ref: don't refer to ref->data if it isn't allocated
  block: ratelimit handle_bad_sector() message
  blk-throttle: Re-use the throtl_set_slice_end()
  blk-throttle: Open code __throtl_de/enqueue_tg()
  blk-throttle: Move service tree validation out of the throtl_rb_first()
  blk-throttle: Move the list operation after list validation
  blk-throttle: Fix IO hang for a corner case
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
  blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
  block: Remove redundant 'return' statement
  ...
2020-10-13 12:12:44 -07:00
Linus Torvalds dfef313e99 Changes since last update:
- fix an issue which can cause overlay permission problem
    due to duplicated permission check for "trusted." xattrs;
 
  - add REQ_RAHEAD flag to readahead requests for blktrace;
 
  - several random cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCX4VJTxUcaHNpYW5na2Fv
 QHJlZGhhdC5jb20ACgkQOTcx3B+15gRiDgD/QZO8yQezBRPnsuGto7PmkVM9epgp
 fNkWRvPgxsnOh1wA/3NjrxtKyE4TRYQOglcv6RBm19PATSrTnZZ5pD1KGV0O
 =yrlK
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs updates from Gao Xiang:
 "This cycle addresses a reported permission issue with overlay due to a
  duplicated permission check for "trusted." xattrs. Also, a REQ_RAHEAD
  flag is added now to all readahead requests in order to trace
  readahead I/Os. The others are random cleanups.

  All commits have been tested and have been in linux-next as well.

  Summary:

   - fix an issue which can cause overlay permission problem due to
     duplicated permission check for "trusted." xattrs;

   - add REQ_RAHEAD flag to readahead requests for blktrace;

   - several random cleanup"

* tag 'erofs-for-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: remove unnecessary enum entries
  erofs: add REQ_RAHEAD flag to readahead requests
  erofs: fold in should_decompress_synchronously()
  erofs: avoid unnecessary variable `err'
  erofs: remove unneeded parameter
  erofs: avoid duplicated permission check for "trusted." xattrs
2020-10-13 09:09:42 -07:00
Linus Torvalds 11e3235b43 for-5.10-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl+EuosACgkQxWXV+ddt
 WDtwLBAAiVqGGeLPwiy7R4/vMJKY9RYhf2WtCI7/AEtj6efiT1NU1Y5spw5QyJO6
 +zIvhRx/3zXXkz52poVS8kM0aIxH06a7p9SXBvTf6D5CIQf1QJZVZTjgJKVVaYkE
 0LsudJgstjowc02k/pN2uLQzfWPLIscatXTGbUKmz2QEf/1opef8EJy8asnbyCXP
 FTo5a0EN0B1Gq5GPBpdSOLmGyCA51W709EDR4uDr+iRIjI7yfsyGYNJiIEZsOJ8f
 tXLxwx2/CA7OWmlTwGJr180UnjVG/gUu9wm3x9QG5Fre8TwshLEllJaZ9DYLebQl
 JKs15pbaTJQa4R/AVZyD1JuVvV5tCdcfvNOffKR3iFNTeq72Kxoj7W+DQn4z5CgO
 PX3SnuoujWIY1hOhXQLXVXd3MTzRtZ21mn33tUFY2dfJq0U28LygkQg/QsYZxn9E
 WWF0d/ml1q9aF6B68n9ZH9jIOlfGZLKf4Dp65ND5nCpUe3k1H+93JZySQndmw+G7
 gfZMGr9319irM+MqRoxlP4ujL972axZhguF/YZxjPv47uB34n1Q1uLJApsjA/bQl
 7SSe1mNFcm4h+xXzwXTYum7yQ8i41ThPX1UAOmIz0geiGn3W6vBuaOX0n5vU9Ds3
 3RdBmijQmrnua1zaoCj5sr4XmU3I146iNOapfHyQdQhhENEicSc=
 =FNjY
 -----END PGP SIGNATURE-----

Merge tag 'for-5.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Mostly core updates with a few user visible bits and fixes.

  Hilights:

   - fsync performance improvements
      - less contention of log mutex (throughput +4%, latency -14%,
        dbench with 32 clients)
      - skip unnecessary commits for link and rename (throughput +6%,
        latency -30%, rename latency -75%, dbench with 16 clients)
      - make fast fsync wait only for writeback (throughput +10..40%,
        runtime -1..-20%, dbench with 1 to 64 clients on various
        file/block sizes)

   - direct io is now implemented using the iomap infrastructure, that's
     the main part, we still have a workaround that requires an iomap
     API update, coming in 5.10

   - new sysfs exports:
      - information about the exclusive filesystem operation status
        (balance, device add/remove/replace, ...)
      - supported send stream version

  Core:

   - use ticket space reservations for data, fair policy using the same
     infrastructure as metadata

   - preparatory work to switch locking from our custom tree locks to
     standard rwsem, now the locking context is propagated to all
     callers, actual switch is expected to happen in the next dev cycle

   - seed device structures are now using list API

   - extent tracepoints print proper tree id

   - unified range checks for extent buffer helpers

   - send: avoid using temporary buffer for copying data

   - remove unnecessary RCU protection from space infos

   - remove unused readpage callback for metadata, enabling several
     cleanups

   - replace indirect function calls for end io hooks and remove
     extent_io_ops completely

  Fixes:

   - more lockdep warning fixes

   - fix qgroup reservation for delayed inode and an occasional
     reservation leak for preallocated files

   - fix device replace of a seed device

   - fix metadata reservation for fallocate that leads to transaction
     aborts

   - reschedule if necessary when logging directory items or when
     cloning lots of extents

   - tree-checker: fix false alert caused by legacy btrfs root item

   - send: fix rename/link conflicts for orphanized inodes

   - properly initialize device stats for seed devices

   - skip devices without magic signature when mounting

  Other:

   - error handling improvements, BUG_ONs replaced by proper handling,
     fuzz fixes

   - various function parameter cleanups

   - various W=1 cleanups

   - error/info messages improved

  Mishaps:

   - commit 62cf539120 ("btrfs: move btrfs_rm_dev_replace_free_srcdev
     outside of all locks") is a rebase leftover after the patch got
     merged to 5.9-rc8 as a466c85edc ("btrfs: move
     btrfs_rm_dev_replace_free_srcdev outside of all locks"), the
     remaining part is trivial and the patch is in the middle of the
     series so I'm keeping it there instead of rebasing"

* tag 'for-5.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (161 commits)
  btrfs: rename BTRFS_INODE_ORDERED_DATA_CLOSE flag
  btrfs: annotate device name rcu_string with __rcu
  btrfs: skip devices without magic signature when mounting
  btrfs: cleanup cow block on error
  btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
  fs: remove no longer used dio_end_io()
  btrfs: return error if we're unable to read device stats
  btrfs: init device stats for seed devices
  btrfs: remove struct extent_io_ops
  btrfs: call submit_bio_hook directly for metadata pages
  btrfs: stop calling submit_bio_hook for data inodes
  btrfs: don't opencode is_data_inode in end_bio_extent_readpage
  btrfs: call submit_bio_hook directly in submit_one_bio
  btrfs: remove extent_io_ops::readpage_end_io_hook
  btrfs: replace readpage_end_io_hook with direct calls
  btrfs: send, recompute reference path after orphanization of a directory
  btrfs: send, orphanize first all conflicting inodes when processing references
  btrfs: tree-checker: fix false alert caused by legacy btrfs root item
  btrfs: use unaligned helpers for stack and header set/get helpers
  btrfs: free-space-cache: use unaligned helpers to access data
  ...
2020-10-13 09:01:36 -07:00
Linus Torvalds c024a81125 dlm for 5.10
This set continues the ongoing rework of the low level
 communication layer in the dlm.  The focus here is on
 improvements to connection handling, and reworking the
 receiving of messages.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJfhJylAAoJEDgbc8f8gGmqiRYP/R8rHeiAtBuTQebIG2S2FRjT
 OoCsr6F240SxyNJYAJlsV4kWGLtRQ0qnHhWku6nAreg9Yw/+0C7ZRHwDNoJsE2/9
 JuCW6HhqN6jYcWWhV3BZA7wvWzPfzdC7Jnla7f9GGB9ToFlani7CLEj5qzkyEIxh
 KaXfFGTHBCftM20HaNcExqBwmB0bn7jiavlR2Nqnsh/FW+er1HPa4rIIJqYy6k31
 ymf0XJ3kZYgf/I4iArUZkR7FKHHy1GhWW10NSQ/DDfwGtkbQ1Lw8IdBZ/zkyheAG
 aInFcxEt+hQPTMOBSB4hJn4+QPvyNAd9UxjFLuaHawUNglH73PXBk77kGgj8xJGU
 qRcaugX5brVV1tpY2UPQO8MC8ITadmKRa7uZkRoI2hIZfsZO2z+TSgRkegsSDIlD
 wXYLQslSYImZ5k42wHqaONxD4nW/haZxdrhul4sP8Z1+d5WmoPE1UDlXMTvbyp/N
 iW3+jhvPc1NAyzEPdmMj/y7zmCX+yrlkRrO1YjTkpEOIpN5uUaxg/1g8ok5OProR
 Xyx4b9gv3r8/3CGQvYTOiNr9ZUj8kPR8rWv4fbQXAt+kQMAKnIjyRKp4SpfFlp1L
 pKv9UU/sdUsduCoiRDD5+SiqNLGDWSH5UAkYvkH0cz3QDELSEYkLlMA6O00QsDAH
 o1f+TFcNFKsphk47DXVo
 =/n6H
 -----END PGP SIGNATURE-----

Merge tag 'dlm-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This set continues the ongoing rework of the low level communication
  layer in the dlm.

  The focus here is on improvements to connection handling, and
  reworking the receiving of messages"

* tag 'dlm-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  fs: dlm: fix race in nodeid2con
  fs: dlm: rework receive handling
  fs: dlm: disallow buffer size below default
  fs: dlm: handle range check as callback
  fs: dlm: fix mark per nodeid setting
  fs: dlm: remove lock dependency warning
  fs: dlm: use free_con to free connection
  fs: dlm: handle possible othercon writequeues
  fs: dlm: move free writequeue into con free
  fs: dlm: fix configfs memory leak
  fs: dlm: fix dlm_local_addr memory leak
  fs: dlm: make connection hash lockless
  fs: dlm: synchronize dlm before shutdown
2020-10-13 08:59:39 -07:00
Linus Torvalds 6f5032a852 fscrypt updates for 5.10
This release, we rework the implementation of creating new encrypted
 files in order to fix some deadlocks and prepare for adding fscrypt
 support to CephFS, which Jeff Layton is working on.
 
 We also export a symbol in preparation for the above-mentioned CephFS
 support and also for ext4/f2fs encrypt+casefold support.
 
 Finally, there are a few other small cleanups.
 
 As usual, all these patches have been in linux-next with no reported
 issues, and I've tested them with xfstests.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCX4SD7xQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKy/AAP92oOybTcuahmvAtHqZP9jAFPJrbI3r
 6QLpMFtWznJoOQEAogaWsavtOIBx9afdOfRNj0zdoBIjpXgyMuzR10Ou2gE=
 =B/Mj
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt updates from Eric Biggers:
 "This release, we rework the implementation of creating new encrypted
  files in order to fix some deadlocks and prepare for adding fscrypt
  support to CephFS, which Jeff Layton is working on.

  We also export a symbol in preparation for the above-mentioned CephFS
  support and also for ext4/f2fs encrypt+casefold support.

  Finally, there are a few other small cleanups.

  As usual, all these patches have been in linux-next with no reported
  issues, and I've tested them with xfstests"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: export fscrypt_d_revalidate()
  fscrypt: rename DCACHE_ENCRYPTED_NAME to DCACHE_NOKEY_NAME
  fscrypt: don't call no-key names "ciphertext names"
  fscrypt: use sha256() instead of open coding
  fscrypt: make fscrypt_set_test_dummy_encryption() take a 'const char *'
  fscrypt: handle test_dummy_encryption in more logical way
  fscrypt: move fscrypt_prepare_symlink() out-of-line
  fscrypt: make "#define fscrypt_policy" user-only
  fscrypt: stop pretending that key setup is nofs-safe
  fscrypt: require that fscrypt_encrypt_symlink() already has key
  fscrypt: remove fscrypt_inherit_context()
  fscrypt: adjust logging for in-creation inodes
  ubifs: use fscrypt_prepare_new_inode() and fscrypt_set_context()
  f2fs: use fscrypt_prepare_new_inode() and fscrypt_set_context()
  ext4: use fscrypt_prepare_new_inode() and fscrypt_set_context()
  ext4: factor out ext4_xattr_credits_for_new_inode()
  fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()
  fscrypt: restrict IV_INO_LBLK_32 to ino_bits <= 32
  fscrypt: drop unused inode argument from fscrypt_fname_alloc_buffer
2020-10-13 08:54:00 -07:00
Darrick J. Wong ace74e797a xfs: annotate grabbing the realtime bitmap/summary locks in growfs
Use XFS_ILOCK_RT{BITMAP,SUM} to annotate grabbing the rt bitmap and
summary locks when we grow the realtime volume, just like we do most
everywhere else.  This shuts up lockdep warnings about grabbing the
ILOCK class of locks recursively:

============================================
WARNING: possible recursive locking detected
5.9.0-rc4-djw #rc4 Tainted: G           O
--------------------------------------------
xfs_growfs/4841 is trying to acquire lock:
ffff888035acc230 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]

but task is already holding lock:
ffff888035acedb0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs]

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-13 08:41:31 -07:00
Darrick J. Wong 7249c95a3f xfs: make xfs_growfs_rt update secondary superblocks
When we call growfs on the data device, we update the secondary
superblocks to reflect the updated filesystem geometry.  We need to do
this for growfs on the realtime volume too, because a future xfs_repair
run could try to fix the filesystem using a backup superblock.

This was observed by the online superblock scrubbers while running
xfs/233.  One can also trigger this by growing an rt volume, cycling the
mount, and creating new rt files.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-13 08:41:31 -07:00
Darrick J. Wong f4c32e87de xfs: fix realtime bitmap/summary file truncation when growing rt volume
The realtime bitmap and summary files are regular files that are hidden
away from the directory tree.  Since they're regular files, inode
inactivation will try to purge what it thinks are speculative
preallocations beyond the incore size of the file.  Unfortunately,
xfs_growfs_rt forgets to update the incore size when it resizes the
inodes, with the result that inactivating the rt inodes at unmount time
will cause their contents to be truncated.

Fix this by updating the incore size when we change the ondisk size as
part of updating the superblock.  Note that we don't do this when we're
allocating blocks to the rt inodes because we actually want those blocks
to get purged if the growfs fails.

This fixes corruption complaints from the online rtsummary checker when
running xfs/233.  Since that test requires rmap, one can also trigger
this by growing an rt volume, cycling the mount, and creating rt files.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-13 08:41:31 -07:00
Linus Torvalds 22230cd2c5 Merge branch 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat mount cleanups from Al Viro:
 "The last remnants of mount(2) compat buried by Christoph.

  Buried into NFS, that is.

  Generally I'm less enthusiastic about "let's use in_compat_syscall()
  deep in call chain" kind of approach than Christoph seems to be, but
  in this case it's warranted - that had been an NFS-specific wart,
  hopefully not to be repeated in any other filesystems (read: any new
  filesystem introducing non-text mount options will get NAKed even if
  it doesn't mess the layout up).

  IOW, not worth trying to grow an infrastructure that would avoid that
  use of in_compat_syscall()..."

* 'compat.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: remove compat_sys_mount
  fs,nfs: lift compat nfs4 mount data handling into the nfs code
  nfs: simplify nfs4_parse_monolithic
2020-10-12 16:44:57 -07:00
Linus Torvalds e18afa5bfa Merge branch 'work.quota-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat quotactl cleanups from Al Viro:
 "More Christoph's compat cleanups: quotactl(2)"

* 'work.quota-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  quota: simplify the quotactl compat handling
  compat: add a compat_need_64bit_alignment_fixup() helper
  compat: lift compat_s64 and compat_u64 to <asm-generic/compat.h>
2020-10-12 16:37:13 -07:00
Linus Torvalds 85ed13e78d Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat iovec cleanups from Al Viro:
 "Christoph's series around import_iovec() and compat variant thereof"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  security/keys: remove compat_keyctl_instantiate_key_iov
  mm: remove compat_process_vm_{readv,writev}
  fs: remove compat_sys_vmsplice
  fs: remove the compat readv/writev syscalls
  fs: remove various compat readv/writev helpers
  iov_iter: transparently handle compat iovecs in import_iovec
  iov_iter: refactor rw_copy_check_uvector and import_iovec
  iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
  compat.h: fix a spelling error in <linux/compat.h>
2020-10-12 16:35:51 -07:00
Linus Torvalds e6412f9833 EFI changes for v5.10:
- Preliminary RISC-V enablement - the bulk of it will arrive via the RISCV tree.
 
  - Relax decompressed image placement rules for 32-bit ARM
 
  - Add support for passing MOK certificate table contents via a config table
    rather than a EFI variable.
 
  - Add support for 18 bit DIMM row IDs in the CPER records.
 
  - Work around broken Dell firmware that passes the entire Boot#### variable
    contents as the command line
 
  - Add definition of the EFI_MEMORY_CPU_CRYPTO memory attribute so we can
    identify it in the memory map listings.
 
  - Don't abort the boot on arm64 if the EFI RNG protocol is available but
    returns with an error
 
  - Replace slashes with exclamation marks in efivarfs file names
 
  - Split efi-pstore from the deprecated efivars sysfs code, so we can
    disable the latter on !x86.
 
  - Misc fixes, cleanups and updates.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl+Ec9QRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1inTQ//TYj3kJq/7sWfUAxmAsWnUEC005YCNf0T
 x3kJQv3zYX4Rl4eEwkff8S1PrqqvwUP5yUZYApp8HD9s9CYvzz5iG5xtf/jX+QaV
 06JnTMnkoycx2NaOlbr1cmcIn4/cAhQVYbVCeVrlf7QL8enNTBr5IIQmo4mgP8Lc
 mauSsO1XU8ZuMQM+JcZSxAkAPxlhz3dbR5GteP4o2K4ShQKpiTCOfOG1J3FvUYba
 s1HGnhHFlkQr6m3pC+iG8dnAG0YtwHMH1eJVP7mbeKUsMXz944U8OVXDWxtn81pH
 /Xt/aFZXnoqwlSXythAr6vFTuEEn40n+qoOK6jhtcGPUeiAFPJgiaeAXw3gO0YBe
 Y8nEgdGfdNOMih94McRd4M6gB/N3vdqAGt+vjiZSCtzE+nTWRyIXSGCXuDVpkvL4
 VpEXpPINnt1FZZ3T/7dPro4X7pXALhODE+pl36RCbfHVBZKRfLV1Mc1prAUGXPxW
 E0MfaM9TxDnVhs3VPWlHmRgavee2MT1Tl/ES4CrRHEoz8ZCcu4MfROQyao8+Gobr
 VR+jVk+xbyDrykEc6jdAK4sDFXpTambuV624LiKkh6Mc4yfHRhPGrmP5c5l7SnCd
 aLp+scQ4T7sqkLuYlXpausXE3h4sm5uur5hNIRpdlvnwZBXpDEpkzI8x0C9OYr0Q
 kvFrreQWPLQ=
 =ZNI8
 -----END PGP SIGNATURE-----

Merge tag 'efi-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull EFI changes from Ingo Molnar:

 - Preliminary RISC-V enablement - the bulk of it will arrive via the
   RISCV tree.

 - Relax decompressed image placement rules for 32-bit ARM

 - Add support for passing MOK certificate table contents via a config
   table rather than a EFI variable.

 - Add support for 18 bit DIMM row IDs in the CPER records.

 - Work around broken Dell firmware that passes the entire Boot####
   variable contents as the command line

 - Add definition of the EFI_MEMORY_CPU_CRYPTO memory attribute so we
   can identify it in the memory map listings.

 - Don't abort the boot on arm64 if the EFI RNG protocol is available
   but returns with an error

 - Replace slashes with exclamation marks in efivarfs file names

 - Split efi-pstore from the deprecated efivars sysfs code, so we can
   disable the latter on !x86.

 - Misc fixes, cleanups and updates.

* tag 'efi-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
  efi: mokvar: add missing include of asm/early_ioremap.h
  efi: efivars: limit availability to X86 builds
  efi: remove some false dependencies on CONFIG_EFI_VARS
  efi: gsmi: fix false dependency on CONFIG_EFI_VARS
  efi: efivars: un-export efivars_sysfs_init()
  efi: pstore: move workqueue handling out of efivars
  efi: pstore: disentangle from deprecated efivars module
  efi: mokvar-table: fix some issues in new code
  efi/arm64: libstub: Deal gracefully with EFI_RNG_PROTOCOL failure
  efivarfs: Replace invalid slashes with exclamation marks in dentries.
  efi: Delete deprecated parameter comments
  efi/libstub: Fix missing-prototypes in string.c
  efi: Add definition of EFI_MEMORY_CPU_CRYPTO and ability to report it
  cper,edac,efi: Memory Error Record: bank group/address and chip id
  edac,ghes,cper: Add Row Extension to Memory Error Record
  efi/x86: Add a quirk to support command line arguments on Dell EFI firmware
  efi/libstub: Add efi_warn and *_once logging helpers
  integrity: Load certs from the EFI MOK config table
  integrity: Move import of MokListRT certs to a separate routine
  efi: Support for MOK variable config table
  ...
2020-10-12 13:26:49 -07:00
Scott Mayhew a2d24bcb97 nfs: add missing "posix" local_lock constant table definition
"mount -o local_lock=posix..." was broken by the mount API conversion
due to the missing constant.

Fixes: e38bb238ed ("NFS: Convert mount option parsing to use functionality from fs_parser.h")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-12 13:23:14 -04:00
Linus Torvalds 6734e20e39 arm64 updates for 5.10
- Userspace support for the Memory Tagging Extension introduced by Armv8.5.
   Kernel support (via KASAN) is likely to follow in 5.11.
 
 - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
   switching.
 
 - Fix and subsequent rewrite of our Spectre mitigations, including the
   addition of support for PR_SPEC_DISABLE_NOEXEC.
 
 - Support for the Armv8.3 Pointer Authentication enhancements.
 
 - Support for ASID pinning, which is required when sharing page-tables with
   the SMMU.
 
 - MM updates, including treating flush_tlb_fix_spurious_fault() as a no-op.
 
 - Perf/PMU driver updates, including addition of the ARM CMN PMU driver and
   also support to handle CPU PMU IRQs as NMIs.
 
 - Allow prefetchable PCI BARs to be exposed to userspace using normal
   non-cacheable mappings.
 
 - Implementation of ARCH_STACKWALK for unwinding.
 
 - Improve reporting of unexpected kernel traps due to BPF JIT failure.
 
 - Improve robustness of user-visible HWCAP strings and their corresponding
   numerical constants.
 
 - Removal of TEXT_OFFSET.
 
 - Removal of some unused functions, parameters and prototypes.
 
 - Removal of MPIDR-based topology detection in favour of firmware
   description.
 
 - Cleanups to handling of SVE and FPSIMD register state in preparation
   for potential future optimisation of handling across syscalls.
 
 - Cleanups to the SDEI driver in preparation for support in KVM.
 
 - Miscellaneous cleanups and refactoring work.
 -----BEGIN PGP SIGNATURE-----
 
 iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl+AUXMQHHdpbGxAa2Vy
 bmVsLm9yZwAKCRC3rHDchMFjNFc1B/4q2Kabe+pPu7s1f58Q+OTaEfqcr3F1qh27
 F1YpFZUYxg0GPfPsFrnbJpo5WKo7wdR9ceI9yF/GHjs7A/MSoQJis3pG6SlAd9c0
 nMU5tCwhg9wfq6asJtl0/IPWem6cqqhdzC6m808DjeHuyi2CCJTt0vFWH3OeHEhG
 cfmLfaSNXOXa/MjEkT8y1AXJ/8IpIpzkJeCRA1G5s18PXV9Kl5bafIo9iqyfKPLP
 0rJljBmoWbzuCSMc81HmGUQI4+8KRp6HHhyZC/k0WEVgj3LiumT7am02bdjZlTnK
 BeNDKQsv2Jk8pXP2SlrI3hIUTz0bM6I567FzJEokepvTUzZ+CVBi
 =9J8H
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "There's quite a lot of code here, but much of it is due to the
  addition of a new PMU driver as well as some arm64-specific selftests
  which is an area where we've traditionally been lagging a bit.

  In terms of exciting features, this includes support for the Memory
  Tagging Extension which narrowly missed 5.9, hopefully allowing
  userspace to run with use-after-free detection in production on CPUs
  that support it. Work is ongoing to integrate the feature with KASAN
  for 5.11.

  Another change that I'm excited about (assuming they get the hardware
  right) is preparing the ASID allocator for sharing the CPU page-table
  with the SMMU. Those changes will also come in via Joerg with the
  IOMMU pull.

  We do stray outside of our usual directories in a few places, mostly
  due to core changes required by MTE. Although much of this has been
  Acked, there were a couple of places where we unfortunately didn't get
  any review feedback.

  Other than that, we ran into a handful of minor conflicts in -next,
  but nothing that should post any issues.

  Summary:

   - Userspace support for the Memory Tagging Extension introduced by
     Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.

   - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
     switching.

   - Fix and subsequent rewrite of our Spectre mitigations, including
     the addition of support for PR_SPEC_DISABLE_NOEXEC.

   - Support for the Armv8.3 Pointer Authentication enhancements.

   - Support for ASID pinning, which is required when sharing
     page-tables with the SMMU.

   - MM updates, including treating flush_tlb_fix_spurious_fault() as a
     no-op.

   - Perf/PMU driver updates, including addition of the ARM CMN PMU
     driver and also support to handle CPU PMU IRQs as NMIs.

   - Allow prefetchable PCI BARs to be exposed to userspace using normal
     non-cacheable mappings.

   - Implementation of ARCH_STACKWALK for unwinding.

   - Improve reporting of unexpected kernel traps due to BPF JIT
     failure.

   - Improve robustness of user-visible HWCAP strings and their
     corresponding numerical constants.

   - Removal of TEXT_OFFSET.

   - Removal of some unused functions, parameters and prototypes.

   - Removal of MPIDR-based topology detection in favour of firmware
     description.

   - Cleanups to handling of SVE and FPSIMD register state in
     preparation for potential future optimisation of handling across
     syscalls.

   - Cleanups to the SDEI driver in preparation for support in KVM.

   - Miscellaneous cleanups and refactoring work"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
  Revert "arm64: initialize per-cpu offsets earlier"
  arm64: random: Remove no longer needed prototypes
  arm64: initialize per-cpu offsets earlier
  kselftest/arm64: Check mte tagged user address in kernel
  kselftest/arm64: Verify KSM page merge for MTE pages
  kselftest/arm64: Verify all different mmap MTE options
  kselftest/arm64: Check forked child mte memory accessibility
  kselftest/arm64: Verify mte tag inclusion via prctl
  kselftest/arm64: Add utilities and a test to validate mte memory
  perf: arm-cmn: Fix conversion specifiers for node type
  perf: arm-cmn: Fix unsigned comparison to less than zero
  arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
  arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
  arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
  arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
  KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
  arm64: Get rid of arm64_ssbd_state
  KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
  KVM: arm64: Get rid of kvm_arm_have_ssbd()
  KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
  ...
2020-10-12 10:00:51 -07:00
Anna Schumaker 9f0b5792f0 NFSD: Encode a full READ_PLUS reply
Reply to the client with multiple hole and data segments. I use the
result of the first vfs_llseek() call for encoding as an optimization so
we don't have to immediately repeat the call. This also lets us encode
any remaining reply as data if we get an unexpected result while trying
to calculate a hole.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-12 10:29:45 -04:00
Anna Schumaker 278765ea07 NFSD: Return both a hole and a data segment
But only one of each right now. We'll expand on this in the next patch.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-12 10:29:45 -04:00
Anna Schumaker 2db27992dd NFSD: Add READ_PLUS hole segment encoding
However, we still only reply to the READ_PLUS call with a single segment
at this time.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-12 10:29:45 -04:00
Anna Schumaker 528b84934e NFSD: Add READ_PLUS data support
This patch adds READ_PLUS support for returning a single
NFS4_CONTENT_DATA segment to the client. This is basically the same as
the READ operation, only with the extra information about data segments.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-12 10:29:45 -04:00
Chuck Lever cc028a10a4 NFSD: Hoist status code encoding into XDR encoder functions
The original intent was presumably to reduce code duplication. The
trade-off was:

- No support for an NFSD proc function returning a non-success
  RPC accept_stat value.
- No support for void NFS replies to non-NULL procedures.
- Everyone pays for the deduplication with a few extra conditional
  branches in a hot path.

In addition, nfsd_dispatch() leaves *statp uninitialized in the
success path, unlike svc_generic_dispatch().

Address all of these problems by moving the logic for encoding
the NFS status code into the NFS XDR encoders themselves. Then
update the NFS .pc_func methods to return an RPC accept_stat
value.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-12 10:29:44 -04:00
Jeff Layton c74d79af90 ceph: comment cleanups and clarifications
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton 16d68903f5 ceph: break up send_cap_msg
Push the allocation of the msg and the send into the caller. Rename
the function to encode_cap_msg and make it void return.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton 5231198089 ceph: drop separate mdsc argument from __send_cap
We can get it from the session if we need it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Matthew Wilcox (Oracle) c403c3a2fb ceph: promote to unsigned long long before shifting
On 32-bit systems, this shift will overflow for files larger than 4GB.

Cc: stable@vger.kernel.org
Fixes: 61f6881621 ("ceph: check caps in filemap_fault and page_mkwrite")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton 7edf1ec5b2 ceph: don't SetPageError on readpage errors
PageError really only has meaning within a particular subsystem. Nothing
looks at this bit in the core kernel code, and ceph itself doesn't care
about it. Don't bother setting the PageError bit on error.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Ilya Dryomov f6fbdcd997 ceph: mark ceph_fmt_xattr() as printf-like for better type checking
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton 1cc1699070 ceph: fold ceph_update_writeable_page into ceph_write_begin
...and reorganize the loop for better clarity.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton 6390987f2f ceph: fold ceph_sync_writepages into writepage_nounlock
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton 9b4862ecae ceph: fold ceph_sync_readpages into ceph_readpage
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton d45156bf46 ceph: don't call ceph_update_writeable_page from page_mkwrite
page_mkwrite should only be called with Uptodate pages, so we should
only need to flush incompatible snap contexts.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Jeff Layton 18d620f063 ceph: break out writeback of incompatible snap context to separate function
When dirtying a page, we have to flush incompatible contexts. Move the
search for an incompatible context into a separate function, and fix up
the caller to wait and retry if there is one.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:27 +02:00
Ilya Dryomov 4bb926e83f ceph: add a note explaining session reject error string
error_string key in the metadata map of MClientSession message
is intended for humans, but unfortunately became part of the on-wire
format with the introduction of recover_session=clean mode in commit
131d7eb4fa ("ceph: auto reconnect after blacklisted").

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Ilya Dryomov 0b98acd618 libceph, rbd, ceph: "blacklist" -> "blocklist"
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Jeff Layton 2e16929660 ceph: have ceph_writepages_start call pagevec_lookup_range_tag
Currently it calls pagevec_lookup_range_nr_tag(), but that may be
inefficient, as we might end up having to search several times as we get
down to looking for fewer pages to fill the array.

Thus spake Willy:

"I think ceph is misusing pagevec_lookup_range_nr_tag().  Let's suppose
 you get a range which is AAAAbbbbAAAAbbbbAAAAbbbbbbbb(...)bbbbAAAA and
 you try to fetch max_pages=13.  First loop will get AAAAbbbbAAAAb and
 have 8 locked_pages.  The next call will get bbbAA and now
 locked_pages=10.  Next call gets AAb ... and now you're iterating your
 way through all the 'b' one page at a time until you find that first A."

'A' here refers to pages that are eligible for writeback and 'b'
represents ones that aren't (for whatever reason).

Not capping the number of return pages may mean that we sometimes find
more pages than are needed, but the extra references will just get put
at the end.

Ceph is also the only caller of pagevec_lookup_range_nr_tag(), so this
change should allow us to eliminate that call as well. That will be done
in a follow-on patch.

Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Jeff Layton 470a5c77ea ceph: use kill_anon_super helper
ceph open-codes this around some other activity and the rationale
for it isn't clear. There is no need to delay free_anon_bdev until
the end of kill_sb.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Xiubo Li 1dd8d47081 ceph: metrics for opened files, pinned caps and opened inodes
In client for each inode, it may have many opened files and may
have been pinned in more than one MDS servers. And some inodes
are idle, which have no any opened files.

This patch will show these metrics in the debugfs, likes:

item                               total
-----------------------------------------
opened files  / total inodes       14 / 5
pinned i_caps / total inodes       7  / 5
opened inodes / total inodes       3  / 5

Will send these metrics to ceph, which will be used by the `fs top`,
later.

[ jlayton: drop unrelated hunk, count hashed inodes instead of
           allocated ones ]

URL: https://tracker.ceph.com/issues/47005
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Xiubo Li 2678da88f4 ceph: add ceph_sb_to_mdsc helper support to parse the mdsc
This will help simplify the code.

[ jlayton: fix minor merge conflict in quota.c ]

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Jeff Layton c5f575ed08 ceph: drop special-casing for ITER_PIPE in ceph_sync_read
This special casing was added in 7ce469a53e (ceph: fix splice
read for no Fc capability case). The confirm callback for ITER_PIPE
expects that the page is Uptodate and returns an error otherwise.

A simpler workaround is just to use the Uptodate bit, which has no
meaning for anonymous pages. Rip out the special casing for ITER_PIPE
and just SetPageUptodate before we copy to the iter.

Cc: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Yanhu Cao 3a8ebe0b8b ceph: add column 'mds' to show caps in more user friendly
In multi-mds, the 'caps' debugfs file will have duplicate ino,
add the 'mds' column to indicate which mds session the cap belongs to.

Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Luis Henriques 1c30c90733 ceph: remove unnecessary return in switch statement
Since there's a return immediately after the 'break', there's no need for
this extra 'return' in the S_IFDIR case.

Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:26 +02:00
Yan, Zheng a33f6432b3 ceph: encode inodes' parent/d_name in cap reconnect message
Since nautilus, MDS tracks dirfrags whose child inodes have caps in open
file table. When MDS recovers, it prefetches all of these dirfrags. This
avoids using backtrace to load inodes. But dirfrags prefetch may load
lots of useless inodes into cache, and make MDS run out of memory.

Recent MDS adds an option that disables dirfrags prefetch. When dirfrags
prefetch is disabled. Recovering MDS only prefetches corresponding dir
inodes. Including inodes' parent/d_name in cap reconnect message can
help MDS to load inodes into its cache.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-10-12 15:29:25 +02:00
Miklos Szeredi 413daa1a3f fuse: connection remove fix
Re-add lost removal of fc from fuse_conn_list and the control filesystem.

Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: fcee216beb ("fuse: split fuse_mount off of fuse_conn")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-12 10:28:14 +02:00
Ronnie Sahlberg d1542cf616 cifs: compute full_path already in cifs_readdir()
Cleanup patch for followon to cache additional information for the root directory
when directory lease held.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-11 23:57:19 -05:00
Ronnie Sahlberg 9e81e8ff74 cifs: return cached_fid from open_shroot
Cleanup patch for followon to cache additional information for the root directory
when directory lease held.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-11 23:57:19 -05:00
Steve French 3984bdc049 update structure definitions from updated protocol documentation
MS-SMB2 was updated recently to include new protocol definitions for
updated compression payload header and new RDMA transform capabilities
Update structure definitions in smb2pdu.h to match

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-10-11 23:57:18 -05:00
Steve French 119e489681 smb3: add defines for new crypto algorithms
In encryption capabilities negotiate context can now request
AES256 GCM or CCM

Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-10-11 23:57:18 -05:00
Boris Protopopov 57c1760740 Convert trailing spaces and periods in path components
When converting trailing spaces and periods in paths, do so
for every component of the path, not just the last component.
If the conversion is not done for every path component, then
subsequent operations in directories with trailing spaces or
periods (e.g. create(), mkdir()) will fail with ENOENT. This
is because on the server, the directory will have a special
symbol in its name, and the client needs to provide the same.

Signed-off-by: Boris Protopopov <pboris@amazon.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-10-11 23:57:18 -05:00
Zhihao Cheng e2a05cc7f8 ubifs: mount_ubifs: Release authentication resource in error handling path
Release the authentication related resource in some error handling
branches in mount_ubifs().

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: <stable@vger.kernel.org>  # 4.20+
Fixes: d8a22773a1 ("ubifs: Enable authentication support")
Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-10-11 22:05:50 +02:00
Zhihao Cheng bb674a4d4d ubifs: Don't parse authentication mount options in remount process
There is no need to dump authentication options while remounting,
because authentication initialization can only be doing once in
the first mount process. Dumping authentication mount options in
remount process may cause memory leak if UBIFS has already been
mounted with old authentication mount options.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: <stable@vger.kernel.org>  # 4.20+
Fixes: d8a22773a1 ("ubifs: Enable authentication support")
Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-10-11 22:05:49 +02:00
Zhihao Cheng 47f6d9ce45 ubifs: Fix a memleak after dumping authentication mount options
Fix a memory leak after dumping authentication mount options in error
handling branch.

Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: <stable@vger.kernel.org>  # 4.20+
Fixes: d8a22773a1 ("ubifs: Enable authentication support")
Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2020-10-11 22:05:49 +02:00
Linus Torvalds 5b697f86f9 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fix from Al Viro:
 "Fixes an obvious bug (memory leak introduced in 5.8)"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  pipe: Fix memory leaks in create_pipe_files()
2020-10-11 11:11:35 -07:00
Vladimir Zapolskiy 64b7f674c2 cifs: Fix incomplete memory allocation on setxattr path
On setxattr() syscall path due to an apprent typo the size of a dynamically
allocated memory chunk for storing struct smb2_file_full_ea_info object is
computed incorrectly, to be more precise the first addend is the size of
a pointer instead of the wanted object size. Coincidentally it makes no
difference on 64-bit platforms, however on 32-bit targets the following
memcpy() writes 4 bytes of data outside of the dynamically allocated memory.

  =============================================================================
  BUG kmalloc-16 (Not tainted): Redzone overwritten
  -----------------------------------------------------------------------------

  Disabling lock debugging due to kernel taint
  INFO: 0x79e69a6f-0x9e5cdecf @offset=368. First byte 0x73 instead of 0xcc
  INFO: Slab 0xd36d2454 objects=85 used=51 fp=0xf7d0fc7a flags=0x35000201
  INFO: Object 0x6f171df3 @offset=352 fp=0x00000000

  Redzone 5d4ff02d: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc  ................
  Object 6f171df3: 00 00 00 00 00 05 06 00 73 6e 72 75 62 00 66 69  ........snrub.fi
  Redzone 79e69a6f: 73 68 32 0a                                      sh2.
  Padding 56254d82: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
  CPU: 0 PID: 8196 Comm: attr Tainted: G    B             5.9.0-rc8+ #3
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
  Call Trace:
   dump_stack+0x54/0x6e
   print_trailer+0x12c/0x134
   check_bytes_and_report.cold+0x3e/0x69
   check_object+0x18c/0x250
   free_debug_processing+0xfe/0x230
   __slab_free+0x1c0/0x300
   kfree+0x1d3/0x220
   smb2_set_ea+0x27d/0x540
   cifs_xattr_set+0x57f/0x620
   __vfs_setxattr+0x4e/0x60
   __vfs_setxattr_noperm+0x4e/0x100
   __vfs_setxattr_locked+0xae/0xd0
   vfs_setxattr+0x4e/0xe0
   setxattr+0x12c/0x1a0
   path_setxattr+0xa4/0xc0
   __ia32_sys_lsetxattr+0x1d/0x20
   __do_fast_syscall_32+0x40/0x70
   do_fast_syscall_32+0x29/0x60
   do_SYSENTER_32+0x15/0x20
   entry_SYSENTER_32+0x9f/0xf2

Fixes: 5517554e43 ("cifs: Add support for writing attributes on SMB2+")
Signed-off-by: Vladimir Zapolskiy <vladimir@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-10 15:52:54 -07:00
Pavel Begunkov b2e9685283 io_uring: keep a pointer ref_node in file_data
->cur_refs of struct fixed_file_data always points to percpu_ref
embedded into struct fixed_file_ref_node. Don't overuse container_of()
and offsetting, and point directly to fixed_file_ref_node.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:25 -06:00
Pavel Begunkov 600cf3f8b3 io_uring: refactor *files_register()'s error paths
Don't keep repeating cleaning sequences in error paths, write it once
in the and use labels. It's less error prone and looks cleaner.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:25 -06:00
Pavel Begunkov 5398ae6985 io_uring: clean file_data access in files_register
Keep file_data in a local var and replace with it complex references
such as ctx->file_data.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:25 -06:00
Pavel Begunkov 692d836351 io_uring: don't delay io_init_req() error check
Don't postpone io_init_req() error checks and do that right after
calling it. There is no control-flow statements or dependencies with
sqe/submitted accounting, so do those earlier, that makes the code flow
a bit more natural.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:25 -06:00
Pavel Begunkov 062d04d731 io_uring: clean leftovers after splitting issue
Kill extra if in io_issue_sqe() and place send/recv[msg] calls
appropriately under switch's cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:25 -06:00
Pavel Begunkov a71976f3fa io_uring: remove timeout.list after hrtimer cancel
Remove timeouts from ctx->timeout_list after hrtimer_try_to_cancel()
successfully cancels it. With this we don't need to care whether there
was a race and it was removed in io_timeout_fn(), and that will be handy
for following patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:25 -06:00
Pavel Begunkov 0bdf7a2ddb io_uring: use a separate struct for timeout_remove
Don't use struct io_timeout for both IORING_OP_TIMEOUT and
IORING_OP_TIMEOUT_REMOVE, they're quite different. Split them in two,
that allows to remove an unused field in struct io_timeout, and btw kill
->flags not used by either. This also easier to follow, especially for
timeout remove.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:25 -06:00
Pavel Begunkov 71b547c048 io_uring: improve submit_state.ios_left accounting
state->ios_left isn't decremented for requests that don't need a file,
so it might be larger than number of SQEs left. That in some
circumstances makes us to grab more files that is needed so imposing
extra put.
Deaccount one ios_left for each request.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:25 -06:00
Pavel Begunkov 8371adf53c io_uring: simplify io_file_get()
Keep ->needs_file_no_error check out of io_file_get(), and let callers
handle it. It makes it more straightforward. Also, as the only error it
can hand back -EBADF, make it return a file or NULL.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:24 -06:00
Pavel Begunkov 479f517be5 io_uring: kill extra check in fixed io_file_get()
ctx->nr_user_files == 0 IFF ctx->file_data == NULL and there fixed files
are not used. Hence, verifying fds only against ctx->nr_user_files is
enough. Remove the other check from hot path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:24 -06:00
Pavel Begunkov 233295130e io_uring: clean up ->files grabbing
Move work.files grabbing into io_prep_async_work() to all other work
resources initialisation. We don't need to keep it separately now, as
->ring_fd/file are gone. It also allows to not grab it when a request
is not going to io-wq.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:24 -06:00
Pavel Begunkov 5bf5e464f1 io_uring: don't io_prep_async_work() linked reqs
There is no real reason left for preparing io-wq work context for linked
requests in advance, remove it as this might become a bottleneck in some
cases.

Reported-by: Roman Gershman <romger@amazon.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-10 12:49:20 -06:00
Chao Yu d662fad143 f2fs: fix to set SBI_NEED_FSCK flag for inconsistent inode
If compressed inode has inconsistent fields on i_compress_algorithm,
i_compr_blocks and i_log_cluster_size, we missed to set SBI_NEED_FSCK
to notice fsck to repair the inode, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-09 10:29:31 -07:00
Matthew Wilcox (Oracle) 5e2ed8c4f4 io_uring: Convert advanced XArray uses to the normal API
There are no bugs here that I've spotted, it's just easier to use the
normal API and there are no performance advantages to using the more
verbose advanced API.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 09:00:05 -06:00
Matthew Wilcox (Oracle) 236434c343 io_uring: Fix XArray usage in io_uring_add_task_file
The xas_store() wasn't paired with an xas_nomem() loop, so if it couldn't
allocate memory using GFP_NOWAIT, it would leak the reference to the file
descriptor.  Also the node pointed to by the xas could be freed between
the call to xas_load() under the rcu_read_lock() and the acquisition of
the xa_lock.

It's easier to just use the normal xa_load/xa_store interface here.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
[axboe: fix missing assign after alloc, cur_uring -> tctx rename]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 08:59:40 -06:00
Matthew Wilcox (Oracle) ce765372bc io_uring: Fix use of XArray in __io_uring_files_cancel
We have to drop the lock during each iteration, so there's no advantage
to using the advanced API.  Convert this to a standard xa_for_each() loop.

Reported-by: syzbot+27c12725d8ff0bfe1a13@syzkaller.appspotmail.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-09 08:52:26 -06:00
Max Reitz bf109c6404 fuse: implement crossmounts
FUSE servers can indicate crossmount points by setting FUSE_ATTR_SUBMOUNT
in fuse_attr.flags.  The inode will then be marked as S_AUTOMOUNT, and the
.d_automount implementation creates a new submount at that location, so
that the submount gets a distinct st_dev value.

Note that all submounts get a distinct superblock and a distinct st_dev
value, so for virtio-fs, even if the same filesystem is mounted more than
once on the host, none of its mount points will have the same st_dev.  We
need distinct superblocks because the superblock points to the root node,
but the different host mounts may show different trees (e.g. due to
submounts in some of them, but not in others).

Right now, this behavior is only enabled when fuse_conn.auto_submounts is
set, which is the case only for virtio-fs.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-09 16:33:47 +02:00
Trond Myklebust 39d43d1641 NFSv4: Use the net namespace uniquifier if it is set
If a container sets a net namespace specific uniquifier, then use that
in the setclientid/exchangeid process.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-09 10:05:06 -04:00
Trond Myklebust 1aee551334 NFSv4: Clean up initialisation of uniquified client id strings
When the user sets a uniquifier, then ensure we copy the string
so that calls to strlen() etc are atomic with calls to snprintf().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-09 10:04:36 -04:00
Eric Biggers f6322f3f12 f2fs: reject CASEFOLD inode flag without casefold feature
syzbot reported:

    general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
    CPU: 0 PID: 6860 Comm: syz-executor835 Not tainted 5.9.0-rc8-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:utf8_casefold+0x43/0x1b0 fs/unicode/utf8-core.c:107
    [...]
    Call Trace:
     f2fs_init_casefolded_name fs/f2fs/dir.c:85 [inline]
     __f2fs_setup_filename fs/f2fs/dir.c:118 [inline]
     f2fs_prepare_lookup+0x3bf/0x640 fs/f2fs/dir.c:163
     f2fs_lookup+0x10d/0x920 fs/f2fs/namei.c:494
     __lookup_hash+0x115/0x240 fs/namei.c:1445
     filename_create+0x14b/0x630 fs/namei.c:3467
     user_path_create fs/namei.c:3524 [inline]
     do_mkdirat+0x56/0x310 fs/namei.c:3664
     do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [...]

The problem is that an inode has F2FS_CASEFOLD_FL set, but the
filesystem doesn't have the casefold feature flag set, and therefore
super_block::s_encoding is NULL.

Fix this by making sanity_check_inode() reject inodes that have
F2FS_CASEFOLD_FL when the filesystem doesn't have the casefold feature.

Reported-by: syzbot+05139c4039d0679e19ff@syzkaller.appspotmail.com
Fixes: 2c2eb7a300 ("f2fs: Support case-insensitive file name lookups")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-08 21:24:40 -07:00
Jaegeuk Kim 48046cb55d f2fs: fix memory alignment to support 32bit
In 32bit system, 64-bits key breaks memory alignment.
This fixes the commit "f2fs: support 64-bits key in f2fs rb-tree node entry".

Reported-by: Nicolas Chauvet <kwizart@gmail.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-10-08 21:24:40 -07:00
Jens Axboe ed6930c920 io_uring: fix break condition for __io_uring_register() waiting
Colin reports that there's unreachable code, since we only ever break
if ret == 0. This is correct, and is due to a reversed logic condition
in when to break or not.

Break out of the loop if we don't process any task work, in that case
we do want to return -EINTR.

Fixes: af9c1a44f8 ("io_uring: process task work in io_uring_register()")
Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 20:37:45 -06:00
Chengguang Xu 915f4c9358 erofs: remove unnecessary enum entries
Opt_nouser_xattr and Opt_noacl are useless, so just remove them.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: Gao Xiang <hsiangkao@redhat.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20201005071550.66193-1-cgxu519@mykernel.net
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
2020-10-09 10:37:42 +08:00
Jakub Kicinski 9d49aea13f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Small conflict around locking in rxrpc_process_event() -
channel_lock moved to bundle in next, while state lock
needs _bh() from net.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-08 15:44:50 -07:00
Linus Torvalds b9e3aa2a9b Description for this pull request:
- Fix use of uninitialized spinlock on error path.
   - Fix missing err assignment in exfat_build_inode().
 -----BEGIN PGP SIGNATURE-----
 
 iQJMBAABCgA2FiEE6NzKS6Uv/XAAGHgyZwv7A1FEIQgFAl99VP4YHG5hbWphZS5q
 ZW9uQHNhbXN1bmcuY29tAAoJEGcL+wNRRCEIALQP/igSZRelxWYA2QwpcMoRsgvV
 xwqqeyol+BJXJa5/tHqO+m5+2Q2Z6B93VHlQ7GUSLsgkqjhubUiWceMAipajK+uS
 WB2qvgREsS2h0mocyC/U22v5PEcaMpqLqFrPjCsyEZzhfT188ImkeOBb+/0Eu4dO
 lhHjrX88E55Bxe9Zn9Gylh73iMfq1aq+ENTKIsUpMk+9qwZUjqprKJDjhDi642Q7
 jSnb7Az/15Ixlmed2r0+9osgcqBYM/U4g/D1k2anD9bOeXFup5O0AS3kMJn8wTj6
 L17BUOf39II3L5AkXKs1RyC6sTUmJMHOjT77P1HbQkIZqgXAYt5f9USGfwIE8/m3
 OmYiBmLQolLTQTzAV7Miup6g1GrByyvsWUjcD8X4s9kTP8DgRxtyj0vxbYM6501g
 bbwWXFDn1Rv7n1DXJVi61CgWiaAk98XeH3y05Or9wVAOpVPFtBP5WRzv3HOyH0kA
 8+bzMyuhbz8IPKphiCly96XgXnqF81GN4a/UQtHMKx7ZEYfEj8BogTH5+SFQVYkq
 ekC/Yiy+17wPw+kTn4TZ3oTvMuYmULaNLPBhjXsolr7Sm7EDio5dCk1Nz8xZdKHK
 9HgT2O+SkYaOLyEvDdq9IZBnYOaUgiMjEWf3cC9Ylec7Rtk3JTh+qRohcLj48yZY
 fT+XjJFGNdxGu6wIqppo
 =W6Bn
 -----END PGP SIGNATURE-----

Merge tag 'exfat-for-5.9-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat

Pull exfat fixes from Namjae Jeon:

 - Fix use of uninitialized spinlock on error path

 - Fix missing err assignment in exfat_build_inode()

* tag 'exfat-for-5.9-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
  exfat: fix use of uninitialized spinlock on error path
  exfat: fix pointer error checking
2020-10-08 11:10:13 -07:00
David Howells ec0fa0b659 afs: Fix deadlock between writeback and truncate
The afs filesystem has a lock[*] that it uses to serialise I/O operations
going to the server (vnode->io_lock), as the server will only perform one
modification operation at a time on any given file or directory.  This
prevents the the filesystem from filling up all the call slots to a server
with calls that aren't going to be executed in parallel anyway, thereby
allowing operations on other files to obtain slots.

  [*] Note that is probably redundant for directories at least since
      i_rwsem is used to serialise directory modifications and
      lookup/reading vs modification.  The server does allow parallel
      non-modification ops, however.

When a file truncation op completes, we truncate the in-memory copy of the
file to match - but we do it whilst still holding the io_lock, the idea
being to prevent races with other operations.

However, if writeback starts in a worker thread simultaneously with
truncation (whilst notify_change() is called with i_rwsem locked, writeback
pays it no heed), it may manage to set PG_writeback bits on the pages that
will get truncated before afs_setattr_success() manages to call
truncate_pagecache().  Truncate will then wait for those pages - whilst
still inside io_lock:

    # cat /proc/8837/stack
    [<0>] wait_on_page_bit_common+0x184/0x1e7
    [<0>] truncate_inode_pages_range+0x37f/0x3eb
    [<0>] truncate_pagecache+0x3c/0x53
    [<0>] afs_setattr_success+0x4d/0x6e
    [<0>] afs_wait_for_operation+0xd8/0x169
    [<0>] afs_do_sync_operation+0x16/0x1f
    [<0>] afs_setattr+0x1fb/0x25d
    [<0>] notify_change+0x2cf/0x3c4
    [<0>] do_truncate+0x7f/0xb2
    [<0>] do_sys_ftruncate+0xd1/0x104
    [<0>] do_syscall_64+0x2d/0x3a
    [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

The writeback operation, however, stalls indefinitely because it needs to
get the io_lock to proceed:

    # cat /proc/5940/stack
    [<0>] afs_get_io_locks+0x58/0x1ae
    [<0>] afs_begin_vnode_operation+0xc7/0xd1
    [<0>] afs_store_data+0x1b2/0x2a3
    [<0>] afs_write_back_from_locked_page+0x418/0x57c
    [<0>] afs_writepages_region+0x196/0x224
    [<0>] afs_writepages+0x74/0x156
    [<0>] do_writepages+0x2d/0x56
    [<0>] __writeback_single_inode+0x84/0x207
    [<0>] writeback_sb_inodes+0x238/0x3cf
    [<0>] __writeback_inodes_wb+0x68/0x9f
    [<0>] wb_writeback+0x145/0x26c
    [<0>] wb_do_writeback+0x16a/0x194
    [<0>] wb_workfn+0x74/0x177
    [<0>] process_one_work+0x174/0x264
    [<0>] worker_thread+0x117/0x1b9
    [<0>] kthread+0xec/0xf1
    [<0>] ret_from_fork+0x1f/0x30

and thus deadlock has occurred.

Note that whilst afs_setattr() calls filemap_write_and_wait(), the fact
that the caller is holding i_rwsem doesn't preclude more pages being
dirtied through an mmap'd region.

Fix this by:

 (1) Use the vnode validate_lock to mediate access between afs_setattr()
     and afs_writepages():

     (a) Exclusively lock validate_lock in afs_setattr() around the whole
     	 RPC operation.

     (b) If WB_SYNC_ALL isn't set on entry to afs_writepages(), trying to
     	 shared-lock validate_lock and returning immediately if we couldn't
     	 get it.

     (c) If WB_SYNC_ALL is set, wait for the lock.

     The validate_lock is also used to validate a file and to zap its cache
     if the file was altered by a third party, so it's probably a good fit
     for this.

 (2) Move the truncation outside of the io_lock in setattr, using the same
     hook as is used for local directory editing.

     This requires the old i_size to be retained in the operation record as
     we commit the revised status to the inode members inside the io_lock
     still, but we still need to know if we reduced the file size.

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-08 10:50:55 -07:00
Gabriel Krisman Bertazi 41b21af388 direct-io: defer alignment check until after the EOF check
Prior to commit 9fe55eea7e ("Fix race when checking i_size on direct
i/o read"), an unaligned direct read past end of file would trigger EOF,
since generic_file_aio_read detected this read-at-EOF condition and
skipped the direct IO read entirely, returning 0. After that change, the
read now reaches dio_generic, which detects the misalignment and returns
EINVAL.

This consolidates the generic direct-io to follow the same behavior of
filesystems.  Apparently, this fix will only affect ocfs2 since other
filesystems do this verification before calling do_blockdev_direct_IO,
with the exception of f2fs, which has the same bug, but is fixed in the
next patch.

it can be verified by a read loop on a file that does a partial read
before EOF (On file that doesn't end at an aligned address).  The
following code fails on an unaligned file on filesystems without
prior validation without this patch, but not on btrfs, ext4, and xfs.

  while (done < total) {
    ssize_t delta = pread(fd, buf + done, total - done, off + done);
    if (!delta)
      break;
    ...
  }

Fix this regression by moving the misalignment check to after the EOF
check added by commit 74cedf9b6c ("direct-io: Fix negative return from
dio read beyond eof").

Based on a patch by Jamie Liu.

Link: https://lore.kernel.org/r/20201008062620.2928326-4-krisman@collabora.com
Reported-by: Jamie Liu <jamieliu@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-10-08 18:26:46 +02:00
Gabriel Krisman Bertazi 0a9164cb7f direct-io: don't force writeback for reads beyond EOF
If a DIO read starts past EOF, the kernel won't attempt it, so we don't
need to flush dirty pages before failing the syscall.

Link: https://lore.kernel.org/r/20201008062620.2928326-3-krisman@collabora.com
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-10-08 18:26:33 +02:00
Gabriel Krisman Bertazi 46d716025a direct-io: clean up error paths of do_blockdev_direct_IO
In preparation to resort DIO checks, reduce code duplication of error
handling in do_blockdev_direct_IO.

Link: https://lore.kernel.org/r/20201008062620.2928326-2-krisman@collabora.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2020-10-08 18:26:01 +02:00
Jens Axboe ca6484cd30 io_uring: no need to call xa_destroy() on empty xarray
The kernel test robot reports this lockdep issue:

[child1:659] mbind (274) returned ENOSYS, marking as inactive.
[child1:659] mq_timedsend (279) returned ENOSYS, marking as inactive.
[main] 10175 iterations. [F:7781 S:2344 HI:2397]
[   24.610601]
[   24.610743] ================================
[   24.611083] WARNING: inconsistent lock state
[   24.611437] 5.9.0-rc7-00017-g0f2122045b9462 #5 Not tainted
[   24.611861] --------------------------------
[   24.612193] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[   24.612660] ksoftirqd/0/7 [HC0[0]:SC1[3]:HE0:SE0] takes:
[   24.613086] f00ed998 (&xa->xa_lock#4){+.?.}-{2:2}, at: xa_destroy+0x43/0xc1
[   24.613642] {SOFTIRQ-ON-W} state was registered at:
[   24.614024]   lock_acquire+0x20c/0x29b
[   24.614341]   _raw_spin_lock+0x21/0x30
[   24.614636]   io_uring_add_task_file+0xe8/0x13a
[   24.614987]   io_uring_create+0x535/0x6bd
[   24.615297]   io_uring_setup+0x11d/0x136
[   24.615606]   __ia32_sys_io_uring_setup+0xd/0xf
[   24.615977]   do_int80_syscall_32+0x53/0x6c
[   24.616306]   restore_all_switch_stack+0x0/0xb1
[   24.616677] irq event stamp: 939881
[   24.616968] hardirqs last  enabled at (939880): [<8105592d>] __local_bh_enable_ip+0x13c/0x145
[   24.617642] hardirqs last disabled at (939881): [<81b6ace3>] _raw_spin_lock_irqsave+0x1b/0x4e
[   24.618321] softirqs last  enabled at (939738): [<81b6c7c8>] __do_softirq+0x3f0/0x45a
[   24.618924] softirqs last disabled at (939743): [<81055741>] run_ksoftirqd+0x35/0x61
[   24.619521]
[   24.619521] other info that might help us debug this:
[   24.620028]  Possible unsafe locking scenario:
[   24.620028]
[   24.620492]        CPU0
[   24.620685]        ----
[   24.620894]   lock(&xa->xa_lock#4);
[   24.621168]   <Interrupt>
[   24.621381]     lock(&xa->xa_lock#4);
[   24.621695]
[   24.621695]  *** DEADLOCK ***
[   24.621695]
[   24.622154] 1 lock held by ksoftirqd/0/7:
[   24.622468]  #0: 823bfb94 (rcu_callback){....}-{0:0}, at: rcu_process_callbacks+0xc0/0x155
[   24.623106]
[   24.623106] stack backtrace:
[   24.623454] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 5.9.0-rc7-00017-g0f2122045b9462 #5
[   24.624090] Call Trace:
[   24.624284]  ? show_stack+0x40/0x46
[   24.624551]  dump_stack+0x1b/0x1d
[   24.624809]  print_usage_bug+0x17a/0x185
[   24.625142]  mark_lock+0x11d/0x1db
[   24.625474]  ? print_shortest_lock_dependencies+0x121/0x121
[   24.625905]  __lock_acquire+0x41e/0x7bf
[   24.626206]  lock_acquire+0x20c/0x29b
[   24.626517]  ? xa_destroy+0x43/0xc1
[   24.626810]  ? lock_acquire+0x20c/0x29b
[   24.627110]  _raw_spin_lock_irqsave+0x3e/0x4e
[   24.627450]  ? xa_destroy+0x43/0xc1
[   24.627725]  xa_destroy+0x43/0xc1
[   24.627989]  __io_uring_free+0x57/0x71
[   24.628286]  ? get_pid+0x22/0x22
[   24.628544]  __put_task_struct+0xf2/0x163
[   24.628865]  put_task_struct+0x1f/0x2a
[   24.629161]  delayed_put_task_struct+0xe2/0xe9
[   24.629509]  rcu_process_callbacks+0x128/0x155
[   24.629860]  __do_softirq+0x1a3/0x45a
[   24.630151]  run_ksoftirqd+0x35/0x61
[   24.630443]  smpboot_thread_fn+0x304/0x31a
[   24.630763]  kthread+0x124/0x139
[   24.631016]  ? sort_range+0x18/0x18
[   24.631290]  ? kthread_create_worker_on_cpu+0x17/0x17
[   24.631682]  ret_from_fork+0x1c/0x28

which is complaining about xa_destroy() grabbing the xa lock in an
IRQ disabling fashion, whereas the io_uring uses cases aren't interrupt
safe. This is really an xarray issue, since it should not assume the
lock type. But for our use case, since we know the xarray is empty at
this point, there's no need to actually call xa_destroy(). So just get
rid of it.

Fixes: 0f2122045b ("io_uring: don't rely on weak ->files references")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-08 07:46:52 -06:00
Matthew Wilcox (Oracle) f5f7ab168b 9P: Cast to loff_t before multiplying
On 32-bit systems, this multiplication will overflow for files larger
than 4GB.

Link: http://lkml.kernel.org/r/20201004180428.14494-2-willy@infradead.org
Cc: stable@vger.kernel.org
Fixes: fb89b45cdf ("9P: introduction of a new cache=mmap model.")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2020-10-07 23:57:30 +02:00
Jens Axboe faf7b51c06 io_uring: batch account ->req_issue and task struct references
Identical to how we handle the ctx reference counts, increase by the
batch we're expecting to submit, and handle any slow path residual,
if any. The request alloc-and-issue path is very hot, and this makes
a noticeable difference by avoiding an two atomic incs for each
individual request.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-07 12:55:42 -06:00
Anna Schumaker bff049a3b5 NFS: Decode a full READ_PLUS reply
Decode multiple hole and data segments sent by the server, placing
everything directly where they need to go in the xdr pages.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07 14:28:40 -04:00
Anna Schumaker c05eafad6b NFS: Add READ_PLUS hole segment decoding
We keep things simple for now by only decoding a single hole or data
segment returned by the server, even if they returned more to us.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07 14:28:40 -04:00
Anna Schumaker c567552612 NFS: Add READ_PLUS data segment support
This patch adds client support for decoding a single NFS4_CONTENT_DATA
segment returned by the server. This is the simplest implementation
possible, since it does not account for any hole segments in the reply.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07 14:28:39 -04:00
Anna Schumaker a14a63594c NFS: Use xdr_page_pos() in NFSv4 decode_getacl()
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-07 14:28:39 -04:00
Kaixu Xia e5b23740db xfs: fix the indent in xfs_trans_mod_dquot
The formatting is strange in xfs_trans_mod_dquot, so do a reindent.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-10-07 08:40:29 -07:00
Kaixu Xia 97611f9366 xfs: do the ASSERT for the arguments O_{u,g,p}dqpp
If we pass in XFS_QMOPT_{U,G,P}QUOTA flags and different uid/gid/prid
than them currently associated with the inode, the arguments
O_{u,g,p}dqpp shouldn't be NULL, so add the ASSERT for them.

Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-10-07 08:40:29 -07:00
Darrick J. Wong 8ffa90e114 xfs: fix deadlock and streamline xfs_getfsmap performance
Refactor xfs_getfsmap to improve its performance: instead of indirectly
calling a function that copies one record to userspace at a time, create
a shadow buffer in the kernel and copy the whole array once at the end.
On the author's computer, this reduces the runtime on his /home by ~20%.

This also eliminates a deadlock when running GETFSMAP against the
realtime device.  The current code locks the rtbitmap to create
fsmappings and copies them into userspace, having not released the
rtbitmap lock.  If the userspace buffer is an mmap of a sparse file that
itself resides on the realtime device, the write page fault will recurse
into the fs for allocation, which will deadlock on the rtbitmap lock.

Fixes: 4c934c7dd6 ("xfs: report realtime space information via the rtbitmap")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-07 08:40:29 -07:00
Darrick J. Wong acd1ac3aa2 xfs: limit entries returned when counting fsmap records
If userspace asked fsmap to count the number of entries, we cannot
return more than UINT_MAX entries because fmh_entries is u32.
Therefore, stop counting if we hit this limit or else we will waste time
to return truncated results.

Fixes: e89c041338 ("xfs: implement the GETFSMAP ioctl")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-07 08:40:29 -07:00
Darrick J. Wong 74f4d6a1e0 xfs: only relog deferred intent items if free space in the log gets low
Now that we have the ability to ask the log how far the tail needs to be
pushed to maintain its free space targets, augment the decision to relog
an intent item so that we only do it if the log has hit the 75% full
threshold.  There's no point in relogging an intent into the same
checkpoint, and there's no need to relog if there's plenty of free space
in the log.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07 08:40:29 -07:00
Darrick J. Wong ed1575daf7 xfs: expose the log push threshold
Separate the computation of the log push threshold and the push logic in
xlog_grant_push_ail.  This enables higher level code to determine (for
example) that it is holding on to a logged intent item and the log is so
busy that it is more than 75% full.  In that case, it would be desirable
to move the log item towards the head to release the tail, which we will
cover in the next patch.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07 08:40:29 -07:00
Darrick J. Wong 4e919af782 xfs: periodically relog deferred intent items
There's a subtle design flaw in the deferred log item code that can lead
to pinning the log tail.  Taking up the defer ops chain examples from
the previous commit, we can get trapped in sequences like this:

Caller hands us a transaction t0 with D0-D3 attached.  The defer ops
chain will look like the following if the transaction rolls succeed:

t1: D0(t0), D1(t0), D2(t0), D3(t0)
t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
t3: d5(t1), D1(t0), D2(t0), D3(t0)
...
t9: d9(t7), D3(t0)
t10: D3(t0)
t11: d10(t10), d11(t10)
t12: d11(t10)

In transaction 9, we finish d9 and try to roll to t10 while holding onto
an intent item for D3 that we logged in t0.

The previous commit changed the order in which we place new defer ops in
the defer ops processing chain to reduce the maximum chain length.  Now
make xfs_defer_finish_noroll capable of relogging the entire chain
periodically so that we can always move the log tail forward.  Most
chains will never get relogged, except for operations that generate very
long chains (large extents containing many blocks with different sharing
levels) or are on filesystems with small logs and a lot of ongoing
metadata updates.

Callers are now required to ensure that the transaction reservation is
large enough to handle logging done items and new intent items for the
maximum possible chain length.  Most callers are careful to keep the
chain lengths low, so the overhead should be minimal.

The decision to relog an intent item is made based on whether the intent
was logged in a previous checkpoint, since there's no point in relogging
an intent into the same checkpoint.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07 08:40:28 -07:00
Darrick J. Wong 27dada070d xfs: change the order in which child and parent defer ops are finished
The defer ops code has been finishing items in the wrong order -- if a
top level defer op creates items A and B, and finishing item A creates
more defer ops A1 and A2, we'll put the new items on the end of the
chain and process them in the order A B A1 A2.  This is kind of weird,
since it's convenient for programmers to be able to think of A and B as
an ordered sequence where all the sub-tasks for A must finish before we
move on to B, e.g. A A1 A2 D.

Right now, our log intent items are not so complex that this matters,
but this will become important for the atomic extent swapping patchset.
In order to maintain correct reference counting of extents, we have to
unmap and remap extents in that order, and we want to complete that work
before moving on to the next range that the user wants to swap.  This
patch fixes defer ops to satsify that requirement.

The primary symptom of the incorrect order was noticed in an early
performance analysis of the atomic extent swap code.  An astonishingly
large number of deferred work items accumulated when userspace requested
an atomic update of two very fragmented files.  The cause of this was
traced to the same ordering bug in the inner loop of
xfs_defer_finish_noroll.

If the ->finish_item method of a deferred operation queues new deferred
operations, those new deferred ops are appended to the tail of the
pending work list.  To illustrate, say that a caller creates a
transaction t0 with four deferred operations D0-D3.  The first thing
defer ops does is roll the transaction to t1, leaving us with:

t1: D0(t0), D1(t0), D2(t0), D3(t0)

Let's say that finishing each of D0-D3 will create two new deferred ops.
After finish D0 and roll, we'll have the following chain:

t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)

d4 and d5 were logged to t1.  Notice that while we're about to start
work on D1, we haven't actually completed all the work implied by D0
being finished.  So far we've been careful (or lucky) to structure the
dfops callers such that D1 doesn't depend on d4 or d5 being finished,
but this is a potential logic bomb.

There's a second problem lurking.  Let's see what happens as we finish
D1-D3:

t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)

Let's say that d4-d11 are simple work items that don't queue any other
operations, which means that we can complete each d4 and roll to t6:

t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
...
t11: d10(t4), d11(t4)
t12: d11(t4)
<done>

When we try to roll to transaction #12, we're holding defer op d11,
which we logged way back in t4.  This means that the tail of the log is
pinned at t4.  If the log is very small or there are a lot of other
threads updating metadata, this means that we might have wrapped the log
and cannot get roll to t11 because there isn't enough space left before
we'd run into t4.

Let's shift back to the original failure.  I mentioned before that I
discovered this flaw while developing the atomic file update code.  In
that scenario, we have a defer op (D0) that finds a range of file blocks
to remap, creates a handful of new defer ops to do that, and then asks
to be continued with however much work remains.

So, D0 is the original swapext deferred op.  The first thing defer ops
does is rolls to t1:

t1: D0(t0)

We try to finish D0, logging d1 and d2 in the process, but can't get all
the work done.  We log a done item and a new intent item for the work
that D0 still has to do, and roll to t2:

t2: D0'(t1), d1(t1), d2(t1)

We roll and try to finish D0', but still can't get all the work done, so
we log a done item and a new intent item for it, requeue D0 a second
time, and roll to t3:

t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)

If it takes 48 more rolls to complete D0, then we'll finally dispense
with D0 in t50:

t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)

We then try to roll again to get a chain like this:

t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
...
t152: d102(t50)
<done>

Notice that in rolling to transaction #51, we're holding on to a log
intent item for d1 that was logged in transaction #1.  This means that
the tail of the log is pinned at t1.  If the log is very small or there
are a lot of other threads updating metadata, this means that we might
have wrapped the log and cannot roll to t51 because there isn't enough
space left before we'd run into t1.  This is of course problem #2 again.

But notice the third problem with this scenario: we have 102 defer ops
tied to this transaction!  Each of these items are backed by pinned
kernel memory, which means that we risk OOM if the chains get too long.

Yikes.  Problem #1 is a subtle logic bomb that could hit someone in the
future; problem #2 applies (rarely) to the current upstream, and problem
#3 applies to work under development.

This is not how incremental deferred operations were supposed to work.
The dfops design of logging in the same transaction an intent-done item
and a new intent item for the work remaining was to make it so that we
only have to juggle enough deferred work items to finish that one small
piece of work.  Deferred log item recovery will find that first
unfinished work item and restart it, no matter how many other intent
items might follow it in the log.  Therefore, it's ok to put the new
intents at the start of the dfops chain.

For the first example, the chains look like this:

t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
t3: d5(t1), D1(t0), D2(t0), D3(t0)
...
t9: d9(t7), D3(t0)
t10: D3(t0)
t11: d10(t10), d11(t10)
t12: d11(t10)

For the second example, the chains look like this:

t1: D0(t0)
t2: d1(t1), d2(t1), D0'(t1)
t3: d2(t1), D0'(t1)
t4: D0'(t1)
t5: d1(t4), d2(t4), D0''(t4)
...
t148: D0<50 primes>(t147)
t149: d101(t148), d102(t148)
t150: d102(t148)
<done>

This actually sucks more for pinning the log tail (we try to roll to t10
while holding an intent item that was logged in t1) but we've solved
problem #1.  We've also reduced the maximum chain length from:

    sum(all the new items) + nr_original_items

to:

    max(new items that each original item creates) + nr_original_items

This solves problem #3 by sharply reducing the number of defer ops that
can be attached to a transaction at any given time.  The change makes
the problem of log tail pinning worse, but is improvement we need to
solve problem #2.  Actually solving #2, however, is left to the next
patch.

Note that a subsequent analysis of some hard-to-trigger reflink and COW
livelocks on extremely fragmented filesystems (or systems running a lot
of IO threads) showed the same symptoms -- uncomfortably large numbers
of incore deferred work items and occasional stalls in the transaction
grant code while waiting for log reservations.  I think this patch and
the next one will also solve these problems.

As originally written, the code used list_splice_tail_init instead of
list_splice_init, so change that, and leave a short comment explaining
our actions.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07 08:40:28 -07:00
Darrick J. Wong ff4ab5e02a xfs: fix an incore inode UAF in xfs_bui_recover
In xfs_bui_item_recover, there exists a use-after-free bug with regards
to the inode that is involved in the bmap replay operation.  If the
mapping operation does not complete, we call xfs_bmap_unmap_extent to
create a deferred op to finish the unmapping work, and we retain a
pointer to the incore inode.

Unfortunately, the very next thing we do is commit the transaction and
drop the inode.  If reclaim tears down the inode before we try to finish
the defer ops, we dereference garbage and blow up.  Therefore, create a
way to join inodes to the defer ops freezer so that we can maintain the
xfs_inode reference until we're done with the inode.

Note: This imposes the requirement that there be enough memory to keep
every incore inode in memory throughout recovery.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07 08:40:28 -07:00
Darrick J. Wong 64a3f3315b xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering
In most places in XFS, we have a specific order in which we gather
resources: grab the inode, allocate a transaction, then lock the inode.
xfs_bui_item_recover doesn't do it in that order, so fix it to be more
consistent.  This also makes the error bailout code a bit less weird.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07 08:40:28 -07:00
Darrick J. Wong 919522e89f xfs: clean up bmap intent item recovery checking
The bmap intent item checking code in xfs_bui_item_recover is spread all
over the function.  We should check the recovered log item at the top
before we allocate any resources or do anything else, so do that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07 08:40:28 -07:00
Darrick J. Wong 929b92f640 xfs: xfs_defer_capture should absorb remaining transaction reservation
When xfs_defer_capture extracts the deferred ops and transaction state
from a transaction, it should record the transaction reservation type
from the old transaction so that when we continue the dfops chain, we
still use the same reservation parameters.

Doing this means that the log item recovery functions get to determine
the transaction reservation instead of abusing tr_itruncate in yet
another part of xfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07 08:40:28 -07:00
Darrick J. Wong 4f9a60c480 xfs: xfs_defer_capture should absorb remaining block reservations
When xfs_defer_capture extracts the deferred ops and transaction state
from a transaction, it should record the remaining block reservations so
that when we continue the dfops chain, we can reserve the same number of
blocks to use.  We capture the reservations for both data and realtime
volumes.

This adds the requirement that every log intent item recovery function
must be careful to reserve enough blocks to handle both itself and all
defer ops that it can queue.  On the other hand, this enables us to do
away with the handwaving block estimation nonsense that was going on in
xlog_finish_defer_ops.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07 08:40:28 -07:00
Darrick J. Wong e6fff81e48 xfs: proper replay of deferred ops queued during log recovery
When we replay unfinished intent items that have been recovered from the
log, it's possible that the replay will cause the creation of more
deferred work items.  As outlined in commit 509955823c ("xfs: log
recovery should replay deferred ops in order"), later work items have an
implicit ordering dependency on earlier work items.  Therefore, recovery
must replay the items (both recovered and created) in the same order
that they would have been during normal operation.

For log recovery, we enforce this ordering by using an empty transaction
to collect deferred ops that get created in the process of recovering a
log intent item to prevent them from being committed before the rest of
the recovered intent items.  After we finish committing all the
recovered log items, we allocate a transaction with an enormous block
reservation, splice our huge list of created deferred ops into that
transaction, and commit it, thereby finishing all those ops.

This is /really/ hokey -- it's the one place in XFS where we allow
nested transactions; the splicing of the defer ops list is is inelegant
and has to be done twice per recovery function; and the broken way we
handle inode pointers and block reservations cause subtle use-after-free
and allocator problems that will be fixed by this patch and the two
patches after it.

Therefore, replace the hokey empty transaction with a structure designed
to capture each chain of deferred ops that are created as part of
recovering a single unfinished log intent.  Finally, refactor the loop
that replays those chains to do so using one transaction per chain.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07 08:40:28 -07:00
Darrick J. Wong 901219bb25 xfs: remove XFS_LI_RECOVERED
The ->iop_recover method of a log intent item removes the recovered
intent item from the AIL by logging an intent done item and committing
the transaction, so it's superfluous to have this flag check.  Nothing
else uses it, so get rid of the flag entirely.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07 08:40:27 -07:00
Darrick J. Wong b80b29d602 xfs: remove xfs_defer_reset
Remove this one-line helper since the assert is trivially true in one
call site and the rest obscures a bitmask operation.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07 08:40:27 -07:00
Nikolay Borisov 1fd4033dd0 btrfs: rename BTRFS_INODE_ORDERED_DATA_CLOSE flag
Commit 8d875f95da ("btrfs: disable strict file flushes for
renames and truncates") eliminated the notion of ordered operations and
instead BTRFS_INODE_ORDERED_DATA_CLOSE only remained as a flag
indicating that a file's content should be synced to disk in case a
file is truncated and any writes happen to it concurrently. In fact
this intendend behavior was broken until it was fixed in
f6dc45c7a9 ("Btrfs: fix filemap_flush call in btrfs_file_release").

All things considered let's give the flag a more descriptive name. Also
slightly reword comments.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:18:00 +02:00
Madhuparna Bhowmik 8d1a7aae89 btrfs: annotate device name rcu_string with __rcu
This patch fixes the following sparse errors in
fs/btrfs/super.c in function btrfs_show_devname()

  fs/btrfs/super.c: error: incompatible types in comparison expression (different address spaces):
  fs/btrfs/super.c:    struct rcu_string [noderef] <asn:4> *
  fs/btrfs/super.c:    struct rcu_string *

The error was because of the following line in function btrfs_show_devname():

  if (first_dev)
	 seq_escape(m, rcu_str_deref(first_dev->name), " \t\n\\");

Annotating the btrfs_device::name member with __rcu fixes the sparse
error.

Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik04@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:17:59 +02:00
Anand Jain 96c2e067ed btrfs: skip devices without magic signature when mounting
Many things can happen after the device is scanned and before the device
is mounted.  One such thing is losing the BTRFS_MAGIC on the device.
If it happens we still won't free that device from the memory and cause
the userland confusion.

For example: As the BTRFS_IOC_DEV_INFO still carries the device path
which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists
device which does not belong to the filesystem anymore:

  $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb
  $ wipefs -a /dev/sdb
  # /dev/sdb does not contain magic signature
  $ mount -o degraded /dev/sda /btrfs
  $ btrfs fi show -m
  Label: none  uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd
	  Total devices 2 FS bytes used 128.00KiB
	  devid    1 size 3.00GiB used 571.19MiB path /dev/sda
	  devid    2 size 3.00GiB used 571.19MiB path /dev/sdb

We need to distinguish the missing signature and invalid superblock, so
add a specific error code ENODATA for that. This also fixes failure of
fstest btrfs/198.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:17:59 +02:00
Josef Bacik 572c83acdc btrfs: cleanup cow block on error
In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
to a system lockup. It gets stuck trying to write back inodes, and the
write back thread was trying to lock an extent buffer:

  $ cat /proc/2143497/stack
  [<0>] __btrfs_tree_lock+0x108/0x250
  [<0>] lock_extent_buffer_for_io+0x35e/0x3a0
  [<0>] btree_write_cache_pages+0x15a/0x3b0
  [<0>] do_writepages+0x28/0xb0
  [<0>] __writeback_single_inode+0x54/0x5c0
  [<0>] writeback_sb_inodes+0x1e8/0x510
  [<0>] wb_writeback+0xcc/0x440
  [<0>] wb_workfn+0xd7/0x650
  [<0>] process_one_work+0x236/0x560
  [<0>] worker_thread+0x55/0x3c0
  [<0>] kthread+0x13a/0x150
  [<0>] ret_from_fork+0x1f/0x30

This is because we got an error while COWing a block, specifically here

        if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
                ret = btrfs_reloc_cow_block(trans, root, buf, cow);
                if (ret) {
                        btrfs_abort_transaction(trans, ret);
                        return ret;
                }
        }

  [16402.241552] BTRFS: Transaction aborted (error -2)
  [16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
  [16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
  [16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  [16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
  [16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
  [16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
  [16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
  [16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
  [16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
  [16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
  [16402.256321] FS:  00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
  [16402.256973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
  [16402.257867] Call Trace:
  [16402.258072]  btrfs_cow_block+0x109/0x230
  [16402.258356]  btrfs_search_slot+0x530/0x9d0
  [16402.258655]  btrfs_lookup_file_extent+0x37/0x40
  [16402.259155]  __btrfs_drop_extents+0x13c/0xd60
  [16402.259628]  ? btrfs_block_rsv_migrate+0x4f/0xb0
  [16402.259949]  btrfs_replace_file_extents+0x190/0x820
  [16402.260873]  btrfs_clone+0x9ae/0xc00
  [16402.261139]  btrfs_extent_same_range+0x66/0x90
  [16402.261771]  btrfs_remap_file_range+0x353/0x3b1
  [16402.262333]  vfs_dedupe_file_range_one.part.0+0xd5/0x140
  [16402.262821]  vfs_dedupe_file_range+0x189/0x220
  [16402.263150]  do_vfs_ioctl+0x552/0x700
  [16402.263662]  __x64_sys_ioctl+0x62/0xb0
  [16402.264023]  do_syscall_64+0x33/0x40
  [16402.264364]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [16402.264862] RIP: 0033:0x7f542c7d15cb
  [16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
  [16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
  [16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
  [16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
  [16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
  [16402.271498] irq event stamp: 0
  [16402.271846] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  [16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
  [16402.273343] softirqs last  enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
  [16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [16402.274338] ---[ end trace 737874a5a41a8236 ]---
  [16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
  [16402.276179] BTRFS info (device dm-9): forced readonly
  [16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
  [16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
  [16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
  [16402.280582] BTRFS info (device dm-9): balance: ended with status: -30

The problem here is that as soon as we allocate the new block it is
locked and marked dirty in the btree inode.  This means that we could
attempt to writeback this block and need to lock the extent buffer.
However we're not unlocking it here and thus we deadlock.

Fix this by unlocking the cow block if we have any errors inside of
__btrfs_cow_block, and also free it so we do not leak it.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:17:59 +02:00
Goldwyn Rodrigues e3c57805f8 btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
Since we now perform direct reads using i_rwsem, we can remove this
inode flag used to co-ordinate unlocked reads.

The truncate call takes i_rwsem. This means it is correctly synchronized
with concurrent direct reads.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jth@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:17:59 +02:00
Goldwyn Rodrigues c33fe275b5 fs: remove no longer used dio_end_io()
Since we removed the last user of dio_end_io() when btrfs got converted
to iomap infrastructure ("btrfs: switch to iomap for direct IO"), remove
the helper function dio_end_io().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:17:59 +02:00
Josef Bacik 92e26df43b btrfs: return error if we're unable to read device stats
I noticed when fixing device stats for seed devices that we simply threw
away the return value from btrfs_search_slot().  This is because we may
not have stat items, but we could very well get an error, and thus miss
reporting the error up the chain.

Fix this by returning ret if it's an actual error, and then stop trying
to init the rest of the devices stats and return the error up the chain.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:17:58 +02:00
Josef Bacik 124604eb50 btrfs: init device stats for seed devices
We recently started recording device stats across the fleet, and noticed
a large increase in messages such as this

  BTRFS warning (device dm-0): get dev_stats failed, not yet valid

on our tiers that use seed devices for their root devices.  This is
because we do not initialize the device stats for any seed devices if we
have a sprout device and mount using that sprout device.  The basic
steps for reproducing are:

  $ mkfs seed device
  $ mount seed device
  # fill seed device
  $ umount seed device
  $ btrfstune -S 1 seed device
  $ mount seed device
  $ btrfs device add -f sprout device /mnt/wherever
  $ umount /mnt/wherever
  $ mount sprout device /mnt/wherever
  $ btrfs device stats /mnt/wherever

This will fail with the above message in dmesg.

Fix this by iterating over the fs_devices->seed if they exist in
btrfs_init_dev_stats.  This fixed the problem and properly reports the
stats for both devices.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename to btrfs_device_init_dev_stats ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:17:58 +02:00
Nikolay Borisov 905eb88bce btrfs: remove struct extent_io_ops
It's no longer used just remove the function and any related code which
was initialising it for inodes. No functional changes.

Removing 8 bytes from extent_io_tree in turn reduces size of other
structures where it is embedded, notably btrfs_inode where it reduces
size by 24 bytes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:25 +02:00
Nikolay Borisov 1b36294a6c btrfs: call submit_bio_hook directly for metadata pages
No need to go through a function pointer indirection simply call
submit_bio_hook directly by exporting and renaming the helper to
btrfs_submit_metadata_bio. This makes the code more readable and should
result in somewhat faster code due to no longer paying the price for
specualtive attack mitigations that come with indirect function calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:25 +02:00
Nikolay Borisov 908930f3ed btrfs: stop calling submit_bio_hook for data inodes
Instead export and rename the function to btrfs_submit_data_bio and
call it directly in submit_one_bio. This avoids paying the cost for
speculative attacks mitigations and improves code readability.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:24 +02:00
Nikolay Borisov be17b3afc4 btrfs: don't opencode is_data_inode in end_bio_extent_readpage
Use the is_data_inode helper.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:24 +02:00
Nikolay Borisov cd0537449c btrfs: call submit_bio_hook directly in submit_one_bio
BTRFS has 2 inode types (for the purposes of the code in submit_one_bio)
- ordinary data inodes (including the freespace inode) and the btree
inode. Both of these implement submit_bio_hook so btrfsic_submit_bio can
never be called from submit_one_bio so just remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:24 +02:00
Nikolay Borisov 1f03d9cfda btrfs: remove extent_io_ops::readpage_end_io_hook
It's no longer used so let's remove it.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:24 +02:00
Nikolay Borisov 9a446d6a9f btrfs: replace readpage_end_io_hook with direct calls
Don't call readpage_end_io_hook for the btree inode.  Instead of relying
on indirect calls to implement metadata buffer validation simply check
if the inode whose page we are processing equals the btree inode. If it
does call the necessary function.

This is an improvement in 2 directions:

1. We aren't paying the penalty of indirect calls in a post-speculation
   attacks world.

2. The function is now named more explicitly so it's obvious what's
   going on

This is in preparation to removing struct extent_io_ops altogether.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:24 +02:00
Filipe Manana 9c2b4e0347 btrfs: send, recompute reference path after orphanization of a directory
During an incremental send, when an inode has multiple new references we
might end up emitting rename operations for orphanizations that have a
source path that is no longer valid due to a previous orphanization of
some directory inode. This causes the receiver to fail since it tries
to rename a path that does not exists.

Example reproducer:

  $ cat reproducer.sh
  #!/bin/bash

  mkfs.btrfs -f /dev/sdi >/dev/null
  mount /dev/sdi /mnt/sdi

  touch /mnt/sdi/f1
  touch /mnt/sdi/f2
  mkdir /mnt/sdi/d1
  mkdir /mnt/sdi/d1/d2

  # Filesystem looks like:
  #
  # .                           (ino 256)
  # |----- f1                   (ino 257)
  # |----- f2                   (ino 258)
  # |----- d1/                  (ino 259)
  #        |----- d2/           (ino 260)

  btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
  btrfs send -f /tmp/snap1.send /mnt/sdi/snap1

  # Now do a series of changes such that:
  #
  # *) inode 258 has one new hardlink and the previous name changed
  #
  # *) both names conflict with the old names of two other inodes:
  #
  #    1) the new name "d1" conflicts with the old name of inode 259,
  #       under directory inode 256 (root)
  #
  #    2) the new name "d2" conflicts with the old name of inode 260
  #       under directory inode 259
  #
  # *) inodes 259 and 260 now have the old names of inode 258
  #
  # *) inode 257 is now located under inode 260 - an inode with a number
  #    smaller than the inode (258) for which we created a second hard
  #    link and swapped its names with inodes 259 and 260
  #
  ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link
  mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1

  # Swap d1 and f2.
  mv /mnt/sdi/d1 /mnt/sdi/tmp
  mv /mnt/sdi/f2 /mnt/sdi/d1
  mv /mnt/sdi/tmp /mnt/sdi/f2

  # Swap d2 and f2_link
  mv /mnt/sdi/f2/d2 /mnt/sdi/tmp
  mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2
  mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link

  # Filesystem now looks like:
  #
  # .                                (ino 256)
  # |----- d1                        (ino 258)
  # |----- f2/                       (ino 259)
  #        |----- f2_link/           (ino 260)
  #        |       |----- f1         (ino 257)
  #        |
  #        |----- d2                 (ino 258)

  btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
  btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2

  mkfs.btrfs -f /dev/sdj >/dev/null
  mount /dev/sdj /mnt/sdj

  btrfs receive -f /tmp/snap1.send /mnt/sdj
  btrfs receive -f /tmp/snap2.send /mnt/sdj

  umount /mnt/sdi
  umount /mnt/sdj

When executed the receive of the incremental stream fails:

  $ ./reproducer.sh
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2
  At subvol snap1
  At snapshot snap2
  ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory

This happens because:

1) When processing inode 257 we end up computing the name for inode 259
   because it is an ancestor in the send snapshot, and at that point it
   still has its old name, "d1", from the parent snapshot because inode
   259 was not yet processed. We then cache that name, which is valid
   until we start processing inode 259 (or set the progress to 260 after
   processing its references);

2) Later we start processing inode 258 and collecting all its new
   references into the list sctx->new_refs. The first reference in the
   list happens to be the reference for name "d1" while the reference for
   name "d2" is next (the last element of the list).
   We compute the full path "d1/d2" for this second reference and store
   it in the reference (its ->full_path member). The path used for the
   new parent directory was "d1" and not "f2" because inode 259, the
   new parent, was not yet processed;

3) When we start processing the new references at process_recorded_refs()
   we start with the first reference in the list, for the new name "d1".
   Because there is a conflicting inode that was not yet processed, which
   is directory inode 259, we orphanize it, renaming it from "d1" to
   "o259-6-0";

4) Then we start processing the new reference for name "d2", and we
   realize it conflicts with the reference of inode 260 in the parent
   snapshot. So we issue an orphanization operation for inode 260 by
   emitting a rename operation with a destination path of "o260-6-0"
   and a source path of "d1/d2" - this source path is the value we
   stored in the reference earlier at step 2), corresponding to the
   ->full_path member of the reference, however that path is no longer
   valid due to the orphanization of the directory inode 259 in step 3).
   This makes the receiver fail since the path does not exists, it should
   have been "o259-6-0/d2".

Fix this by recomputing the full path of a reference before emitting an
orphanization if we previously orphanized any directory, since that
directory could be a parent in the new path. This is a rare scenario so
keeping it simple and not checking if that previously orphanized directory
is in fact an ancestor of the inode we are trying to orphanize.

A test case for fstests follows soon.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:23 +02:00
Filipe Manana 98272bb77b btrfs: send, orphanize first all conflicting inodes when processing references
When doing an incremental send it is possible that when processing the new
references for an inode we end up issuing rename or link operations that
have an invalid path, which contains the orphanized name of a directory
before we actually orphanized it, causing the receiver to fail.

The following reproducer triggers such scenario:

  $ cat reproducer.sh
  #!/bin/bash

  mkfs.btrfs -f /dev/sdi >/dev/null
  mount /dev/sdi /mnt/sdi

  touch /mnt/sdi/a
  touch /mnt/sdi/b
  mkdir /mnt/sdi/testdir
  # We want "a" to have a lower inode number then "testdir" (257 vs 259).
  mv /mnt/sdi/a /mnt/sdi/testdir/a

  # Filesystem looks like:
  #
  # .                           (ino 256)
  # |----- testdir/             (ino 259)
  # |          |----- a         (ino 257)
  # |
  # |----- b                    (ino 258)

  btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1
  btrfs send -f /tmp/snap1.send /mnt/sdi/snap1

  # Now rename 259 to "testdir_2", then change the name of 257 to
  # "testdir" and make it a direct descendant of the root inode (256).
  # Also create a new link for inode 257 with the old name of inode 258.
  # By swapping the names and location of several inodes and create a
  # nasty dependency chain of rename and link operations.
  mv /mnt/sdi/testdir/a /mnt/sdi/a2
  touch /mnt/sdi/testdir/a
  mv /mnt/sdi/b /mnt/sdi/b2
  ln /mnt/sdi/a2 /mnt/sdi/b
  mv /mnt/sdi/testdir /mnt/sdi/testdir_2
  mv /mnt/sdi/a2 /mnt/sdi/testdir

  # Filesystem now looks like:
  #
  # .                            (ino 256)
  # |----- testdir_2/            (ino 259)
  # |          |----- a          (ino 260)
  # |
  # |----- testdir               (ino 257)
  # |----- b                     (ino 257)
  # |----- b2                    (ino 258)

  btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2
  btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2

  mkfs.btrfs -f /dev/sdj >/dev/null
  mount /dev/sdj /mnt/sdj

  btrfs receive -f /tmp/snap1.send /mnt/sdj
  btrfs receive -f /tmp/snap2.send /mnt/sdj

  umount /mnt/sdi
  umount /mnt/sdj

When running the reproducer, the receive of the incremental send stream
fails:

  $ ./reproducer.sh
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
  At subvol /mnt/sdi/snap1
  Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
  At subvol /mnt/sdi/snap2
  At subvol snap1
  At snapshot snap2
  ERROR: link b -> o259-6-0/a failed: No such file or directory

The problem happens because of the following:

1) Before we start iterating the list of new references for inode 257,
   we generate its current path and store it at @valid_path, done at
   the very beginning of process_recorded_refs(). The generated path
   is "o259-6-0/a", containing the orphanized name for inode 259;

2) Then we iterate over the list of new references, which has the
   references "b" and "testdir" in that specific order;

3) We process reference "b" first, because it is in the list before
   reference "testdir". We then issue a link operation to create
   the new reference "b" using a target path corresponding to the
   content at @valid_path, which corresponds to "o259-6-0/a".
   However we haven't yet orphanized inode 259, its name is still
   "testdir", and not "o259-6-0". The orphanization of 259 did not
   happen yet because we will process the reference named "testdir"
   for inode 257 only in the next iteration of the loop that goes
   over the list of new references.

Fix the issue by having a preliminar iteration over all the new references
at process_recorded_refs(). This iteration is responsible only for doing
the orphanization of other inodes that have and old reference that
conflicts with one of the new references of the inode we are currently
processing. The emission of rename and link operations happen now in the
next iteration of the new references.

A test case for fstests will follow soon.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:23 +02:00
Qu Wenruo 1465af12e2 btrfs: tree-checker: fix false alert caused by legacy btrfs root item
Commit 259ee7754b ("btrfs: tree-checker: Add ROOT_ITEM check")
introduced btrfs root item size check, however btrfs root item has two
versions, the legacy one which just ends before generation_v2 member, is
smaller than current btrfs root item size.

This caused btrfs kernel to reject valid but old tree root leaves.

Fix this problem by also allowing legacy root item, since kernel can
already handle them pretty well and upgrade to newer root item format
when needed.

Reported-by: Martin Steigerwald <martin@lichtvoll.de>
Fixes: 259ee7754b ("btrfs: tree-checker: Add ROOT_ITEM check")
CC: stable@vger.kernel.org # 5.4+
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:23 +02:00
David Sterba e97659cefe btrfs: use unaligned helpers for stack and header set/get helpers
In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct
pointer assignment but we should use the helpers for unaligned access
for clarity. It hasn't been a problem so far because of the natural
alignment.

Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure
from stack that has an aligned start but some members may not be aligned
due to packing. This as well hasn't caused problems so far.

Move the put/get_unaligned_le8 stubs to ctree.h so we can use them.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:23 +02:00
David Sterba 6994ca367c btrfs: free-space-cache: use unaligned helpers to access data
The free space inode stores the tracking data, checksums etc, using the
io_ctl structure and moving the pointers. The data are generally aligned
to at least 4 bytes (u32 for CRC) so it's not completely unaligned but
for clarity we should use the proper helpers whenever a struct is
initialized from io_ctl->cur pointer.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:23 +02:00
David Sterba e2f896b318 btrfs: send: use helpers for unaligned access to header members
The header is mapped onto the send buffer and thus its members may be
potentially unaligned so use the helpers instead of directly assigning
the pointers. This has worked so far but let's use the helpers to make
that clear.

Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:23 +02:00
Qu Wenruo 2c53a14dd3 btrfs: use own btree inode io_tree owner id
Btree inode is special compared to all other inode extent io_trees,
although it has a btrfs inode, it doesn't have the track_uptodate bit at
all.

This means a lot of things like extent locking doesn't even need to be
applied to btree io tree.

Since it's so special, adds a new owner value for it to make debuging a
little easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:22 +02:00
Johannes Thumshirn 6b613cc97f btrfs: reschedule when cloning lots of extents
We have several occurrences of a soft lockup from fstest's generic/175
testcase, which look more or less like this one:

  watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [xfs_io:10030]
  Kernel panic - not syncing: softlockup: hung tasks
  CPU: 0 PID: 10030 Comm: xfs_io Tainted: G             L    5.9.0-rc5+ #768
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
  Call Trace:
   <IRQ>
   dump_stack+0x77/0xa0
   panic+0xfa/0x2cb
   watchdog_timer_fn.cold+0x85/0xa5
   ? lockup_detector_update_enable+0x50/0x50
   __hrtimer_run_queues+0x99/0x4c0
   ? recalibrate_cpu_khz+0x10/0x10
   hrtimer_run_queues+0x9f/0xb0
   update_process_times+0x28/0x80
   tick_handle_periodic+0x1b/0x60
   __sysvec_apic_timer_interrupt+0x76/0x210
   asm_call_on_stack+0x12/0x20
   </IRQ>
   sysvec_apic_timer_interrupt+0x7f/0x90
   asm_sysvec_apic_timer_interrupt+0x12/0x20
  RIP: 0010:btrfs_tree_unlock+0x91/0x1a0 [btrfs]
  RSP: 0018:ffffc90007123a58 EFLAGS: 00000282
  RAX: ffff8881cea2fbe0 RBX: ffff8881cea2fbe0 RCX: 0000000000000000
  RDX: ffff8881d23fd200 RSI: ffffffff82045220 RDI: ffff8881cea2fba0
  RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000032
  R10: 0000160000000000 R11: 0000000000001000 R12: 0000000000001000
  R13: ffff8882357fd5b0 R14: ffff88816fa76e70 R15: ffff8881cea2fad0
   ? btrfs_tree_unlock+0x15b/0x1a0 [btrfs]
   btrfs_release_path+0x67/0x80 [btrfs]
   btrfs_insert_replace_extent+0x177/0x2c0 [btrfs]
   btrfs_replace_file_extents+0x472/0x7c0 [btrfs]
   btrfs_clone+0x9ba/0xbd0 [btrfs]
   btrfs_clone_files.isra.0+0xeb/0x140 [btrfs]
   ? file_update_time+0xcd/0x120
   btrfs_remap_file_range+0x322/0x3b0 [btrfs]
   do_clone_file_range+0xb7/0x1e0
   vfs_clone_file_range+0x30/0xa0
   ioctl_file_clone+0x8a/0xc0
   do_vfs_ioctl+0x5b2/0x6f0
   __x64_sys_ioctl+0x37/0xa0
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f87977fc247
  RSP: 002b:00007ffd51a2f6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f87977fc247
  RDX: 00007ffd51a2f710 RSI: 000000004020940d RDI: 0000000000000003
  RBP: 0000000000000004 R08: 00007ffd51a79080 R09: 0000000000000000
  R10: 00005621f11352f2 R11: 0000000000000206 R12: 0000000000000000
  R13: 0000000000000000 R14: 00005621f128b958 R15: 0000000080000000
  Kernel Offset: disabled
  ---[ end Kernel panic - not syncing: softlockup: hung tasks ]---

All of these lockup reports have the call chain btrfs_clone_files() ->
btrfs_clone() in common. btrfs_clone_files() calls btrfs_clone() with
both source and destination extents locked and loops over the source
extent to create the clones.

Conditionally reschedule in the btrfs_clone() loop, to give some time back
to other processes.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:22 +02:00
Denis Efremov bae12df966 btrfs: use kvcalloc for allocation in btrfs_ioctl_send()
Replace kvzalloc() call with kvcalloc() that also checks the size
internally. There's a standalone overflow check in the function so we
can return invalid parameter combination.  Use array_size() helper to
compute the memory size for clone_sources_tmp.

Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:22 +02:00
Denis Efremov 8eb2fd0015 btrfs: use kvzalloc() to allocate clone_roots in btrfs_ioctl_send()
btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
The code was accidentally replaced with kzalloc() call [1]. Restore
the original code by using kvzalloc() to allocate sctx->clone_roots.

[1] https://patchwork.kernel.org/patch/9757891/#20529627

Fixes: 818e010bf9 ("btrfs: replace opencoded kvzalloc with the helper")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:22 +02:00
Nikolay Borisov c0a4360305 btrfs: remove inode argument from btrfs_start_ordered_extent
The passed in ordered_extent struct is always well-formed and contains
the inode making the explicit argument redundant.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:22 +02:00
Nikolay Borisov 510f85edf1 btrfs: remove inode argument from add_pending_csums
It's used to reference the csum root which can be done from the trans
handle as well. Simplify the signature and while at it also remove the
noinline attribute as the function uses only at most 16 bytes of stack
space.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:21 +02:00
Nikolay Borisov 3c38c877fc btrfs: sink inode argument in insert_ordered_extent_file_extent
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:21 +02:00
Nikolay Borisov 71fe0a55da btrfs: switch btrfs_remove_ordered_extent to btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:21 +02:00
Nikolay Borisov 633cc816f7 btrfs: clean BTRFS_I usage in btrfs_destroy_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:21 +02:00
Nikolay Borisov 0f20881249 btrfs: open code extent_read_full_page to its sole caller
This makes reading the code a tad easier by decreasing the level of
indirection by one.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:21 +02:00
Nikolay Borisov fd513000eb btrfs: sink mirror_num argument in __do_readpage
It's always set to 0 by the 2 callers so move it inside __do_readpage.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:21 +02:00
Nikolay Borisov 6f15af6060 btrfs: sink read_flags argument into extent_read_full_page
It's always set to 0 by its sole caller - btrfs_readpage. Simply remove
it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:20 +02:00
Nikolay Borisov 003c286aef btrfs: sink mirror_num argument in extent_read_full_page
It's always set to 0 from the sole caller - btrfs_readpage.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:20 +02:00
Nikolay Borisov c1be9c1ad5 btrfs: promote extent_read_full_page to btrfs_readpage
Now that btrfs_readpage is the only caller of extent_read_full_page the
latter can be open coded in the former. Use the occassion to rename
__extent_read_full_page to extent_read_full_page. To facillitate this
change submit_one_bio has to be exported as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:20 +02:00
Nikolay Borisov 72cffee463 btrfs: remove mirror_num argument from extent_read_full_page
It's called only from btrfs_readpage which always passes 0 so just sink
the argument into extent_read_full_page.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:20 +02:00
Nikolay Borisov 1a5ee1e626 btrfs: remove btrfs_get_extent indirection from __do_readpage
Now that this function is only responsible for reading data pages it's
no longer necessary to pass get_extent_t parameter across several
layers of functions. This patch removes this parameter from multiple
functions: __get_extent_map/__do_readpage/__extent_read_full_page/
extent_read_full_page and simply calls btrfs_get_extent directly in
__get_extent_map.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:20 +02:00
Nikolay Borisov 208d6341e8 btrfs: remove btree_get_extent
The sole purpose of this function was to satisfy the requirements of
__do_readpage. Since that function is no longer used to read metadata
pages the need to keep btree_get_extent around has also disappeared.
Simply remove it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:19 +02:00
Nikolay Borisov 0420177c08 btrfs: simplify metadata pages reading
Metadata pages currently use __do_readpage to read metadata pages,
unfortunately this function is also used to deal with ordinary data
pages. This makes the metadata pages reading code to go through multiple
hoops in order to adhere to __do_readpage invariants. Most of these are
necessary for data pages which could be compressed. For metadata it's
enough to simply build a bio and submit it.

To this effect simply call submit_extent_page directly from
read_extent_buffer_pages which is the only callpath used to populate
extent_buffers with data. This in turn enables further cleanups.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:19 +02:00
Nikolay Borisov 2f1d3e4b93 btrfs: remove btree_readpage
There is no way for this function to be called as ->readpage() since
it's called from
generic_file_buffered_read/filemap_fault/do_read_cache_page/readhead
code. BTRFS doesn't utilize the first 3 for the btree inode and
implements it's owon readhead mechanism. So simply remove the function.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:19 +02:00
Filipe Manana bb56f02f26 btrfs: reschedule if necessary when logging directory items
Logging directories with many entries can take a significant amount of
time, and in some cases monopolize a cpu/core for a long time if the
logging task doesn't happen to block often enough.

Johannes and Lu Fengqi reported test case generic/041 triggering a soft
lockup when the kernel has CONFIG_SOFTLOCKUP_DETECTOR=y. For this test
case we log an inode with 3002 hard links, and because the test removed
one hard link before fsyncing the file, the inode logging causes the
parent directory do be logged as well, which has 6004 directory items to
log (3002 BTRFS_DIR_ITEM_KEY items plus 3002 BTRFS_DIR_INDEX_KEY items),
so it can take a significant amount of time and trigger the soft lockup.

So just make tree-log.c:log_dir_items() reschedule when necessary,
releasing the current search path before doing so and then resume from
where it was before the reschedule.

The stack trace produced when the soft lockup happens is the following:

[10480.277653] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [xfs_io:28172]
[10480.279418] Modules linked in: dm_thin_pool dm_persistent_data (...)
[10480.284915] irq event stamp: 29646366
[10480.285987] hardirqs last  enabled at (29646365): [<ffffffff85249b66>] __slab_alloc.constprop.0+0x56/0x60
[10480.288482] hardirqs last disabled at (29646366): [<ffffffff8579b00d>] irqentry_enter+0x1d/0x50
[10480.290856] softirqs last  enabled at (4612): [<ffffffff85a00323>] __do_softirq+0x323/0x56c
[10480.293615] softirqs last disabled at (4483): [<ffffffff85800dbf>] asm_call_on_stack+0xf/0x20
[10480.296428] CPU: 2 PID: 28172 Comm: xfs_io Not tainted 5.9.0-rc4-default+ #1248
[10480.298948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
[10480.302455] RIP: 0010:__slab_alloc.constprop.0+0x19/0x60
[10480.304151] Code: 86 e8 31 75 21 00 66 66 2e 0f 1f 84 00 00 00 (...)
[10480.309558] RSP: 0018:ffffadbe09397a58 EFLAGS: 00000282
[10480.311179] RAX: ffff8a495ab92840 RBX: 0000000000000282 RCX: 0000000000000006
[10480.313242] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff85249b66
[10480.315260] RBP: ffff8a497d04b740 R08: 0000000000000001 R09: 0000000000000001
[10480.317229] R10: ffff8a497d044800 R11: ffff8a495ab93c40 R12: 0000000000000000
[10480.319169] R13: 0000000000000000 R14: 0000000000000c40 R15: ffffffffc01daf70
[10480.321104] FS:  00007fa1dc5c0e40(0000) GS:ffff8a497da00000(0000) knlGS:0000000000000000
[10480.323559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10480.325235] CR2: 00007fa1dc5befb8 CR3: 0000000004f8a006 CR4: 0000000000170ea0
[10480.327259] Call Trace:
[10480.328286]  ? overwrite_item+0x1f0/0x5a0 [btrfs]
[10480.329784]  __kmalloc+0x831/0xa20
[10480.331009]  ? btrfs_get_32+0xb0/0x1d0 [btrfs]
[10480.332464]  overwrite_item+0x1f0/0x5a0 [btrfs]
[10480.333948]  log_dir_items+0x2ee/0x570 [btrfs]
[10480.335413]  log_directory_changes+0x82/0xd0 [btrfs]
[10480.336926]  btrfs_log_inode+0xc9b/0xda0 [btrfs]
[10480.338374]  ? init_once+0x20/0x20 [btrfs]
[10480.339711]  btrfs_log_inode_parent+0x8d3/0xd10 [btrfs]
[10480.341257]  ? dget_parent+0x97/0x2e0
[10480.342480]  btrfs_log_dentry_safe+0x3a/0x50 [btrfs]
[10480.343977]  btrfs_sync_file+0x24b/0x5e0 [btrfs]
[10480.345381]  do_fsync+0x38/0x70
[10480.346483]  __x64_sys_fsync+0x10/0x20
[10480.347703]  do_syscall_64+0x2d/0x70
[10480.348891]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10480.350444] RIP: 0033:0x7fa1dc80970b
[10480.351642] Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 (...)
[10480.356952] RSP: 002b:00007fffb3d081d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[10480.359458] RAX: ffffffffffffffda RBX: 0000562d93d45e40 RCX: 00007fa1dc80970b
[10480.361426] RDX: 0000562d93d44ab0 RSI: 0000562d93d45e60 RDI: 0000000000000003
[10480.363367] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007fa1dc7b2a40
[10480.365317] R10: 0000562d93d0e366 R11: 0000000000000293 R12: 0000000000000001
[10480.367299] R13: 0000562d93d45290 R14: 0000562d93d45e40 R15: 0000562d93d45e60

Link: https://lore.kernel.org/linux-btrfs/20180713090216.GC575@fnst.localdomain/
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
CC: stable@vger.kernel.org # 4.4+
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:19 +02:00
Josef Bacik 49ea112da0 btrfs: do not create raid sysfs entries under any locks
While running xfstests btrfs/177 I got the following lockdep splat

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-rc3+ #5 Not tainted
  ------------------------------------------------------
  kswapd0/100 is trying to acquire lock:
  ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330

  but task is already holding lock:
  ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (fs_reclaim){+.+.}-{0:0}:
	 fs_reclaim_acquire+0x65/0x80
	 slab_pre_alloc_hook.constprop.0+0x20/0x200
	 kmem_cache_alloc+0x37/0x270
	 alloc_inode+0x82/0xb0
	 iget_locked+0x10d/0x2c0
	 kernfs_get_inode+0x1b/0x130
	 kernfs_get_tree+0x136/0x240
	 sysfs_get_tree+0x16/0x40
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x434/0xc00
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (kernfs_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 kernfs_add_one+0x23/0x150
	 kernfs_create_dir_ns+0x7a/0xb0
	 sysfs_create_dir_ns+0x60/0xb0
	 kobject_add_internal+0xc0/0x2c0
	 kobject_add+0x6e/0x90
	 btrfs_sysfs_add_block_group_type+0x102/0x160
	 btrfs_make_block_group+0x167/0x230
	 btrfs_alloc_chunk+0x54f/0xb80
	 btrfs_chunk_alloc+0x18e/0x3a0
	 find_free_extent+0xdf6/0x1210
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb0/0x310
	 alloc_tree_block_no_bg_flush+0x4a/0x60
	 __btrfs_cow_block+0x11a/0x530
	 btrfs_cow_block+0x104/0x220
	 btrfs_search_slot+0x52e/0x9d0
	 btrfs_insert_empty_items+0x64/0xb0
	 btrfs_new_inode+0x225/0x730
	 btrfs_create+0xab/0x1f0
	 lookup_open.isra.0+0x52d/0x690
	 path_openat+0x2a7/0x9e0
	 do_filp_open+0x75/0x100
	 do_sys_openat2+0x7b/0x130
	 __x64_sys_openat+0x46/0x70
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 btrfs_chunk_alloc+0x125/0x3a0
	 find_free_extent+0xdf6/0x1210
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb0/0x310
	 alloc_tree_block_no_bg_flush+0x4a/0x60
	 __btrfs_cow_block+0x11a/0x530
	 btrfs_cow_block+0x104/0x220
	 btrfs_search_slot+0x52e/0x9d0
	 btrfs_lookup_inode+0x2a/0x8f
	 __btrfs_update_delayed_inode+0x80/0x240
	 btrfs_commit_inode_delayed_inode+0x119/0x120
	 btrfs_evict_inode+0x357/0x500
	 evict+0xcf/0x1f0
	 do_unlinkat+0x1a9/0x2b0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 __lock_acquire+0x119c/0x1fc0
	 lock_acquire+0xa7/0x3d0
	 __mutex_lock+0x7e/0x7e0
	 __btrfs_release_delayed_node.part.0+0x3f/0x330
	 btrfs_evict_inode+0x24c/0x500
	 evict+0xcf/0x1f0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x44/0x50
	 super_cache_scan+0x161/0x1e0
	 do_shrink_slab+0x178/0x3c0
	 shrink_slab+0x17c/0x290
	 shrink_node+0x2b2/0x6d0
	 balance_pgdat+0x30a/0x670
	 kswapd+0x213/0x4c0
	 kthread+0x138/0x160
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(kernfs_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/100:
   #0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
   #2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0

  stack backtrace:
  CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb8
   check_noncircular+0x12d/0x150
   __lock_acquire+0x119c/0x1fc0
   lock_acquire+0xa7/0x3d0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   __mutex_lock+0x7e/0x7e0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? lock_acquire+0xa7/0x3d0
   ? find_held_lock+0x2b/0x80
   __btrfs_release_delayed_node.part.0+0x3f/0x330
   btrfs_evict_inode+0x24c/0x500
   evict+0xcf/0x1f0
   dispose_list+0x48/0x70
   prune_icache_sb+0x44/0x50
   super_cache_scan+0x161/0x1e0
   do_shrink_slab+0x178/0x3c0
   shrink_slab+0x17c/0x290
   shrink_node+0x2b2/0x6d0
   balance_pgdat+0x30a/0x670
   kswapd+0x213/0x4c0
   ? _raw_spin_unlock_irqrestore+0x41/0x50
   ? add_wait_queue_exclusive+0x70/0x70
   ? balance_pgdat+0x670/0x670
   kthread+0x138/0x160
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30

This happens because when we link in a block group with a new raid index
type we'll create the corresponding sysfs entries for it.  This is
problematic because while restriping we're holding the chunk_mutex, and
while mounting we're holding the tree locks.

Fixing this isn't pretty, we move the call to the sysfs stuff into the
btrfs_create_pending_block_groups() work, where we're not holding any
locks.  This creates a slight race where other threads could see that
there's no sysfs kobj for that raid type, and race to create the
sysfs dir.  Fix this by wrapping the creation in space_info->lock, so we
only get one thread calling kobject_add() for the new directory.  We
don't worry about the lock on cleanup as it only gets deleted on
unmount.

On mount it's more straightforward, we loop through the space_infos
already, just check every raid index in each space_info and added the
sysfs entries for the corresponding block groups.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:19 +02:00
Josef Bacik 7280490500 btrfs: kill the RCU protection for fs_info->space_info
We have this thing wrapped in an RCU lock, but it's really not needed.
We create all the space_info's on mount, and we destroy them on unmount.
The list never changes and we're protected from messing with it by the
normal mount/umount path, so kill the RCU stuff around it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:19 +02:00
Nikolay Borisov 7269ddd2f6 btrfs: improve error message in setup_items_for_insert
Reword and update formats to match variable types.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update formats ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:18 +02:00
Nikolay Borisov da9ffb242c btrfs: add kerneldoc for setup_items_for_insert
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:18 +02:00
Nikolay Borisov fc0d82e103 btrfs: sink total_data parameter in setup_items_for_insert
That parameter can easily be derived based on the "data_size" and "nr"
parameters exploit this fact to simply the function's signature. No
functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:18 +02:00
Nikolay Borisov 3dc9dc8969 btrfs: eliminate total_size parameter from setup_items_for_insert
The value of this argument can be derived from the total_data as it's
simply the value of the data size + size of btrfs_items being touched.
Move the parameter calculation inside the function. This results in a
simpler interface and also a minor size reduction:

./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34)
Function                                     old     new   delta
btrfs_duplicate_item                         260     259      -1
setup_items_for_insert                      1200    1190     -10
btrfs_insert_empty_items                     177     154     -23

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:18 +02:00
Nikolay Borisov fc0716c2f6 btrfs: re-arrange statements in setup_items_for_insert
Rearrange statements calculating the offset of the newly added items so
that the calculation has to be done only once. No functional change.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:18 +02:00
Omar Sandoval 7573df5547 btrfs: sysfs: export supported send stream version
This reports the latest send stream version supported by the kernel as
the feature in /sys/fs/btrfs/features/send_stream_version .

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:17 +02:00
Omar Sandoval c9a949af13 btrfs: send: use btrfs_file_extent_end() in send_write_or_clone()
send_write_or_clone() basically has an open-coded copy of
btrfs_file_extent_end() except that it (incorrectly) aligns to PAGE_SIZE
instead of sectorsize. Fix and simplify the code by using
btrfs_file_extent_end().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:17 +02:00
Omar Sandoval 8c7d9fe06f btrfs: send: avoid copying file data
send_write() currently copies from the page cache to sctx->read_buf, and
then from sctx->read_buf to sctx->send_buf. Similarly, send_hole()
zeroes sctx->read_buf and then copies from sctx->read_buf to
sctx->send_buf. However, if we write the TLV header manually, we can
copy to sctx->send_buf directly and get rid of sctx->read_buf.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:17 +02:00
Omar Sandoval a9b2e0de92 btrfs: send: get rid of i_size logic in send_write()
send_write()/fill_read_buf() have some logic for avoiding reading past
i_size. However, everywhere that we call
send_write()/send_extent_data(), we've already clamped the length down
to i_size. Get rid of the i_size handling, which simplifies the next
change.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:17 +02:00
Filipe Manana 0cbb5bdfea btrfs: rename btrfs_insert_clone_extent() to a more generic name
Now that we use the same mechanism to replace all the extents in a file
range with either a hole, an existing extent (when cloning) or a new
extent (when using fallocate), the name of btrfs_insert_clone_extent()
no longer reflects its genericity.

So rename it to btrfs_insert_replace_extent(), since what it does is
to either insert an existing extent or a new extent into a file range.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:17 +02:00
Filipe Manana 306bfec02b btrfs: rename btrfs_punch_hole_range() to a more generic name
The function btrfs_punch_hole_range() is now used to replace all the file
extents in a given file range with an extent described in the given struct
btrfs_replace_extent_info argument. This extent can either be an existing
extent that is being cloned or it can be a new extent (namely a prealloc
extent). When that argument is NULL it only punches a hole (drops all the
existing extents) in the file range.

So rename the function to btrfs_replace_file_extents().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:17 +02:00
Filipe Manana bf385648fa btrfs: rename struct btrfs_clone_extent_info to a more generic name
Now that we can use btrfs_clone_extent_info to convey information for a
new prealloc extent as well, and not just for existing extents that are
being cloned, rename it to btrfs_replace_extent_info, which reflects the
fact that this is now more generic and it is used to replace all existing
extents in a file range with the extent described by the structure.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:16 +02:00
Filipe Manana fb870f6cdd btrfs: remove item_size member of struct btrfs_clone_extent_info
The value of item_size of struct btrfs_clone_extent_info is always set to
the size of a non-inline file extent item, and in fact the infrastructure
that uses this structure (btrfs_punch_hole_range()) does not work with
inline file extents at all (and it is not supposed to).

So just remove that field from the structure and use directly
sizeof(struct btrfs_file_extent_item) instead. Also assert that the
file extent type is not inline at btrfs_insert_clone_extent().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:16 +02:00
Filipe Manana 8fccebfa53 btrfs: fix metadata reservation for fallocate that leads to transaction aborts
When doing an fallocate(), specially a zero range operation, we assume
that reserving 3 units of metadata space is enough, that at most we touch
one leaf in subvolume/fs tree for removing existing file extent items and
inserting a new file extent item. This assumption is generally true for
most common use cases. However when we end up needing to remove file extent
items from multiple leaves, we can end up failing with -ENOSPC and abort
the current transaction, turning the filesystem to RO mode. When this
happens a stack trace like the following is dumped in dmesg/syslog:

[ 1500.620934] ------------[ cut here ]------------
[ 1500.620938] BTRFS: Transaction aborted (error -28)
[ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
[ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G        W         5.9.0-rc3-btrfs-next-67 #1
[ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
[ 1500.621026] Code: 8b 40 50 f0 48 (...)
[ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
[ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
[ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
[ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
[ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
[ 1500.621039] FS:  00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
[ 1500.621041] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
[ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1500.621049] Call Trace:
[ 1500.621076]  btrfs_prealloc_file_range+0x10/0x20 [btrfs]
[ 1500.621087]  btrfs_fallocate+0xccd/0x1280 [btrfs]
[ 1500.621108]  vfs_fallocate+0x14d/0x290
[ 1500.621112]  ksys_fallocate+0x3a/0x70
[ 1500.621117]  __x64_sys_fallocate+0x1a/0x20
[ 1500.621120]  do_syscall_64+0x33/0x80
[ 1500.621123]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1500.621126] RIP: 0033:0x7fb5b248c477
[ 1500.621128] Code: 89 7c 24 08 (...)
[ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
[ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
[ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
[ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
[ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
[ 1500.621151] irq event stamp: 1026217
[ 1500.621154] hardirqs last  enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
[ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
[ 1500.621159] softirqs last  enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
[ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
[ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
[ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left

When we use fallocate() internally, for reserving an extent for a space
cache, inode cache or relocation, we can't hit this problem since either
there aren't any file extent items to remove from the subvolume tree or
there is at most one.

When using plain fallocate() it's very unlikely, since that would require
having many file extent items representing holes for the target range and
crossing multiple leafs - we attempt to increase the range (merge) of such
file extent items when punching holes, so at most we end up with 2 file
extent items for holes at leaf boundaries.

However when using the zero range operation of fallocate() for a large
range (100+ MiB for example) that's fairly easy to trigger. The following
example reproducer triggers the issue:

  $ cat reproducer.sh
  #!/bin/bash

  umount /dev/sdj &> /dev/null
  mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
  mount /dev/sdj /mnt/sdj

  # Create a 100M file with many file extent items. Punch a hole every 8K
  # just to speedup the file creation - we could do 4K sequential writes
  # followed by fsync (or O_SYNC) as well, but that takes a lot of time.
  file_size=$((100 * 1024 * 1024))
  xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
  for ((i = 0; i < $file_size; i += 8192)); do
      xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
  done

  # Force a transaction commit, so the zero range operation will be forced
  # to COW all metadata extents it need to touch.
  sync

  xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar

  umount /mnt/sdj

  $ ./reproducer.sh
  wrote 104857600/104857600 bytes at offset 0
  100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
  fallocate: No space left on device

  $ dmesg
  <shows the same stack trace pasted before>

To fix this use the existing infrastructure that hole punching and
extent cloning use for replacing a file range with another extent. This
deals with doing the removal of file extent items and inserting the new
one using an incremental approach, reserving more space when needed and
always ensuring we don't leave an implicit hole in the range in case
we need to do multiple iterations and a crash happens between iterations.

A test case for fstests will follow up soon.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:16 +02:00
YueHaibing a31a5876fa btrfs: remove unused function calc_global_rsv_need_space()
It is not used since commit 0096420adb ("btrfs: do not
account global reserve in can_overcommit").

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:16 +02:00
Anand Jain 0725c0c935 btrfs: move btrfs_dev_replace_update_device_in_mapping_tree to drop declaration
The function is short and simple, we can get rid of the declaration as
it's not necessary for a static function. Move it before its first
caller.  No functional changes.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:13:15 +02:00
Anand Jain c83b60c0e4 btrfs: simplify gotos in open_seed_device
The function does not have a common exit block and returns immediatelly
so there's no point having the goto. Remove the two cases.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:22 +02:00
Anand Jain e493e8f9bc btrfs: remove unnecessary tmp variable in btrfs_assign_next_active_device()
We can check the argument value directly, no need for the temporary
variable.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:22 +02:00
Anand Jain 1888709d71 btrfs: remove tmp variable for list traversal in btrfs_init_dev_replace_tgtdev
In the function btrfs_init_dev_replace_tgtdev(), the local variable
devices is used only once, we can remove it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:22 +02:00
Anand Jain e17125b52b btrfs: use sprout device_list_mutex in btrfs_init_devices_late
On a mounted sprout filesystem, all threads now are using the
sprout::device_list_mutex, and this is the only code using the
seed::device_list_mutex. This patch converts to use the sprouts
fs_info->fs_devices->device_list_mutex.

The same reasoning holds true here, that device delete is holding
the sprout::device_list_mutex.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:22 +02:00
Anand Jain 2fca0db076 btrfs: reada: lock all seed/sprout devices in __reada_start_machine
On an fs mounted using a sprout device, the seed fs_devices are
maintained in a linked list under fs_info->fs_devices. Each seeds
fs_devices also has device_list_mutex initialized to protect against the
potential race with delete threads. But the delete thread (at
btrfs_rm_device()) is holding the fs_info::fs_devices::device_list_mutex
mutex which belongs to sprout device_list_mutex instead of seed
device_list_mutex. Moreover, there aren't any significient benefits in
using the seed::device_list_mutex instead of sprout::device_list_mutex.

So this patch converts them of using the seed::device_list_mutex to
sprout::device_list_mutex.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:22 +02:00
Anand Jain 7ad3912a70 btrfs: handle errors in btrfs_sysfs_add_fs_devices
btrfs_sysfs_add_fs_devices() is called by btrfs_sysfs_add_mounted().
btrfs_sysfs_add_mounted() assumes that btrfs_sysfs_add_fs_devices() will
either add sysfs entries for all the devices or none. So this patch keeps up
to its caller expecatation and cleans up the created sysfs entries if it
has to fail at some device in the list.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:22 +02:00
Anand Jain 30b0e4e0e3 btrfs: initialize sysfs devid and device link for seed device
We don't initialize the sysfs devid kobject and device-link yet for the
seed devices in an sprouted filesystem.
So this patch initializes the seed device devid kobject and the device
link in the sysfs.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:21 +02:00
Anand Jain 53f8a74cbe btrfs: split and refactor btrfs_sysfs_remove_devices_dir
Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split
btrfs_sysfs_remove_devices_dir() so that we don't have to use the device
argument to indicate whether to free all devices or just one device.

Export btrfs_sysfs_remove_device() as device operations outside of
sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir().

btrfs_sysfs_remove_devices_dir() is renamed to
btrfs_sysfs_remove_fs_devices() to suite its new role.

Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices()
so it is redeclared s static. And the same function had to be moved
before its first caller.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:21 +02:00
Anand Jain cd36da2e7e btrfs: simplify parameters of btrfs_sysfs_add_devices_dir
When we add a device we need to add it to sysfs, so instead of using the
btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to
add a device or all of fs_devices, call the helper function directly
btrfs_sysfs_add_device() and thus make it non-static.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:21 +02:00
Anand Jain 6a416a018f btrfs: make btrfs_sysfs_remove_devices_dir return void
btrfs_sysfs_remove_devices_dir() return value is unused declare it as
void.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:21 +02:00
Anand Jain 985e233e96 btrfs: add btrfs_sysfs_remove_device helper
btrfs_sysfs_remove_devices_dir() removes device link and devid kobject
(sysfs entries) for a device or all the devices in the btrfs_fs_devices.
In preparation to remove these sysfs entries for the seed as well, add
a btrfs_sysfs_remove_device() helper function and avoid code
duplication.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:21 +02:00
Anand Jain 178a16c940 btrfs: add btrfs_sysfs_add_device helper
btrfs_sysfs_add_devices_dir() adds device link and devid kobject
(sysfs entries) for a device or all the devices in the btrfs_fs_devices.
In preparation to add these sysfs entries for the seed as well, add
a btrfs_sysfs_add_device() helper function and avoid code duplication.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:20 +02:00
Anand Jain c6a5d95495 btrfs: fix replace of seed device
If you replace a seed device in a sprouted fs, it appears to have
successfully replaced the seed device, but if you look closely, it
didn't.  Here is an example.

  $ mkfs.btrfs /dev/sda
  $ btrfstune -S1 /dev/sda
  $ mount /dev/sda /btrfs
  $ btrfs device add /dev/sdb /btrfs
  $ umount /btrfs
  $ btrfs device scan --forget
  $ mount -o device=/dev/sda /dev/sdb /btrfs
  $ btrfs replace start -f /dev/sda /dev/sdc /btrfs
  $ echo $?
  0

  BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc started
  BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc finished

  $ btrfs fi show
  Label: none  uuid: ab2c88b7-be81-4a7e-9849-c3666e7f9f4f
	  Total devices 2 FS bytes used 256.00KiB
	  devid    1 size 3.00GiB used 520.00MiB path /dev/sdc
	  devid    2 size 3.00GiB used 896.00MiB path /dev/sdb

  Label: none  uuid: 10bd3202-0415-43af-96a8-d5409f310a7e
	  Total devices 1 FS bytes used 128.00KiB
	  devid    1 size 3.00GiB used 536.00MiB path /dev/sda

So as per the replace start command and kernel log replace was successful.
Now let's try to clean mount.

  $ umount /btrfs
  $ btrfs device scan --forget

  $ mount -o device=/dev/sdc /dev/sdb /btrfs
  mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.

  [  636.157517] BTRFS error (device sdc): failed to read chunk tree: -2
  [  636.180177] BTRFS error (device sdc): open_ctree failed

That's because per dev items it is still looking for the original seed
device.

 $ btrfs inspect-internal dump-tree -d /dev/sdb

	item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
		devid 1 total_bytes 3221225472 bytes_used 545259520
		io_align 4096 io_width 4096 sector_size 4096 type 0
		generation 6 start_offset 0 dev_group 0
		seek_speed 0 bandwidth 0
		uuid 59368f50-9af2-4b17-91da-8a783cc418d4  <--- seed uuid
		fsid 10bd3202-0415-43af-96a8-d5409f310a7e  <--- seed fsid
	item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98
		devid 2 total_bytes 3221225472 bytes_used 939524096
		io_align 4096 io_width 4096 sector_size 4096 type 0
		generation 0 start_offset 0 dev_group 0
		seek_speed 0 bandwidth 0
		uuid 56a0a6bc-4630-4998-8daf-3c3030c4256a  <- sprout uuid
		fsid ab2c88b7-be81-4a7e-9849-c3666e7f9f4f <- sprout fsid

But the replaced target has the following uuid+fsid in its superblock
which doesn't match with the expected uuid+fsid in its devitem.

  $ btrfs in dump-super /dev/sdc | egrep '^generation|dev_item.uuid|dev_item.fsid|devid'
  generation	20
  dev_item.uuid	59368f50-9af2-4b17-91da-8a783cc418d4
  dev_item.fsid	ab2c88b7-be81-4a7e-9849-c3666e7f9f4f [match]
  dev_item.devid	1

So if you provide the original seed device the mount shall be
successful.  Which so long happening in the test case btrfs/163.

  $ btrfs device scan --forget
  $ mount -o device=/dev/sda /dev/sdb /btrfs

Fix in this patch:
If a seed is not sprouted then there is no replacement of it, because of
its read-only filesystem with a read-only device. Similarly, in the case
of a sprouted filesystem, the seed device is still read only. So, mark
it as you can't replace a seed device, you can only add a new device and
then delete the seed device. If replace is attempted then returns
-EINVAL.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:20 +02:00
Anand Jain 79dae17d8d btrfs: improve device scanning messages
Systems booting without the initramfs seems to scan an unusual kind
of device path (/dev/root). And at a later time, the device is updated
to the correct path. We generally print the process name and PID of the
process scanning the device but we don't capture the same information if
the device path is rescanned with a different pathname.

The current message is too long, so drop the unnecessary UUID and add
process name and PID.

While at this also update the duplicate device warning to include the
process name and PID so the messages are consistent

CC: stable@vger.kernel.org # 4.19+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:20 +02:00
Josef Bacik 457f1864b5 btrfs: pretty print leaked root name
I'm a actual human being so am incapable of converting u64 to s64 in my
head, so add a helper to get the pretty name of a root objectid and use
that helper to spit out the name for any special roots for leaked roots,
so I don't have to scratch my head and figure out which root I messed up
the refs for.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:20 +02:00
Goldwyn Rodrigues 66a2823c54 btrfs: sysfs: export currently running exclusive operation
/sys/fs/<fsid>/exclusive_operation contains the currently executing
exclusive operation. Add a sysfs_notify() when operation end, so
userspace can be notified of exclusive operation is finished.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:20 +02:00
Goldwyn Rodrigues c3e1f96c37 btrfs: enumerate the type of exclusive operation in progress
Instead of using a flag bit for exclusive operation, use a variable to
store which exclusive operation is being performed.  Introduce an API
to start and finish an exclusive operation.

This would enable another way for tools to check which operation is
running on why starting an exclusive operation failed. The followup
patch adds a sysfs_notify() to alert userspace when the state changes, so
userspace can perform select() on it to get notified of the change.

This would enable us to enqueue a command which will wait for current
exclusive operation to complete before issuing the next exclusive
operation. This has been done synchronously as opposed to a background
process, or else error collection (if any) will become difficult.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:20 +02:00
Josef Bacik ca10845a56 btrfs: sysfs: init devices outside of the chunk_mutex
While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
following lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-rc3+ #4 Not tainted
  ------------------------------------------------------
  kswapd0/100 is trying to acquire lock:
  ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330

  but task is already holding lock:
  ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (fs_reclaim){+.+.}-{0:0}:
	 fs_reclaim_acquire+0x65/0x80
	 slab_pre_alloc_hook.constprop.0+0x20/0x200
	 kmem_cache_alloc+0x37/0x270
	 alloc_inode+0x82/0xb0
	 iget_locked+0x10d/0x2c0
	 kernfs_get_inode+0x1b/0x130
	 kernfs_get_tree+0x136/0x240
	 sysfs_get_tree+0x16/0x40
	 vfs_get_tree+0x28/0xc0
	 path_mount+0x434/0xc00
	 __x64_sys_mount+0xe3/0x120
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (kernfs_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 kernfs_add_one+0x23/0x150
	 kernfs_create_link+0x63/0xa0
	 sysfs_do_create_link_sd+0x5e/0xd0
	 btrfs_sysfs_add_devices_dir+0x81/0x130
	 btrfs_init_new_device+0x67f/0x1250
	 btrfs_ioctl+0x1ef/0x2e20
	 __x64_sys_ioctl+0x83/0xb0
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7e0
	 btrfs_chunk_alloc+0x125/0x3a0
	 find_free_extent+0xdf6/0x1210
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb0/0x310
	 alloc_tree_block_no_bg_flush+0x4a/0x60
	 __btrfs_cow_block+0x11a/0x530
	 btrfs_cow_block+0x104/0x220
	 btrfs_search_slot+0x52e/0x9d0
	 btrfs_insert_empty_items+0x64/0xb0
	 btrfs_insert_delayed_items+0x90/0x4f0
	 btrfs_commit_inode_delayed_items+0x93/0x140
	 btrfs_log_inode+0x5de/0x2020
	 btrfs_log_inode_parent+0x429/0xc90
	 btrfs_log_new_name+0x95/0x9b
	 btrfs_rename2+0xbb9/0x1800
	 vfs_rename+0x64f/0x9f0
	 do_renameat2+0x320/0x4e0
	 __x64_sys_rename+0x1f/0x30
	 do_syscall_64+0x33/0x40
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 __lock_acquire+0x119c/0x1fc0
	 lock_acquire+0xa7/0x3d0
	 __mutex_lock+0x7e/0x7e0
	 __btrfs_release_delayed_node.part.0+0x3f/0x330
	 btrfs_evict_inode+0x24c/0x500
	 evict+0xcf/0x1f0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x44/0x50
	 super_cache_scan+0x161/0x1e0
	 do_shrink_slab+0x178/0x3c0
	 shrink_slab+0x17c/0x290
	 shrink_node+0x2b2/0x6d0
	 balance_pgdat+0x30a/0x670
	 kswapd+0x213/0x4c0
	 kthread+0x138/0x160
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(kernfs_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/100:
   #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
   #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0

  stack backtrace:
  CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb8
   check_noncircular+0x12d/0x150
   __lock_acquire+0x119c/0x1fc0
   lock_acquire+0xa7/0x3d0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   __mutex_lock+0x7e/0x7e0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? __btrfs_release_delayed_node.part.0+0x3f/0x330
   ? lock_acquire+0xa7/0x3d0
   ? find_held_lock+0x2b/0x80
   __btrfs_release_delayed_node.part.0+0x3f/0x330
   btrfs_evict_inode+0x24c/0x500
   evict+0xcf/0x1f0
   dispose_list+0x48/0x70
   prune_icache_sb+0x44/0x50
   super_cache_scan+0x161/0x1e0
   do_shrink_slab+0x178/0x3c0
   shrink_slab+0x17c/0x290
   shrink_node+0x2b2/0x6d0
   balance_pgdat+0x30a/0x670
   kswapd+0x213/0x4c0
   ? _raw_spin_unlock_irqrestore+0x41/0x50
   ? add_wait_queue_exclusive+0x70/0x70
   ? balance_pgdat+0x670/0x670
   kthread+0x138/0x160
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30

This happens because we are holding the chunk_mutex at the time of
adding in a new device.  However we only need to hold the
device_list_mutex, as we're going to iterate over the fs_devices
devices.  Move the sysfs init stuff outside of the chunk_mutex to get
rid of this lockdep splat.

CC: stable@vger.kernel.org # 4.4.x: f3cd2c5811: btrfs: sysfs, rename device_link add/remove functions
CC: stable@vger.kernel.org # 4.4.x
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:19 +02:00
Nikolay Borisov facee0a09c btrfs: make extent_fiemap take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:19 +02:00
Nikolay Borisov 948dfeb86b btrfs: make btrfs_zero_range_check_range_boundary take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:19 +02:00
Nikolay Borisov 998acfe8ff btrfs: make copy_inline_to_page take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:19 +02:00
Nikolay Borisov 3c5641a83a btrfs: make btrfs_find_ordered_sum take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:19 +02:00
Nikolay Borisov f1bbde8d5f btrfs: make get_extent_skip_holes take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:18 +02:00
Nikolay Borisov 3347c48f27 btrfs: make btrfs_writepage_endio_finish_ordered btrfs_inode-centric
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:18 +02:00
Nikolay Borisov 53ac7ead24 btrfs: make btrfs_invalidatepage work on btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:18 +02:00
Nikolay Borisov 6fee248d2b btrfs: convert btrfs_inode_sectorsize to take btrfs_inode
It's counterintuitive to have a function named btrfs_inode_xxx which
takes a generic inode. Also move the function to btrfs_inode.h so that
it has access to the definition of struct btrfs_inode.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:18 +02:00
Nikolay Borisov 90c0304c63 btrfs: make btrfs_dec_test_ordered_pending take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:18 +02:00
Nikolay Borisov acbf1dd0fc btrfs: make ordered extent tracepoint take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:17 +02:00
Nikolay Borisov 6d072c8e29 btrfs: make btrfs_lookup_first_ordered_extent take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:17 +02:00
Nikolay Borisov b79b724969 btrfs: make inode_tree_del take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:17 +02:00
Josef Bacik ca9d473a3e btrfs: use BTRFS_NESTED_NEW_ROOT for double splits
I've made this change separate since it requires both of the newly added
NESTED flags and I didn't want to slip it into one of those changes.

If we do a double split of a node we can end up doing a
BTRFS_NESTED_SPLIT on level 0, which throws lockdep off because it
appears as a double lock.  Since we're maxed out on subclasses, use
BTRFS_NESTED_NEW_ROOT if we had to do a double split.  This is OK
because we won't have to do a double split if we had to insert a new
root, and the new root would be at a higher level anyway.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:17 +02:00
Josef Bacik cf6f34aa3a btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots
The way we add new roots is confusing from a locking perspective for
lockdep.  We generally have the rule that we lock things in order from
highest level to lowest, but in the case of adding a new level to the
tree we actually allocate a new block for the root, which makes the
locking go in reverse.  A similar issue exists for snapshotting, we cow
the original root for the root of a new tree, however they're at the
same level.  Address this by using BTRFS_NESTING_NEW_ROOT for these
operations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:17 +02:00
Josef Bacik 4dff97e690 btrfs: introduce BTRFS_NESTING_SPLIT for split blocks
If we are splitting a leaf/node, we could do something like the
following

lock(leaf)  BTRFS_NESTING_NORMAL
  lock(left) BTRFS_NESTING_LEFT + BTRFS_NESTING_COW
    push from leaf -> left
      reset path to point to left
        split left
          allocate new block, lock block BTRFS_NESTING_SPLIT

at the new block point we need to have a different nesting level,
because we have already used either BTRFS_NESTING_LEFT or
BTRFS_NESTING_RIGHT when pushing items from the original leaf into the
adjacent leaves.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:16 +02:00
Josef Bacik bf59a5a216 btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW
For similar reasons as BTRFS_NESTING_COW, we need
BTRFS_NESTING_LEFT/RIGHT_COW.  The pattern is this

lock leaf -> BTRFS_NESTING_NORMAL
  cow leaf -> BTRFS_NESTING_COW
    split leaf
      lock left -> BTRFS_NESTING_LEFT
        cow left -> BTRFS_NESTING_LEFT_COW

We need this in order to indicate to lockdep that these locks are
discrete and are being taken in a safe order.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:16 +02:00
Josef Bacik bf77467a93 btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT
Our lockdep maps are based on rootid+level, however in some cases we
will lock adjacent blocks on the same level, namely in searching forward
or in split/balance.  Because of this lockdep will complain, so we need
a separate subclass to indicate to lockdep that these are different
locks.

lock leaf -> BTRFS_NESTING_NORMAL
  cow leaf -> BTRFS_NESTING_COW
    split leaf
       lock left -> BTRFS_NESTING_LEFT
       lock right -> BTRFS_NESTING_RIGHT

The above graph illustrates the need for this new nesting subclass.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:16 +02:00
Josef Bacik 9631e4cc1a btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks
When we COW a block we are holding a lock on the original block, and
then we lock the new COW block.  Because our lockdep maps are based on
root + level, this will make lockdep complain.  We need a way to
indicate a subclass for locking the COW'ed block, so plumb through our
btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.

The reason I've added all this extra infrastructure is because there
will be need of different nesting classes in follow up patches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:16 +02:00
Josef Bacik fd7ba1c120 btrfs: add nesting tags to the locking helpers
We will need these when we switch to an rwsem, so plumb in the
infrastructure here to use later on.  I violate the 80 character limit
some here because it'll be cleaned up later.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:16 +02:00
Josef Bacik 51899412dd btrfs: introduce btrfs_path::recurse
Our current tree locking stuff allows us to recurse with read locks if
we're already holding the write lock.  This is necessary for the space
cache inode, as we could be holding a lock on the root_tree root when we
need to cache a block group, and thus need to be able to read down the
root_tree to read in the inode cache.

We can get away with this in our current locking, but we won't be able
to with a rwsem.  Handle this by purposefully annotating the places
where we require recursion, so that in the future we can maybe come up
with a way to avoid the recursion.  In the case of the free space inode,
this will be superseded by the free space tree.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:16 +02:00
Josef Bacik 329ced799b btrfs: rename extent_buffer::lock_nested to extent_buffer::lock_recursed
Nested locking with lockdep and everything else refers to lock hierarchy
within the same lock map.  This is how we indicate the same locks for
different objects are ok to take in a specific order, for our use case
that would be to take the lock on a leaf and then take a lock on an
adjacent leaf.

What ->lock_nested _actually_ refers to is if we happen to already be
holding the write lock on the extent buffer and we're allowing a read
lock to be taken on that extent buffer, which is recursion.  Rename this
so we don't get confused when we switch to a rwsem and have to start
using the _nested helpers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:15 +02:00
Nikolay Borisov b9ba017fb0 btrfs: don't opencode sync_blockdev in btrfs_init_new_device
Instead of opencoding filemap_write_and_wait simply call syncblockdev as
it makes it abundantly clear what's going on and why this is used. No
semantics changes.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:15 +02:00
Nikolay Borisov 4ae312e972 btrfs: remove redundant code from btrfs_free_stale_devices
Following the refactor of btrfs_free_stale_devices in
7bcb8164ad ("btrfs: use device_list_mutex when removing stale devices")
fs_devices are freed after they have been iterated by the inner
list_for_each so the use-after-free fixed by introducing the break in
fd649f10c3 ("btrfs: Fix use-after-free when cleaning up fs_devs with
a single stale device") is no longer necessary. Just remove it
altogether. No functional changes.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:15 +02:00
Nikolay Borisov 44cab9ba37 btrfs: refactor locked condition in btrfs_init_new_device
Invert unlocked to locked and exploit the fact it can only ever be
modified if we are adding a new device to a seed filesystem. This allows
to simplify the check in error: label. No semantics changes.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:15 +02:00
Nikolay Borisov f4cfa9bdd4 btrfs: use RCU for quick device check in btrfs_init_new_device
When adding a new device there's a mandatory check to see if a device is
being duplicated to the filesystem it's added to. Since this is a
read-only operations not necessary to take device_list_mutex and can simply
make do with an rcu-readlock.

Using just RCU is safe because there won't be another device add delete
running in parallel as btrfs_init_new_device is called only from
btrfs_ioctl_add_dev.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:15 +02:00
Qu Wenruo d16c702fe4 btrfs: ctree: check key order before merging tree blocks
[BUG]
With a crafted image, btrfs can panic at btrfs_del_csums():

  kernel BUG at fs/btrfs/ctree.c:3188!
  invalid opcode: 0000 [#1] SMP PTI
  CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
  RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
  RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
  RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
  RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
  RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
  R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
  R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
  FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
  CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
  Call Trace:
  truncate_one_csum+0xac/0xf0
  btrfs_del_csums+0x24f/0x3a0
  __btrfs_free_extent.isra.72+0x5a7/0xbe0
  __btrfs_run_delayed_refs+0x539/0x1120
  btrfs_run_delayed_refs+0xdb/0x1b0
  btrfs_commit_transaction+0x52/0x950
  ? start_transaction+0x94/0x450
  transaction_kthread+0x163/0x190
  kthread+0x105/0x140
  ? btrfs_cleanup_transaction+0x560/0x560
  ? kthread_destroy_worker+0x50/0x50
  ret_from_fork+0x35/0x40
  Modules linked in:
  ---[ end trace 93bf9db00e6c374e ]---

[CAUSE]
This crafted image has a tricky key order corruption:

  checksum tree key (CSUM_TREE ROOT_ITEM 0)
  node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
          ...
          key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
          key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
          ...

  leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
          item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
                  range start 73785344 end 75497472 length 1712128
          item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
                  range start 75497472 end 75501568 length 4096
          item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
                  range start 75501568 end 77283328 length 1781760
          item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
                  range start 77283328 end 77287424 length 4096
          item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
                  range start 4120596480 end 4120903680 length 307200
  leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
          item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
                  range start 77594624 end 79306752 length 1712128
          ...

Note the item 4 key of leaf 29757440, which is obviously too large, and
even larger than the first key of the next leaf.

However it still follows the key order in that tree block, thus tree
checker is unable to detect it at read time, since tree checker can only
work inside one leaf, thus such complex corruption can't be detected in
advance.

[FIX]
The next time to detect such problem is at tree block merge time,
which is in push_node_left(), balance_node_right(), push_leaf_left() or
push_leaf_right().

Now we check if the key order of the right-most key of the left node is
larger than the left-most key of the right node.

By this we don't need to call the full tree-checker, while still keeping
the key order correct as key order in each node is already checked by
tree checker thus we only need to check the above two slots.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:14 +02:00
Qu Wenruo 07cce5cf3b btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref()
[BUG]
With a crafted image, btrfs can panic at insert_inline_extent_backref():

  kernel BUG at fs/btrfs/extent-tree.c:1857!
  invalid opcode: 0000 [#1] SMP PTI
  CPU: 0 PID: 1117 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
  RIP: 0010:insert_inline_extent_backref+0xcc/0xe0
  RSP: 0018:ffffac4dc1287be8 EFLAGS: 00010293
  RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000001
  RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
  RBP: ffffac4dc1287c28 R08: ffffac4dc1287ab8 R09: ffffac4dc1287ac0
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: ffff8febef88a540 R14: ffff8febeaa7bc30 R15: 0000000000000000
  FS: 0000000000000000(0000) GS:ffff8febf7a00000(0000) knlGS:0000000000000000
  CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f663ace94c0 CR3: 0000000235698006 CR4: 00000000000206f0
  Call Trace:
  ? _cond_resched+0x1a/0x50
  __btrfs_inc_extent_ref.isra.64+0x7e/0x240
  ? btrfs_merge_delayed_refs+0xa5/0x330
  __btrfs_run_delayed_refs+0x653/0x1120
  btrfs_run_delayed_refs+0xdb/0x1b0
  btrfs_commit_transaction+0x52/0x950
  ? start_transaction+0x94/0x450
  transaction_kthread+0x163/0x190
  kthread+0x105/0x140
  ? btrfs_cleanup_transaction+0x560/0x560
  ? kthread_destroy_worker+0x50/0x50
  ret_from_fork+0x35/0x40
  Modules linked in:
  ---[ end trace 2ad8b3de903cf825 ]---

[CAUSE]
Due to extent tree corruption (still valid by itself, but bad cross
ref), we can allocate an extent which is still in extent tree.  The
offending tree block of that case is from csum tree.  The newly
allocated tree block is also for csum tree.

Then we will try to insert a tree block ref for the existing tree block
ref.

For a tree extent item, tree block can never be shared directly by the
same tree twice.  We have such BUG_ON() to prevent such problem, but
this is not a proper error handling.

[FIX]
Replace that BUG_ON() with proper error message and leaf dump for debug
build.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202829
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:14 +02:00
Qu Wenruo 1c2a07f598 btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
__btrfs_free_extent() is doing two things:

1. Reduce the refs number of an extent backref
   Either it's an inline extent backref (inside EXTENT/METADATA item) or
   a keyed extent backref (SHARED_* item).
   We only need to locate that backref line, either reduce the number or
   remove the backref line completely.

2. Update the refs count in EXTENT/METADATA_ITEM

During step 1), we will try to locate the EXTENT/METADATA_ITEM without
triggering another btrfs_search_slot() as fast path.

Only when we fail to locate that item, we will trigger another
btrfs_search_slot() to get that EXTENT/METADATA_ITEM after we
updated/deleted the backref line.

And we have a lot of strict checks on things like refs_to_drop against
extent refs and special case checks for single ref extents.

There are 7 BUG_ON()s, although they're doing correct checks, they can
be triggered by crafted images.

This patch improves the function:

- Introduce two examples to show what __btrfs_free_extent() is doing
  One inline backref case and one keyed case.  Should cover most cases.

- Kill all BUG_ON()s with proper error message and optional leaf dump

- Add comment to show the overall flow

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202819
[ The report triggers one BUG_ON() in __btrfs_free_extent() ]
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:14 +02:00
Qu Wenruo f98b6215d7 btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:

- No overflow check
  Values like start = 1024 len = -1024 can still pass the basic
   (start + len) > eb->len check.

- Checks are not consistent
  For read_extent_buffer() we only check (start + len) against eb->len.
  While for memcmp_extent_buffer() we also check start against eb->len.

- Different error reporting mechanism
  We use WARN() in read_extent_buffer() but BUG() in
  memcpy_extent_buffer().

- Still modify memory if the request is obviously wrong
  In read_extent_buffer() even we find (start + len) > eb->len, we still
  call memset(dst, 0, len), which can easily cause memory access error
  if start + len overflows.

To address above problems, this patch creates a new common function to
check such access, check_eb_range().

- Add overflow check
  This function checks start, start + len against eb->len and overflow
  check.

- Unified checks

- Unified error reports
  Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
  And also do btrfs_warn() message for non-debug build.

- Exit ASAP if check fails
  No more possible memory corruption.

- Add extra comment for @start @len used in those functions as it's
  sometimes confused with the logical addressing instead of a range
  inside the eb space

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:14 +02:00
Nikolay Borisov 217f5004fe btrfs: rework error detection in init_tree_roots
To avoid duplicating 3 lines of code the error detection logic in
init_tree_roots is somewhat quirky. It first checks for the presence of
any error condition, then checks for the specific condition to perform
any specific actions. That's spurious because directly checking for
each respective error condition and doing the necessary steps is more
obvious. While at it change the -EUCLEAN to -EIO in case the extent
buffer is not read correctly, this is in line with other sites which
return -EIO when the eb couldn't be read.

Additionally it results in smaller code and the code reads
more linearly:

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-95 (-95)
Function                                     old     new   delta
open_ctree                                 17243   17148     -95
Total: Before=113104, After=113009, chg -0.08%

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:14 +02:00
Qu Wenruo e85fde5162 btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations
[BUG]
When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:

  generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)

And with the following metadata leak:

  BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
  ------------[ cut here ]------------
  WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
  Call Trace:
   btrfs_put_super+0x15/0x17 [btrfs]
   generic_shutdown_super+0x72/0x110
   kill_anon_super+0x18/0x30
   btrfs_kill_super+0x17/0x30 [btrfs]
   deactivate_locked_super+0x3b/0xa0
   deactivate_super+0x40/0x50
   cleanup_mnt+0x135/0x190
   __cleanup_mnt+0x12/0x20
   task_work_run+0x64/0xb0
   __prepare_exit_to_usermode+0x1bc/0x1c0
   __syscall_return_slowpath+0x47/0x230
   do_syscall_64+0x64/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ---[ end trace a6cfd45ba80e4e06 ]---
  BTRFS error (device dm-3): qgroup reserved space leaked
  BTRFS info (device dm-3): disk space caching is enabled
  BTRFS info (device dm-3): has skinny extents

[CAUSE]
The qgroup preallocated meta rsv operations of that offending root are:

  btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
  btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
  btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
  btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
  btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072

It's pretty obvious that, we reserve qgroup meta rsv in
btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
release/convert calls in btrfs_subvolume_release_metadata().

This leads to the leakage.

[FIX]
To fix this bug, we should follow what we're doing in
btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
add it to block_rsv->qgroup_rsv_reserved.

And free the qgroup reserved metadata space when releasing the
block_rsv.

To do this, we need to change the btrfs_subvolume_release_metadata() to
accept btrfs_root, and record the qgroup_to_release number, and call
btrfs_qgroup_convert_reserved_meta() for it.

Fixes: 733e03a0b2 ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:13 +02:00
Qu Wenruo b4c5d8fdff btrfs: qgroup: fix wrong qgroup metadata reserve for delayed inode
For delayed inode facility, qgroup metadata is reserved for it, and
later freed.

However we're freeing more bytes than we reserved.
In btrfs_delayed_inode_reserve_metadata():

	num_bytes = btrfs_calc_metadata_size(fs_info, 1);
	...
		ret = btrfs_qgroup_reserve_meta_prealloc(root,
				fs_info->nodesize, true);
		...
		if (!ret) {
			node->bytes_reserved = num_bytes;

But in btrfs_delayed_inode_release_metadata():

	if (qgroup_free)
		btrfs_qgroup_free_meta_prealloc(node->root,
				node->bytes_reserved);
	else
		btrfs_qgroup_convert_reserved_meta(node->root,
				node->bytes_reserved);

This means, we're always releasing more qgroup metadata rsv than we have
reserved.

This won't trigger selftest warning, as btrfs qgroup metadata rsv has
extra protection against cases like quota enabled half-way.

But we still need to fix this problem any way.

This patch will use the same num_bytes for qgroup metadata rsv so we
could handle it correctly.

Fixes: f218ea6c47 ("btrfs: delayed-inode: Remove wrong qgroup meta reservation calls")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:13 +02:00
Josef Bacik 425c6ed648 btrfs: do not hold device_list_mutex when closing devices
The following lockdep splat

======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
------------------------------------------------------
fsstress/8739 is trying to acquire lock:
ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70

but task is already holding lock:
ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #10 (sb_pagefaults){.+.+}-{0:0}:
       __sb_start_write+0x129/0x210
       btrfs_page_mkwrite+0x6a/0x4a0
       do_page_mkwrite+0x4d/0xc0
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       asm_exc_page_fault+0x1e/0x30

-> #9 (&mm->mmap_lock#2){++++}-{3:3}:
       __might_fault+0x68/0x90
       _copy_to_user+0x1e/0x80
       perf_read+0x141/0x2c0
       vfs_read+0xad/0x1b0
       ksys_read+0x5f/0xe0
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #8 (&cpuctx_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       perf_event_init_cpu+0x88/0x150
       perf_event_init+0x1db/0x20b
       start_kernel+0x3ae/0x53c
       secondary_startup_64+0xa4/0xb0

-> #7 (pmus_lock){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       perf_event_init_cpu+0x4f/0x150
       cpuhp_invoke_callback+0xb1/0x900
       _cpu_up.constprop.26+0x9f/0x130
       cpu_up+0x7b/0xc0
       bringup_nonboot_cpus+0x4f/0x60
       smp_init+0x26/0x71
       kernel_init_freeable+0x110/0x258
       kernel_init+0xa/0x103
       ret_from_fork+0x1f/0x30

-> #6 (cpu_hotplug_lock){++++}-{0:0}:
       cpus_read_lock+0x39/0xb0
       kmem_cache_create_usercopy+0x28/0x230
       kmem_cache_create+0x12/0x20
       bioset_init+0x15e/0x2b0
       init_bio+0xa3/0xaa
       do_one_initcall+0x5a/0x2e0
       kernel_init_freeable+0x1f4/0x258
       kernel_init+0xa/0x103
       ret_from_fork+0x1f/0x30

-> #5 (bio_slab_lock){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       bioset_init+0xbc/0x2b0
       __blk_alloc_queue+0x6f/0x2d0
       blk_mq_init_queue_data+0x1b/0x70
       loop_add+0x110/0x290 [loop]
       fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
       do_one_initcall+0x5a/0x2e0
       do_init_module+0x5a/0x220
       load_module+0x2459/0x26e0
       __do_sys_finit_module+0xba/0xe0
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #4 (loop_ctl_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       lo_open+0x18/0x50 [loop]
       __blkdev_get+0xec/0x570
       blkdev_get+0xe8/0x150
       do_dentry_open+0x167/0x410
       path_openat+0x7c9/0xa80
       do_filp_open+0x93/0x100
       do_sys_openat2+0x22a/0x2e0
       do_sys_open+0x4b/0x80
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       blkdev_put+0x1d/0x120
       close_fs_devices.part.31+0x84/0x130
       btrfs_close_devices+0x44/0xb0
       close_ctree+0x296/0x2b2
       generic_shutdown_super+0x69/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_run_dev_stats+0x49/0x480
       commit_cowonly_roots+0xb5/0x2a0
       btrfs_commit_transaction+0x516/0xa60
       sync_filesystem+0x6b/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_commit_transaction+0x4bb/0xa60
       sync_filesystem+0x6b/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
       __lock_acquire+0x1272/0x2310
       lock_acquire+0x9e/0x360
       __mutex_lock+0x9f/0x930
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       do_page_mkwrite+0x4d/0xc0
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       asm_exc_page_fault+0x1e/0x30

other info that might help us debug this:

Chain exists of:
  &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_pagefaults);
                               lock(&mm->mmap_lock#2);
                               lock(sb_pagefaults);
  lock(&fs_info->reloc_mutex);

 *** DEADLOCK ***

3 locks held by fsstress/8739:
 #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
 #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
 #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0

stack backtrace:
CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
 dump_stack+0x78/0xa0
 check_noncircular+0x165/0x180
 __lock_acquire+0x1272/0x2310
 ? btrfs_get_alloc_profile+0x150/0x210
 lock_acquire+0x9e/0x360
 ? btrfs_record_root_in_trans+0x43/0x70
 __mutex_lock+0x9f/0x930
 ? btrfs_record_root_in_trans+0x43/0x70
 ? lock_acquire+0x9e/0x360
 ? join_transaction+0x5d/0x450
 ? find_held_lock+0x2d/0x90
 ? btrfs_record_root_in_trans+0x43/0x70
 ? join_transaction+0x3d5/0x450
 ? btrfs_record_root_in_trans+0x43/0x70
 btrfs_record_root_in_trans+0x43/0x70
 start_transaction+0xd1/0x5d0
 btrfs_dirty_inode+0x42/0xd0
 file_update_time+0xc8/0x110
 btrfs_page_mkwrite+0x10c/0x4a0
 ? handle_mm_fault+0x5e/0x1730
 do_page_mkwrite+0x4d/0xc0
 ? __do_fault+0x32/0x150
 handle_mm_fault+0x103c/0x1730
 exc_page_fault+0x340/0x660
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7faa6c9969c4

Was seen in testing.  The fix is similar to that of

  btrfs: open device without device_list_mutex

where we're holding the device_list_mutex and then grab the bd_mutex,
which pulls in a bunch of dependencies under the bd_mutex.  We only ever
call btrfs_close_devices() on mount failure or unmount, so we're save to
not have the device_list_mutex here.  We're already holding the
uuid_mutex which keeps us safe from any external modification of the
fs_devices.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:13 +02:00
Josef Bacik 62cf539120 btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
When closing and freeing the source device we could end up doing our
final blkdev_put() on the bdev, which will grab the bd_mutex.  As such
we want to be holding as few locks as possible, so move this call
outside of the dev_replace->lock_finishing_cancel_unmount lock.  Since
we're modifying the fs_devices we need to make sure we're holding the
uuid_mutex here, so take that as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:12:13 +02:00
Nikolay Borisov 68abf36016 btrfs: remove alloc_list splice in btrfs_prepare_sprout
btrfs_prepare_sprout is called when the first rw device is added to a
seed filesystem. This means the filesystem can't have its alloc_list
be non-empty, since seed filesystems are read only. Simply remove the
code altogether.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:59 +02:00
Nikolay Borisov 427c8fddb1 btrfs: document some invariants of seed code
Without good understanding of how seed devices works it's hard to grok
some of what the code in open_seed_devices or btrfs_prepare_sprout does.

Add comments hopefully reducing some of the cognitive load.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:58 +02:00
Nikolay Borisov 944d3f9fac btrfs: switch seed device to list api
While this patch touches a bunch of files the conversion is
straighforward. Instead of using the implicit linked list anchored at
btrfs_fs_devices::seed the code is switched to using
list_for_each_entry.

Previous patches in the series already factored out code that processed
both main and seed devices so in those cases the factored out functions
are called on the main fs_devices and then on every seed dev inside
list_for_each_entry.

Using list api also allows to simplify deletion from the seed dev list
performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by
substituting a while() loop with a simple list_del_init.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:58 +02:00
Nikolay Borisov c4989c2fd0 btrfs: simplify setting/clearing fs_info to btrfs_fs_devices
It makes no sense to have sysfs-related routines be responsible for
properly initialising the fs_info pointer of struct btrfs_fs_device.
Instead this can be streamlined by making it the responsibility of
btrfs_init_devices_late to initialize it. That function already
initializes fs_info of every individual device in btrfs_fs_devices.

As far as clearing it is concerned it makes sense to move it to
close_fs_devices. That function is only called when struct
btrfs_fs_devices is no longer in use - either for holding seeds or
main devices for a mounted filesystem.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:58 +02:00
Nikolay Borisov 54eed6ae8d btrfs: make close_fs_devices return void
The return value of this function conveys absolutely no information.
All callers already check the state of fs_devices->opened to decide how
to proceed. So convert the function to returning void. While at it make
btrfs_close_devices also return void.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:57 +02:00
Nikolay Borisov 3712ccb7f1 btrfs: factor out loop logic from btrfs_free_extra_devids
This prepares the code to switching seeds devices to a proper list.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:57 +02:00
Nikolay Borisov dc0ab488d2 btrfs: factor out reada loop in __reada_start_machine
This is in preparation for moving fs_devices to proper lists.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:57 +02:00
Nikolay Borisov 1028d1c48b btrfs: remove err variable from btrfs_get_extent
There's no practical reason too use 'err' as a variable to convey
errors. In fact it's value is either set explicitly in the beginning of
the function or it simply takes the value of 'ret'. Not conforming to
the usual pattern of having ret be the only variable used to convey
errors makes the code more error prone to bugs. In fact one such bug
was introduced by 6bf9e4bd6a ("btrfs: inode: Verify inode mode toi
avoid NULL pointer dereference") by assigning the error value to 'ret'
and not 'err'.

Let's fix that issue and make the function less tricky by leaving only
ret to convey error values.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:57 +02:00
Josef Bacik 0eb79294db btrfs: dio iomap DSYNC workaround
iomap dio will run generic_write_sync() for us if the iocb is DSYNC.
This is problematic for us because of 2 reasons:

1. we hold the inode_lock() during this operation, and we take it in
   generic_write_sync()
2. we hold a read lock on the dio_sem but take the write lock in fsync

Since we don't want to rip out this code right now, but reworking the
locking is a bit much to do at this point, work around this problem with
this masterpiece of a patch.

First, we clear DSYNC on the iocb so that the iomap stuff doesn't know
that it needs to handle the sync.  We save this fact in
current->journal_info, because we need to see do special things once
we're in iomap_begin, and we have no way to pass private information
into iomap_dio_rw().

Next we specify a separate iomap_dio_ops for sync, which implements an
->end_io() callback that gets called when the dio completes.  This is
important for AIO, because we really do need to run generic_write_sync()
if we complete asynchronously.  However if we're still in the submitting
context when we enter ->end_io() we clear the flag so that the submitter
knows they're the ones that needs to run generic_write_sync().

This is meant to be temporary.  We need to work out how to eliminate the
inode_lock() and the dio_sem in our fsync and use another mechanism to
protect these operations.

Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:57 +02:00
Goldwyn Rodrigues f85781fb50 btrfs: switch to iomap for direct IO
We're using direct io implementation based on buffer heads. This patch
switches to the new iomap infrastructure.

Switch from __blockdev_direct_IO() to iomap_dio_rw().  Rename
btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as
iomap_begin() for iomap direct I/O functions. This function allocates
and locks all the blocks required for the I/O.  btrfs_submit_direct() is
used as the submit_io() hook for direct I/O ops.

Since we need direct I/O reads to go through iomap_dio_rw(), we change
file_operations.read_iter() to a btrfs_file_read_iter() which calls
btrfs_direct_IO() for direct reads and falls back to
generic_file_buffered_read() for incomplete reads and buffered reads.

We don't need address_space.direct_IO() anymore: set it to noop.

Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
capable of direct I/O reads from a hole, so we don't need to return
-ENOENT.

Btrfs direct I/O is now done under i_rwsem, shared in case of reads and
exclusive in case of writes. This guards against simultaneous truncates.

Use iomap->iomap_end() to check for failed or incomplete direct I/O:

  - for writes, call __endio_write_update_ordered()
  - for reads, unlock extents

btrfs_dio_data is now hooked in iomap->private and not
current->journal_info. It carries the reservation variable and the
amount of data submitted, so we can calculate the amount of data to call
__endio_write_update_ordered in case of an error.

This patch removes last use of struct buffer_head from btrfs.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:57 +02:00
Qu Wenruo 154f7cb868 btrfs: add owner and fs_info to alloc_state io_tree
Commit 1c11b63eff ("btrfs: replace pending/pinned chunks lists with io
tree") introduced btrfs_device::alloc_state extent io tree, but it
doesn't initialize the fs_info and owner member.

This means the following features are not properly supported:

- Fs owner report for insert_state() error
  Without fs_info initialized, although btrfs_err() won't panic, it
  won't output which fs is causing the error.

- Wrong owner for trace events
  alloc_state will get the owner as pinned extents.

Fix this by assiging proper fs_info and owner for
btrfs_device::alloc_state.

Fixes: 1c11b63eff ("btrfs: replace pending/pinned chunks lists with io tree")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:56 +02:00
Marcos Paulo de Souza 4c448ce8b4 btrfs: make read_block_group_item return void
Since it's inclusion on 9afc66498a ("btrfs: block-group: refactor how
we read one block group item") this function always returned 0, so there
is no need to check for the returned value.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:56 +02:00
Leon Romanovsky 24646481fb btrfs: sysfs: fix unused-but-set-variable warnings
The compilation with W=1 generates the following warnings:
 fs/btrfs/sysfs.c:1630:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
  1630 |  int ret;
       |      ^~~
 fs/btrfs/sysfs.c:1629:6: warning: variable 'features' set but not used [-Wunused-but-set-variable]
  1629 |  u64 features;
       |      ^~~~~~~~

[ The unused variables are leftover from e410e34fad ("Revert "btrfs:
  synchronize incompat feature bits with sysfs files""), which needs
  to be properly fixed by moving feature bit manipulation from the sysfs
  context.  Silence the warning to save pepople time, we got several
  reports. ]

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:56 +02:00
Filipe Manana 487781796d btrfs: make fast fsyncs wait only for writeback
Currently regardless of a full or a fast fsync we always wait for ordered
extents to complete, and then start logging the inode after that. However
for fast fsyncs we can just wait for the writeback to complete, we don't
need to wait for the ordered extents to complete since we use the list of
modified extents maps to figure out which extents we must log and we can
get their checksums directly from the ordered extents that are still in
flight, otherwise look them up from the checksums tree.

Until commit b5e6c3e170 ("btrfs: always wait on ordered extents at
fsync time"), for fast fsyncs, we used to start logging without even
waiting for the writeback to complete first, we would wait for it to
complete after logging, while holding a transaction open, which lead to
performance issues when using cgroups and probably for other cases too,
as wait for IO while holding a transaction handle should be avoided as
much as possible. After that, for fast fsyncs, we started to wait for
ordered extents to complete before starting to log, which adds some
latency to fsyncs and we even got at least one report about a performance
drop which bisected to that particular change:

https://lore.kernel.org/linux-btrfs/20181109215148.GF23260@techsingularity.net/

This change makes fast fsyncs only wait for writeback to finish before
starting to log the inode, instead of waiting for both the writeback to
finish and for the ordered extents to complete. This brings back part of
the logic we had that extracts checksums from in flight ordered extents,
which are not yet in the checksums tree, and making sure transaction
commits wait for the completion of ordered extents previously logged
(by far most of the time they have already completed by the time a
transaction commit starts, resulting in no wait at all), to avoid any
data loss if an ordered extent completes after the transaction used to
log an inode is committed, followed by a power failure.

When there are no other tasks accessing the checksums and the subvolume
btrees, the ordered extent completion is pretty fast, typically taking
100 to 200 microseconds only in my observations. However when there are
other tasks accessing these btrees, ordered extent completion can take a
lot more time due to lock contention on nodes and leaves of these btrees.
I've seen cases over 2 milliseconds, which starts to be significant. In
particular when we do have concurrent fsyncs against different files there
is a lot of contention on the checksums btree, since we have many tasks
writing the checksums into the btree and other tasks that already started
the logging phase are doing lookups for checksums in the btree.

This change also turns all ranged fsyncs into full ranged fsyncs, which
is something we already did when not using the NO_HOLES features or when
doing a full fsync. This is to guarantee we never miss checksums due to
writeback having been triggered only for a part of an extent, and we end
up logging the full extent but only checksums for the written range, which
results in missing checksums after log replay. Allowing ranged fsyncs to
operate again only in the original range, when using the NO_HOLES feature
and doing a fast fsync is doable but requires some non trivial changes to
the writeback path, which can always be worked on later if needed, but I
don't think they are a very common use case.

Several tests were performed using fio for different numbers of concurrent
jobs, each writing and fsyncing its own file, for both sequential and
random file writes. The tests were run on bare metal, no virtualization,
on a box with 12 cores (Intel i7-8700), 64Gb of RAM and a NVMe device,
with a kernel configuration that is the default of typical distributions
(debian in this case), without debug options enabled (kasan, kmemleak,
slub debug, debug of page allocations, lock debugging, etc).

The following script that calls fio was used:

  $ cat test-fsync.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/btrfs
  MOUNT_OPTIONS="-o ssd -o space_cache=v2"
  MKFS_OPTIONS="-d single -m single"

  if [ $# -ne 5 ]; then
    echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE [write|randwrite]"
    exit 1
  fi

  NUM_JOBS=$1
  FILE_SIZE=$2
  FSYNC_FREQ=$3
  BLOCK_SIZE=$4
  WRITE_MODE=$5

  if [ "$WRITE_MODE" != "write" ] && [ "$WRITE_MODE" != "randwrite" ]; then
    echo "Invalid WRITE_MODE, must be 'write' or 'randwrite'"
    exit 1
  fi

  cat <<EOF > /tmp/fio-job.ini
  [writers]
  rw=$WRITE_MODE
  fsync=$FSYNC_FREQ
  fallocate=none
  group_reporting=1
  direct=0
  bs=$BLOCK_SIZE
  ioengine=sync
  size=$FILE_SIZE
  directory=$MNT
  numjobs=$NUM_JOBS
  EOF

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  echo
  echo "Using config:"
  echo
  cat /tmp/fio-job.ini
  echo

  umount $MNT &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT
  fio /tmp/fio-job.ini
  umount $MNT

The results were the following:

*************************
*** sequential writes ***
*************************

==== 1 job, 8GiB file, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=36.6MiB/s (38.4MB/s), 36.6MiB/s-36.6MiB/s (38.4MB/s-38.4MB/s), io=8192MiB (8590MB), run=223689-223689msec

After patch:

WRITE: bw=40.2MiB/s (42.1MB/s), 40.2MiB/s-40.2MiB/s (42.1MB/s-42.1MB/s), io=8192MiB (8590MB), run=203980-203980msec
(+9.8%, -8.8% runtime)

==== 2 jobs, 4GiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=35.8MiB/s (37.5MB/s), 35.8MiB/s-35.8MiB/s (37.5MB/s-37.5MB/s), io=8192MiB (8590MB), run=228950-228950msec

After patch:

WRITE: bw=43.5MiB/s (45.6MB/s), 43.5MiB/s-43.5MiB/s (45.6MB/s-45.6MB/s), io=8192MiB (8590MB), run=188272-188272msec
(+21.5% throughput, -17.8% runtime)

==== 4 jobs, 2GiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=50.1MiB/s (52.6MB/s), 50.1MiB/s-50.1MiB/s (52.6MB/s-52.6MB/s), io=8192MiB (8590MB), run=163446-163446msec

After patch:

WRITE: bw=64.5MiB/s (67.6MB/s), 64.5MiB/s-64.5MiB/s (67.6MB/s-67.6MB/s), io=8192MiB (8590MB), run=126987-126987msec
(+28.7% throughput, -22.3% runtime)

==== 8 jobs, 1GiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=64.0MiB/s (68.1MB/s), 64.0MiB/s-64.0MiB/s (68.1MB/s-68.1MB/s), io=8192MiB (8590MB), run=126075-126075msec

After patch:

WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=8192MiB (8590MB), run=94358-94358msec
(+35.6% throughput, -25.2% runtime)

==== 16 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=79.8MiB/s (83.6MB/s), 79.8MiB/s-79.8MiB/s (83.6MB/s-83.6MB/s), io=8192MiB (8590MB), run=102694-102694msec

After patch:

WRITE: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=8192MiB (8590MB), run=76446-76446msec
(+34.1% throughput, -25.6% runtime)

==== 32 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=93.2MiB/s (97.7MB/s), 93.2MiB/s-93.2MiB/s (97.7MB/s-97.7MB/s), io=16.0GiB (17.2GB), run=175836-175836msec

After patch:

WRITE: bw=111MiB/s (117MB/s), 111MiB/s-111MiB/s (117MB/s-117MB/s), io=16.0GiB (17.2GB), run=147001-147001msec
(+19.1% throughput, -16.4% runtime)

==== 64 jobs, 512MiB files, fsync frequency 1, block size 64KiB ====

Before patch:

WRITE: bw=108MiB/s (114MB/s), 108MiB/s-108MiB/s (114MB/s-114MB/s), io=32.0GiB (34.4GB), run=302656-302656msec

After patch:

WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=246003-246003msec
(+23.1% throughput, -18.7% runtime)

************************
***   random writes  ***
************************

==== 1 job, 8GiB file, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=11.5MiB/s (12.0MB/s), 11.5MiB/s-11.5MiB/s (12.0MB/s-12.0MB/s), io=8192MiB (8590MB), run=714281-714281msec

After patch:

WRITE: bw=11.6MiB/s (12.2MB/s), 11.6MiB/s-11.6MiB/s (12.2MB/s-12.2MB/s), io=8192MiB (8590MB), run=705959-705959msec
(+0.9% throughput, -1.7% runtime)

==== 2 jobs, 4GiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=12.8MiB/s (13.5MB/s), 12.8MiB/s-12.8MiB/s (13.5MB/s-13.5MB/s), io=8192MiB (8590MB), run=638101-638101msec

After patch:

WRITE: bw=13.1MiB/s (13.7MB/s), 13.1MiB/s-13.1MiB/s (13.7MB/s-13.7MB/s), io=8192MiB (8590MB), run=625374-625374msec
(+2.3% throughput, -2.0% runtime)

==== 4 jobs, 2GiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=15.4MiB/s (16.2MB/s), 15.4MiB/s-15.4MiB/s (16.2MB/s-16.2MB/s), io=8192MiB (8590MB), run=531146-531146msec

After patch:

WRITE: bw=17.8MiB/s (18.7MB/s), 17.8MiB/s-17.8MiB/s (18.7MB/s-18.7MB/s), io=8192MiB (8590MB), run=460431-460431msec
(+15.6% throughput, -13.3% runtime)

==== 8 jobs, 1GiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=19.9MiB/s (20.8MB/s), 19.9MiB/s-19.9MiB/s (20.8MB/s-20.8MB/s), io=8192MiB (8590MB), run=412664-412664msec

After patch:

WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=8192MiB (8590MB), run=368589-368589msec
(+11.6% throughput, -10.7% runtime)

==== 16 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=8192MiB (8590MB), run=279924-279924msec

After patch:

WRITE: bw=30.4MiB/s (31.9MB/s), 30.4MiB/s-30.4MiB/s (31.9MB/s-31.9MB/s), io=8192MiB (8590MB), run=269258-269258msec
(+3.8% throughput, -3.8% runtime)

==== 32 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=36.9MiB/s (38.7MB/s), 36.9MiB/s-36.9MiB/s (38.7MB/s-38.7MB/s), io=16.0GiB (17.2GB), run=443581-443581msec

After patch:

WRITE: bw=41.6MiB/s (43.6MB/s), 41.6MiB/s-41.6MiB/s (43.6MB/s-43.6MB/s), io=16.0GiB (17.2GB), run=394114-394114msec
(+12.7% throughput, -11.2% runtime)

==== 64 jobs, 512MiB files, fsync frequency 16, block size 4KiB ====

Before patch:

WRITE: bw=45.9MiB/s (48.1MB/s), 45.9MiB/s-45.9MiB/s (48.1MB/s-48.1MB/s), io=32.0GiB (34.4GB), run=714614-714614msec

After patch:

WRITE: bw=48.8MiB/s (51.1MB/s), 48.8MiB/s-48.8MiB/s (51.1MB/s-51.1MB/s), io=32.0GiB (34.4GB), run=672087-672087msec
(+6.3% throughput, -6.0% runtime)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:56 +02:00
Filipe Manana 75b463d2b4 btrfs: do not commit logs and transactions during link and rename operations
Since commit d4682ba03e ("Btrfs: sync log after logging new name") we
started to commit logs, and fallback to transaction commits when we failed
to log the new names or commit the logs, after link and rename operations
when the target inodes (or their parents) were previously logged in the
current transaction. This was to avoid losing directories despite an
explicit fsync on them when they are ancestors of some inode that got a
new named logged, due to a link or rename operation. However that adds the
cost of starting IO and waiting for it to complete, which can cause higher
latencies for applications.

Instead of doing that, just make sure that when we log a new name for an
inode we don't mark any of its ancestors as logged, so that if any one
does an fsync against any of them, without doing any other change on them,
the fsync commits the log. This way we only pay the cost of a log commit
(or a transaction commit if something goes wrong or a new block group was
created) if the application explicitly asks to fsync any of the parent
directories.

Using dbench, which mixes several filesystems operations including renames,
revealed some significant latency gains. The following script that uses
dbench was used to test this:

  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/btrfs
  MOUNT_OPTIONS="-o ssd -o space_cache=v2"
  MKFS_OPTIONS="-m single -d single"
  THREADS=16

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -t 300 -D $MNT $THREADS

  umount $MNT

The test was run on bare metal, no virtualization, on a box with 12 cores
(Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
configuration that is the default of typical distributions (debian in this
case), without debug options enabled (kasan, kmemleak, slub debug, debug
of page allocations, lock debugging, etc).

Results before this patch:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    10750455     0.011   155.088
 Close         7896674     0.001     0.243
 Rename         455222     2.158  1101.947
 Unlink        2171189     0.067   121.638
 Deltree           256     2.425     7.816
 Mkdir             128     0.002     0.003
 Qpathinfo     9744323     0.006    21.370
 Qfileinfo     1707092     0.001     0.146
 Qfsinfo       1786756     0.001    11.228
 Sfileinfo      875612     0.003    21.263
 Find          3767281     0.025     9.617
 WriteX        5356924     0.011   211.390
 ReadX        16852694     0.003     9.442
 LockX           35008     0.002     0.119
 UnlockX         35008     0.001     0.138
 Flush          753458     4.252  1102.249

Throughput 1128.35 MB/sec  16 clients  16 procs  max_latency=1102.255 ms

Results after this patch:

16 clients, after

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    11471098     0.012   448.281
 Close         8426396     0.001     0.925
 Rename         485746     0.123   267.183
 Unlink        2316477     0.080    63.433
 Deltree           288     2.830    11.144
 Mkdir             144     0.003     0.010
 Qpathinfo    10397420     0.006    10.288
 Qfileinfo     1822039     0.001     0.169
 Qfsinfo       1906497     0.002    14.039
 Sfileinfo      934433     0.004     2.438
 Find          4019879     0.026    10.200
 WriteX        5718932     0.011   200.985
 ReadX        17981671     0.003    10.036
 LockX           37352     0.002     0.076
 UnlockX         37352     0.001     0.109
 Flush          804018     5.015   778.033

Throughput 1201.98 MB/sec  16 clients  16 procs  max_latency=778.036 ms
(+6.5% throughput, -29.4% max latency, -75.8% rename latency)

Test case generic/498 from fstests tests the scenario that the previously
mentioned commit fixed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:56 +02:00
Filipe Manana 5522a27e59 btrfs: do not take the log_mutex of the subvolume when pinning the log
During a rename we pin the log to make sure no one commits a log that
reflects an ongoing rename operation, as it might result in a committed
log where it recorded the unlink of the old name without having recorded
the new name. However we are taking the subvolume's log_mutex before
incrementing the log_writers counter, which is not necessary since that
counter is atomic and we only remove the old name from the log and add
the new name to the log after we have incremented log_writers, ensuring
that no one can commit the log after we have removed the old name from
the log and before we added the new name to the log.

By taking the log_mutex lock we are just adding unnecessary contention on
the lock, which can become visible for workloads that mix renames with
fsyncs, writes for files opened with O_SYNC and unlink operations (if the
inode or its parent were fsynced before in the current transaction).

So just remove the lock and unlock of the subvolume's log_mutex at
btrfs_pin_log_trans().

Using dbench, which mixes different types of operations that end up taking
that mutex (fsyncs, renames, unlinks and writes into files opened with
O_SYNC) revealed some small gains. The following script that calls dbench
was used:

  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/btrfs
  MOUNT_OPTIONS="-o ssd -o space_cache=v2"
  MKFS_OPTIONS="-m single -d single"
  THREADS=32

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -s -t 600 -D $MNT $THREADS

  umount $MNT

The test was run on bare metal, no virtualization, on a box with 12 cores
(Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
configuration that is the default of typical distributions (debian in this
case), without debug options enabled (kasan, kmemleak, slub debug, debug
of page allocations, lock debugging, etc).

Results before this patch:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4410848     0.017   738.640
 Close        3240222     0.001     0.834
 Rename        186850     7.478  1272.476
 Unlink        890875     0.128   785.018
 Deltree          128     2.846    12.081
 Mkdir             64     0.002     0.003
 Qpathinfo    3997659     0.009    11.171
 Qfileinfo     701307     0.001     0.478
 Qfsinfo       733494     0.002     1.103
 Sfileinfo     359362     0.004     3.266
 Find         1546226     0.041     4.128
 WriteX       2202803     7.905  1376.989
 ReadX        6917775     0.003     3.887
 LockX          14392     0.002     0.043
 UnlockX        14392     0.001     0.085
 Flush         309225     0.128  1033.936

Throughput 231.555 MB/sec (sync open)  32 clients  32 procs  max_latency=1376.993 ms

Results after this patch:

Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4603244     0.017   232.776
 Close        3381299     0.001     1.041
 Rename        194871     7.251  1073.165
 Unlink        929730     0.133   119.233
 Deltree          128     2.871    10.199
 Mkdir             64     0.002     0.004
 Qpathinfo    4171343     0.009    11.317
 Qfileinfo     731227     0.001     1.635
 Qfsinfo       765079     0.002     3.568
 Sfileinfo     374881     0.004     1.220
 Find         1612964     0.041     4.675
 WriteX       2296720     7.569  1178.204
 ReadX        7213633     0.003     3.075
 LockX          14976     0.002     0.076
 UnlockX        14976     0.001     0.061
 Flush         322635     0.102   579.505

Throughput 241.4 MB/sec (sync open)  32 clients  32 procs  max_latency=1178.207 ms
(+4.3% throughput, -14.4% max latency)

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
David Sterba 1b51d6fce4 btrfs: send: remove indirect callback parameter for changed_cb
There's a custom callback passed to btrfs_compare_trees which happens to
be named exactly same as the existing function implementing it. This is
confusing and the indirection is not necessary for our needs. Compiler
is clever enough to call it directly so there's effectively no change.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
David Sterba 8bb1cf1ba6 btrfs: scrub: rename ratelimit state varaible to avoid shadowing
There's already defined _rs within ctree.h:btrfs_printk_ratelimited,
local variables should not use _ to avoid such name clashes with
macro-local variables.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
David Sterba 0af447d050 btrfs: remove unnecessarily shadowed variables
In btrfs_orphan_cleanup, there's another instance of fs_info, but it's
the same as the one we already have.

In btrfs_backref_finish_upper_links, rb_node is same type and used
as temporary cursor to the tree.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
David Sterba cb4c919830 btrfs: compression: move declarations to header
The declarations of compression algorithm callbacks are defined in the
.c file as they're used from there. Compiler warns that there are no
declarations for public functions when compiling lzo.c/zlib.c/zstd.c.
Fix that by moving the declarations to the header as it's the common
place for all of them.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
David Sterba 9e6df7cedf btrfs: remove const from btrfs_feature_set_name
The function btrfs_feature_set_name returns a const char pointer, the
second const is not necessary and reported as a warning:

 In file included from fs/btrfs/space-info.c:6:
 fs/btrfs/sysfs.h:16:1: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
    16 | const char * const btrfs_feature_set_name(enum btrfs_feature_set set);
       | ^~~~~

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
Qu Wenruo e21139c621 btrfs: cleanup calculation of lockend in lock_and_cleanup_extent_if_need()
We're just doing rounding up to sectorsize to calculate the lockend.
There is no need to do the unnecessary length calculation, just direct
round_up() is enough.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Josef Bacik c4923027bd btrfs: fix possible infinite loop in data async reclaim
Dave reported an issue where generic/102 would sometimes hang.  This
turned out to be because we'd get into this spot where we were no longer
making progress on data reservations because our exit condition was not
met.  The log is basically

while (!space_info->full && !list_empty(&space_info->tickets))
	flush_space(space_info, flush_state);

where flush state is our various flush states, but doesn't include
ALLOC_CHUNK_FORCE.  This is because we actually lead with allocating
chunks, and so the assumption was that once you got to the actual
flushing states you could no longer allocate chunks.  This was a stupid
assumption, because you could have deleted block groups that would be
reclaimed by a transaction commit, thus unsetting space_info->full.
This is essentially what happens with generic/102, and so sometimes
you'd get stuck in the flushing loop because we weren't allocating
chunks, but flushing space wasn't giving us what we needed to make
progress.

Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states,
that way we will eventually bail out because we did end up with
space_info->full if we free'd a chunk previously.  Otherwise, as is the
case for this test, we'll allocate our chunk and continue on our happy
merry way.

Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Josef Bacik 1a7a92c8dd btrfs: add a comment explaining the data flush steps
The data flushing steps are not obvious to people other than myself and
Chris.  Write a giant comment explaining the reasoning behind each flush
step for data as well as why it is in that particular order.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Josef Bacik 5705674081 btrfs: do async reclaim for data reservations
Now that we have the data ticketing stuff in place, move normal data
reservations to use an async reclaim helper to satisfy tickets.  Before
we could have multiple tasks race in and both allocate chunks, resulting
in more data chunks than we would necessarily need.  Serializing these
allocations and making a single thread responsible for flushing will
only allocate chunks as needed, as well as cut down on transaction
commits and other flush related activities.

Priority reservations will still work as they have before, simply
trying to allocate a chunk until they can make their reservation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Josef Bacik cb3e393045 btrfs: flush delayed refs when trying to reserve data space
We can end up with freed extents in the delayed refs, and thus
may_commit_transaction() may not think we have enough pinned space to
commit the transaction and we'll ENOSPC early.  Handle this by running
the delayed refs in order to make sure pinned is uptodate before we try
to commit the transaction.

Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:54 +02:00
Josef Bacik 327feeeb2e btrfs: run delayed iputs before committing the transaction for data
Before we were waiting on iputs after we committed the transaction, but
this doesn't really make much sense.  We want to reclaim any space we
may have in order to be more likely to commit the transaction, due to
pinned space being added by running the delayed iputs.  Fix this by
making delayed iputs run before committing the transaction.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik bb86bd3db8 btrfs: don't force commit if we are data
We used to unconditionally commit the transaction at least 2 times and
then on the 3rd try check against pinned space to make sure committing
the transaction was worth the effort.  This is overkill, we know nobody
is going to steal our reservation, and if we can't make our reservation
with the pinned amount simply bail out.

This also cleans up the passing of bytes_needed to
may_commit_transaction, as that was the thing we added into place in
order to accomplish this behavior.  We no longer need it so remove that
mess.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik 0282700135 btrfs: drop the commit_cycles stuff for data reservations
This was an old wart left over from how we previously did data
reservations.  Before we could have people race in and take a
reservation while we were flushing space, so we needed to make sure we
looped a few times before giving up.  Now that we're using the ticketing
infrastructure we don't have to worry about this and can drop the logic
altogether.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik f3bda421c1 btrfs: use the same helper for data and metadata reservations
Now that data reservations follow the same pattern as metadata
reservations we can simply rename __reserve_metadata_bytes to
__reserve_bytes and use that helper for data reservations.

Things to keep in mind, btrfs_can_overcommit() returns 0 for data,
because we can never overcommit.  We also will never pass in FLUSH_ALL
for data, so we'll simply be added to the priority list and go straight
into handle_reserve_ticket.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik 0532a6f8b6 btrfs: serialize data reservations if we are flushing
Nikolay reported a problem where generic/371 would fail sometimes with a
slow drive.  The gist of the test is that we fallocate a file in
parallel with a pwrite of a different file.  These two files combined
are smaller than the file system, but sometimes the pwrite would ENOSPC.

A fair bit of investigation uncovered the fact that the fallocate
workload was racing in and grabbing the free space that the pwrite
workload was trying to free up so it could make its own reservation.
After a few loops of this eventually the pwrite workload would error out
with an ENOSPC.

We've had the same problem with metadata as well, and we serialized all
metadata allocations to satisfy this problem.  This wasn't usually a
problem with data because data reservations are more straightforward,
but obviously could still happen.

Fix this by not allowing reservations to occur if there are any pending
tickets waiting to be satisfied on the space info.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik 1004f6860f btrfs: use ticketing for data space reservations
Now that we have all the infrastructure in place, use the ticketing
infrastructure to make data allocations.  This still maintains the exact
same flushing behavior, but now we're using tickets to get our
reservations satisfied.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:53 +02:00
Josef Bacik 8698fc4eb7 btrfs: add btrfs_reserve_data_bytes and use it
Create a new function btrfs_reserve_data_bytes() in order to handle data
reservations.  This uses the new flush types and flush states to handle
making data reservations.

This patch specifically does not change any functionality, and is
purposefully not cleaned up in order to make bisection easier for the
future patches.  The new helper is identical to the old helper in how it
handles data reservations.  We first try to force a chunk allocation,
and then we run through the flush states all at once and in the same
order that they were done with the old helper.

Subsequent patches will clean this up and change the behavior of the
flushing, and it is important to keep those changes separate so we can
easily bisect down to the patch that caused the regression, rather than
the patch that made us start using the new infrastructure.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik a1ed0a8216 btrfs: add the data transaction commit logic into may_commit_transaction
Data space flushing currently unconditionally commits the transaction
twice in a row, and the last time it checks if there's enough pinned
extents to satisfy its reservation before deciding to commit the
transaction for the 3rd and final time.

Encode this logic into may_commit_transaction().  In the next patch we
will pass in U64_MAX for bytes_needed the first two times, and the final
time we will pass in the actual bytes we need so the normal logic will
apply.

This patch exists solely to make the logical changes I will make to the
flushing state machine separate to make it easier to bisect any
performance related regressions.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik 058e6d1d26 btrfs: add flushing states for handling data reservations
Currently the way we do data reservations is by seeing if we have enough
space in our space_info.  If we do not and we're a normal inode we'll

1) Attempt to force a chunk allocation until we can't anymore.
2) If that fails we'll flush delalloc, then commit the transaction, then
   run the delayed iputs.

If we are a free space inode we're only allowed to force a chunk
allocation.  In order to use the normal flushing mechanism we need to
encode this into a flush state array for normal inodes.  Since both will
start with allocating chunks until the space info is full there is no
need to add this as a flush state, this will be handled specially.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik 448b966b49 btrfs: check tickets after waiting on ordered extents
Right now if the space is freed up after the ordered extents complete
(which is likely since the reservations are held until they complete),
we would do extra delalloc flushing before we'd notice that we didn't
have any more tickets.  Fix this by moving the tickets check after our
wait_ordered_extents check.

Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik 38d715f494 btrfs: use btrfs_start_delalloc_roots in shrink_delalloc
The original iteration of flushing had us flushing delalloc and then
checking to see if we could make our reservation, thus we were very
careful about how many pages we would flush at once.

But now that everything is async and we satisfy tickets as the space
becomes available we don't have to keep track of any of this, simply
try and flush the number of dirty inodes we may have in order to
reclaim space to make our reservation.  This cleans up our delalloc
flushing significantly.

The async_pages stuff is dropped because btrfs_start_delalloc_roots()
handles the case that we generate async extents for us, so we no longer
require this extra logic.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik 39753e4a3a btrfs: use the btrfs_space_info_free_bytes_may_use helper for delalloc
We are going to use the ticket infrastructure for data, so use the
btrfs_space_info_free_bytes_may_use() helper in
btrfs_free_reserved_data_space_noquota() so we get the
btrfs_try_granting_tickets call when we free our reservation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:52 +02:00
Josef Bacik 99ffb43e5d btrfs: call btrfs_try_granting_tickets when reserving space
If we have compression on we could free up more space than we reserved,
and thus be able to make a space reservation.  Add the call for this
scenario.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik 2732798c9b btrfs: call btrfs_try_granting_tickets when unpinning anything
When unpinning we were only calling btrfs_try_granting_tickets() if
global_rsv->space_info == space_info, which is problematic because we
use ticketing for SYSTEM chunks, and want to use it for DATA as well.
Fix this by moving this call outside of that if statement.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik 3308234a7e btrfs: call btrfs_try_granting_tickets when freeing reserved bytes
We were missing a call to btrfs_try_granting_tickets in
btrfs_free_reserved_bytes, so add it to handle the case where we're able
to satisfy an allocation because we've freed a pending reservation.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik c6c453032e btrfs: make ALLOC_CHUNK use the space info flags
We have traditionally used flush_space() to flush metadata space, so
we've been unconditionally using btrfs_metadata_alloc_profile() for our
profile to allocate a chunk. However if we're going to use this for
data we need to use btrfs_get_alloc_profile() on the space_info we pass
in.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik 920a9958c2 btrfs: make shrink_delalloc take space_info as an arg
Currently shrink_delalloc just looks up the metadata space info, but
this won't work if we're trying to reclaim space for data chunks.  We
get the right space_info we want passed into flush_space, so simply pass
that along to shrink_delalloc.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik d7f81fac97 btrfs: handle U64_MAX for shrink_delalloc
Data allocations are going to want to pass in U64_MAX for flushing
space, adjust shrink_delalloc to handle this properly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:51 +02:00
Josef Bacik 288be2d997 btrfs: remove orig from shrink_delalloc
We don't use this anywhere inside of shrink_delalloc since 17024ad0a0
("Btrfs: fix early ENOSPC due to delalloc"), remove it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Josef Bacik b49121393f btrfs: change nr to u64 in btrfs_start_delalloc_roots
We have btrfs_wait_ordered_roots() which takes a u64 for nr, but
btrfs_start_delalloc_roots() that takes an int for nr, which makes using
them in conjunction, especially for something like (u64)-1, annoying and
inconsistent.  Fix btrfs_start_delalloc_roots() to take a u64 for nr and
adjust start_delalloc_inodes() and it's callers appropriately.

This means we've adjusted start_delalloc_inodes() to take a pointer of
nr since we want to preserve the ability for start-delalloc_inodes() to
return an error, so simply make it do the nr adjusting as necessary.

Part of adjusting the callers to this means changing
btrfs_writeback_inodes_sb_nr() to take a u64 for items.  This may be
confusing because it seems unrelated, but the caller of
btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the
function variable that needs to be changed.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Nikolay Borisov 8e56008180 btrfs: remove fsid argument from btrfs_sysfs_update_sprout_fsid
It can be accessed from 'fs_devices' as it's identical to
fs_info->fs_devices. Also add a comment about why we are calling the
function. No semantic changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Nikolay Borisov 57297c1e8e btrfs: remove spurious BUG_ON in btrfs_get_extent
That BUG_ON cannot ever trigger because as the comment there states -
'err' is always set. Simply remove it as it brings no value.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Randy Dunlap 260db43cd2 btrfs: delete duplicated words + other fixes in comments
Delete repeated words in fs/btrfs/.
{to, the, a, and old}
and change "into 2 part" to "into 2 parts".

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:50 +02:00
Qu Wenruo 437490fed3 btrfs: tracepoints: output proper root owner for trace_find_free_extent()
The current trace event always output result like this:

 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
 find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)
 find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)

T's saying we're allocating data extent for EXTENT tree, which is not
even possible.

It's because we always use EXTENT tree as the owner for
trace_find_free_extent() without using the @root from
btrfs_reserve_extent().

This patch will change the parameter to use proper @root for
trace_find_free_extent():

Now it looks much better:

 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=4096 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
 find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=7(CSUM_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
 find_free_extent: root=1(ROOT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)

Reported-by: Hans van Kranenburg <hans@knorrie.org>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:49 +02:00
Namjae Jeon 8ff006e57a exfat: fix use of uninitialized spinlock on error path
syzbot reported warning message:

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1d6/0x29e lib/dump_stack.c:118
 register_lock_class+0xf06/0x1520 kernel/locking/lockdep.c:893
 __lock_acquire+0xfd/0x2ae0 kernel/locking/lockdep.c:4320
 lock_acquire+0x148/0x720 kernel/locking/lockdep.c:5029
 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
 _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
 spin_lock include/linux/spinlock.h:354 [inline]
 exfat_cache_inval_inode+0x30/0x280 fs/exfat/cache.c:226
 exfat_evict_inode+0x124/0x270 fs/exfat/inode.c:660
 evict+0x2bb/0x6d0 fs/inode.c:576
 exfat_fill_super+0x1e07/0x27d0 fs/exfat/super.c:681
 get_tree_bdev+0x3e9/0x5f0 fs/super.c:1342
 vfs_get_tree+0x88/0x270 fs/super.c:1547
 do_new_mount fs/namespace.c:2875 [inline]
 path_mount+0x179d/0x29e0 fs/namespace.c:3192
 do_mount fs/namespace.c:3205 [inline]
 __do_sys_mount fs/namespace.c:3413 [inline]
 __se_sys_mount+0x126/0x180 fs/namespace.c:3390
 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

If exfat_read_root() returns an error, spinlock is used in
exfat_evict_inode() without initialization. This patch combines
exfat_cache_init_inode() with exfat_inode_init_once() to initialize
spinlock by slab constructor.

Fixes: c35b6810c4 ("exfat: add exfat cache")
Cc: stable@vger.kernel.org # v5.7+
Reported-by: syzbot <syzbot+b91107320911a26c9a95@syzkaller.appspotmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-10-07 14:27:13 +09:00
Tetsuhiro Kohada d6c9efd924 exfat: fix pointer error checking
Fix missing result check of exfat_build_inode().
And use PTR_ERR_OR_ZERO instead of PTR_ERR.

Signed-off-by: Tetsuhiro Kohada <kohada.t2@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
2020-10-07 14:26:55 +09:00
Linus Torvalds d1a819a2ec splice: teach splice pipe reading about empty pipe buffers
Tetsuo Handa reports that splice() can return 0 before the real EOF, if
the data in the splice source pipe is an empty pipe buffer.  That empty
pipe buffer case doesn't happen in any normal situation, but you can
trigger it by doing a write to a pipe that fails due to a page fault.

Tetsuo has a test-case to show the behavior:

  #define _GNU_SOURCE
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <fcntl.h>
  #include <unistd.h>

  int main(int argc, char *argv[])
  {
	const int fd = open("/tmp/testfile", O_WRONLY | O_CREAT, 0600);
	int pipe_fd[2] = { -1, -1 };
	pipe(pipe_fd);
	write(pipe_fd[1], NULL, 4096);
	/* This splice() should wait unless interrupted. */
	return !splice(pipe_fd[0], NULL, fd, NULL, 65536, 0);
  }

which results in

    write(5, NULL, 4096)                    = -1 EFAULT (Bad address)
    splice(4, NULL, 3, NULL, 65536, 0)      = 0

and this can confuse splice() users into believing they have hit EOF
prematurely.

The issue was introduced when the pipe write code started pre-allocating
the pipe buffers before copying data from user space.

This is modified verion of Tetsuo's original patch.

Fixes: a194dfe6e6 ("pipe: Rearrange sequence in pipe_write() to preallocate slot")
Link:https://lore.kernel.org/linux-fsdevel/20201005121339.4063-1-penguin-kernel@I-love.SAKURA.ne.jp/
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-06 10:27:22 -07:00
Ashish Sangwan 247db73560 NFS: fix nfs_path in case of a rename retry
We are generating incorrect path in case of rename retry because
we are restarting from wrong dentry. We should restart from the
dentry which was received in the call to nfs_path.

CC: stable@vger.kernel.org
Signed-off-by: Ashish Sangwan <ashishsangwan2@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-06 10:21:18 -04:00
Amir Goldstein be4df0cea0 ovl: use generic vfs_ioc_setflags_prepare() helper
Canonalize to ioctl FS_* flags instead of inode S_* flags.

Note that we do not call the helper vfs_ioc_fssetxattr_check()
for FS_IOC_FSSETXATTR ioctl. The reason is that underlying filesystem
will perform all the checks. We only need to perform the capability
check before overriding credentials.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-06 15:38:15 +02:00
Amir Goldstein 61536bed21 ovl: support [S|G]ETFLAGS and FS[S|G]ETXATTR ioctls for directories
[S|G]ETFLAGS and FS[S|G]ETXATTR ioctls are applicable to both files and
directories, so add ioctl operations to dir as well.

We teach ovl_real_fdget() to get the realfile of directories which use
a different type of file->private_data.

Ifdef away compat ioctl implementation to conform to standard practice.

With this change, xfstest generic/079 which tests these ioctls on files
and directories passes.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-06 15:38:14 +02:00
David S. Miller 8b0308fe31 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Rejecting non-native endian BTF overlapped with the addition
of support for it.

The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-05 18:40:01 -07:00
Christoph Hellwig 10ed16662d block: add a bdget_part helper
All remaining callers of bdget() outside of fs/block_dev.c want to get a
reference to the struct block_device for a given struct hd_struct.  Add
a helper just for that and then mark bdget static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-05 10:38:33 -06:00
Kees Cook 0fa8e08464 fs/kernel_file_read: Add "offset" arg for partial reads
To perform partial reads, callers of kernel_read_file*() must have a
non-NULL file_size argument and a preallocated buffer. The new "offset"
argument can then be used to seek to specific locations in the file to
fill the buffer to, at most, "buf_size" per call.

Where possible, the LSM hooks can report whether a full file has been
read or not so that the contents can be reasoned about.

Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201002173828.2099543-14-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-05 13:37:04 +02:00
Kees Cook 2039bda1fa LSM: Add "contents" flag to kernel_read_file hook
As with the kernel_load_data LSM hook, add a "contents" flag to the
kernel_read_file LSM hook that indicates whether the LSM can expect
a matching call to the kernel_post_read_file LSM hook with the full
contents of the file. With the coming addition of partial file read
support for kernel_read_file*() API, the LSM will no longer be able
to always see the entire contents of a file during the read calls.

For cases where the LSM must read examine the complete file contents,
it will need to do so on its own every time the kernel_read_file
hook is called with contents=false (or reject such cases). Adjust all
existing LSMs to retain existing behavior.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Link: https://lore.kernel.org/r/20201002173828.2099543-12-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-05 13:37:03 +02:00
Kees Cook 885352881f fs/kernel_read_file: Add file_size output argument
In preparation for adding partial read support, add an optional output
argument to kernel_read_file*() that reports the file size so callers
can reason more easily about their reading progress.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Link: https://lore.kernel.org/r/20201002173828.2099543-8-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-05 13:37:03 +02:00
Kees Cook 113eeb5177 fs/kernel_read_file: Switch buffer size arg to size_t
In preparation for further refactoring of kernel_read_file*(), rename
the "max_size" argument to the more accurate "buf_size", and correct
its type to size_t. Add kerndoc to explain the specifics of how the
arguments will be used. Note that with buf_size now size_t, it can no
longer be negative (and was never called with a negative value). Adjust
callers to use it as a "maximum size" when *buf is NULL.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Link: https://lore.kernel.org/r/20201002173828.2099543-7-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-05 13:34:19 +02:00
Kees Cook f7a4f689bc fs/kernel_read_file: Remove redundant size argument
In preparation for refactoring kernel_read_file*(), remove the redundant
"size" argument which is not needed: it can be included in the return
code, with callers adjusted. (VFS reads already cannot be larger than
INT_MAX.)

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Link: https://lore.kernel.org/r/20201002173828.2099543-6-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-05 13:34:18 +02:00
Kees Cook 5287b07f6d fs/kernel_read_file: Split into separate source file
These routines are used in places outside of exec(2), so in preparation
for refactoring them, move them into a separate source file,
fs/kernel_read_file.c.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Link: https://lore.kernel.org/r/20201002173828.2099543-5-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-05 13:34:18 +02:00
Scott Branden b89999d004 fs/kernel_read_file: Split into separate include file
Move kernel_read_file* out of linux/fs.h to its own linux/kernel_read_file.h
include file. That header gets pulled in just about everywhere
and doesn't really need functions not related to the general fs interface.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/r/20200706232309.12010-2-scott.branden@broadcom.com
Link: https://lore.kernel.org/r/20201002173828.2099543-4-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-05 13:34:18 +02:00
Kees Cook c307459b9d fs/kernel_read_file: Remove FIRMWARE_PREALLOC_BUFFER enum
FIRMWARE_PREALLOC_BUFFER is a "how", not a "what", and confuses the LSMs
that are interested in filtering between types of things. The "how"
should be an internal detail made uninteresting to the LSMs.

Fixes: a098ecd2fa ("firmware: support loading into a pre-allocated buffer")
Fixes: fd90bc559b ("ima: based on policy verify firmware signatures (pre-allocated buffer)")
Fixes: 4f0496d8ff ("ima: based on policy warn about loading firmware (pre-allocated buffer)")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Scott Branden <scott.branden@broadcom.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201002173828.2099543-2-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-05 13:34:18 +02:00
Christoph Hellwig 598b3cec83 fs: remove compat_sys_vmsplice
Now that import_iovec handles compat iovecs, the native vmsplice syscall
can be used for the compat case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-03 00:02:15 -04:00
Christoph Hellwig 5f764d624a fs: remove the compat readv/writev syscalls
Now that import_iovec handles compat iovecs, the native readv and writev
syscalls can be used for the compat case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-03 00:02:14 -04:00
Christoph Hellwig 3523a9d454 fs: remove various compat readv/writev helpers
Now that import_iovec handles compat iovecs as well, all the duplicated
code in the compat readv/writev helpers is not needed.  Remove them
and switch the compat syscall handlers to use the native helpers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-03 00:02:14 -04:00
Christoph Hellwig 89cd35c58b iov_iter: transparently handle compat iovecs in import_iovec
Use in compat_syscall to import either native or the compat iovecs, and
remove the now superflous compat_import_iovec.

This removes the need for special compat logic in most callers, and
the remaining ones can still be simplified by using __import_iovec
with a bool compat parameter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-03 00:02:13 -04:00
Jakub Kicinski 66a9b9287d genetlink: move to smaller ops wherever possible
Bulk of the genetlink users can use smaller ops, move them.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-02 19:11:11 -07:00
Linus Torvalds 702bfc891d io_uring-5.9-2020-10-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl93Z48QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmp4EACwxi4UVnL0zhaOBmXfqxDuaXViwkfVZNxx
 d40y+DcCewnpZMk2G9cES8OKG+Tu2GFX2yl1m2XdrIWJ6jpnGFKJOkNQGfPDQrT3
 fI7qFrEDeSVeLUMMBxtvZLW8w2D0KcNCgla4h/ESXI9xtPTZdYXhYQY0zfuWalUC
 ZplUgAWlHx82qJari7ZmIfeVtpAoujTvkccRe+/RtPv5vO+UsvP7kqPSCYMGqhHS
 7z5gK3Nw+PNMWrzZVZ6Rw5nLeExx9PJGgiEkitEjn7mRJELXV9eWnTt9D0eVwaec
 WO7OSQmrJLmMFER4ZhkDNJkXZFvlYUCygnwJQmH70LflRqUEA00O6wX4J32O3NIg
 fIDWKMGGANFU5atL+RHqfQgUYq0GY1UsIvZxJnwRwv1QssmJoQq9fpT6VYqiQMik
 2JAeWyMqTGI4vRNmVJKTR/13SpRUYrvS3wHN53kCaBBhE5Y/vFksgOGgXZBG/TPk
 odpegeJOTa5xuS0YcKIK6yL/xHENct1Y1BtVjczrXKJz0E90n5ZdIR0lEg6Ij3B1
 jZUwKiS2sY09eBaJIQvtD4hIaw5VgqtwinKTyt7MBw/6pCqJpSZtaV0Uvgvjq/Se
 1ifUo4cWwQBccZLgWeWoEalio2fNIyb+J+sm7eu9Xygjl67U2M8oMfAN2JjkM7As
 btLazer4lg==
 =fo3Z
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - fix for async buffered reads if read-ahead is fully disabled (Hao)

 - double poll match fix

 - ->show_fdinfo() potential ABBA deadlock complaint fix

* tag 'io_uring-5.9-2020-10-02' of git://git.kernel.dk/linux-block:
  io_uring: fix async buffered reads when readahead is disabled
  io_uring: fix potential ABBA deadlock in ->show_fdinfo()
  io_uring: always delete double poll wait entry on match
2020-10-02 14:38:10 -07:00
Linus Torvalds d4fce2e20f Merge branch 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull epoll fixes from Al Viro:
 "Several race fixes in epoll"

* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ep_create_wakeup_source(): dentry name can change under you...
  epoll: EPOLL_CTL_ADD: close the race in decision to take fast path
  epoll: replace ->visited/visited_list with generation count
  epoll: do not insert into poll queues until all sanity checks are done
2020-10-02 10:37:08 -07:00
Linus Torvalds 4e3b9ce271 for-5.9-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl93REAACgkQxWXV+ddt
 WDv0/A//XYr1XLC/5sMILHqYZ4ogiFxC3Nfjeyt6vfBPX3J0d2eHnw5Rw+ZHHHdQ
 qtoKWom9ZwCxjybghwmvfxJuohy+6Sc764aEj+rYpUcCmmUZsAZZpmwpZqpYG+0H
 DEn9p45T0MO+r5lsF/GdNqqsdXZfUlZy7PweIhZucQxENM8cowklqKCo4AU2IEW4
 203THU3UxQayn0um6kaiesioh8TtT+R9UVAyyA3n6lGINHKG8AMy0ulS/M2Uzgq5
 eAzWne4Opy+wLxubBdeqruPiQrFQp+JV/YhTTEHGKRXykRYXwZnCDYdK27X4UKkt
 g3Ne0cEd/JuxZfb3Mzsd7+MF0xr9xKJPziFXv7YZt0LkiHE+B0b/DwA9FksR9sdO
 4BY2oe0gztstIMqQ5qnriJMDQxonyUt2G65YW8sCI9b32vRYaHLhCWZRYzbmftEO
 W4FJOnAI2It3Ib0CUkBjkPYkmH113Q6g59k015IpoYRGmExhnC59zhuijdmthxFJ
 S5PXFymVhxt9iMOKM0jE17Rp/j4hVg/bdFVHJryzlOsldjq63Vukqoo24SQhiqfY
 qYn/Ilkc/h1YD/pxehFAhZcbGfEdjD5oo8OkGoKIUXfv35r7JH/5F/x+4DxZNnYk
 n0oHJ7WBR01AlHAcuTvsN7z9O2ZX6wZufkkgKYLBvtGtyC71T3A=
 =MT2i
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Two more fixes.

  One is for a lockdep warning/lockup (also caught by syzbot), that one
  has been seen in practice. Regarding the other syzbot reports
  mentioned last time, they don't seem to be urgent and reliably
  reproducible so they'll be fixed later.

  The second fix is for a potential corruption when device replace
  finishes and the in-memory state of trim is not copied to the new
  device"

* tag 'for-5.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix filesystem corruption after a device replace
  btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
  btrfs: move btrfs_scratch_superblocks into btrfs_dev_replace_finishing
2020-10-02 10:09:40 -07:00
Chuck Lever 4b74fd793a NFSD: Map nfserr_wrongsec outside of nfsd_dispatch
Refactor: Handle this NFS version-specific mapping in the only
place where nfserr_wrongsec is generated.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:42 -04:00
Chuck Lever 14168d678a NFSD: Remove the RETURN_STATUS() macro
Refactor: I'm about to change the return value from .pc_func. Clear
the way by replacing the RETURN_STATUS() macro with logic that
plants the status code directly into the response structure.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:42 -04:00
Chuck Lever f0af22101d NFSD: Call NFSv2 encoders on error returns
Remove special dispatcher logic for NFSv2 error responses. These are
rare to the point of becoming extinct, but all NFS responses have to
pay the cost of the extra conditional branches.

With this change, the NFSv2 error cases now get proper
xdr_ressize_check() calls.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:42 -04:00
Chuck Lever 1841b9b614 NFSD: Fix .pc_release method for NFSv2
nfsd_release_fhandle() assumes that rqstp->rq_resp always points to
an nfsd_fhandle struct. In fact, no NFSv2 procedure uses struct
nfsd_fhandle as its response structure.

So far that has been "safe" to do because the res structs put the
resp->fh field at that same offset as struct nfsd_fhandle. I don't
think that's a guarantee, though, and there is certainly nothing
preventing a developer from altering the fields in those structures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:42 -04:00
Chuck Lever 7cf8357043 NFSD: Remove vestigial typedefs
Clean up: These are not used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:42 -04:00
Chuck Lever 85085aacef NFSD: Refactor nfsd_dispatch() error paths
nfsd_dispatch() is a hot path. Ensure the compiler takes the
processing of rare error cases out of line.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:41 -04:00
Chuck Lever 4c96cb56ee NFSD: Clean up nfsd_dispatch() variables
For consistency and code legibility, use a similar organization of
variables as svc_generic_dispatch().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:41 -04:00
Chuck Lever 383c440d4f NFSD: Clean up stale comments in nfsd_dispatch()
Add a documenting comment for the function. Remove comments that
simply describe obvious aspects of the code, but leave comments
that explain the differences in processing of each NFS version.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:41 -04:00
Chuck Lever 84c138e78d NFSD: Clean up switch statement in nfsd_dispatch()
Reorder the arms so the compiler places checks for the most frequent
case first.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:41 -04:00
Chuck Lever dcc46991d3 NFSD: Encoder and decoder functions are always present
nfsd_dispatch() is a hot path. Let's optimize the XDR method calls
for the by-far common case, which is that the XDR methods are indeed
present.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:41 -04:00
Chuck Lever ba1df797e5 NFSACL: Replace PROC() macro with open code
Clean up: Follow-up on ten-year-old commit b9081d90f5 ("NFS: kill
off complicated macro 'PROC'") by performing the same conversion in
the NFSACL code. To reduce the chance of error, I copied the original
C preprocessor output and then made some minor edits.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:41 -04:00
Chuck Lever 49d9960821 lockd: Replace PROC() macro with open code
Clean up: Follow-up on ten-year-old commit b9081d90f5 ("NFS: kill
off complicated macro 'PROC'") by performing the same conversion in
the lockd code. To reduce the chance of error, I copied the original
C preprocessor output and then made some minor edits.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:41 -04:00
Chuck Lever 6b3dccd48d NFSD: Add missing NFSv2 .pc_func methods
There's no protection in nfsd_dispatch() against a NULL .pc_func
helpers. A malicious NFS client can trigger a crash by invoking the
unused/unsupported NFSv2 ROOT or WRITECACHE procedures.

The current NFSD dispatcher does not support returning a void reply
to a non-NULL procedure, so the reply to both of these is wrong, for
the moment.

Cc: <stable@vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-10-02 09:37:41 -04:00
Yang Shi 5904c16d22 fs: nfs: return per memcg count for xattr shrinkers
The list_lru_count() returns the pre node count, but the new xattr
shrinkers are memcg aware, so the shrinkers should return per memcg
count by calling list_lru_shrink_count() instead.  Otherwise over-shrink
might be experienced.  The problem was spotted by visual code
inspection.

Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-02 08:46:46 -04:00
Benjamin Coddington b4868b44c5 NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE
Since commit 0e0cb35b41 ("NFSv4: Handle NFS4ERR_OLD_STATEID in
CLOSE/OPEN_DOWNGRADE") the following livelock may occur if a CLOSE races
with the update of the nfs_state:

Process 1           Process 2           Server
=========           =========           ========
 OPEN file
                    OPEN file
                                        Reply OPEN (1)
                                        Reply OPEN (2)
 Update state (1)
 CLOSE file (1)
                                        Reply OLD_STATEID (1)
 CLOSE file (2)
                                        Reply CLOSE (-1)
                    Update state (2)
                    wait for state change
 OPEN file
                    wake
 CLOSE file
 OPEN file
                    wake
 CLOSE file
 ...
                    ...

We can avoid this situation by not issuing an immediate retry with a bumped
seqid when CLOSE/OPEN_DOWNGRADE receives NFS4ERR_OLD_STATEID.  Instead,
take the same approach used by OPEN and wait at least 5 seconds for
outstanding stateid updates to complete if we can detect that we're out of
sequence.

Note that after this change it is still possible (though unlikely) that
CLOSE waits a full 5 seconds, bumps the seqid, and retries -- and that
attempt races with another OPEN at the same time.  In order to avoid this
race (which would result in the livelock), update
nfs_need_update_open_stateid() to handle the case where:
 - the state is NFS_OPEN_STATE, and
 - the stateid doesn't match the current open stateid

Finally, nfs_need_update_open_stateid() is modified to be idempotent and
renamed to better suit the purpose of signaling that the stateid passed
is the next stateid in sequence.

Fixes: 0e0cb35b41 ("NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-02 08:43:09 -04:00
Nick Desaulniers fb08334bb3 nfs: remove incorrect fallthrough label
There is no case after the default from which to fallthrough to. Clang
will error in this case (unhelpfully without context, see link below)
and GCC will with -Wswitch-unreachable.

The previous commit should have just replaced the comment with a break
statement.

If we consider implicit fallthrough to be a design mistake of C, then
all case statements should be terminated with one of the following
statements:
* break
* continue
* return
* fallthrough
* goto
* (call of function with __attribute__(__noreturn__))

Fixes: 2a1390c95a69 ("nfs: Convert to use the preferred fallthrough macro")
Link: https://bugs.llvm.org/show_bug.cgi?id=47539
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-10-02 08:43:08 -04:00
Joe Perches 2efc459d06 sysfs: Add sysfs_emit and sysfs_emit_at to format sysfs output
Output defects can exist in sysfs content using sprintf and snprintf.

sprintf does not know the PAGE_SIZE maximum of the temporary buffer
used for outputting sysfs content and it's possible to overrun the
PAGE_SIZE buffer length.

Add a generic sysfs_emit function that knows that the size of the
temporary buffer and ensures that no overrun is done.

Add a generic sysfs_emit_at function that can be used in multiple
call situations that also ensures that no overrun is done.

Validate the output buffer argument to be page aligned.
Validate the offset len argument to be within the PAGE_SIZE buf.

Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/884235202216d464d61ee975f7465332c86f76b2.1600285923.git.joe@perches.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-02 12:02:30 +02:00
Linus Torvalds 472e5b056f pipe: remove pipe_wait() and fix wakeup race with splice
The pipe splice code still used the old model of waiting for pipe IO by
using a non-specific "pipe_wait()" that waited for any pipe event to
happen, which depended on all pipe IO being entirely serialized by the
pipe lock.  So by checking the state you were waiting for, and then
adding yourself to the wait queue before dropping the lock, you were
guaranteed to see all the wakeups.

Strictly speaking, the actual wakeups were not done under the lock, but
the pipe_wait() model still worked, because since the waiter held the
lock when checking whether it should sleep, it would always see the
current state, and the wakeup was always done after updating the state.

However, commit 0ddad21d3e ("pipe: use exclusive waits when reading or
writing") split the single wait-queue into two, and in the process also
made the "wait for event" code wait for _two_ wait queues, and that then
showed a race with the wakers that were not serialized by the pipe lock.

It's only splice that used that "pipe_wait()" model, so the problem
wasn't obvious, but Josef Bacik reports:

 "I hit a hang with fstest btrfs/187, which does a btrfs send into
  /dev/null. This works by creating a pipe, the write side is given to
  the kernel to write into, and the read side is handed to a thread that
  splices into a file, in this case /dev/null.

  The box that was hung had the write side stuck here [pipe_write] and
  the read side stuck here [splice_from_pipe_next -> pipe_wait].

  [ more details about pipe_wait() scenario ]

  The problem is we're doing the prepare_to_wait, which sets our state
  each time, however we can be woken up either with reads or writes. In
  the case above we race with the WRITER waking us up, and re-set our
  state to INTERRUPTIBLE, and thus never break out of schedule"

Josef had a patch that avoided the issue in pipe_wait() by just making
it set the state only once, but the deeper problem is that pipe_wait()
depends on a level of synchonization by the pipe mutex that it really
shouldn't.  And the whole "wait for any pipe state change" model really
isn't very good to begin with.

So rather than trying to work around things in pipe_wait(), remove that
legacy model of "wait for arbitrary pipe event" entirely, and actually
create functions that wait for the pipe actually being readable or
writable, and can do so without depending on the pipe lock serializing
everything.

Fixes: 0ddad21d3e ("pipe: use exclusive waits when reading or writing")
Link: https://lore.kernel.org/linux-fsdevel/bfa88b5ad6f069b2b679316b9e495a970130416c.1601567868.git.josef@toxicpanda.com/
Reported-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-and-tested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-01 19:14:36 -07:00
Alexander Aring 4f2b30fd9b fs: dlm: fix race in nodeid2con
This patch fixes a race in nodeid2con in cases that we parallel running
a lookup and both will create a connection structure for the same nodeid.
It's a rare case to create a new connection structure to keep reader
lockless we just do a lookup inside the protection area again and drop
previous work if this race happens.

Fixes: a47666eb76 ("fs: dlm: make connection hash lockless")
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2020-10-01 09:25:07 -05:00
Qian Cai 8a018eb55e pipe: Fix memory leaks in create_pipe_files()
Calling pipe2() with O_NOTIFICATION_PIPE could results in memory
leaks unless watch_queue_init() is successful.

        In case of watch_queue_init() failure in pipe2() we are left
with inode and pipe_inode_info instances that need to be freed.  That
failure exit has been introduced in commit c73be61ced ("pipe: Add
general notification queue support") and its handling should've been
identical to nearby treatment of alloc_file_pseudo() failures - it
is dealing with the same situation.  As it is, the mainline kernel
leaks in that case.

        Another problem is that CONFIG_WATCH_QUEUE and !CONFIG_WATCH_QUEUE
cases are treated differently (and the former leaks just pipe_inode_info,
the latter - both pipe_inode_info and inode).

        Fixed by providing a dummy wacth_queue_init() in !CONFIG_WATCH_QUEUE
case and by having failures of wacth_queue_init() handled the same way
we handle alloc_file_pseudo() ones.

Fixes: c73be61ced ("pipe: Add general notification queue support")
Signed-off-by: Qian Cai <cai@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-01 09:40:35 -04:00
Jan Kara c2bb80b8bd reiserfs: Fix oops during mount
With suitably crafted reiserfs image and mount command reiserfs will
crash when trying to verify that XATTR_ROOT directory can be looked up
in / as that recurses back to xattr code like:

 xattr_lookup+0x24/0x280 fs/reiserfs/xattr.c:395
 reiserfs_xattr_get+0x89/0x540 fs/reiserfs/xattr.c:677
 reiserfs_get_acl+0x63/0x690 fs/reiserfs/xattr_acl.c:209
 get_acl+0x152/0x2e0 fs/posix_acl.c:141
 check_acl fs/namei.c:277 [inline]
 acl_permission_check fs/namei.c:309 [inline]
 generic_permission+0x2ba/0x550 fs/namei.c:353
 do_inode_permission fs/namei.c:398 [inline]
 inode_permission+0x234/0x4a0 fs/namei.c:463
 lookup_one_len+0xa6/0x200 fs/namei.c:2557
 reiserfs_lookup_privroot+0x85/0x1e0 fs/reiserfs/xattr.c:972
 reiserfs_fill_super+0x2b51/0x3240 fs/reiserfs/super.c:2176
 mount_bdev+0x24f/0x360 fs/super.c:1417

Fix the problem by bailing from reiserfs_xattr_get() when xattrs are not
yet initialized.

CC: stable@vger.kernel.org
Reported-by: syzbot+9b33c9b118d77ff59b6f@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
2020-10-01 11:15:31 +02:00
Jens Axboe 87c4311fd2 io_uring: kill callback_head argument for io_req_task_work_add()
We always use &req->task_work anyway, no point in passing it in.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 21:00:16 -06:00
Pavel Begunkov c1379e247a io_uring: move req preps out of io_issue_sqe()
All request preparations are done only during submission, reflect it in
the code by moving io_req_prep() much earlier into io_queue_sqe().

That's much cleaner, because it doen't expose bits to async code which
it won't ever use. Also it makes the interface harder to misuse, and
there are potential places for bugs.

For instance, __io_queue() doesn't clear @sqe before proceeding to a
next linked request, that could have been disastrous, but hopefully
there are linked requests IFF sqe==NULL, so not actually a bug.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:46 -06:00
Pavel Begunkov bfe7655983 io_uring: decouple issuing and req preparation
io_issue_sqe() does two things at once, trying to prepare request and
issuing them. Split it in two and deduplicate with io_defer_prep().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:46 -06:00
Pavel Begunkov 73debe68b3 io_uring: remove nonblock arg from io_{rw}_prep()
All io_*_prep() functions including io_{read,write}_prep() are called
only during submission where @force_nonblock is always true. Don't keep
propagating it and instead remove the @force_nonblock argument
from prep() altogether.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:46 -06:00
Pavel Begunkov a88fc40021 io_uring: set/clear IOCB_NOWAIT into io_read/write
Move setting IOCB_NOWAIT from io_prep_rw() into io_read()/io_write(), so
it's set/cleared in a single place. Also remove @force_nonblock
parameter from io_prep_rw().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:46 -06:00
Pavel Begunkov 2d199895d2 io_uring: remove F_NEED_CLEANUP check in *prep()
REQ_F_NEED_CLEANUP is set only by io_*_prep() and they're guaranteed to
be called only once, so there is no one who may have set the flag
before. Kill REQ_F_NEED_CLEANUP check in these *prep() handlers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:46 -06:00
Pavel Begunkov 5b09e37e27 io_uring: io_kiocb_ppos() style change
Put brackets around bitwise ops in a complex expression

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:45 -06:00
Pavel Begunkov 291b2821e0 io_uring: simplify io_alloc_req()
Extract common code from if/else branches. That is cleaner and optimised
even better.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:38:45 -06:00
Jens Axboe 145cc8c665 io-wq: kill unused IO_WORKER_F_EXITING
This flag is no longer used, remove it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Hillf Danton c4068bf898 io-wq: fix use-after-free in io_wq_worker_running
The smart syzbot has found a reproducer for the following issue:

 ==================================================================
 BUG: KASAN: use-after-free in instrument_atomic_write include/linux/instrumented.h:71 [inline]
 BUG: KASAN: use-after-free in atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
 BUG: KASAN: use-after-free in io_wqe_inc_running fs/io-wq.c:301 [inline]
 BUG: KASAN: use-after-free in io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
 Write of size 4 at addr ffff8882183db08c by task io_wqe_worker-0/7771

 CPU: 0 PID: 7771 Comm: io_wqe_worker-0 Not tainted 5.9.0-rc4-syzkaller #0
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
 Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x198/0x1fd lib/dump_stack.c:118
  print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
  __kasan_report mm/kasan/report.c:513 [inline]
  kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
  check_memory_region_inline mm/kasan/generic.c:186 [inline]
  check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
  instrument_atomic_write include/linux/instrumented.h:71 [inline]
  atomic_inc include/asm-generic/atomic-instrumented.h:240 [inline]
  io_wqe_inc_running fs/io-wq.c:301 [inline]
  io_wq_worker_running+0xde/0x110 fs/io-wq.c:613
  schedule_timeout+0x148/0x250 kernel/time/timer.c:1879
  io_wqe_worker+0x517/0x10e0 fs/io-wq.c:580
  kthread+0x3b5/0x4a0 kernel/kthread.c:292
  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

 Allocated by task 7768:
  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
  kasan_set_track mm/kasan/common.c:56 [inline]
  __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
  kmem_cache_alloc_node_trace+0x17b/0x3f0 mm/slab.c:3594
  kmalloc_node include/linux/slab.h:572 [inline]
  kzalloc_node include/linux/slab.h:677 [inline]
  io_wq_create+0x57b/0xa10 fs/io-wq.c:1064
  io_init_wq_offload fs/io_uring.c:7432 [inline]
  io_sq_offload_start fs/io_uring.c:7504 [inline]
  io_uring_create fs/io_uring.c:8625 [inline]
  io_uring_setup+0x1836/0x28e0 fs/io_uring.c:8694
  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

 Freed by task 21:
  kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
  kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
  kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
  __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
  __cache_free mm/slab.c:3418 [inline]
  kfree+0x10e/0x2b0 mm/slab.c:3756
  __io_wq_destroy fs/io-wq.c:1138 [inline]
  io_wq_destroy+0x2af/0x460 fs/io-wq.c:1146
  io_finish_async fs/io_uring.c:6836 [inline]
  io_ring_ctx_free fs/io_uring.c:7870 [inline]
  io_ring_exit_work+0x1e4/0x6d0 fs/io_uring.c:7954
  process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
  worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
  kthread+0x3b5/0x4a0 kernel/kthread.c:292
  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

 The buggy address belongs to the object at ffff8882183db000
  which belongs to the cache kmalloc-1k of size 1024
 The buggy address is located 140 bytes inside of
  1024-byte region [ffff8882183db000, ffff8882183db400)
 The buggy address belongs to the page:
 page:000000009bada22b refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x2183db
 flags: 0x57ffe0000000200(slab)
 raw: 057ffe0000000200 ffffea0008604c48 ffffea00086a8648 ffff8880aa040700
 raw: 0000000000000000 ffff8882183db000 0000000100000002 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff8882183daf80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
  ffff8882183db000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 >ffff8882183db080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                       ^
  ffff8882183db100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8882183db180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ==================================================================

which is down to the comment below,

	/* all workers gone, wq exit can proceed */
	if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
		complete(&wqe->wq->done);

because there might be multiple cases of wqe in a wq and we would wait
for every worker in every wqe to go home before releasing wq's resources
on destroying.

To that end, rework wq's refcount by making it independent of the tracking
of workers because after all they are two different things, and keeping
it balanced when workers come and go. Note the manager kthread, like
other workers, now holds a grab to wq during its lifetime.

Finally to help destroy wq, check IO_WQ_BIT_EXIT upon creating worker
and do nothing for exiting wq.

Cc: stable@vger.kernel.org # v5.5+
Reported-by: syzbot+45fa0a195b941764e0f0@syzkaller.appspotmail.com
Reported-by: syzbot+9af99580130003da82b1@syzkaller.appspotmail.com
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Joseph Qi dbbe9c6424 io_uring: show sqthread pid and cpu in fdinfo
In most cases we'll specify IORING_SETUP_SQPOLL and run multiple
io_uring instances in a host. Since all sqthreads are named
"io_uring-sq", it's hard to distinguish the relations between
application process and its io_uring sqthread.
With this patch, application can get its corresponding sqthread pid
and cpu through show_fdinfo.
Steps:
1. Get io_uring fd first.
$ ls -l /proc/<pid>/fd | grep -w io_uring
2. Then get io_uring instance related info, including corresponding
sqthread pid and cpu.
$ cat /proc/<pid>/fdinfo/<io_uring_fd>

pos:	0
flags:	02000002
mnt_id:	13
SqThread:	6929
SqThreadCpu:	2
UserFiles:	1
    0: testfile
UserBufs:	0
PollList:

Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
[axboe: fixed for new shared SQPOLL]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Jens Axboe af9c1a44f8 io_uring: process task work in io_uring_register()
We do this for CQ ring wait, in case task_work completions come in. We
should do the same in io_uring_register(), to avoid spurious -EINTR
if the ring quiescing ends up having to process task_work to complete
the operation

Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Dennis Zhou 91d8f5191e io_uring: add blkcg accounting to offloaded operations
There are a few operations that are offloaded to the worker threads. In
this case, we lose process context and end up in kthread context. This
results in ios to be not accounted to the issuing cgroup and
consequently end up as issued by root. Just like others, adopt the
personality of the blkcg too when issuing via the workqueues.

For the SQPOLL thread, it will live and attach in the inited cgroup's
context.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Jens Axboe de2939388b io_uring: improve registered buffer accounting for huge pages
io_uring does account any registered buffer as pinned/locked memory, and
checks limit and fails if the given user doesn't have a big enough limit
to register the ranges specified. However, if huge pages are used, we
are potentially under-accounting the memory in terms of what gets pinned
on the vm side.

This patch rectifies that, by ensuring that we account the full size of
a compound page, regardless of how much of it is being registered. Huge
pages are not accounted mulitple times - if multiple sections of a huge
page is registered, then the page is only accounted once.

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Zheng Bin 14db84110d io_uring: remove unneeded semicolon
Fixes coccicheck warning:

fs/io_uring.c:4242:13-14: Unneeded semicolon

Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Jens Axboe e95eee2dee io_uring: cap SQ submit size for SQPOLL with multiple rings
In the spirit of fairness, cap the max number of SQ entries we'll submit
for SQPOLL if we have multiple rings. If we don't do that, we could be
submitting tons of entries for one ring, while others are waiting to get
service.

The value of 8 is somewhat arbitrarily chosen as something that allows
a fair bit of batching, without using an excessive time per ring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Jens Axboe e8c2bc1fb6 io_uring: get rid of req->io/io_async_ctx union
There's really no point in having this union, it just means that we're
always allocating enough room to cater to any command. But that's
pointless, as the ->io field is request type private anyway.

This gets rid of the io_async_ctx structure, and fills in the required
size in the io_op_defs[] instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Pavel Begunkov 4be1c61512 io_uring: kill extra user_bufs check
Testing ctx->user_bufs for NULL in io_import_fixed() is not neccessary,
because in that case ctx->nr_user_bufs would be zero, and the following
check would fail.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:34 -06:00
Pavel Begunkov ab0b196ce5 io_uring: fix overlapped memcpy in io_req_map_rw()
When io_req_map_rw() is called from io_rw_prep_async(), it memcpy()
iorw->iter into itself. Even though it doesn't lead to an error, such a
memcpy()'s aliasing rules violation is considered to be a bad practise.

Inline io_req_map_rw() into io_rw_prep_async(). We don't really need any
remapping there, so it's much simpler than the generic implementation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-30 20:32:33 -06:00