Commit graph

10233 commits

Author SHA1 Message Date
Marcos Paulo de Souza 0ff40a910f btrfs: introduce btrfs_search_backwards function
It's a common practice to start a search using offset (u64)-1, which is
the u64 maximum value, meaning that we want the search_slot function to
be set in the last item with the same objectid and type.

Once we are in this position, it's a matter to start a search backwards
by calling btrfs_previous_item, which will check if we'll need to go to
a previous leaf and other necessary checks, only to be sure that we are
in last offset of the same object and type.

The new btrfs_search_backwards function does the all these steps when
necessary, and can be used to avoid code duplication.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
David Sterba ea3dc7d2d1 btrfs: print if fsverity support is built in when loading module
As fsverity support depends on a config option, print that at module
load time like we do for similar features.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Boris Burkov 705242538f btrfs: verity metadata orphan items
Writing out the verity data is too large of an operation to do in a
single transaction. If we are interrupted before we finish creating
fsverity metadata for a file, or fail to clean up already created
metadata after a failure, we could leak the verity items that we already
committed.

To address this issue, we use the orphan mechanism. When we start
enabling verity on a file, we also add an orphan item for that inode.
When we are finished, we delete the orphan. However, if we are
interrupted midway, the orphan will be present at mount and we can
cleanup the half-formed verity state.

There is a possible race with a normal unlink operation: if unlink and
verity run on the same file in parallel, it is possible for verity to
succeed and delete the still legitimate orphan added by unlink. Then, if
we are interrupted and mount in that state, we will never clean up the
inode properly. This is also possible for a file created with O_TMPFILE.
Check nlink==0 before deleting to avoid this race.

A final thing to note is that this is a resurrection of using orphans to
signal an operation besides "delete this inode". The old case was to
signal the need to do a truncate. That case still technically applies
for mounting very old file systems, so we need to take some care to not
clobber it. To that end, we just have to be careful that verity orphan
cleanup is a no-op for non-verity files.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Boris Burkov 146054090b btrfs: initial fsverity support
Add support for fsverity in btrfs. To support the generic interface in
fs/verity, we add two new item types in the fs tree for inodes with
verity enabled. One stores the per-file verity descriptor and btrfs
verity item and the other stores the Merkle tree data itself.

Verity checking is done in end_page_read just before a page is marked
uptodate. This naturally handles a variety of edge cases like holes,
preallocated extents, and inline extents. Some care needs to be taken to
not try to verity pages past the end of the file, which are accessed by
the generic buffered file reading code under some circumstances like
reading to the end of the last page and trying to read again. Direct IO
on a verity file falls back to buffered reads.

Verity relies on PageChecked for the Merkle tree data itself to avoid
re-walking up shared paths in the tree. For this reason, we need to
cache the Merkle tree data. Since the file is immutable after verity is
turned on, we can cache it at an index past EOF.

Use the new inode ro_flags to store verity on the inode item, so that we
can enable verity on a file, then rollback to an older kernel and still
mount the file system and read the file. Since we can't safely write the
file anymore without ruining the invariants of the Merkle tree, we mark
a ro_compat flag on the file system when a file has verity enabled.

Acked-by: Eric Biggers <ebiggers@google.com>
Co-developed-by: Chris Mason <clm@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Boris Burkov 77eea05e78 btrfs: add ro compat flags to inodes
Currently, inode flags are fully backwards incompatible in btrfs. If we
introduce a new inode flag, then tree-checker will detect it and fail.
This can even cause us to fail to mount entirely. To make it possible to
introduce new flags which can be read-only compatible, like VERITY, we
add new ro flags to btrfs without treating them quite so harshly in
tree-checker. A read-only file system can survive an unexpected flag,
and can be mounted.

As for the implementation, it unfortunately gets a little complicated.

The on-disk representation of the inode, btrfs_inode_item, has an __le64
for flags but the in-memory representation, btrfs_inode, uses a u32.
David Sterba had the nice idea that we could reclaim those wasted 32 bits
on disk and use them for the new ro_compat flags.

It turns out that the tree-checker code which checks for unknown flags
is broken, and ignores the upper 32 bits we are hoping to use. The issue
is that the flags use the literal 1 rather than 1ULL, so the flags are
signed ints, and one of them is specifically (1 << 31). As a result, the
mask which ORs the flags is a negative integer on machines where int is
32 bit twos complement. When tree-checker evaluates the expression:

  btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)

The mask is something like 0x80000abc, which gets promoted to u64 with
sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
all the upper bits zeroed, and we can't detect unexpected flags.

This suggests that we can't use those bits after all. Luckily, we have
good reason to believe that they are zero anyway. Inode flags are
metadata, which is always checksummed, so any bit flips that would
introduce 1s would cause a checksum failure anyway (excluding the
improbable case of the checksum getting corrupted exactly badly).

Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
inode flag should preserve its value and not add leading zeroes
(at least for twos complement). The only place that flag
(BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
the root item, and indeed for that inode we see 0xffffffff80000000 as
the flags on disk. However, that inode is never seen by tree checker,
nor is it used in a context where verity might be meaningful.
Theoretically, a future ro flag might cause trouble on that inode, so we
should proactively clean up that mess before it does.

With the introduction of the new ro flags, keep two separate unsigned
masks and check them against the appropriate u32. Since we no longer run
afoul of sign extension, this also stops writing out 0xffffffff80000000
in root_item inodes going forward.

Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Anand Jain efc222f8d7 btrfs: simplify return values in btrfs_check_raid_min_devices
Function btrfs_check_raid_min_devices() returns error code from the enum
btrfs_err_code and it starts from 1. So there is no need to check if ret
is > 0. So drop this check and also drop the local variable ret.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:09 +02:00
Qu Wenruo 7361b4ae03 btrfs: remove the dead comment in writepage_delalloc()
When btrfs_run_delalloc_range() failed, we will error out.

But there is a strange comment mentioning that
btrfs_run_delalloc_range() could have returned value >0 to indicate the
IO has already started.

Commit 40f765805f ("Btrfs: split up __extent_writepage to lower stack
usage") introduced the comment, but unfortunately at that time, we were
already using @page_started to indicate that case, and still return 0.

Furthermore, even if that comment was right (which is not), we would
return -EIO if the IO had already started.

By all means the comment is incorrect, just remove the comment along
with the dead check.

Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
make sure we either return 0 or error, no positive return value.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
David Sterba b2f78e8805 btrfs: allow degenerate raid0/raid10
The data on raid0 and raid10 are supposed to be spread over multiple
devices, so the minimum constraints are set to 2 and 4 respectively.
This is an artificial limit and there's some interest to remove it.

Change this to allow raid0 on one device and raid10 on two devices. This
works as expected eg. when converting or removing devices.

The only difference is when raid0 on two devices gets one device
removed. Unpatched would silently create a single profile, while newly
it would be raid0.

The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion), or
when there are disks of different size, with raid0 the otherwise
unusable space of the last device will be used too. Similarly for
raid10, though the two largest devices would need to be the same.

Unpatched kernel will mount and use the degenerate profiles just fine
but won't allow any operation that would not satisfy the stricter device
number constraints, eg. not allowing to go from 3 to 2 devices for
raid10 or various profile conversions.

Example output:

  # btrfs fi us -T .
  Overall:
      Device size:                  10.00GiB
      Device allocated:              1.01GiB
      Device unallocated:            8.99GiB
      Device missing:                  0.00B
      Used:                        200.61MiB
      Free (estimated):              9.79GiB      (min: 9.79GiB)
      Free (statfs, df):             9.79GiB
      Data ratio:                       1.00
      Metadata ratio:                   1.00
      Global reserve:                3.25MiB      (used: 0.00B)
      Multiple profiles:                  no

		Data      Metadata  System
  Id Path       RAID0     single    single   Unallocated
  -- ---------- --------- --------- -------- -----------
   1 /dev/sda10   1.00GiB   8.00MiB  1.00MiB     8.99GiB
  -- ---------- --------- --------- -------- -----------
     Total        1.00GiB   8.00MiB  1.00MiB     8.99GiB
     Used       200.25MiB 352.00KiB 16.00KiB

  # btrfs dev us .
  /dev/sda10, ID: 1
     Device size:            10.00GiB
     Device slack:              0.00B
     Data,RAID0/1:            1.00GiB
     Metadata,single:         8.00MiB
     System,single:           1.00MiB
     Unallocated:             8.99GiB

Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
profile is printed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
Filipe Manana bd54f381a1 btrfs: do not pin logs too early during renames
During renames we pin the logs of the roots a bit too early, before the
calls to btrfs_insert_inode_ref(). We can pin the logs after those calls,
since those will not change anything in a log tree.

In a scenario where we have multiple and diverse filesystem operations
running in parallel, those calls can take a significant amount of time,
due to lock contention on extent buffers, and delay log commits from other
tasks for longer than necessary.

So just pin logs after calls to btrfs_insert_inode_ref() and right before
the first operation that can update a log tree.

The following script that uses dbench was used for testing:

  $ cat dbench-test.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $DEV &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 120 16

  umount $MNT

The tests were run on a machine with 12 cores, 64G of RAN, a NVMe device
and using a non-debug kernel config (Debian's default config).

The results compare a branch without this patch and without the previous
patch in the series, that has the subject:

 "btrfs: eliminate some false positives when checking if inode was logged"

Versus the same branch with these two patches applied.

dbench with 8 clients, results before:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4391359     0.009   249.745
 Close        3225882     0.001     3.243
 Rename        185953     0.065   240.643
 Unlink        886669     0.049   249.906
 Deltree          112     2.455   217.433
 Mkdir             56     0.002     0.004
 Qpathinfo    3980281     0.004     3.109
 Qfileinfo     697579     0.001     0.187
 Qfsinfo       729780     0.002     2.424
 Sfileinfo     357764     0.004     1.415
 Find         1538861     0.016     4.863
 WriteX       2189666     0.010     3.327
 ReadX        6883443     0.002     0.729
 LockX          14298     0.002     0.073
 UnlockX        14298     0.001     0.042
 Flush         307777     2.447   303.663

Throughput 1149.6 MB/sec  8 clients  8 procs  max_latency=303.666 ms

dbench with 8 clients, results after:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    4269920     0.009   213.532
 Close        3136653     0.001     0.690
 Rename        180805     0.082   213.858
 Unlink        862189     0.050   172.893
 Deltree          112     2.998   218.328
 Mkdir             56     0.002     0.003
 Qpathinfo    3870158     0.004     5.072
 Qfileinfo     678375     0.001     0.194
 Qfsinfo       709604     0.002     0.485
 Sfileinfo     347850     0.004     1.304
 Find         1496310     0.017     5.504
 WriteX       2129613     0.010     2.882
 ReadX        6693066     0.002     1.517
 LockX          13902     0.002     0.075
 UnlockX        13902     0.001     0.055
 Flush         299276     2.511   220.189

Throughput 1187.33 MB/sec  8 clients  8 procs  max_latency=220.194 ms

+3.2% throughput, -31.8% max latency

dbench with 16 clients, results before:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5978334     0.028   156.507
 Close        4391598     0.001     1.345
 Rename        253136     0.241   155.057
 Unlink       1207220     0.182   257.344
 Deltree          160     6.123    36.277
 Mkdir             80     0.003     0.005
 Qpathinfo    5418817     0.012     6.867
 Qfileinfo     949929     0.001     0.941
 Qfsinfo       993560     0.002     1.386
 Sfileinfo     486904     0.004     2.829
 Find         2095088     0.059     8.164
 WriteX       2982319     0.017     9.029
 ReadX        9371484     0.002     4.052
 LockX          19470     0.002     0.461
 UnlockX        19470     0.001     0.990
 Flush         418936     2.740   347.902

Throughput 1495.31 MB/sec  16 clients  16 procs  max_latency=347.909 ms

dbench with 16 clients, results after:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    5711833     0.029   131.240
 Close        4195897     0.001     1.732
 Rename        241849     0.204   147.831
 Unlink       1153341     0.184   231.322
 Deltree          160     6.086    30.198
 Mkdir             80     0.003     0.021
 Qpathinfo    5177011     0.012     7.150
 Qfileinfo     907768     0.001     0.793
 Qfsinfo       949205     0.002     1.431
 Sfileinfo     465317     0.004     2.454
 Find         2001541     0.058     7.819
 WriteX       2850661     0.017     9.110
 ReadX        8952289     0.002     3.991
 LockX          18596     0.002     0.655
 UnlockX        18596     0.001     0.179
 Flush         400342     2.879   293.607

Throughput 1565.73 MB/sec  16 clients  16 procs  max_latency=293.611 ms

+4.6% throughput, -16.9% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
Filipe Manana 6e8e777deb btrfs: eliminate some false positives when checking if inode was logged
When checking if an inode was previously logged in the current transaction
through the helper inode_logged(), we can return some false positives that
can be easily eliminated. These correspond to the cases where an inode has
a ->logged_trans value that is not zero and its value is smaller then the
ID of the current transaction. This means we know exactly that the inode
was never logged before in the current transaction, so we can return false
and avoid the callers to do extra work:

1) Having btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log()
   unnecessarily join a log transaction and do deletion searches in a log
   tree that will not find anything. This just adds unnecessary contention
   on extent buffer locks;

2) Having btrfs_log_new_name() unnecessarily log an inode when it is not
   needed. If the inode was not logged before, we don't need to log it in
   LOG_INODE_EXISTS mode.

So just make sure that any false positive only happens when ->logged_trans
has a value of 0.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
Naohiro Aota 42b5d73b5d btrfs: drop unnecessary ASSERT from btrfs_submit_direct()
When on SINGLE block group, btrfs_get_io_geometry() will return "the
size of the block group - the offset of the logical address within the
block group" as geom.len. Since we allow up to 8 GiB zone size on zoned
filesystem, we can have up to 8 GiB block group, so can have up to 8 GiB
geom.len as well. With this setup, we easily hit the "ASSERT(geom.len <=
INT_MAX);".

The ASSERT looks like to guard btrfs_bio_clone_partial() and bio_trim()
which both take "int" (now u64 due to the previous patch). So to be
precise the ASSERT should check if clone_len <= UINT_MAX. But actually,
clone_len is already capped by bio.bi_iter.bi_size which is unsigned
int. So the ASSERT is not necessary.

Drop the ASSERT and properly compare submit_len and geom.len in u64.
Then, let the implicit casting to convert it to u64.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
Chaitanya Kulkarni 21dda654d4 btrfs: fix argument type of btrfs_bio_clone_partial()
The offset and can never be negative use unsigned int instead of int
type for them.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:08 +02:00
Josef Bacik b377630527 btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking
sync_inode() has some holes that can cause problems if we're under heavy
ENOSPC pressure.  If there's writeback running on a separate thread
sync_inode() will skip writing the inode altogether.  What we really
want is to make sure writeback has been started on all the pages to make
sure we can see the ordered extents and wait on them if appropriate.
Switch to this new helper which will allow us to accomplish this and
avoid ENOSPC'ing early.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:07 +02:00
Josef Bacik e16460707e btrfs: wait on async extents when flushing delalloc
I've been debugging an early ENOSPC problem in production and finally
root caused it to this problem.  When we switched to the per-inode in
38d715f494 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") I pulled out the async extent handling, because we
were doing the correct thing by calling filemap_flush() if we had async
extents set.  This would properly wait on any async extents by locking
the page in the second flush, thus making sure our ordered extents were
properly set up.

However when I switched us back to page based flushing, I used
sync_inode(), which allows us to pass in our own wbc.  The problem here
is that sync_inode() is smarter than the filemap_* helpers, it tries to
avoid calling writepages at all.  This means that our second call could
skip calling do_writepages altogether, and thus not wait on the pagelock
for the async helpers.  This means we could come back before any ordered
extents were created and then simply continue on in our flushing
mechanisms and ENOSPC out when we have plenty of space to use.

Fix this by putting back the async pages logic in shrink_delalloc.  This
allows us to bulk write out everything that we need to, and then we can
wait in one place for the async helpers to catch up, and then wait on
any ordered extents that are created.

Fixes: e076ab2a2c ("btrfs: shrink delalloc pages instead of full inodes")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:07 +02:00
Josef Bacik 03fe78cc29 btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc
We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be.  With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.

Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations.  Now assume we have 128MiB of delalloc
outstanding.  With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.

Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.

Fix this two ways.  First, change the calculations to be a fraction of
the total delalloc bytes on the system.  Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.

Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once.  This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.

I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again.  This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:07 +02:00
Josef Bacik fcdef39c03 btrfs: enable a tracepoint when we fail tickets
When debugging early enospc problems it was useful to have a tracepoint
where we failed all tickets so I could check the state of the enospc
counters at failure time to validate my fixes.  This adds the tracpoint
so you can easily get that information.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Josef Bacik ac98141d14 btrfs: wake up async_delalloc_pages waiters after submit
We use the async_delalloc_pages mechanism to make sure that we've
completed our async work before trying to continue our delalloc
flushing.  The reason for this is we need to see any ordered extents
that were created by our delalloc flushing.  However we're waking up
before we do the submit work, which is before we create the ordered
extents.  This is a pretty wide race window where we could potentially
think there are no ordered extents and thus exit shrink_delalloc
prematurely.  Fix this by waking us up after we've done the work to
create ordered extents.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Qu Wenruo 963e4db83e btrfs: unify regular and subpage error paths in __extent_writepage()
[BUG]
When running btrfs/160 in a loop for subpage with experimental
compression support, it has a high chance to crash (~20%):

 BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ordered-data.c:238!
 Internal error: Oops - BUG: 0 [#1] SMP
 pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
 lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
 Call trace:
  __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
  btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
  run_delalloc_nocow+0x81c/0x8fc [btrfs]
  btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
  writepage_delalloc+0xc0/0x1ac [btrfs]
  __extent_writepage+0xf4/0x370 [btrfs]
  extent_write_cache_pages+0x288/0x4f4 [btrfs]
  extent_writepages+0x58/0xe0 [btrfs]
  btrfs_writepages+0x1c/0x30 [btrfs]
  do_writepages+0x60/0x110
  __filemap_fdatawrite_range+0x108/0x170
  filemap_fdatawrite_range+0x20/0x30
  btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
  __btrfs_write_out_cache+0x34c/0x480 [btrfs]
  btrfs_write_out_cache+0x144/0x220 [btrfs]
  btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
  btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
  btrfs_sync_fs+0x64/0x1cc [btrfs]
  sync_fs_one_sb+0x3c/0x50
  iterate_supers+0xcc/0x1d4
  ksys_sync+0x6c/0xd0
  __arm64_sys_sync+0x1c/0x30
  invoke_syscall+0x50/0x120
  el0_svc_common.constprop.0+0x4c/0xd4
  do_el0_svc+0x30/0x9c
  el0_svc+0x2c/0x54
  el0_sync_handler+0x1a8/0x1b0
  el0_sync+0x198/0x1c0
 ---[ end trace 336f67369ae6e0af ]---

[CAUSE]
For subpage case, we can have multiple sectors inside a page, this makes
it possible for __extent_writepage() to have part of its page submitted
before returning.

In btrfs/160, we are using dm-dust to emulate write error, this means
for certain pages, we could have everything running fine, but at the end
of __extent_writepage(), one of the submitted bios fails due to dm-dust.

Then the page is marked Error, and we change @ret from 0 to -EIO.

This makes the caller extent_write_cache_pages() to error out, without
submitting the remaining pages.

Furthermore, since we're erroring out for free space cache, it doesn't
really care about the error and will update the inode and retry the
writeback.

Then we re-run the delalloc range, and will try to insert the same
delalloc range while previous delalloc range is still hanging there,
triggering the above error.

[FIX]
The proper fix is to handle errors from __extent_writepage() properly,
by ending the remaining ordered extent.

But that fix needs the following changes:

- Know at exactly which sector the error happened
  Currently __extent_writepage_io() works for the full page, can't
  return at which sector we hit the error.

- Grab the ordered extent covering the failed sector

As a hotfix for subpage case, here we unify the error paths in
__extent_writepage().

In fact, the "if (PageError(page))" branch never get executed if @ret is
still 0 for non-subpage cases.

As for non-subpage case, we never submit current page in
__extent_writepage(), but only add current page into bio.
The bio can only get submitted in next page.

Thus we never get PageError() set due to IO failure, thus when we hit
the branch, @ret is never 0.

By simply removing that @ret assignment, we let subpage case ignore the
IO failure, thus only error out for fatal errors just like regular
sectorsize.

So that IO error won't be treated as fatal error not trigger the hanging
OE problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Qu Wenruo 95ea0486b2 btrfs: allow read-write for 4K sectorsize on 64K page size systems
Since now we support data and metadata read-write for subpage, remove
the RO requirement for subpage mount.

There are some extra limitations though:

- For now, subpage RW mount is still considered experimental
  Thus that mount warning will still be there.

- No compression support
  There are still quite some PAGE_SIZE hard coded and quite some call
  sites use extent_clear_unlock_delalloc() to unlock locked_page.
  This will screw up subpage helpers.

  Now for subpage RW mount, no matter what mount option or inode attr is
  set, all writes will not be compressed.  Although reading compressed
  data has no problem.

- No defrag for subpage case
  The defrag support for subpage case will come in later patches, which
  will also rework the defrag workflow.

- No inline extent will be created
  This is mostly due to the fact that filemap_fdatawrite_range() will
  trigger more write than the range specified.
  In fallocate calls, this behavior can make us to writeback which can
  be inlined, before we enlarge the i_size.

  This is a very special corner case, and even current btrfs check won't
  report error on such inline extent + regular extent.
  But considering how much effort has been put to prevent such inline +
  regular, I'd prefer to cut off inline extent completely until we have
  a good solution.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Qu Wenruo 9d9ea1e68a btrfs: subpage: fix relocation potentially overwriting last page data
[BUG]
When using the following script, btrfs will report data corruption after
one data balance with subpage support:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  $fsstress -w -n 8 -s 1620948986 -d $mnt/ -v > /tmp/fsstress
  sync
  btrfs balance start -d $mnt
  btrfs scrub start -B $mnt

Similar problem can be easily observed in btrfs/028 test case, there
will be tons of balance failure with -EIO.

[CAUSE]
Above fsstress will result the following data extents layout in extent
tree:
  item 10 key (13631488 EXTENT_ITEM 98304) itemoff 15889 itemsize 82
    refs 2 gen 7 flags DATA
    extent data backref root FS_TREE objectid 259 offset 1339392 count 1
    extent data backref root FS_TREE objectid 259 offset 647168 count 1
  item 11 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 15865 itemsize 24
    block group used 102400 chunk_objectid 256 flags DATA
  item 12 key (13733888 EXTENT_ITEM 4096) itemoff 15812 itemsize 53
    refs 1 gen 7 flags DATA
    extent data backref root FS_TREE objectid 259 offset 729088 count 1

Then when creating the data reloc inode, the data reloc inode will look
like this:

	0	32K	64K	96K 100K	104K
	|<------ Extent A ----->|   |<- Ext B ->|

Then when we first try to relocate extent A, we setup the data reloc
inode with i_size 96K, then read both page [0, 64K) and page [64K, 128K).

For page 64K, since the i_size is just 96K, we fill range [96K, 128K)
with 0 and set it uptodate.

Then when we come to extent B, we update i_size to 104K, then try to read
page [64K, 128K).
Then we find the page is already uptodate, so we skip the read.
But range [96K, 128K) is filled with 0, not the real data.

Then we writeback the data reloc inode to disk, with 0 filling range
[96K, 128K), corrupting the content of extent B.

The behavior is caused by the fact that we still do full page read for
subpage case.

The bug won't really happen for regular sectorsize, as one page only
contains one sector.

[FIX]
This patch will fix the problem by invalidating range [i_size, PAGE_END]
in prealloc_file_extent_cluster().

So that if above example happens, when we preallocate the file extent
for extent B, we will clear the uptodate bits for range [96K, 128K),
allowing later relocate_one_page() to re-read the needed range.

There is a special note for the invalidating part.

Since we're not calling real btrfs_invalidatepage(), but just clearing
the subpage and page uptodate bits, we can leave a page half dirty and
half out of date.

Reading such page can cause a deadlock, as we normally expect a dirty
page to be fully uptodate.

Thus here we flush and wait the data reloc inode before doing the hacked
invalidating.  This won't cause extra overhead, as we're going to
writeback the data later anyway.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:06 +02:00
Qu Wenruo e3c62324e4 btrfs: subpage: fix false alert when relocating partial preallocated data extents
[BUG]
When relocating partial preallocated data extents (part of the
preallocated extent is written) for subpage, it can cause the following
false alert and make the relocation to fail:

  BTRFS info (device dm-3): balance: start -d
  BTRFS info (device dm-3): relocating block group 13631488 flags data
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  BTRFS warning (device dm-3): csum failed root -9 ino 257 off 4096 csum 0x98757625 expected csum 0x00000000 mirror 1
  BTRFS error (device dm-3): bdev /dev/mapper/arm_nvme-test errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  BTRFS info (device dm-3): balance: ended with status: -5

The minimal script to reproduce looks like this:

  mkfs.btrfs -f -s 4k $dev
  mount $dev -o nospace_cache $mnt
  xfs_io -f -c "falloc 0 8k" $mnt/file
  xfs_io -f -c "pwrite 0 4k" $mnt/file
  btrfs balance start -d $mnt

[CAUSE]
Function btrfs_verify_data_csum() checks if the full range has
EXTENT_NODATASUM bit for data reloc inode, if *all* bytes of the range
have EXTENT_NODATASUM bit, then it skip the range.

This works pretty well for regular sectorsize, as in that case
btrfs_verify_data_csum() is called for each sector, thus no problem at
all.

But for subpage case, btrfs_verify_data_csum() is called on each bvec,
which can contain several sectors, and since it checks *all* bytes for
EXTENT_NODATASUM bit, if we have some range with csum, then we will
continue checking all the sectors.

For the preallocated sectors, it doesn't have any csum, thus obviously
the csum won't match and cause the false alert.

[FIX]
Move the EXTENT_NODATASUM check into the main loop, so that we can check
each sector for EXTENT_NODATASUM bit for subpage case.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00
Qu Wenruo 7c11d0ae43 btrfs: subpage: fix a potential use-after-free in writeback helper
[BUG]
There is a possible use-after-free bug when running generic/095.

 BUG: Unable to handle kernel data access on write at 0x6b6b6b6b6b6b725b
 Faulting instruction address: 0xc000000000283654
 c000000000283078 do_raw_spin_unlock+0x88/0x230
 c0000000012b1e14 _raw_spin_unlock_irqrestore+0x44/0x90
 c000000000a918dc btrfs_subpage_clear_writeback+0xac/0xe0
 c0000000009e0458 end_bio_extent_writepage+0x158/0x270
 c000000000b6fd14 bio_endio+0x254/0x270
 c0000000009fc0f0 btrfs_end_bio+0x1a0/0x200
 c000000000b6fd14 bio_endio+0x254/0x270
 c000000000b781fc blk_update_request+0x46c/0x670
 c000000000b8b394 blk_mq_end_request+0x34/0x1d0
 c000000000d82d1c lo_complete_rq+0x11c/0x140
 c000000000b880a4 blk_complete_reqs+0x84/0xb0
 c0000000012b2ca4 __do_softirq+0x334/0x680
 c0000000001dd878 irq_exit+0x148/0x1d0
 c000000000016f4c do_IRQ+0x20c/0x240
 c000000000009240 hardware_interrupt_common_virt+0x1b0/0x1c0

[CAUSE]
There is very small race window like the following in generic/095.

	Thread 1		|		Thread 2
--------------------------------+------------------------------------
  end_bio_extent_writepage()	| btrfs_releasepage()
  |- spin_lock_irqsave()	| |
  |- end_page_writeback()	| |
  |				| |- if (PageWriteback() ||...)
  |				| |- clear_page_extent_mapped()
  |				|    |- kfree(subpage);
  |- spin_unlock_irqrestore().

The race can also happen between writeback and btrfs_invalidatepage(),
although that would be much harder as btrfs_invalidatepage() has much
more work to do before the clear_page_extent_mapped() call.

[FIX]
Here we "wait" for the subapge spinlock to be released before we detach
subpage structure.
So this patch will introduce a new function, wait_subpage_spinlock(), to
do the "wait" by acquiring the spinlock and release it.

Since the caller has ensured the page is not dirty nor writeback, and
page is already locked, the only way to hold the subpage spinlock is
from endio function.
Thus we only need to acquire the spinlock to wait for any existing
holder.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00
Qu Wenruo e046786619 btrfs: subpage: fix race between prepare_pages() and btrfs_releasepage()
[BUG]
When running generic/095, there is a high chance to crash with subpage
data RW support:

 assertion failed: PagePrivate(page) && page->private
 ------------[ cut here ]------------
 kernel BUG at fs/btrfs/ctree.h:3403!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3567 Comm: fio Tainted: 5.12.0-rc7-custom+ #17
 Hardware name: Khadas VIM3 (DT)
 Call trace:
  assertfail.constprop.0+0x28/0x2c [btrfs]
  btrfs_subpage_assert+0x80/0xa0 [btrfs]
  btrfs_subpage_set_uptodate+0x34/0xec [btrfs]
  btrfs_page_clamp_set_uptodate+0x74/0xa4 [btrfs]
  btrfs_dirty_pages+0x160/0x270 [btrfs]
  btrfs_buffered_write+0x444/0x630 [btrfs]
  btrfs_direct_write+0x1cc/0x2d0 [btrfs]
  btrfs_file_write_iter+0xc0/0x160 [btrfs]
  new_sync_write+0xe8/0x180
  vfs_write+0x1b4/0x210
  ksys_pwrite64+0x7c/0xc0
  __arm64_sys_pwrite64+0x24/0x30
  el0_svc_common.constprop.0+0x70/0x140
  do_el0_svc+0x28/0x90
  el0_svc+0x2c/0x54
  el0_sync_handler+0x1a8/0x1ac
  el0_sync+0x170/0x180
 Code: f0000160 913be042 913c4000 955444bc (d4210000)
 ---[ end trace 3fdd39f4cccedd68 ]---

[CAUSE]
Although prepare_pages() calls find_or_create_page(), which returns the
page locked, but in later prepare_uptodate_page() calls, we may call
btrfs_readpage() which will unlock the page before it returns.

This leaves a window where btrfs_releasepage() can sneak in and release
the page, clearing page->private and causing above ASSERT().

[FIX]
In prepare_uptodate_page(), we should not only check page->mapping, but
also PagePrivate() to ensure we are still holding the correct page which
has proper fs context setup.

Reported-by: Ritesh Harjani <riteshh@linux.ibm.com>
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00
Qu Wenruo c8050b3b7f btrfs: subpage: reject raid56 filesystem and profile conversion
RAID56 is not only unsafe due to its write-hole problem, but also has
tons of hardcoded PAGE_SIZE.

Disable it for subpage support for now.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00
Qu Wenruo e0eefe07f8 btrfs: subpage: allow submit_extent_page() to do bio split
Current submit_extent_page() just checks if the current page range can
be fitted into current bio, and if not, submit then re-add.

But this behavior can't handle subpage case at all.

For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.

This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.

The proper way to handle it is:

- Check how many bytes we can be put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create a new bio
- Add the remaining bytes into the new bio

Refactor submit_extent_page() so that it does the above iteration.

The main loop inside submit_extent_page() will look like this:

	cur = pg_offset;
	while (cur < pg_offset + size) {
		u32 offset = cur - pg_offset;
		int added;
		if (!bio_ctrl->bio) {
			/* Allocate new bio if needed */
		}

		/* Add as many bytes into the bio */
		added = btrfs_bio_add_page();

		if (added < size - offset) {
			/* The current bio is full, submit it */
		}
		cur += added;
	}

Also, since we're doing new bio allocation deep inside the main loop,
extract that code into a new helper, alloc_new_bio().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00
Qu Wenruo 7367253a35 btrfs: subpage: disable inline extent creation
[BUG]
When running the following fsx command (extracted from generic/127) on
subpage filesystem, it can create inline extent with regular extents:

  fsx -q -l 262144 -o 65536 -S 191110531 -N 9057 -R -W $mnt/file > /tmp/fsx

The offending extent would look like:

  item 9 key (257 INODE_REF 256) itemoff 15703 itemsize 14
    index 2 namelen 4 name: file
  item 10 key (257 EXTENT_DATA 0) itemoff 14975 itemsize 728
    generation 7 type 0 (inline)
    inline extent data size 707 ram_bytes 707 compression 0 (none)
  item 11 key (257 EXTENT_DATA 4096) itemoff 14922 itemsize 53
    generation 7 type 2 (prealloc)
    prealloc data disk byte 102346752 nr 4096
    prealloc data offset 0 nr 4096

[CAUSE]
For subpage filesystem, the writeback is triggered in page units, which
means, even if we just want to writeback range [16K, 20K) for 64K page
system, we will still try to writeback any dirty sector of range [0, 64K).

This is never a problem if sectorsize == PAGE_SIZE, but for subpage,
this can cause unexpected problems.

For above test case, the last several operations from fsx are:

 9055 trunc      from 0x40000 to 0x2c3
 9057 falloc     from 0x164c to 0x19d2 (0x386 bytes)

In operation 9055, we dirtied sector [0, 4096), then in falloc, we call
btrfs_wait_ordered_range(inode, start=4096, len=4096), only expecting to
writeback any dirty data in [4096, 8192), but nothing else.

Unfortunately, in subpage case, above btrfs_wait_ordered_range() will
trigger writeback of the range [0, 64K), which includes the data at
[0, 4096).

And since at the call site, we haven't yet increased i_size, which is
still 707, this means cow_file_range() can insert an inline extent.

Resulting above inline + regular extent.

[WORKAROUND]
I don't really have any good short-term solution yet, as this means all
operations that would trigger writeback need to be reviewed for any
i_size change.

So here I choose to disable inline extent creation for subpage case as a
workaround.  We have done tons of work just to avoid such extent, so I
don't to create an exception just for subpage.

This only affects inline extent creation, subpage has no problem reading
existing inline extents at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:05 +02:00
Qu Wenruo cc1d0d93d5 btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:

 kernel BUG at fs/btrfs/file-item.c:667!
 Internal error: Oops - BUG: 0 [#1] SMP
 CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
 Hardware name: Radxa ROCK Pi 4B (DT)
 Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
 pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
 pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
 lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
 Call trace:
  btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
  btrfs_submit_bio_start+0x20/0x30 [btrfs]
  run_one_async_start+0x28/0x44 [btrfs]
  btrfs_work_helper+0x128/0x1b4 [btrfs]
  process_one_work+0x22c/0x430
  worker_thread+0x70/0x3a0
  kthread+0x13c/0x140
  ret_from_fork+0x10/0x30

[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().

Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.

This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.

In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.

But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.

This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.

[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().

Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:04 +02:00
Qu Wenruo c283289812 btrfs: make relocate_one_page() handle subpage case
For subpage case, one page of data reloc inode can contain several file
extents, like this:

|<--- File extent A --->| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|

We can no longer use PAGE_SIZE directly for various operations.

This patch will relocate_one_page() to handle subpage case by:
- Iterating through all extents of a cluster when marking pages
  When marking pages dirty and delalloc, we need to check the cluster
  extent boundary.
  Now we introduce a loop to go extent by extent of a page, until we
  either finished the last extent, or reach the page end.

  By this, regular sectorsize == PAGE_SIZE can still work as usual, since
  we will do that loop only once.

- Iteration start from max(page_start, extent_start)
  Since we can have the following case:
			| FE B | FE C |<--- File extent D -->|
		|<--------- Page --------->|
  Thus we can't always start from page_start, but do a
  max(page_start, extent_start)

- Iteration end when the cluster is exhausted
  Similar to previous case, the last file extent can end before the page
  end:
|<--- File extent A --->| FE B | FE C |
		|<--------- Page --------->|
  In this case, we need to manually exit the loop after we have finished
  the last extent of the cluster.

- Reserve metadata space for each extent range
  Since now we can hit multiple ranges in one page, we should reserve
  metadata for each range, not simply PAGE_SIZE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:04 +02:00
Qu Wenruo f47960f49e btrfs: reloc: factor out relocation page read and dirty part
In function relocate_file_extent_cluster(), we have a big loop for
marking all involved page delalloc.

That part is long enough to be contained in one function, so this patch
will move that code chunk into a new function, relocate_one_page().

This also provides enough space for later subpage work.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:04 +02:00
Qu Wenruo a6e66e6f8c btrfs: rework lzo_decompress_bio() to make it subpage compatible
For the initial subpage support, although we won't support compressed
write, we still need to support compressed read.

But for lzo_decompress_bio() it has several problems:

- The abuse of PAGE_SIZE for boundary detection
  For subpage case, we should follow sectorsize to detect the padding
  zeros.
  Using PAGE_SIZE will cause subpage compress read to skip certain
  bytes, and causing read error.

- Too many helper variables
  There are half a dozen helper variables, which is only making things
  harder to read

This patch will rework lzo_decompress_bio() to make it work for subpage:

- Use sectorsize to do boundary check, while still use PAGE_SIZE for
  page switching
  This allows us to have the same on-disk format for 4K sectorsize fs,
  while take advantage of larger page size.

- Use two main cursors
  Only @cur_in and @cur_out is utilized as the main cursor.
  The helper variables will only be declared inside the loop, and only 2
  helper variables needed.

- Introduce a helper function to copy compressed segment payload
  Introduce a new helper, copy_compressed_segment(), to copy a
  compressed segment to workspace buffer.
  This function will handle the page switching.

Now the net result is, with all the excessive comments and new helper
function, the refactored code is still smaller, and easier to read.

For other decompression code, they have no special padding rule, thus no
need to bother for initial subpage support, but will be refactored to
the same style later.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:04 +02:00
Qu Wenruo 1c3dc1731e btrfs: rework btrfs_decompress_buf2page()
There are several bugs inside the function btrfs_decompress_buf2page()

- @start_byte doesn't take bvec.bv_offset into consideration
  Thus it can't handle case where the target range is not page aligned.

- Too many helper variables
  There are tons of helper variables, @buf_offset, @current_buf_start,
  @start_byte, @prev_start_byte, @working_bytes, @bytes.
  This hurts anyone who wants to read the function.

- No obvious main cursor for the iteartion
  A new problem caused by previous problem.

- Comments for parameter list makes no sense
  Like @buf_start is the offset to @buf, or offset inside the full
  decompressed extent? (Spoiler alert, the later case)
  And @total_out acts more like @buf_start + @size_of_buf.

  The worst is @disk_start.
  The real meaning of it is the file offset of the full decompressed
  extent.

This patch will rework the whole function by:

- Add a proper comment with ASCII art to explain the parameter list

- Rework parameter list
  The old @buf_start is renamed to @decompressed, to show how many bytes
  are already decompressed inside the full decompressed extent.
  The old @total_out is replaced by @buf_len, which is the decompressed
  data size.
  For old @disk_start and @bio, just pass @compressed_bio in.

- Use single main cursor
  The main cursor will be @cur_file_offset, to show what's the current
  file offset.
  Other helper variables will be declared inside the main loop, and only
  minimal amount of helper variables:
  * offset_inside_decompressed_buf:	The only real helper
  * copy_start_file_offset:		File offset we start memcpy
  * bvec_file_offset:			File offset of current bvec

Even with all these extensive comments, the final function is still
smaller than the original function, which is definitely a win.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:04 +02:00
Qu Wenruo 557023ea9f btrfs: grab correct extent map for subpage compressed extent read
[BUG]
When subpage compressed read write support is enabled, btrfs/038 always
fails with EIO.

A simplified script can easily trigger the problem:

  mkfs.btrfs -f -s 4k $dev
  mount $dev $mnt -o compress=lzo

  xfs_io -f -c "truncate 118811" $mnt/foo
  xfs_io -c "pwrite -S 0x0d -b 39987 92267 39987" $mnt/foo > /dev/null

  sync
  btrfs subvolume snapshot -r $mnt $mnt/mysnap1

  xfs_io -c "pwrite -S 0x3e -b 80000 200000 80000" $mnt/foo > /dev/null
  sync

  xfs_io -c "pwrite -S 0xdc -b 10000 250000 10000" $mnt/foo > /dev/null
  xfs_io -c "pwrite -S 0xff -b 10000 300000 10000" $mnt/foo > /dev/null

  sync
  btrfs subvolume snapshot -r $mnt $mnt/mysnap2

  cat $mnt/mysnap2/foo
  # Above cat will fail due to EIO

[CAUSE]
The problem is in btrfs_submit_compressed_read().

When it tries to grab the extent map of the read range, it uses the
following call:

	em = lookup_extent_mapping(em_tree,
  				   page_offset(bio_first_page_all(bio)),
				   fs_info->sectorsize);

The problem is in the page_offset(bio_first_page_all(bio)) part.

The offending inode has the following file extent layout

        item 10 key (257 EXTENT_DATA 131072) itemoff 15639 itemsize 53
                generation 8 type 1 (regular)
                extent data disk byte 13680640 nr 4096
                extent data offset 0 nr 4096 ram 4096
                extent compression 0 (none)
        item 11 key (257 EXTENT_DATA 135168) itemoff 15586 itemsize 53
                generation 8 type 1 (regular)
                extent data disk byte 0 nr 0
        item 12 key (257 EXTENT_DATA 196608) itemoff 15533 itemsize 53
                generation 8 type 1 (regular)
                extent data disk byte 13676544 nr 4096
                extent data offset 0 nr 53248 ram 86016
                extent compression 2 (lzo)

And the bio passed in has the following parameters:

page_offset(bio_first_page_all(bio))	= 131072
bio_first_bvec_all(bio)->bv_offset	= 65536

If we use page_offset(bio_first_page_all(bio) without adding bv_offset,
we will get an extent map for file offset 131072, not 196608.

This means we read uncompressed data from disk, and later decompression
will definitely fail.

[FIX]
Take bv_offset into consideration when trying to grab an extent map.

And add an ASSERT() to ensure we're really getting a compressed extent.

Thankfully this won't affect anything but subpage, thus we only need to
ensure this patch get merged before we enabled basic subpage support.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:04 +02:00
Qu Wenruo ca62e85ded btrfs: disable compressed readahead for subpage
For current subpage support, we only support 64K page size with 4K
sector size.

This makes compressed readahead less effective, as maximum compressed
extent size is only 128K, 2x the page size.

On the other hand, the function add_ra_bio_pages() is still assuming
sectorsize == PAGE_SIZE, and code change may affect 4K page size
systems.

So for now, let's disable subpage compressed readahead for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:04 +02:00
Qu Wenruo 3670e6451b btrfs: subpage: check if there are compressed extents inside one page
[BUG]
When testing experimental subpage compressed write support, it hits a
NULL pointer dereference inside read path:

 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
 pc : __pi_memcmp+0x28/0x1ec
 lr : check_data_csum+0xd0/0x274 [btrfs]
 Call trace:
  __pi_memcmp+0x28/0x1ec
  btrfs_verify_data_csum+0xf4/0x244 [btrfs]
  end_bio_extent_readpage+0x1d0/0x6b0 [btrfs]
  bio_endio+0x15c/0x1dc
  end_workqueue_fn+0x44/0x64 [btrfs]
  btrfs_work_helper+0x74/0x250 [btrfs]
  process_one_work+0x1d4/0x47c
  worker_thread+0x180/0x400
  kthread+0x11c/0x120
  ret_from_fork+0x10/0x30
 Code: 54000261 d100044c d343fd8c f8408403 (f8408424)
 ---[ end trace 9e2c59f33ea40866 ]---

[CAUSE]
When reading two compressed extents inside the same page, like the
following layout, we trigger above crash:

	0	32K	64K
	|-------|\\\\\\\|
	     |	     \- Compressed extent (A)
	     \--------- Compressed extent (B)

For compressed read, we don't need to populate its io_bio->csum, as we
rely on compressed_bio->csum to verify the compressed data, and then
copy the decompressed to inode pages.

Normally btrfs_verify_data_csum() skip such page by checking and
clearing its PageChecked flag

But since that flag is still for the full page, when endio for inode
page range [0, 32K) gets executed, it clears PageChecked flag for the
full page.

Then when endio for inode page range [32K, 64K) gets executed, since the
page no longer has PageChecked flag, it just continues checking, even
though io_bio->csum is NULL.

[FIX]
Thankfully there are only two users of PageChecked bit:

- Cow fixup
  Since subpage has its own way to trace page dirty (dirty_bitmap) and
  ordered bit (ordered_bitmap), it should never trigger cow fixup.

- Compressed read
  We can distinguish such read by just checking io_bio->csum.

So just check io_bio->csum before doing the verification to avoid such
NULL pointer dereference.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:03 +02:00
Qu Wenruo 4c37a79384 btrfs: reset this_bio_flag to avoid inheriting old flags
In btrfs_do_readpage(), we never reset @this_bio_flag after we hit a
compressed extent.

This is fine, as for PAGE_SIZE == sectorsize case, we can only have one
sector for one page, thus @this_bio_flag will only be set at most once.

But for subpage case, after hitting a compressed extent, @this_bio_flag
will always have EXTENT_BIO_COMPRESSED bit, even we're reading a regular
extent.

This will lead to various read errors, and causing new ASSERT() in
incoming subpage patches, which adds more strict check in
btrfs_submit_compressed_read().

Fix it by declaring @this_bio_flag inside the main loop and reset its
value for each iteration.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:03 +02:00
David Sterba 214cc18432 btrfs: constify and cleanup variables in comparators
Comparators just read the data and thus get const parameters. This
should be also preserved by the local variables, update all comparators
passed to sort or bsearch.

Cleanups:

- unnecessary casts are dropped
- btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern
  and 'inline' is dropped as the function address is taken

Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:03 +02:00
David Sterba d58ede8d1d btrfs: simplify data stripe calculation helpers
There are two helpers doing the same calculations based on nparity and
ncopies. calc_data_stripes can be simplified into one expression, so far
we don't have profile with both copies and parity, so there's no
effective change. calc_stripe_length should reuse the helper and not
repeat the same calculation.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:03 +02:00
David Sterba fe4f46d40c btrfs: merge alloc_device helpers
The device allocation is split to two functions, but one just calls the
other and they're very far in the file. Merge them together.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:03 +02:00
David Sterba 500a44c9b3 btrfs: uninline btrfs_bg_flags_to_raid_index
The helper does a simple translation from block group flags to index to
the btrfs_raid_array table. There's no apparent reason to inline the
function, the translation happens usually once per function and is not
called in a loop.

Making it a proper function saves quite some binary code (x86_64,
release config):

   text    data     bss     dec     hex filename
1164011   19253   14912 1198176  124860 pre/btrfs.ko
1161559   19253   14912 1195724  123ecc post/btrfs.ko

DELTA: -2451

Also add the const attribute as there are no side effects, this could
help compiler to optimize a few things without the function body.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:03 +02:00
David Sterba 6c154ba41b btrfs: tree-checker: add missing stripe checks for raid1c3/4 profiles
The stripe checks for raid1c3/raid1c4 are missing in the sequence in
btrfs_check_chunk_valid.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:03 +02:00
David Sterba 0ac6e06b6c btrfs: tree-checker: use table values for stripe checks
There are hardcoded values in several checks regarding chunks and stripe
constraints. We have that defined in the raid table and ought to use it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:02 +02:00
David Sterba 809d6902b3 btrfs: make btrfs_next_leaf static inline
btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it
to header to avoid the function call overhead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:02 +02:00
David Sterba f41b6ba93d btrfs: remove uptodate parameter from btrfs_dec_test_first_ordered_pending
In commit e65f152e43 ("btrfs: refactor how we finish ordered extent io
for endio functions") there was last caller not using 1 for the uptodate
parameter. Now there's only one, passing 1, so we can remove it and
simplify the code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:02 +02:00
David Sterba 25c1252a02 btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered
The uptodate parameter should be bool, change the type.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:02 +02:00
Qu Wenruo a129ffb816 btrfs: remove unused start and end parameters from btrfs_run_delalloc_range()
Since commit d75855b451 ("btrfs: Remove
extent_io_ops::writepage_start_hook") removes the writepage_start_hook()
and adds btrfs_writepage_cow_fixup() function, there is no need to
follow the old hook parameters.

Remove the @start and @end hook, since currently the fixup check is full
page check, it doesn't need @start and @end hook.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:02 +02:00
Marcos Paulo de Souza a7d1c5dc86 btrfs: introduce btrfs_lookup_match_dir
btrfs_search_slot is called in multiple places in dir-item.c to search
for a dir entry, and then calling btrfs_match_dir_name to return a
btrfs_dir_item.

In order to reduce the number of callers of btrfs_search_slot, create a
common function that looks for the dir key, and if found call
btrfs_match_dir_item_name.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:02 +02:00
Marcos Paulo de Souza f8ee80de7b btrfs: remove unneeded return variable in btrfs_lookup_file_extent
We can return from btrfs_search_slot directly which also shows that it
follows the same return value convention.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:01 +02:00
Marcos Paulo de Souza ad9a937850 btrfs: use btrfs_next_leaf instead of btrfs_next_item when slots > nritems
After calling btrfs_search_slot is a common practice to check if the
slot found isn't bigger than number of slots in the current leaf, and if
so, search for the same key in the next leaf by calling btrfs_next_leaf,
which calls btrfs_next_old_leaf to do the job.

Calling btrfs_next_item in the same situation would end up in the same
code flow, since

* btrfs_next_item
  * btrfs_next_old_item
    * if slot >= nritems(curr_leaf)
      btrfs_next_old_leaf

Change btrfs_verify_dev_extents and calculate_emulated_zone_size
functions to use btrfs_next_leaf in the same situation.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:01 +02:00
Filipe Manana c7bcbb2120 btrfs: remove ignore_offset argument from btrfs_find_all_roots()
Currently all the callers of btrfs_find_all_roots() pass a value of false
for its ignore_offset argument. This makes the argument pointless and we
can remove it and make btrfs_find_all_roots() always pass false as the
ignore_offset argument for btrfs_find_all_roots_safe(). So just do that.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:01 +02:00
Filipe Manana 2ac691d8b3 btrfs: avoid unnecessary lock and leaf splits when updating inode in the log
During a fast fsync, if we have already fsynced the file before and in the
current transaction, we can make the inode item update more efficient and
avoid acquiring a write lock on the leaf's parent.

To update the inode item we are always using btrfs_insert_empty_item() to
get a path pointing to the inode item, which calls btrfs_search_slot()
with an "ins_len" argument of 'sizeof(struct btrfs_inode_item) +
sizeof(struct btrfs_item)', and that always results in the search taking
a write lock on the level 1 node that is the parent of the leaf that
contains the inode item. This adds unnecessary lock contention on log
trees when we have multiple fsyncs in parallel against inodes in the same
subvolume, which has a very significant impact due to the fact that log
trees are short lived and their height very rarely goes beyond level 2.

Also, by using btrfs_insert_empty_item() when we need to update the inode
item, we also end up splitting the leaf of the existing inode item when
the leaf has an amount of free space smaller than the size of an inode
item.

Improve this by using btrfs_seach_slot(), with a 0 "ins_len" argument,
when we know the inode item already exists in the log. This avoids these
two inefficiencies.

The following script, using fio, was used to perform the tests:

  $ cat fio-test.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-d single -m single"

  if [ $# -ne 4 ]; then
    echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
    exit 1
  fi

  NUM_JOBS=$1
  FILE_SIZE=$2
  FSYNC_FREQ=$3
  BLOCK_SIZE=$4

  cat <<EOF > /tmp/fio-job.ini
  [writers]
  rw=randwrite
  fsync=$FSYNC_FREQ
  fallocate=none
  group_reporting=1
  direct=0
  bs=$BLOCK_SIZE
  ioengine=sync
  size=$FILE_SIZE
  directory=$MNT
  numjobs=$NUM_JOBS
  EOF

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  echo
  echo "Using config:"
  echo
  cat /tmp/fio-job.ini
  echo
  echo "mount options: $MOUNT_OPTIONS"
  echo

  umount $MNT &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  fio /tmp/fio-job.ini
  umount $MNT

The tests were done on a physical machine, with 12 cores, 64G of RAM,
using a NVMEe device and using a non-debug kernel config (the default one
from Debian). The summary line from fio is provided below for each test
run.

With 8 jobs, file size 256M, fsync frequency of 4 and a block size of 4K:

Before: WRITE: bw=28.3MiB/s (29.7MB/s), 28.3MiB/s-28.3MiB/s (29.7MB/s-29.7MB/s), io=2048MiB (2147MB), run=72297-72297msec
After:  WRITE: bw=28.7MiB/s (30.1MB/s), 28.7MiB/s-28.7MiB/s (30.1MB/s-30.1MB/s), io=2048MiB (2147MB), run=71411-71411msec

+1.4% throughput, -1.2% runtime

With 16 jobs, file size 256M, fsync frequency of 4 and a block size of 4K:

Before: WRITE: bw=40.0MiB/s (42.0MB/s), 40.0MiB/s-40.0MiB/s (42.0MB/s-42.0MB/s), io=4096MiB (4295MB), run=99980-99980msec
After:  WRITE: bw=40.9MiB/s (42.9MB/s), 40.9MiB/s-40.9MiB/s (42.9MB/s-42.9MB/s), io=4096MiB (4295MB), run=97933-97933msec

+2.2% throughput, -2.1% runtime

The changes are small but it's possible to be better on faster hardware as
in the test machine used disk utilization was pretty much 100% during the
whole time the tests were running (observed with 'iostat -xz 1').

The tests also included the previous patch with the subject of:
"btrfs: avoid unnecessary log mutex contention when syncing log".
So they compared a branch without that patch and without this patch versus
a branch with these two patches applied.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:01 +02:00
Filipe Manana e68107e51f btrfs: remove unnecessary list head initialization when syncing log
One of the last steps of syncing the log is to remove all log contexts
from the root's list of contexts, done at btrfs_remove_all_log_ctxs().
There we iterate over all the contexts in the list and delete each one
from the list, and after that we call INIT_LIST_HEAD() on the list. That
is unnecessary since at that point the list is empty.

So just remove the INIT_LIST_HEAD() call. It's not needed, increases code
size (bloat-o-meter reported a delta of -122 for btrfs_sync_log() after
this change) and increases two critical sections delimited by log mutexes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:01 +02:00
Filipe Manana e1a6d26483 btrfs: avoid unnecessary log mutex contention when syncing log
When syncing the log we acquire the root's log mutex just to update the
root's last_log_commit. This is unnecessary because:

1) At this point there can only be one task updating this value, which is
   the task committing the current log transaction. Any task that enters
   btrfs_sync_log() has to wait for the previous log transaction to commit
   and wait for the current log transaction to commit if someone else
   already started it (in this case it never reaches to the point of
   updating last_log_commit, as that is done by the committing task);

2) All readers of the root's last_log_commit don't acquire the root's
   log mutex. This is to avoid blocking the readers, potentially for too
   long and because getting a stale value of last_log_commit does not
   cause any functional problem, in the worst case getting a stale value
   results in logging an inode unnecessarily. Plus it's actually very
   rare to get a stale value that results in unnecessarily logging the
   inode.

So in order to avoid unnecessary contention on the root's log mutex,
which is used for several different purposes, like starting/joining a
log transaction and starting writeback of a log transaction, stop
acquiring the log mutex for updating the root's last_log_commit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:01 +02:00
Filipe Manana cceaa89f02 btrfs: remove racy and unnecessary inode transaction update when using no-holes
When using the NO_HOLES feature and expanding the size of an inode, we
update the inode's last_trans, last_sub_trans and last_log_commit fields
at maybe_insert_hole() so that a fsync does know that the inode needs to
be logged (by making sure that btrfs_inode_in_log() returns false). This
happens for expanding truncate operations, buffered writes, direct IO
writes and when cloning extents to an offset greater than the inode's
i_size.

However the way we do it is racy, because in between setting the inode's
last_sub_trans and last_log_commit fields, the log transaction ID that was
assigned to last_sub_trans might be committed before we read the root's
last_log_commit and assign that value to last_log_commit. If that happens
it would make a future call to btrfs_inode_in_log() return true. This is
a race that should be extremely unlikely to be hit in practice, and it is
the same that was described by commit bc0939fcfa ("btrfs: fix race
between marking inode needs to be logged and log syncing").

The fix would simply be to set last_log_commit to the value we assigned
to last_sub_trans minus 1, like it was done in that commit. However
updating these two fields plus the last_trans field is pointless here
because all the callers of btrfs_cont_expand() (which is the only
caller of maybe_insert_hole()) always call btrfs_set_inode_last_trans()
or btrfs_update_inode() after calling btrfs_cont_expand(). Calling either
btrfs_set_inode_last_trans() or btrfs_update_inode() guarantees that the
next fsync will log the inode, as it makes btrfs_inode_in_log() return
false.

So just remove the code that explicitly sets the inode's last_trans,
last_sub_trans and last_log_commit fields.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:01 +02:00
Filipe Manana 5a656c3628 btrfs: stop doing GFP_KERNEL memory allocations in the ref verify tool
In commit 351cbf6e44 ("btrfs: use nofs allocations for running delayed
items") we wrapped all btree updates when running delayed items with
memalloc_nofs_save() and memalloc_nofs_restore(), due to a lock inversion
detected by lockdep involving reclaim and the mutex of delayed nodes.

The problem is because the ref verify tool does some memory allocations
with GFP_KERNEL, which can trigger reclaim and reclaim can trigger inode
eviction, which requires locking the mutex of an inode's delayed node.
On the other hand the ref verify tool is called when allocating metadata
extents as part of operations that modify a btree, which is a problem when
running delayed nodes, where we do btree updates while holding the mutex
of a delayed node. This is what caused the lockdep warning.

Instead of wrapping every btree update when running delayed nodes, change
the ref verify tool to never do GFP_KERNEL allocations, because:

1) We get less repeated code, which at the moment does not even have a
   comment mentioning why we need to setup the NOFS context, which is a
   recommended good practice as mentioned at
   Documentation/core-api/gfp_mask-from-fs-io.rst

2) The ref verify tool is something meant only for debugging and not
   something that should be enabled on non-debug / non-development
   kernels;

3) We may have yet more places outside delayed-inode.c where we have
   similar problem: doing btree updates while holding some lock and
   then having the GFP_KERNEL memory allocations, from the ref verify
   tool, trigger reclaim and trying again to acquire the same lock
   through the reclaim path.
   Or we could get more such cases in the future, therefore this change
   prevents getting into similar cases when using the ref verify tool.

Curiously most of the memory allocations done by the ref verify tool
were already using GFP_NOFS, except a few ones for no apparent reason.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:00 +02:00
Filipe Manana 506650dcb3 btrfs: improve the batch insertion of delayed items
When we insert the delayed items of an inode, which corresponds to the
directory index keys for a directory (key type BTRFS_DIR_INDEX_KEY), we
do the following:

1) Pick the first delayed item from the rbtree and insert it into the
   fs/subvolume btree, using btrfs_insert_empty_item() for that;

2) Without releasing the path returned by btrfs_insert_empty_item(),
   keep collecting as many consecutive delayed items from the rbtree
   as possible, as long as each one's BTRFS_DIR_INDEX_KEY key is the
   immediate successor of the previously picked item and as long as
   they fit in the available space of the leaf the path points to;

3) Then insert all the collected items into the leaf;

4) Release the reserve metadata space for each collected item and
   release each item (implies deleting from the rbtree);

5) Unlock the path.

While this is much better than inserting items one by one, it can be
improved in a few aspects:

1) Instead of adding items based on the remaining free space of the
   leaf, collect as many items that can fit in a leaf and bulk insert
   them. This results in less and larger batches, reducing the total
   amount of time to insert the delayed items. For example when adding
   100K files to a directory, we ended up creating 1658 batches with
   very variable sizes ranging from 1 item to 118 items, on a filesystem
   with a node/leaf size of 16K. After this change, we end up with 839
   batches, with the vast majority of them having exactly 120 items;

2) We do the search for more items to batch, by iterating the rbtree,
   while holding a write lock on the leaf;

3) While still holding the leaf locked, we are releasing the reserved
   metadata for each item and then deleting each item, keeping a write
   lock on the leaf for longer than necessary. Releasing the delayed items
   one by one can take a significant amount of time, because deleting
   them from the rbtree can often be a bit slow when the deletion results
   in rebalancing the rbtree.

So change this so that we try to create larger batches, with a total
item size up to the maximum a leaf can support, and by unlocking the leaf
immediately after inserting the items, releasing the reserved metadata
space of each item and releasing each item without holding the write lock
on the leaf.

The following script that runs fs_mark was used to test this change:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/nvme0n1
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"
  FILES=1000000
  THREADS=16
  FILE_SIZE=0

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $DEV &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t 16"
  for ((i = 1; i <= $THREADS; i++)); do
      OPTS="$OPTS -d $MNT/d$i"
  done

  fs_mark $OPTS

  umount $MNT

It was run on machine with 12 cores, 64G of ram, using a NVMe device and
using a non-debug kernel config (Debian's default config).

Results before this change:

FSUse%        Count         Size    Files/sec         App Overhead
     1     16000000            0      76182.1             72223046
     3     32000000            0      62746.9             80776528
     5     48000000            0      77029.0             93022381
     6     64000000            0      73691.6             95251075
     8     80000000            0      66288.0             85089634

Results after this change:

FSUse%        Count         Size    Files/sec         App Overhead
     1     16000000            0      79049.5 (+3.7%)     69700824
     3     32000000            0      65248.9 (+3.9%)     80583693
     5     48000000            0      77991.4 (+1.2%)     90040908
     6     64000000            0      75096.8 (+1.9%)     89862241
     8     80000000            0      66926.8 (+1.0%)     84429169

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:00 +02:00
Qu Wenruo 2b29726c47 btrfs: rescue: allow ibadroots to skip bad extent tree when reading block group items
When extent tree gets corrupted, normally it's not extent tree root, but
one toasted tree leaf/node.

In that case, rescue=ibadroots mount option won't help as it can only
handle the extent tree root corruption.

This patch will enhance the behavior by:

- Allow fill_dummy_bgs() to ignore -EEXIST error

  This means we may have some block group items read from disk, but
  then hit some error halfway.

- Fallback to fill_dummy_bgs() if any error gets hit in
  btrfs_read_block_groups()

  Of course, this still needs rescue=ibadroots mount option.

With that, rescue=ibadroots can handle extent tree corruption more
gracefully and allow a better recover chance.

Reported-by: Zhenyu Wu <wuzy001@gmail.com>
Link: https://www.spinics.net/lists/linux-btrfs/msg114424.html
Reviewed-by: Su Yue <l@damenly.su>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:00 +02:00
Marcos Paulo de Souza 6534c0c99d btrfs: pass NULL as trans to btrfs_search_slot if we only want to search
Using a transaction in btrfs_search_slot is only useful when we are
searching to add or modify the tree. When the function is used for
searching, insert length and mod arguments are 0, there is no need to
use a transaction.

No functional changes, changing for consistency.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:00 +02:00
Filipe Manana 069a2e3778 btrfs: continue readahead of siblings even if target node is in memory
At reada_for_search(), when attempting to readahead a node or leaf's
siblings, we skip the readahead of the siblings if the node/leaf is
already in memory. That is probably fine for the READA_FORWARD and
READA_BACK readahead types, as they are used on contexts where we
end up reading some consecutive leaves, but usually not the whole btree.

However for a READA_FORWARD_ALWAYS mode, currently only used for full
send operations, it does not make sense to skip the readahead if the
target node or leaf is already loaded in memory, since we know the caller
is visiting every node and leaf of the btree in ascending order.

So change the behaviour to not skip the readahead when the target node is
already in memory and the readahead mode is READA_FORWARD_ALWAYS.

The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk, with 32GiB of RAM and
using a non-debug kernel config (Debian's default config).

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdj
  MNT=/mnt/sdj
  MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
  MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit

  mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT

  # Create files with inline data to make it easier and faster to create
  # large btrees.
  add_files()
  {
      local total=$1
      local start_offset=$2
      local number_jobs=$3
      local total_per_job=$(($total / $number_jobs))

      echo "Creating $total new files using $number_jobs jobs"
      for ((n = 0; n < $number_jobs; n++)); do
          (
              local start_num=$(($start_offset + $n * $total_per_job))
              for ((i = 1; i <= $total_per_job; i++)); do
                  local file_num=$((start_num + $i))
                  local file_path="$MNT/file_${file_num}"
                  xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                  if [ $? -ne 0 ]; then
                      echo "Failed creating file $file_path"
                      break
                  fi
              done
          ) &
          worker_pids[$n]=$!
      done

      wait ${worker_pids[@]}

      sync
      echo
      echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
  }

  file_count=2000000
  add_files $file_count 0 4

  echo
  echo "Creating snapshot..."
  btrfs subvolume snapshot -r $MNT $MNT/snap1

  umount $MNT

  echo 3 > /proc/sys/vm/drop_caches
  blockdev --flushbufs $DEV &> /dev/null
  hdparm -F $DEV &> /dev/null

  mount $MOUNT_OPTIONS $DEV $MNT

  echo
  echo "Testing full send..."
  start=$(date +%s)
  btrfs send $MNT/snap1 > /dev/null
  end=$(date +%s)
  echo
  echo "Full send took $((end - start)) seconds"

  umount $MNT

The duration of the full send operations, in seconds, were the following:

Before this change:  85 seconds
After this change:   76 seconds (-11.2%)

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:00 +02:00
David Sterba 5da3847992 btrfs: check-integrity: drop kmap/kunmap for block pages
The pages in block_ctx have never been allocated from highmem (in
btrfsic_read_block) so the mapping is pointless and can be removed.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:00 +02:00
David Sterba 4c2bf276b5 btrfs: compression: drop kmap/kunmap from generic helpers
The pages in compressed_pages are not from highmem anymore so we can
drop the mapping for checksum calculation and inline extent.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:19:00 +02:00
David Sterba bbaf9715f3 btrfs: compression: drop kmap/kunmap from zstd
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
simply page_address and kunmap is a no-op.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:59 +02:00
David Sterba 696ab562e6 btrfs: compression: drop kmap/kunmap from zlib
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
simply page_address and kunmap is a no-op.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:59 +02:00
David Sterba 8c945d32e6 btrfs: compression: drop kmap/kunmap from lzo
As we don't use highmem pages anymore, drop the kmap/kunmap. The kmap is
simply page_address and kunmap is a no-op.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:59 +02:00
David Sterba b0ee5e1ec4 btrfs: drop from __GFP_HIGHMEM all allocations
The highmem flag is used for allocating pages for compression and for
raid56 pages. The high memory makes sense on 32bit systems but is not
without problems. On 64bit system's it's just another layer of wrappers.

The time the pages are allocated for compression or raid56 is relatively
short (about a transaction commit), so the pages are not blocked
indefinitely. As the number of pages depends on the amount of data being
written/read, there's a theoretical problem. A fast device on a 32bit
system could use most of the low memory pool, while with the highmem
allocation that would not happen. This was possibly the original idea
long time ago, but nowadays we optimize for 64bit systems.

This patch removes all usage of the __GFP_HIGHMEM flag for page
allocation, the kmap/kunmap are still in place and will be removed in
followup patches. Remaining is masking out the bit in
alloc_extent_state and __lookup_free_space_inode, that can safely stay.

Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:59 +02:00
Anand Jain 23608d51a3 btrfs: cleanup fs_devices pointer usage in btrfs_trim_fs
Drop variable 'devices' (used only once) and add new variable for
the fs_devices, so it is used at two locations within btrfs_trim_fs()
function and also helps to access fs_devices->devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:59 +02:00
Marcos Paulo de Souza 67d5e289a1 btrfs: remove max argument from generic_bin_search
Both callers use btrfs_header_nritems to feed the max argument.  Remove
the argument and let generic_bin_search call it itself.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:59 +02:00
Nikolay Borisov 2eadb9e75e btrfs: make btrfs_finish_chunk_alloc private to block-group.c
One of the final things that must be done to add a new chunk is
inserting its device extent items in the device tree. They describe
the portion of allocated device physical space during phase 1 of
chunk allocation. This is currently done in btrfs_finish_chunk_alloc
whose name isn't very informative. What's more, this function is only
used in block-group.c but is defined as public. There isn't anything
special about it that would warrant it being defined in volumes.c.

Just move btrfs_finish_chunk_alloc and alloc_chunk_dev_extent to
block-group.c, make the former static and rename both functions to
insert_dev_extents and insert_dev_extent respectively.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:59 +02:00
Anand Jain 4a9531cf89 btrfs: check-integrity: drop unnecessary function prototypes
The function prototypes below aren't necessary as the functions are
first defined before called. Remove them.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:58 +02:00
David Sterba b3b7e1d0b4 btrfs: add special case to setget helpers for 64k pages
On 64K pages the size of the extent_buffer::pages array is 1 and
compilation with -Warray-bounds warns due to

  kaddr = page_address(eb->pages[idx + 1]);

when reading byte range crossing page boundary.

This does never actually overflow the array because on 64K because all
the data fit in one page and bounds are checked by check_setget_bounds.

To fix the reported overflows and warnings add a compile-time condition
that will allow compiler to eliminate the dead code that reads from the
idx + 1 page.

Link: https://lore.kernel.org/lkml/20210623083901.1d49d19d@canb.auug.org.au/
CC: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:58 +02:00
Johannes Thumshirn 5a80d1c6a2 btrfs: zoned: remove max_zone_append_size logic
There used to be a patch in the original series for zoned support which
limited the extent size to max_zone_append_size, but this patch has been
dropped somewhere around v9.

We've decided to go the opposite direction, instead of limiting extents
in the first place we split them before submission to comply with the
device's limits.

Remove the related code, btrfs_fs_info::max_zone_append_size and
btrfs_zoned_device_info::max_zone_append_size.

This also removes the workaround for dm-crypt introduced in
1d68128c10 ("btrfs: zoned: fail mount if the device does not support
zone append") because the fix has been merged as f34ee1dce6 ("dm
crypt: Fix zoned block device support").

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23 13:18:58 +02:00
Miklos Szeredi 0cad624662 vfs: add rcu argument to ->get_acl() callback
Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-18 22:08:24 +02:00
Linus Torvalds d6d09a6942 for-5.14-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEdNDEACgkQxWXV+ddt
 WDvfDw//cDnR8HZEtrHwHX9qHitcYs6pubdwwGAsFlSZ/wh0iX05TxUjho4gGYMZ
 Kp9PXipMOEdxNLJ8oaPkI+i8vIXxTWWqAm5ZePkV0cjg+vTgqqKf9NLcMtS34kP4
 /GeQJgul9oreTMbXCx219J0B6lKpl6Iv0sCaSFyN09GIPNI8F6nyDbJTA3JKTRWb
 ElP8mGvdUFFcOKsG6Wh6BU/WVU/My7d+HumApsRXB2lDwmMambAkX0iGpRElGrbD
 ub+5ya0WeO8DB6KsVa4W8cMO5sWV9L9FcXMtGlwLbIkOxFdHvP7CT1pvH3TZe9Wy
 mr8oAL01IktuNjZgQ5sUn+yISf+LuHnWjhpu+QBRuylZiwfpMwSPb0geLcrXcYGj
 i8ERlmJvwbm6dAQlQDbA3yZKH6+FzePyTR99std2LK9JtbqBaFeSS6WM05SpRUDJ
 FNHCLOzsswzBUE54nkqsb+A8tBXpcxnvQkrU+nJeDNUYM9w6S5mbCGeZjxe/n+ov
 TGprz1ar2Ppm9YMH0zj6wOM690nJZYNrAvtUmeCl1xlLERYIoV2jXS1SzkaMdcQu
 u3UVVsOCghPN5krEac4jgiGBdVHvjVnJb/qGBNTpj3aDX29PADHUJr7TmzGJBa8F
 ePWqDngDYi+cVrm9JMls9UJhaCVmhzLAXtN5X3+fKfe5bNQE4gU=
 =aR16
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One more fix for cross-rename, adding a missing check for directory
  and subvolume, this could lead to a crash"

* tag 'for-5.14-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: prevent rename2 from exchanging a subvol with a directory from different parents
2021-08-18 12:06:42 -07:00
Christoph Hellwig a6d3d49587 iomap: switch __iomap_dio_rw to use iomap_iter
Switch __iomap_dio_rw to use iomap_iter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-16 21:26:33 -07:00
NeilBrown 3f79f6f624 btrfs: prevent rename2 from exchanging a subvol with a directory from different parents
Cross-rename lacks a check when that would prevent exchanging a
directory and subvolume from different parent subvolume. This causes
data inconsistencies and is caught before commit by tree-checker,
turning the filesystem to read-only.

Calling the renameat2 with RENAME_EXCHANGE flags like

  renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1))

on two paths:

  namesrc = dir1/subvol1/dir2
 namedest = subvol2/subvol3

will cause key order problem with following write time tree-checker
report:

  [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258
  [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5
  [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561
  [1194842.338772]        item 0 key (256 1 0) itemoff 16123 itemsize 160
  [1194842.338793]                inode generation 3 size 16 mode 40755
  [1194842.338801]        item 1 key (256 12 256) itemoff 16111 itemsize 12
  [1194842.338809]        item 2 key (256 84 2248503653) itemoff 16077 itemsize 34
  [1194842.338817]                dir oid 258 type 2
  [1194842.338823]        item 3 key (256 84 2363071922) itemoff 16043 itemsize 34
  [1194842.338830]                dir oid 257 type 2
  [1194842.338836]        item 4 key (256 96 2) itemoff 16009 itemsize 34
  [1194842.338843]        item 5 key (256 96 3) itemoff 15975 itemsize 34
  [1194842.338852]        item 6 key (257 1 0) itemoff 15815 itemsize 160
  [1194842.338863]                inode generation 6 size 8 mode 40755
  [1194842.338869]        item 7 key (257 12 256) itemoff 15801 itemsize 14
  [1194842.338876]        item 8 key (257 84 2505409169) itemoff 15767 itemsize 34
  [1194842.338883]                dir oid 256 type 2
  [1194842.338888]        item 9 key (257 96 2) itemoff 15733 itemsize 34
  [1194842.338895]        item 10 key (258 12 256) itemoff 15719 itemsize 14
  [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected
  [1194842.339245] ------------[ cut here ]------------
  [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs]
  [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793
  [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
  [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
  [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
  [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282
  [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978
  [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970
  [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003
  [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
  [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000
  [1194842.512092] FS:  0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000
  [1194842.512095] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0
  [1194842.512103] Call Trace:
  [1194842.512128]  ? run_one_async_free+0x10/0x10 [btrfs]
  [1194842.631729]  btree_csum_one_bio+0x1ac/0x1d0 [btrfs]
  [1194842.631837]  run_one_async_start+0x18/0x30 [btrfs]
  [1194842.631938]  btrfs_work_helper+0xd5/0x1d0 [btrfs]
  [1194842.647482]  process_one_work+0x262/0x5e0
  [1194842.647520]  worker_thread+0x4c/0x320
  [1194842.655935]  ? process_one_work+0x5e0/0x5e0
  [1194842.655946]  kthread+0x135/0x160
  [1194842.655953]  ? set_kthread_struct+0x40/0x40
  [1194842.655965]  ret_from_fork+0x1f/0x30
  [1194842.672465] irq event stamp: 1729
  [1194842.672469] hardirqs last  enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0
  [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0
  [1194842.672482] softirqs last  enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a
  [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0

The corrupted data will not be written, and filesystem can be unmounted
and mounted again (all changes since the last commit will be lost).

Add the missing check for new_ino so that all non-subvolumes must reside
under the same parent subvolume. There's an exception allowing to
exchange two subvolumes from any parents as the directory representing a
subvolume is only a logical link and does not have any other structures
related to the parent subvolume, unlike files, directories etc, that
are always in the inode namespace of the parent subvolume.

Fixes: cdd1fedf82 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.7+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-16 13:33:23 +02:00
Linus Torvalds 051df241e4 for-5.14-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmEEDKIACgkQxWXV+ddt
 WDtW+BAAnUD7h3ollIQo4C6hE9WaTG49Tp12Z00Og2m8hn4XyhI2QIaDz6a2CU7n
 MLQv16vZUQk5Z/VMtczM+5ZF5Rf0ywlMXnS4Sq5yKWT0YHpnH7q2nMAvg4gql/tJ
 Ldov92hnTrFAZX6vvkLVM5lZriY7fop3Lv2vHeAKu4CymAoisAv+SLa5xYkBR6Ig
 3S16+lh/rIRgssI7KuDnjp9iTXvnB1J2MbfAOLNfqjXGWUDumu1k7HWQSNYZnHJX
 L390/QS3F3K6Trxkf5MSUXOxQROqcGKQVKyAR5ZvyULKly84nDpiINze80yCopq/
 7//32pO43xDPb78c7saxSWtjdgX4XsBOdzIoiJZHnc5CTTbCcneLes8zz4fD6AGq
 vjZKDLTgiO/sRlkQHZQk1y+7CawrqbKkAG+O7MqF7KGOtQ1WLRGfAkFP732TBFXM
 TyoZ7ENh3TiFDdeRmkOonpQ2k3DctW+7z2BmdlsuSXgD8fFbEArfxnO1SnRHrmcr
 C8FNeSkks8MTL7uePNUxwlnB8uHuGWCgSuS++q4OkCnzA3AmO6cRlDoMT3RMwVB/
 wQxvqF/U6JJx16YOVqwA6ZjuUWVwyBj/WBKlaxgfghz8CUmDC0D4Xb2/S1UVcZi6
 bFRph0UKeE5LaduoNZYaAqMOinCXFmetjudPmWO4sWfPrLb1mOY=
 =J0Pw
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix -Warray-bounds warning, to help external patchset to make it
   default treewide

 - fix writeable device accounting (syzbot report)

 - fix fsync and log replay after a rename and inode eviction

 - fix potentially lost error code when submitting multiple bios for
   compressed range

* tag 'for-5.14-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: calculate number of eb pages properly in csum_tree_block
  btrfs: fix rw device counting in __btrfs_free_extra_devids
  btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
  btrfs: mark compressed range uptodate only if all bio succeed
2021-07-30 10:50:09 -07:00
David Sterba 7280305eb5 btrfs: calculate number of eb pages properly in csum_tree_block
Building with -Warray-bounds on systems with 64K pages there's a
warning:

  fs/btrfs/disk-io.c: In function ‘csum_tree_block’:
  fs/btrfs/disk-io.c:226:34: warning: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Warray-bounds]
    226 |   kaddr = page_address(buf->pages[i]);
        |                        ~~~~~~~~~~^~~
  ./include/linux/mm.h:1630:48: note: in definition of macro ‘page_address’
   1630 | #define page_address(page) lowmem_page_address(page)
        |                                                ^~~~
  In file included from fs/btrfs/ctree.h:32,
                   from fs/btrfs/disk-io.c:23:
  fs/btrfs/extent_io.h:98:15: note: while referencing ‘pages’
     98 |  struct page *pages[1];
        |               ^~~~~

The compiler has no way to know that in that case the nodesize is exactly
PAGE_SIZE, so the resulting number of pages will be correct (1).

Let's use num_extent_pages that makes the case nodesize == PAGE_SIZE
explicitly 1.

Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-29 13:01:04 +02:00
Desmond Cheong Zhi Xi b2a6166768 btrfs: fix rw device counting in __btrfs_free_extra_devids
When removing a writeable device in __btrfs_free_extra_devids, the rw
device count should be decremented.

This error was caught by Syzbot which reported a warning in
close_fs_devices:

  WARNING: CPU: 1 PID: 9355 at fs/btrfs/volumes.c:1168 close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
  Modules linked in:
  CPU: 0 PID: 9355 Comm: syz-executor552 Not tainted 5.13.0-rc1-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:close_fs_devices+0x763/0x880 fs/btrfs/volumes.c:1168
  RSP: 0018:ffffc9000333f2f0 EFLAGS: 00010293
  RAX: ffffffff8365f5c3 RBX: 0000000000000001 RCX: ffff888029afd4c0
  RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
  RBP: ffff88802846f508 R08: ffffffff8365f525 R09: ffffed100337d128
  R10: ffffed100337d128 R11: 0000000000000000 R12: dffffc0000000000
  R13: ffff888019be8868 R14: 1ffff1100337d10d R15: 1ffff1100337d10a
  FS:  00007f6f53828700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000000047c410 CR3: 00000000302a6000 CR4: 00000000001506f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_close_devices+0xc9/0x450 fs/btrfs/volumes.c:1180
   open_ctree+0x8e1/0x3968 fs/btrfs/disk-io.c:3693
   btrfs_fill_super fs/btrfs/super.c:1382 [inline]
   btrfs_mount_root+0xac5/0xc60 fs/btrfs/super.c:1749
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x86/0x270 fs/super.c:1498
   fc_mount fs/namespace.c:993 [inline]
   vfs_kern_mount+0xc9/0x160 fs/namespace.c:1023
   btrfs_mount+0x3d3/0xb50 fs/btrfs/super.c:1809
   legacy_get_tree+0xea/0x180 fs/fs_context.c:592
   vfs_get_tree+0x86/0x270 fs/super.c:1498
   do_new_mount fs/namespace.c:2905 [inline]
   path_mount+0x196f/0x2be0 fs/namespace.c:3235
   do_mount fs/namespace.c:3248 [inline]
   __do_sys_mount fs/namespace.c:3456 [inline]
   __se_sys_mount+0x2f9/0x3b0 fs/namespace.c:3433
   do_syscall_64+0x3f/0xb0 arch/x86/entry/common.c:47
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Because fs_devices->rw_devices was not 0 after
closing all devices. Here is the call trace that was observed:

  btrfs_mount_root():
    btrfs_scan_one_device():
      device_list_add();   <---------------- device added
    btrfs_open_devices():
      open_fs_devices():
        btrfs_open_one_device();   <-------- writable device opened,
	                                     rw device count ++
    btrfs_fill_super():
      open_ctree():
        btrfs_free_extra_devids():
	  __btrfs_free_extra_devids();  <--- writable device removed,
	                              rw device count not decremented
	  fail_tree_roots:
	    btrfs_close_devices():
	      close_fs_devices();   <------- rw device count off by 1

As a note, prior to commit cf89af146b ("btrfs: dev-replace: fail
mount if we don't have replace item with target device"), rw_devices
was decremented on removing a writable device in
__btrfs_free_extra_devids only if the BTRFS_DEV_STATE_REPLACE_TGT bit
was not set for the device. However, this check does not need to be
reinstated as it is now redundant and incorrect.

In __btrfs_free_extra_devids, we skip removing the device if it is the
target for replacement. This is done by checking whether device->devid
== BTRFS_DEV_REPLACE_DEVID. Since BTRFS_DEV_STATE_REPLACE_TGT is set
only on the device with devid BTRFS_DEV_REPLACE_DEVID, no devices
should have the BTRFS_DEV_STATE_REPLACE_TGT bit set after the check,
and so it's redundant to test for that bit.

Additionally, following commit 82372bc816 ("Btrfs: make
the logic of source device removing more clear"), rw_devices is
incremented whenever a writeable device is added to the alloc
list (including the target device in btrfs_dev_replace_finishing), so
all removals of writable devices from the alloc list should also be
accompanied by a decrement to rw_devices.

Reported-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Fixes: cf89af146b ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
CC: stable@vger.kernel.org # 5.10+
Tested-by: syzbot+a70e2ad0879f160b9217@syzkaller.appspotmail.com
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-28 19:02:49 +02:00
Filipe Manana ecc64fab7d btrfs: fix lost inode on log replay after mix of fsync, rename and inode eviction
When checking if we need to log the new name of a renamed inode, we are
checking if the inode and its parent inode have been logged before, and if
not we don't log the new name. The check however is buggy, as it directly
compares the logged_trans field of the inodes versus the ID of the current
transaction. The problem is that logged_trans is a transient field, only
stored in memory and never persisted in the inode item, so if an inode
was logged before, evicted and reloaded, its logged_trans field is set to
a value of 0, meaning the check will return false and the new name of the
renamed inode is not logged. If the old parent directory was previously
fsynced and we deleted the logged directory entries corresponding to the
old name, we end up with a log that when replayed will delete the renamed
inode.

The following example triggers the problem:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  $ mkdir /mnt/A
  $ mkdir /mnt/B
  $ echo -n "hello world" > /mnt/A/foo

  $ sync

  # Add some new file to A and fsync directory A.
  $ touch /mnt/A/bar
  $ xfs_io -c "fsync" /mnt/A

  # Now trigger inode eviction. We are only interested in triggering
  # eviction for the inode of directory A.
  $ echo 2 > /proc/sys/vm/drop_caches

  # Move foo from directory A to directory B.
  # This deletes the directory entries for foo in A from the log, and
  # does not add the new name for foo in directory B to the log, because
  # logged_trans of A is 0, which is less than the current transaction ID.
  $ mv /mnt/A/foo /mnt/B/foo

  # Now make an fsync to anything except A, B or any file inside them,
  # like for example create a file at the root directory and fsync this
  # new file. This syncs the log that contains all the changes done by
  # previous rename operation.
  $ touch /mnt/baz
  $ xfs_io -c "fsync" /mnt/baz

  <power fail>

  # Mount the filesystem and replay the log.
  $ mount /dev/sdc /mnt

  # Check the filesystem content.
  $ ls -1R /mnt
  /mnt/:
  A
  B
  baz

  /mnt/A:
  bar

  /mnt/B:
  $

  # File foo is gone, it's neither in A/ nor in B/.

Fix this by using the inode_logged() helper at btrfs_log_new_name(), which
safely checks if an inode was logged before in the current transaction.

A test case for fstests will follow soon.

CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-28 19:02:30 +02:00
Goldwyn Rodrigues 240246f6b9 btrfs: mark compressed range uptodate only if all bio succeed
In compression write endio sequence, the range which the compressed_bio
writes is marked as uptodate if the last bio of the compressed (sub)bios
is completed successfully. There could be previous bio which may
have failed which is recorded in cb->errors.

Set the writeback range as uptodate only if cb->errors is zero, as opposed
to checking only the last bio's status.

Backporting notes: in all versions up to 4.4 the last argument is always
replaced by "!cb->errors".

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-28 18:59:23 +02:00
Linus Torvalds f0fddcec6b for-5.14-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmD7GZUACgkQxWXV+ddt
 WDs+BA/+OHDY2ROYEnysAqF1qaDENVVnUavnDYYa+Uk61KVvx0pm/mHY9SllsuH4
 WCIwCwH7LZs11cRYp3vD80t4OdVGBKaDvEfX+znMCQYuoBm6G5eT3n5jhsVFr1jJ
 EqUVzUY+S44IWAEhzkVnSAD4xnMsan8b+YnngIFSMEqJlH+on6w8FhyP0QXwInxk
 1kfjl8tDMiryKFaekfGX5WaeflEeWGoHNf2xYokzPD/Oq6TCaoLycar1YXH+80FM
 05Jl0+jfEWbaHouMNd8bW9nHnSxh30i7gorY17Q6KLOFDCThNiKZuypZsQcCi/df
 TbjQDNTZjSsReFvrFeFlEdGv3dFHBGxz1Ns7RFPfVeNgmN0WnOLmzS+4rmfGyi8L
 +3TQ6MGqgG0DppPwfB9caDvxYsbN23uA1v5J1B+Dsbo47lFWWIoBQBtDvErAiHEy
 KF7B4jIOWrx3ZYwv3pkE3D+D19sKkB9+wLnlwVSF77npKO1up8W0h4mPdMLZaznW
 TGBXxwqI4105MSX5UatBpX+HYATpEWG5tmeZz5ERGFNC/piILmY4iVz/c5Vguh9/
 iUQwjSudIDWgGxcL7VClqrdF7sucsml6Svb+ZrxckmK7pa97TG2bIlzJDg0eFcle
 NBcw8RBcBMUay/Y04cKHLJAj6OOjBiXnxKjjHrhvtaBmOV2SHpc=
 =kj2e
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few fixes and one patch to help some block layer API cleanups:

   - skip missing device when running fstrim

   - fix unpersisted i_size on fsync after expanding truncate

   - fix lock inversion problem when doing qgroup extent tracing

   - replace bdgrab/bdput usage, replace gendisk by block_device"

* tag 'for-5.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: store a block_device in struct btrfs_ordered_extent
  btrfs: fix lock inversion problem when doing qgroup extent tracing
  btrfs: check for missing device in btrfs_trim_fs
  btrfs: fix unpersisted i_size on fsync after expanding truncate
2021-07-23 12:49:07 -07:00
Christoph Hellwig c7c3a6dcb1 btrfs: store a block_device in struct btrfs_ordered_extent
Store the block device instead of the gendisk in the btrfs_ordered_extent
structure instead of acquiring a reference to it later.

Note: this is from series removing bdgrab/bdput, btrfs is one of the
last users.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22 15:50:15 +02:00
Filipe Manana 8949b9a114 btrfs: fix lock inversion problem when doing qgroup extent tracing
At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a
NULL value as the transaction handle argument, which makes that function
take the commit_root_sem semaphore, which is necessary when we don't hold
a transaction handle or any other mechanism to prevent a transaction
commit from wiping out commit roots.

However btrfs_qgroup_trace_extent_post() can be called in a context where
we are holding a write lock on an extent buffer from a subvolume tree,
namely from btrfs_truncate_inode_items(), called either during truncate
or unlink operations. In this case we end up with a lock inversion problem
because the commit_root_sem is a higher level lock, always supposed to be
acquired before locking any extent buffer.

Lockdep detects this lock inversion problem since we switched the extent
buffer locks from custom locks to semaphores, and when running btrfs/158
from fstests, it reported the following trace:

[ 9057.626435] ======================================================
[ 9057.627541] WARNING: possible circular locking dependency detected
[ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted
[ 9057.628961] ------------------------------------------------------
[ 9057.629867] kworker/u16:4/30781 is trying to acquire lock:
[ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.632542]
               but task is already holding lock:
[ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
[ 9057.635255]
               which lock already depends on the new lock.

[ 9057.636292]
               the existing dependency chain (in reverse order) is:
[ 9057.637240]
               -> #1 (&fs_info->commit_root_sem){++++}-{3:3}:
[ 9057.638138]        down_read+0x46/0x140
[ 9057.638648]        btrfs_find_all_roots+0x41/0x80 [btrfs]
[ 9057.639398]        btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
[ 9057.640283]        btrfs_add_delayed_data_ref+0x418/0x490 [btrfs]
[ 9057.641114]        btrfs_free_extent+0x35/0xb0 [btrfs]
[ 9057.641819]        btrfs_truncate_inode_items+0x424/0xf70 [btrfs]
[ 9057.642643]        btrfs_evict_inode+0x454/0x4f0 [btrfs]
[ 9057.643418]        evict+0xcf/0x1d0
[ 9057.643895]        do_unlinkat+0x1e9/0x300
[ 9057.644525]        do_syscall_64+0x3b/0xc0
[ 9057.645110]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 9057.645835]
               -> #0 (btrfs-tree-00){++++}-{3:3}:
[ 9057.646600]        __lock_acquire+0x130e/0x2210
[ 9057.647248]        lock_acquire+0xd7/0x310
[ 9057.647773]        down_read_nested+0x4b/0x140
[ 9057.648350]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.649175]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 9057.650010]        btrfs_search_slot+0x537/0xc00 [btrfs]
[ 9057.650849]        scrub_print_warning_inode+0x89/0x370 [btrfs]
[ 9057.651733]        iterate_extent_inodes+0x1e3/0x280 [btrfs]
[ 9057.652501]        scrub_print_warning+0x15d/0x2f0 [btrfs]
[ 9057.653264]        scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
[ 9057.654295]        scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
[ 9057.655111]        btrfs_work_helper+0xf8/0x400 [btrfs]
[ 9057.655831]        process_one_work+0x247/0x5a0
[ 9057.656425]        worker_thread+0x55/0x3c0
[ 9057.656993]        kthread+0x155/0x180
[ 9057.657494]        ret_from_fork+0x22/0x30
[ 9057.658030]
               other info that might help us debug this:

[ 9057.659064]  Possible unsafe locking scenario:

[ 9057.659824]        CPU0                    CPU1
[ 9057.660402]        ----                    ----
[ 9057.660988]   lock(&fs_info->commit_root_sem);
[ 9057.661581]                                lock(btrfs-tree-00);
[ 9057.662348]                                lock(&fs_info->commit_root_sem);
[ 9057.663254]   lock(btrfs-tree-00);
[ 9057.663690]
                *** DEADLOCK ***

[ 9057.664437] 4 locks held by kworker/u16:4/30781:
[ 9057.665023]  #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
[ 9057.666260]  #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
[ 9057.667639]  #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs]
[ 9057.669017]  #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
[ 9057.670408]
               stack backtrace:
[ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1
[ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[ 9057.674258] Call Trace:
[ 9057.674588]  dump_stack_lvl+0x57/0x72
[ 9057.675083]  check_noncircular+0xf3/0x110
[ 9057.675611]  __lock_acquire+0x130e/0x2210
[ 9057.676132]  lock_acquire+0xd7/0x310
[ 9057.676605]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.677313]  ? lock_is_held_type+0xe8/0x140
[ 9057.677849]  down_read_nested+0x4b/0x140
[ 9057.678349]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.679068]  __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.679760]  btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 9057.680458]  btrfs_search_slot+0x537/0xc00 [btrfs]
[ 9057.681083]  ? _raw_spin_unlock+0x29/0x40
[ 9057.681594]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
[ 9057.682336]  scrub_print_warning_inode+0x89/0x370 [btrfs]
[ 9057.683058]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
[ 9057.683834]  ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs]
[ 9057.684632]  iterate_extent_inodes+0x1e3/0x280 [btrfs]
[ 9057.685316]  scrub_print_warning+0x15d/0x2f0 [btrfs]
[ 9057.685977]  ? ___ratelimit+0xa4/0x110
[ 9057.686460]  scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
[ 9057.687316]  scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
[ 9057.688021]  btrfs_work_helper+0xf8/0x400 [btrfs]
[ 9057.688649]  ? lock_is_held_type+0xe8/0x140
[ 9057.689180]  process_one_work+0x247/0x5a0
[ 9057.689696]  worker_thread+0x55/0x3c0
[ 9057.690175]  ? process_one_work+0x5a0/0x5a0
[ 9057.690731]  kthread+0x155/0x180
[ 9057.691158]  ? set_kthread_struct+0x40/0x40
[ 9057.691697]  ret_from_fork+0x22/0x30

Fix this by making btrfs_find_all_roots() never attempt to lock the
commit_root_sem when it is called from btrfs_qgroup_trace_extent_post().

We can't just pass a non-NULL transaction handle to btrfs_find_all_roots()
from btrfs_qgroup_trace_extent_post(), because that would make backref
lookup not use commit roots and acquire read locks on extent buffers, and
therefore could deadlock when btrfs_qgroup_trace_extent_post() is called
from the btrfs_truncate_inode_items() code path which has acquired a write
lock on an extent buffer of the subvolume btree.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22 15:50:07 +02:00
Anand Jain 16a200f66e btrfs: check for missing device in btrfs_trim_fs
A fstrim on a degraded raid1 can trigger the following null pointer
dereference:

  BTRFS info (device loop0): allowing degraded mounts
  BTRFS info (device loop0): disk space caching is enabled
  BTRFS info (device loop0): has skinny extents
  BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
  BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
  BTRFS info (device loop0): enabling ssd optimizations
  BUG: kernel NULL pointer dereference, address: 0000000000000620
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP NOPTI
  CPU: 0 PID: 4574 Comm: fstrim Not tainted 5.13.0-rc7+ #31
  Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
  RIP: 0010:btrfs_trim_fs+0x199/0x4a0 [btrfs]
  RSP: 0018:ffff959541797d28 EFLAGS: 00010293
  RAX: 0000000000000000 RBX: ffff946f84eca508 RCX: a7a67937adff8608
  RDX: ffff946e8122d000 RSI: 0000000000000000 RDI: ffffffffc02fdbf0
  RBP: ffff946ea4615000 R08: 0000000000000001 R09: 0000000000000000
  R10: 0000000000000000 R11: ffff946e8122d960 R12: 0000000000000000
  R13: ffff959541797db8 R14: ffff946e8122d000 R15: ffff959541797db8
  FS:  00007f55917a5080(0000) GS:ffff946f9bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000620 CR3: 000000002d2c8001 CR4: 00000000000706f0
  Call Trace:
  btrfs_ioctl_fitrim+0x167/0x260 [btrfs]
  btrfs_ioctl+0x1c00/0x2fe0 [btrfs]
  ? selinux_file_ioctl+0x140/0x240
  ? syscall_trace_enter.constprop.0+0x188/0x240
  ? __x64_sys_ioctl+0x83/0xb0
  __x64_sys_ioctl+0x83/0xb0

Reproducer:

  $ mkfs.btrfs -fq -d raid1 -m raid1 /dev/loop0 /dev/loop1
  $ mount /dev/loop0 /btrfs
  $ umount /btrfs
  $ btrfs dev scan --forget
  $ mount -o degraded /dev/loop0 /btrfs

  $ fstrim /btrfs

The reason is we call btrfs_trim_free_extents() for the missing device,
which uses device->bdev (NULL for missing device) to find if the device
supports discard.

Fix is to check if the device is missing before calling
btrfs_trim_free_extents().

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22 15:49:49 +02:00
Filipe Manana 9acc8103ab btrfs: fix unpersisted i_size on fsync after expanding truncate
If we have an inode that does not have the full sync flag set, was changed
in the current transaction, then it is logged while logging some other
inode (like its parent directory for example), its i_size is increased by
a truncate operation, the log is synced through an fsync of some other
inode and then finally we explicitly call fsync on our inode, the new
i_size is not persisted.

The following example shows how to trigger it, with comments explaining
how and why the issue happens:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  $ touch /mnt/foo
  $ xfs_io -f -c "pwrite -S 0xab 0 1M" /mnt/bar

  $ sync

  # Fsync bar, this will be a noop since the file has not yet been
  # modified in the current transaction. The goal here is to clear
  # BTRFS_INODE_NEEDS_FULL_SYNC from the inode's runtime flags.
  $ xfs_io -c "fsync" /mnt/bar

  # Now rename both files, without changing their parent directory.
  $ mv /mnt/bar /mnt/bar2
  $ mv /mnt/foo /mnt/foo2

  # Increase the size of bar2 with a truncate operation.
  $ xfs_io -c "truncate 2M" /mnt/bar2

  # Now fsync foo2, this results in logging its parent inode (the root
  # directory), and logging the parent results in logging the inode of
  # file bar2 (its inode item and the new name). The inode of file bar2
  # is logged with an i_size of 0 bytes since it's logged in
  # LOG_INODE_EXISTS mode, meaning we are only logging its names (and
  # xattrs if it had any) and the i_size of the inode will not be changed
  # when the log is replayed.
  $ xfs_io -c "fsync" /mnt/foo2

  # Now explicitly fsync bar2. This resulted in doing nothing, not
  # logging the inode with the new i_size of 2M and the hole from file
  # offset 1M to 2M. Because the inode did not have the flag
  # BTRFS_INODE_NEEDS_FULL_SYNC set, when it was logged through the
  # fsync of file foo2, its last_log_commit field was updated,
  # resulting in this explicit of file bar2 not doing anything.
  $ xfs_io -c "fsync" /mnt/bar2

  # File bar2 content and size before a power failure.
  $ od -A d -t x1 /mnt/bar2
  0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  *
  1048576 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  2097152

  <power failure>

  # Mount the filesystem to replay the log.
  $ mount /dev/sdc /mnt

  # Read the file again, should have the same content and size as before
  # the power failure happened, but it doesn't, i_size is still at 1M.
  $ od -A d -t x1 /mnt/bar2
  0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
  *
  1048576

This started to happen after commit 209ecbb858 ("btrfs: remove stale
comment and logic from btrfs_inode_in_log()"), since btrfs_inode_in_log()
no longer checks if the inode's list of modified extents is not empty.
However, checking that list is not the right way to address this case
and the check was added long time ago in commit 125c4cf9f3
("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync")
for a different purpose, to address consecutive ranged fsyncs.

The reason that checking for the list emptiness makes this test pass is
because during an expanding truncate we create an extent map to represent
a hole from the old i_size to the new i_size, and add that extent map to
the list of modified extents in the inode. However if we are low on
available memory and we can not allocate a new extent map, then we don't
treat it as an error and just set the full sync flag on the inode, so that
the next fsync does not rely on the list of modified extents - so checking
for the emptiness of the list to decide if the inode needs to be logged is
not reliable, and results in not logging the inode if it was not possible
to allocate the extent map for the hole.

Fix this by ensuring that if we are only logging that an inode exists
(inode item, names/references and xattrs), we don't update the inode's
last_log_commit even if it does not have the full sync runtime flag set.

A test case for fstests follows soon.

CC: stable@vger.kernel.org # 5.13+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22 15:49:42 +02:00
Linus Torvalds f02bf8578b for-5.14-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDsjSEACgkQxWXV+ddt
 WDtnZRAAieSXta8GaJYNF4cKs7xHttIkNl0ljJHsJsKoN5kCxW22RWsf8gAyToT3
 XERkJfRksgMH0Th3StJqTxg0fQTSiSi1bcz+wJjMVvQev2gX8dw7O05GLZT5GTzx
 zquI57+OGDEpQdEM6YzrUl+tYnO0roibI2LQeMWUXYXJTy6F75zWjBqKcTGcnfGc
 d8bOi6ijN4F148zIxvr6ahHrQN9WGwD5OWA1I5RqHBadgwCDWsQIdE6/N1Kdavf5
 uW785lJ8a4VqOWyM7Y0kp4madnF9rwZ/CFyoQFJ51oG/NrUf469+bCBFM8VOEwSa
 c3ZaqvF8CF3sndSAYiI4MEBFbM2O4hIVl/B9NkjDXDu3VlkRwwHDxZfadvc4BzsG
 kfisaw/GbOvOv8ojxBq4ux2nbRIVul096HpZH4UWHs/MCQ5Ct40OP5sG77YZKQgf
 o+D65V3NMn1gnp+B8wqyNnraY4hAoBePoK9f3IH+WXF5hlk6gWkbWxmXxCIPvJM4
 XTJUcNCXDZtKA9KRgOmcP9fZSu4gyD3hbDRgU5nKkLLSGE+mE4BRmtnq91VnT7FA
 5Nxlrjw9Na9LoyXYaoHcCksj207KU6WVgIjK4OFJarLMWlSDYBwAQCX0+voG+ZBq
 qa6BuLpq2aJhB6Q4M3MdAQSbhfR6tcI+HENCQlFHa6Je7oY9NVQ=
 =v00S
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs zoned mode fixes from David Sterba:

 - fix deadlock when allocating system chunk

 - fix wrong mutex unlock on an error path

 - fix extent map splitting for append operation

 - update and fix message reporting unusable chunk space

 - don't block when background zone reclaim runs with balance in
   parallel

* tag 'for-5.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
  btrfs: don't block if we can't acquire the reclaim lock
  btrfs: properly split extent_map for REQ_OP_ZONE_APPEND
  btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
  btrfs: fix deadlock with concurrent chunk allocations involving system chunks
  btrfs: zoned: print unusable percentage when reclaiming block groups
  btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
2021-07-13 12:02:07 -07:00
Filipe Manana ea32af47f0 btrfs: zoned: fix wrong mutex unlock on failure to allocate log root tree
When syncing the log, if we fail to allocate the root node for the log
root tree:

1) We are unlocking fs_info->tree_log_mutex, but at this point we have
   not yet locked this mutex;

2) We have locked fs_info->tree_root->log_mutex, but we end up not
   unlocking it;

So fix this by unlocking fs_info->tree_root->log_mutex instead of
fs_info->tree_log_mutex.

Fixes: e75f9fd194 ("btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex")
CC: stable@vger.kernel.org # 5.13+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 18:27:44 +02:00
Johannes Thumshirn 9cc0b837e1 btrfs: don't block if we can't acquire the reclaim lock
If we can't acquire the reclaim_bgs_lock on block group reclaim, we
block until it is free. This can potentially stall for a long time.

While reclaim of block groups is necessary for a good user experience on
a zoned file system, there still is no need to block as it is best
effort only, just like when we're deleting unused block groups.

CC: stable@vger.kernel.org # 5.13
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:57 +02:00
Naohiro Aota abb99cfdaf btrfs: properly split extent_map for REQ_OP_ZONE_APPEND
Damien reported a test failure with btrfs/209. The test itself ran fine,
but the fsck ran afterwards reported a corrupted filesystem.

The filesystem corruption happens because we're splitting an extent and
then writing the extent twice. We have to split the extent though, because
we're creating too large extents for a REQ_OP_ZONE_APPEND operation.

When dumping the extent tree, we can see two EXTENT_ITEMs at the same
start address but different lengths.

$ btrfs inspect dump-tree /dev/nullb1 -t extent
...
   item 19 key (269484032 EXTENT_ITEM 126976) itemoff 15470 itemsize 53
           refs 1 gen 7 flags DATA
           extent data backref root FS_TREE objectid 257 offset 786432 count 1
   item 20 key (269484032 EXTENT_ITEM 262144) itemoff 15417 itemsize 53
           refs 1 gen 7 flags DATA
           extent data backref root FS_TREE objectid 257 offset 786432 count 1

The duplicated EXTENT_ITEMs originally come from wrongly split extent_map in
extract_ordered_extent(). Since extract_ordered_extent() uses
create_io_em() to split an existing extent_map, we will have
split->orig_start != split->start. Then, it will be logged with non-zero
"extent data offset". Finally, the logged entries are replayed into
a duplicated EXTENT_ITEM.

Introduce and use proper splitting function for extent_map. The function is
intended to be simple and specific usage for extract_ordered_extent() e.g.
not supporting compression case (we do not allow splitting compressed
extent_map anyway).

There was a question raised by Qu, in summary why we want to split the
extent map (and not the bio):

The problem is not the limit on the zone end, which as you mention is
the same as the block group end. The problem is that data write use zone
append (ZA) operations. ZA BIOs cannot be split so a large extent may
need to be processed with multiple ZA BIOs, While that is also true for
regular writes, the major difference is that ZA are "nameless" write
operation giving back the written sectors on completion. And ZA
operations may be reordered by the block layer (not intentionally
though). Combine both of these characteristics and you can see that the
data for a large extent may end up being shuffled when written resulting
in data corruption and the impossibility to map the extent to some start
sector.

To avoid this problem, zoned btrfs uses the principle "one data extent
== one ZA BIO". So large extents need to be split. This is unfortunate,
but we can revisit this later and optimize, e.g. merge back together the
fragments of an extent once written if they actually were written
sequentially in the zone.

Reported-by: Damien Le Moal <damien.lemoal@wdc.com>
Fixes: d22002fd37 ("btrfs: zoned: split ordered extent when bio is sent")
CC: stable@vger.kernel.org # 5.12+
CC: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:45 +02:00
Filipe Manana 79bd37120b btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.

However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.

This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.

The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.

For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.

The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.

CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:41 +02:00
Filipe Manana 1cb3db1cf3 btrfs: fix deadlock with concurrent chunk allocations involving system chunks
When a task attempting to allocate a new chunk verifies that there is not
currently enough free space in the system space_info and there is another
task that allocated a new system chunk but it did not finish yet the
creation of the respective block group, it waits for that other task to
finish creating the block group. This is to avoid exhaustion of the system
chunk array in the superblock, which is limited, when we have a thundering
herd of tasks allocating new chunks. This problem was described and fixed
by commit eafa4fd0ad ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations").

However there are two very similar scenarios where this can lead to a
deadlock:

1) Task B allocated a new system chunk and task A is waiting on task B
   to finish creation of the respective system block group. However before
   task B ends its transaction handle and finishes the creation of the
   system block group, it attempts to allocate another chunk (like a data
   chunk for an fallocate operation for a very large range). Task B will
   be unable to progress and allocate the new chunk, because task A set
   space_info->chunk_alloc to 1 and therefore it loops at
   btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
   and set space_info->chunk_alloc to 0, but task A is waiting on task B
   to finish creation of the new system block group, therefore resulting
   in a deadlock;

2) Task B allocated a new system chunk and task A is waiting on task B to
   finish creation of the respective system block group. By the time that
   task B enter the final phase of block group allocation, which happens
   at btrfs_create_pending_block_groups(), when it modifies the extent
   tree, the device tree or the chunk tree to insert the items for some
   new block group, it needs to allocate a new chunk, so it ends up at
   btrfs_chunk_alloc() and keeps looping there because task A has set
   space_info->chunk_alloc to 1, but task A is waiting for task B to
   finish creation of the new system block group and release the reserved
   system space, therefore resulting in a deadlock.

In short, the problem is if a task B needs to allocate a new chunk after
it previously allocated a new system chunk and if another task A is
currently waiting for task B to complete the allocation of the new system
chunk.

Unfortunately this deadlock scenario introduced by the previous fix for
the system chunk array exhaustion problem does not have a simple and short
fix, and requires a big change to rework the chunk allocation code so that
chunk btree updates are all made in the first phase of chunk allocation.
And since this deadlock regression is being frequently hit on zoned
filesystems and the system chunk array exhaustion problem is triggered
in more extreme cases (originally observed on PowerPC with a node size
of 64K when running the fallocate tests from stress-ng), revert the
changes from that commit. The next patch in the series, with a subject
of "btrfs: rework chunk allocation to avoid exhaustion of the system
chunk array" does the necessary changes to fix the system chunk array
exhaustion problem.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
Fixes: eafa4fd0ad ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:40 +02:00
Johannes Thumshirn 5f93e776c6 btrfs: zoned: print unusable percentage when reclaiming block groups
When we're automatically reclaiming a zone, because its zone_unusable
value is above the reclaim threshold, we're only logging how much
percent of the zone's capacity are used, but not how much of the
capacity is unusable.

Also print the percentage of the unusable space in the block group
before we're reclaiming it.

Example:

  BTRFS info (device sdg): reclaiming chunk 230686720 with 13% used 86% unusable

CC: stable@vger.kernel.org # 5.13
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:37 +02:00
David Sterba 54afaae34e btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work
The types in calculation of the used percentage in the reclaiming
messages are both u64, though bg->length is either 1GiB (non-zoned) or
the zone size in the zoned mode. The upper limit on zone size is 8GiB so
this could theoretically overflow in the future, right now the values
fit.

Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.13
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:34 +02:00
Linus Torvalds d3acb15a3a Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "iov_iter cleanups and fixes.

  There are followups, but this is what had sat in -next this cycle. IMO
  the macro forest in there became much thinner and easier to follow..."

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller
  clean up copy_mc_pipe_to_iter()
  pipe_zero(): we don't need no stinkin' kmap_atomic()...
  iov_iter: clean csum_and_copy_...() primitives up a bit
  copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases
  copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases
  iterate_xarray(): only of the first iteration we might get offset != 0
  pull handling of ->iov_offset into iterate_{iovec,bvec,xarray}
  iov_iter: make iterator callbacks use base and len instead of iovec
  iov_iter: make the amount already copied available to iterator callbacks
  iov_iter: get rid of separate bvec and xarray callbacks
  iov_iter: teach iterate_{bvec,xarray}() about possible short copies
  iterate_bvec(): expand bvec.h macro forest, massage a bit
  iov_iter: unify iterate_iovec and iterate_kvec
  iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec
  iterate_and_advance(): get rid of magic in case when n is 0
  csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter()
  iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
  [xarray] iov_iter_npages(): just use DIV_ROUND_UP()
  iov_iter_npages(): don't bother with iterate_all_kinds()
  ...
2021-07-03 11:30:04 -07:00
Linus Torvalds df668a5fe4 for-5.14/block-2021-06-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmDbXAwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpr0HEADDJaSgjpnWQwH1RVLNagJa9KnktxZYsEs+
 as3QmDdpKRG3rEC9bdE7FLe/xq3WBaO5j1hTQ9P6IguqLyS1Df72DtTlKyaCrZoe
 zv9eIlY4lZUfksE2nzWmlN9uG0FBVXeEQpHCLSNbUZeK1zvV6+NNhQqw2kc0sEqu
 hReUFeMUbsMcu/w5T3XMVJNsTMCql9wta2H0q5hONQyJQSrIwa1D+sUdE5I8fO4j
 bnoYX9yxHX26EztX1UJiGRgoq5Trz7LY7hAfljKSkewpFwiHE2vBdq2L0C2RKsIV
 tTs2DjMCMQyPNeA7WAG8HlR4aPG+7+/fuBP1KJHkykjWXglWN7OqISuBv6rrBgQs
 gNRnZ4qmb1CzD6aLEBk59nHt6po6eMxXIW856YktKy8rKcrgK29qP44Z+oomkPKo
 ZjQ0wqN5CvpObM/dIKxl9bAJ4zQDHBt49d5nTTQLfWl/mgevu6ZNWD/hONyCQmFy
 zKKqQ/wkxWHutOsjC5/MKNb3ZRNH9tt9X+HfULO2DU6IqqifYw/ex4z4MVsBopJC
 7pPfd81kgC73TgXe1AaCwHqNWsrqYCuTK0ew1CtGudlS3lucMwtap4GBiCgg5gbu
 M8pEgwO4OcCLHyRUc8zdfqI7HumbprbFmojPkwGSEe0ofVD74lMhzbUj5jvTYY2B
 t8D2XcgyOA==
 =lhon
 -----END PGP SIGNATURE-----

Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:

 - disk events cleanup (Christoph)

 - gendisk and request queue allocation simplifications (Christoph)

 - bdev_disk_changed cleanups (Christoph)

 - IO priority improvements (Bart)

 - Chained bio completion trace fix (Edward)

 - blk-wbt fixes (Jan)

 - blk-wbt enable/disable fix (Zhang)

 - Scheduler dispatch improvements (Jan, Ming)

 - Shared tagset scheduler improvements (John)

 - BFQ updates (Paolo, Luca, Pietro)

 - BFQ lock inversion fix (Jan)

 - Documentation improvements (Kir)

 - CLONE_IO block cgroup fix (Tejun)

 - Remove of ancient and deprecated block dump feature (zhangyi)

 - Discard merge fix (Ming)

 - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
   Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
  block: fix discard request merge
  block/mq-deadline: Remove a WARN_ON_ONCE() call
  blk-mq: update hctx->dispatch_busy in case of real scheduler
  blk: Fix lock inversion between ioc lock and bfqd lock
  bfq: Remove merged request already in bfq_requests_merged()
  block: pass a gendisk to bdev_disk_changed
  block: move bdev_disk_changed
  block: add the events* attributes to disk_attrs
  block: move the disk events code to a separate file
  block: fix trace completion for chained bio
  block/partitions/msdos: Fix typo inidicator -> indicator
  block, bfq: reset waker pointer with shared queues
  block, bfq: check waker only for queues with no in-flight I/O
  block, bfq: avoid delayed merge of async queues
  block, bfq: boost throughput by extending queue-merging times
  block, bfq: consider also creation time in delayed stable merge
  block, bfq: fix delayed stable merge check
  block, bfq: let also stably merged queues enjoy weight raising
  blk-wbt: make sure throttle is enabled properly
  blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
  ...
2021-06-30 12:12:56 -07:00
Nikolay Borisov 629e33a168 btrfs: remove unused btrfs_fs_info::total_pinned
This got added 14 years ago in 324ae4df00 ("Btrfs: Add block group
pinned accounting back") but it was not ever used. Subsequently its
usage got gradually removed in 8790d502e4 ("Btrfs: Add support for
mirroring across drives") and 11833d66be ("Btrfs: improve async block
group caching"). Let's remove it for good!

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 19:58:26 +02:00
Josef Bacik 138a12d865 btrfs: rip out btrfs_space_info::total_bytes_pinned
We used this in may_commit_transaction() in order to determine if we
needed to commit the transaction.  However we no longer have that logic
and thus have no use of this counter anymore, so delete it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:55:25 +02:00
Josef Bacik 3ffad6961d btrfs: rip the first_ticket_bytes logic from fail_all_tickets
This was a trick implemented to handle the case where we had a giant
reservation in front of a bunch of little reservations in the ticket
queue.  If the giant reservation was too large for the transaction
commit to make a difference we'd ENOSPC everybody out instead of
committing the transaction.  This logic was put in to force us to go
back and re-try the transaction commit logic to see if we could make
progress.

Instead now we know we've committed the transaction, so any space that
would have been recovered is now available, and would be caught by the
btrfs_try_granting_tickets() in this loop, so we no longer need this
code and can simply delete it.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:51:48 +02:00
Josef Bacik 0480855392 btrfs: remove FLUSH_DELAYED_REFS from data ENOSPC flushing
Since we unconditionally commit the transaction now we no longer need to
run the delayed refs to make sure our total_bytes_pinned value is
uptodate, we can simply commit the transaction.  Remove this stage from
the data flushing list.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:51:10 +02:00
Josef Bacik c416a30cdd btrfs: rip out may_commit_transaction
may_commit_transaction was introduced before the ticketing
infrastructure existed.  There was a problem where we'd legitimately be
out of space, but every reservation would trigger a transaction commit
and then fail.  Thus if you had 1000 things trying to make a
reservation, they'd all do the flushing loop and thus commit the
transaction 1000 times before they'd get their ENOSPC.

This helper was introduced to short circuit this, if there wasn't space
that could be reclaimed by committing the transaction then simply ENOSPC
out.  This made true ENOSPC tests much faster as we didn't waste a bunch
of time.

However many of our bugs over the years have been from cases where we
didn't account for some space that would be reclaimed by committing a
transaction.  The delayed refs rsv space, delayed rsv, many pinned bytes
miscalculations, etc.  And in the meantime the original problem has been
solved with ticketing.  We no longer will commit the transaction 1000
times.  Instead we'll get 1000 waiters, we will go through the flushing
mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
out.  The ticketing infrastructure gives us a deterministic way to see
if we're making progress or not, thus we avoid a lot of extra work.

So simplify this step by simply unconditionally committing the
transaction.  This removes what is arguably our most common source of
early ENOSPC bugs and will allow us to drastically simplify many of the
things we track because we simply won't need them with this stuff gone.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:50:45 +02:00
Filipe Manana 35b22c19af btrfs: send: fix crash when memory allocations trigger reclaim
When doing a send we don't expect the task to ever start a transaction
after the initial check that verifies if commit roots match the regular
roots. This is because after that we set current->journal_info with a
stub (special value) that signals we are in send context, so that we take
a read lock on an extent buffer when reading it from disk and verifying
it is valid (its generation matches the generation stored in the parent).
This stub was introduced in 2014 by commit a26e8c9f75 ("Btrfs: don't
clear uptodate if the eb is under IO") in order to fix a concurrency issue
between send and balance.

However there is one particular exception where we end up needing to start
a transaction and when this happens it results in a crash with a stack
trace like the following:

[60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80
[60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...)
[60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1
[60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015
[60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80
[60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...)
[60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246
[60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224
[60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000
[60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001
[60015.902408] kernel: FS:  00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000
[60015.902410] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0
[60015.902414] kernel: Call Trace:
[60015.902419] kernel:  kfence_unprotect+0x13/0x30
[60015.902423] kernel:  page_fault_oops+0x89/0x270
[60015.902427] kernel:  ? search_module_extables+0xf/0x40
[60015.902431] kernel:  ? search_bpf_extables+0x57/0x70
[60015.902435] kernel:  kernelmode_fixup_or_oops+0xd6/0xf0
[60015.902437] kernel:  __bad_area_nosemaphore+0x142/0x180
[60015.902440] kernel:  exc_page_fault+0x67/0x150
[60015.902445] kernel:  asm_exc_page_fault+0x1e/0x30
[60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580
[60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...)
[60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246
[60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039
[60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000
[60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001
[60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000
[60015.902468] kernel:  btrfs_commit_inode_delayed_inode+0x5d/0x120
[60015.902473] kernel:  btrfs_evict_inode+0x2c5/0x3f0
[60015.902476] kernel:  evict+0xd1/0x180
[60015.902480] kernel:  inode_lru_isolate+0xe7/0x180
[60015.902483] kernel:  __list_lru_walk_one+0x77/0x150
[60015.902487] kernel:  ? iput+0x1a0/0x1a0
[60015.902489] kernel:  ? iput+0x1a0/0x1a0
[60015.902491] kernel:  list_lru_walk_one+0x47/0x70
[60015.902495] kernel:  prune_icache_sb+0x39/0x50
[60015.902497] kernel:  super_cache_scan+0x161/0x1f0
[60015.902501] kernel:  do_shrink_slab+0x142/0x240
[60015.902505] kernel:  shrink_slab+0x164/0x280
[60015.902509] kernel:  shrink_node+0x2c8/0x6e0
[60015.902512] kernel:  do_try_to_free_pages+0xcb/0x4b0
[60015.902514] kernel:  try_to_free_pages+0xda/0x190
[60015.902516] kernel:  __alloc_pages_slowpath.constprop.0+0x373/0xcc0
[60015.902521] kernel:  ? __memcg_kmem_charge_page+0xc2/0x1e0
[60015.902525] kernel:  __alloc_pages_nodemask+0x30a/0x340
[60015.902528] kernel:  pipe_write+0x30b/0x5c0
[60015.902531] kernel:  ? set_next_entity+0xad/0x1e0
[60015.902534] kernel:  ? switch_mm_irqs_off+0x58/0x440
[60015.902538] kernel:  __kernel_write+0x13a/0x2b0
[60015.902541] kernel:  kernel_write+0x73/0x150
[60015.902543] kernel:  send_cmd+0x7b/0xd0
[60015.902545] kernel:  send_extent_data+0x5a3/0x6b0
[60015.902549] kernel:  process_extent+0x19b/0xed0
[60015.902551] kernel:  btrfs_ioctl_send+0x1434/0x17e0
[60015.902554] kernel:  ? _btrfs_ioctl_send+0xe1/0x100
[60015.902557] kernel:  _btrfs_ioctl_send+0xbf/0x100
[60015.902559] kernel:  ? enqueue_entity+0x18c/0x7b0
[60015.902562] kernel:  btrfs_ioctl+0x185f/0x2f80
[60015.902564] kernel:  ? psi_task_change+0x84/0xc0
[60015.902569] kernel:  ? _flat_send_IPI_mask+0x21/0x40
[60015.902572] kernel:  ? check_preempt_curr+0x2f/0x70
[60015.902576] kernel:  ? selinux_file_ioctl+0x137/0x1e0
[60015.902579] kernel:  ? expand_files+0x1cb/0x1d0
[60015.902582] kernel:  ? __x64_sys_ioctl+0x82/0xb0
[60015.902585] kernel:  __x64_sys_ioctl+0x82/0xb0
[60015.902588] kernel:  do_syscall_64+0x33/0x40
[60015.902591] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
[60015.902595] kernel: RIP: 0033:0x7f158e38f0ab
[60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...)
[60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab
[60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004
[60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640
[60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
[60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0
[60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]---

This happens because when writing to the pipe, by calling kernel_write(),
we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the
gfp flags, which allow reclaim to happen if there is memory pressure. This
allocation happens at fs/pipe.c:pipe_write().

If the reclaim is triggered, inode eviction can be triggered and that in
turn can result in starting a transaction if the inode has a link count
of 0. The transaction start happens early on during eviction, when we call
btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if
there is currently an open file descriptor for an inode with a link count
of 0 and the reclaim task gets a reference on the inode before that
descriptor is closed, in which case the reclaim task ends up doing the
final iput that triggers the inode eviction.

When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers
the following assertion at transaction.c:start_transaction():

    /* Send isn't supposed to start transactions. */
    ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);

And when assertions are not enabled, it triggers a crash since after that
assertion we cast current->journal_info into a transaction handle pointer
and then dereference it:

   if (current->journal_info) {
       WARN_ON(type & TRANS_EXTWRITERS);
       h = current->journal_info;
       refcount_inc(&h->use_count);
       (...)

Which obviously results in a crash due to an invalid memory access.

The same type of issue can happen during other memory allocations we
do directly in the send code with kmalloc (and friends) as they use
GFP_KERNEL and therefore may trigger reclaim too, which started to
happen since 2016 after commit e780b0d1c1 ("btrfs: send: use
GFP_KERNEL everywhere").

The issue could be solved by setting up a NOFS context for the entire
send operation so that reclaim could not be triggered when allocating
memory or pages through kernel_write(). However that is not very friendly
and we can in fact get rid of the send stub because:

1) The stub was introduced way back in 2014 by commit a26e8c9f75
   ("Btrfs: don't clear uptodate if the eb is under IO") to solve an
   issue exclusive to when send and balance are running in parallel,
   however there were other problems between balance and send and we do
   not allow anymore to have balance and send run concurrently since
   commit 9e967495e0 ("Btrfs: prevent send failures and crashes due
   to concurrent relocation"). More generically the issues are between
   send and relocation, and that last commit eliminated only the
   possibility of having send and balance run concurrently, but shrinking
   a device also can trigger relocation, and on zoned filesystems we have
   relocation of partially used block groups triggered automatically as
   well. The previous patch that has a subject of:

   "btrfs: ensure relocation never runs while we have send operations running"

   Addresses all the remaining cases that can trigger relocation.

2) We can actually allow starting and even committing transactions while
   in a send context if needed because send is not holding any locks that
   would block the start or the commit of a transaction.

So get rid of all the logic added by commit a26e8c9f75 ("Btrfs: don't
clear uptodate if the eb is under IO"). We can now always call
clear_extent_buffer_uptodate() at verify_parent_transid() since send is
the only case that uses commit roots without having a transaction open or
without holding the commit_root_sem.

Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-22 14:11:58 +02:00