Commit graph

283 commits

Author SHA1 Message Date
Linus Torvalds 7a7fd0de4a Merge branch 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull kmap conversion updates from David Sterba:
 "This contains changes regarding kmap API use and eg conversion from
  kmap_atomic to kmap_local_page.

  The API belongs to memory management but to save cross-tree
  dependency headaches we've agreed to take it through the btrfs tree
  because there are some trivial conversions possible, while the rest
  will need some time and getting the easy cases out of the way would be
  convenient.

  The changes can be grouped:

   - function exports, new helpers

   - new VM_BUG_ON for additional verification; it's been discussed if
     it should be VM_BUG_ON or BUG_ON, the former was chosen due to
     performance reasons

   - code replaced by relevant helpers"

[ This is an updated version of a request that originally came in during
  the merge window, but I asked for some updates:

    https://lore.kernel.org/lkml/cover.1614090658.git.dsterba@suse.com/

  which is why this got merge after the merge window closed.  - Linus ]

* 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: use copy_highpage() instead of 2 kmaps()
  btrfs: use memcpy_[to|from]_page() and kmap_local_page()
  mm/highmem: Add VM_BUG_ON() to mem*_page() calls
  mm/highmem: Introduce memcpy_page(), memmove_page(), and memset_page()
  mm/highmem: Convert memcpy_[to|from]_page() to kmap_local_page()
  mm/highmem: Lift memcpy_[to|from]_page to core
2021-03-01 11:24:18 -08:00
Ira Weiny 3590ec5899 btrfs: use memcpy_[to|from]_page() and kmap_local_page()
There are many places where the pattern kmap/memcpy/kunmap occurs.

This pattern was lifted to the core common functions
memcpy_[to|from]_page().

Use these new functions to reduce the code, eliminate direct uses of
kmap, and leverage the new core functions use of kmap_local_page().

Also, there is 1 place where a kmap/memcpy is followed by an
optional memset.  Here we leave the kmap open coded to avoid remapping
the page but use kmap_local_page() directly.

Development of this patch was aided by the coccinelle script:

// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls
//
// NOTE: Offsets and other expressions may be more complex than what the script
// will automatically generate.  Therefore a catchall rule is provided to find
// the pattern which then must be evaluated by hand.
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
// Comments:
// Options:

//
// simple memcpy version
//
@ memcpy_rule1 @
expression page, T, F, B, Off;
identifier ptr;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
-memcpy(ptr + Off, F, B);
+memcpy_to_page(page, Off, F, B);
|
-memcpy(ptr, F, B);
+memcpy_to_page(page, 0, F, B);
|
-memcpy(T, ptr + Off, B);
+memcpy_from_page(T, page, Off, B);
|
-memcpy(T, ptr, B);
+memcpy_from_page(T, page, 0, B);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memcpy_rule1
@
identifier memcpy_rule1.ptr;
type VP, VP1;
@@

-VP ptr;
	... when != ptr;
? VP1 ptr;

//
// Some callers kmap without a temp pointer
//
@ memcpy_rule2 @
expression page, T, Off, F, B;
@@

<+...
(
-memcpy(kmap(page) + Off, F, B);
+memcpy_to_page(page, Off, F, B);
|
-memcpy(kmap(page), F, B);
+memcpy_to_page(page, 0, F, B);
|
-memcpy(T, kmap(page) + Off, B);
+memcpy_from_page(T, page, Off, B);
|
-memcpy(T, kmap(page), B);
+memcpy_from_page(T, page, 0, B);
)
...+>
-kunmap(page);
// No need for the ptr variable removal

//
// Catch all
//
@ memcpy_rule3 @
expression page;
expression GenTo, GenFrom, GenSize;
identifier ptr;
type VP;
@@

(
-VP ptr = kmap(page);
|
-ptr = kmap(page);
|
-VP ptr = kmap_atomic(page);
|
-ptr = kmap_atomic(page);
)
<+...
(
//
// Some call sites have complex expressions within the memcpy
// match a catch all to be evaluated by hand.
//
-memcpy(GenTo, GenFrom, GenSize);
+memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize);
+memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize);
)
...+>
(
-kunmap(page);
|
-kunmap_atomic(ptr);
)

// Remove any pointers left unused
@
depends on memcpy_rule3
@
identifier memcpy_rule3.ptr;
type VP, VP1;
@@

-VP ptr;
	... when != ptr;
? VP1 ptr;

// <smpl>

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-26 12:45:15 +01:00
Qu Wenruo 04d4ba4c90 btrfs: make check_compressed_csum() to be subpage compatible
Currently check_compressed_csum() completely relies on sectorsize ==
PAGE_SIZE to do checksum verification for compressed extents.

To make it subpage compatible, this patch will:
- Do extra calculation for the csum range
  Since we have multiple sectors inside a page, we need to only hash
  the range we want, not the full page anymore.

- Do sector-by-sector hash inside the page

With this patch and previous conversion on
btrfs_submit_compressed_read(), now we can read subpage compressed
extents properly, and do proper csum verification.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:27 +01:00
Qu Wenruo be6a13613f btrfs: make btrfs_submit_compressed_read() subpage compatible
For compressed read, we always submit page read using page size.  This
doesn't work well with subpage, as for subpage one page can contain
several sectors.  Such submission will read range out of what we want,
and cause problems.

Thankfully to make it subpage compatible, we only need to change how the
last page of the compressed extent is read.

Instead of always adding a full page to the compressed read bio, if we're
at the last page, calculate the size using compressed length, so that we
only add part of the range into the compressed read bio.

Since we are here, also change the PAGE_SIZE used in
lookup_extent_mapping() to sectorsize.
This modification won't cause any functional change, as
lookup_extent_mapping() can handle the case where the search range is
larger than found extent range.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-22 17:15:25 +01:00
Qu Wenruo 32443de338 btrfs: introduce btrfs_subpage for data inodes
To support subpage sector size, data also need extra info to make sure
which sectors in a page are uptodate/dirty/...

This patch will make pages for data inodes get btrfs_subpage structure
attached, and detached when the page is freed.

This patch also slightly changes the timing when
set_page_extent_mapped() is called to make sure:

- We have page->mapping set
  page->mapping->host is used to grab btrfs_fs_info, thus we can only
  call this function after page is mapped to an inode.

  One call site attaches pages to inode manually, thus we have to modify
  the timing of set_page_extent_mapped() a bit.

- As soon as possible, before other operations
  Since memory allocation can fail, we have to do extra error handling.
  Calling set_page_extent_mapped() as soon as possible can simply the
  error handling for several call sites.

The idea is pretty much the same as iomap_page, but with more bitmaps
for btrfs specific cases.

Currently the plan is to switch iomap if iomap can provide sector
aligned write back (only write back dirty sectors, but not the full
page, data balance require this feature).

So we will stick to btrfs specific bitmap for now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08 22:59:03 +01:00
Qu Wenruo 6275193ef1 btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:

- Remove the @file_offset parameter
  There are two factors making the @file_offset parameter useless:

  * For csum lookup in csum tree, file offset makes no sense
    We only need disk_bytenr, which is unrelated to file_offset

  * page_offset (file offset) of each bvec is not contiguous.
    Pages can be added to the same bio as long as their on-disk bytenr
    is contiguous, meaning we could have pages at different file offsets
    in the same bio.

  Thus passing file_offset makes no sense any more.
  The only user of file_offset is for data reloc inode, we will use
  a new function, search_file_offset_in_bio(), to handle it.

- Extract the csum tree lookup into search_csum_tree()
  The new function will handle the csum search in csum tree.
  The return value is the same as btrfs_find_ordered_sum(), returning
  the number of found sectors which have checksum.

- Change how we do the main loop
  The only needed info from bio is:
  * the on-disk bytenr
  * the length

  After extracting the above info, we can do the search without bio
  at all, which makes the main loop much simpler:

	for (cur_disk_bytenr = orig_disk_bytenr;
	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
	     cur_disk_bytenr += count * sectorsize) {

		/* Lookup csum tree */
		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
					 search_len, csum_dst);
		if (!count) {
			/* Csum hole handling */
		}
	}

- Use single variable as the source to calculate all other offsets
  Instead of all different type of variables, we use only one main
  variable, cur_disk_bytenr, which represents the current disk bytenr.

  All involved values can be calculated from that variable, and
  all those variable will only be visible in the inner loop.

The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup.  Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:11 +01:00
David Sterba 1201b58b67 btrfs: drop casts of bio bi_sector
Since commit 72deb455b5 ("block: remove CONFIG_LBDAF") (5.2) the
sector_t type is u64 on all arches and configs so we don't need to
typecast it.  It used to be unsigned long and the result of sector size
shifts were not guaranteed to fit in the type.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-09 19:16:05 +01:00
David Sterba 713cebfb98 btrfs: remove unnecessary local variables for checksum size
Remove local variable that is then used just once and replace it with
fs_info::csum_size.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:54:00 +01:00
David Sterba 223486c27b btrfs: switch cached fs_info::csum_size from u16 to u32
The fs_info value is 32bit, switch also the local u16 variables. This
leads to a better assembly code generated due to movzwl.

This simple change will shave some bytes on x86_64 and release config:

   text    data     bss     dec     hex filename
1090000   17980   14912 1122892  11224c pre/btrfs.ko
1089794   17980   14912 1122686  11217e post/btrfs.ko

DELTA: -206

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:59 +01:00
David Sterba 55fc29bed8 btrfs: use cached value of fs_info::csum_size everywhere
btrfs_get_16 shows up in the system performance profiles (helper to read
16bit values from on-disk structures). This is partially because of the
checksum size that's frequently read along with data reads/writes, other
u16 uses are from item size or directory entries.

Replace all calls to btrfs_super_csum_size by the cached value from
fs_info.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:59 +01:00
Josef Bacik 42437a6386 btrfs: introduce mount option rescue=ignorebadroots
In the face of extent root corruption, or any other core fs wide root
corruption we will fail to mount the file system.  This makes recovery
kind of a pain, because you need to fall back to userspace tools to
scrape off data.  Instead provide a mechanism to gracefully handle bad
roots, so we can at least mount read-only and possibly recover data from
the file system.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:41 +01:00
Josef Bacik 334c16d82c btrfs: push the NODATASUM check into btrfs_lookup_bio_sums
When we move to being able to handle NULL csum_roots it'll be cleaner to
just check in btrfs_lookup_bio_sums instead of at all of the caller
locations, so push the NODATASUM check into it as well so it's unified.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-12-08 15:53:39 +01:00
David Sterba cb4c919830 btrfs: compression: move declarations to header
The declarations of compression algorithm callbacks are defined in the
.c file as they're used from there. Compiler warns that there are no
declarations for public functions when compiling lzo.c/zlib.c/zstd.c.
Fix that by moving the declarations to the header as it's the common
place for all of them.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-07 12:06:55 +02:00
Nikolay Borisov 93c4c033ec btrfs: remove fail label in check_compressed_csum
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:42 +02:00
Nikolay Borisov 5a9472fe7f btrfs: increment corrupt device counter during compressed read
If a compressed read fails due to checksum error only a line is printed
to dmesg, device corrupt counter is not modified.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:40 +02:00
Nikolay Borisov 26056eab4b btrfs: remove needless ASSERT check of orig_bio in end_compressed_bio_read
compressed_bio::orig_bio is always set in btrfs_submit_compressed_read
before any bio submission is performed. Since that function is always
called with a valid bio it renders the ASSERT unnecessary.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:40 +02:00
Nikolay Borisov c7ee1819dc btrfs: make btrfs_submit_compressed_write take btrfs_inode
Majority of its uses are for btrfs_inode so take it as an argument
directly.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:31 +02:00
Nikolay Borisov bd242a08a6 btrfs: make btrfs_csum_one_bio takae btrfs_inode
Will enable converting btrfs_submit_compressed_write to btrfs_inode more
easily.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:26 +02:00
Anand Jain adbab6420c btrfs: unexport btrfs_compress_set_level()
btrfs_compress_set_level() can be static function in the file
compression.c.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:37 +02:00
Eric Biggers fd08001f17 btrfs: use crypto_shash_digest() instead of open coding
Use crypto_shash_digest() instead of crypto_shash_init() +
crypto_shash_update() + crypto_shash_final().  This is more efficient.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-05-25 11:25:27 +02:00
Mikhail Zaslonko 3fd396afc0 btrfs: use larger zlib buffer for s390 hardware compression
In order to benefit from s390 zlib hardware compression support,
increase the btrfs zlib workspace buffer size from 1 to 4 pages (if s390
zlib hardware support is enabled on the machine).

This brings up to 60% better performance in hardware on s390 compared to
the PAGE_SIZE buffer and much more compared to the software zlib
processing in btrfs.  In case of memory pressure, fall back to a single
page buffer during workspace allocation.

The data compressed with larger input buffers will still conform to zlib
standard and thus can be decompressed also on a systems that uses only
PAGE_SIZE buffer for btrfs zlib.

Link: http://lkml.kernel.org/r/20200108105103.29028-1-zaslonko@linux.ibm.com
Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Richard Purdie <rpurdie@rpsys.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eduard Shishkin <edward6@linux.ibm.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:40 -08:00
Omar Sandoval db72e47f79 btrfs: get rid of at_offset parameter to btrfs_lookup_bio_sums()
We can encode this in the offset parameter: -1 means use the page
offsets, anything else is a valid offset.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20 16:40:54 +01:00
Omar Sandoval e62958fce9 btrfs: get rid of trivial __btrfs_lookup_bio_sums() wrappers
Currently, we have two wrappers for __btrfs_lookup_bio_sums():
btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and
btrfs_lookup_bio_sums(), which is used everywhere else. The only
difference is that the _dio variant looks up csums starting at the given
offset instead of using the page index, which isn't actually direct
I/O-specific. Let's clean up the signature and return value of
__btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get
rid of the trivial helpers.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20 16:40:53 +01:00
Dennis Zhou 46bcff2bfc btrfs: fix compressed write bio blkcg attribution
Bio attribution is handled at bio_set_dev() as once we have a device, we
have a corresponding request_queue and then can derive the current css.
In special cases, we want to attribute to bio to someone else. This can
be done by calling bio_associate_blkg_from_css() or
kthread_associate_blkcg() depending on the scenario. Btrfs does this for
compressed writeback as they are handled by kworkers, so the latter can
be done here.

Commit 1a41802701 ("btrfs: drop bio_set_dev where not needed") removes
early bio_set_dev() calls prior to submit_stripe_bio(). This breaks the
above assumption that we'll have a request_queue when we are doing
association. To fix this, switch to using kthread_associate_blkcg().

Without this, we crash in btrfs/024:

  [ 3052.093088] BUG: kernel NULL pointer dereference, address: 0000000000000510
  [ 3052.107013] #PF: supervisor read access in kernel mode
  [ 3052.107014] #PF: error_code(0x0000) - not-present page
  [ 3052.107015] PGD 0 P4D 0
  [ 3052.107021] Oops: 0000 [#1] SMP
  [ 3052.138904] CPU: 42 PID: 201270 Comm: kworker/u161:0 Kdump: loaded Not tainted 5.5.0-rc1-00062-g4852d8ac90a9 #712
  [ 3052.138905] Hardware name: Quanta Tioga Pass Single Side 01-0032211004/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
  [ 3052.138912] Workqueue: btrfs-delalloc btrfs_work_helper
  [ 3052.191375] RIP: 0010:bio_associate_blkg_from_css+0x1e/0x3c0
  [ 3052.191379] RSP: 0018:ffffc900210cfc90 EFLAGS: 00010282
  [ 3052.191380] RAX: 0000000000000000 RBX: ffff88bfe5573c00 RCX: 0000000000000000
  [ 3052.191382] RDX: ffff889db48ec2f0 RSI: ffff88bfe5573c00 RDI: ffff889db48ec2f0
  [ 3052.191386] RBP: 0000000000000800 R08: 0000000000203bb0 R09: ffff889db16b2400
  [ 3052.293364] R10: 0000000000000000 R11: ffff88a07fffde80 R12: ffff889db48ec2f0
  [ 3052.293365] R13: 0000000000001000 R14: ffff889de82bc000 R15: ffff889e2b7bdcc8
  [ 3052.293367] FS:  0000000000000000(0000) GS:ffff889ffba00000(0000) knlGS:0000000000000000
  [ 3052.293368] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 3052.293369] CR2: 0000000000000510 CR3: 0000000002611001 CR4: 00000000007606e0
  [ 3052.293370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [ 3052.293371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [ 3052.293372] PKRU: 55555554
  [ 3052.293376] Call Trace:
  [ 3052.402552]  btrfs_submit_compressed_write+0x137/0x390
  [ 3052.402558]  submit_compressed_extents+0x40f/0x4c0
  [ 3052.422401]  btrfs_work_helper+0x246/0x5a0
  [ 3052.422408]  process_one_work+0x200/0x570
  [ 3052.438601]  ? process_one_work+0x180/0x570
  [ 3052.438605]  worker_thread+0x4c/0x3e0
  [ 3052.438614]  kthread+0x103/0x140
  [ 3052.460735]  ? process_one_work+0x570/0x570
  [ 3052.460737]  ? kthread_mod_delayed_work+0xc0/0xc0
  [ 3052.460744]  ret_from_fork+0x24/0x30

Fixes: 1a41802701 ("btrfs: drop bio_set_dev where not needed")
Reported-by: Chris Murphy <chris@colorremedies.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-30 16:07:19 +01:00
Dennis Zhou 7b62e66cbb btrfs: punt all bios created in btrfs_submit_compressed_write()
Compressed writes happen in the background via kworkers. However, this
causes bios to be attributed to root bypassing any cgroup limits from
the actual writer. We tag the first bio with REQ_CGROUP_PUNT, which will
punt the bio to an appropriate cgroup specific workqueue and attribute
the IO properly. However, if btrfs_submit_compressed_write() creates a
new bio, we don't tag it the same way. Add the appropriate tagging for
subsequent bios.

Fixes: ec39f7696c ("Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios")
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-30 16:07:16 +01:00
David Sterba 1a41802701 btrfs: drop bio_set_dev where not needed
bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
also changing some state bits if there was a different bdev set before.
This is one thing that's not needed.

Another thing is that setting a bdev at bio allocation time is too early
and actually does not work with plain redundancy profiles, where each
time we submit a bio to a device, the bdev is set correctly.

In many places the bio bdev is set to latest_bdev that seems to serve as
a stub pointer "just to put something to bio". But we don't have to do
that.

Where do we know which bdev to set:

* for regular IO: submit_stripe_bio that's called by btrfs_map_bio

* repair IO: repair_io_failure, read or write from specific device

* super block write (using buffer_heads but uses raw bdev) and barriers

* scrub: this does not use all regular IO paths as it needs to reach all
  copies, verify and fixup eventually, and for that all bdev management
  is independent

* raid56: rbio_add_io_page, for the RMW write

* integrity-checker: does it's own low-level block tracking

Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 23:39:30 +01:00
David Sterba 0cf2521313 btrfs: compression: remove ops pointer from workspace_manager
We can infer the ops from the type that is now passed to all functions
that would need it, this makes workspace_manager::ops redundant and can
be removed.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba 1e00235160 btrfs: compression: inline free_workspace
Replace indirect calls to free_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba a3bbd2a9ee btrfs: compression: pass type to btrfs_put_workspace
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for free_workspace.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:59 +01:00
David Sterba c778df1406 btrfs: compression: inline alloc_workspace
Replace indirect calls to alloc_workspace by switch and calls to the
specific callbacks. This is mainly to get rid of the indirection due to
spectre vulnerability mitigations.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba 5907a9bb13 btrfs: compression: pass type to btrfs_get_workspace
We can infer the workspace_manager from type and the type will be used
in the following patch to call a common helper for alloc_workspace.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba bd3a5287cc btrfs: compression: inline put_workspace
Similar to get_workspace, majority of the callbacks is trivial, we don't
gain anything by the indirection, so replace them by a switch function.
Trivial callback implementations use the helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba 6a0d12724b btrfs: compression: inline get_workspace
Majority of the callbacks is trivial, we don't gain anything by the
indirection, so replace them by a switch function.

ZLIB needs to adjust level in the callback and ZSTD workspace management
is complex, the rest is call to the helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba d20f395f98 btrfs: compression: export alloc/free/get/put callbacks of all algos
The indirect calls will be replaced by a switch in compression.c.
(Switch is faster than indirect calls with when Spectre mitigations are
enabled).

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:58 +01:00
David Sterba 2510307e6c btrfs: compression: inline cleanup_workspace_manager
Replace loop calling to all algos with a list of direct calls to the
cleanup manager callback. When that becomes trivial it is replaced by
direct call to the helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba 2dba714390 btrfs: compression: let workspace manager cleanup take only the type
With the access to the workspace structures, we can look it up together
with the compression ops inside the workspace manager cleanup helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba d551703347 btrfs: compression: inline init_workspace_manager
Replace loop calling to all algos with a list of direct calls to the
init manager callback. When that becomes trivial it is replaced by
direct call to the helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba 975db48330 btrfs: compression: let workspace manager init take only the type
With the access to the workspace structures, we can look it up together
with the compression ops inside the workspace manager init helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba be95104531 btrfs: compression: attach workspace manager to the ops
There's a lot of indirection when the generic code calls into
algo-specific callbacks to reach the private workspace manager structure
and back to the generic code.

To simplify that, export the workspace manager for heuristic, LZO and
ZLIB, while ZSTD is going to use it's own manager.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba 1e4eb74654 btrfs: switch compression callbacks to direct calls
The indirect calls bring some overhead due to spectre vulnerability
mitigations. The number of cases is small and below the threshold
(10-20) where indirect call would be better.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:57 +01:00
David Sterba c4bf665a31 btrfs: export compression and decompression callbacks
Export compress_pages, decompress_bio and decompress callbacks for all
compression algos. The indirect calls will be replaced by a switch.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:56 +01:00
Chengguang Xu ce96b7ffd1 btrfs: use better definition of number of compression type
The compression type upper limit constant is the same as the last value
and this is confusing.  In order to keep coding style consistent, use
BTRFS_NR_COMPRESS_TYPES as the total number that follows the idom of
'NR' being one more than the last value.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:55 +01:00
Chris Mason ec39f7696c Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios
Async CRCs and compression submit IO through helper threads, which means
they have IO priority inversions when cgroup IO controllers are in use.

This flags all of the writes submitted by btrfs helper threads as
REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
work items to avoid the priority inversion.

For the compression code, we take a reference on the wbc's blkg css and
pass it down to the async workers.

For the async CRCs, the bio already has the correct css, we just need to
tell the block layer to use REQ_CGROUP_PUNT.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
Modified-and-reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:53 +01:00
Chris Mason 08635bae0b Btrfs: stop using btrfs_schedule_bio()
btrfs_schedule_bio() hands IO off to a helper thread to do the actual
submit_bio() call.  This has been used to make sure async crc and
compression helpers don't get stuck on IO submission.  To maintain good
performance, over time the IO submission threads duplicated some IO
scheduler characteristics such as high and low priority IOs and they
also made some ugly assumptions about request allocation batch sizes.

All of this cost at least one extra context switch during IO submission,
and doesn't fit well with the modern blkmq IO stack.  So, this commit stops
using btrfs_schedule_bio().  We may need to adjust the number of async
helper threads for crcs and compression, but long term it's a better
path.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18 12:46:52 +01:00
David Sterba 602cbe91fb btrfs: move cond_wake_up functions out of ctree
The file ctree.h serves as a header for everything and has become quite
bloated. Split some helpers that are generic and create a new file that
should be the catch-all for code that's not btrfs-specific.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:15 +02:00
David Sterba b0c1fe1eaf btrfs: compression: replace set_level callbacks by a common helper
The set_level callbacks do not do anything special and can be replaced
by a helper that uses the levels defined in the tables.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09 14:59:11 +02:00
David Sterba e749af443f btrfs: lift bio_set_dev from bio allocation helpers
The block device is passed around for the only purpose to set it in new
bios. Move the assignment one level up. This is a preparatory patch for
further bdev cleanups.

Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:51 +02:00
Johannes Thumshirn aa53e3bfac btrfs: correctly validate compression type
Nikolay reported the following KASAN splat when running btrfs/048:

[ 1843.470920] ==================================================================
[ 1843.471971] BUG: KASAN: slab-out-of-bounds in strncmp+0x66/0xb0
[ 1843.472775] Read of size 1 at addr ffff888111e369e2 by task btrfs/3979

[ 1843.473904] CPU: 3 PID: 3979 Comm: btrfs Not tainted 5.2.0-rc3-default #536
[ 1843.475009] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 1843.476322] Call Trace:
[ 1843.476674]  dump_stack+0x7c/0xbb
[ 1843.477132]  ? strncmp+0x66/0xb0
[ 1843.477587]  print_address_description+0x114/0x320
[ 1843.478256]  ? strncmp+0x66/0xb0
[ 1843.478740]  ? strncmp+0x66/0xb0
[ 1843.479185]  __kasan_report+0x14e/0x192
[ 1843.479759]  ? strncmp+0x66/0xb0
[ 1843.480209]  kasan_report+0xe/0x20
[ 1843.480679]  strncmp+0x66/0xb0
[ 1843.481105]  prop_compression_validate+0x24/0x70
[ 1843.481798]  btrfs_xattr_handler_set_prop+0x65/0x160
[ 1843.482509]  __vfs_setxattr+0x71/0x90
[ 1843.483012]  __vfs_setxattr_noperm+0x84/0x130
[ 1843.483606]  vfs_setxattr+0xac/0xb0
[ 1843.484085]  setxattr+0x18c/0x230
[ 1843.484546]  ? vfs_setxattr+0xb0/0xb0
[ 1843.485048]  ? __mod_node_page_state+0x1f/0xa0
[ 1843.485672]  ? _raw_spin_unlock+0x24/0x40
[ 1843.486233]  ? __handle_mm_fault+0x988/0x1290
[ 1843.486823]  ? lock_acquire+0xb4/0x1e0
[ 1843.487330]  ? lock_acquire+0xb4/0x1e0
[ 1843.487842]  ? mnt_want_write_file+0x3c/0x80
[ 1843.488442]  ? debug_lockdep_rcu_enabled+0x22/0x40
[ 1843.489089]  ? rcu_sync_lockdep_assert+0xe/0x70
[ 1843.489707]  ? __sb_start_write+0x158/0x200
[ 1843.490278]  ? mnt_want_write_file+0x3c/0x80
[ 1843.490855]  ? __mnt_want_write+0x98/0xe0
[ 1843.491397]  __x64_sys_fsetxattr+0xba/0xe0
[ 1843.492201]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1843.493201]  do_syscall_64+0x6c/0x230
[ 1843.493988]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1843.495041] RIP: 0033:0x7fa7a8a7707a
[ 1843.495819] Code: 48 8b 0d 21 de 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 be 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee dd 2b 00 f7 d8 64 89 01 48
[ 1843.499203] RSP: 002b:00007ffcb73bca38 EFLAGS: 00000202 ORIG_RAX: 00000000000000be
[ 1843.500210] RAX: ffffffffffffffda RBX: 00007ffcb73bda9d RCX: 00007fa7a8a7707a
[ 1843.501170] RDX: 00007ffcb73bda9d RSI: 00000000006dc050 RDI: 0000000000000003
[ 1843.502152] RBP: 00000000006dc050 R08: 0000000000000000 R09: 0000000000000000
[ 1843.503109] R10: 0000000000000002 R11: 0000000000000202 R12: 00007ffcb73bda91
[ 1843.504055] R13: 0000000000000003 R14: 00007ffcb73bda82 R15: ffffffffffffffff

[ 1843.505268] Allocated by task 3979:
[ 1843.505771]  save_stack+0x19/0x80
[ 1843.506211]  __kasan_kmalloc.constprop.5+0xa0/0xd0
[ 1843.506836]  setxattr+0xeb/0x230
[ 1843.507264]  __x64_sys_fsetxattr+0xba/0xe0
[ 1843.507886]  do_syscall_64+0x6c/0x230
[ 1843.508429]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

[ 1843.509558] Freed by task 0:
[ 1843.510188] (stack is not available)

[ 1843.511309] The buggy address belongs to the object at ffff888111e369e0
                which belongs to the cache kmalloc-8 of size 8
[ 1843.514095] The buggy address is located 2 bytes inside of
                8-byte region [ffff888111e369e0, ffff888111e369e8)
[ 1843.516524] The buggy address belongs to the page:
[ 1843.517561] page:ffff88813f478d80 refcount:1 mapcount:0 mapping:ffff88811940c300 index:0xffff888111e373b8 compound_mapcount: 0
[ 1843.519993] flags: 0x4404000010200(slab|head)
[ 1843.520951] raw: 0004404000010200 ffff88813f48b008 ffff888119403d50 ffff88811940c300
[ 1843.522616] raw: ffff888111e373b8 000000000016000f 00000001ffffffff 0000000000000000
[ 1843.524281] page dumped because: kasan: bad access detected

[ 1843.525936] Memory state around the buggy address:
[ 1843.526975]  ffff888111e36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.528479]  ffff888111e36900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.530138] >ffff888111e36980: fc fc fc fc fc fc fc fc fc fc fc fc 02 fc fc fc
[ 1843.531877]                                                        ^
[ 1843.533287]  ffff888111e36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.534874]  ffff888111e36a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 1843.536468] ==================================================================

This is caused by supplying a too short compression value ('lz') in the
test-case and comparing it to 'lzo' with strncmp() and a length of 3.
strncmp() read past the 'lz' when looking for the 'o' and thus caused an
out-of-bounds read.

Introduce a new check 'btrfs_compress_is_valid_type()' which not only
checks the user-supplied value against known compression types, but also
employs checks for too short values.

Reported-by: Nikolay Borisov <nborisov@suse.com>
Fixes: 272e5326c7 ("btrfs: prop: fix vanished compression property after failed set")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02 12:30:48 +02:00
Johannes Thumshirn ea41d6b278 btrfs: remove assumption about csum type form btrfs_print_data_csum_error()
btrfs_print_data_csum_error() still assumed checksums to be 32 bit in
size.  Make it size agnostic.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:02 +02:00
Johannes Thumshirn d5178578bc btrfs: directly call into crypto framework for checksumming
Currently btrfs_csum_data() relied on the crc32c() wrapper around the
crypto framework for calculating the CRCs.

As we have our own crypto_shash structure in the fs_info now, we can
directly call into the crypto framework without going trough the wrapper.

This way we can even remove the btrfs_csum_data() and btrfs_csum_final()
wrappers.

The module dependency on crc32c is preserved via MODULE_SOFTDEP("pre:
crc32c"), which was previously provided by LIBCRC32C config option doing
the same.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:02 +02:00
Johannes Thumshirn 10fe6ca80d btrfs: don't assume compressed_bio sums to be 4 bytes
BTRFS has the implicit assumption that a checksum in compressed_bio is 4
bytes. While this is true for CRC32C, it is not for any other checksum.

Change the data type to be a byte array and adjust loop index calculation
accordingly.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:01 +02:00
Johannes Thumshirn 1e25a2e3ca btrfs: don't assume ordered sums to be 4 bytes
BTRFS has the implicit assumption that a checksum in btrfs_orderd_sums
is 4 bytes. While this is true for CRC32C, it is not for any other
checksum.

Change the data type to be a byte array and adjust loop index
calculation accordingly.

This includes moving the adjustment of 'index' by 'ins_size' in
btrfs_csum_file_blocks() before dividing 'ins_size' by the checksum
size, because before this patch the 'sums' member of 'struct
btrfs_ordered_sum' was 4 Bytes in size and afterwards it is only one
byte.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-01 13:35:00 +02:00
Linus Torvalds f49aa1de98 for-5.2-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlzi2E8ACgkQxWXV+ddt
 WDuwdg/9Gil8uC28r7HLk1DkMdUZp6qHPXC2D79iN63XOIyxtTv2Y/ZDOneHheTa
 NaW9DOe6PUWoVyrYRCM/BhRxouZp0cFlpMG1m8ABdaO3uSCzwlc9wHs7YPNOwiGJ
 DM3qikX4V8w0ECoY3Z9NzbHLGTi9INzgkuazWGQnplK1ZA7CHe4RLH1r442daTAO
 iFr+bhjODmwHyebXlK66dcOGw7HXp4ac+iyZnlivNcTipGtOTdA7kryZLaNmfepz
 JfMESxGMrLhdrd/YxeaDEVYRAh1ZSD57/WGrQDeRQ54qD2ELXmoPX0rAtquwoziS
 F1PSitiW0DzYGjS+KCKP9553tlEtJ5Md45k0AibK4h/aqCPy6s6khK/PfsHQT5K+
 lD0CqwB4zr9zOhS0n1uFRlNomzK4UZ2SPDtB4KMpCCEQLlvwJIkUqb3Bx6JZgAEH
 FPFEZGVX/Xyqv6w/VASHHhhoAGRJ/mIx+mU/RGVU+jFVBzwd0EmlCymFDMF2z44K
 8HZz7ib4fMvArR5S2uEz/h85JM7EzDG7YkPluzERiQy86Abi79QQl8qWfC7yBGYd
 K3g6VQM/H6NUprXqTNQ/NU7Zvrq5HPXC+NhrLvC+Ul0DlwLAwxRj8NeYImUuDDpi
 Du49hJcV0U2kWocvwdP+600y48UroioJHlqKtqlng3NKxdjUGxw=
 =qN6T
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Notable highlights:

   - fixes for some long-standing bugs in fsync that were quite hard to
     catch but now finaly fixed

   - some fixups to error handling paths that did not properly clean up
     (locking, memory)

   - fix to space reservation for inheriting properties"

* tag 'for-5.2-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: tree-checker: detect file extent items with overlapping ranges
  Btrfs: fix race between ranged fsync and writeback of adjacent ranges
  Btrfs: avoid fallback to transaction commit during fsync of files with holes
  btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytes
  btrfs: sysfs: don't leak memory when failing add fsid
  btrfs: sysfs: Fix error path kobject memory leak
  Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path
  btrfs: use the existing reserved items for our first prop for inheritance
  btrfs: don't double unlock on error in btrfs_punch_hole
  btrfs: Check the compression level before getting a workspace
2019-05-20 09:52:35 -07:00
Linus Torvalds 67a2422239 for-5.2/block-20190507
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzR0AAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpo0MD/47D1kBK9rGzkAwIz1Jkh1Qy/ITVaDJzmHJ
 UP5uncQsgKFLKMR1LbRcrWtmk2MwFDNULGbteHFeCYE1ypCrTgpWSp5+SJluKd1Q
 hma9krLSAXO9QiSaZ4jafshXFIZxz6IjakOW8c9LrT80Ze47yh7AxiLwDafcp/Jj
 x6NW790qB7ENDtfarDkZk14NCS8HGLRHO5B21LB+hT0Kfbh0XZaLzJdj7Mck1wPA
 VT8hL9mPuA++AjF7Ra4kUjwSakgmajTa3nS2fpkwTYdztQfas7x5Jiv7FWxrrelb
 qbabkNkWKepcHAPEiZR7o53TyfCucGeSK/jG+dsJ9KhNp26kl1ci3frl5T6PfVMP
 SPPDjsKIHs+dqFrU9y5rSGhLJqewTs96hHthnLGxyF67+5sRb5+YIy+dcqgiyc/b
 TUVyjCD6r0cO2q4v9VhwnhOyeBUA9Rwbu8nl7JV5Q45uG7qI4BC39l1jfubMNDPO
 GLNGUUzb6ER7z6lYINjRSF2Jhejsx8SR9P7jhpb1Q7k/VvDDxO1T4FpwvqWFz9+s
 Gn+s6//+cA6LL+42eZkQjvwF2CUNE7TaVT8zdb+s5HP1RQkZToqUnsQCGeRTrFni
 RqWXfW9o9+awYRp431417oMdX/LvLGq9+ZtifRk9DqDcowXevTaf0W2RpplWSuiX
 RcCuPeLAVg==
 =Ot0g
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Nothing major in this series, just fixes and improvements all over the
  map. This contains:

   - Series of fixes for sed-opal (David, Jonas)

   - Fixes and performance tweaks for BFQ (via Paolo)

   - Set of fixes for bcache (via Coly)

   - Set of fixes for md (via Song)

   - Enabling multi-page for passthrough requests (Ming)

   - Queue release fix series (Ming)

   - Device notification improvements (Martin)

   - Propagate underlying device rotational status in loop (Holger)

   - Removal of mtip32xx trim support, which has been disabled for years
     (Christoph)

   - Improvement and cleanup of nvme command handling (Christoph)

   - Add block SPDX tags (Christoph)

   - Cleanup/hardening of bio/bvec iteration (Christoph)

   - A few NVMe pull requests (Christoph)

   - Removal of CONFIG_LBDAF (Christoph)

   - Various little fixes here and there"

* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
  block: fix mismerge in bvec_advance
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  blk-mq: always free hctx after request queue is freed
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  block: fix function name in comment
  nvmet: protect discovery change log event list iteration
  nvme: mark nvme_core_init and nvme_core_exit static
  nvme: move command size checks to the core
  nvme-fabrics: check more command sizes
  nvme-pci: check more command sizes
  nvme-pci: remove an unneeded variable initialization
  nvme-pci: unquiesce admin queue on shutdown
  nvme-pci: shutdown on timeout during deletion
  nvme-pci: fix psdt field for single segment sgls
  nvme-multipath: don't print ANA group state by default
  nvme-multipath: split bios with the ns_head bio_set before submitting
  ...
2019-05-07 18:14:36 -07:00
Johnny Chang 2b90883c56 btrfs: Check the compression level before getting a workspace
When a file's compression property is set as zlib or zstd but leave
the compression mount option not be set, that means btrfs will try
to compress the file with default compression level. But in
btrfs_compress_pages(), it calls get_workspace() with level = 0.
This will return a workspace with a wrong compression level.
For zlib, the compression level in the workspace will be 0
(that means "store only"). And for zstd, the compression in the
workspace will be 1, not the default level 3.

How to reproduce:
  mkfs -t btrfs /dev/sdb
  mount /dev/sdb /mnt/
  mkdir /mnt/zlib
  btrfs property set /mnt/zlib/ compression zlib
  dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10
  sync
  btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M

btrfs-debugfs output:
* before:
  ...
  (258 9961472): ram 524288 disk 1106247680 disk_size 524288
  file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00

* after:
 ...
 (258 10354688): ram 131072 disk 14217216 disk_size 4096
 file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00

The steps for zstd are similar, but need to put a debugging message to
show the level of the return workspace in zstd_get_workspace().

This commit adds a check of the compression level before getting a
workspace by set_level().

CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Johnny Chang <johnnyc@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-03 18:21:25 +02:00
Christoph Hellwig 2b070cfe58 block: remove the i argument to bio_for_each_segment_all
We only have two callers that need the integer loop iterator, and they
can easily maintain it themselves.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-30 09:26:13 -06:00
Nikolay Borisov 6a8d2136ca btrfs: Use less confusing condition for uptodate parameter to btrfs_writepage_endio_finish_ordered
The uptodate parameter of btrfs_writepage_endio_finish_ordered is used
to signal whether an error has occured while writing the given page.
0 signals an error, which is propagated to callees and 1 signifies
success. In end_compressed_bio_write the ->bi_status is checked and
based on it either BLK_STS_OK (0) or BLK_STS_NOTSUPP (1) are used. While
from functional point of view this is ok it's a for the poor reader of
the code, since the block layer values are conflated with the semantics
of the parameter.

Just use plain 0 or 1. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29 19:02:24 +02:00
Linus Torvalds 80201fe175 for-5.1/block-20190302
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlx63XIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp2vEACfrrQsap7R+Av28mmXpmXi2FPa3g5Tev1t
 yYjK2qHvhlMZjPTYw3hCmbYdDDczlF7PEgSE2x2DjdcsYapb8Fy1lZ2X16c7ztBR
 HD/t9b5AVSQsczZzKgv3RqsNtTnjzS5V0A8XH8FAP2QRgiwDMwSN6G0FP0JBLbE/
 ZgxQrH1Iy1F33Wz4hI3Z7dEghKPZrH1IlegkZCEu47q9SlWS76qUetSy2GEtchOl
 3Lgu54mQZyVdI5/QZf9DyMDLF6dIz3tYU2qhuo01AHjGRCC72v86p8sIiXcUr94Q
 8pbegJhJ/g8KBol9Qhv3+pWG/QUAZwi/ZwasTkK+MJ4klRXfOrznxPubW1z6t9Vn
 QRo39Po5SqqP0QWAscDxCFjESIQlWlKa+LZurJL7DJDCUGrSgzTpnVwFqKwc5zTP
 HJa5MT2tEeL2TfUYRYCfh0ZV0elINdHA1y1klDBh38drh4EWr2gW8xdseGYXqRjh
 fLgEpoF7VQ8kTvxKN+E4jZXkcZmoLmefp0ZyAbblS6IawpPVC7kXM9Fdn2OU8f2c
 fjVjvSiqxfeN6dnpfeLDRbbN9894HwgP/LPropJOQ7KmjCorQq5zMDkAvoh3tElq
 qwluRqdBJpWT/F05KweY+XVW8OawIycmUWqt6JrVNoIDAK31auHQv47kR0VA4OvE
 DRVVhYpocw==
 =VBaU
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "Not a huge amount of changes in this round, the biggest one is that we
  finally have Mings multi-page bvec support merged. Apart from that,
  this pull request contains:

   - Small series that avoids quiescing the queue for sysfs changes that
     match what we currently have (Aleksei)

   - Series of bcache fixes (via Coly)

   - Series of lightnvm fixes (via Mathias)

   - NVMe pull request from Christoph. Nothing major, just SPDX/license
     cleanups, RR mp policy (Hannes), and little fixes (Bart,
     Chaitanya).

   - BFQ series (Paolo)

   - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
     for the fast path (Jianchao)

   - fops->iopoll() added for async IO polling, this is a feature that
     the upcoming io_uring interface will use (Christoph, me)

   - Partition scan loop fixes (Dongli)

   - mtip32xx conversion from managed resource API (Christoph)

   - cdrom registration race fix (Guenter)

   - MD pull from Song, two minor fixes.

   - Various documentation fixes (Marcos)

   - Multi-page bvec feature. This brings a lot of nice improvements
     with it, like more efficient splitting, larger IOs can be supported
     without growing the bvec table size, and so on. (Ming)

   - Various little fixes to core and drivers"

* tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
  block: fix updating bio's front segment size
  block: Replace function name in string with __func__
  nbd: propagate genlmsg_reply return code
  floppy: remove set but not used variable 'q'
  null_blk: fix checking for REQ_FUA
  block: fix NULL pointer dereference in register_disk
  fs: fix guard_bio_eod to check for real EOD errors
  blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
  block: optimize bvec iteration in bvec_iter_advance
  block: introduce mp_bvec_for_each_page() for iterating over page
  block: optimize blk_bio_segment_split for single-page bvec
  block: optimize __blk_segment_map_sg() for single-page bvec
  block: introduce bvec_nth_page()
  iomap: wire up the iopoll method
  block: add bio_set_polled() helper
  block: wire up block device iopoll method
  fs: add an iopoll method to struct file_operations
  loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
  loop: do not print warn message if partition scan is successful
  block: bounce: make sure that bvec table is updated
  ...
2019-03-08 14:12:17 -08:00
Dennis Zhou d0ab62ce2d btrfs: change set_level() to bound the level passed in
Currently, the only user of set_level() is zlib which sets an internal
workspace parameter. As level is now plumbed into get_workspace(), this
can be handled there rather than separately.

This repurposes set_level() to bound the level passed in so it can be
used when setting the mounts compression level and as well as verifying
the level before getting a workspace. The other benefit is this divides
the meaning of compress(0) and get_workspace(0). The former means we
want to use the default compression level of the compression type. The
latter means we can use any workspace available.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:32 +01:00
Dennis Zhou 7bf4994304 btrfs: plumb level through the compression interface
Zlib compression supports multiple levels, but doesn't require changing
in how a workspace itself is created and managed. Zstd introduces a
different memory requirement such that higher levels of compression
require more memory.

This requires changes in how the alloc()/get() methods work for zstd.
This pach plumbs compression level through the interface as a parameter
in preparation for zstd compression levels.  This gives the compression
types opportunity to create/manage based on the compression level.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:32 +01:00
Dennis Zhou 92ee553036 btrfs: move to function pointers for get/put workspaces
The previous patch added generic helpers for get_workspace() and
put_workspace(). Now, we can migrate ownership of the workspace_manager
to be in the compression type code as the compression code itself
doesn't care beyond being able to get a workspace. The init/cleanup and
get/put methods are abstracted so each compression algorithm can decide
how they want to manage their workspaces.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:32 +01:00
Dennis Zhou 929f4baf93 btrfs: add compression interface in (get/put)_workspace
There are two levels of workspace management. First, alloc()/free()
which are responsible for actually creating and destroy workspaces.
Second, at a higher level, get()/put() which is the compression code
asking for a workspace from a workspace_manager.

The compression code shouldn't really care how it gets a workspace, but
that it got a workspace. This adds get_workspace() and put_workspace()
to be the higher level interface which is responsible for indexing into
the appropriate compression type. It also introduces
btrfs_put_workspace() and btrfs_get_workspace() to be the generic
implementations of the higher interface.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:31 +01:00
Dennis Zhou 1666edabc8 btrfs: add helper methods for workspace manager init and cleanup
Workspace manager init and cleanup code is open coded inside a for loop
over the compression types. This forces each compression type to rely on
the same workspace manager implementation. This patch creates helper
methods that will be the generic implementation for btrfs workspace
management.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:31 +01:00
Dennis Zhou 10b94a51ca btrfs: unify compression ops with workspace_manager
Make the workspace_manager own the interface operations rather than
managing index-paired arrays for the workspace_manager and compression
operations.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:31 +01:00
Dennis Zhou ca4ac360af btrfs: manage heuristic workspace as index 0
While the heuristic workspaces aren't really compression workspaces,
they use the same interface for managing them. So rather than branching,
let's just handle them once again as the index 0 compression type.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:31 +01:00
Dennis Zhou acce85de12 btrfs: rename workspaces_list to workspace_manager
This is in preparation for zstd compression levels. As each level will
require different size of workspace, workspaces_list is no longer a
really fitting name.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:30 +01:00
Dennis Zhou 1972708a89 btrfs: add helpers for compression type and level
It is very easy to miss places that rely on a certain bitshifting for
decoding the type_level overloading. Add helpers to do this instead.

Cc: Omar Sandoval <osandov@osandov.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:30 +01:00
Ming Lei 6dc4f100c1 block: allow bio_for_each_segment_all() to iterate over multi-page bvec
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Andrea Gelmini 52042d8e82 btrfs: Fix typos in comments and strings
The typos accumulate over time so once in a while time they get fixed in
a large patch.

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:50 +01:00
Johannes Thumshirn fdb1e12180 btrfs: use PAGE_ALIGNED instead of open-coding it
When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
alignment and thus should use the PAGE_ALIGNED() macro instead of
open-coding it.

Convert all open-coded occurrences of PAGE_ALIGNED().

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:45 +01:00
Johannes Thumshirn 7073017aeb btrfs: use offset_in_page instead of open-coding it
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.

So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:45 +01:00
Nikolay Borisov da12fe5414 btrfs: Refactor btrfs_merge_bio_hook
This function really checks whether adding more data to the bio will
straddle a stripe/chunk. So first let's give it a more appropraite name
- btrfs_bio_fits_in_stripe. Secondly, the offset parameter was never
used to just remove it. Thirdly, pages are submitted to either btree or
data inodes so it's guaranteed that tree->ops is set so replace the
check with an ASSERT. Finally, document the parameters of the function.
No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:42 +01:00
Nikolay Borisov c629732d24 btrfs: Remove unused extent_state argument from btrfs_writepage_endio_finish_ordered
This parameter was never used, yet was part of the interface of the
function ever since its introduction as extent_io_ops::writepage_end_io_hook
in e6dcd2dc9c ("Btrfs: New data=ordered implementation"). Now that
NULL is passed everywhere as a value for this parameter let's remove it
for good. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:32 +01:00
Nikolay Borisov 7087a9d8db btrfs: Remove extent_io_ops::writepage_end_io_hook
This callback is ony ever called for data page writeout so there is no
need to actually abstract it via extent_io_ops. Lets just export it,
remove the definition of the callback and call it directly in the
functions that invoke the callback. Also rename the function to
btrfs_writepage_endio_finish_ordered since what it really does is
account finished io in the ordered extent data structures.  No
functional changes.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17 14:51:28 +01:00
Linus Torvalds dad4f140ed Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax
Pull XArray conversion from Matthew Wilcox:
 "The XArray provides an improved interface to the radix tree data
  structure, providing locking as part of the API, specifying GFP flags
  at allocation time, eliminating preloading, less re-walking the tree,
  more efficient iterations and not exposing RCU-protected pointers to
  its users.

  This patch set

   1. Introduces the XArray implementation

   2. Converts the pagecache to use it

   3. Converts memremap to use it

  The page cache is the most complex and important user of the radix
  tree, so converting it was most important. Converting the memremap
  code removes the only other user of the multiorder code, which allows
  us to remove the radix tree code that supported it.

  I have 40+ followup patches to convert many other users of the radix
  tree over to the XArray, but I'd like to get this part in first. The
  other conversions haven't been in linux-next and aren't suitable for
  applying yet, but you can see them in the xarray-conv branch if you're
  interested"

* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
  radix tree: Remove multiorder support
  radix tree test: Convert multiorder tests to XArray
  radix tree tests: Convert item_delete_rcu to XArray
  radix tree tests: Convert item_kill_tree to XArray
  radix tree tests: Move item_insert_order
  radix tree test suite: Remove multiorder benchmarking
  radix tree test suite: Remove __item_insert
  memremap: Convert to XArray
  xarray: Add range store functionality
  xarray: Move multiorder_check to in-kernel tests
  xarray: Move multiorder_shrink to kernel tests
  xarray: Move multiorder account test in-kernel
  radix tree test suite: Convert iteration test to XArray
  radix tree test suite: Convert tag_tagged_items to XArray
  radix tree: Remove radix_tree_clear_tags
  radix tree: Remove radix_tree_maybe_preload_order
  radix tree: Remove split/join code
  radix tree: Remove radix_tree_update_node_t
  page cache: Finish XArray conversion
  dax: Convert page fault handlers to XArray
  ...
2018-10-28 11:35:40 -07:00
Matthew Wilcox 0a943c65e7 btrfs: Convert page cache to XArray
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: David Sterba <dsterba@suse.com>
2018-10-21 10:46:41 -04:00
Colin Ian King 29c5e5d496 btrfs: remove unused pointer 'tree' in btrfs_submit_compressed_read
Pointer 'tree' is being assigned but is never used hence it is redundant
and can be removed. This is a leftover from cleanup patch
00032d38ea ("btrfs: drop extent_io_ops::merge_bio_hook
callback").

Cleans up clang warning:

  warning: variable 'tree' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Matthew Wilcox 3159f943aa xarray: Replace exceptional entries
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries.  This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry).  It is also a change in emphasis; exceptional entries are
intimidating and different.  As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2018-09-29 22:47:49 -04:00
David Sterba 00032d38ea btrfs: drop extent_io_ops::merge_bio_hook callback
The data and metadata callback implementation both use the same
function. We can remove the call indirection completely.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:56 +02:00
David Sterba ebcc326316 btrfs: open-code bio_set_op_attrs
The helper is trivial and marked as deprecated.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:44 +02:00
David Sterba d7f663fa3f btrfs: prune unused includes
Remove includes if none of the interfaces and exports is used in the
given source file.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-06 13:12:43 +02:00
David Sterba 093258e6eb btrfs: replace waitqueue_actvie with cond_wake_up
Use the wrappers and reduce the amount of low-level details about the
waitqueue management.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28 18:23:09 +02:00
Linus Torvalds e37563bb6c for-4.17-part2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrTQ4sACgkQxWXV+ddt
 WDti3Q/+MAeqsLTjvre2RQ3ka5hNyCuVftUIBmcP3YfJbt+xZYQyaewW4Xkfi3cm
 cbJE+zehzf5ag+RJhxk3OwFTvNfLGIO9asWs3b08NGUi6VzwL0/8B/iOdZPuHSAV
 TrecQIBE2Tp+xax9cQEnxav34D4dUtXNaDweGjp1MIIUkDneQP/I0vlTu7vafBgX
 UVxP6riL/MCs7sjTHGIPs0lv8L/fgdmo+dk5SnNuIPTOcFTQXgVrtHjw9IvbKWd4
 aq+sbNWoSrhXUfllbFg/wZqDe9tWn9E2f6m/H0ThSoNdxusSVgacOjFRYh20NKLW
 WGB8Amd/ItGtJwJ1CIypa7VX2U11UAi0XT7BeiK82rUNEJ6moRqFOXG861gRLoTZ
 SpH8uWO+e+CogfXob1KCndn5lot4AM2ZTkCqfrjpM35Nul72PZdne0CxNlmiRupY
 Fdt5GB+sg8plcMaRiYr++BbbHP5tggX1MrhLGEbx2XBs2eRdn+2Lv2I1Ig/U4NUb
 Vf+xk/tFLKGOTSZlbv7SV7ekXxG/3+7gAuL7A1XMETZCwBF4L3hwyW7CgkEBb3PC
 TqX8BwMaRpyp/FgW/QL6edjXZ3a64VaHIfqRPNks3lWWcCHVbzyiVPfEjx+AEJ/0
 abx6jXTJhLUBPuPxEgb5rsSv62RoxPoYHqCErrG95XwnZ8rSCts=
 =I0y0
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull more btrfs updates from David Sterba:
 "We have queued a few more fixes (error handling, log replay,
  softlockup) and the rest is SPDX updates that touche almost all files
  so the diffstat is long"

* tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Only check first key for committed tree blocks
  btrfs: add SPDX header to Kconfig
  btrfs: replace GPL boilerplate by SPDX -- sources
  btrfs: replace GPL boilerplate by SPDX -- headers
  Btrfs: fix loss of prealloc extents past i_size after fsync log replay
  Btrfs: clean up resources during umount after trans is aborted
  btrfs: Fix possible softlock on single core machines
  Btrfs: bail out on error during replay_dir_deletes
  Btrfs: fix NULL pointer dereference in log_dir_items
2018-04-15 18:08:35 -07:00
David Sterba c1d7c514f7 btrfs: replace GPL boilerplate by SPDX -- sources
Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12 16:29:51 +02:00
Matthew Wilcox b93b016313 page cache: use xa_lock
Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
since we don't really care that it's a tree.

[willy@infradead.org: fix nds32, fs/dax.c]
  Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11 10:28:39 -07:00
David Sterba e67c718b5b btrfs: add more __cold annotations
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.

Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:

- printf wrappers, error messages
- exit helpers

Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26 15:09:39 +02:00
Linus Torvalds 31466f3ed7 for-4.16-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpvikQACgkQxWXV+ddt
 WDs6qA//ZE7eEH0sKpD4Z+3gUevk/MMXwE9prRijEdjXz/K/UXtvpq0sI7HMQskZ
 Ls9Wmzof+3WEQoa08RQZFzwuclW1Udm09SqE2oHP2gXQB5rC0BtWdrlMaKUJy03y
 NUwxHetbE6TsFLU5HIVmi05NexNx9SVV6oJTWt00RlXTePw9Aoc88ikoXXUE2vqH
 wbH9/ccmM9EkDFxdG+YG5QX054kQV8/5RXdqBJnIiGVRX5ZsAY84AN9x9YoRCVUw
 wq9TfPu6XmeA6Uq6knpeLlXDms5w+FE3n5CduROk7Q7YNgpoZekF20c8uK8HzT4T
 KF8hc0QpQgRCVBJ8I4MbPSMRIDf3IWfZmWSDEDda/6/ep6Bl99b8PFvdDKDBMUct
 8wsgGrwGbHuz2l2QUIXjpBL9Cv9Tbu8vjmg0h2hFrpiH1c8JaXjKtJXAMtigWsZ1
 DdX+5Y0zqvV/YLpzKF4aMDWXIteN4qaznvjdmj3B7BxgcnITOV/cmPCyMplNrtUa
 Cs2fzGV5tpxhBzxE490v+frMULmLq2W1e6WPfFCqPKBCqulcR75TozDQS9M2/h4k
 uAZzVKoguHrUPP1ONVas9aVC05K473nbYPd28eecYwkZ4z32hK4SAdnQY1aPreTe
 axoV7p7a+i1bkzT5LK6gIfpddVWth8w45nz4P0lwxp0Z6XlkbJE=
 =Irul
 -----END PGP SIGNATURE-----

Merge tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Features or user visible changes:

   - fallocate: implement zero range mode

   - avoid losing data raid profile when deleting a device

   - tree item checker: more checks for directory items and xattrs

  Notable fixes:

   - raid56 recovery: don't use cached stripes, that could be
     potentially changed and a later RMW or recovery would lead to
     corruptions or failures

   - let raid56 try harder to rebuild damaged data, reading from all
     stripes if necessary

   - fix scrub to repair raid56 in a similar way as in the case above

  Other:

   - cleanups: device freeing, removed some call indirections, redundant
     bio_put/_get, unused parameters, refactorings and renames

   - RCU list traversal fixups

   - simplify mount callchain, remove recursing back when mounting a
     subvolume

   - plug for fsync, may improve bio merging on multiple devices

   - compression heurisic: replace heap sort with radix sort, gains some
     performance

   - add extent map selftests, buffered write vs dio"

* tag 'for-4.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (155 commits)
  btrfs: drop devid as device_list_add() arg
  btrfs: get device pointer from device_list_add()
  btrfs: set the total_devices in device_list_add()
  btrfs: move pr_info into device_list_add
  btrfs: make btrfs_free_stale_devices() to match the path
  btrfs: rename btrfs_free_stale_devices() arg to skip_dev
  btrfs: make btrfs_free_stale_devices() argument optional
  btrfs: make btrfs_free_stale_device() to iterate all stales
  btrfs: no need to check for btrfs_fs_devices::seeding
  btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it
  Btrfs: noinline merge_extent_mapping
  Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mapping
  Btrfs: extent map selftest: dio write vs dio read
  Btrfs: extent map selftest: buffered write vs dio read
  Btrfs: add extent map selftests
  Btrfs: move extent map specific code to extent_map.c
  Btrfs: add helper for em merge logic
  Btrfs: fix unexpected EEXIST from btrfs_get_extent
  Btrfs: fix incorrect block_len in merge_extent_mapping
  btrfs: Remove unused readahead spinlock
  ...
2018-01-29 14:04:23 -08:00
Nikolay Borisov 32506af595 btrfs: Remove redundant bio_get/set calls in compressed read/write paths
bio_get/set is necessary only if the bio is going to be referenced
following submissions. In the code paths where such calls are made
we don't really need them since the bio is referenced only if
btrfs_map_bio returns an error. And this function can return an error
prior to submission only. So referencing the bio is safe. Furthermore
we do call bio_endio which will consume the last reference. So let's
remove the redundant calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba 36243c9199 btrfs: heuristic: call get4bits directly
As it's a single instance and local to the file, we don't need to pass
it as an argument.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba 7add17befc btrfs: heuristic: open code copy_call callback of radix sort
The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba 23ae8c63aa btrfs: heuristic: open code get_num callback of radix sort
The callback is trivial and we don't need the abstraction for our
purposes. Let's open code it and also make the array types explicit.

Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:19 +01:00
David Sterba e128f9c3f7 btrfs: compression: add helper for type to string conversion
There are several places opencoding this conversion, add a helper now
that we have 3 compression algorithms.

Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:16 +01:00
Timofey Titovets 440c840cb4 Btrfs: compression heuristic: replace heap sort with radix sort
Slowest part of heuristic for now is kernel heap sort()
It's can take up to 55% of runtime on sorting bucket items.

As sorting will always call on most data sets to get correctly
byte_core_set_size, the only way to speed up heuristic, is to
speed up sort on bucket.

Add a general radix_sort function.
Radix sort require 2 buffers, one full size of input array
and one for store counters (jump addresses).

That increase usage per heuristic workspace +1KiB
8KiB + 1KiB -> 8KiB + 2KiB

That is LSD Radix, i use 4 bit as a base for calculating,
to make counters array acceptable small (16 elements * 8 byte).

That Radix sort implementation have several points to adjust,
I added him to make radix sort general usable in kernel,
like heap sort, if needed.

Performance tested in userspace copy of heuristic code,
throughput:
    - average <-> random data: ~3500 MiB/s - heap  sort
    - average <-> random data: ~6000 MiB/s - radix sort

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
[ coding style fixes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22 16:08:15 +01:00
Ming Lei c45a8f2def fs: convert to bio_last_bvec_all()
This patch converts 3 users to bio_last_bvec_all(), so that we can go
ahead and convert to multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei 263663cd3c block: convert to bio_first_bvec_all & bio_first_page_all
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Qu Wenruo eae8d82529 btrfs: Fix wild memory access in compression level parser
[BUG]
Kernel panic when mounting with "-o compress" mount option.
KASAN will report like:
------
==================================================================
BUG: KASAN: wild-memory-access in strncmp+0x31/0xc0
Read of size 1 at addr d86735fce994f800 by task mount/662
...
Call Trace:
 dump_stack+0xe3/0x175
 kasan_report+0x163/0x370
 __asan_load1+0x47/0x50
 strncmp+0x31/0xc0
 btrfs_compress_str2level+0x20/0x70 [btrfs]
 btrfs_parse_options+0xff4/0x1870 [btrfs]
 open_ctree+0x2679/0x49f0 [btrfs]
 btrfs_mount+0x1b7f/0x1d30 [btrfs]
 mount_fs+0x49/0x190
 vfs_kern_mount.part.29+0xba/0x280
 vfs_kern_mount+0x13/0x20
 btrfs_mount+0x31e/0x1d30 [btrfs]
 mount_fs+0x49/0x190
 vfs_kern_mount.part.29+0xba/0x280
 do_mount+0xaad/0x1a00
 SyS_mount+0x98/0xe0
 entry_SYSCALL_64_fastpath+0x1f/0xbe
------

[Cause]
For 'compress' and 'compress_force' options, its token doesn't expect
any parameter so its args[0] contains uninitialized data.
Accessing args[0] will cause above wild memory access.

[Fix]
For Opt_compress and Opt_compress_force, set compression level to
the default.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ set the default in advance ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-27 17:01:11 +01:00
Liu Bo f82b735936 Btrfs: add write_flags for compression bio
Compression code path has only flaged bios with REQ_OP_WRITE no matter
where the bios come from, but it could be a sync write if fsync starts
this writeback or a normal writeback write if wb kthread starts a
periodic writeback.

It breaks the rule that sync writes and writeback writes need to be
differentiated from each other, because from the POV of block layer,
all bios need to be recognized by these flags in order to do some
management, e.g. throttlling.

This passes writeback_control to compression write path so that it can
send bios with proper flags to block layer.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:44:31 +01:00
Timofey Titovets 19562430c6 Btrfs: heuristic: add Shannon entropy calculation
Byte distribution check in heuristic will filter edge data cases and
some time fail to classify input data.

Let's fix that by adding Shannon entropy calculation, that will cover
classification of most other data types.

As Shannon entropy needs log2 with some precision to work, let's use
ilog2(N) and for increased precision, by do ilog2(pow(N, 4)).

Shannon entropy has been slightly changed to avoid signed numbers and
division.

The calculation is direct by the formula, successor of precalculated
table or chains of if-else.

The accuracy errors of ilog2 are compensated by

@ENTROPY_LVL_ACEPTABLE 70 -> 65
@ENTROPY_LVL_HIGH      85 -> 80

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:36 +01:00
Timofey Titovets 858177d38d Btrfs: heuristic: add byte core set calculation
Calculate byte core set for data sample:
- sort buckets' numbers in decreasing order
- count how many values cover 90% of the sample

If the core set size is low (<=25%), data are easily compressible.
If the core set size is high (>=80%), data are not compressible.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:36 +01:00
Timofey Titovets a288e92cac Btrfs: heuristic: add byte set calculation
Calculate byte set size for data sample:
- calculate how many unique bytes have been in the sample
- for all bytes count > 0, check if we're still in the low count range
  (~25%), such data are easily compressible, otherwise furhter analysis
  is needed

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:36 +01:00