Commit graph

7122 commits

Author SHA1 Message Date
Mike Christie c6adada5b7 dm: Start pr_preempt from the same starting path
pr_preempt has a similar issue as reserve where for all the
reservation types except the All Registrants ones the preempt can
create a reservation. And a follow up reservation or release needs to
go down the same path the preempt did. This has the pr_preempt work
like reserve and release where we always start from the first path in
the first group.

This commit has been tested with windows failover clustering's
validation test and libiscsi's PGR tests to check for regressions.
They both don't have tests to verify this case, so I tested it
manually.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Mike Christie 08a3c338e0 dm: Fix PR release handling for non All Registrants
This commit fixes a bug where we are leaving the reservation in place
even though pr_release has run and returned success.

If we have a Write Exclusive, Exclusive Access, or Write/Exclusive
Registrants only reservation, the release must be sent down the path
that is the reservation holder. The problem is multipath_prepare_ioctl
most likely selected path N for the reservation, then later when we do
the release multipath_prepare_ioctl will select a completely different
path. The device will then return success becuase the nvme and scsi
specs say to return success if there is no reservation or if the
release is sent down from a path that is not the holder. We then think
we have released the reservation.

This commit has us loop over each path and send a release so we can
make sure the release is executed on the correct path. It has been
tested with windows failover clustering's validation test which checks
this case, and it has been tested manually (the libiscsi PGR tests
don't have a test case for this yet, but I will be adding one).

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Mike Christie 7015108759 dm: Start pr_reserve from the same starting path
When an app does a pr_reserve it will go to whatever path we happen to
be using at the time. This can result in errors when the app does a
second pr_reserve call and expects success but gets a failure because
the reserve is not done on the holder's path. This commit has us
always start trying to do reserves from the first path in the first
group.

Windows failover clustering will produce the type of pattern above.
With this commit, we will now pass its validation test for this case.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Mike Christie 8dd87f3c52 dm: Allow dm_call_pr to be used for path searches
The specs state that if you send a reserve down a path that is already
the holder success must be returned and if it goes down a path that
is not the holder reservation conflict must be returned. Windows
failover clustering will send a second reservation and expects that a
device returns success. The problem for multipathing is that for an
All Registrants reservation, we can send the reserve down any path but
for all other reservation types there is one path that is the holder.

To handle this we could add PR state to dm but that can get nasty.
Look at target_core_pr.c for an example of the type of things we'd
have to track. It will also get more complicated because other
initiators can change the state so we will have to add in async
event/sense handling.

This commit, and the 3 commits that follow, tries to keep dm simple
and keep just doing passthrough. This commit modifies dm_call_pr to be
able to find the first usable path that can execute our pr_op then
return. When dm_pr_reserve is converted to dm_call_pr in the next
commit for the normal case we will use the same path for every
reserve.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Mike Snitzer e120a5f1e7 dm: return early from dm_pr_call() if DM device is suspended
Otherwise PR ops may be issued while the broader DM device is being
reconfigured, etc.

Fixes: 9c72bad1f3 ("dm: call PR reserve/unreserve on each underlying device")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-28 17:29:56 -04:00
Luo Meng 3534e5a5ed dm thin: fix use-after-free crash in dm_sm_register_threshold_callback
Fault inject on pool metadata device reports:
  BUG: KASAN: use-after-free in dm_pool_register_metadata_threshold+0x40/0x80
  Read of size 8 at addr ffff8881b9d50068 by task dmsetup/950

  CPU: 7 PID: 950 Comm: dmsetup Tainted: G        W         5.19.0-rc6 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   print_address_description.constprop.0.cold+0xeb/0x3f4
   kasan_report.cold+0xe6/0x147
   dm_pool_register_metadata_threshold+0x40/0x80
   pool_ctr+0xa0a/0x1150
   dm_table_add_target+0x2c8/0x640
   table_load+0x1fd/0x430
   ctl_ioctl+0x2c4/0x5a0
   dm_ctl_ioctl+0xa/0x10
   __x64_sys_ioctl+0xb3/0xd0
   do_syscall_64+0x35/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

This can be easily reproduced using:
  echo offline > /sys/block/sda/device/state
  dd if=/dev/zero of=/dev/mapper/thin bs=4k count=10
  dmsetup load pool --table "0 20971520 thin-pool /dev/sda /dev/sdb 128 0 0"

If a metadata commit fails, the transaction will be aborted and the
metadata space maps will be destroyed. If a DM table reload then
happens for this failed thin-pool, a use-after-free will occur in
dm_sm_register_threshold_callback (called from
dm_pool_register_metadata_threshold).

Fix this by in dm_pool_register_metadata_threshold() by returning the
-EINVAL error if the thin-pool is in fail mode. Also fail pool_ctr()
with a new error message: "Error registering metadata threshold".

Fixes: ac8c3f3df6 ("dm thin: generate event when metadata threshold passed")
Cc: stable@vger.kernel.org
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-15 18:09:14 -04:00
Mikulas Patocka 2ee73ef60d dm writecache: count number of blocks discarded, not number of discard bios
Change dm-writecache, so that it counts the number of blocks discarded
instead of the number of discard bios. Make it consistent with the
read and write statistics counters that were changed to count the
number of blocks instead of bios.

Fixes: e3a35d0340 ("dm writecache: add event counters")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:54:46 -04:00
Mikulas Patocka b2676e1482 dm writecache: count number of blocks written, not number of write bios
Change dm-writecache, so that it counts the number of blocks written
instead of the number of write bios. Bios can be split and requeued
using the dm_accept_partial_bio function, so counting bios caused
inaccurate results.

Fixes: e3a35d0340 ("dm writecache: add event counters")
Reported-by: Yu Kuai <yukuai1@huaweicloud.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:53:25 -04:00
Mikulas Patocka 2c6e755b49 dm writecache: count number of blocks read, not number of read bios
Change dm-writecache, so that it counts the number of blocks read
instead of the number of read bios. Bios can be split and requeued
using the dm_accept_partial_bio function, so counting bios caused
inaccurate results.

Fixes: e3a35d0340 ("dm writecache: add event counters")
Reported-by: Yu Kuai <yukuai1@huaweicloud.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:52:32 -04:00
Mikulas Patocka 9bc0c92e4b dm writecache: return void from functions
The functions writecache_map_remap_origin and writecache_bio_copy_ssd
only return a single value, thus they can be made to return void.

This helps simplify the following IO accounting changes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:52:31 -04:00
Mikulas Patocka 949d49ec30 dm kcopyd: use __GFP_HIGHMEM when allocating pages
dm-kcopyd doesn't access the allocated pages directly, it only passes
them to dm-io which adds them to a bio list - thus, we can allocate
the pages from high memory. This will reduce pressure on the low
memory when there are a large number of kcopyd jobs in progress.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:52:31 -04:00
Mikulas Patocka ca7dc242e3 dm writecache: set a default MAX_WRITEBACK_JOBS
dm-writecache has the capability to limit the number of writeback jobs
in progress. However, this feature was off by default. As such there
were some out-of-memory crashes observed when lowering the low
watermark while the cache is full.

This commit enables writeback limit by default. It is set to 256MiB or
1/16 of total system memory, whichever is smaller.

Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-14 15:52:31 -04:00
Zhang Jiaming 962c6296f0 dm snapshot: fix typo in snapshot_map() comment
Signed-off-by: Zhang Jiaming <jiaming@nfschina.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:39 -04:00
Jiang Jian ce92fc4b8b dm raid: remove redundant "the" in parse_raid_params() comment
Signed-off-by: Jiang Jian <jiangjian@cdjrlc.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:38 -04:00
Steven Lung 5c29e78473 dm cache: fix typo in 2 comment blocks
Replace neccessarily with necessarily.

Signed-off-by: Steven Lung <1030steven@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:37 -04:00
JeongHyeon Lee 20e6fc8562 dm verity: fix checkpatch close brace error
Resolves: ERROR: else should follow close brace '}'

Signed-off-by: JeongHyeon Lee <jhs2.lee@samsung.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:36 -04:00
Mike Snitzer 899ab445a4 dm table: rename dm_target variable in dm_table_add_target()
Rename from "tgt" to "ti" so that all of dm-table.c code uses the same
naming for dm_target variables.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:35 -04:00
Mike Snitzer 564b5c5476 dm table: audit all dm_table_get_target() callers
All callers of dm_table_get_target() are expected to do proper bounds
checking on the index they pass.

Move dm_table_get_target() to dm-core.h to make it extra clear that only
DM core code should be using it. Switch it to be inlined while at it.

Standardize all DM core callers to use the same for loop pattern and
make associated variables as local as possible. Rename some variables
(e.g. s/table/t/ and s/tgt/ti/) along the way.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:34 -04:00
Mike Snitzer 2aec377a29 dm table: remove dm_table_get_num_targets() wrapper
More efficient and readable to just access table->num_targets directly.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:33 -04:00
Ming Lei 8b211aaccb dm: add two stage requeue mechanism
Commit 61b6e2e532 ("dm: fix BLK_STS_DM_REQUEUE handling when dm_io
represents split bio") reverted DM core's bio splitting back to using
bio_split()+bio_chain() because it was found that otherwise DM's
BLK_STS_DM_REQUEUE would trigger a live-lock waiting for bio
completion that would never occur.

Restore using bio_trim()+bio_inc_remaining(), like was done in commit
7dd76d1fee ("dm: improve bio splitting and associated IO
accounting"), but this time with proper handling for the above
scenario that is covered in more detail in the commit header for
61b6e2e532.

Solve this issue by adding a two staged dm_io requeue mechanism that
uses the new dm_bio_rewind() via dm_io_rewind():

1) requeue the dm_io into the requeue_list added to struct
   mapped_device, and schedule it via new added requeue work. This
   workqueue just clones the dm_io->orig_bio (which DM saves and
   ensures its end sector isn't modified). dm_io_rewind() uses the
   sectors and sectors_offset members of the dm_io that are recorded
   relative to the end of orig_bio: dm_bio_rewind()+bio_trim() are
   then used to make that cloned bio reflect the subset of the
   original bio that is represented by the dm_io that is being
   requeued.

2) the 2nd stage requeue is same with original requeue, but
   io->orig_bio points to new cloned bio (which matches the requeued
   dm_io as described above).

This allows DM core to shift the need for bio cloning from bio-split
time (during IO submission) to the less likely BLK_STS_DM_REQUEUE
handling (after IO completes with that error).

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:49:32 -04:00
Ming Lei 61cbe7888d dm: add dm_bio_rewind() API to DM core
Commit 7759eb23fd ("block: remove bio_rewind_iter()") removed
a similar API for the following reasons:
    ```
    It is pointed that bio_rewind_iter() is one very bad API[1]:

    1) bio size may not be restored after rewinding

    2) it causes some bogus change, such as 5151842b9d (block: reset
    bi_iter.bi_done after splitting bio)

    3) rewinding really makes things complicated wrt. bio splitting

    4) unnecessary updating of .bi_done in fast path

    [1] https://marc.info/?t=153549924200005&r=1&w=2

    So this patch takes Kent's suggestion to restore one bio into its original
    state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
    given now bio_rewind_iter() is only used by bio integrity code.
    ```
However, saving off a copy of the 32 bytes bio->bi_iter in case rewind
needed isn't efficient because it bloats per-bio-data for what is an
unlikely case. That suggestion also ignores the need to restore
crypto and integrity info.

Add dm_bio_rewind() API for a specific use-case that is much more narrow
than the previous more generic rewind code that was reverted:

1) most bios have a fixed end sector since bio split is done from front
   of the bio, if driver just records how many sectors between current
   bio's start sector and the original bio's end sector, the original
   position can be restored. Keeping the original bio's end sector
   fixed is a _hard_ requirement for this interface!

2) if a bio's end sector won't change (usually bio_trim() isn't
   called, or in the case of DM it preserves original bio), user can
   restore the original position by storing sector offset from the
   current ->bi_iter.bi_sector to bio's end sector; together with
   saving bio size, only 8 bytes is needed to restore to original
   bio.

3) DM's requeue use case: when BLK_STS_DM_REQUEUE happens, DM core
   needs to restore to an "original bio" which represents the current
   dm_io to be requeued (which may be a subset of the original bio).
   By storing the sector offset from the original bio's end sector and
   dm_io's size, dm_bio_rewind() can restore such original bio. See
   commit 7dd76d1fee ("dm: improve bio splitting and associated IO
   accounting") for more details on how DM does this. Leveraging this,
   allows DM core to shift the need for bio cloning from bio-split
   time (during IO submission) to the less likely BLK_STS_DM_REQUEUE
   handling (after IO completes with that error).

4) Unlike the original rewind API, dm_bio_rewind() doesn't add .bi_done
   to bvec_iter and there is no effect on the fast path.

Implement dm_bio_rewind() by factoring out clear helpers that it calls:
dm_bio_integrity_rewind, dm_bio_crypt_rewind and dm_bio_rewind_iter.

DM is able to ensure that dm_bio_rewind() is used safely but, given
the constraint that the bio's end must never change, other
hypothetical future callers may not take the same care. So make
dm_bio_rewind() and all supporting code local to DM to avoid risk of
hypothetical abuse. A "dm_" prefix was added to all functions to avoid
any namespace collisions.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-07-07 11:26:55 -04:00
Ming Lei 444fe04f7a dm: improve BLK_STS_DM_REQUEUE and BLK_STS_AGAIN handling
If either BLK_STS_DM_REQUEUE or BLK_STS_AGAIN is returned for POLLED
io, we requeue the original bio into deferred list and kick md->wq to
re-submit it to block layer.

Improve the handling in the following way:

1) Factor out dm_handle_requeue() for handling dm_io requeue.

2) Unify handling for BLK_STS_DM_REQUEUE and BLK_STS_AGAIN: clear
   REQ_POLLED for BLK_STS_DM_REQUEUE too, for the sake of simplicity,
   given BLK_STS_DM_REQUEUE is very unusual.

3) Queue md->wq explicitly in dm_handle_requeue(), so requeue handling
   becomes more robust.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-29 14:17:16 -04:00
Christoph Hellwig e810cb78bc dm: refactor dm_md_mempool allocation
The current split between dm_table_alloc_md_mempools and
dm_alloc_md_mempools is rather arbitrary, so merge the two
into one easy to follow function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-29 12:46:06 -04:00
Christoph Hellwig 4ed045d875 dm: unexport dm_get_reserved_rq_based_ios
dm_get_reserved_rq_based_ios is only used in the core dm code, so
remove the export.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-29 12:46:05 -04:00
Christoph Hellwig 8b9ab62662 block: remove blk_cleanup_disk
blk_cleanup_disk is nothing but a trivial wrapper for put_disk now,
so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28 06:33:15 -06:00
Christoph Hellwig c39493222e dm: open code blk_max_size_offset in max_io_len
max_io_len always passes an explicitly non-zero chunk_sectors into
blk_max_size_offset.  That means much of blk_max_size_offset is not
needed and can be open coded to simplify the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20220614090934.570632-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-27 06:29:11 -06:00
Mikulas Patocka 90736eb323 dm mirror log: clear log bits up to BITS_PER_LONG boundary
Commit 85e123c27d ("dm mirror log: round up region bitmap size to
BITS_PER_LONG") introduced a regression on 64-bit architectures in the
lvm testsuite tests: lvcreate-mirror, mirror-names and vgsplit-operation.

If the device is shrunk, we need to clear log bits beyond the end of the
device. The code clears bits up to a 32-bit boundary and then calculates
lc->sync_count by summing set bits up to a 64-bit boundary (the commit
changed that; previously, this boundary was 32-bit too). So, it was using
some non-zeroed bits in the calculation and this caused misbehavior.

Fix this regression by clearing bits up to BITS_PER_LONG boundary.

Fixes: 85e123c27d ("dm mirror log: round up region bitmap size to BITS_PER_LONG")
Cc: stable@vger.kernel.org
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-23 14:55:43 -04:00
Ming Lei 61b6e2e532 dm: fix BLK_STS_DM_REQUEUE handling when dm_io represents split bio
Commit 7dd76d1fee ("dm: improve bio splitting and associated IO
accounting") removed using cloned bio when dm io splitting is needed.
Using bio_trim()+bio_inc_remaining() rather than bio_split()+bio_chain()
causes multiple dm_io instances to share the same original bio, and it
works fine if IOs are completed successfully.

But a regression was caused for the case when BLK_STS_DM_REQUEUE is
returned from any one of DM's cloned bios (whose dm_io share the same
orig_bio). In this BLK_STS_DM_REQUEUE case only the mapped subset of
the original bio for the current exact dm_io needs to be re-submitted.
However, since the original bio is shared among all dm_io instances,
the ->orig_bio actually only represents the last dm_io instance, so
requeue can't work as expected. Also when more than one dm_io is
requeued, the same original bio is requeued from all dm_io's
completion handler, then race is caused.

Fix this issue by still allocating one clone bio for completing io
only, then io accounting can rely on ->orig_bio being unmodified. This
is needed because the dm_io's sector_offset and sectors members are
recorded relative to an unmodified ->orig_bio.

In the future, we can go back to using bio_trim()+bio_inc_remaining()
for dm's io splitting but then delay needing a bio clone only when
handling BLK_STS_DM_REQUEUE, but that approach is a bit complicated
(so it needs a development cycle):
1) bio clone needs to be done in task context
2) a block interface for unwinding bio is required

Fixes: 7dd76d1fee ("dm: improve bio splitting and associated IO accounting")
Reported-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-23 14:33:13 -04:00
Mike Snitzer 78ccef9123 dm: do not return early from dm_io_complete if BLK_STS_AGAIN without polling
Commit 5291984004 ("dm: fix bio polling to handle possibile
BLK_STS_AGAIN") inadvertently introduced an early return from
dm_io_complete() without first queueing the bio to DM if BLK_STS_AGAIN
occurs and bio-polling is _not_ being used.

Fix this by only returning early from dm_io_complete() if the bio has
first been properly queued to DM. Otherwise, the bio will never finish
via bio_endio.

Fixes: 5291984004 ("dm: fix bio polling to handle possibile BLK_STS_AGAIN")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-21 13:48:52 -04:00
Nikos Tsironis 9ae6e8b1c9 dm era: commit metadata in postsuspend after worker stops
During postsuspend dm-era does the following:

1. Archives the current era
2. Commits the metadata, as part of the RPC call for archiving the
   current era
3. Stops the worker

Until the worker stops, it might write to the metadata again. Moreover,
these writes are not flushed to disk immediately, but are cached by the
dm-bufio client, which writes them back asynchronously.

As a result, the committed metadata of a suspended dm-era device might
not be consistent with the in-core metadata.

In some cases, this can result in the corruption of the on-disk
metadata. Suppose the following sequence of events:

1. Load a new table, e.g. a snapshot-origin table, to a device with a
   dm-era table
2. Suspend the device
3. dm-era commits its metadata, but the worker does a few more metadata
   writes until it stops, as part of digesting an archived writeset
4. These writes are cached by the dm-bufio client
5. Load the dm-era table to another device.
6. The new instance of the dm-era target loads the committed, on-disk
   metadata, which don't include the extra writes done by the worker
   after the metadata commit.
7. Resume the new device
8. The new dm-era target instance starts using the metadata
9. Resume the original device
10. The destructor of the old dm-era target instance is called and
    destroys the dm-bufio client, which results in flushing the cached
    writes to disk
11. These writes might overwrite the writes done by the new dm-era
    instance, hence corrupting its metadata.

Fix this by committing the metadata after the worker stops running.

stop_worker uses flush_workqueue to flush the current work. However, the
work item may re-queue itself and flush_workqueue doesn't wait for
re-queued works to finish.

This could result in the worker changing the metadata after they have
been committed, or writing to the metadata concurrently with the commit
in the postsuspend thread.

Use drain_workqueue instead, which waits until the work and all
re-queued works finish.

Fixes: eec40579d8 ("dm: add era target")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Nikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-21 13:35:01 -04:00
Linus Torvalds 462abc9de7 block-5.19-2022-06-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKr3gUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphQbD/9q3RlFADRFQICCPoTbh9RBflhjN3D21l7y
 OarLbQiOpSLT1KRrwUdzxKjNkps/tKrSWKke7znrGpgAWAUXrRX9O2mqraGwQ0JQ
 cxMI2DDnekQbDaIsv/9lAsF+9rPaSU1NQR09jm4jw5Wc59XnY6mjJF+Lb83Psoow
 r4MRMXg+JQZGOn4ekN/RujwRaDluK+R1RQPUePXM1/LZPbsbTMFbQR1hrPIcrsYp
 R5f2an3mpmOrYpVAZQBEl5F451pKv0lah+QVd6VY8/CyReWLTzKh4yHYZs7fgkAK
 0aTxXEskrQozLYQgEPefTDt1JoyHqMLQ1EIvVbSoGrlcgdmbjEpDd0ee2IhBdq6m
 80gJZOFlE/K4r5RFClIhrsYdJVdSQ+fHGqO+R0WLxWe+w0T80X6tUEewSRO0WnB9
 nlXvlj/SDmeXmsrkjFGH55yLubgKTXigCbwwvtdv24zsmmq5QQSgMrTkdey4NH7U
 C3zs8PVvGsqlXR7/N5GkaSvE/zPe6mx+pXF9SdzFwAAYYglqTZs1moInT3WGrqBw
 iK/aKJkoCDh06LHkNmQXKh2nTNCl/VLFvlcmyTfbKthfQC6dWD8v3Qe7ce4P7kyM
 bxEF9hAdJSPNPWXMITU4CRCz6JqK0vZUAUNn8g5lf7SJ1QFrqciwqAQF3/OIxeS0
 3+Qj2izcPA==
 =5gxO
 -----END PGP SIGNATURE-----

Merge tag 'block-5.19-2022-06-16' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Christoph
      - Quirks, quirks, quirks to work around buggy consumer grade
        devices (Keith Bush, Ning Wang, Stefan Reiter, Rasheed Hsueh)
      - Better kernel messages for devices that need quirking (Keith
        Bush)
      - Make a kernel message more useful (Thomas Weißschuh)

 - MD pull request from Song, with a few fixes

 - blk-mq sysfs locking fixes (Ming)

 - BFQ stats fix (Bart)

 - blk-mq offline queue fix (Bart)

 - blk-mq flush request tag fix (Ming)

* tag 'block-5.19-2022-06-16' of git://git.kernel.dk/linux-block:
  block/bfq: Enable I/O statistics
  blk-mq: don't clear flush_rq from tags->rqs[]
  blk-mq: avoid to touch q->elevator without any protection
  blk-mq: protect q->elevator by ->sysfs_lock in blk_mq_elv_switch_none
  block: Fix handling of offline queues in blk_mq_alloc_request_hctx()
  md/raid5-ppl: Fix argument order in bio_alloc_bioset()
  Revert "md: don't unregister sync_thread with reconfig_mutex held"
  nvme-pci: disable write zeros support on UMIC and Samsung SSDs
  nvme-pci: avoid the deepest sleep state on ZHITAI TiPro7000 SSDs
  nvme-pci: sk hynix p31 has bogus namespace ids
  nvme-pci: smi has bogus namespace ids
  nvme-pci: phison e12 has bogus namespace ids
  nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S50
  nvme-pci: add trouble shooting steps for timeouts
  nvme: add bug report info for global duplicate id
  nvme: add device name to warning in uuid_show()
2022-06-17 11:22:58 -07:00
Mikulas Patocka 85e123c27d dm mirror log: round up region bitmap size to BITS_PER_LONG
The code in dm-log rounds up bitset_size to 32 bits. It then uses
find_next_zero_bit_le on the allocated region. find_next_zero_bit_le
accesses the bitmap using unsigned long pointers. So, on 64-bit
architectures, it may access 4 bytes beyond the allocated size.

Fix this bug by rounding up bitset_size to BITS_PER_LONG.

This bug was found by running the lvm2 testsuite with kasan.

Fixes: 29121bd0b0 ("[PATCH] dm mirror log: bitset_size fix")
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-16 19:39:29 -04:00
Mikulas Patocka 1ee88de395 dm: fix narrow race for REQ_NOWAIT bios being issued despite no support
Starting with the commit 63a225c9fd20, device mapper has an optimization
that it will take cheaper table lock (dm_get_live_table_fast instead of
dm_get_live_table) if the bio has REQ_NOWAIT. The bios with REQ_NOWAIT
must not block in the target request routine, if they did, we would be
blocking while holding rcu_read_lock, which is prohibited.

The targets that are suitable for REQ_NOWAIT optimization (and that don't
block in the map routine) have the flag DM_TARGET_NOWAIT set. Device
mapper will test if all the targets and all the devices in a table
support nowait (see the function dm_table_supports_nowait) and it will set
or clear the QUEUE_FLAG_NOWAIT flag on its request queue according to
this check.

There's a test in submit_bio_noacct: "if ((bio->bi_opf & REQ_NOWAIT) &&
!blk_queue_nowait(q)) goto not_supported" - this will make sure that
REQ_NOWAIT bios can't enter a request queue that doesn't support them.

This mechanism works to prevent REQ_NOWAIT bios from reaching dm targets
that don't support the REQ_NOWAIT flag (and that may block in the map
routine) - except that there is a small race condition:

submit_bio_noacct checks if the queue has the QUEUE_FLAG_NOWAIT without
holding any locks. Immediatelly after this check, the device mapper table
may be reloaded with a table that doesn't support REQ_NOWAIT (for example,
if we start moving the logical volume or if we activate a snapshot).
However the REQ_NOWAIT bio that already passed the check in
submit_bio_noacct would be sent to device mapper, where it could be
redirected to a dm target that doesn't support REQ_NOWAIT - the result is
sleeping while we hold rcu_read_lock.

In order to fix this race, we double-check if the target supports
REQ_NOWAIT while we hold the table lock (so that the table can't change
under us).

Fixes: 563a225c9f ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-16 19:39:02 -04:00
Mikulas Patocka 5d7362d0d5 dm: fix use-after-free in dm_put_live_table_bio
dm_put_live_table_bio is called from the end of dm_submit_bio.
However, at this point, the bio may be already finished and the caller
may have freed the bio. Consequently, dm_put_live_table_bio accesses
the stale "bio" pointer.

Fix this bug by loading the bi_opf value and passing it to
dm_get_live_table_bio and dm_put_live_table_bio instead of the bio.

This bug was found by running the lvm2 testsuite with kasan.

Fixes: 563a225c9f ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-16 19:38:49 -04:00
Logan Gunthorpe f34fdcd4a0 md/raid5-ppl: Fix argument order in bio_alloc_bioset()
bio_alloc_bioset() takes a block device, number of vectors, the
OP flags, the GFP mask and the bio set. However when the prototype
was changed, the callisite in ppl_do_flush() had the OP flags and
the GFP flags reversed. This introduced some sparse error:

  drivers/md/raid5-ppl.c:632:57: warning: incorrect type in argument 3
				    (different base types)
  drivers/md/raid5-ppl.c:632:57:    expected unsigned int opf
  drivers/md/raid5-ppl.c:632:57:    got restricted gfp_t [usertype]
  drivers/md/raid5-ppl.c:633:61: warning: incorrect type in argument 4
  				    (different base types)
  drivers/md/raid5-ppl.c:633:61:    expected restricted gfp_t [usertype]
				    gfp_mask
  drivers/md/raid5-ppl.c:633:61:    got unsigned long long

The sparse error introduction may not have been reported correctly by
0day due to other work that was cleaning up other sparse errors in this
area.

Fixes: 609be10667 ("block: pass a block_device and opf to bio_alloc_bioset")
Cc: stable@vger.kernel.org # 5.18+
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
2022-06-15 10:32:48 -07:00
Guoqing Jiang d0a180341f Revert "md: don't unregister sync_thread with reconfig_mutex held"
The 07reshape5intr test is broke because of below path.

    md_reap_sync_thread
            -> mddev_unlock
            -> md_unregister_thread(&mddev->sync_thread)

And md_check_recovery is triggered by,

mddev_unlock -> md_wakeup_thread(mddev->thread)

then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
reshape_position can't be updated accordingly.

Fixes: 8b48ec23cc ("md: don't unregister sync_thread with reconfig_mutex held")
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
2022-06-15 10:30:14 -07:00
Benjamin Marzinski 10eb3a0d51 dm: fix race in dm_start_io_acct
After commit 82f6cdcc36 ("dm: switch dm_io booleans over to proper
flags") dm_start_io_acct stopped atomically checking and setting
was_accounted, which turned into the DM_IO_ACCOUNTED flag. This opened
the possibility for a race where IO accounting is started twice for
duplicate bios. To remove the race, check the flag while holding the
io->lock.

Fixes: 82f6cdcc36 ("dm: switch dm_io booleans over to proper flags")
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-15 11:51:41 -04:00
Mike Snitzer dddf305640 dm: fix zoned locking imbalance due to needless check in clone_endio
After the commit ca522482e3 ("dm: pass NULL bdev to bio_alloc_clone"),
clone_endio() only calls dm_zone_endio() when DM targets remap the
clone bio's bdev to something other than the md->disk->part0 default.

However, if a DM target (e.g. dm-crypt) stacked ontop of a dm-zoned
does not remap the clone bio using bio_set_dev() then dm_zone_endio()
is not called at completion of the bios and zone locks are not
properly unlocked. This triggers a hang, in dm_zone_map_bio(), when
blktests block/004 is run for dm-crypt on zoned block devices. To
avoid the hang, simply remove the clone_endio() check that verifies
the target remapped the clone bio to a device other than the default.

Fixes: ca522482e3 ("dm: pass NULL bdev to bio_alloc_clone")
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-10 15:23:54 -04:00
Christoph Hellwig 29dec90a0f dm: fix bio_set allocation
The use of bioset_init_from_src mean that the pre-allocated pools weren't
used for anything except parameter passing, and the integrity pool
creation got completely lost for the actual live mapped_device.  Fix that
by assigning the actual preallocated dm_md_mempools to the mapped_device
and using that for I/O instead of creating new mempools.

Fixes: 2a2a4c510b ("dm: use bioset_init_from_src() to copy bio_set")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-06-08 14:04:14 -04:00
Linus Torvalds 78c6499c92 for-5.19/drivers-2022-06-02
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKZmkoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqyrD/4iyg2ULBPyljLoM3Ed8AONbrFBApenKDFN
 FjiFRZNll8zvJLTtP0GQqJIrljPuRlqb0IkbgqXl+yvPZ+wpB9HgIe2aohkOaqqJ
 KS49UNR/aIHwC2y7lwlcsdgVqqoPdc4wnZeaQvCsWPCBhCca/k0kR7s3uEhHMK92
 OWpV9osl/thLfOBwwt4IEaO1Koz8PM/fCR4XA2KLbs8E4P8EcSFglqi0ap7foLNr
 pQAJIlPjkmF6nw4xg5fdBjVBo//kVcuf9IBMi5/XinmUL1taFAcn5WyeOvBbi0Fs
 Sqp/pKkveM8xWZKrDyA/wf8nzRNpBBl6TOQEMFtV6FkZrij3pbKHlCiyL2gBR13e
 5gkbVvXgtdqDVlnqlvIV/Swfh5YQFtn7+vlHFOUjP+iObsBRo4fTvDhPTLoO/VCf
 tIAA7xnq/pquRKS/QGGC7ZxRVc3T1r+EpvbBP5Dc4CDGbbbyZLCSrOh5HWIb0I3k
 95GSiipTtf54KqZ9HiG/u+xNAFIdapXgU4Xm+JyDRWLdxSs5Nmy8VgpdflvxBfuo
 hCvJJw3vtDusjyHc7IafxaZlJVQT9tPgPshs8GfrCMCP19RCALD/5irspFh6dD35
 BQTIkhC68XNa0iNn/NTP3uxir/JwRoovxQkA9eD+r1NHsAbL8GTypfr5kKJxDaIK
 UhawfyZE3Q==
 =YqhC
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block

Pull more block driver updates from Jens Axboe:
 "A collection of stragglers that were late on sending in their changes
  and just followup fixes.

   - NVMe fixes pull request via Christoph:
       - set controller enable bit in a separate write (Niklas Cassel)
       - disable namespace identifiers for the MAXIO MAP1001 (Christoph)
       - fix a comment typo (Julia Lawall)"

   - MD fixes pull request via Song:
       - Remove uses of bdevname (Christoph Hellwig)
       - Bug fixes (Guoqing Jiang, and Xiao Ni)

   - bcache fixes series (Coly)

   - null_blk zoned write fix (Damien)

   - nbd fixes (Yu, Zhang)

   - Fix for loop partition scanning (Christoph)"

* tag 'for-5.19/drivers-2022-06-02' of git://git.kernel.dk/linux-block: (23 commits)
  block: null_blk: Fix null_zone_write()
  nvmet: fix typo in comment
  nvme: set controller enable bit in a separate write
  nvme-pci: disable namespace identifiers for the MAXIO MAP1001
  bcache: avoid unnecessary soft lockup in kworker update_writeback_rate()
  nbd: use pr_err to output error message
  nbd: fix possible overflow on 'first_minor' in nbd_dev_add()
  nbd: fix io hung while disconnecting device
  nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
  nbd: fix race between nbd_alloc_config() and module removal
  nbd: call genl_unregister_family() first in nbd_cleanup()
  md: bcache: check the return value of kzalloc() in detached_dev_do_request()
  bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init()
  block, loop: support partitions without scanning
  bcache: avoid journal no-space deadlock by reserving 1 journal bucket
  bcache: remove incremental dirty sector counting for bch_sectors_dirty_init()
  bcache: improve multithreaded bch_sectors_dirty_init()
  bcache: improve multithreaded bch_btree_check()
  md: fix double free of io_acct_set bioset
  md: Don't set mddev private to NULL in raid0 pers->free
  ...
2022-06-03 10:25:56 -07:00
Linus Torvalds fa78526acc - Fix DM core's dm_table_supports_poll to return false if no data
devices.
 
 - Fix DM verity target so that it cannot be switched to a different DM
   target type (e.g. dm-linear) via DM table reload.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmKX0aMACgkQxSPxCi2d
 A1okwgf+NMX+Il0qgda1mZrId7nVdQGGyv7uapLXIxRm5Z+LcUxqw+hqzoeabeQ/
 7rou3KqsXuPcpu1AATPHis0Ub9CcXwtpbevf2rmh3Ey4kqLLuFqUP6IjwvFtyp2Y
 Ms9QAhTvIZXxAPcrb7HH2v7ULOCmdI89OAr8Q/hQ+F4wjI8BO2tNJ4WfxeqpMy5M
 EFwEO485Ct+XvLDek4+7hYxvSO/6ANgjgzWx4dwsP+iC9SFJurvNVnoXpIl+69DU
 v8R6Udp0buQqFscyfRbHVOYxVkBROMWg/lKX/4hhgiSoV8j5xSm9hp3S13BffHj7
 Bbp8cW+E6IkaASDpIRBAa/6a6k7K1w==
 =+C+g
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM core's dm_table_supports_poll to return false if target has no
   data devices.

 - Fix DM verity target so that it cannot be switched to a different DM
   target type (e.g. dm-linear) via DM table reload.

* tag 'for-5.19/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm verity: set DM_TARGET_IMMUTABLE feature flag
  dm table: fix dm_table_supports_poll to return false if no data devices
2022-06-01 14:25:04 -07:00
Sarthak Kukreti 4caae58406 dm verity: set DM_TARGET_IMMUTABLE feature flag
The device-mapper framework provides a mechanism to mark targets as
immutable (and hence fail table reloads that try to change the target
type). Add the DM_TARGET_IMMUTABLE flag to the dm-verity target's
feature flags to prevent switching the verity target with a different
target type.

Fixes: a4ffc15219 ("dm: add verity target")
Cc: stable@vger.kernel.org
Signed-off-by: Sarthak Kukreti <sarthakkukreti@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-31 16:22:30 -04:00
Mike Snitzer 9571f829f3 dm table: fix dm_table_supports_poll to return false if no data devices
It was reported that the "generic/250" test in xfstests (which uses
the dm-error target) demonstrates a regression where the kernel
crashes in bioset_exit().

Since commit cfc97abcbe ("dm: conditionally enable
BIOSET_PERCPU_CACHE for dm_io bioset") the bioset_init() for the dm_io
bioset will setup the bioset's per-cpu alloc cache if all devices have
QUEUE_FLAG_POLL set.

But there was an bug where a target that doesn't have any data devices
(and that doesn't even set the .iterate_devices dm target callback)
will incorrectly return true from dm_table_supports_poll().

Fix this by updating dm_table_supports_poll() to follow dm-table.c's
well-worn pattern for testing that _all_ targets in a DM table do in
fact have underlying devices that set QUEUE_FLAG_POLL.

NOTE: An additional block fix is still needed so that
bio_alloc_cache_destroy() clears the bioset's ->cache member.
Otherwise, a DM device's table reload that transitions the DM device's
bioset from using a per-cpu alloc cache to _not_ using one will result
in bioset_exit() crashing in bio_alloc_cache_destroy() because dm's
dm_io bioset ("io_bs") was left with a stale ->cache member.

Fixes: cfc97abcbe ("dm: conditionally enable BIOSET_PERCPU_CACHE for dm_io bioset")
Reported-by: Matthew Wilcox <willy@infradead.org>
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
2022-05-31 14:44:17 -04:00
Coly Li a1a2d8f016 bcache: avoid unnecessary soft lockup in kworker update_writeback_rate()
The kworker routine update_writeback_rate() is schedued to update the
writeback rate in every 5 seconds by default. Before calling
__update_writeback_rate() to do real job, semaphore dc->writeback_lock
should be held by the kworker routine.

At the same time, bcache writeback thread routine bch_writeback_thread()
also needs to hold dc->writeback_lock before flushing dirty data back
into the backing device. If the dirty data set is large, it might be
very long time for bch_writeback_thread() to scan all dirty buckets and
releases dc->writeback_lock. In such case update_writeback_rate() can be
starved for long enough time so that kernel reports a soft lockup warn-
ing started like:
  watchdog: BUG: soft lockup - CPU#246 stuck for 23s! [kworker/246:31:179713]

Such soft lockup condition is unnecessary, because after the writeback
thread finishes its job and releases dc->writeback_lock, the kworker
update_writeback_rate() may continue to work and everything is fine
indeed.

This patch avoids the unnecessary soft lockup by the following method,
- Add new member to struct cached_dev
  - dc->rate_update_retry (0 by default)
- In update_writeback_rate() call down_read_trylock(&dc->writeback_lock)
  firstly, if it fails then lock contention happens.
- If dc->rate_update_retry <= BCH_WBRATE_UPDATE_MAX_SKIPS (15), doesn't
  acquire the lock and reschedules the kworker for next try.
- If dc->rate_update_retry > BCH_WBRATE_UPDATE_MAX_SKIPS, no retry
  anymore and call down_read(&dc->writeback_lock) to wait for the lock.

By the above method, at worst case update_writeback_rate() may retry for
1+ minutes before blocking on dc->writeback_lock by calling down_read().
For a 4TB cache device with 1TB dirty data, 90%+ of the unnecessary soft
lockup warning message can be avoided.

When retrying to acquire dc->writeback_lock in update_writeback_rate(),
of course the writeback rate cannot be updated. It is fair, because when
the kworker is blocked on the lock contention of dc->writeback_lock, the
writeback rate cannot be updated neither.

This change follows Jens Axboe's suggestion to a more clear and simple
version.

Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220528124550.32834-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28 06:48:26 -06:00
Linus Torvalds 35cdd8656e libnvdimm for 5.19
- Add support for clearing memory error via pwrite(2) on DAX
 
 - Fix 'security overwrite' support in the presence of media errors
 
 - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYpFPcQAKCRDfioYZHlFs
 Z9A3AQCdfoT5sY3OK+I/3oTvJ//6lw2MtXrnXFM046ICKPi9sgD8CzR9mRAHA+vj
 kxOtJEU2bA9naninXGORsDUndiNkwQo=
 =gVIn
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm and DAX updates from Dan Williams:
 "New support for clearing memory errors when a file is in DAX mode,
  alongside with some other fixes and cleanups.

  Previously it was only possible to clear these errors using a truncate
  or hole-punch operation to trigger the filesystem to reallocate the
  block, now, any page aligned write can opportunistically clear errors
  as well.

  This change spans x86/mm, nvdimm, and fs/dax, and has received the
  appropriate sign-offs. Thanks to Jane for her work on this.

  Summary:

   - Add support for clearing memory error via pwrite(2) on DAX

   - Fix 'security overwrite' support in the presence of media errors

   - Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)"

* tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  pmem: implement pmem_recovery_write()
  pmem: refactor pmem_clear_poison()
  dax: add .recovery_write dax_operation
  dax: introduce DAX_RECOVERY_WRITE dax access mode
  mce: fix set_mce_nospec to always unmap the whole page
  x86/mce: relocate set{clear}_mce_nospec() functions
  acpi/nfit: rely on mce->misc to determine poison granularity
  testing: nvdimm: asm/mce.h is not needed in nfit.c
  testing: nvdimm: iomap: make __nfit_test_ioremap a macro
  nvdimm: Allow overwrite in the presence of disabled dimms
  tools/testing/nvdimm: remove unneeded flush_workqueue
2022-05-27 15:49:30 -07:00
Jia-Ju Bai 40f567bbb3 md: bcache: check the return value of kzalloc() in detached_dev_do_request()
The function kzalloc() in detached_dev_do_request() can fail, so its
return value should be checked.

Fixes: bc082a55d2 ("bcache: fix inaccurate io state for detached bcache devices")
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 09:49:48 -06:00
Coly Li 7d6b902ea0 bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init()
The local variables check_state (in bch_btree_check()) and state (in
bch_sectors_dirty_init()) should be fully filled by 0, because before
allocating them on stack, they were dynamically allocated by kzalloc().

Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-27 09:49:48 -06:00
Linus Torvalds 7e284070ab - Enable DM core bioset's per-cpu bio cache if QUEUE_FLAG_POLL
set. This change improves DM's hipri bio polling (REQ_POLLED)
   performance by 7 - 20% depending on the system.
 
 - Update DM core to use jump_labels to further reduce cost of unlikely
   branches for zoned block devices, dm-stats and swap_bios throttling.
 
 - Various DM core changes to reduce bio-based DM overhead and simplify
   IO accounting.
 
 - Fundamental DM core improvements to dm_io reference counting and the
   elimination of using bio_split()+bio_chain() -- instead DM's
   bio-based IO accounting is updated to account that a split occurred.
 
 - Improve DM core's abnormal bio processing to do less work.
 
 - Improve DM core's hipri polling support to use a single list rather
   than an hlist.
 
 - Update DM core to pass NULL bdev to bio_alloc_clone() so that
   initialization that isn't useful for DM can be elided.
 
 - Add cond_resched to DM stats' various loops that loop over all
   entries.
 
 - Fix incorrect error code return from DM integrity's constructor.
 
 - Make DM crypt's printing of the key constant-time.
 
 - Update bio-based DM multipath to provide high-resolution timer to
   the Historical Service Time (HST) path selector.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmKOiAwACgkQxSPxCi2d
 A1qCxgf/VLmiywUR7zCIDiPyJkc547z2MlPVaTCzpclklyLZ5wgfTtqsb8RnEqRs
 KIXrvKbakbfZtrG0k9ZYdbIIVO4jHQavCXtVZH0Mj4XqQArDhHxZzlBx8i6BirO5
 aya7HKWHcXj+s7GF146hRcPtD53LbSNDrB/bVaiZcHxOemthSgtht2xwmZfzoCcS
 JtorpWkA97lgDbdSAcR0kn1zkHDWFkQMJgnUL9FotnlhOVXZVUvAbgb4OjfxB3fG
 bCYlytDhs+6VwO+/CZXcOA7LNNtkLJPEkxUxjrChS5do64SgJUlDvY66iB6+7K0w
 0Jri5D/NZvjwYVeLgXn5UwyQJ75cjQ==
 =zlaE
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - Enable DM core bioset's per-cpu bio cache if QUEUE_FLAG_POLL set.
   This change improves DM's hipri bio polling (REQ_POLLED) performance
   by 7 - 20% depending on the system.

 - Update DM core to use jump_labels to further reduce cost of unlikely
   branches for zoned block devices, dm-stats and swap_bios throttling.

 - Various DM core changes to reduce bio-based DM overhead and simplify
   IO accounting.

 - Fundamental DM core improvements to dm_io reference counting and the
   elimination of using bio_split()+bio_chain() -- instead DM's
   bio-based IO accounting is updated to account that a split occurred.

 - Improve DM core's abnormal bio processing to do less work.

 - Improve DM core's hipri polling support to use a single list rather
   than an hlist.

 - Update DM core to pass NULL bdev to bio_alloc_clone() so that
   initialization that isn't useful for DM can be elided.

 - Add cond_resched to DM stats' various loops that loop over all
   entries.

 - Fix incorrect error code return from DM integrity's constructor.

 - Make DM crypt's printing of the key constant-time.

 - Update bio-based DM multipath to provide high-resolution timer to the
   Historical Service Time (HST) path selector.

* tag 'for-5.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits)
  dm: pass NULL bdev to bio_alloc_clone
  dm cache metadata: remove unnecessary variable in __dump_mapping
  dm mpath: provide high-resolution timer to HST for bio-based
  dm crypt: make printing of the key constant-time
  dm integrity: fix error code in dm_integrity_ctr()
  dm stats: add cond_resched when looping over entries
  dm: improve abnormal bio processing
  dm: simplify bio-based IO accounting further
  dm: put all polled dm_io instances into a single list
  dm: improve dm_io reference counting
  dm: don't grab target io reference in dm_zone_map_bio
  dm: improve bio splitting and associated IO accounting
  dm: switch to bdev based IO accounting interfaces
  dm: pass dm_io instance to dm_io_acct directly
  dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct
  dm: use bio_sectors in dm_aceept_partial_bio
  dm: simplify basic targets
  dm: conditionally enable branching for less used features
  dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio
  dm: move hot dm_io members to same cacheline as dm_target_io
  ...
2022-05-26 21:13:45 -07:00
Coly Li 32feee36c3 bcache: avoid journal no-space deadlock by reserving 1 journal bucket
The journal no-space deadlock was reported time to time. Such deadlock
can happen in the following situation.

When all journal buckets are fully filled by active jset with heavy
write I/O load, the cache set registration (after a reboot) will load
all active jsets and inserting them into the btree again (which is
called journal replay). If a journaled bkey is inserted into a btree
node and results btree node split, new journal request might be
triggered. For example, the btree grows one more level after the node
split, then the root node record in cache device super block will be
upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
space in journal buckets, the journal replay has to wait for new journal
bucket to be reclaimed after at least one journal bucket replayed. This
is one example that how the journal no-space deadlock happens.

The solution to avoid the deadlock is to reserve 1 journal bucket in
run time, and only permit the reserved journal bucket to be used during
cache set registration procedure for things like journal replay. Then
the journal space will never be fully filled, there is no chance for
journal no-space deadlock to happen anymore.

This patch adds a new member "bool do_reserve" in struct journal, it is
inititalized to 0 (false) when struct journal is allocated, and set to
1 (true) by bch_journal_space_reserve() when all initialization done in
run_cache_set(). In the run time when journal_reclaim() tries to
allocate a new journal bucket, free_journal_buckets() is called to check
whether there are enough free journal buckets to use. If there is only
1 free journal bucket and journal->do_reserve is 1 (true), the last
bucket is reserved and free_journal_buckets() will return 0 to indicate
no free journal bucket. Then journal_reclaim() will give up, and try
next time to see whetheer there is free journal bucket to allocate. By
this method, there is always 1 jouranl bucket reserved in run time.

During the cache set registration, journal->do_reserve is 0 (false), so
the reserved journal bucket can be used to avoid the no-space deadlock.

Reported-by: Nikhil Kshirsagar <nkshirsagar@gmail.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-24 06:19:33 -06:00
Coly Li 80db4e4707 bcache: remove incremental dirty sector counting for bch_sectors_dirty_init()
After making bch_sectors_dirty_init() being multithreaded, the existing
incremental dirty sector counting in bch_root_node_dirty_init() doesn't
release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME)
bkeys. Because a read lock is added on btree root node to prevent the
btree to be split during the dirty sectors counting, other I/O requester
has no chance to gain the write lock even restart bcache_btree().

That is to say, the incremental dirty sectors counting is incompatible
to the multhreaded bch_sectors_dirty_init(). We have to choose one and
drop another one.

In my testing, with 512 bytes random writes, I generate 1.2T dirty data
and a btree with 400K nodes. With single thread and incremental dirty
sectors counting, it takes 30+ minites to register the backing device.
And with multithreaded dirty sectors counting, the backing device
registration can be accomplished within 2 minutes.

The 30+ minutes V.S. 2- minutes difference makes me decide to keep
multithreaded bch_sectors_dirty_init() and drop the incremental dirty
sectors counting. This is what this patch does.

But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU
will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys
iterated. This is to avoid the watchdog reports a bogus soft lockup
warning.

Fixes: b144e45fc5 ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-24 06:19:33 -06:00