Commit graph

1587 commits

Author SHA1 Message Date
Kevin Wolf 303de47b2c Mark assert_bdrv_graph_readable/writable() GRAPH_RD/WRLOCK
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221207131838.239125-16-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:08:23 +01:00
Kevin Wolf 4002ffdc4f graph-lock: TSA annotations for lock/unlock functions
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221207131838.239125-15-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:08:23 +01:00
Emanuele Giuseppe Esposito 3f35f82e04 block: assert that graph read and writes are performed correctly
Remove the old assert_bdrv_graph_writable, and replace it with
the new version using graph-lock API.

See the function documentation for more information.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221207131838.239125-14-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:08:23 +01:00
Emanuele Giuseppe Esposito 8aa77000c2 graph-lock: Implement guard macros
Similar to the implementation in lockable.h, implement macros to
automatically take and release the rdlock.

Create the empty GraphLockable and GraphLockableMainloop structs only to
use it as a type for G_DEFINE_AUTOPTR_CLEANUP_FUNC.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221207131838.239125-4-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:43 +01:00
Paolo Bonzini aead9dc9d1 graph-lock: Introduce a lock to protect block graph operations
Block layer graph operations are always run under BQL in the main loop.
This is proved by the assertion qemu_in_main_thread() and its wrapper
macro GLOBAL_STATE_CODE.

However, there are also concurrent coroutines running in other iothreads
that always try to traverse the graph. Currently this is protected
(among various other things) by the AioContext lock, but once this is
removed, we need to make sure that reads do not happen while modifying
the graph.

We distinguish between writer (main loop, under BQL) that modifies the
graph, and readers (all other coroutines running in various AioContext),
that go through the graph edges, reading ->parents and->children.

The writer (main loop) has "exclusive" access, so it first waits for any
current read to finish, and then prevents incoming ones from entering
while it has the exclusive access.

The readers (coroutines in multiple AioContext) are free to access the
graph as long the writer is not modifying the graph. In case it is, they
go in a CoQueue and sleep until the writer is done.

If a coroutine changes AioContext, the counter in the original and new
AioContext are left intact, since the writer does not care where the
reader is, but only if there is one.

As a result, some AioContexts might have a negative reader count, to
balance the positive count of the AioContext that took the lock.  This
also means that when an AioContext is deleted it may have a nonzero
reader count. In that case we transfer the count to a global shared
counter so that the writer is always aware of all readers.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221207131838.239125-3-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:43 +01:00
Kevin Wolf da0bd74434 block: Factor out bdrv_drain_all_begin_nopoll()
Provide a separate function that just quiesces the users of a node to
prevent new requests from coming in, but without waiting for the already
in-flight I/O to complete.

This function can be used in contexts where polling is not allowed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221207131838.239125-2-kwolf@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:43 +01:00
Emanuele Giuseppe Esposito 0508d0be4b block/dirty-bitmap: convert coroutine-only functions to co_wrapper
bdrv_can_store_new_dirty_bitmap and bdrv_remove_persistent_dirty_bitmap
check if they are running in a coroutine, directly calling the
coroutine callback if it's the case.
Except that no coroutine calls such functions, therefore that check
can be removed, and function creation can be offloaded to
c_w.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20221128142337.657646-15-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:43 +01:00
Emanuele Giuseppe Esposito 741443eb43 block: convert bdrv_create to co_wrapper
This function is never called in coroutine context, therefore
instead of manually creating a new coroutine, delegate it to the
block-coroutine-wrapper script, defining it as co_wrapper.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20221128142337.657646-14-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:43 +01:00
Emanuele Giuseppe Esposito 76a2f554c1 block-coroutine-wrapper.py: introduce co_wrapper
This new annotation starts just a function wrapper that creates
a new coroutine. It assumes the caller is not a coroutine.
It will be the default annotation to be used in the future.

This is much better as c_w_mixed, because it is clear if the caller
is a coroutine or not, and provides the advantage of automating
the code creation. In the future all c_w_mixed functions will be
substituted by co_wrapper.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20221128142337.657646-11-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:43 +01:00
Emanuele Giuseppe Esposito 1bd542016c block: rename generated_co_wrapper in co_wrapper_mixed
In preparation to the incoming new function specifiers,
rename g_c_w with a more meaningful name and document it.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20221128142337.657646-10-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:43 +01:00
Emanuele Giuseppe Esposito 2475a0d0f4 block: bdrv_create_file is a coroutine_fn
It is always called in coroutine_fn callbacks, therefore
it can directly call bdrv_co_create().

Rename it to bdrv_co_create_file too.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20221128142337.657646-9-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:43 +01:00
Emanuele Giuseppe Esposito 43a0d4f08b block-copy: add coroutine_fn annotations
These functions end up calling bdrv_common_block_status_above(), a
generated_co_wrapper function.
In addition, they also happen to be always called in coroutine context,
meaning all callers are coroutine_fn.
This means that the g_c_w function will enter the qemu_in_coroutine()
case and eventually suspend (or in other words call qemu_coroutine_yield()).
Therefore we can mark such functions coroutine_fn too.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20221128142337.657646-3-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:43 +01:00
Emanuele Giuseppe Esposito 7b52a921c1 block-io: introduce coroutine_fn duplicates for bdrv_common_block_status_above callers
bdrv_common_block_status_above() is a g_c_w, and it is being called by
many "wrapper" functions like bdrv_is_allocated(),
bdrv_is_allocated_above() and bdrv_block_status_above().

Because we want to eventually split the coroutine from non-coroutine
case in g_c_w, create duplicate wrappers that take care of directly
calling the same coroutine functions called in the g_c_w.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20221128142337.657646-2-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:43 +01:00
Kevin Wolf 606ed756c1 block: Remove poll parameter from bdrv_parent_drained_begin_single()
All callers of bdrv_parent_drained_begin_single() pass poll=false now,
so we don't need the parameter any more.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221118174110.55183-16-kwolf@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:42 +01:00
Kevin Wolf 2398747128 block: Don't poll in bdrv_replace_child_noperm()
In order to make sure that bdrv_replace_child_noperm() doesn't have to
poll any more, get rid of the bdrv_parent_drained_begin_single() call.

This is possible now because we can require that the parent is already
drained through the child in question when the function is called and we
don't call the parent drain callbacks more than once.

The additional drain calls needed in callers cause the test case to run
its code in the drain handler too early (bdrv_attach_child() drains
now), so modify it to only enable the code after the test setup has
completed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221118174110.55183-15-kwolf@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:42 +01:00
Kevin Wolf a82a3bd135 block: Remove ignore_bds_parents parameter from drain_begin/end.
ignore_bds_parents is now ignored during drain_begin and drain_end, so
we can just remove it there. It is still a valid optimisation for
drain_all in bdrv_drained_poll(), so leave it around there.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221118174110.55183-13-kwolf@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:42 +01:00
Kevin Wolf 57e05be343 block: Call drain callbacks only once
We only need to call both the BlockDriver's callback and the parent
callbacks when going from undrained to drained or vice versa. A second
drain section doesn't make a difference for the driver or the parent,
they weren't supposed to send new requests before and after the second
drain.

One thing that gets in the way is the 'ignore_bds_parents' parameter in
bdrv_do_drained_begin_quiesce() and bdrv_do_drained_end(): It means that
bdrv_drain_all_begin() increases bs->quiesce_counter, but does not
quiesce the parent through BdrvChildClass callbacks. If an additional
drain section is started now, bs->quiesce_counter will be non-zero, but
we would still need to quiesce the parent through BdrvChildClass in
order to keep things consistent (and unquiesce it on the matching
bdrv_drained_end(), even though the counter would not reach 0 yet as
long as the bdrv_drain_all() section is still active).

Instead of keeping track of this, let's just get rid of the parameter.
It was introduced in commit 6cd5c9d7b2 as an optimisation so that
during bdrv_drain_all(), we wouldn't recursively drain all parents up to
the root for each node, resulting in quadratic complexity. As it happens,
calling the callbacks only once solves the same problem, so as of this
patch, we'll still have O(n) complexity and ignore_bds_parents is not
needed any more.

This patch only ignores the 'ignore_bds_parents' parameter. It will be
removed in a separate patch.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20221118174110.55183-12-kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:42 +01:00
Kevin Wolf 299403aeda block: Remove subtree drains
Subtree drains are not used any more. Remove them.

After this, BdrvChildClass.attach/detach() don't poll any more.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20221118174110.55183-11-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:42 +01:00
Kevin Wolf 92140b9f3f stream: Replace subtree drain with a single node drain
The subtree drain was introduced in commit b1e1af394d as a way to avoid
graph changes between finding the base node and changing the block graph
as necessary on completion of the image streaming job.

The block graph could change between these two points because
bdrv_set_backing_hd() first drains the parent node, which involved
polling and can do anything.

Subtree draining was an imperfect way to make this less likely (because
with it, fewer callbacks are called during this window). Everyone agreed
that it's not really the right solution, and it was only committed as a
stopgap solution.

This replaces the subtree drain with a solution that simply drains the
parent node before we try to find the base node, and then call a version
of bdrv_set_backing_hd() that doesn't drain, but just asserts that the
parent node is already drained.

This way, any graph changes caused by draining happen before we start
looking at the graph and things stay consistent between finding the base
node and changing the graph.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20221118174110.55183-10-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:42 +01:00
Kevin Wolf 2f65df6e16 block: Remove drained_end_counter
drained_end_counter is unused now, nobody changes its value any more. It
can be removed.

In cases where we had two almost identical functions that only differed
in whether the caller passes drained_end_counter, or whether they would
poll for a local drained_end_counter to reach 0, these become a single
function.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221118174110.55183-5-kwolf@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:42 +01:00
Kevin Wolf 5e8ac21717 block: Revert .bdrv_drained_begin/end to non-coroutine_fn
Polling during bdrv_drained_end() can be problematic (and in the future,
we may get cases for bdrv_drained_begin() where polling is forbidden,
and we don't care about already in-flight requests, but just want to
prevent new requests from arriving).

The .bdrv_drained_begin/end callbacks running in a coroutine is the only
reason why we have to do this polling, so make them non-coroutine
callbacks again. None of the callers actually yield any more.

This means that bdrv_drained_end() effectively doesn't poll any more,
even if AIO_WAIT_WHILE() loops are still there (their condition is false
from the beginning). This is generally not a problem, but in
test-bdrv-drain, some additional explicit aio_poll() calls need to be
added because the test case wants to verify the final state after BHs
have executed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20221118174110.55183-4-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-12-15 16:07:42 +01:00
Hanna Reitz d5f8d79c2f block: Make bdrv_child_get_parent_aio_context I/O
We want to use bdrv_child_get_parent_aio_context() from
bdrv_parent_drained_{begin,end}_single(), both of which are "I/O or GS"
functions.

Prior to 3ed4f708fe, all the implementations were I/O code anyway.
3ed4f708fe has put block jobs' AioContext field under the job mutex, so
to make child_job_get_parent_aio_context() work in an I/O context, we
need to take that lock there.

Furthermore, blk_root_get_parent_aio_context() is not marked as
anything, but is safe to run in an I/O context, so mark it that way now.
(blk_get_aio_context() is an I/O code function.)

With that done, all implementations explicitly are I/O code, so we can
mark bdrv_child_get_parent_aio_context() as I/O code, too, so callers
know it is safe to run from both GS and I/O contexts.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20221107151321.211175-2-hreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-11-10 14:58:34 +01:00
Stefan Hajnoczi d5ab9490cd Block layer patches
- Cleanup bs->backing and bs->file handling
 - Refactor bdrv_try_set_aio_context using transactions
 - Changes for improved coroutine_fn consistency
 - vhost-user-blk: fix the resize crash
 - io_uring: Use of io_uring_register_ring_fd() led to breakage, revert
 - vvfat: Fix some problems with r/w mode
 - Code cleanup
 - MAINTAINERS: Fold "Block QAPI, monitor, ..." into "Block layer core"
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmNazhIRHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9ZyTw/8Dfck/SuxfyeLlnQItkjaV4cnqWOU8vHs
 9x0KhlptCs+HXdF/3iicpA0lHojn7mNnbdFGjPRY4E0LriQv91TQ5ycdEmrseFPf
 sgeQlgdKCVU/pHjZ2wYarm2pE43Cx85a5xuufmw+7w49dNNZn14l4t+DgviuClVM
 nuVaogfZFbYyetre+Qd2TgLl+gJ+0d4o7Zs5lSWLrT8t0L9AGkcWPA7Nrbl6loIE
 dOautV4G7jLjuMiCeJZOGcnuRVe3gCQ5rCGBFzzH4DUtz4BmiYx4hd3LMEsP0PMM
 CrsfDZS04Ztybl9M7TmJuwkAm1gx1JDMOuJuh18lbJocIOBvhkKKxY2wI5LIdZVI
 ZntmU36RowkX+GGu/PYpYyMjBDClJppZCl7vnjyLYsVt6r0Vu6SmlHpJhcRYabhe
 96Kv1LXH9A6+ogKPU3Layw6JGjg01GNr1ALuT7PO3pGto/JshmOuBEJJDucoF84M
 5AfxFCohMROVldwblA6M0eKnlQBgtr5BvtgbV54BBo88VlFJgDJFQn7R09cTFUEo
 UwaJoS+nIaiZ0bQQVZhZloVppUaTdVJojzfVRCZZctga96/tu1HSFnGLnbEFpUN3
 KOf+XnVNS6Ro+nPSDf9bMjbIom2JicGFfV+6yMgIoxY/d5UA2dTZfefil4TAlSod
 6PsTgg+jrm8=
 =/Fw0
 -----END PGP SIGNATURE-----

Merge tag 'for-upstream' of https://repo.or.cz/qemu/kevin into staging

Block layer patches

- Cleanup bs->backing and bs->file handling
- Refactor bdrv_try_set_aio_context using transactions
- Changes for improved coroutine_fn consistency
- vhost-user-blk: fix the resize crash
- io_uring: Use of io_uring_register_ring_fd() led to breakage, revert
- vvfat: Fix some problems with r/w mode
- Code cleanup
- MAINTAINERS: Fold "Block QAPI, monitor, ..." into "Block layer core"

# -----BEGIN PGP SIGNATURE-----
#
# iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmNazhIRHGt3b2xmQHJl
# ZGhhdC5jb20ACgkQfwmycsiPL9ZyTw/8Dfck/SuxfyeLlnQItkjaV4cnqWOU8vHs
# 9x0KhlptCs+HXdF/3iicpA0lHojn7mNnbdFGjPRY4E0LriQv91TQ5ycdEmrseFPf
# sgeQlgdKCVU/pHjZ2wYarm2pE43Cx85a5xuufmw+7w49dNNZn14l4t+DgviuClVM
# nuVaogfZFbYyetre+Qd2TgLl+gJ+0d4o7Zs5lSWLrT8t0L9AGkcWPA7Nrbl6loIE
# dOautV4G7jLjuMiCeJZOGcnuRVe3gCQ5rCGBFzzH4DUtz4BmiYx4hd3LMEsP0PMM
# CrsfDZS04Ztybl9M7TmJuwkAm1gx1JDMOuJuh18lbJocIOBvhkKKxY2wI5LIdZVI
# ZntmU36RowkX+GGu/PYpYyMjBDClJppZCl7vnjyLYsVt6r0Vu6SmlHpJhcRYabhe
# 96Kv1LXH9A6+ogKPU3Layw6JGjg01GNr1ALuT7PO3pGto/JshmOuBEJJDucoF84M
# 5AfxFCohMROVldwblA6M0eKnlQBgtr5BvtgbV54BBo88VlFJgDJFQn7R09cTFUEo
# UwaJoS+nIaiZ0bQQVZhZloVppUaTdVJojzfVRCZZctga96/tu1HSFnGLnbEFpUN3
# KOf+XnVNS6Ro+nPSDf9bMjbIom2JicGFfV+6yMgIoxY/d5UA2dTZfefil4TAlSod
# 6PsTgg+jrm8=
# =/Fw0
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 27 Oct 2022 14:29:38 EDT
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* tag 'for-upstream' of https://repo.or.cz/qemu/kevin: (58 commits)
  block/block-backend: blk_set_enable_write_cache is IO_CODE
  monitor: switch to *_co_* functions
  vmdk: switch to *_co_* functions
  vhdx: switch to *_co_* functions
  vdi: switch to *_co_* functions
  qed: switch to *_co_* functions
  qcow2: switch to *_co_* functions
  qcow: switch to *_co_* functions
  parallels: switch to *_co_* functions
  mirror: switch to *_co_* functions
  block: switch to *_co_* functions
  commit: switch to *_co_* functions
  vmdk: manually add more coroutine_fn annotations
  qcow2: manually add more coroutine_fn annotations
  qcow: manually add more coroutine_fn annotations
  blkdebug: add missing coroutine_fn annotation for indirect-called functions
  qcow2: add coroutine_fn annotation for indirect-called functions
  block: add missing coroutine_fn annotation to BlockDriverState callbacks
  coroutine-io: add missing coroutine_fn annotation to prototypes
  coroutine-lock: add missing coroutine_fn annotation to prototypes
  ...

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-10-30 15:15:12 -04:00
Alberto Faria c2d7680893 block: add missing coroutine_fn annotation to BlockDriverState callbacks
Signed-off-by: Alberto Faria <afaria@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221013123711.620631-9-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Alberto Faria 16bb776f5b block: add missing coroutine_fn annotation to prototypes
The functions are marked coroutine_fn in the definition.

Signed-off-by: Alberto Faria <afaria@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221013123711.620631-6-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Alberto Faria 6894ee2bee monitor: add missing coroutine_fn annotation
hmp_block_resize and hmp_screendump are defined as a ".coroutine = true" command,
so they must be coroutine_fn.

Signed-off-by: Alberto Faria <afaria@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20221013123711.620631-4-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Emanuele Giuseppe Esposito 142e690712 block: remove bdrv_try_set_aio_context and replace it with bdrv_try_change_aio_context
No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221025084952.2139888-11-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Emanuele Giuseppe Esposito a41cfda126 block: rename bdrv_child_try_change_aio_context in bdrv_try_change_aio_context
No functional changes intended.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221025084952.2139888-10-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Emanuele Giuseppe Esposito d2aafbb68a block: remove all unused ->can_set_aio_ctx and ->set_aio_ctx callbacks
Together with all _can_set_ and _set_ APIs, as they are not needed
anymore.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221025084952.2139888-9-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Emanuele Giuseppe Esposito e08cc001d5 bdrv_change_aio_context: use hash table instead of list of visited nodes
Minor performance improvement, but given that we have hash tables
available, avoid iterating in the visited nodes list every time just
to check if a node has been already visited.

The data structure is not actually a proper hash map, but an hash set,
as we are just adding nodes and not key,value pairs.

Suggested-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20221025084952.2139888-4-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Emanuele Giuseppe Esposito 7e8c182fb5 block: use transactions as a replacement of ->{can_}set_aio_context()
Simplify the way the aiocontext can be changed in a BDS graph.
There are currently two problems in bdrv_try_set_aio_context:
- There is a confusion of AioContext locks taken and released, because
  we assume that old aiocontext is always taken and new one is
  taken inside.

- It doesn't look very safe to call bdrv_drained_begin while some
  nodes have already switched to the new aiocontext and others haven't.
  This could be especially dangerous because bdrv_drained_begin polls, so
  something else could be executed while graph is in an inconsistent
  state.

Additional minor nitpick: can_set and set_ callbacks both traverse the
graph, both using the ignored list of visited nodes in a different way.

Therefore, get rid of all of this and introduce a new callback,
change_aio_context, that uses transactions to efficiently, cleanly
and most importantly safely change the aiocontext of a graph.

This new callback is a "merge" of the two previous ones:
- Just like can_set_aio_context, recursively traverses the graph.
  Marks all nodes that are visited using a GList, and checks if
  they *could* change the aio_context.
- For each node that passes the above check, drain it and add a new transaction
  that implements a callback that effectively changes the aiocontext.
- Once done, the recursive function returns if *all* nodes can change
  the AioContext. If so, commit the above transactions.
  Regardless of the outcome, call transaction.clean() to undo all drains
  done in the recursion.
- The transaction list is scanned only after all nodes are being drained, so
  we are sure that they all are in the same context, and then
  we switch their AioContext, concluding the drain only after all nodes
  switched to the new AioContext. In this way we make sure that
  bdrv_drained_begin() is always called under the old AioContext, and
  bdrv_drained_end() under the new one.
- Because of the above, we don't need to release and re-acquire the
  old AioContext every time, as everything is done once (and not
  per-node drain and aiocontext change).

Note that the "change" API is not yet invoked anywhere.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20221025084952.2139888-3-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Vladimir Sementsov-Ogievskiy 5bb0474778 block: Manipulate bs->file / bs->backing pointers in .attach/.detach
bs->file and bs->backing are a kind of duplication of part of
bs->children. But very useful diplication, so let's not drop them at
all:)

We should manage bs->file and bs->backing in same place, where we
manage bs->children, to keep them in sync.

Moreover, generic io paths are unprepared to BdrvChild without a bs, so
it's double good to clear bs->file / bs->backing when we detach the
child.

Detach is simple: if we detach bs->file or bs->backing child, just
set corresponding field to NULL.

Attach is a bit more complicated. But we still can precisely detect
should we set one of bs->file / bs->backing or not:

- if role is BDRV_CHILD_COW, we definitely deal with bs->backing
- else, if role is BDRV_CHILD_FILTERED (it must be also
  BDRV_CHILD_PRIMARY), it's a filtered child. Use
  bs->drv->filtered_child_is_backing to chose the pointer field to
  modify.
- else, if role is BDRV_CHILD_PRIMARY, we deal with bs->file
- in all other cases, it's neither bs->backing nor bs->file. It's some
  other child and we shouldn't care

OK. This change brings one more good thing: we can (and should) get rid
of all indirect pointers in the block-graph-change transactions:

bdrv_attach_child_common() stores BdrvChild** into transaction to clear
it on abort.

bdrv_attach_child_common() has two callers: bdrv_attach_child_noperm()
just pass-through this feature, bdrv_root_attach_child() doesn't need
the feature.

Look at bdrv_attach_child_noperm() callers:
  - bdrv_attach_child() doesn't need the feature
  - bdrv_set_file_or_backing_noperm() uses the feature to manage
    bs->file and bs->backing, we don't want it anymore
  - bdrv_append() uses the feature to manage bs->backing, again we
    don't want it anymore

So, we should drop this stuff! Great!

We could probably keep BdrvChild** argument to keep the int return
value, but it seems not worth the complexity.

Finally, we now set .file / .backing automatically in generic code and
want to restring setting them by hand outside of .attach/.detach.
So, this patch cleanups all remaining places where they were set.
To find such places I use:

  git grep '\->file ='
  git grep '\->backing ='
  git grep '&.*\<backing\>'
  git grep '&.*\<file\>'

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220726201134.924743-14-vsementsov@yandex-team.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Vladimir Sementsov-Ogievskiy 71ca43852a block: document connection between child roles and bs->backing/bs->file
Make the informal rules formal. In further commit we'll add
corresponding assertions.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220726201134.924743-8-vsementsov@yandex-team.ru>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Vladimir Sementsov-Ogievskiy 1921b4f786 test-bdrv-graph-mod: fix filters to be filters
bdrv_pass_through is used as filter, even all node variables has
corresponding names. We want to append it, so it should be
backing-child-based filter like mirror_top.
So, in test_update_perm_tree, first child should be DATA, as we don't
want filters with two filtered children.

bdrv_exclusive_writer is used as a filter once. So it should be filter
anyway. We want to append it, so it should be backing-child-based
fitler too.

Make all FILTERED children to be PRIMARY as well. We are going to force
this rule by assertion soon.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220726201134.924743-7-vsementsov@yandex-team.ru>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Vladimir Sementsov-Ogievskiy 8393078032 block: introduce bdrv_open_file_child() helper
Almost all drivers call bdrv_open_child() similarly. Let's create a
helper for this.

The only not updated drivers that call bdrv_open_child() to set
bs->file are raw-format and snapshot-access:
    raw-format sometimes want to have filtered child but
        don't set drv->is_filter to true.
    snapshot-access wants only DATA | PRIMARY

Possibly we should implement drv->is_filter_func() handler, to consider
raw-format as filter when it works as filter.. But it's another story.

Note also, that we decrease assignments to bs->file in code: it helps
us restrict modifying this field in further commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220726201134.924743-3-vsementsov@yandex-team.ru>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Vladimir Sementsov-Ogievskiy 046fd84fac block: BlockDriver: add .filtered_child_is_backing field
Unfortunately not all filters use .file child as filtered child. Two
exclusions are mirror_top and commit_top. Happily they both are private
filters. Bad thing is that this inconsistency is observable through qmp
commands query-block / query-named-block-nodes. So, could we just
change mirror_top and commit_top to use file child as all other filter
driver is an open question. Probably, we could do that with some kind
of deprecation period, but how to warn users during it?

For now, let's just add a field so we can distinguish them in generic
code, it will be used in further commits.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220726201134.924743-2-vsementsov@yandex-team.ru>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:14:11 +02:00
Bin Meng 69fbfff95e block: Refactor get_tmp_filename()
At present there are two callers of get_tmp_filename() and they are
inconsistent.

One does:

    /* TODO: extra byte is a hack to ensure MAX_PATH space on Windows. */
    char *tmp_filename = g_malloc0(PATH_MAX + 1);
    ...
    ret = get_tmp_filename(tmp_filename, PATH_MAX + 1);

while the other does:

    s->qcow_filename = g_malloc(PATH_MAX);
    ret = get_tmp_filename(s->qcow_filename, PATH_MAX);

As we can see different 'size' arguments are passed. There are also
platform specific implementations inside the function, and the use
of snprintf is really undesirable.

The function name is also misleading. It creates a temporary file,
not just a filename.

Refactor this routine by changing its name and signature to:

    char *create_tmp_file(Error **errp)

and use g_get_tmp_dir() / g_mkstemp() for a consistent implementation.

While we are here, add some comments to mention that /var/tmp is
preferred over /tmp on non-win32 hosts.

Signed-off-by: Bin Meng <bin.meng@windriver.com>
Message-Id: <20221010040432.3380478-2-bin.meng@windriver.com>
[kwolf: Fixed incorrect errno negation and iotest 051]
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-27 20:13:32 +02:00
Stefan Hajnoczi f4ec04bae9 block: return errors from bdrv_register_buf()
Registering an I/O buffer is only a performance optimization hint but it
is still necessary to return errors when it fails.

Later patches will need to detect errors when registering buffers but an
immediate advantage is that error_report() calls are no longer needed in
block driver .bdrv_register_buf() functions.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20221013185908.1297568-8-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-10-26 14:56:42 -04:00
Stefan Hajnoczi e8b6535533 block: add BDRV_REQ_REGISTERED_BUF request flag
Block drivers may optimize I/O requests accessing buffers previously
registered with bdrv_register_buf(). Checking whether all elements of a
request's QEMUIOVector are within previously registered buffers is
expensive, so we need a hint from the user to avoid costly checks.

Add a BDRV_REQ_REGISTERED_BUF request flag to indicate that all
QEMUIOVector elements in an I/O request are known to be within
previously registered buffers.

Always pass the flag through to driver read/write functions. There is
little harm in passing the flag to a driver that does not use it.
Passing the flag to drivers avoids changes across many block drivers.
Filter drivers would need to explicitly support the flag and pass
through to their children when the children support it. That's a lot of
code changes and it's hard to remember to do that everywhere, leading to
silent reduced performance when the flag is accidentally dropped.

The only problematic scenario with the approach in this patch is when a
driver passes the flag through to internal I/O requests that don't use
the same I/O buffer. In that case the hint may be set when it should
actually be clear. This is a rare case though so the risk is low.

Some drivers have assert(!flags), which no longer works when
BDRV_REQ_REGISTERED_BUF is passed in. These assertions aren't very
useful anyway since the functions are called almost exclusively by
bdrv_driver_preadv/pwritev() so if we get flags handling right there
then the assertion is not needed.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20221013185908.1297568-7-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-10-26 14:56:42 -04:00
Stefan Hajnoczi 98b3ddc78b block: use BdrvRequestFlags type for supported flag fields
Use the enum type so GDB displays the enum members instead of printing a
numeric constant.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20221013185908.1297568-6-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-10-26 14:56:42 -04:00
Stefan Hajnoczi 4f384011c5 block: pass size to bdrv_unregister_buf()
The only implementor of bdrv_register_buf() is block/nvme.c, where the
size is not needed when unregistering a buffer. This is because
util/vfio-helpers.c can look up mappings by address.

Future block drivers that implement bdrv_register_buf() may not be able
to do their job given only the buffer address. Add a size argument to
bdrv_unregister_buf().

Also document the assumptions about
bdrv_register_buf()/bdrv_unregister_buf() calls. The same <host, size>
values that were given to bdrv_register_buf() must be given to
bdrv_unregister_buf().

gcc 11.2.1 emits a spurious warning that img_bench()'s buf_size local
variable might be uninitialized, so it's necessary to silence the
compiler.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20221013185908.1297568-5-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-10-26 14:56:42 -04:00
Emanuele Giuseppe Esposito ba6a910052 blockjob: remove unused functions
These public functions are not used anywhere, thus can be dropped.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20220926093214.506243-21-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-07 12:11:41 +02:00
Emanuele Giuseppe Esposito 3937e12cf8 blockjob.h: categorize fields in struct BlockJob
The same job lock is being used also to protect some of blockjob fields.
Categorize them just as done in job.h.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220926093214.506243-15-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-07 12:11:41 +02:00
Emanuele Giuseppe Esposito f41ab73fa2 blockjob: introduce block_job _locked() APIs
Just as done with job.h, create _locked() functions in blockjob.h

These functions will be later useful when caller has already taken
the lock. All blockjob _locked functions call job _locked functions.

Note: at this stage, job_{lock/unlock} and job lock guard macros
are *nop*.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220926093214.506243-8-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-07 12:11:41 +02:00
Emanuele Giuseppe Esposito fd4b14e299 aio-wait.h: introduce AIO_WAIT_WHILE_UNLOCKED
Same as AIO_WAIT_WHILE macro, but if we are in the Main loop
do not release and then acquire ctx_ 's aiocontext.

Once all Aiocontext locks go away, this macro will replace
AIO_WAIT_WHILE.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20220926093214.506243-5-eesposit@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-07 12:11:41 +02:00
Paolo Bonzini 9fb26291ac nbd: remove incorrect coroutine_fn annotations
nbd_co_establish_connection_cancel() cancels a coroutine but is not called
from coroutine context itself, for example in nbd_cancel_in_flight()
and in timer callbacks reconnect_delay_timer_cb() and open_timer_cb().

Reviewed-by: Alberto Faria <afaria@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220922084924.201610-5-pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-10-07 12:11:40 +02:00
Keith Busch a7c5f67a78 block: move bdrv_qiov_is_aligned to file-posix
There is only user of bdrv_qiov_is_aligned(), so move the alignment
function to there and make it static.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Message-Id: <20220929200523.3218710-2-kbusch@meta.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-09-30 18:43:44 +02:00
Denis V. Lunev 131498f775 block: make serializing requests functions 'void'
Return codes of the following functions are never used in the code:
* bdrv_wait_serialising_requests_locked
* bdrv_wait_serialising_requests
* bdrv_make_request_serialising

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Hanna Reitz <hreitz@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Fam Zheng <fam@euphon.net>
CC: Ronnie Sahlberg <ronniesahlberg@gmail.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Peter Lieven <pl@kamp.de>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20220817083736.40981-3-den@openvz.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-09-30 18:43:44 +02:00
Denis V. Lunev b2aaf35477 block: pass OnOffAuto instead of bool to block_acct_setup()
We would have one more place for block_acct_setup() calling, which should
not corrupt original value.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
CC: Peter Krempa <pkrempa@redhat.com>
CC: Markus Armbruster <armbru@redhat.com>
CC: John Snow <jsnow@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220824095044.166009-2-den@openvz.org>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-09-30 18:02:30 +02:00
Darren Kenny 43f76aac49 nvme: Fix misleading macro when mixed with ternary operator
Using the Parfait source code analyser and issue was found in
hw/nvme/ctrl.c where the macros NVME_CAP_SET_CMBS and NVME_CAP_SET_PMRS
are called with a ternary operatore in the second parameter, resulting
in a potentially unexpected expansion of the form:

  x ? a: b & FLAG_TEST

which will result in a different result to:

  (x ? a: b) & FLAG_TEST.

The macros should wrap each of the parameters in brackets to ensure the
correct result on expansion.

Signed-off-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-07-15 10:40:33 +02:00
Jinhao Fan 3f7fe8de3d hw/nvme: Implement shadow doorbell buffer support
Implement Doorbel Buffer Config command (Section 5.7 in NVMe Spec 1.3)
and Shadow Doorbel buffer & EventIdx buffer handling logic (Section 7.13
in NVMe Spec 1.3). For queues created before the Doorbell Buffer Config
command, the nvme_dbbuf_config function tries to associate each existing
SQ and CQ with its Shadow Doorbel buffer and EventIdx buffer address.
Queues created after the Doorbell Buffer Config command will have the
doorbell buffers associated with them when they are initialized.

In nvme_process_sq and nvme_post_cqe, proactively check for Shadow
Doorbell buffer changes instead of wait for doorbell register changes.
This reduces the number of MMIOs.

In nvme_process_db(), update the shadow doorbell buffer value with
the doorbell register value if it is the admin queue. This is a hack
since hosts like Linux NVMe driver and SPDK do not use shadow
doorbell buffer for the admin queue. Copying the doorbell register
value to the shadow doorbell buffer allows us to support these hosts
as well as spec-compliant hosts that use shadow doorbell buffer for
the admin queue.

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
[k.jensen: rebased]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-07-15 10:40:33 +02:00
Alberto Faria e97190a405 block: Add bdrv_co_pwrite_sync()
Also convert bdrv_pwrite_sync() to being implemented using
generated_co_wrapper.

Signed-off-by: Alberto Faria <afaria@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220609152744.3891847-9-afaria@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-07-12 12:14:56 +02:00
Alberto Faria 1d39c7098b block: Implement bdrv_{pread,pwrite,pwrite_zeroes}() using generated_co_wrapper
bdrv_{pread,pwrite}() now return -EIO instead of -EINVAL when 'bytes' is
negative, making them consistent with bdrv_{preadv,pwritev}() and
bdrv_co_{pread,pwrite,preadv,pwritev}().

bdrv_pwrite_zeroes() now also calls trace_bdrv_co_pwrite_zeroes() and
clears the BDRV_REQ_MAY_UNMAP flag when appropriate, which it didn't
previously.

Signed-off-by: Alberto Faria <afaria@redhat.com>
Message-Id: <20220609152744.3891847-8-afaria@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-07-12 12:14:56 +02:00
Alberto Faria c1458c66b2 block: Make 'bytes' param of bdrv_co_{pread,pwrite,preadv,pwritev}() an int64_t
For consistency with other I/O functions, and in preparation to
implement bdrv_{pread,pwrite}() using generated_co_wrapper.

unsigned int fits in int64_t, so all callers remain correct.

bdrv_check_request32() is called further down the stack and causes -EIO
to be returned if 'bytes' is negative or greater than
BDRV_REQUEST_MAX_BYTES, which in turns never exceeds SIZE_MAX.

Signed-off-by: Alberto Faria <afaria@redhat.com>
Message-Id: <20220609152744.3891847-7-afaria@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-07-12 12:14:56 +02:00
Alberto Faria ca71a64ee5 block: Make bdrv_co_pwrite() take a const buffer
It does not mutate the buffer.

Signed-off-by: Alberto Faria <afaria@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220609152744.3891847-6-afaria@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-07-12 12:14:56 +02:00
Alberto Faria 32cc71def9 block: Change bdrv_{pread,pwrite,pwrite_sync}() param order
Swap 'buf' and 'bytes' around for consistency with
bdrv_co_{pread,pwrite}(), and in preparation to implement these
functions using generated_co_wrapper.

Callers were updated using this Coccinelle script:

    @@ expression child, offset, buf, bytes, flags; @@
    - bdrv_pread(child, offset, buf, bytes, flags)
    + bdrv_pread(child, offset, bytes, buf, flags)

    @@ expression child, offset, buf, bytes, flags; @@
    - bdrv_pwrite(child, offset, buf, bytes, flags)
    + bdrv_pwrite(child, offset, bytes, buf, flags)

    @@ expression child, offset, buf, bytes, flags; @@
    - bdrv_pwrite_sync(child, offset, buf, bytes, flags)
    + bdrv_pwrite_sync(child, offset, bytes, buf, flags)

Resulting overly-long lines were then fixed by hand.

Signed-off-by: Alberto Faria <afaria@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20220609152744.3891847-3-afaria@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-07-12 12:14:55 +02:00
Alberto Faria 53fb7844f0 block: Add a 'flags' param to bdrv_{pread,pwrite,pwrite_sync}()
For consistency with other I/O functions, and in preparation to
implement them using generated_co_wrapper.

Callers were updated using this Coccinelle script:

    @@ expression child, offset, buf, bytes; @@
    - bdrv_pread(child, offset, buf, bytes)
    + bdrv_pread(child, offset, buf, bytes, 0)

    @@ expression child, offset, buf, bytes; @@
    - bdrv_pwrite(child, offset, buf, bytes)
    + bdrv_pwrite(child, offset, buf, bytes, 0)

    @@ expression child, offset, buf, bytes; @@
    - bdrv_pwrite_sync(child, offset, buf, bytes)
    + bdrv_pwrite_sync(child, offset, buf, bytes, 0)

Resulting overly-long lines were then fixed by hand.

Signed-off-by: Alberto Faria <afaria@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-Id: <20220609152744.3891847-2-afaria@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-07-12 12:14:55 +02:00
Vladimir Sementsov-Ogievskiy 15df6e6987 block/block-copy: block_copy(): add timeout_ns parameter
Add possibility to limit block_copy() call in time. To be used in the
next commit.

As timed-out block_copy() call will continue in background anyway (we
can't immediately cancel IO operation), it's important also give user a
possibility to pass a callback, to do some additional actions on
block-copy call finish.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
2022-06-29 10:56:12 +03:00
Emanuele Giuseppe Esposito 7455ff1aa0 aio_wait_kick: add missing memory barrier
It seems that aio_wait_kick always required a memory barrier
or atomic operation in the caller, but nobody actually
took care of doing it.

Let's put the barrier in the function instead, and pair it
with another one in AIO_WAIT_WHILE. Read aio_wait_kick()
comment for further explanation.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220524173054.12651-1-eesposit@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Vladimir Sementsov-Ogievskiy 618af89e55 block: simplify handling of try to merge different sized bitmaps
We have too much logic to simply check that bitmaps are of the same
size. Let's just define that hbitmap_merge() and
bdrv_dirty_bitmap_merge_internal() require their argument bitmaps be of
same size, this simplifies things.

Let's look through the callers:

For backup_init_bcs_bitmap() we already assert that merge can't fail.

In bdrv_reclaim_dirty_bitmap_locked() we gracefully handle the error
that can't happen: successor always has same size as its parent, drop
this logic.

In bdrv_merge_dirty_bitmap() we already has assertion and separate
check. Make the check explicit and improve error message.

Signed-off-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20220517111206.23585-4-v.sementsov-og@mail.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Stefan Hajnoczi 3399848b7f block: drop unused bdrv_co_drain() API
bdrv_co_drain() has not been used since commit 9a0cec664e ("mirror:
use bdrv_drained_begin/bdrv_drained_end") in 2016. Remove it so there
are fewer drain scenarios to worry about.

Use bdrv_drained_begin()/bdrv_drained_end() instead. They are "mixed"
functions that can be called from coroutine context. Unlike
bdrv_co_drain(), these functions provide control of the length of the
drained section, which is usually the right thing.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220521122714.3837731-1-stefanha@redhat.com>
Reviewed-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Alberto Faria <afaria@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-06-24 17:07:06 +02:00
Łukasz Gieryk 11871f53ef hw/nvme: Add support for the Virtualization Management command
With the new command one can:
 - assign flexible resources (queues, interrupts) to primary and
   secondary controllers,
 - toggle the online/offline state of given controller.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-06-23 23:24:29 +02:00
Łukasz Gieryk 746d42b133 hw/nvme: Initialize capability structures for primary/secondary controllers
With four new properties:
 - sriov_v{i,q}_flexible,
 - sriov_max_v{i,q}_per_vf,
one can configure the number of available flexible resources, as well as
the limits. The primary and secondary controller capability structures
are initialized accordingly.

Since the number of available queues (interrupts) now varies between
VF/PF, BAR size calculation is also adjusted.

Signed-off-by: Łukasz Gieryk <lukasz.gieryk@linux.intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-06-23 23:24:29 +02:00
Lukasz Maniak 99f48ae7ae hw/nvme: Add support for Secondary Controller List
Introduce handling for Secondary Controller List (Identify command with
CNS value of 15h).

Secondary controller ids are unique in the subsystem, hence they are
reserved by it upon initialization of the primary controller to the
number of sriov_max_vfs.

ID reservation requires the addition of an intermediate controller slot
state, so the reserved controller has the address 0xFFFF.
A secondary controller is in the reserved state when it has no virtual
function assigned, but its primary controller is realized.
Secondary controller reservations are released to NULL when its primary
controller is unregistered.

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-06-23 23:24:28 +02:00
Lukasz Maniak 5e6f963f01 hw/nvme: Add support for Primary Controller Capabilities
Implementation of Primary Controller Capabilities data
structure (Identify command with CNS value of 14h).

Currently, the command returns only ID of a primary controller.
Handling of remaining fields are added in subsequent patches
implementing virtualization enhancements.

Signed-off-by: Lukasz Maniak <lukasz.maniak@linux.intel.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-06-23 23:24:28 +02:00
Eric Blake 58a6fdcc9e nbd/server: Allow MULTI_CONN for shared writable exports
According to the NBD spec, a server that advertises
NBD_FLAG_CAN_MULTI_CONN promises that multiple client connections will
not see any cache inconsistencies: when properly separated by a single
flush, actions performed by one client will be visible to another
client, regardless of which client did the flush.

We always satisfy these conditions in qemu - even when we support
multiple clients, ALL clients go through a single point of reference
into the block layer, with no local caching.  The effect of one client
is instantly visible to the next client.  Even if our backend were a
network device, we argue that any multi-path caching effects that
would cause inconsistencies in back-to-back actions not seeing the
effect of previous actions would be a bug in that backend, and not the
fault of caching in qemu.  As such, it is safe to unconditionally
advertise CAN_MULTI_CONN for any qemu NBD server situation that
supports parallel clients.

Note, however, that we don't want to advertise CAN_MULTI_CONN when we
know that a second client cannot connect (for historical reasons,
qemu-nbd defaults to a single connection while nbd-server-add and QMP
commands default to unlimited connections; but we already have
existing means to let either style of NBD server creation alter those
defaults).  This is visible by no longer advertising MULTI_CONN for
'qemu-nbd -r' without -e, as in the iotest nbd-qemu-allocation.

The harder part of this patch is setting up an iotest to demonstrate
behavior of multiple NBD clients to a single server.  It might be
possible with parallel qemu-io processes, but I found it easier to do
in python with the help of libnbd, and help from Nir and Vladimir in
writing the test.

Signed-off-by: Eric Blake <eblake@redhat.com>
Suggested-by: Nir Soffer <nsoffer@redhat.com>
Suggested-by: Vladimir Sementsov-Ogievskiy <v.sementsov-og@mail.ru>
Message-Id: <20220512004924.417153-3-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-05-12 13:10:52 +02:00
Eric Blake a5fced4021 qemu-nbd: Pass max connections to blockdev layer
The next patch wants to adjust whether the NBD server code advertises
MULTI_CONN based on whether it is known if the server limits to
exactly one client.  For a server started by QMP, this information is
obtained through nbd_server_start (which can support more than one
export); but for qemu-nbd (which supports exactly one export), it is
controlled only by the command-line option -e/--shared.  Since we
already have a hook function used by qemu-nbd, it's easiest to just
alter its signature to fit our needs.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20220512004924.417153-2-eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-05-12 13:10:52 +02:00
Markus Armbruster ea9cea93c6 Clean up decorations and whitespace around header guards
Cleaned up with scripts/clean-header-guards.pl.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20220506134911.2856099-5-armbru@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
2022-05-11 16:50:32 +02:00
Markus Armbruster 52581c718c Clean up header guards that don't match their file name
Header guard symbols should match their file name to make guard
collisions less likely.

Cleaned up with scripts/clean-header-guards.pl, followed by some
renaming of new guard symbols picked by the script to better ones.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20220506134911.2856099-2-armbru@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
[Change to generated file ebpf/rss.bpf.skeleton.h backed out]
2022-05-11 16:49:06 +02:00
Nicolas Saenz Julienne 71ad4713cc util/event-loop-base: Introduce options to set the thread pool size
The thread pool regulates itself: when idle, it kills threads until
empty, when in demand, it creates new threads until full. This behaviour
doesn't play well with latency sensitive workloads where the price of
creating a new thread is too high. For example, when paired with qemu's
'-mlock', or using safety features like SafeStack, creating a new thread
has been measured take multiple milliseconds.

In order to mitigate this let's introduce a new 'EventLoopBase'
property to set the thread pool size. The threads will be created during
the pool's initialization or upon updating the property's value, remain
available during its lifetime regardless of demand, and destroyed upon
freeing it. A properly characterized workload will then be able to
configure the pool to avoid any latency spikes.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20220425075723.20019-4-nsaenzju@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-05-09 10:43:23 +01:00
Hanna Reitz 15aee7ac95 block: Classify bdrv_get_flags() as I/O function
This function is safe to call in an I/O context, and qcow2_do_open()
does so (invoked in an I/O context by qcow2_co_invalidate_cache()).

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220427114057.36651-2-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-05-04 15:55:23 +02:00
Vladimir Sementsov-Ogievskiy 1466ef6cbe qapi: rename BlockDirtyBitmapMergeSource to BlockDirtyBitmapOrStr
Rename the type to be reused. Old name is "what is it for". To be
natively reused for other needs, let's name it exactly "what is it".

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@openvz.org>
Message-Id: <20220314213226.362217-2-v.sementsov-og@mail.ru>
[eblake: Adjust S-o-b to Vladimir's new email, with permission]
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: John Snow <jsnow@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-04-26 13:13:50 -05:00
Marc-André Lureau 215aea0cb2 include: move qdict_{crumple,flatten} declarations
Move them where they belong, since the functions are implemented in block-qdict.c.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20220420132624.2439741-25-marcandre.lureau@redhat.com>
2022-04-21 17:03:51 +04:00
Daniel P. Berrangé 046f98d075 block: pass desired TLS hostname through from block driver client
In

  commit a71d597b98
  Author: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
  Date:   Thu Jun 10 13:08:00 2021 +0300

    block/nbd: reuse nbd_co_do_establish_connection() in nbd_open()

the use of the 'hostname' field from the BDRVNBDState struct was
lost, and 'nbd_connect' just hardcoded it to match the IP socket
address. This was a harmless bug at the time since we block use
with anything other than IP sockets.

Shortly though, we want to allow the caller to override the hostname
used in the TLS certificate checks. This is to allow for TLS
when doing port forwarding or tunneling. Thus we need to reinstate
the passing along of the 'hostname'.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20220304193610.3293146-3-berrange@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2022-03-07 15:58:42 -06:00
Vladimir Sementsov-Ogievskiy 1c14eaabdb block: introduce snapshot-access block driver
The new block driver simply utilizes snapshot-access API of underlying
block node.

In further patches we want to use it like this:

[guest]                   [NBD export]
   |                            |
   | root                       | root
   v                 file       v
[copy-before-write]<------[snapshot-access]
   |           |
   | file      | target
   v           v
[active-disk] [temp.img]

This way, NBD client will be able to read snapshotted state of active
disk, when active disk is continued to be written by guest. This is
known as "fleecing", and currently uses another scheme based on qcow2
temporary image which backing file is active-disk. New scheme comes
with benefits - see next commit.

The other possible application is exporting internal snapshots of
qcow2, like this:

[guest]          [NBD export]
   |                  |
   | root             | root
   v       file       v
[qcow2]<---------[snapshot-access]

For this, we'll need to implement snapshot-access API handlers in
qcow2 driver, and improve snapshot-access block driver (and API) to
make it possible to select snapshot by name. Another thing to improve
is size of snapshot. Now for simplicity we just use size of bs->file,
which is OK for backup, but for qcow2 snapshots export we'll need to
imporve snapshot-access API to get size of snapshot.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20220303194349.2304213-12-vsementsov@virtuozzo.com>
[hreitz: Rebased on block GS/IO split]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:31 +01:00
Vladimir Sementsov-Ogievskiy ce14f3b407 block/io: introduce block driver snapshot-access API
Add new block driver handlers and corresponding generic wrappers.
It will be used to allow copy-before-write filter to provide
reach fleecing interface in further commit.

In future this approach may be used to allow reading qcow2 internal
snapshots, for example to export them through NBD.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-11-vsementsov@virtuozzo.com>
[hreitz: Rebased on block GS/IO split]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:31 +01:00
Vladimir Sementsov-Ogievskiy 3b7ca26bdf block/reqlist: add reqlist_wait_all()
Add function to wait for all intersecting requests.
To be used in the further commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-10-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy a6426475a7 block/dirty-bitmap: introduce bdrv_dirty_bitmap_status()
Add a convenient function similar with bdrv_block_status() to get
status of dirty bitmap.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-9-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy d088e6a48a block: intoduce reqlist
Split intersecting-requests functionality out of block-copy to be
reused in copy-before-write filter.

Note: while being here, fix tiny typo in MAINTAINERS.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-7-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy 177541e671 block/block-copy: add block_copy_reset()
Split block_copy_reset() out of block_copy_reset_unallocated() to be
used separately later.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-6-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy 1f7252e868 block/block-copy: block_copy_state_new(): add bitmap parameter
This will be used in the following commit to bring "incremental" mode
to copy-before-write filter.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-4-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy 34ffacb7f4 block/dirty-bitmap: bdrv_merge_dirty_bitmap(): add return value
That simplifies handling failure in existing code and in further new
usage of bdrv_merge_dirty_bitmap().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220303194349.2304213-3-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:33:30 +01:00
Vladimir Sementsov-Ogievskiy 45e62b464a block: fix preallocate filter: don't do unaligned preallocate requests
There is a bug in handling BDRV_REQ_NO_WAIT flag: we still may wait in
wait_serialising_requests() if request is unaligned. And this is
possible for the only user of this flag (preallocate filter) if
underlying file is unaligned to its request_alignment on start.

So, we have to fix preallocate filter to do only aligned preallocate
requests.

Next, we should fix generic block/io.c somehow. Keeping in mind that
preallocate is the only user of BDRV_REQ_NO_WAIT and that we have to
fix its behavior now, it seems more safe to just assert that we never
use BDRV_REQ_NO_WAIT with unaligned requests and add corresponding
comment. Let's do so.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Message-Id: <20220215121609.38570-1-vsementsov@virtuozzo.com>
[hreitz: Rebased on block GS/IO split]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-03-07 09:19:20 +01:00
Emanuele Giuseppe Esposito abc5a79c64 block_int-common.h: split function pointers in BdrvChildClass
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-28-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito 69c0bf1197 block_int-common.h: split function pointers in BlockDriver
Similar to the header split, also the function pointers in BlockDriver
can be split in I/O and global state.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-26-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito 6b573efec8 include/block/snapshot: global state API + assertions
Snapshots run also under the BQL, so they all are
in the global state API. The aiocontext lock that they hold
is currently an overkill and in future could be removed.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-23-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito 4ad3387637 include/block/blockjob.h: global state API
blockjob functions run always under the BQL lock.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-19-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito 2015c4c28d include/block/blockjob_int.h: split header into I/O and GS API
Since the I/O functions are not many, keep a single file.
Also split the function pointers in BlockJobDriver.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20220303151616.325444-16-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito 696bf4c78c block: introduce assert_bdrv_graph_writable
We want to be sure that the functions that write the child and
parent list of a bs are under BQL and drain.

BQL prevents from concurrent writings from the GS API, while
drains protect from I/O.

TODO: drains are missing in some functions using this assert.
Therefore a proper assertion will fail. Because adding drains
requires additional discussions, they will be added in future
series.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-15-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito 967d7905d1 IO_CODE and IO_OR_GS_CODE for block_int I/O API
Mark all I/O functions with IO_CODE, and all "I/O OR GS" with
IO_OR_GS_CODE.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-14-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito ebc2752b08 include/block/block_int: split header into I/O and global state API
Similarly to the previous patch, split block_int.h
in block_int-io.h and block_int-global-state.h

block_int-common.h contains the structures shared between
the two headers, and the functions that can't be categorized as
I/O or global state.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-12-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito 384a48fb74 IO_CODE and IO_OR_GS_CODE for block I/O API
Mark all I/O functions with IO_CODE, and all "I/O OR GS" with
IO_OR_GS_CODE.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-6-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito 3b491a9056 include/block/block: split header into I/O and global state API
block.h currently contains a mix of functions:
some of them run under the BQL and modify the block layer graph,
others are instead thread-safe and perform I/O in iothreads.
Some others can only be called by either the main loop or the
iothread running the AioContext (and not other iothreads),
and using them in another thread would cause deadlocks, and therefore
it is not ideal to define them as I/O.

It is not easy to understand which function is part of which
group (I/O vs GS vs "I/O or GS"), and this patch aims to clarify it.

The "GS" functions need the BQL, and often use
aio_context_acquire/release and/or drain to be sure they
can modify the graph safely.
The I/O function are instead thread safe, and can run in
any AioContext.
"I/O or GS" functions run instead in the main loop or in
a single iothread, and use BDRV_POLL_WHILE().

By splitting the header in two files, block-io.h
and block-global-state.h we have a clearer view on what
needs what kind of protection. block-common.h
contains common structures shared by both headers.

block.h is left there for legacy and to avoid changing
all includes in all c files that use the block APIs.

Assertions are added in the next patch.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220303151616.325444-4-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:18:25 +01:00
Emanuele Giuseppe Esposito 3b71719462 block: rename bdrv_invalidate_cache_all, blk_invalidate_cache and test_sync_op_invalidate_cache
Following the bdrv_activate renaming, change also the name
of the respective callers.

bdrv_invalidate_cache_all -> bdrv_activate_all
blk_invalidate_cache -> blk_activate
test_sync_op_invalidate_cache -> test_sync_op_activate

No functional change intended.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220209105452.1694545-5-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:14:40 +01:00
Emanuele Giuseppe Esposito a94750d956 block: introduce bdrv_activate
This function is currently just a wrapper for bdrv_invalidate_cache(),
but in future will contain the code of bdrv_co_invalidate_cache() that
has to always be protected by BQL, and leave the rest in the I/O
coroutine.

Replace all bdrv_invalidate_cache() invokations with bdrv_activate().

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220209105452.1694545-4-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:14:40 +01:00
Emanuele Giuseppe Esposito c1019d1687 crypto: perform permission checks under BQL
Move the permission API calls into driver-specific callbacks
that always run under BQL. In this case, bdrv_crypto_luks
needs to perform permission checks before and after
qcrypto_block_amend_options(). The problem is that the caller,
block_crypto_amend_options_generic_luks(), can also run in I/O
from .bdrv_co_amend(). This does not comply with Global State-I/O API split,
as permissions API must always run under BQL.

Firstly, introduce .bdrv_amend_pre_run() and .bdrv_amend_clean()
callbacks. These two callbacks are guaranteed to be invoked under
BQL, respectively before and after .bdrv_co_amend().
They take care of performing the permission checks
in the same way as they are currently done before and after
qcrypto_block_amend_options().
These callbacks are in preparation for next patch, where we
delete the original permission check. Right now they just add redundant
control.

Then, call .bdrv_amend_pre_run() before job_start in
qmp_x_blockdev_amend(), so that it will be run before the job coroutine
is created and stay in the main loop.
As a cleanup, use JobDriver's .clean() callback to call
.bdrv_amend_clean(), and run amend-specific cleanup callbacks under BQL.

After this patch, permission failures occur early in the blockdev-amend
job to update a LUKS volume's keys.  iotest 296 must now expect them in
x-blockdev-amend's QMP reply instead of waiting for the actual job to
fail later.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220209105452.1694545-2-eesposit@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20220304153729.711387-6-hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-03-04 18:14:39 +01:00
Naveen Nagar 44219b6029 hw/nvme: 64-bit pi support
This adds support for one possible new protection information format
introduced in TP4068 (and integrated in NVMe 2.0): the 64-bit CRC guard
and 48-bit reference tag. This version does not support storage tags.

Like the CRC16 support already present, this uses a software
implementation of CRC64 (so it is naturally pretty slow). But its good
enough for verification purposes.

This may go nicely hand-in-hand with the support that Keith submitted
for the Linux kernel[1].

  [1]: https://lore.kernel.org/linux-nvme/20220126165214.GA1782352@dhcp-10-100-145-180.wdc.com/T/

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Naveen Nagar <naveen.n1@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-03-03 09:30:21 +01:00
Naveen Nagar 763c05dfb0 hw/nvme: add support for the lbafee hbs feature
Add support for up to 64 LBA formats through the LBAFEE field of the
Host Behavior Support feature.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Naveen Nagar <naveen.n1@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-03-03 09:28:49 +01:00
Naveen Nagar d0c0697b9e hw/nvme: add host behavior support feature
Add support for getting and setting the Host Behavior Support feature.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Naveen Nagar <naveen.n1@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-03-03 09:28:48 +01:00
Klaus Jensen e321b4cdc2 hw/nvme: add support for zoned random write area
Add support for TP 4076 ("Zoned Random Write Area"), v2021.08.23
("Ratified").

This adds three new namespace parameters: "zoned.numzrwa" (number of
zrwa resources, i.e. number of zones that can have a zrwa),
"zoned.zrwas" (zrwa size in LBAs), "zoned.zrwafg" (granularity in LBAs
for flushes).

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-02-14 08:58:29 +01:00
Klaus Jensen 25872031e1 hw/nvme: add ozcs enum
Add enumeration for OZCS values.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-02-14 08:58:29 +01:00
Klaus Jensen 6190d92ff7 hw/nvme: add struct for zone management send
Add struct for Zone Management Send in preparation for more zone send
flags.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2022-02-14 08:58:29 +01:00
Emanuele Giuseppe Esposito 751486c185 block.h: remove outdated comment
The comment "disk I/O throttling" doesn't make any sense at all
any more. It was added in commit 0563e19151 to describe
bdrv_io_limits_enable()/disable(), which were removed in commit
97148076, so the comment is just a forgotten leftover.

Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20220131125615.74612-1-eesposit@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2022-02-01 13:28:53 +01:00
Peter Maydell 1cd2ad11d3 Block layer patches
- qemu-storage-daemon: Add vhost-user-blk help
 - block-backend: Fix use-after-free for BDS pointers after aio_poll()
 - qemu-img: Fix sparseness of output image with unaligned ranges
 - vvfat: Fix crashes in read-write mode
 - Fix device deletion events with -device JSON syntax
 - Code cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmHhf5gRHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9YBMA//ZkaIigVsfjRoeUh2MccgOuvYpXZtq4po
 q7l6AwGLbBpTt5Fy468gYhwmXuwHCapTMRmvWf6mpb86jtJ6vdbE16L0Z4/Z9iiW
 C0w69fsAAP9XyI+f7Q5FNtzz3jWztKowgyhkU33izbwYM7dm5Xw1q5bDkOiIBNoO
 d8cdxLC1oQGEWJmGLgmbaM/ow0iDogFpT8zU5j0VE3uK01si8pblWlXm1SM3nOK9
 b4uROqKYsTzTny/zX7KxD4SX3UGKYK393rQxr5HdmTiW14uGfB+EVfBxJmn07Qch
 lWM/v9tYoP1aVbR6IL5osAQdmbDYX0zsRMq5UA+dQ6OqnE3GpluVrYIFoaUSoShf
 S704hYdWgO0sKfpAYgJgGo6y0mglnp9Z7xO4Ng3XUNj0gvfgnOe3CdCdXIOeTFwC
 eP+KlFvbUT2xpTqI6ttBgKCcwKHA3hgWCnlo39C80bL1ZVKWSqh6zORfwmptouQ3
 BmuhEqZRyoYrknrTELN+lIKK2gP6MLup/ymeXWOOOE58KSpmrdeBAXmgJNXX3ucx
 lAWGsIz0CxdaKQoZpKpikho4rhrGkqZ33B3H7mdcsKS6zYzmsDIqa9FzUjtpvN2V
 K/jXlK7dv58Y+LLzpcuJAf8HNnitA107WD5RA1s5nTw0ahD2GwR4UPzEhnSO9/nT
 yZ3dGUysj7Q=
 =dnBv
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches

- qemu-storage-daemon: Add vhost-user-blk help
- block-backend: Fix use-after-free for BDS pointers after aio_poll()
- qemu-img: Fix sparseness of output image with unaligned ranges
- vvfat: Fix crashes in read-write mode
- Fix device deletion events with -device JSON syntax
- Code cleanups

# gpg: Signature made Fri 14 Jan 2022 13:50:16 GMT
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream:
  iotests/testrunner.py: refactor test_field_width
  block: drop BLK_PERM_GRAPH_MOD
  qemu-img: make is_allocated_sectors() more efficient
  iotests: Test qemu-img convert of zeroed data cluster
  vvfat: Fix vvfat_write() for writes before the root directory
  vvfat: Fix size of temporary qcow file
  iotests/308: Fix for CAP_DAC_OVERRIDE
  iotests/stream-error-on-reset: New test
  block-backend: prevent dangling BDS pointers across aio_poll()
  qapi/block: Restrict vhost-user-blk to CONFIG_VHOST_USER_BLK_SERVER
  qemu-storage-daemon: Add vhost-user-blk help
  docs: Correct 'vhost-user-blk' spelling
  softmmu: fix device deletion events with -device JSON syntax
  include/sysemu/blockdev.h: remove drive_get_max_devs
  include/sysemu/blockdev.h: remove drive_mark_claimed_by_board and inline drive_def
  block_int: make bdrv_backing_overridden static

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2022-01-14 15:56:30 +00:00
Vladimir Sementsov-Ogievskiy 64631f3681 block: drop BLK_PERM_GRAPH_MOD
First, this permission never protected a node from being changed, as
generic child-replacing functions don't check it.

Second, it's a strange thing: it presents a permission of parent node
to change its child. But generally, children are replaced by different
mechanisms, like jobs or qmp commands, not by nodes.

Graph-mod permission is hard to understand. All other permissions
describe operations which done by parent node on its child: read,
write, resize. Graph modification operations are something completely
different.

The only place where BLK_PERM_GRAPH_MOD is used as "perm" (not shared
perm) is mirror_start_job, for s->target. Still modern code should use
bdrv_freeze_backing_chain() to protect from graph modification, if we
don't do it somewhere it may be considered as a bug. So, it's a bit
risky to drop GRAPH_MOD, and analyzing of possible loss of protection
is hard. But one day we should do it, let's do it now.

One more bit of information is that locking the corresponding byte in
file-posix doesn't make sense at all.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210902093754.2352-1-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-01-14 12:03:16 +01:00
Emanuele Giuseppe Esposito fa8fc1d09f block_int: make bdrv_backing_overridden static
bdrv_backing_overridden is only used in block.c, so there is
no need to leave it in block_int.h

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20211215121140.456939-2-eesposit@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2022-01-14 12:03:16 +01:00
Stefan Hajnoczi 826cc32423 aio-posix: split poll check from ready handler
Adaptive polling measures the execution time of the polling check plus
handlers called when a polled event becomes ready. Handlers can take a
significant amount of time, making it look like polling was running for
a long time when in fact the event handler was running for a long time.

For example, on Linux the io_submit(2) syscall invoked when a virtio-blk
device's virtqueue becomes ready can take 10s of microseconds. This
can exceed the default polling interval (32 microseconds) and cause
adaptive polling to stop polling.

By excluding the handler's execution time from the polling check we make
the adaptive polling calculation more accurate. As a result, the event
loop now stays in polling mode where previously it would have fallen
back to file descriptor monitoring.

The following data was collected with virtio-blk num-queues=2
event_idx=off using an IOThread. Before:

168k IOPS, IOThread syscalls:

  9837.115 ( 0.020 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 16, iocbpp: 0x7fcb9f937db0)    = 16
  9837.158 ( 0.002 ms): IO iothread1/620155 write(fd: 103, buf: 0x556a2ef71b88, count: 8)                         = 8
  9837.161 ( 0.001 ms): IO iothread1/620155 write(fd: 104, buf: 0x556a2ef71b88, count: 8)                         = 8
  9837.163 ( 0.001 ms): IO iothread1/620155 ppoll(ufds: 0x7fcb90002800, nfds: 4, tsp: 0x7fcb9f1342d0, sigsetsize: 8) = 3
  9837.164 ( 0.001 ms): IO iothread1/620155 read(fd: 107, buf: 0x7fcb9f939cc0, count: 512)                        = 8
  9837.174 ( 0.001 ms): IO iothread1/620155 read(fd: 105, buf: 0x7fcb9f939cc0, count: 512)                        = 8
  9837.176 ( 0.001 ms): IO iothread1/620155 read(fd: 106, buf: 0x7fcb9f939cc0, count: 512)                        = 8
  9837.209 ( 0.035 ms): IO iothread1/620155 io_submit(ctx_id: 140512552468480, nr: 32, iocbpp: 0x7fca7d0cebe0)    = 32

174k IOPS (+3.6%), IOThread syscalls:

  9809.566 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0cdd62be0)    = 32
  9809.625 ( 0.001 ms): IO iothread1/623061 write(fd: 103, buf: 0x5647cfba5f58, count: 8)                         = 8
  9809.627 ( 0.002 ms): IO iothread1/623061 write(fd: 104, buf: 0x5647cfba5f58, count: 8)                         = 8
  9809.663 ( 0.036 ms): IO iothread1/623061 io_submit(ctx_id: 140539805028352, nr: 32, iocbpp: 0x7fd0d0388b50)    = 32

Notice that ppoll(2) and eventfd read(2) syscalls are eliminated because
the IOThread stays in polling mode instead of falling back to file
descriptor monitoring.

As usual, polling is not implemented on Windows so this patch ignores
the new io_poll_read() callback in aio-win32.c.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20211207132336.36627-2-stefanha@redhat.com

[Fixed up aio_set_event_notifier() calls in
tests/unit/test-fdmon-epoll.c added after this series was queued.
--Stefan]

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2022-01-12 17:09:39 +00:00
Vladimir Sementsov-Ogievskiy 985cac8f20 blockjob: drop BlockJob.blk field
It's unused now (except for permission handling)[*]. The only reasonable
user of it was block-stream job, recently updated to use own blk. And
other block jobs prefer to use own source node related objects.

So, the arguments of dropping the field are:

 - block jobs prefer not to use it
 - block jobs usually has more then one node to operate on, and better
   to operate symmetrically (for example has both source and target
   blk's in specific block-job state structure)

*: BlockJob.blk is used to keep some permissions. We simply move
permissions to block-job child created in block_job_create() together
with blk.

In mirror, we just should not care anymore about restoring state of
blk. Most probably this code could be dropped long ago, after dropping
bs->job pointer. Now it finally goes away together with BlockJob.blk
itself.

iotest 141 output is updated, as "bdrv_has_blk(bs)" check in
qmp_blockdev_del() doesn't fail (we don't have blk now). Still, new
error message looks even better.

In iotest 283 we need to add a job id, otherwise "Invalid job ID"
happens now earlier than permission check (as permissions moved from
blk to block-job node).

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
2021-12-28 15:18:59 +01:00
Vladimir Sementsov-Ogievskiy df9a316505 blockjob: implement and use block_job_get_aio_context
We are going to drop BlockJob.blk. So let's retrieve block job context
from underlying job instead of main node.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Nikita Lapshin <nikita.lapshin@virtuozzo.com>
2021-12-28 15:18:23 +01:00
Stefano Garzarella 68d7946648 linux-aio: add dev_max_batch parameter to laio_io_unplug()
Between the submission of a request and the unplug, other devices
with larger limits may have been queued new requests without flushing
the batch.

Using the new `dev_max_batch` parameter, laio_io_unplug() can check
if the batch exceeds the device limit to flush the current batch.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-4-sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:03:35 +01:00
Stefano Garzarella 512da21101 linux-aio: add dev_max_batch parameter to laio_co_submit()
This new parameter can be used by block devices to limit the
Linux AIO batch size more than the limit set by the AIO context.

file-posix backend supports this, passing its `aio-max-batch` option
previously added.

Add an helper function to calculate the maximum batch size.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20211026162346.253081-3-sgarzare@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02 13:03:35 +01:00
Paolo Bonzini cc07162953 block: introduce max_hw_iov for use in scsi-generic
Linux limits the size of iovecs to 1024 (UIO_MAXIOV in the kernel
sources, IOV_MAX in POSIX).  Because of this, on some host adapters
requests with many iovecs are rejected with -EINVAL by the
io_submit() or readv()/writev() system calls.

In fact, the same limit applies to SG_IO as well.  To fix both the
EINVAL and the possible performance issues from using fewer iovecs
than allowed by Linux (some HBAs have max_segments as low as 128),
introduce a separate entry in BlockLimits to hold the max_segments
value from sysfs.  This new limit is used only for SG_IO and clamped
to bs->bl.max_iov anyway, just like max_hw_transfer is clamped to
bs->bl.max_transfer.

Reported-by: Halil Pasic <pasic@linux.ibm.com>
Cc: Hanna Reitz <hreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org
Cc: qemu-stable@nongnu.org
Fixes: 18473467d5 ("file-posix: try BLKSECTGET on block devices too, do not round to power of 2", 2021-06-25)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210923130436.1187591-1-pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-10-06 10:25:55 +02:00
Vladimir Sementsov-Ogievskiy 621d17378a block: implement bdrv_new_open_driver_opts()
Add version of bdrv_new_open_driver() that supports QDict options.
We'll use it in further commit.

Simply add one more argument to bdrv_new_open_driver() is worse, as
there are too many invocations of bdrv_new_open_driver() to update
then.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210920115538.264372-2-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-10-06 10:25:55 +02:00
Emanuele Giuseppe Esposito a6297e1ade include/block.h: remove outdated comment
There are a couple of errors in bdrv_drained_begin header comment:
- block_job_pause does not exist anymore, it has been replaced
  with job_pause in b15de82867
- job_pause is automatically invoked as a .drained_begin callback
  (child_job_drained_begin) by the child_job BdrvChildClass struct
  in blockjob.c. So no additional pause should be required.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20210903113800.59970-1-eesposit@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-10-06 10:25:55 +02:00
Vladimir Sementsov-Ogievskiy 0c8022876f block: use int64_t instead of int in driver discard handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver discard handlers bytes parameter to int64_t.

The only caller of all updated function is bdrv_co_pdiscard in
block/io.c. It is already prepared to work with 64bit requests, but
pass at most max(bs->bl.max_pdiscard, INT_MAX) to the driver.

Let's look at all updated functions:

blkdebug: all calculations are still OK, thanks to
  bdrv_check_qiov_request().
  both rule_check and bdrv_co_pdiscard are 64bit

blklogwrites: pass to blk_loc_writes_co_log which is 64bit

blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pdiscard, OK

copy-before-write: pass to bdrv_co_pdiscard which is 64bit and to
  cbw_do_copy_before_write which is 64bit

file-posix: one handler calls raw_account_discard() is 64bit and both
  handlers calls raw_do_pdiscard(). Update raw_do_pdiscard, which pass
  to RawPosixAIOData::aio_nbytes, which is 64bit (and calls
  raw_account_discard())

gluster: somehow, third argument of glfs_discard_async is size_t.
  Let's set max_pdiscard accordingly.

iscsi: iscsi_allocmap_set_invalid is 64bit,
  !is_byte_request_lun_aligned is 64bit.
  list.num is uint32_t. Let's clarify max_pdiscard and
  pdiscard_alignment.

mirror_top: pass to bdrv_mirror_top_do_write() which is
  64bit

nbd: protocol limitation. max_pdiscard is alredy set strict enough,
  keep it as is for now.

nvme: buf.nlb is uint32_t and we do shift. So, add corresponding limits
  to nvme_refresh_limits().

preallocate: pass to bdrv_co_pdiscard() which is 64bit.

rbd: pass to qemu_rbd_start_co() which is 64bit.

qcow2: calculations are still OK, thanks to bdrv_check_qiov_request(),
  qcow2_cluster_discard() is 64bit.

raw-format: raw_adjust_offset() is 64bit, bdrv_co_pdiscard too.

throttle: pass to bdrv_co_pdiscard() which is 64bit and to
  throttle_group_co_io_limits_intercept() which is 64bit as well.

test-block-iothread: bytes argument is unused

Great! Now all drivers are prepared to handle 64bit discard requests,
or else have explicit max_pdiscard limits.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-11-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy 39af49c0d7 block: make BlockLimits::max_pdiscard 64bit
We are going to support 64 bit discard requests. Now update the
limit variable. It's absolutely safe. The variable is set in some
drivers, and used in bdrv_co_pdiscard().

Update also max_pdiscard variable in bdrv_co_pdiscard(), so that
bdrv_co_pdiscard() is now prepared for 64bit requests. The remaining
logic including num, offset and bytes variables is already
supporting 64bit requests.

So the only thing that prevents 64 bit requests is limiting
max_pdiscard variable to INT_MAX in bdrv_co_pdiscard().
We'll drop this limitation after updating all block drivers.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-10-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy f34b2bcf8c block: use int64_t instead of int in driver write_zeroes handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver write_zeroes handlers bytes parameter to int64_t.

The only caller of all updated function is bdrv_co_do_pwrite_zeroes().

bdrv_co_do_pwrite_zeroes() itself is of course OK with widening of
callee parameter type. Also, bdrv_co_do_pwrite_zeroes()'s
max_write_zeroes is limited to INT_MAX. So, updated functions all are
safe, they will not get "bytes" larger than before.

Still, let's look through all updated functions, and add assertions to
the ones which are actually unprepared to values larger than INT_MAX.
For these drivers also set explicit max_pwrite_zeroes limit.

Let's go:

blkdebug: calculations can't overflow, thanks to
  bdrv_check_qiov_request() in generic layer. rule_check() and
  bdrv_co_pwrite_zeroes() both have 64bit argument.

blklogwrites: pass to blk_log_writes_co_log() with 64bit argument.

blkreplay, copy-on-read, filter-compress: pass to
  bdrv_co_pwrite_zeroes() which is OK

copy-before-write: Calls cbw_do_copy_before_write() and
  bdrv_co_pwrite_zeroes, both have 64bit argument.

file-posix: both handler calls raw_do_pwrite_zeroes, which is updated.
  In raw_do_pwrite_zeroes() calculations are OK due to
  bdrv_check_qiov_request(), bytes go to RawPosixAIOData::aio_nbytes
  which is uint64_t.
  Check also where that uint64_t gets handed:
  handle_aiocb_write_zeroes_block() passes a uint64_t[2] to
  ioctl(BLKZEROOUT), handle_aiocb_write_zeroes() calls do_fallocate()
  which takes off_t (and we compile to always have 64-bit off_t), as
  does handle_aiocb_write_zeroes_unmap. All look safe.

gluster: bytes go to GlusterAIOCB::size which is int64_t and to
  glfs_zerofill_async works with off_t.

iscsi: Aha, here we deal with iscsi_writesame16_task() that has
  uint32_t num_blocks argument and iscsi_writesame16_task() has
  uint16_t argument. Make comments, add assertions and clarify
  max_pwrite_zeroes calculation.
  iscsi_allocmap_() functions already has int64_t argument
  is_byte_request_lun_aligned is simple to update, do it.

mirror_top: pass to bdrv_mirror_top_do_write which has uint64_t
  argument

nbd: Aha, here we have protocol limitation, and NBDRequest::len is
  uint32_t. max_pwrite_zeroes is cleanly set to 32bit value, so we are
  OK for now.

nvme: Again, protocol limitation. And no inherent limit for
  write-zeroes at all. But from code that calculates cdw12 it's obvious
  that we do have limit and alignment. Let's clarify it. Also,
  obviously the code is not prepared to handle bytes=0. Let's handle
  this case too.
  trace events already 64bit

preallocate: pass to handle_write() and bdrv_co_pwrite_zeroes(), both
  64bit.

rbd: pass to qemu_rbd_start_co() which is 64bit.

qcow2: offset + bytes and alignment still works good (thanks to
  bdrv_check_qiov_request()), so tail calculation is OK
  qcow2_subcluster_zeroize() has 64bit argument, should be OK
  trace events updated

qed: qed_co_request wants int nb_sectors. Also in code we have size_t
  used for request length which may be 32bit. So, let's just keep
  INT_MAX as a limit (aligning it down to pwrite_zeroes_alignment) and
  don't care.

raw-format: Is OK. raw_adjust_offset and bdrv_co_pwrite_zeroes are both
  64bit.

throttle: Both throttle_group_co_io_limits_intercept() and
  bdrv_co_pwrite_zeroes() are 64bit.

vmdk: pass to vmdk_pwritev which is 64bit

quorum: pass to quorum_co_pwritev() which is 64bit

Hooray!

At this point all block drivers are prepared to support 64bit
write-zero requests, or have explicitly set max_pwrite_zeroes.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-8-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: use <= rather than < in assertions relying on max_pwrite_zeroes]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy d544f5d3b1 block: make BlockLimits::max_pwrite_zeroes 64bit
We are going to support 64 bit write-zeroes requests. Now update the
limit variable. It's absolutely safe. The variable is set in some
drivers, and used in bdrv_co_do_pwrite_zeroes().

Update also max_write_zeroes variable in bdrv_co_do_pwrite_zeroes(), so
that bdrv_co_do_pwrite_zeroes() is now prepared to 64bit requests. The
remaining logic including num, offset and bytes variables is already
supporting 64bit requests.

So the only thing that prevents 64 bit requests is limiting
max_write_zeroes variable to INT_MAX in bdrv_co_do_pwrite_zeroes().
We'll drop this limitation after updating all block drivers.

Ah, we also have bdrv_check_request32() in bdrv_co_pwritev_part(). It
will be modified to do bdrv_check_request() for write-zeroes path.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-7-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy 485350497b block: use int64_t instead of uint64_t in copy_range driver handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver copy_range handlers parameters which are already
64bit to signed type.

Now let's consider all callers. Simple

  git grep '\->bdrv_co_copy_range'

shows the only caller:

  bdrv_co_copy_range_internal(), which does bdrv_check_request32(),
  so everything is OK.

Still, the functions may be called directly, not only by drv->...
Let's check:

git grep '\.bdrv_co_copy_range_\(from\|to\)\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done

shows no more callers. So, we are done.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-6-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:32 -05:00
Vladimir Sementsov-Ogievskiy e75abedab7 block: use int64_t instead of uint64_t in driver write handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver write handlers parameters which are already 64bit to
signed type.

While being here, convert also flags parameter to be BdrvRequestFlags.

Now let's consider all callers. Simple

  git grep '\->bdrv_\(aio\|co\)_pwritev\(_part\)\?'

shows that's there three callers of driver function:

 bdrv_driver_pwritev() and bdrv_driver_pwritev_compressed() in
 block/io.c, both pass int64_t, checked by bdrv_check_qiov_request() to
 be non-negative.

 qcow2_save_vmstate() does bdrv_check_qiov_request().

Still, the functions may be called directly, not only by drv->...
Let's check:

git grep '\.bdrv_\(aio\|co\)_pwritev\(_part\)\?\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done

shows several callers:

qcow2:
  qcow2_co_truncate() write at most up to @offset, which is checked in
    generic qcow2_co_truncate() by bdrv_check_request().
  qcow2_co_pwritev_compressed_task() pass the request (or part of the
    request) that already went through normal write path, so it should
    be OK

qcow:
  qcow_co_pwritev_compressed() pass int64_t, it's updated by this patch

quorum:
  quorum_co_pwrite_zeroes() pass int64_t and int - OK

throttle:
  throttle_co_pwritev_compressed() pass int64_t, it's updated by this
  patch

vmdk:
  vmdk_co_pwritev_compressed() pass int64_t, it's updated by this
  patch

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:31 -05:00
Vladimir Sementsov-Ogievskiy f7ef38dd13 block: use int64_t instead of uint64_t in driver read handlers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert driver read handlers parameters which are already 64bit to
signed type.

While being here, convert also flags parameter to be BdrvRequestFlags.

Now let's consider all callers. Simple

  git grep '\->bdrv_\(aio\|co\)_preadv\(_part\)\?'

shows that's there three callers of driver function:

 bdrv_driver_preadv() in block/io.c, passes int64_t, checked by
   bdrv_check_qiov_request() to be non-negative.

 qcow2_load_vmstate() does bdrv_check_qiov_request().

 do_perform_cow_read() has uint64_t argument. And a lot of things in
 qcow2 driver are uint64_t, so converting it is big job. But we must
 not work with requests that don't satisfy bdrv_check_qiov_request(),
 so let's just assert it here.

Still, the functions may be called directly, not only by drv->...
Let's check:

git grep '\.bdrv_\(aio\|co\)_preadv\(_part\)\?\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done

The only one such caller:

    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, &data, 1);
    ...
    ret = bdrv_replace_test_co_preadv(bs, 0, 1, &qiov, 0);

in tests/unit/test-bdrv-drain.c, and it's OK obviously.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: fix typos]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:31 -05:00
Vladimir Sementsov-Ogievskiy 558902cc3d qcow2: check request on vmstate save/load path
We modify the request by adding an offset to vmstate. Let's check the
modified request. It will help us to safely move .bdrv_co_preadv_part
and .bdrv_co_pwritev_part to int64_t type of offset and bytes.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-3-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29 13:46:31 -05:00
Naveen Nagar 07a3dfa7c4 hw/nvme: fix verification of select field in namespace attachment
Fix is added to check for reserved value in select field for
namespace attachment

CC: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Naveen Nagar <naveen.n1@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-09-24 08:43:52 +02:00
Hanna Reitz 5a1cfd2150 block: Clarify that @bytes is no limit on *pnum
.bdrv_co_block_status() implementations are free to return a *pnum that
exceeds @bytes, because bdrv_co_block_status() in block/io.c will clamp
*pnum as necessary.

On the other hand, if drivers' implementations return values for *pnum
that are as large as possible, our recently introduced block-status
cache will become more effective.

So, make a note in block_int.h that @bytes is no upper limit for *pnum.

Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210812084148.14458-4-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2021-09-15 15:54:07 +02:00
Hanna Reitz 0bc329fbb0 block: block-status cache for data regions
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210812084148.14458-3-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[hreitz: Added `local_file == bs` assertion, as suggested by Vladimir]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-15 15:54:06 +02:00
Hanna Reitz 33ff4c9e08 block: Drop BDS comment regarding bdrv_append()
There is a comment above the BDS definition stating care must be taken
to consider handling newly added fields in bdrv_append().

Actually, this comment should have said "bdrv_swap()" as of 4ddc07cac
(nine years ago), and in any case, bdrv_swap() was dropped in
8e419aefa (six years ago).  So no such care is necessary anymore.

Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210812084148.14458-2-hreitz@redhat.com>
2021-09-15 15:54:06 +02:00
Vladimir Sementsov-Ogievskiy abde8ac2a5 block/block-copy: block_copy_state_new(): drop extra arguments
The only caller pass copy_range and compress both false. Let's just
drop these arguments.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210824083856.17408-35-vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-01 14:38:08 +02:00
Vladimir Sementsov-Ogievskiy b518e9e9ef block/backup: move cluster size calculation to block-copy
The main consumer of cluster-size is block-copy. Let's calculate it
here instead of passing through backup-top.

We are going to publish copy-before-write filter soon, so it will be
created through options. But we don't want for now to make explicit
option for cluster-size, let's continue to calculate it automatically.
So, now is the time to get rid of cluster_size argument for
bdrv_cbw_append().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-10-vsementsov@virtuozzo.com>
[hreitz: Add qemu/error-report.h include to block/block-copy.c]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-01 14:03:11 +02:00
Vladimir Sementsov-Ogievskiy f8b9504bac block/block-copy: introduce block_copy_set_copy_opts()
We'll need a possibility to set compress and use_copy_range options
after initialization of the state. So make corresponding part of
block_copy_state_new() separate and public.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210824083856.17408-8-vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-01 12:57:31 +02:00
Vladimir Sementsov-Ogievskiy 49577723d4 block-copy: move detecting fleecing scheme to block-copy
We want to simplify initialization interface of copy-before-write
filter as we are going to make it public. So, let's detect fleecing
scheme exactly in block-copy code, to not pass this information through
extra levels.

Why not just set BDRV_REQ_SERIALISING unconditionally: because we are
going to implement new more efficient fleecing scheme which will not
rely on backing feature.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210824083856.17408-7-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-01 12:57:31 +02:00
Vladimir Sementsov-Ogievskiy bd8f4c42c8 block: introduce bdrv_replace_child_bs()
Add function to transactionally replace bs inside BdrvChild.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-2-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
2021-09-01 12:57:31 +02:00
Klaus Jensen a316aa50e6 hw/nvme: use symbolic names for registers
Add the NvmeBarRegs enum and use these instead of explicit register
offsets.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-07-26 21:09:38 +02:00
Klaus Jensen 5d45edbeac hw/nvme: split pmrmsc register into upper and lower
The specification uses a set of 32 bit PMRMSCL and PMRMSCU registers to
make up the 64 bit logical PMRMSC register.

Make it so.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-07-26 21:09:38 +02:00
Stefano Garzarella 1793ad0247 iothread: add aio-max-batch parameter
The `aio-max-batch` parameter will be propagated to AIO engines
and it will be used to control the maximum number of queued requests.

When there are in queue a number of requests equal to `aio-max-batch`,
the engine invokes the system call to forward the requests to the kernel.

This parameter allows us to control the maximum batch size to reduce
the latency that requests might accumulate while queued in the AIO
engine queue.

If `aio-max-batch` is equal to 0 (default value), the AIO engine will
use its default maximum batch size value.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20210721094211.69853-3-sgarzare@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-07-21 13:47:50 +01:00
Peter Maydell 42e1d798a6 Block layer patches
- Make blockdev-reopen stable
 - Remove deprecated qemu-img backing file without format
 - rbd: Convert to coroutines and add write zeroes support
 - rbd: Updated MAINTAINERS
 - export/fuse: Allow other users access to the export
 - vhost-user: Fix backends without multiqueue support
 - Fix drive-backup transaction endless drained section
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmDoRdIRHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9bvgQ/+Ogq24n1UOQc8FEKRYfyhajNToQ9ofzWN
 iLiblSGx2QDq+CauD3qdu6z7DLlqEXeoM4NYM462oIPumptQj+9XZt7ftfh6FLWW
 4yJEbjfnVKOba+vFdJ+E0DStwnPaxYdnrPGd53cwHZfbZh4ZmkpTM350mzHHiLTb
 KYKOgWd+UHZbkYeCVNYTGe30SRBiKeAecTpsVZ5HVhe7LstjByuy5stk8dytLpdV
 YqdKOToZfOp77XiHr8YcLLp1HHBGlr5hw73V4SDas0beCp7hqtnAqsTYyXBue4xO
 4zfD4Gujr5JVOCb0crDTyOmOQY5E+y2dqFoOUF00D5AoN2vj4nfQ9ESkbqlE9BVh
 mgJ1izSokYlN2X8rIwGXNR5fbxRmxxfkAA4rScNRytj1KxDHyrDxrp/k8YFemxSQ
 qwgb/FBm0fcr69evPRzovKwZFhcyPremksluHQE4rZZ66qBQ2cGuDJPE7PWVTpPH
 67JCrIVK/O6n5p+4ilFHmQQ3aP3ol0frMFcboYVRchJ2MhIDTsfFL3F/tTK8hy86
 AmrrdQ1BQIAoKNOKnAmOSOUdExM55OcfPmX69+AhEk2GeWP6kgz5Pks4H3qCiKGf
 YoRk8F1V+N4q+C0mFFovB61bNQ6COIlBuzmD9EtmpDD/Ta3Wib+3ZnoGVIdPS+OI
 jyj+qJxd9z4=
 =kH+r
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches

- Make blockdev-reopen stable
- Remove deprecated qemu-img backing file without format
- rbd: Convert to coroutines and add write zeroes support
- rbd: Updated MAINTAINERS
- export/fuse: Allow other users access to the export
- vhost-user: Fix backends without multiqueue support
- Fix drive-backup transaction endless drained section

# gpg: Signature made Fri 09 Jul 2021 13:49:22 BST
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream: (28 commits)
  block: Make blockdev-reopen stable API
  iotests: Test reopening multiple devices at the same time
  block: Support multiple reopening with x-blockdev-reopen
  block: Acquire AioContexts during bdrv_reopen_multiple()
  block: Add bdrv_reopen_queue_free()
  qcow2: Fix dangling pointer after reopen for 'file'
  qemu-img: Improve error for rebase without backing format
  qemu-img: Require -F with -b backing image
  qcow2: Prohibit backing file changes in 'qemu-img amend'
  blockdev: fix drive-backup transaction endless drained section
  vhost-user: Fix backends without multiqueue support
  MAINTAINERS: add block/rbd.c reviewer
  block/rbd: fix type of task->complete
  iotests/fuse-allow-other: Test allow-other
  iotests/308: Test +w on read-only FUSE exports
  export/fuse: Let permissions be adjustable
  export/fuse: Give SET_ATTR_SIZE its own branch
  export/fuse: Add allow-other option
  export/fuse: Pass default_permissions for mount
  util/uri: do not check argument of uri_free()
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-07-10 19:55:21 +01:00
Kevin Wolf 6cf42ca2f9 block: Acquire AioContexts during bdrv_reopen_multiple()
As the BlockReopenQueue can contain nodes in multiple AioContexts, only
one of which may be locked when AIO_WAIT_WHILE() can be called, we can't
let the caller lock the right contexts. Instead, individually lock the
AioContext of a single node when iterating the queue.

Reintroduce bdrv_reopen() as a wrapper for reopening a single node that
drains the node and temporarily drops the AioContext lock for
bdrv_reopen_multiple().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210708114709.206487-4-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-09 13:19:11 +02:00
Alberto Garcia ab5b522879 block: Add bdrv_reopen_queue_free()
Move the code to free a BlockReopenQueue to a separate function.
It will be used in a subsequent patch.

[ kwolf: Also free explicit_options and options, and explicitly
  qobject_ref() the value when it continues to be used. This makes
  future memory leaks less likely. ]

Signed-off-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210708114709.206487-3-kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-07-09 13:19:11 +02:00
Stefan Hajnoczi 0f08586c71 util/async: add a human-readable name to BHs for debugging
It can be difficult to debug issues with BHs in production environments.
Although BHs can usually be identified by looking up their ->cb()
function pointer, this requires debug information for the program. It is
also not possible to print human-readable diagnostics about BHs because
they have no identifier.

This patch adds a name to each BH. The name is not unique per instance
but differentiates between cb() functions, which is usually enough. It's
done by changing aio_bh_new() and friends to macros that stringify cb.

The next patch will use the name field when reporting leaked BHs.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210414200247.917496-2-stefanha@redhat.com>
2021-07-05 11:40:32 +01:00
Peter Maydell 9c2647f750 Block layer patches
- Supporting changing 'file' in x-blockdev-reopen
 - ssh: add support for sha256 host key fingerprints
 - vhost-user-blk: Implement reconnection during realize
 - introduce QEMU_AUTO_VFREE
 - Don't require password of encrypted backing file for image creation
 - Code cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCAAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmDclTcRHGt3b2xmQHJl
 ZGhhdC5jb20ACgkQfwmycsiPL9bzaw/+PYQ9vPG+ZROWl633TUOQu7IYGZynXCET
 ZHlV2JlXnFH8QoO8A53U72cgVg+GlwDpiMCCtjEGMG1yfMBNe+DXR1wFUMDne1Gs
 qIFX4gIVpPGDi3gPeQvefLCwoN8VXwIxvCJj40YR9BY0cqQR6joI81kjVlqKemqB
 cHG4qJHmnphihSLep/dd2BRInkeHWXWU63jK4d7ctdCwHZPNDf0u0FEraxEqS/d0
 5tfdIl3haghhoDRRamyh5bLZBCW1KlpfTR98RdspcwpSlXq/FgV5N9CFa15XSYfQ
 rinSClSIOpLzG90+tBihROVBXvbugO5qZVQTG0Yg1tt4FGG8Cmiqf9MNXC5yctNg
 WnaQQipx/37deafGA4jqorZBJd1R87JLJTBFTpkB47XAFq/ltqsTDhrrfdS+jail
 Fd+qyqWg0Jx3JjdhSUpHvDKBBsErjoxtoyQIGakSreXGmj2UY6BFmGii7lnLZNLo
 +E81C7exnkCIGKkOHy+y9DkpVY/PEJKCG7uwcyy+F2qOqGUOxKLuZomWcLodo6Vf
 /eJ/UsLJt6HhXhXq/1ZZHmaORn8Lft1yr/9azoGXZ7er+jZcbEkhbcZmET+Y6ykq
 Vox/GmLkhyVkM96MA0lMW5hHPWUbF29m9Jmq3nNfvFWBcILEs4uWSlbd0M2oAmWj
 ung9sKIV/8s=
 =aB0a
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches

- Supporting changing 'file' in x-blockdev-reopen
- ssh: add support for sha256 host key fingerprints
- vhost-user-blk: Implement reconnection during realize
- introduce QEMU_AUTO_VFREE
- Don't require password of encrypted backing file for image creation
- Code cleanups

# gpg: Signature made Wed 30 Jun 2021 17:00:55 BST
# gpg:                using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg:                issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [full]
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74  56FE 7F09 B272 C88F 2FD6

* remotes/kevin/tags/for-upstream: (24 commits)
  vhost-user-blk: Implement reconnection during realize
  vhost-user-blk: Factor out vhost_user_blk_realize_connect()
  vhost: Distinguish errors in vhost_dev_get_config()
  vhost-user-blk: Add Error parameter to vhost_user_blk_start()
  vhost: Return 0/-errno in vhost_dev_init()
  vhost: Distinguish errors in vhost_backend_init()
  vhost: Add Error parameter to vhost_dev_init()
  block/ssh: add support for sha256 host key fingerprints
  block/commit: use QEMU_AUTO_VFREE
  introduce QEMU_AUTO_VFREE
  iotests: Test replacing files with x-blockdev-reopen
  block: Allow changing bs->file on reopen
  block: BDRVReopenState: drop replace_backing_bs field
  block: move supports_backing check to bdrv_set_file_or_backing_noperm()
  block: bdrv_reopen_parse_backing(): simplify handling implicit filters
  block: bdrv_reopen_parse_backing(): don't check frozen child
  block: bdrv_reopen_parse_backing(): don't check aio context
  block: introduce bdrv_set_file_or_backing_noperm()
  block: introduce bdrv_remove_file_or_backing_child()
  block: comment graph-modifying function not updating permissions
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-07-02 11:46:32 +01:00
Peter Maydell 1ec2cd0ce2 hw/nvme patches
* namespace eui64 support (Heinrich)
 * aiocb refactoring (Klaus)
 * controller parameter for auto zone transitioning (Niklas)
 * misc fixes and additions (Gollu, Klaus, Keith)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmDbap8ACgkQTeGvMW1P
 Denz9wf9GtU4z7G0pQpFlv1smWPvQfv9qlkGiRvGcIQjEgPjcKjEsJC1CqtG6nZ5
 BGYzey706iM7NAs6wPkyUyHgWXl+N/Ku5nNBvNJuzTN5J9aHgnqehe8civvS6ONU
 0WtAQEEZCLu0Qbvw6KawCShXrJ6Jpmu+bYGggYEy5+baumLF0Qj50l2wRiDbImNu
 rE3iZuhEenwnNNSTTkyt2RT8lSzizAtth4bAqzl3Zw1Q3NO+qDCL0wPOWD7YuD8p
 LTZ8/ojB1+YUI3JrMM4MlzbJyGstdk2Y5zq6QpAcf86agaaFEmS0iqjOic52ImpL
 VzmZZZ0W7kmhJVc5H/Q5y/KAFIHnFA==
 =efjI
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging

hw/nvme patches

* namespace eui64 support (Heinrich)
* aiocb refactoring (Klaus)
* controller parameter for auto zone transitioning (Niklas)
* misc fixes and additions (Gollu, Klaus, Keith)

# gpg: Signature made Tue 29 Jun 2021 19:46:55 BST
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* remotes/nvme/tags/nvme-next-pull-request: (23 commits)
  hw/nvme: add 'zoned.zasl' to documentation
  hw/nvme: fix pin-based interrupt behavior (again)
  hw/nvme: fix missing check for PMR capability
  hw/nvme: documentation fix
  hw/nvme: fix endianess conversion and add controller list
  Partially revert "hw/block/nvme: drain namespaces on sq deletion"
  hw/nvme: reimplement format nvm to allow cancellation
  hw/nvme: reimplement zone reset to allow cancellation
  hw/nvme: reimplement the copy command to allow aio cancellation
  hw/nvme: add dw0/1 to the req completion trace event
  hw/nvme: use prinfo directly in nvme_check_prinfo and nvme_dif_check
  hw/nvme: remove assert from nvme_get_zone_by_slba
  hw/nvme: save reftag when generating pi
  hw/nvme: reimplement dsm to allow cancellation
  hw/nvme: add nvme_block_status_all helper
  hw/nvme: reimplement flush to allow cancellation
  hw/nvme: default for namespace EUI-64
  hw/nvme: namespace parameter for EUI-64
  hw/nvme: fix csi field for cns 0x00 and 0x11
  hw/nvme: add param to control auto zone transitioning to zone state closed
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-06-30 21:09:27 +01:00
Alberto Garcia ecd30d2d97 block: Allow changing bs->file on reopen
When the x-blockdev-reopen was added it allowed reconfiguring the
graph by replacing backing files, but changing the 'file' option was
forbidden. Because of this restriction some operations are not
possible, notably inserting and removing block filters.

This patch adds support for replacing the 'file' option. This is
similar to replacing the backing file and the user is likewise
responsible for the correctness of the resulting graph, otherwise this
can lead to data corruption.

Signed-off-by: Alberto Garcia <berto@igalia.com>
 [vsementsov: bdrv_reopen_parse_file_or_backing() is modified a lot]
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610120537.196183-9-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-06-29 16:51:00 +02:00
Vladimir Sementsov-Ogievskiy 3d0e8743f0 block: BDRVReopenState: drop replace_backing_bs field
It's used only in bdrv_reopen_commit(). "backing" is covered by the
loop through all children except for case when we removed backing child
during reopen.

Make it more obvious and drop extra boolean field: qdict_del will not
fail if there is no such entry.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610120537.196183-8-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-06-29 16:51:00 +02:00
Gollu Appalanaidu 5f4eb94dbb hw/nvme: fix endianess conversion and add controller list
Add the controller identifiers list CNS 0x13, available list of ctrls
in NVM Subsystem that may or may not be attached to namespaces.

In Identify Ctrl List of the CNS 0x12 and 0x13 no endian conversion
for the nsid field.

These two CNS values shows affect when there exists a Subsystem.
Added condition if there is no Subsystem return invalid field in
command.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-06-29 07:16:25 +02:00
Klaus Jensen 2a132309e4 hw/nvme: use prinfo directly in nvme_check_prinfo and nvme_dif_check
The nvme_check_prinfo() and nvme_dif_check() functions operate on the
16 bit "control" member of the NvmeCmd. These functions do not otherwise
operate on an NvmeCmd or an NvmeRequest, so change them to expect the
actual 4 bit PRINFO field and add constants that work on this field as
well.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-06-29 07:16:25 +02:00
Gollu Appalanaidu 18de1526ba hw/nvme: add identify namespace flbas/mc enums
Add enums for the Identify Namespace FLBAS and MC fields.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: squashed separate flbas/mc commits into one]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-06-29 07:16:25 +02:00
Peter Maydell 6512fa497c * Some Meson test conversions
* KVM dirty page ring buffer fix
 * KVM TSC scaling support
 * Fixes for SG_IO with /dev/sdX devices
 * (Non)support for host devices on iOS
 * -smp cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDV5TIUHHBib256aW5p
 QHJlZGhhdC5jb20ACgkQv/vSX3jHroNySgf9HMnAtLWp36p2ie74o4rrW9x3Ojrm
 fuCq2i3q3nBhEKqqiyp+QQJGubE44mXEZQYtX89tOfSFgg7o6SLIoAcQQskr+In6
 f9I1jjpSVTls0AaGUO+iRn9KiTzeMWeo1l6Wht+2mfBL5XpNLaLLu/T49uPhjlvN
 zFi5blgILxIYMqMCD1joDBnIiqqDozr0p7QzRZD8re25sRhg0NHQxyIh3OxBPpJ9
 3Jhy1Us0cDWrwvPbxz6S5N0zesLu1ojtojVPy6iKjyHSv+6eiE6bHyIbS8duG5+H
 zBC1THOsUV3X1UvPAjuSNlgfNeobGAzmxSJ/evLgWWkpkx1mLtsnL5RARQ==
 =YoOL
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/bonzini-gitlab/tags/for-upstream' into staging

* Some Meson test conversions
* KVM dirty page ring buffer fix
* KVM TSC scaling support
* Fixes for SG_IO with /dev/sdX devices
* (Non)support for host devices on iOS
* -smp cleanups

# gpg: Signature made Fri 25 Jun 2021 15:16:18 BST
# gpg:                using RSA key F13338574B662389866C7682BFFBD25F78C7AE83
# gpg:                issuer "pbonzini@redhat.com"
# gpg: Good signature from "Paolo Bonzini <bonzini@gnu.org>" [full]
# gpg:                 aka "Paolo Bonzini <pbonzini@redhat.com>" [full]
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4  E2F7 7E15 100C CD36 69B1
#      Subkey fingerprint: F133 3857 4B66 2389 866C  7682 BFFB D25F 78C7 AE83

* remotes/bonzini-gitlab/tags/for-upstream: (28 commits)
  machine: reject -smp dies!=1 for non-PC machines
  machine: pass QAPI struct to mc->smp_parse
  machine: add error propagation to mc->smp_parse
  machine: move common smp_parse code to caller
  machine: move dies from X86MachineState to CpuTopology
  file-posix: handle EINTR during ioctl
  block: detect DKIOCGETBLOCKCOUNT/SIZE before use
  block: try BSD disk size ioctls one after another
  block: check for sys/disk.h
  block: feature detection for host block support
  file-posix: try BLKSECTGET on block devices too, do not round to power of 2
  block: add max_hw_transfer to BlockLimits
  block-backend: align max_transfer to request alignment
  osdep: provide ROUND_DOWN macro
  scsi-generic: pass max_segments via max_iov field in BlockLimits
  file-posix: fix max_iov for /dev/sg devices
  KVM: Fix dirty ring mmap incorrect size due to renaming accident
  configure, meson: convert libusbredir detection to meson
  configure, meson: convert libcacard detection to meson
  configure, meson: convert libusb detection to meson
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-06-28 21:04:22 +01:00
Emanuele Giuseppe Esposito 149009bef4 block-copy: atomic .cancelled and .finished fields in BlockCopyCallState
By adding acquire/release pairs, we ensure that .ret and .error_is_read
fields are written by block_copy_dirty_clusters before .finished is true,
and that they are read by API user after .finished is true.

The atomic here are necessary because the fields are concurrently modified
in coroutines, and read outside.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20210624072043.180494-6-eesposit@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2021-06-25 14:33:51 +03:00
Paolo Bonzini 24b36e9813 block: add max_hw_transfer to BlockLimits
For block host devices, I/O can happen through either the kernel file
descriptor I/O system calls (preadv/pwritev, io_submit, io_uring)
or the SCSI passthrough ioctl SG_IO.

In the latter case, the size of each transfer can be limited by the
HBA, while for file descriptor I/O the kernel is able to split and
merge I/O in smaller pieces as needed.  Applying the HBA limits to
file descriptor I/O results in more system calls and suboptimal
performance, so this patch splits the max_transfer limit in two:
max_transfer remains valid and is used in general, while max_hw_transfer
is limited to the maximum hardware size.  max_hw_transfer can then be
included by the scsi-generic driver in the block limits page, to ensure
that the stricter hardware limit is used.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-25 10:54:13 +02:00
Vladimir Sementsov-Ogievskiy 97cf89259e nbd/client-connection: add option for non-blocking connection attempt
We'll need a possibility of non-blocking nbd_co_establish_connection(),
so that it returns immediately, and it returns success only if a
connections was previously established in background.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-30-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 12:21:22 -05:00
Vladimir Sementsov-Ogievskiy 43cb34dede nbd/client-connection: return only one io channel
block/nbd doesn't need underlying sioc channel anymore. So, we can
update nbd/client-connection interface to return only one top-most io
channel, which is more straight forward.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-27-vsementsov@virtuozzo.com>
[eblake: squash in Vladimir's fixes for uninit usage caught by clang]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 12:20:53 -05:00
Vladimir Sementsov-Ogievskiy e0e67cbe58 nbd/client-connection: implement connection retry
Add an option for a thread to retry connecting until it succeeds. We'll
use nbd/client-connection both for reconnect and for initial connection
in nbd_open(), so we need a possibility to use same NBDClientConnection
instance to connect once in nbd_open() and then use retry semantics for
reconnect.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-21-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 10:59:53 -05:00
Vladimir Sementsov-Ogievskiy 130d49baa5 nbd/client-connection: add possibility of negotiation
Add arguments and logic to support nbd negotiation in the same thread
after successful connection.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210610100802.5888-20-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 10:59:53 -05:00
Vladimir Sementsov-Ogievskiy 5276c87c12 nbd: move connection code from block/nbd to nbd/client-connection
We now have bs-independent connection API, which consists of four
functions:

  nbd_client_connection_new()
  nbd_client_connection_release()
  nbd_co_establish_connection()
  nbd_co_establish_connection_cancel()

Move them to a separate file together with NBDClientConnection
structure which becomes private to the new API.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210610100802.5888-18-vsementsov@virtuozzo.com>
[eblake: comment tweaks]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 10:59:53 -05:00
Paolo Bonzini 5f50be9b58 async: the main AioContext is only "current" if under the BQL
If we want to wake up a coroutine from a worker thread, aio_co_wake()
currently does not work.  In that scenario, aio_co_wake() calls
aio_co_enter(), but there is no current AioContext and therefore
qemu_get_current_aio_context() returns the main thread.  aio_co_wake()
then attempts to call aio_context_acquire() instead of going through
aio_co_schedule().

The default case of qemu_get_current_aio_context() was added to cover
synchronous I/O started from the vCPU thread, but the main and vCPU
threads are quite different.  The main thread is an I/O thread itself,
only running a more complicated event loop; the vCPU thread instead
is essentially a worker thread that occasionally calls
qemu_mutex_lock_iothread().  It is only in those critical sections
that it acts as if it were the home thread of the main AioContext.

Therefore, this patch detaches qemu_get_current_aio_context() from
iothreads, which is a useless complication.  The AioContext pointer
is stored directly in the thread-local variable, including for the
main loop.  Worker threads (including vCPU threads) optionally behave
as temporary home threads if they have taken the big QEMU lock,
but if that is not the case they will always schedule coroutines
on remote threads via aio_co_schedule().

With this change, the stub qemu_mutex_iothread_locked() must be changed
from true to false.  The previous value of true was needed because the
main thread did not have an AioContext in the thread-local variable,
but now it does have one.

Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210609122234.544153-1-pbonzini@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Tested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[eblake: tweak commit message per Vladimir's review]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-06-18 10:59:52 -05:00
Paolo Bonzini c0d4aa82f8 vl: plumb keyval-based options into -readconfig
Let -readconfig support parsing command line options into QDict or
QemuOpts.  This will be used to add back support for objects in
-readconfig.

Cc: Markus Armbruster <armbru@redhat.com>
Cc: qemu-stable@nongnu.org
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210524105752.3318299-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-06-04 13:50:04 +02:00
Vladimir Sementsov-Ogievskiy 260242a833 block: drop BlockBackendRootState::read_only
Instead of keeping additional boolean field, let's store the
information in BDRV_O_RDWR bit of BlockBackendRootState::open_flags.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210527154056.70294-4-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-06-02 14:23:20 +02:00
Vladimir Sementsov-Ogievskiy 975da07374 block: drop BlockDriverState::read_only
This variable is just a cache for !(bs->open_flags & BDRV_O_RDWR),
which we have to synchronize everywhere. Let's just drop it and
consistently use bdrv_is_read_only().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210527154056.70294-3-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-06-02 14:23:20 +02:00
Vladimir Sementsov-Ogievskiy fb62b58896 block/vvfat: child_vvfat_qcow: add .get_parent_aio_context, fix crash
Commit 3ca1f32257
"block: BdrvChildClass: add .get_parent_aio_context handler" introduced
new handler and commit 228ca37e12
"block: drop ctx argument from bdrv_root_attach_child" made a generic
use of it. But 3ca1f32257 didn't update
child_vvfat_qcow. Fix that.

Before that fix the command

./build/qemu-system-x86_64 -usb -device usb-storage,drive=fat16 \
  -drive file=fat:rw:fat-type=16:"<path of a host folder>",id=fat16,format=raw,if=none

crashes:

1  bdrv_child_get_parent_aio_context (c=0x559d62426d20)
    at ../block.c:1440
2  bdrv_attach_child_common
    (child_bs=0x559d62468190, child_name=0x559d606f9e3d "write-target",
     child_class=0x559d60c58d20 <child_vvfat_qcow>, child_role=3,
     perm=3, shared_perm=4, opaque=0x559d62445690,
     child=0x7ffc74c2acc8, tran=0x559d6246ddd0, errp=0x7ffc74c2ae60)
    at ../block.c:2795
3  bdrv_attach_child_noperm
    (parent_bs=0x559d62445690, child_bs=0x559d62468190,
     child_name=0x559d606f9e3d "write-target",
     child_class=0x559d60c58d20 <child_vvfat_qcow>, child_role=3,
     child=0x7ffc74c2acc8, tran=0x559d6246ddd0, errp=0x7ffc74c2ae60) at
    ../block.c:2855
4  bdrv_attach_child
    (parent_bs=0x559d62445690, child_bs=0x559d62468190,
     child_name=0x559d606f9e3d "write-target",
     child_class=0x559d60c58d20 <child_vvfat_qcow>, child_role=3,
     errp=0x7ffc74c2ae60) at ../block.c:2953
5  bdrv_open_child
    (filename=0x559d62464b80 "/var/tmp/vl.h3TIS4",
     options=0x559d6246ec20, bdref_key=0x559d606f9e3d "write-target",
     parent=0x559d62445690, child_class=0x559d60c58d20
     <child_vvfat_qcow>, child_role=3, allow_none=false,
     errp=0x7ffc74c2ae60) at ../block.c:3351
6  enable_write_target (bs=0x559d62445690, errp=0x7ffc74c2ae60) at
   ../block/vvfat.c:3176
7  vvfat_open (bs=0x559d62445690, options=0x559d6244adb0, flags=155650,
               errp=0x7ffc74c2ae60) at ../block/vvfat.c:1236
8  bdrv_open_driver (bs=0x559d62445690, drv=0x559d60d4f7e0
                     <bdrv_vvfat>, node_name=0x0,
                     options=0x559d6244adb0, open_flags=155650,
                     errp=0x7ffc74c2af70) at ../block.c:1557
9  bdrv_open_common (bs=0x559d62445690, file=0x0,
                     options=0x559d6244adb0, errp=0x7ffc74c2af70) at
...

(gdb) fr 1
 #1  0x0000559d603ea3bf in bdrv_child_get_parent_aio_context
     (c=0x559d62426d20) at ../block.c:1440
1440        return c->klass->get_parent_aio_context(c);
 (gdb) p c->klass
$1 = (const BdrvChildClass *) 0x559d60c58d20 <child_vvfat_qcow>
 (gdb) p c->klass->get_parent_aio_context
$2 = (AioContext *(*)(BdrvChild *)) 0x0

Fixes: 3ca1f32257
Fixes: 228ca37e12
Reported-by: John Arbuckle <programmingkidx@gmail.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210524101257.119377-2-vsementsov@virtuozzo.com>
Tested-by: John Arbuckle <programmingkidx@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-06-02 14:23:20 +02:00
Paolo Bonzini b02629550d replication: move include out of root directory
The replication.h file is included from migration/colo.c and tests/unit/test-replication.c,
so it should be in include/.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-26 14:49:46 +02:00
Peter Maydell 15e147b3c7 emulated nvme updates
* various fixes (Gollu Appalanaidu)
 * refactoring (me)
 * move to hw/nvme from hw/block (me)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmCiNFEACgkQTeGvMW1P
 DeldNQgAgF/FdkGPNNxNlT5Ei7zmycjbKKf7QDSfn4BXgP9a+7Ccil60kGFCX7Ib
 ERiTOFgofJ587GcYVghBSQv4zYKW+3uXIwfFSUxrMi1hG48j8+z7NyxeWUSXSeSi
 AqohaE70iZyoUx7HQWwRhDsoTnlgRfmpK5Ju9jVRE0be28Y5Z9/hSoyti3PeJPNb
 Igg4bYXTGnXIgJcAfXQBe4iHBn9iZrJB+mp59a8Avb62tP2zwgiRED2M0FNHWyhV
 s/Ra1SiUuEzDJROD47mlxCiqfstknZJ31EI34wJwwKJU2rSQUEMcSRZKc+k7dDHG
 qmQAGPwqS7EnW1uJDxSUFZ944nx+dg==
 =gum+
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging

emulated nvme updates

* various fixes (Gollu Appalanaidu)
* refactoring (me)
* move to hw/nvme from hw/block (me)

# gpg: Signature made Mon 17 May 2021 10:16:01 BST
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* remotes/nvme/tags/nvme-next-pull-request:
  hw/nvme: move nvme emulation out of hw/block
  hw/block/nvme: move zoned constraints checks
  hw/block/nvme: remove irrelevant zone resource checks
  hw/block/nvme: remove num_namespaces member
  hw/block/nvme: streamline namespace array indexing
  hw/block/nvme: add metadata offset helper
  hw/block/nvme: cache lba and ms sizes
  hw/block/nvme: replace nvme_ns_status
  hw/block/nvme: remove non-shared defines from header file
  hw/block/nvme: cleanup includes
  hw/block/nvme: consolidate header files
  hw/block/nvme: rename __nvme_select_ns_iocs
  hw/block/nvme: rename __nvme_advance_zone_wp
  hw/block/nvme: rename __nvme_zrm_open
  hw/block/nvme: align with existing style
  hw/block/nvme: function formatting fix
  hw/block/nvme: fix io-command set profile feature
  hw/block/nvme: consider metadata read aio return value in compare
  hw/block/nvme: rename reserved fields declarations
  hw/block/nvme: remove redundant invalid_lba_range trace

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-05-18 12:22:16 +01:00
Gollu Appalanaidu 312c3531bb hw/block/nvme: align with existing style
While QEMU coding style prefers lowercase hexadecimals in constants, the
NVMe subsystem uses the format from the NVMe specifications in comments,
i.e. 'h' suffix instead of '0x' prefix.

Fix this up across the code base.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: updated message; added conversion in a couple of missing comments]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-05-17 09:15:13 +02:00
Gollu Appalanaidu 9a31c61583 hw/block/nvme: rename reserved fields declarations
Align the 'rsvd1' reserved field declaration in NvmeBar with existing
style.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: minor commit message fixup]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-05-17 09:15:13 +02:00
Vladimir Sementsov-Ogievskiy c61ebf362d write-threshold: deal with includes
"qemu/typedefs.h" is enough for include/block/write-threshold.h header
with forward declaration of BlockDriverState. Also drop extra includes
from block/write-threshold.c and tests/unit/test-write-threshold.c

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210506090621.11848-9-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-05-14 16:14:10 +02:00
Vladimir Sementsov-Ogievskiy 2e0e9cbd89 block/write-threshold: drop extra APIs
bdrv_write_threshold_exceeded() is unused.

bdrv_write_threshold_is_set() is used only to double check the value of
bs->write_threshold_offset in tests. No real sense in it (both tests do
check real value with help of bdrv_write_threshold_get())

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-5-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[mreitz: Adjusted commit message as per Eric's suggestion]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-05-14 16:14:10 +02:00
Vladimir Sementsov-Ogievskiy ad578c56d5 block: drop write notifiers
They are unused now.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-3-vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-05-14 16:14:10 +02:00
Vladimir Sementsov-Ogievskiy 94783301b8 block/write-threshold: don't use write notifiers
write-notifiers are used only for write-threshold. New code for such
purpose should create filters.

Let's better special-case write-threshold and drop write notifiers at
all. (Actually, write-threshold is special-cased anyway, as the only
user of write-notifiers)

So, create a new direct interface for bdrv_co_write_req_prepare() and
drop all write-notifier related logic from write-threshold.c.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210506090621.11848-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[mreitz: Adjusted comment as per Eric's suggestion]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-05-14 16:14:10 +02:00
Vladimir Sementsov-Ogievskiy 9c785cd714 mirror: stop cancelling in-flight requests on non-force cancel in READY
If mirror is READY than cancel operation is not discarding the whole
result of the operation, but instead it's a documented way get a
point-in-time snapshot of source disk.

So, we should not cancel any requests if mirror is READ and
force=false. Let's fix that case.

Note, that bug that we have before this commit is not critical, as the
only .bdrv_cancel_in_flight implementation is nbd_cancel_in_flight()
and it cancels only requests waiting for reconnection, so it should be
rare case.

Fixes: 521ff8b779
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210421075858.40197-1-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-05-14 16:14:10 +02:00
Kevin Wolf 35b7f4abd5 block: Add BDRV_O_NO_SHARE for blk_new_open()
Normally, blk_new_open() just shares all permissions. This was fine
originally when permissions only protected against uses in the same
process because no other part of the code would actually get to access
the block nodes opened with blk_new_open(). However, since we use it for
file locking now, unsharing permissions becomes desirable.

Add a new BDRV_O_NO_SHARE flag that is used in blk_new_open() to unshare
any permissions that can be unshared.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210422164344.283389-2-kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:48 +02:00
Vladimir Sementsov-Ogievskiy ecb776bd93 block: refactor bdrv_child_set_perm_safe() transaction action
Old interfaces dropped, nobody directly calls
bdrv_child_set_perm_abort() and bdrv_child_set_perm_commit(), so we can
use personal state structure for the action and stop exploiting
BdrvChild structure. Also, drop "_safe" suffix which is redundant now.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-35-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:48 +02:00
Vladimir Sementsov-Ogievskiy 72373e40fb block: bdrv_reopen_multiple: refresh permissions on updated graph
Move bdrv_reopen_multiple to new paradigm of permission update:
first update graph relations, then do refresh the permissions.

We have to modify reopen process in file-posix driver: with new scheme
we don't have prepared permissions in raw_reopen_prepare(), so we
should reconfigure fd in raw_check_perm(). Still this seems more native
and simple anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-31-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:48 +02:00
Vladimir Sementsov-Ogievskiy 1e4c797c75 block: make bdrv_refresh_limits() to be a transaction action
To be used in further commit.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-28-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:48 +02:00
Vladimir Sementsov-Ogievskiy 3108a15cf0 block: introduce bdrv_drop_filter()
Using bdrv_replace_node() for removing filter is not good enough: it
keeps child reference of the filter, which may conflict with original
top node during permission update.

Instead let's create new interface, which will do all graph
modifications first and then update permissions.

Let's modify bdrv_replace_node_common(), allowing it additionally drop
backing chain child link pointing to new node. This is quite
appropriate for bdrv_drop_intermediate() and makes possible to add
new bdrv_drop_filter() as a simple wrapper.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-24-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:48 +02:00
Vladimir Sementsov-Ogievskiy 53e96d1e9f block: make bdrv_reopen_{prepare,commit,abort} private
These functions are called only from bdrv_reopen_multiple() in block.c.
No reason to publish them.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-8-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:47 +02:00
Vladimir Sementsov-Ogievskiy 228ca37e12 block: drop ctx argument from bdrv_root_attach_child
Passing parent aio context is redundant, as child_class and parent
opaque pointer are enough to retrieve it. Drop the argument and use new
bdrv_child_get_parent_aio_context() interface.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-7-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:47 +02:00
Vladimir Sementsov-Ogievskiy 3ca1f32257 block: BdrvChildClass: add .get_parent_aio_context handler
Add new handler to get aio context and implement it in all child
classes. Add corresponding public interface to be used soon.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210428151804.439460-6-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-04-30 12:27:47 +02:00
Klaus Jensen e548935634 hw/block/nvme: fix handling of private namespaces
Prior to this patch, if a private nvme-ns device (that is, a namespace
that is not linked to a subsystem) is wired up to an nvme-subsys linked
nvme controller device, the device fails to verify that the namespace id
is unique within the subsystem. NVM Express v1.4b, Section 6.1.6 ("NSID
and Namespace Usage") states that because the device supports Namespace
Management, "NSIDs *shall* be unique within the NVM subsystem".

Additionally, prior to this patch, private namespaces are not known to
the subsystem and the namespace is considered exclusive to the
controller with which it is initially wired up to. However, this is not
the definition of a private namespace; per Section 1.6.33 ("private
namespace"), a private namespace is just a namespace that does not
support multipath I/O or namespace sharing, which means "that it is only
able to be attached to one controller at a time".

Fix this by always allocating namespaces in the subsystem (if one is
linked to the controller), regardless of the shared/private status of
the namespace. Whether or not the namespace is shareable is controlled
by a new `shared` nvme-ns parameter.

Finally, this fix allows the nvme-ns `subsys` parameter to be removed,
since the `shared` parameter now serves the purpose of attaching the
namespace to all controllers in the subsystem upon device realization.
It is invalid to have an nvme-ns namespace device with a linked
subsystem without the parent nvme controller device also being linked to
one and since the nvme-ns devices will unconditionally be "attached" (in
QEMU terms that is) to an nvme controller device through an NvmeBus, the
nvme-ns namespace device can always get a reference to the subsystem of
the controller it is explicitly (using 'bus=' parameter) or implicitly
attaching to.

Fixes: e570768566 ("hw/block/nvme: support for shared namespace in subsystem")
Cc: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
2021-04-07 10:48:31 +02:00
Peter Maydell 8a40754bca emulated nvme updates and fixes
* fixes for Coverity CID 1450756, 1450757 and 1450758 (me)
 * fix for a bug in zone management receive (me)
 * metadata and end-to-end data protection support (me & Gollu Appalanaidu)
 * verify support (Gollu Appalanaidu)
 * multiple lba formats and format nvm support (Minwoo Im)
 
 and a couple of misc refactorings from me.
 
 v2:
   - remove an unintended submodule update. Argh.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmBTP0wACgkQTeGvMW1P
 DelZvAf+Ijw77YLz2t0P97CiwoOAG4FADwo61WQJo2AHyS3JMYPVdgNUXF7UGt/S
 cNhU1JjdEylgZPKi5t9o/qgy7+vtSL0KKqoXIS0z9ZWZkdFsgObNetGULhaqXgaX
 4KcAt7PyJS/33uFqYGSVZxJTO4GtCy34hGw6XrVs388tQD1S+eI+oS5EYBN57rl9
 MrV5Z72iMCBx8Y074u81SD/u7n34b5WWbUOJADVI1rAKhilDhnkAhBQ/gK/UBodB
 trUUQY/KU+eCnXEqfBeMD6fL75UOAFOjnm7Kfk/Q/g3di3H2fu0RGLrJuSS1dG0n
 zmwgE7pQ1jT1UcB3QwAJLR3tGtd6sA==
 =HwqP
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging

emulated nvme updates and fixes

* fixes for Coverity CID 1450756, 1450757 and 1450758 (me)
* fix for a bug in zone management receive (me)
* metadata and end-to-end data protection support (me & Gollu Appalanaidu)
* verify support (Gollu Appalanaidu)
* multiple lba formats and format nvm support (Minwoo Im)

and a couple of misc refactorings from me.

v2:
  - remove an unintended submodule update. Argh.

# gpg: Signature made Thu 18 Mar 2021 11:53:48 GMT
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* remotes/nvme/tags/nvme-next-pull-request:
  hw/block/nvme: add support for the format nvm command
  hw/block/nvme: pull lba format initialization
  hw/block/nvme: prefer runtime helpers instead of device parameters
  hw/block/nvme: support multiple lba formats
  hw/block/nvme: add non-mdts command size limit for verify
  hw/block/nvme: add verify command
  hw/block/nvme: end-to-end data protection
  hw/block/nvme: add metadata support
  hw/block/nvme: fix zone management receive reporting too many zones
  hw/block/nvme: assert namespaces array indices
  hw/block/nvme: fix potential overflow

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-03-18 19:55:37 +00:00
Minwoo Im dc04d25e2f hw/block/nvme: add support for the format nvm command
Format NVM admin command can make a namespace or namespaces to be
with different LBA size and metadata size with protection information
types.

This patch introduces Format NVM command with LBA format, Metadata, and
Protection Information for the device. The secure erase operation things
and support for formatting zoned namespaces are yet to be added.

The parameter checks inside of this patch has been referred from
Keith's old branch.

Signed-off-by: Minwoo Im <minwoo.im@samsung.com>
[anaidu.gollu: rebased on e2e]
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: rebased for reworked aio tracking]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-18 12:41:43 +01:00
Gollu Appalanaidu 3e1da158c4 hw/block/nvme: add verify command
See NVM Express 1.4, section 6.14 ("Verify Command").

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
[k.jensen: rebased, refactored for e2e]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-18 12:34:51 +01:00
Klaus Jensen 146f720c55 hw/block/nvme: end-to-end data protection
Add support for namespaces formatted with protection information. The
type of end-to-end data protection (i.e. Type 1, Type 2 or Type 3) is
selected with the `pi` nvme-ns device parameter. If the number of
metadata bytes is larger than 8, the `pil` nvme-ns device parameter may
be used to control the location of the 8-byte DIF tuple. The default
`pil` value of '0', causes the DIF tuple to be transferred as the last
8 bytes of the metadata. Set to 1 to store this in the first eight bytes
instead.

Co-authored-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-18 12:34:51 +01:00
Daniel P. Berrangé 81cbfd5088 block: remove dirty bitmaps 'status' field
The same information is available via the 'recording' and 'busy' fields.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2021-03-18 09:22:55 +00:00
Peter Maydell 9abda42bf2 nbd patches for 2021-03-09
- Add Vladimir as NBD co-maintainer
 - Fix reporting of holes in NBD_CMD_BLOCK_STATUS
 - Improve command-line parsing accuracy of large numbers (anything going
 through qemu_strtosz), including the deprecation of hex+suffix
 - Improve some error reporting in the block layer
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAmBHlmIACgkQp6FrSiUn
 Q2q2cQgAqJWNb4J/ShjvzocDDPzJ0iBitFbg0huFPfbt4DScubEZo5wBJG7vOhOW
 hIHrWCRzGvRgsn0tcSfrgFaegmHKrLgjkibM7ou8ni9NC1kUBd3R/3FBNIMxhYf7
 Q8Kfspl0LRfMJDKF9jdCnQ4Gxcd6h2OIYZqiWVg8V4Tc8WdCpIVOah7e7wjuW8bT
 vgZvfboUWm5AmIF9j/MxuMn+HFZ4ArSuFVL80ZaXlD00vRra7u3HZ8pUfcOlOujg
 7HeouM1E5j3NNE6aZSN++x/EQ3sg0zmirbWUCcgAyRfdRkAmB15uh2PUzPxEIJKH
 UHUIW5LvNtz2+yzOAz2yK29OE523Yg==
 =blE1
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/ericb/tags/pull-nbd-2021-03-09' into staging

nbd patches for 2021-03-09

- Add Vladimir as NBD co-maintainer
- Fix reporting of holes in NBD_CMD_BLOCK_STATUS
- Improve command-line parsing accuracy of large numbers (anything going
through qemu_strtosz), including the deprecation of hex+suffix
- Improve some error reporting in the block layer

# gpg: Signature made Tue 09 Mar 2021 15:38:10 GMT
# gpg:                using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg:                 aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg:                 aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2  F3AA A7A1 6B4A 2527 436A

* remotes/ericb/tags/pull-nbd-2021-03-09:
  block/qcow2: refactor qcow2_update_options_prepare error paths
  block/qed: bdrv_qed_do_open: deal with errp
  block/qcow2: simplify qcow2_co_invalidate_cache()
  block/qcow2: read_cache_sizes: return status value
  block/qcow2-bitmap: return status from qcow2_store_persistent_dirty_bitmaps
  block/qcow2-bitmap: improve qcow2_load_dirty_bitmaps() interface
  block/qcow2: qcow2_get_specific_info(): drop error propagation
  blockjob: return status from block_job_set_speed()
  block/mirror: drop extra error propagation in commit_active_start()
  block: drop extra error propagation for bdrv_set_backing_hd
  blockdev: fix drive_backup_prepare() missed error
  block: check return value of bdrv_open_child and drop error propagation
  utils: Deprecate hex-with-suffix sizes
  utils: Improve qemu_strtosz() to have 64 bits of precision
  utils: Enhance testsuite for do_strtosz()
  nbd: server: Report holes for raw images
  MAINTAINERS: add Vladimir as co-maintainer of NBD

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-03-11 13:57:08 +00:00
Peter Maydell d689ecad07 hw/block/nvme updates
* NVMe subsystem support (`-device nvme-subsys`) (Minwoo Im)
 * Namespace (De|At)tachment support (Minwoo Im)
 * Simple Copy command support (Klaus Jensen)
 * Flush broadcast support (Gollu Appalanaidu)
 * QEMUIOVector/QEMUSGList duality refactoring (Klaus Jensen)
 
 plus various fixes from Minwoo, Gollu, Dmitry and me.
 
 v2:
   - add `nqn` nvme-subsys device parameter instead of using `id`.
     (Paolo)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmBHX5EACgkQTeGvMW1P
 Dek7ygf9H1g1wwp0Epo3U9z3wCWGcGz6uhBcqEW/T7MtmFg0G50C9iUGKOSe7xtv
 bWDabGJj0VCCzceeVUtMLfDXtDsekdUEiP/40OWuD3GsMDE7BJj0YIEklPioqa14
 maDuFoXbXvgmbPUGIf4IIvmQDTOg22K2oBkjCHA9nbaZM6qxIPC3wRtM77sxXDcW
 QdKQR4QXOs0lCXUlPdO9PhhnYcxt7k2/WxCKQsdF7ZL7USDkgZ/c7m54MUaw2st4
 gQWUAh6vxa+oEjpAjgivCdnmd6B4srpUsvMK3U4UKSpYFXus7teK2x88e8t/plKR
 9X0sgL40+PQVI9/CeBh0Z8l/vJYCvg==
 =WF8v
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging

hw/block/nvme updates

* NVMe subsystem support (`-device nvme-subsys`) (Minwoo Im)
* Namespace (De|At)tachment support (Minwoo Im)
* Simple Copy command support (Klaus Jensen)
* Flush broadcast support (Gollu Appalanaidu)
* QEMUIOVector/QEMUSGList duality refactoring (Klaus Jensen)

plus various fixes from Minwoo, Gollu, Dmitry and me.

v2:
  - add `nqn` nvme-subsys device parameter instead of using `id`.
    (Paolo)

# gpg: Signature made Tue 09 Mar 2021 11:44:17 GMT
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* remotes/nvme/tags/nvme-next-pull-request: (38 commits)
  hw/block/nvme: support Identify NS Attached Controller List
  hw/block/nvme: support changed namespace asynchronous event
  hw/block/nvme: support namespace attachment command
  hw/block/nvme: refactor nvme_select_ns_iocs
  hw/block/nvme: support allocated namespace type
  hw/block/nvme: fix allocated namespace list to 256
  hw/block/nvme: fix namespaces array to 1-based
  hw/block/nvme: support namespace detach
  hw/block/nvme: refactor nvme_dma
  hw/block/nvme: remove the req dependency in map functions
  hw/block/nvme: try to deal with the iov/qsg duality
  hw/block/nvme: fix strerror printing
  hw/block/nvme: remove block accounting for write zeroes
  hw/block/nvme: remove redundant len member in compare context
  hw/block/nvme: report non-mdts command size limit for dsm
  hw/block/nvme: add trace event for zone read check
  hw/block/nvme: fix potential compilation error
  hw/block/nvme: add identify trace event
  hw/block/nvme: remove unnecessary endian conversion
  hw/block/nvme: align zoned.zasl with mdts
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-03-10 20:11:33 +00:00
Minwoo Im 23fb7dfeca hw/block/nvme: support Identify NS Attached Controller List
Support Identify command for Namespace attached controller list.  This
command handler will traverse the controller instances in the given
subsystem to figure out whether the specified nsid is attached to the
controllers or not.

The 4096bytes Identify data will return with the first entry (16bits)
indicating the number of the controller id entries.  So, the data can
hold up to 2047 entries for the controller ids.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
[k.jensen: rebased for dma refactor]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:58 +01:00
Minwoo Im f432fdfa12 hw/block/nvme: support changed namespace asynchronous event
If namespace inventory is changed due to some reasons (e.g., namespace
attachment/detachment), controller can send out event notifier to the
host to manage namespaces.

This patch sends out the AEN to the host after either attach or detach
namespaces from controllers.  To support clear of the event from the
controller, this patch also implemented Get Log Page command for Changed
Namespace List log type.  To return namespace id list through the
command, when namespace inventory is updated, id is added to the
per-controller list (changed_ns_list).

To indicate the support of this async event, this patch set
OAES(Optional Asynchronous Events Supported) in Identify Controller data
structure.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:58 +01:00
Minwoo Im 645ce1a70c hw/block/nvme: support namespace attachment command
This patch supports Namespace Attachment command for the pre-defined
nvme-ns device nodes.  Of course, attach/detach namespace should only be
supported in case 'subsys' is given.  This is because if we detach a
namespace from a controller, somebody needs to manage the detached, but
allocated namespace in the NVMe subsystem.

As command effect for the namespace attachment command is registered,
the host will be notified that namespace inventory is changed so that
host will rescan the namespace inventory after this command.  For
example, kernel driver manages this command effect via passthru IOCTL.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
[k.jensen: rebased for dma refactor]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:58 +01:00
Gollu Appalanaidu 67ce28a1fd hw/block/nvme: report non-mdts command size limit for dsm
Dataset Management is not subject to MDTS, but exceeded a certain size
per range causes internal looping. Report this limit (DMRSL) in the NVM
command set specific identify controller data structure.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-09 11:00:57 +01:00
Gollu Appalanaidu c94973288c hw/block/nvme: add broadcast nsid support flush command
Add support for using the broadcast nsid to issue a flush on all
namespaces through a single command.

Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:57 +01:00
Klaus Jensen 3862efff31 nvme: updated shared header for copy command
Add new data structures and types for the Simple Copy command.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-03-09 11:00:57 +01:00
Minwoo Im adc36b8d21 hw/block/nvme: add NMIC enum value for Identify Namespace
Added Namespace Multi-path I/O and Namespace Sharing Capabilities (NMIC)
field to support shared namespace from controller(s).

This field is in Identify Namespace data structure in [30].

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:57 +01:00
Minwoo Im 66b7e9bed0 hw/block/nvme: add CMIC enum value for Identify Controller
Added Controller Multi-path I/O and Namespace Sharing Capabilities
(CMIC) field to support multi-controller in the following patches.

This field is in Identify Controller data structure in [76].

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Tested-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-03-09 11:00:57 +01:00
Vladimir Sementsov-Ogievskiy 775d0c0508 blockjob: return status from block_job_set_speed()
Better to return status together with setting errp. It allows to avoid
error propagation in the caller.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20210202124956.63146-8-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-03-08 15:14:15 -06:00
Vladimir Sementsov-Ogievskiy 35f428ba39 qcow2-bitmap: make bytes_covered_by_bitmap_cluster() public
Rename bytes_covered_by_bitmap_cluster() to
bdrv_dirty_bitmap_serialization_coverage() and make it public.
It is needed as we are going to share it with bitmap loading in
parallels format.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
Message-Id: <20210224104707.88430-2-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-03-08 14:56:54 +01:00
Maxim Levitsky a890f08e58 block: add bdrv_co_delete_file_noerr
This function wraps bdrv_co_delete_file for the common case of removing a file,
which was just created by format driver, on an error condition.

It hides the -ENOTSUPP error, and reports all other errors otherwise.

Use it in luks driver

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20201217170904.946013-3-mlevitsk@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-02-15 15:10:14 +01:00
Peter Maydell 392b9a74b9 bitmaps patches for 2021-02-12
- add 'transform' member to manipulate bitmaps across migration
 - work towards better error handling during bdrv_open
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAmAnDQsACgkQp6FrSiUn
 Q2qc5Qf/SKVdpX4j7OnHF6sBuf/8LVWz4KazSqEU0ohazBJmafgJpH2EA5pXMXR4
 frZDWeanGmhj1MjMkta/++uvEBU/TMpW2z98mZvjErteXdnRQAlII/hOCI+QZJvg
 viQ5t1EyrkyXzUePOjs+AwqA5KHWbCKt6QqyItQ78HvI23sw/fuvHj0G67KbVzXZ
 VcSrVr0J7PXnZV/hWfg+C+Nn9Ro9tsVdn79awLYVQ7/SDro3hzylpcHMQaHMK2oe
 mX4D2kNq7s21E27Zb6vlknUhQPkMdETk0gfEbpn7sTVMEc58GRLC7Tqfx7l0JIFK
 5izVyA5vndKVxDGYPkbDK6VL2uDg4A==
 =+Epy
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/ericb/tags/pull-bitmaps-2021-02-12' into staging

bitmaps patches for 2021-02-12

- add 'transform' member to manipulate bitmaps across migration
- work towards better error handling during bdrv_open

# gpg: Signature made Fri 12 Feb 2021 23:19:39 GMT
# gpg:                using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [full]
# gpg:                 aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [full]
# gpg:                 aka "[jpeg image of size 6874]" [full]
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2  F3AA A7A1 6B4A 2527 436A

* remotes/ericb/tags/pull-bitmaps-2021-02-12:
  block: use return status of bdrv_append()
  block: return status from bdrv_append and friends
  qemu-iotests: 300: Add test case for modifying persistence of bitmap
  migration: dirty-bitmap: Allow control of bitmap persistence
  migration: dirty-bitmap: Use struct for alias map inner members

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-02-13 21:26:00 +00:00
Vladimir Sementsov-Ogievskiy a1e708fcda block: return status from bdrv_append and friends
The recommended use of qemu error api assumes returning status together
with setting errp and avoid void functions with errp parameter. Let's
improve bdrv_append and some friends to reduce error-propagation
overhead in further patches.

Choose int return status, because bdrv_replace_node_common() has call
to bdrv_check_update_perm(), which reports int status, which seems
correct to propagate.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210202124956.63146-2-vsementsov@virtuozzo.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-12 15:36:41 -06:00
Vladimir Sementsov-Ogievskiy bd54669a4a block: add new BlockDriver handler: bdrv_cancel_in_flight
It will be used to stop retrying NBD requests on mirror cancel.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210205163720.887197-2-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-12 09:45:18 -06:00
Peter Maydell 1214d55d1c Emulated NVMe device updates
* deallocate or unwritten logical block error feature (me)
   * dataset management command (me)
   * compare command (Gollu Appalanaidu)
   * namespace types (Niklas Cassel)
   * zoned namespaces (Dmitry Fomichev)
   * smart critical warning toggle (Zhenwei Pi)
   * allow cmb and pmr to coexist (me)
   * pmr rds/wds support (Naveen Nagar)
   * cmb v1.4 logic (Padmakar Kalghatgi)
 
 And a lot of smaller fixes from Gollu Appalanaidu and Minwoo Im.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEUigzqnXi3OaiR2bATeGvMW1PDekFAmAiON4ACgkQTeGvMW1P
 DenhxQf/WzoiO5bXmvZeRv+IHoIn5GhJ1NxLRYO5MX0diswh0BBwUmIscbEDMe1b
 GsD6rpd9YfEO80L/sEGqgV09HT+6e8YwDsNNZBTAsUNRVx0WxckgcGWcrzNNA9Nc
 WEl0Q8si5USSQ1C7djplLXdR6p4pbA/gIk6AjNIo3q2VK1ZqCBhQGESGEfGgrAXW
 xWo8C1V8dnKdxUYI2blbti44sElHZJ6jcF5N3Xmv0UUa1WL0hh0u6qr7IbCZe1kO
 SUFWMIGLF+1C35MUyWpgCjCn5cUdnTA0s/SLEuWtDlNYRhRRh0D6LZviVTZi38Wx
 6Cxg/bRkSlcKo1/jswwYcAaH7qQ4Eg==
 =NB9D
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/nvme-next-pull-request' into staging

Emulated NVMe device updates

  * deallocate or unwritten logical block error feature (me)
  * dataset management command (me)
  * compare command (Gollu Appalanaidu)
  * namespace types (Niklas Cassel)
  * zoned namespaces (Dmitry Fomichev)
  * smart critical warning toggle (Zhenwei Pi)
  * allow cmb and pmr to coexist (me)
  * pmr rds/wds support (Naveen Nagar)
  * cmb v1.4 logic (Padmakar Kalghatgi)

And a lot of smaller fixes from Gollu Appalanaidu and Minwoo Im.

# gpg: Signature made Tue 09 Feb 2021 07:25:18 GMT
# gpg:                using RSA key 522833AA75E2DCE6A24766C04DE1AF316D4F0DE9
# gpg: Good signature from "Klaus Jensen <its@irrelevant.dk>" [unknown]
# gpg:                 aka "Klaus Jensen <k.jensen@samsung.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DDCA 4D9C 9EF9 31CC 3468  4272 63D5 6FC5 E55D A838
#      Subkey fingerprint: 5228 33AA 75E2 DCE6 A247  66C0 4DE1 AF31 6D4F 0DE9

* remotes/nvme/tags/nvme-next-pull-request: (56 commits)
  hw/block/nvme: refactor the logic for zone write checks
  hw/block/nvme: fix zone boundary check for append
  hw/block/nvme: fix wrong parameter name 'cross_read'
  hw/block/nvme: align with existing style
  hw/block/nvme: fix set feature save field check
  hw/block/nvme: fix set feature for error recovery
  hw/block/nvme: error if drive less than a zone size
  hw/block/nvme: lift cmb restrictions
  hw/block/nvme: bump to v1.4
  hw/block/nvme: move cmb logic to v1.4
  hw/block/nvme: add PMR RDS/WDS support
  hw/block/nvme: disable PMR at boot up
  hw/block/nvme: remove redundant zeroing of PMR registers
  hw/block/nvme: rename PMR/CMB shift/mask fields
  hw/block/nvme: allow cmb and pmr to coexist
  hw/block/nvme: move msix table and pba to BAR 0
  hw/block/nvme: indicate CMB support through controller capabilities register
  hw/block/nvme: fix 64 bit register hi/lo split writes
  hw/block/nvme: add size to mmio read/write trace events
  hw/block/nvme: trigger async event during injecting smart warning
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2021-02-09 13:24:37 +00:00
Klaus Jensen c2a3640de8 hw/block/nvme: bump to v1.4
With the new CMB logic in place, bump the implemented specification
version to v1.4 by default.

This requires adding the setting the CNTRLTYPE field and modifying the
VWC field since 0x00 is no longer a valid value for bits 2:1.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:54 +01:00
Padmakar Kalghatgi f4319477b4 hw/block/nvme: move cmb logic to v1.4
Implement v1.4 logic for configuring the Controller Memory Buffer. By
default, the v1.4 scheme will be used (CMB must be explicitly enabled by
the host), so drivers that only support v1.3 will not be able to use the
CMB anymore.

To retain the v1.3 behavior, set the boolean 'legacy-cmb' nvme device
parameter.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Padmakar Kalghatgi <p.kalghatgi@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:54 +01:00
Klaus Jensen 8e9e8b4821 hw/block/nvme: rename PMR/CMB shift/mask fields
Use the correct field names.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:53 +01:00
Andrzej Jakowski c705063129 hw/block/nvme: indicate CMB support through controller capabilities register
This patch sets CMBS bit in controller capabilities register when user
configures NVMe driver with CMB support, so capabilites are correctly
reported to guest OS.

Signed-off-by: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
Reviewed-by: Maxim Levitsky <mlevitsky@gmail.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:53 +01:00
zhenwei pi c62720f137 hw/block/nvme: trigger async event during injecting smart warning
During smart critical warning injection by setting property from QMP
command, also try to trigger asynchronous event.

Suggested by Keith, if a event has already been raised, there is no
need to enqueue the duplicate event any more.

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
[k.jensen: fix typo in commit message]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:53 +01:00
zhenwei pi 4714791b66 hw/block/nvme: add smart_critical_warning property
There is a very low probability that hitting physical NVMe disk
hardware critical warning case, it's hard to write & test a monitor
agent service.

For debugging purposes, add a new 'smart_critical_warning' property
to emulate this situation.

The orignal version of this change is implemented by adding a fixed
property which could be initialized by QEMU command line. Suggested
by Philippe & Klaus, rework like current version.

Test with this patch:
1, change smart_critical_warning property for a running VM:
 #virsh qemu-monitor-command nvme-upstream '{ "execute": "qom-set",
  "arguments": { "path": "/machine/peripheral-anon/device[0]",
  "property": "smart_critical_warning", "value":16 } }'
2, run smartctl in guest
 #smartctl -H -l error /dev/nvme0n1

  === START OF SMART DATA SECTION ===
  SMART overall-health self-assessment test result: FAILED!
  - volatile memory backup device has failed

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:53 +01:00
zhenwei pi c6d1b5c13b nvme: introduce bit 5 for critical warning
According to NVM Express v1.4, Section 5.14.1.2 ("SMART / Health
Information"), introduce bit 5 for "Persistent Memory Region has become
read-only or unreliable".

Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
[k.jensen: minor brush ups in commit message]
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 21:15:53 +01:00
Klaus Jensen b05fde2881 hw/block/nvme: enum style fix
Align with existing style and use a typedef for header-file enums.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Tested-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
2021-02-08 21:15:53 +01:00
Dmitry Fomichev e9ba46eeaf nvme: Make ZNS-related definitions
Define values and structures that are needed to support Zoned
Namespace Command Set (NVMe TP 4053).

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 20:58:34 +01:00
Niklas Cassel 922e6f4ebd hw/block/nvme: Support allocated CNS command variants
Many CNS commands have "allocated" command variants. These include
a namespace as long as it is allocated, that is a namespace is
included regardless if it is active (attached) or not.

While these commands are optional (they are mandatory for controllers
supporting the namespace attachment command), our QEMU implementation
is more complete by actually providing support for these CNS values.

However, since our QEMU model currently does not support the namespace
attachment command, these new allocated CNS commands will return the
same result as the active CNS command variants.

The reason for not hooking up this command completely is because the
NVMe specification requires the namespace management command to be
supported if the namespace attachment command is supported.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 20:58:34 +01:00
Niklas Cassel 141354d55b hw/block/nvme: Add support for Namespace Types
Define the structures and constants required to implement
Namespace Types support.

Namespace Types introduce a new command set, "I/O Command Sets",
that allows the host to retrieve the command sets associated with
a namespace. Introduce support for the command set and enable
detection for the NVM Command Set.

The new workflows for identify commands rely heavily on zero-filled
identify structs. E.g., certain CNS commands are defined to return
a zero-filled identify struct when an inactive namespace NSID
is supplied.

Add a helper function in order to avoid code duplication when
reporting zero-filled identify structures.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 20:58:34 +01:00
Dmitry Fomichev 62e8faa468 hw/block/nvme: Add Commands Supported and Effects log
This log page becomes necessary to implement to allow checking for
Zone Append command support in Zoned Namespace Command Set.

This commit adds the code to report this log page for NVM Command
Set only. The parts that are specific to zoned operation will be
added later in the series.

All incoming admin and i/o commands are now only processed if their
corresponding support bits are set in this log. This provides an
easy way to control what commands to support and what not to
depending on set CC.CSS.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
2021-02-08 20:58:32 +01:00
Klaus Jensen 6fd704a59a nvme: add namespace I/O optimization fields to shared header
This adds the NPWG, NPWA, NPDG, NPDA and NOWS family of fields to the
shared nvme.h header for use by later patches.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Fam Zheng <fam@euphon.net>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
2021-02-08 18:55:48 +01:00
Klaus Jensen 54064e51d1 hw/block/nvme: add dulbe support
Add support for reporting the Deallocated or Unwritten Logical Block
Error (DULBE).

Rely on the block status flags reported by the block layer and consider
any block with the BDRV_BLOCK_ZERO flag to be deallocated.

Multiple factors affect when a Write Zeroes command result in
deallocation of blocks.

  * the underlying file system block size
  * the blockdev format
  * the 'discard' and 'logical_block_size' parameters

     format | discard | wz (512B)  wz (4KiB)  wz (64KiB)
    -----------------------------------------------------
      qcow2    ignore   n          n          y
      qcow2    unmap    n          n          y
      raw      ignore   n          y          y
      raw      unmap    n          y          y

So, this works best with an image in raw format and 4KiB LBAs, since
holes can then be punched on a per-block basis (this assumes a file
system with a 4kb block size, YMMV). A qcow2 image, uses a cluster size
of 64KiB by default and blocks will only be marked deallocated if a full
cluster is zeroed or discarded. However, this *is* consistent with the
spec since Write Zeroes "should" deallocate the block if the Deallocate
attribute is set and "may" deallocate if the Deallocate attribute is not
set. Thus, we always try to deallocate (the BDRV_REQ_MAY_UNMAP flag is
always set).

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2021-02-08 18:55:48 +01:00
Daniel P. Berrangé 3d3e9b1f66 block: rename and alter bdrv_all_find_snapshot semantics
Currently bdrv_all_find_snapshot() will return 0 if it finds
a snapshot, -1 if an error occurs, or if it fails to find a
snapshot. New callers to be added want to distinguish between
the error scenario and failing to find a snapshot.

Rename it to bdrv_all_has_snapshot and make it return -1 on
error, 0 if no snapshot is found and 1 if snapshot is found.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210204124834.774401-7-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-02-08 11:19:51 +00:00
Daniel P. Berrangé c22d644ca7 block: allow specifying name of block device for vmstate storage
Currently the vmstate will be stored in the first block device that
supports snapshots. Historically this would have usually been the
root device, but with UEFI it might be the variable store. There
needs to be a way to override the choice of block device to store
the state in.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210204124834.774401-6-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-02-08 11:19:51 +00:00
Daniel P. Berrangé cf3a74c94f block: add ability to specify list of blockdevs during snapshot
When running snapshot operations, there are various rules for which
blockdevs are included/excluded. While this provides reasonable default
behaviour, there are scenarios that are not well handled by the default
logic. Some of the conditions do not have a single correct answer.

Thus there needs to be a way for the mgmt app to provide an explicit
list of blockdevs to perform snapshots across. This can be achieved
by passing a list of node names that should be used.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20210204124834.774401-5-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-02-08 11:19:51 +00:00
Daniel P. Berrangé e26f98e209 block: push error reporting into bdrv_all_*_snapshot functions
The bdrv_all_*_snapshot functions return a BlockDriverState pointer
for the invalid backend, which the callers then use to report an
error message. In some cases multiple callers are reporting the
same error message, but with slightly different text. In the future
there will be more error scenarios for some of these methods, which
will benefit from fine grained error message reporting. So it is
helpful to push error reporting down a level.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
[PMD: Initialize variables]
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <20210204124834.774401-2-berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
2021-02-08 11:19:51 +00:00
Roman Kagan 5082fc82a6 nbd: make nbd_read* return -EIO on error
NBD reconnect logic considers the error code from the functions that
read NBD messages to tell if reconnect should be attempted or not: it is
attempted on -EIO, otherwise the client transitions to NBD_CLIENT_QUIT
state (see nbd_channel_error).  This error code is propagated from the
primitives like nbd_read.

The problem, however, is that nbd_read itself turns every error into -1
rather than -EIO.  As a result, if the NBD server happens to die while
sending the message, the client in QEMU receives less data than it
expects, considers it as a fatal error, and wouldn't attempt
reestablishing the connection.

Fix it by turning every negative return from qio_channel_read_all into
-EIO returned from nbd_read.  Apparently that was the original behavior,
but got broken later.  Also adjust nbd_readXX to follow.

Fixes: e6798f06a6 ("nbd: generalize usage of nbd_read")
Signed-off-by: Roman Kagan <rvkagan@yandex-team.ru>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210129073859.683063-4-rvkagan@yandex-team.ru>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:17:12 -06:00
Vladimir Sementsov-Ogievskiy a5215b8fdf block/io: use int64_t bytes in copy_range
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, convert now copy_range parameters which are already 64bit to signed
type.

It's safe as we don't work with requests overflowing BDRV_MAX_LENGTH
(which is less than INT64_MAX), and do check the requests in
bdrv_co_copy_range_internal() (by bdrv_check_request32(), which calls
bdrv_check_request()).

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-17-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:17:12 -06:00
Vladimir Sementsov-Ogievskiy e9e52efdc5 block/io: support int64_t bytes in read/write wrappers
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

Now, since bdrv_co_preadv_part() and bdrv_co_pwritev_part() have been
updated, update all their wrappers.

For all of them type of 'bytes' is widening, so callers are safe. We
have update request_fn in blkverify.c simultaneously. Still it's just a
pointer to one of bdrv_co_pwritev() or bdrv_co_preadv(), and type is
widening for callers of the request_fn anyway.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-16-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: grammar tweak]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:17:12 -06:00
Vladimir Sementsov-Ogievskiy 37e9403ea8 block/io: support int64_t bytes in bdrv_co_p{read,write}v_part()
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

So, prepare bdrv_co_preadv_part() and bdrv_co_pwritev_part() and their
remaining dependencies now.

bdrv_pad_request() is updated simultaneously, as pointer to bytes passed
to it both from bdrv_co_pwritev_part() and bdrv_co_preadv_part().

So, all callers of bdrv_pad_request() are updated to pass 64bit bytes.
bdrv_pad_request() is already good for 64bit requests, add
corresponding assertion.

Look at bdrv_co_preadv_part() and bdrv_co_pwritev_part().
Type is widening, so callers are safe. Let's look inside the functions.

In bdrv_co_preadv_part() and bdrv_aligned_pwritev() we only pass bytes
to other already int64_t interfaces (and some obviously safe
calculations), it's OK.

In bdrv_co_do_zero_pwritev() aligned_bytes may become large now, still
it's passed to bdrv_aligned_pwritev which supports int64_t bytes.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-15-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:17:11 -06:00
Eric Blake 8024726459 block: use int64_t as bytes type in tracked requests
We are generally moving to int64_t for both offset and bytes parameters
on all io paths.

Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

All requests in block/io must not overflow BDRV_MAX_LENGTH, all
external users of BdrvTrackedRequest already have corresponding
assertions, so we are safe. Add some assertions still.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-9-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:14:15 -06:00
Vladimir Sementsov-Ogievskiy 801625e69d block/throttle-groups: throttle_group_co_io_limits_intercept(): 64bit bytes
The function is called from 64bit io handlers, and bytes is just passed
to throttle_account() which is 64bit too (unsigned though). So, let's
convert intermediate argument to 64bit too.

This patch is a first in the 64-bit-blocklayer series, so we are
generally moving to int64_t for both offset and bytes parameters on all
io paths. Main motivation is realization of 64-bit write_zeroes
operation for fast zeroing large disk chunks, up to the whole disk.

We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).

Patch-correctness audit by Eric Blake:

  Caller has 32-bit, this patch now causes widening which is safe:
  block/block-backend.c: blk_do_preadv() passes 'unsigned int'
  block/block-backend.c: blk_do_pwritev_part() passes 'unsigned int'
  block/throttle.c: throttle_co_pwrite_zeroes() passes 'int'
  block/throttle.c: throttle_co_pdiscard() passes 'int'

  Caller has 64-bit, this patch fixes potential bug where pre-patch
  could narrow, except it's easy enough to trace that callers are still
  capped at 2G actions:
  block/throttle.c: throttle_co_preadv() passes 'uint64_t'
  block/throttle.c: throttle_co_pwritev() passes 'uint64_t'

  Implementation in question: block/throttle-groups.c
  throttle_group_co_io_limits_intercept() takes 'unsigned int bytes'
  and uses it: argument to util/throttle.c throttle_account(uint64_t)

  All safe: it patches a latent bug, and does not introduce any 64-bit
  gotchas once throttle_co_p{read,write}v are relaxed, and assuming
  throttle_account() is not buggy.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20201211183934.169161-7-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:14:00 -06:00
Vladimir Sementsov-Ogievskiy 69b55e03f7 block: refactor bdrv_check_request: add errp
It's better to pass &error_abort than just assert that result is 0: on
crash, we'll immediately see the reason in the backtrace.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201211183934.169161-2-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: fix iotest 206 fallout]
Signed-off-by: Eric Blake <eblake@redhat.com>
2021-02-03 08:00:33 -06:00
Vladimir Sementsov-Ogievskiy 143a6384f5 block/block-copy: drop unused argument of block_copy()
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-21-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy 5b49c2bdc1 block/block-copy: drop unused block_copy_set_progress_callback()
Drop unused code.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-20-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy e0323a045f blockjob: add set_speed to BlockJobDriver
We are going to use async block-copy call in backup, so we'll need to
passthrough setting backup speed to block-copy call.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-9-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy a6d23d56df block/block-copy: add block_copy_cancel
Add function to cancel running async block-copy call. It will be used
in backup.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-8-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy 7e032df0ea block/block-copy: add ratelimit to block-copy
We are going to directly use one async block-copy operation for backup
job, so we need rate limiter.

We want to maintain current backup behavior: only background copying is
limited and copy-before-write operations only participate in limit
calculation. Therefore we need one rate limiter for block-copy state
and boolean flag for block-copy call state for actual limitation.

Note, that we can't just calculate each chunk in limiter after
successful copying: it will not save us from starting a lot of async
sub-requests which will exceed limit too much. Instead let's use the
following scheme on sub-request creation:
1. If at the moment limit is not exceeded, create the request and
account it immediately.
2. If at the moment limit is already exceeded, drop create sub-request
and handle limit instead (by sleep).
With this approach we'll never exceed the limit more than by one
sub-request (which pretty much matches current backup behavior).

Note also, that if there is in-flight block-copy async call,
block_copy_kick() should be used after set-speed to apply new setup
faster. For that block_copy_kick() published in this patch.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-7-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy 26be9d62dd block/block-copy: add max_chunk and max_workers parameters
They will be used for backup.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-5-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy de4641b46b block/block-copy: implement block_copy_async
We'll need async block-copy invocation to use in backup directly.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-4-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy 86c6a3b690 qapi: backup: add perf.use-copy-range parameter
Experiments show, that copy_range is not always making things faster.
So, to make experimentation simpler, let's add a parameter. Some more
perf parameters will be added soon, so here is a new struct.

For now, add new backup qmp parameter with x- prefix for the following
reasons:

 - We are going to add more performance parameters, some will be
   related to the whole block-copy process, some only to background
   copying in backup (ignored for copy-before-write operations).
 - On the other hand, we are going to use block-copy interface in other
   block jobs, which will need performance options as well.. And it
   should be the same structure or at least somehow related.

So, there are too much unclean things about how the interface and now
we need the new options mostly for testing. Let's keep them
experimental for a while.

In do_backup_common() new x-perf parameter handled in a way to
make further options addition simpler.

We add use-copy-range with default=true, and we'll change the default
in further patch, after moving backup to use block-copy.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210116214705.822267-2-vsementsov@virtuozzo.com>
[mreitz: s/5\.2/6.0/]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Vladimir Sementsov-Ogievskiy 7f4a396d76 qapi: block-stream: add "bottom" argument
The code already don't freeze base node and we try to make it prepared
for the situation when base node is changed during the operation. In
other words, block-stream doesn't own base node.

Let's introduce a new interface which should replace the current one,
which will in better relations with the code. Specifying bottom node
instead of base, and requiring it to be non-filter gives us the
following benefits:

 - drop difference between above_base and base_overlay, which will be
   renamed to just bottom, when old interface dropped

 - clean way to work with parallel streams/commits on the same backing
   chain, which otherwise become a problem when we introduce a filter
   for stream job

 - cleaner interface. Nobody will surprised the fact that base node may
   disappear during block-stream, when there is no word about "base" in
   the interface.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201216061703.70908-11-vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Andrey Shinkevich e275458b29 copy-on-read: skip non-guest reads if no copy needed
If the flag BDRV_REQ_PREFETCH was set, skip idling read/write
operations in COR-driver. It can be taken into account for the
COR-algorithms optimization. That check is being made during the
block stream job by the moment.

Add the BDRV_REQ_PREFETCH flag to the supported_read_flags of the
COR-filter.

block: Modify the comment for the flag BDRV_REQ_PREFETCH as we are
going to use it alone and pass it to the COR-filter driver for further
processing.

Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-9-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Andrey Shinkevich 897dd0ec4f block: include supported_read_flags into BDS structure
Add the new member supported_read_flags to the BlockDriverState
structure. It will control the flags set for copy-on-read operations.
Make the block generic layer evaluate supported read flags before they
go to a block driver.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
 [vsementsov: use assert instead of abort]
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-8-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 14:36:37 +01:00
Andrey Shinkevich 880747a887 qapi: add filter-node-name to block-stream
Provide the possibility to pass the 'filter-node-name' parameter to the
block-stream job as it is done for the commit block job.

Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
  [vsementsov: comment indentation, s/Since: 5.2/Since: 6.0/]
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-5-vsementsov@virtuozzo.com>
[mreitz: s/commit/stream/]
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 11:26:54 +01:00
Andrey Shinkevich 8872ef78ab block: add API function to insert a node
Provide API for insertion a node to backing chain.

Suggested-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201216061703.70908-3-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2021-01-26 11:26:54 +01:00
Vladimir Sementsov-Ogievskiy d1a764d126 block: introduce BDRV_REQ_NO_WAIT flag
Add flag to make serialising request no wait: if there are conflicting
requests, just return error immediately. It's will be used in upcoming
preallocate filter.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201021145859.11201-7-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-12-18 12:35:55 +01:00
Vladimir Sementsov-Ogievskiy 8ac5aab255 block: bdrv_mark_request_serialising: split non-waiting function
We'll need a separate function, which will only "mark" request
serialising with specified align but not wait for conflicting
requests. So, it will be like old bdrv_mark_request_serialising(),
before merging bdrv_wait_serialising_requests_locked() into it.

To reduce the possible mess, let's do the following:

Public function that does both marking and waiting will be called
bdrv_make_request_serialising, and private function which will only
"mark" will be called tracked_request_set_serialising().

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201021145859.11201-6-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-12-18 12:35:55 +01:00
Vladimir Sementsov-Ogievskiy 2153994e2e block: simplify comment to BDRV_REQ_SERIALISING
1. BDRV_REQ_NO_SERIALISING doesn't exist already, don't mention it.

2. We are going to add one more user of BDRV_REQ_SERIALISING, so
   comment about backup becomes a bit confusing here. The use case in
   backup is documented in block/backup.c, so let's just drop
   duplication here.

3. The fact that BDRV_REQ_SERIALISING is only for write requests is
   omitted. Add a note.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Message-Id: <20201021145859.11201-2-vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-12-18 12:35:55 +01:00
Vladimir Sementsov-Ogievskiy 8b1170012b block: introduce BDRV_MAX_LENGTH
We are going to modify block layer to work with 64bit requests. And
first step is moving to int64_t type for both offset and bytes
arguments in all block request related functions.

It's mostly safe (when widening signed or unsigned int to int64_t), but
switching from uint64_t is questionable.

So, let's first establish the set of requests we want to work with.
First signed int64_t should be enough, as off_t is signed anyway. Then,
obviously offset + bytes should not overflow.

And most interesting: (offset + bytes) being aligned up should not
overflow as well. Aligned to what alignment? First thing that comes in
mind is bs->bl.request_alignment, as we align up request to this
alignment. But there is another thing: look at
bdrv_mark_request_serialising(). It aligns request up to some given
alignment. And this parameter may be bdrv_get_cluster_size(), which is
often a lot greater than bs->bl.request_alignment.
Note also, that bdrv_mark_request_serialising() uses signed int64_t for
calculations. So, actually, we already depend on some restrictions.

Happily, bdrv_get_cluster_size() returns int and
bs->bl.request_alignment has 32bit unsigned type, but defined to be a
power of 2 less than INT_MAX. So, we may establish, that INT_MAX is
absolute maximum for any kind of alignment that may occur with the
request.

Note, that bdrv_get_cluster_size() is not documented to return power
of 2, still bdrv_mark_request_serialising() behaves like it is.
Also, backup uses bdi.cluster_size and is not prepared to it not being
power of 2.
So, let's establish that Qemu supports only power-of-2 clusters and
alignments.

So, alignment can't be greater than 2^30.

Finally to be safe with calculations, to not calculate different
maximums for different nodes (depending on cluster size and
request_alignment), let's simply set QEMU_ALIGN_DOWN(INT64_MAX, 2^30)
as absolute maximum bytes length for Qemu. Actually, it's not much less
than INT64_MAX.

OK, then, let's apply it to block/io.

Let's consider all block/io entry points of offset/bytes:

4 bytes/offset interface functions: bdrv_co_preadv_part(),
bdrv_co_pwritev_part(), bdrv_co_copy_range_internal() and
bdrv_co_pdiscard() and we check them all with bdrv_check_request().

We also have one entry point with only offset: bdrv_co_truncate().
Check the offset.

And one public structure: BdrvTrackedRequest. Happily, it has only
three external users:

 file-posix.c: adopted by this patch
 write-threshold.c: only read fields
 test-write-threshold.c: sets obviously small constant values

Better is to make the structure private and add corresponding
interfaces.. Still it's not obvious what kind of interface is needed
for file-posix.c. Let's keep it public but add corresponding
assertions.

After this patch we'll convert functions in block/io.c to int64_t bytes
and offset parameters. We can assume that offset/bytes pair always
satisfy new restrictions, and make
corresponding assertions where needed. If we reach some offset/bytes
point in block/io.c missing bdrv_check_request() it is considered a
bug. As well, if block/io.c modifies a offset/bytes request, expanding
it more then aligning up to request_alignment, it's a bug too.

For all io requests except for discard we keep for now old restriction
of 32bit request length.

iotest 206 output error message changed, as now test disk size is
larger than new limit. Add one more test case with new maximum disk
size to cover too-big-L1 case.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20201203222713.13507-5-vsementsov@virtuozzo.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-12-11 17:52:40 +01:00
Max Reitz 0c9b70d590 fuse: Allow exporting BDSs via FUSE
block-export-add type=fuse allows mounting block graph nodes via FUSE on
some existing regular file.  That file should then appears like a raw
disk image, and accesses to it result in accesses to the exported BDS.

Right now, we only implement the necessary block export functions to set
it up and shut it down.  We do not implement any access functions, so
accessing the mount point only results in errors.  This will be
addressed by a followup patch.

We keep a hash table of exported mount points, because we want to be
able to detect when users try to use a mount point twice.  This is
because we invoke stat() to check whether the given mount point is a
regular file, but if that file is served by ourselves (because it is
already used as a mount point), then this stat() would have to be served
by ourselves, too, which is impossible to do while we (as the caller)
are waiting for it to settle.  Therefore, keep track of mount point
paths to at least catch the most obvious instances of that problem.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20201027190600.192171-3-mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-12-11 17:52:39 +01:00
Peter Maydell 683685e72d Pull request for 5.2
NVMe fixes to solve IOMMU issues on non-x86 and error message/tracing
 improvements. Elena Afanasova's ioeventfd fixes are also included.
 
 Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEhpWov9P5fNqsNXdanKSrs4Grc8gFAl+ixjgACgkQnKSrs4Gr
 c8iZYgf+OB2eAGsdZO97fKh6VUUoRKa+BgWKuh37Cfpp3q+dLuIFMSKfU/UgprLc
 aowt6uTFfwudDV9KltUB2EiXIzpuf7JhMNOiDRkyEvYSj4KHRPsQmFCd35Nrjezy
 VvxSGafe2Z60Qnvcx+iGeMATSFX9YTcTZeHttC07v7dWn/yEK3b1hobcmjCcwWeR
 Ud8pjMyh5E2z/NpW8E669/byJf9iahx3LSQxSWt+9PVTPuftAB0Suu+m6svz1wvk
 sjVfIbtVWCp2BdGf5U6a2rEqF3+kIcFkfHp+MwgE0EdMz1wfjudaPl13a0C4DSun
 PSt9E+Ct5BTrDUvqCHvQDOaFiMZTPg==
 =Poyb
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/stefanha-gitlab/tags/block-pull-request' into staging

Pull request for 5.2

NVMe fixes to solve IOMMU issues on non-x86 and error message/tracing
improvements. Elena Afanasova's ioeventfd fixes are also included.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

# gpg: Signature made Wed 04 Nov 2020 15:18:16 GMT
# gpg:                using RSA key 8695A8BFD3F97CDAAC35775A9CA4ABB381AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>" [full]
# gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>" [full]
# Primary key fingerprint: 8695 A8BF D3F9 7CDA AC35  775A 9CA4 ABB3 81AB 73C8

* remotes/stefanha-gitlab/tags/block-pull-request: (33 commits)
  util/vfio-helpers: Assert offset is aligned to page size
  util/vfio-helpers: Convert vfio_dump_mapping to trace events
  util/vfio-helpers: Improve DMA trace events
  util/vfio-helpers: Trace where BARs are mapped
  util/vfio-helpers: Trace PCI BAR region info
  util/vfio-helpers: Trace PCI I/O config accesses
  util/vfio-helpers: Improve reporting unsupported IOMMU type
  block/nvme: Fix nvme_submit_command() on big-endian host
  block/nvme: Fix use of write-only doorbells page on Aarch64 arch
  block/nvme: Align iov's va and size on host page size
  block/nvme: Change size and alignment of prp_list_pages
  block/nvme: Change size and alignment of queue
  block/nvme: Change size and alignment of IDENTIFY response buffer
  block/nvme: Correct minimum device page size
  block/nvme: Set request_alignment at initialization
  block/nvme: Simplify nvme_cmd_sync()
  block/nvme: Simplify ADMIN queue access
  block/nvme: Correctly initialize Admin Queue Attributes
  block/nvme: Use definitions instead of magic values in add_io_queue()
  block/nvme: Introduce Completion Queue definitions
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-11-23 13:03:13 +00:00
Greg Kurz 009cde17a5 block: Move bdrv_drain_all_end_quiesce() to block_int.h
This function is really an internal helper for bdrv_close(). Update its
doc comment to make this clear and make the function private.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160387245480.131299.13430357162209598411.stgit@bahia>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-11-09 15:44:21 +01:00
Philippe Mathieu-Daudé 54248d4d73 block/nvme: Introduce Completion Queue definitions
Rename Submission Queue flags with 'Sq' to differentiate
submission queue flags from command queue flags, and introduce
Completion Queue flag definitions.

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 20201029093306.1063879-13-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03 19:06:21 +00:00
Peter Maydell 8680d6e364 nvme pull 2 Nov 2020
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE28EdLTc7SjdV9QLsYlFWYQpPbMAFAl+gI74ACgkQYlFWYQpP
 bMCBrA/9GXMZZGDfHFenXF+rS6J+ZKxtk29vq9Ly8KZ9YW7CzF9MP8qE/5iyFfmx
 d1BknXGQerW2kAzpkOq2/MKDklOc+0BAhaTdUaFR/ao5ZKuv2LQ8uFnKVoTrhTx9
 +HVkTVUTnez6ReCZVIrtN4+XVdyQTeQotJg6H2m5Q/BxQKcj6OMOlneuSGDn5vFN
 EWgDvEmfFEkzbN8FMXtkT35bg3vA5TGmfQRMk1SMMREOPxF04CaTVTxYscCpS0WC
 Cl+62mx4XLjscK7hwXuTNTrxeOLxZ2xLK5dhDd/qxBveio07mIM5X2psdKR0t5qX
 HLtm437T9CAYmyo8jgvM4KL8f+rbJnLd579qyVwIMsue28Qisj9nuWCTcaEpjfck
 4krhxJwxenRtqQ9wYrnbnQI5yQDIE6iUGf0toXwCNdJIr+FvyIcT7vJtTzZXtRI8
 sxwK5wfJ/WSey9uNLZGFbQuv4vjOMV+Nk3mEi1gUV8ujogo+2U6WUAE3NhqFLKn1
 YT6AJhDZvqL1f8gFrbiqR8xwvPrYmwK/tK38X1exSDOqiB7UNzR/apAb1oniul0e
 rS5xWzIs9APvkdWQssCHvrVDdh6VISXQ5bnT8lkfmvYrCTn2gUGAFXDrxZjXIaL9
 scCr8N9STkHmoYpc2ACRKIpfK3E1sDjGA8mAPemkxsLakNwBS4o=
 =s4KC
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/nvme/tags/pull-nvme-20201102' into staging

nvme pull 2 Nov 2020

# gpg: Signature made Mon 02 Nov 2020 15:20:30 GMT
# gpg:                using RSA key DBC11D2D373B4A3755F502EC625156610A4F6CC0
# gpg: Good signature from "Keith Busch <kbusch@kernel.org>" [unknown]
# gpg:                 aka "Keith Busch <keith.busch@gmail.com>" [unknown]
# gpg:                 aka "Keith Busch <keith.busch@intel.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DBC1 1D2D 373B 4A37 55F5  02EC 6251 5661 0A4F 6CC0

* remotes/nvme/tags/pull-nvme-20201102: (30 commits)
  hw/block/nvme: fix queue identifer validation
  hw/block/nvme: fix create IO SQ/CQ status codes
  hw/block/nvme: fix prp mapping status codes
  hw/block/nvme: report actual LBA data shift in LBAF
  hw/block/nvme: add trace event for requests with non-zero status code
  hw/block/nvme: add nsid to get/setfeat trace events
  hw/block/nvme: reject io commands if only admin command set selected
  hw/block/nvme: support for admin-only command set
  hw/block/nvme: validate command set selected
  hw/block/nvme: support per-namespace smart log
  hw/block/nvme: fix log page offset check
  hw/block/nvme: remove pointless rw indirection
  hw/block/nvme: update nsid when registered
  hw/block/nvme: change controller pci id
  pci: allocate pci id for nvme
  hw/block/nvme: support multiple namespaces
  hw/block/nvme: refactor identify active namespace id list
  hw/block/nvme: add support for sgl bit bucket descriptor
  hw/block/nvme: add support for scatter gather lists
  hw/block/nvme: harden cmb access
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2020-11-02 17:17:29 +00:00
Eric Blake 71719cd57f nbd: Add new qemu:allocation-depth metadata context
'qemu-img map' provides a way to determine which extents of an image
come from the top layer vs. inherited from a backing chain.  This is
useful information worth exposing over NBD.  There is a proposal to
add a QMP command block-dirty-bitmap-populate which can create a dirty
bitmap that reflects allocation information, at which point the
qemu:dirty-bitmap:NAME metadata context can expose that information
via the creation of a temporary bitmap, but we can shorten the effort
by adding a new qemu:allocation-depth metadata context that does the
same thing without an intermediate bitmap (this patch does not
eliminate the need for that proposal, as it will have other uses as
well).

While documenting things, remember that although the NBD protocol has
NBD_OPT_SET_META_CONTEXT, the rest of its documentation refers to
'metadata context', which is a more apt description of what is
actually being used by NBD_CMD_BLOCK_STATUS: the user is requesting
metadata by passing one or more context names.  So I also touched up
some existing wording to prefer the term 'metadata context' where it
makes sense.

Note that this patch does not actually enable any way to request a
server to enable this context; that will come in the next patch.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <20201027050556.269064-10-eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
2020-10-30 15:22:00 -05:00
Greg Kurz 1a6d3bd229 block: End quiescent sections when a BDS is deleted
If a BDS gets deleted during blk_drain_all(), it might miss a
call to bdrv_do_drained_end(). This means missing a call to
aio_enable_external() and the AIO context remains disabled for
ever. This can cause a device to become irresponsive and to
disrupt the guest execution, ie. hang, loop forever or worse.

This scenario is quite easy to encounter with virtio-scsi
on POWER when punching multiple blockdev-create QMP commands
while the guest is booting and it is still running the SLOF
firmware. This happens because SLOF disables/re-enables PCI
devices multiple times via IO/MEM/MASTER bits of PCI_COMMAND
register after the initial probe/feature negotiation, as it
tends to work with a single device at a time at various stages
like probing and running block/network bootloaders without
doing a full reset in-between. This naturally generates many
dataplane stops and starts, and thus many drain sections that
can race with blockdev_create_run(). In the end, SLOF bails
out.

It is somehow reproducible on x86 but it requires to generate
articial dataplane start/stop activity with stop/cont QMP
commands. In this case, seabios ends up looping for ever,
waiting for the virtio-scsi device to send a response to
a command it never received.

Add a helper that pairs all previously called bdrv_do_drained_begin()
with a bdrv_do_drained_end() and call it from bdrv_close().
While at it, update the "/bdrv-drain/graph-change/drain_all"
test in test-bdrv-drain so that it can catch the issue.

BugId: https://bugzilla.redhat.com/show_bug.cgi?id=1874441
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160346526998.272601.9045392804399803158.stgit@bahia.lan>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-27 15:26:20 +01:00
Alberto Garcia 46cd1e8a47 qcow2: Skip copy-on-write when allocating a zero cluster
Since commit c8bb23cbdb when a write
request results in a new allocation QEMU first tries to see if the
rest of the cluster outside the written area contains only zeroes.

In that case, instead of doing a normal copy-on-write operation and
writing explicit zero buffers to disk, the code zeroes the whole
cluster efficiently using pwrite_zeroes() with BDRV_REQ_NO_FALLBACK.

This improves performance very significantly but it only happens when
we are writing to an area that was completely unallocated before. Zero
clusters (QCOW2_CLUSTER_ZERO_*) are treated like normal clusters and
are therefore slower to allocate.

This happens because the code uses bdrv_is_allocated_above() rather
bdrv_block_status_above(). The former is not as accurate for this
purpose but it is faster. However in the case of qcow2 the underlying
call does already report zero clusters just fine so there is no reason
why we cannot use that information.

After testing 4KB writes on an image that only contains zero clusters
this patch results in almost five times more IOPS.

Signed-off-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <6d77cab968c501c44d6e1089b9bc91b04170b49e.1603731354.git.berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-10-27 15:26:20 +01:00
Gollu Appalanaidu 28fee5b5d0 hw/block/nvme: fix prp mapping status codes
Address 0 is not an invalid address. Remove those invalikd checks.

Unaligned PRP2 and PRP list entries should result in Invalid PRP Offset
status code and not Invalid Field. Fix that.

See NVMe Express v1.3d, Section 4.3 ("Physical Region Page Entry and
List").

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Gollu Appalanaidu <anaidu.gollu@samsung.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
2020-10-27 11:29:25 +01:00
Klaus Jensen 1b48e4611a hw/block/nvme: reject io commands if only admin command set selected
If the host sets CC.CSS to 111b, all commands submitted to I/O queues
should be completed with status Invalid Command Opcode.

Note that this is technically a v1.4 feature, but it does not hurt to
implement before we finally bump the reported version implemented.

Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-10-27 11:29:25 +01:00