Commit graph

4036 commits

Author SHA1 Message Date
Bart Van Assche 0d25bd072b block: Schedule runtime resume earlier
Instead of scheduling runtime resume of a request queue after a
request has been queued, schedule asynchronous resume during request
allocation. The new pm_request_resume() calls occur after
blk_queue_enter() has increased the q_usage_counter request queue
member. This change is needed for a later patch that will make request
allocation block while the queue status is not RPM_ACTIVE.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:28 -06:00
Bart Van Assche 154b00d566 block: Split blk_pm_add_request() and blk_pm_put_request()
Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
two new functions and thereby separate legacy block layer code from code
that works for both the legacy block layer and blk-mq. A later patch will
add calls to the new functions in the blk-mq code.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:28 -06:00
Bart Van Assche cd84a62e00 block, scsi: Change the preempt-only flag into a counter
The RQF_PREEMPT flag is used for three purposes:
- In the SCSI core, for making sure that power management requests
  are executed even if a device is in the "quiesced" state.
- For domain validation by SCSI drivers that use the parallel port.
- In the IDE driver, for IDE preempt requests.
Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context. Since
with this change scsi_device_quiesce() is no longer idempotent, make
that function return early if it is called for a quiesced queue.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:28 -06:00
Bart Van Assche bca6b067b0 block: Move power management code into a new source file
Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
<linux/blkdev.h> into <linux/blk-pm.h>. For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file <linux/blkdev.h>
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 15:11:28 -06:00
Christoph Hellwig c39ae60dfb block: remove ARCH_BIOVEC_PHYS_MERGEABLE
Take the Xen check into the core code instead of delegating it to
the architectures.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-26 08:45:11 -06:00
Christoph Hellwig 6e768461c2 block: remove bvec_to_phys
We only use it in biovec_phys_mergeable and a m68k paravirt driver,
so just opencode it there.  Also remove the pointless unsigned long cast
for the offset in the opencoded instances.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:59 -06:00
Christoph Hellwig 3dccdae54f block: merge BIOVEC_SEG_BOUNDARY into biovec_phys_mergeable
These two checks should always be performed together, so merge them into
a single helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:57 -06:00
Christoph Hellwig 0e253391a9 block: add a missing BIOVEC_SEG_BOUNDARY check in bio_add_pc_page
The actual recaculation of segments in __blk_recalc_rq_segments will
do this check, so there is no point in forcing it if we know it won't
succeed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:55 -06:00
Christoph Hellwig 6a9f5f240a block: simplify BIOVEC_PHYS_MERGEABLE
Turn the macro into an inline, move it to blk.h and simplify the
arch hooks a bit.

Also rename the function to biovec_phys_mergeable as there is no need
to shout.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:54 -06:00
Christoph Hellwig 27ca1d4ed0 block: move req_gap_back_merge to blk.h
No need to expose these helpers outside the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:53 -06:00
Christoph Hellwig e9907009cb block: move req_gap_{back,front}_merge to blk-merge.c
Keep it close to the actual users instead of exposing the function to all
drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:51 -06:00
Christoph Hellwig 43b729bfe9 block: move integrity_req_gap_{back,front}_merge to blk.h
No need to expose these to drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-24 12:33:50 -06:00
Bart Van Assche c7b1bf5cca blk-mq: Document the functions that iterate over requests
Make it easier to understand the purpose of the functions that iterate
over requests by documenting their purpose. Fix several minor spelling
and grammer mistakes in comments in these functions.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:30:22 -06:00
Dennis Zhou (Facebook) 101246ec02 blkcg: rename blkg_try_get to blkg_tryget
blkg reference counting now uses percpu_ref rather than atomic_t. Let's
make this consistent with css_tryget. This renames blkg_try_get to
blkg_tryget and now returns a bool rather than the blkg or NULL.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:19 -06:00
Dennis Zhou (Facebook) b3b9f24f5f blkcg: change blkg reference counting to use percpu_ref
Now that every bio is associated with a blkg, this puts the use of
blkg_get, blkg_try_get, and blkg_put on the hot path. This switches over
the refcnt in blkg to use percpu_ref.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:18 -06:00
Dennis Zhou (Facebook) f0fcb3ec89 blkcg: remove additional reference to the css
The previous patch in this series removed carrying around a pointer to
the css in blkg. However, the blkg association logic still relied on
taking a reference on the css to ensure we wouldn't fail in getting a
reference for the blkg.

Here the implicit dependency on the css is removed. The association
continues to rely on the tryget logic walking up the blkg tree. This
streamlines the three ways that association can happen: normal, swap,
and writeback.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:15 -06:00
Dennis Zhou (Facebook) c839e7a03f blkcg: remove bio->bi_css and instead use bio->bi_blkg
Prior patches ensured that all bios are now associated with some blkg.
This now makes bio->bi_css unnecessary as blkg maintains a reference to
the blkcg already.

This patch removes the field bi_css and transfers corresponding uses to
access via bi_blkg.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:13 -06:00
Dennis Zhou (Facebook) 74b7c02a9b blkcg: associate a blkg for pages being evicted by swap
A prior patch in this series added blkg association to bios issued by
cgroups. There are two other paths that we want to attribute work back
to the appropriate cgroup: swap and writeback. Here we modify the way
swap tags bios to include the blkg. Writeback will be tackle in the next
patch.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:09 -06:00
Dennis Zhou (Facebook) 5bf9a1f3b4 blkcg: consolidate bio_issue_init to be a part of core
bio_issue_init among other things initializes the timestamp for an IO.
Rather than have this logic handled by policies, this consolidates it to
be on the init paths (normal, clone, bounce clone).

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:08 -06:00
Dennis Zhou (Facebook) a7b39b4e96 blkcg: always associate a bio with a blkg
Previously, blkg's were only assigned as needed by blk-iolatency and
blk-throttle. bio->css was also always being associated while blkg was
being looked up and then thrown away in blkcg_bio_issue_check.

This patch begins the cleanup of bio->css and bio->bi_blkg by always
associating a blkg in blkcg_bio_issue_check. This tries to create the
blkg, but if it is not possible, falls back to using the root_blkg of
the request_queue. Therefore, a bio will always be associated with a
blkg. The duplicate association logic is removed from blk-throttle and
blk-iolatency.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:06 -06:00
Dennis Zhou (Facebook) 07b05bcc32 blkcg: convert blkg_lookup_create to find closest blkg
There are several scenarios where blkg_lookup_create can fail. Examples
include the blkcg dying, request_queue is dying, or simply being OOM. At
the end of the day, most handle this by simply falling back to the
q->root_blkg and calling it a day.

This patch implements the notion of closest blkg. During
blkg_lookup_create, if it fails to create, return the closest blkg
found or the q->root_blkg. blkg_try_get_closest is introduced and used
during association so a bio is always attached to a blkg.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:05 -06:00
Dennis Zhou (Facebook) 49f4c2dc2b blkcg: update blkg_lookup_create to do locking
To know when to create a blkg, the general pattern is to do a
blkg_lookup and if that fails, lock and then do a lookup again and if
that fails finally create. It doesn't make much sense for everyone who
wants to do creation to write this themselves.

This changes blkg_lookup_create to do locking and implement this
pattern. The old blkg_lookup_create is renamed to __blkg_lookup_create.
If a call site wants to do its own error handling or already owns the
queue lock, they can use __blkg_lookup_create. This will be used in
upcoming patches.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:03 -06:00
Dennis Zhou (Facebook) 27e6fa996c blkcg: fix ref count issue with bio_blkcg using task_css
The accessor function bio_blkcg either returns the blkcg associated with
the bio or finds one in the current context. This can cause an issue
when trying to associate a bio with a blkcg. Particularly, it's the
third case that is problematic:

	return css_to_blkcg(task_css(current, io_cgrp_id));

As the above may race against task migration and the cgroup exiting, it
is not always ok to take a reference on the blkcg returned from
bio_blkcg.

This patch adds association ahead of calling bio_blkcg rather than
after. This makes association a required and explicit step along the
code paths for calling bio_blkcg. blk_get_rl is modified as well to get
a reference to the blkcg it may use and blk_put_rl will always put the
reference back. Association is also moved above the bio_blkcg call to
ensure it will not return NULL in blk-iolatency.

BFQ and CFQ utilize this flaw, but due to the complexity, I do not want
to address this in this series. I've created a private version of the
function with notes not to use it describing the flaw. Hopefully soon,
that code can be cleaned up.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:02 -06:00
Liu Bo 9ff01255a0 Blk-throttle: update to use rbtree with leftmost node cached
As rbtree has native support of caching leftmost node,
i.e. rb_root_cached, no need to do the caching by ourselves.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-20 19:31:46 -06:00
Christoph Hellwig 576ed91354 block: use bio_add_page in bio_iov_iter_get_pages
Replace a nasty hack with a different nasty hack to prepare for multipage
bio_vecs.  By moving the temporary page array as far up as possible in
the space allocated for the bio_vec array we can iterate forward over it
and thus use bio_add_page.  Using bio_add_page means we'll be able to
merge physically contiguous pages once support for multipath bio_vecs is
merged.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-20 09:11:40 -06:00
Paolo Valente c8765de0ad blok, bfq: do not plug I/O if all queues are weight-raised
To reduce latency for interactive and soft real-time applications, bfq
privileges the bfq_queues containing the I/O of these
applications. These privileged queues, referred-to as weight-raised
queues, get a much higher share of the device throughput
w.r.t. non-privileged queues. To preserve this higher share, the I/O
of any non-weight-raised queue must be plugged whenever a sync
weight-raised queue, while being served, remains temporarily empty. To
attain this goal, bfq simply plugs any I/O (from any queue), if a sync
weight-raised queue remains empty while in service.

Unfortunately, this plugging typically lowers throughput with random
I/O, on devices with internal queueing (because it reduces the filling
level of the internal queues of the device).

This commit addresses this issue by restricting the cases where
plugging is performed: if a sync weight-raised queue remains empty
while in service, then I/O plugging is performed only if some of the
active bfq_queues are *not* weight-raised (which is actually the only
circumstance where plugging is needed to preserve the higher share of
the throughput of weight-raised queues). This restriction proved able
to boost throughput in really many use cases needing only maximum
throughput.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-14 13:06:05 -06:00
Paolo Valente d0edc2473b block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash
The Achilles' heel of BFQ is its failing to reach a high throughput
with sync random I/O on flash storage with internal queueing, in case
the processes doing I/O have differentiated weights.

The cause of this failure is as follows. If at least two processes do
sync I/O, and have a different weight from each other, then BFQ plugs
I/O dispatching every time one of these processes, while it is being
served, remains temporarily without pending I/O requests. This
plugging is necessary to guarantee that every process enjoys a
bandwidth proportional to its weight; but it empties the internal
queue(s) of the drive. And this kills throughput with random I/O. So,
if some processes have differentiated weights and do both sync and
random I/O, the end result is a throughput collapse.

This commit tries to counter this problem by injecting the service of
other processes, in a controlled way, while the process in service
happens to have no I/O. This injection is performed only if the medium
is non rotational and performs internal queueing, and the process in
service does random I/O (service injection might be beneficial for
sequential I/O too, we'll work on that).

As an example of the benefits of this commit, on a PLEXTOR PX-256M5S
SSD, and with five processes having differentiated weights and doing
sync random 4KB I/O, this commit makes the throughput with bfq grow by
400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than
that reached with none. As some less random I/O is added to the mix,
the throughput becomes equal to or higher than that with none.

This commit is a very first attempt to recover throughput without
losing control, and certainly has many limitations. One is, e.g., that
the processes whose service is injected are not chosen so as to
distribute the extra bandwidth they receive in accordance to their
weights. Thus there might be loss of weighted fairness in some
cases. Anyway, this loss concerns extra service, which would not have
been received at all without this commit. Other limitations and issues
will probably show up with usage.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-14 13:06:03 -06:00
Paolo Valente cbeb869a3d block, bfq: correctly charge and reset entity service in all cases
BFQ schedules entities (which represent either per-process queues or
groups of queues) as a function of their timestamps. In particular, as
a function of their (virtual) finish times. The finish time of an
entity is computed as a function of the budget assigned to the entity,
assuming, tentatively, that the entity, once in service, will receive
an amount of service equal to its budget. Then, when the entity is
expired because it finishes to be served, this finish time is updated
as a function of the actual service received by the entity. This
allows the entity to be correctly charged with only the service
received, and then to be correctly re-scheduled.

Yet an entity may receive service also while not being the entity in
service (in the scheduling environment of its parent entity), for
several reasons. If the entity remains with no backlog while receiving
this 'unofficial' service, then it is expired. Also on such an
expiration, the finish time of the entity should be updated to account
for only the service actually received by the entity. Unfortunately,
such an update is not performed for an entity expiring without being
the entity in service.

In a similar vein, the service counter of the entity in service is
reset when the entity is expired, to be ready to be used for next
service cycle. This reset too should be performed also in case an
entity is expired because it remains empty after receiving service
while not being the entity in service. But in this case the reset is
not performed.

This commit performs the above update of the finish time and reset of
the service received, also for an entity expiring while not being the
entity in service.

Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-14 13:06:02 -06:00
YueHaibing f8c0d7b16f blk-iolatency: remove set but not used variables 'changed' and 'blkiolat'
Fixes gcc '-Wunused-but-set-variable' warning:

block/blk-iolatency.c: In function 'scale_change':
block/blk-iolatency.c:301:7: warning:
 variable 'changed' set but not used [-Wunused-but-set-variable]

block/blk-iolatency.c: In function 'iolatency_set_limit':
block/blk-iolatency.c:765:24: warning:
 variable 'blkiolat' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-13 19:59:24 -06:00
Ming Lei 7759eb23fd block: remove bio_rewind_iter()
It is pointed that bio_rewind_iter() is one very bad API[1]:

1) bio size may not be restored after rewinding

2) it causes some bogus change, such as 5151842b9d (block: reset
bi_iter.bi_done after splitting bio)

3) rewinding really makes things complicated wrt. bio splitting

4) unnecessary updating of .bi_done in fast path

[1] https://marc.info/?t=153549924200005&r=1&w=2

So this patch takes Kent's suggestion to restore one bio into its original
state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
given now bio_rewind_iter() is only used by bio integrity code.

Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Hannes Reinecke <hare@suse.com>
Suggested-by: Kent Overstreet <kent.overstreet@gmail.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-06 15:12:24 -06:00
Konstantin Khlebnikov d5274b3cd6 block: bfq: swap puts in bfqg_and_blkg_put
Fix trivial use-after-free. This could be last reference to bfqg.

Fixes: 8f9bebc33d ("block, bfq: access and cache blkg data only when safe")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-06 11:32:58 -06:00
Mikulas Patocka 8b2ded1c94 block: don't warn when doing fsync on read-only devices
It is possible to call fsync on a read-only handle (for example, fsck.ext2
does it when doing read-only check), and this call results in kernel
warning.

The patch b089cfd95d ("block: don't warn for flush on read-only device")
attempted to disable the warning, but it is buggy and it doesn't
(op_is_flush tests flags, but bio_op strips off the flags).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 721c7fc701 ("block: fail op_is_write() requests to read-only partitions")
Cc: stable@vger.kernel.org	# 4.18
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-05 16:14:36 -06:00
Dennis Zhou (Facebook) 3111885015 blkcg: use tryget logic when associating a blkg with a bio
There is a very small change a bio gets caught up in a really
unfortunate race between a task migration, cgroup exiting, and itself
trying to associate with a blkg. This is due to css offlining being
performed after the css->refcnt is killed which triggers removal of
blkgs that reach their blkg->refcnt of 0.

To avoid this, association with a blkg should use tryget and fallback to
using the root_blkg.

Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:58 -06:00
Dennis Zhou (Facebook) 59b57717ff blkcg: delay blkg destruction until after writeback has finished
Currently, blkcg destruction relies on a sequence of events:
  1. Destruction starts. blkcg_css_offline() is called and blkgs
     release their reference to the blkcg. This immediately destroys
     the cgwbs (writeback).
  2. With blkgs giving up their reference, the blkcg ref count should
     become zero and eventually call blkcg_css_free() which finally
     frees the blkcg.

Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.

The new process for blkcg cleanup is as follows:
  1. Destruction starts. blkcg_css_offline() is called which offlines
     writeback. Blkg destruction is delayed on the cgwb_refcnt count to
     avoid punting potentially large amounts of outstanding writeback
     to root while maintaining any ongoing policies. Here, the base
     cgwb_refcnt is put back.
  2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
     and handles destruction of blkgs. This is where the css reference
     held by each blkg is released.
  3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
     This finally frees the blkg.

It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.

Fixes: 08e18eab0c ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:56 -06:00
Dennis Zhou (Facebook) 6b06546206 Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()"
This reverts commit 4c6994806f.

Destroying blkgs is tricky because of the nature of the relationship. A
blkg should go away when either a blkcg or a request_queue goes away.
However, blkg's pin the blkcg to ensure they remain valid. To break this
cycle, when a blkcg is offlined, blkgs put back their css ref. This
eventually lets css_free() get called which frees the blkcg.

The above commit (4c6994806f) breaks this order of events by trying to
destroy blkgs in css_free(). As the blkgs still hold references to the
blkcg, css_free() is never called.

The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
addressed in the following patch by delaying destruction of a blkg until
all writeback associated with the blkcg has been finished.

Fixes: 4c6994806f ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-31 14:48:54 -06:00
John Pittman db193954ed block: bsg: move atomic_t ref_count variable to refcount API
Currently, variable ref_count within the bsg_device struct is of
type atomic_t.  For variables being used as reference counters,
the refcount API should be used instead of atomic.  The newer
refcount API works to prevent counter overflows and use-after-free
bugs.  So, move this varable from the atomic API to refcount,
potentially avoiding the issues mentioned.

Signed-off-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 19:17:02 -06:00
Chengguang Xu 62d2a19407 block: remove unnecessary condition check
kmem_cache_destroy() can handle NULL pointer correctly, so there is
no need to check e->icq_cache before calling kmem_cache_destroy().

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 19:16:06 -06:00
Jens Axboe b0a84beb2e blk-wbt: remove dead code
We already note and mark discard and swap IO from bio_to_wbt_flags().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 13:32:12 -06:00
Jens Axboe 38cfb5a45e blk-wbt: improve waking of tasks
We have two potential issues:

1) After commit 2887e41b91, we only wake one process at the time when
   we finish an IO. We really want to wake up as many tasks as can
   queue IO. Before this commit, we woke up everyone, which could cause
   a thundering herd issue.

2) A task can potentially consume two wakeups, causing us to (in
   practice) miss a wakeup.

Fix both by providing our own wakeup function, which stops
__wake_up_common() from waking up more tasks if we fail to get a
queueing token. With the strict ordering we have on the wait list, this
wakes the right tasks and the right amount of tasks.

Based on a patch from Jianchao Wang <jianchao.w.wang@oracle.com>.

Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 11:27:26 -06:00
Jens Axboe 061a542753 blk-wbt: abstract out end IO completion handler
Prep patch for calling the handler from a different context,
no functional changes in this patch.

Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-27 11:27:24 -06:00
Jens Axboe c125311d96 blk-wbt: don't maintain inflight counts if disabled
A previous commit removed the ability to have per-rq flags. We used
those flags to maintain inflight counts. Since we don't have those
anymore, we have to always maintain inflight counts, even if wbt is
disabled. This is clearly suboptimal.

Add a queue quiesce around changing the wbt latency settings from sysfs
to work around this. With that, we can reliably put the enabled check in
our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
consistent for the lifetime of the request.

Fixes: c1c80384c8 ("block: remove external dependency on wbt_flags")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-23 09:34:46 -06:00
Jens Axboe c45e6a037a blk-wbt: fix has-sleeper queueing check
We need to do this inside the loop as well, or we can allow new
IO to supersede previous IO.

Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:07:32 -06:00
Jens Axboe b78820937b blk-wbt: use wq_has_sleeper() for wq active check
We need the memory barrier before checking the list head,
use the appropriate helper for this. The matching queue
side memory barrier is provided by set_current_state().

Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:07:31 -06:00
Jens Axboe ffa358dcaa blk-wbt: move disable check into get_limit()
Check it in one place, instead of in multiple places.

Tested-by: Anchal Agarwal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-22 15:07:31 -06:00
Linus Torvalds 5bed49adfe for-4.19/post-20180822
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlt9on8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj1xEADKBmJlV9aVyxc5w6XggqAGeHqI4afFrl+v
 9fW6WUQMAaBUrr7PMIEJQ0Zm4B7KxgBaEWNtuuj4ULkjpgYm2AuGUuTJSyKz41rS
 Ma+KNyCA2Zmq4SvwGFbcdCuCbUqnoxTycscAgCjuDvIYLW0+nFGNc47ibmu9lZIV
 33Ef5LrxuCjhC2zyNxEdWpUxDCjoYzock85LW+wYyIYLU9uKdoExS+YmT8U+ebA/
 AkXBcxPztNDxwkcsIwgGVoTjwxiowqGz3uueWfyEmYgaCPiNOsxkoNQAtjX4ykQE
 hnqnHWyzJkRwbYo7Vd/bRAZXvszKGYE1YcJmu5QrNf0dK5MSq2o5OYJAEJWbucPj
 m0R2u7O9qbS2JEnxGrm5+oYJwBzwNY5/Lajr15WkljTqobKnqcvn/Hdgz/XdGtek
 0S1QHkkBsF7e+cax8sePWK+O3ilY7pl9CzyZKB/tJngl8A45Jv8xVojg0v3O7oS+
 zZib0rwWg/bwR/uN6uPCDcEsQusqL5YovB7m6NRVshwz6cV1zVNp2q+iOulk7KuC
 MprW4Du9CJf8HA19XtyJIG1XLstnuz+Exy+i5BiimUJ5InoEFDuj/6OZa6Qaczbo
 SrDDvpGtSf4h7czKpE5kV4uZiTOrjuI30TrI+4csdZ7HQIlboxNL72seNTLJs55F
 nbLjRM8L6g==
 =FS7e
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:

 - Set of bcache fixes and changes (Coly)

 - The flush warn fix (me)

 - Small series of BFQ fixes (Paolo)

 - wbt hang fix (Ming)

 - blktrace fix (Steven)

 - blk-mq hardware queue count update fix (Jianchao)

 - Various little fixes

* tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
  block/DAC960.c: make some arrays static const, shrinks object size
  blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
  blk-mq: init hctx sched after update ctx and hctx mapping
  block: remove duplicate initialization
  tracing/blktrace: Fix to allow setting same value
  pktcdvd: fix setting of 'ret' error return for a few cases
  block: change return type to bool
  block, bfq: return nbytes and not zero from struct cftype .write() method
  block, bfq: improve code of bfq_bfqq_charge_time
  block, bfq: reduce write overcharge
  block, bfq: always update the budget of an entity when needed
  block, bfq: readd missing reset of parent-entity service
  blk-wbt: fix IO hang in wbt_wait()
  block: don't warn for flush on read-only device
  bcache: add the missing comments for smp_mb()/smp_wmb()
  bcache: remove unnecessary space before ioctl function pointer arguments
  bcache: add missing SPDX header
  bcache: move open brace at end of function definitions to next line
  bcache: add static const prefix to char * array declarations
  bcache: fix code comments style
  ...
2018-08-22 13:38:05 -07:00
Jianchao Wang f5bbbbe4d6 blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
account the inflight requests. It will access the queue_hw_ctx and
nr_hw_queues w/o any protection. When updating nr_hw_queues and
blk_mq_in_flight/rw occur concurrently, panic comes up.

Before update nr_hw_queues, the q will be frozen. So we could use
q_usage_counter to avoid the race. percpu_ref_is_zero is used here
so that we will not miss any in-flight request. The access to
nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
under rcu critical section, __blk_mq_update_nr_hw_queues could use
synchronize_rcu to ensure the zeroed q_usage_counter to be globally
visible.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-21 09:02:56 -06:00
Jianchao Wang d48ece209f blk-mq: init hctx sched after update ctx and hctx mapping
Currently, when update nr_hw_queues, IO scheduler's init_hctx will
be invoked before the mapping between ctx and hctx is adapted
correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
may depend on this mapping and get wrong result and panic finally.
A simply way to fix this is that switch the IO scheduler to 'none'
before update the nr_hw_queues, and then switch it back after
update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
to nobody use them any more.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-21 09:02:55 -06:00
Chaitanya Kulkarni fcedba42d9 block: remove duplicate initialization
This patch removes the duplicate initialization of q->queue_head
in the blk_alloc_queue_node(). This removes the 2nd initialization
so that we preserve the initialization order same as declaration
present in struct request_queue.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-17 12:33:36 -06:00
Chengguang Xu 599d067dd3 block: change return type to bool
Because blk_do_io_stat() only does a judgement about the request
contributes to IO statistics, it better changes return type to bool.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:44:17 -06:00
Maciej S. Szmigiero fc8ebd01de block, bfq: return nbytes and not zero from struct cftype .write() method
The value that struct cftype .write() method returns is then directly
returned to userspace as the value returned by write() syscall, so it
should be the number of bytes actually written (or consumed) and not zero.

Returning zero from write() syscall makes programs like /bin/echo or bash
spin.

Signed-off-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: e21b7a0b98 ("block, bfq: add full hierarchical scheduling and cgroups support")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-16 13:11:16 -06:00