linux/block
Tejun Heo ec14a87ee1 blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init
blk-iocost sometimes causes the following crash:

  BUG: kernel NULL pointer dereference, address: 00000000000000e0
  ...
  RIP: 0010:_raw_spin_lock+0x17/0x30
  Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0 <f0> 0f b1 0f 75 02 5d c3 89 c6 e8 ea 04 00 00 5d c3 0f 1f 84 00 00
  RSP: 0018:ffffc900023b3d40 EFLAGS: 00010046
  RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 0000000000000001
  RDX: ffffc900023b3d20 RSI: ffffc900023b3cf0 RDI: 00000000000000e0
  RBP: ffffc900023b3d40 R08: ffffc900023b3c10 R09: 0000000000000003
  R10: 0000000000000064 R11: 000000000000000a R12: ffff888102337000
  R13: fffffffffffffff2 R14: ffff88810af408c8 R15: ffff8881070c3600
  FS:  00007faaaf364fc0(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000000e0 CR3: 00000001097b1000 CR4: 0000000000350ea0
  Call Trace:
   <TASK>
   ioc_weight_write+0x13d/0x410
   cgroup_file_write+0x7a/0x130
   kernfs_fop_write_iter+0xf5/0x170
   vfs_write+0x298/0x370
   ksys_write+0x5f/0xb0
   __x64_sys_write+0x1b/0x20
   do_syscall_64+0x3d/0x80
   entry_SYSCALL_64_after_hwframe+0x46/0xb0

This happens because iocg->ioc is NULL. The field is initialized by
ioc_pd_init() and never cleared. The NULL deref is caused by
blkcg_activate_policy() installing blkg_policy_data before initializing it.

blkcg_activate_policy() was doing the following:

1. Allocate pd's for all existing blkg's and install them in blkg->pd[].
2. Initialize all pd's.
3. Online all pd's.

blkcg_activate_policy() only grabs the queue_lock and may release and
re-acquire the lock as allocation may need to sleep. ioc_weight_write()
grabs blkcg->lock and iterates all its blkg's. The two can race and if
ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a
pd which is not initialized yet, leading to crash.

The crash can be reproduced with the following script:

  #!/bin/bash

  echo +io > /sys/fs/cgroup/cgroup.subtree_control
  systemd-run --unit touch-sda --scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct
  echo 100 > /sys/fs/cgroup/system.slice/io.weight
  bash -c "echo '8:0 enable=1' > /sys/fs/cgroup/io.cost.qos" &
  sleep .2
  echo 100 > /sys/fs/cgroup/system.slice/io.weight

with the following patch applied:

> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index fc49be622e05..38d671d5e10c 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
> 		pd->online = false;
> 	}
>
> +       if (system_state == SYSTEM_RUNNING) {
> +               spin_unlock_irq(&q->queue_lock);
> +               ssleep(1);
> +               spin_lock_irq(&q->queue_lock);
> +       }
> +
> 	/* all allocated, init in the same order */
> 	if (pol->pd_init_fn)
> 		list_for_each_entry_reverse(blkg, &q->blkg_list, q_node)

I don't see a reason why all pd's should be allocated, initialized and
onlined together. The only ordering requirement is that parent blkgs to be
initialized and onlined before children, which is guaranteed from the
walking order. Let's fix the bug by allocating, initializing and onlining pd
for each blkg and holding blkcg->lock over initialization and onlining. This
ensures that an installed blkg is always fully initialized and onlined
removing the the race window.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Breno Leitao <leitao@debian.org>
Fixes: 9d179b8654 ("blkcg: Fix multiple bugs in blkcg_activate_policy()")
Link: https://lore.kernel.org/r/ZN0p5_W-Q9mAHBVY@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-17 19:21:05 -06:00
..
partitions block/partition: fix signedness issue for Amiga partitions 2023-07-05 16:34:56 -06:00
badblocks.c block/badblocks: Remove redundant assignments 2022-04-23 07:15:26 -06:00
bdev.c block: Improve kernel-doc headers 2023-06-21 13:18:15 -06:00
bfq-cgroup.c blkcg: Restructure blkg_conf_prep() and friends 2023-04-13 06:46:49 -06:00
bfq-iosched.c SCSI misc on 20230629 2023-06-30 11:57:07 -07:00
bfq-iosched.h block, bfq: remove BFQ_WEIGHT_LEGACY_DFL 2023-04-06 16:17:32 -06:00
bfq-wf2q.c block, bfq: inject I/O to underutilized actuators 2023-01-29 15:18:33 -07:00
bio-integrity.c bio-integrity: create multi-page bvecs in bio_integrity_add_page() 2023-08-09 16:05:35 -06:00
bio.c block: Bring back zero_fill_bio_iter 2023-08-14 15:40:42 -06:00
blk-cgroup-fc-appid.c block: Replace all non-returning strlcpy with strscpy 2023-06-01 09:13:31 -06:00
blk-cgroup-rwstat.c Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" 2023-02-14 14:24:09 -07:00
blk-cgroup-rwstat.h block: Use the new blk_opf_t type 2022-07-14 12:14:30 -06:00
blk-cgroup.c blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init 2023-08-17 19:21:05 -06:00
blk-cgroup.h for-6.4/block-2023-04-21 2023-04-26 12:52:58 -07:00
blk-core.c block: Add some exports for bcachefs 2023-08-14 15:40:42 -06:00
blk-crypto-fallback.c treewide: use get_random_bytes() when possible 2022-10-11 17:42:58 -06:00
blk-crypto-internal.h blk-crypto: remove blk_crypto_insert_cloned_request() 2023-03-16 09:35:09 -06:00
blk-crypto-profile.c blk-crypto: use dynamic lock class for blk_crypto_profile::lock 2023-07-05 16:36:12 -06:00
blk-crypto-sysfs.c block: make kobj_type structures constant 2023-02-09 09:38:16 -07:00
blk-crypto.c blk-crypto: make blk_crypto_evict_key() more robust 2023-03-16 09:35:09 -06:00
blk-flush.c blk-flush: reuse rq queuelist in flush state machine 2023-07-17 08:18:21 -06:00
blk-ia-ranges.c block: make kobj_type structures constant 2023-02-09 09:38:16 -07:00
blk-integrity.c blk-integrity: register sysfs attributes on struct device 2023-04-26 18:22:50 -06:00
blk-ioc.c blk-ioc: fix recursive spin_lock/unlock_irq() in ioc_clear_queue() 2023-06-07 07:51:00 -06:00
blk-iocost.c blk-iocost: move wbt_enable/disable_default() out of spinlock 2023-06-26 09:53:36 -06:00
blk-iolatency.c block: fix bad lockdep annotation in blk-iolatency 2023-08-10 17:24:53 -06:00
blk-ioprio.c blk-ioprio: Introduce promote-to-rt policy 2023-06-06 22:26:26 -06:00
blk-ioprio.h blk-ioprio: pass a gendisk to blk_ioprio_init and blk_ioprio_exit 2022-09-26 19:09:31 -06:00
blk-lib.c blk-lib: fix blkdev_issue_secure_erase 2022-09-15 00:25:17 -06:00
blk-map.c for-6.5/block-2023-06-23 2023-06-26 12:47:20 -07:00
blk-merge.c blk-mq: release crypto keyslot before reporting I/O complete 2023-03-16 09:35:09 -06:00
blk-mq-cpumap.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-debugfs-zoned.c block: move zone related fields to struct gendisk 2022-07-06 06:46:26 -06:00
blk-mq-debugfs.c blk-mq: fix potential io hang by wrong 'wake_batch' 2023-06-12 09:55:53 -06:00
blk-mq-debugfs.h block: remove per-disk debugfs files in blk_unregister_queue 2022-06-17 07:31:05 -06:00
blk-mq-pci.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-sched.c blk-mq: cleanup __blk_mq_sched_dispatch_requests 2023-04-13 06:57:18 -06:00
blk-mq-sched.h blk-mq: make sure elevator callbacks aren't called for passthrough request 2023-05-18 19:42:54 -06:00
blk-mq-sysfs.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq-tag.c for-6.5/block-2023-06-23 2023-06-26 12:47:20 -07:00
blk-mq-virtio.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-mq.c block: Improve performance for BLK_MQ_F_BLOCKING drivers 2023-07-24 20:13:12 -06:00
blk-mq.h blk-mq: fix potential io hang by wrong 'wake_batch' 2023-06-12 09:55:53 -06:00
blk-pm.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-pm.h block: Remove unused blk_pm_*() function definitions 2021-02-22 06:33:48 -07:00
blk-rq-qos.c block/rq_qos: protect rq_qos apis with a new lock 2023-05-23 11:13:19 -06:00
blk-rq-qos.h blk-iolatency: s/blkcg_rq_qos/iolat_rq_qos/ 2023-04-13 06:46:49 -06:00
blk-settings.c block: don't allow enabling a cache on devices that don't support it 2023-07-17 08:18:18 -06:00
blk-stat.c blk-mq: include <linux/blk-mq.h> in block/blk-mq.h 2023-04-13 06:52:29 -06:00
blk-stat.h block: make queue stat accounting a reference 2021-12-14 17:23:05 -07:00
blk-sysfs.c block: don't allow enabling a cache on devices that don't support it 2023-07-17 08:18:18 -06:00
blk-throttle.c blk-throttle: Fix io statistics for cgroup v1 2023-06-25 08:00:39 -06:00
blk-throttle.h blk-throttle: Fix io statistics for cgroup v1 2023-06-25 08:00:39 -06:00
blk-timeout.c block: blk-timeout: delete duplicated word 2020-07-31 16:29:47 -06:00
blk-wbt.c Merge branch 'for-6.5/block-late' into block-6.5 2023-06-28 16:08:19 -06:00
blk-wbt.h blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled 2023-06-26 09:53:36 -06:00
blk-zoned.c Merge branch '6.5/scsi-staging' into 6.5/scsi-fixes 2023-07-11 12:15:15 -04:00
blk.h block: Add some exports for bcachefs 2023-08-14 15:40:42 -06:00
bounce.c block: change the blk_queue_bounce calling convention 2022-08-02 17:22:54 -06:00
bsg-lib.c scsi: replace the fmode_t argument to ->sg_io_fn with a simple bool 2023-06-12 08:04:04 -06:00
bsg.c SCSI misc on 20230629 2023-06-30 11:57:07 -07:00
disk-events.c block: increment diskseq on all media change events 2023-06-20 07:16:24 -06:00
early-lookup.c block: don't return -EINVAL for not found names in devt_from_devname 2023-06-22 09:09:33 -06:00
elevator.c block: Replace all non-returning strlcpy with strscpy 2023-06-01 09:13:31 -06:00
elevator.h blk-mq: pass a flags argument to elevator_type->insert_requests 2023-04-13 06:52:30 -06:00
fops.c fs: add CONFIG_BUFFER_HEAD 2023-08-02 09:13:09 -06:00
genhd.c block: fix the exclusive open mask in disk_scan_partitions 2023-06-21 07:37:52 -06:00
holder.c block: don't allow a disk link holder to itself 2022-11-16 15:19:56 -07:00
ioctl.c block: fine-granular CAP_SYS_ADMIN for Persistent Reservation 2023-06-20 12:49:23 -06:00
ioprio.c scsi: block: Improve ioprio value validity checks 2023-06-16 12:04:30 -04:00
Kconfig block: use iomap for writes to block devices 2023-08-02 09:13:09 -06:00
Kconfig.iosched block: Default to use cgroup support for BFQ 2023-01-30 09:42:42 -07:00
kyber-iosched.c blk-mq: pass a flags argument to elevator_type->insert_requests 2023-04-13 06:52:30 -06:00
Makefile block: move the code to do early boot lookup of block devices to block/ 2023-06-05 10:57:40 -06:00
mq-deadline.c block/mq-deadline: use correct way to throttling write requests 2023-08-08 15:46:41 -06:00
opal_proto.h sed-opal: allow user authority to get locking range attributes. 2023-04-05 07:46:25 -06:00
sed-opal.c sed-opal: geometry feature reporting command 2023-04-19 14:07:13 -06:00
t10-pi.c block: add pi for extended integrity 2022-03-07 12:48:35 -07:00