linux

mirror of https://github.com/torvalds/linux synced 2024-10-10 21:36:41 +00:00

History

James Smart a96d4bd867 nvmet: fix fatal_err_work deadlock Below is a stack trace for an issue that was reported. What's happening is that the nvmet layer had it's controller kato timeout fire, which causes it to schedule its fatal error handler via the fatal_err_work element. The error handler is invoked, which calls the transport delete_ctrl() entry point, and as the transport tears down the controller, nvmet_sq_destroy ends up doing the final put on the ctlr causing it to enter its free routine. The ctlr free routine does a cancel_work_sync() on fatal_err_work element, which then does a flush_work and wait_for_completion. But, as the wait is in the context of the work element being flushed, its in a catch-22 and the thread hangs. [ 326.903131] nvmet: ctrl 1 keep-alive timer (15 seconds) expired! [ 326.909832] nvmet: ctrl 1 fatal error occurred! [ 327.643100] lpfc 0000:04:00.0: 0:6313 NVMET Defer ctx release xri x114 flg x2 [ 494.582064] INFO: task kworker/0:2:243 blocked for more than 120 seconds. [ 494.589638] Not tainted 4.14.0-rc1.James+ #1 [ 494.594986] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 494.603718] kworker/0:2 D 0 243 2 0x80000000 [ 494.609839] Workqueue: events nvmet_fatal_error_handler [nvmet] [ 494.616447] Call Trace: [ 494.619177] __schedule+0x28d/0x890 [ 494.623070] schedule+0x36/0x80 [ 494.626571] schedule_timeout+0x1dd/0x300 [ 494.631044] ? dequeue_task_fair+0x592/0x840 [ 494.635810] ? pick_next_task_fair+0x23b/0x5c0 [ 494.640756] wait_for_completion+0x121/0x180 [ 494.645521] ? wake_up_q+0x80/0x80 [ 494.649315] flush_work+0x11d/0x1a0 [ 494.653206] ? wake_up_worker+0x30/0x30 [ 494.657484] __cancel_work_timer+0x10b/0x190 [ 494.662249] cancel_work_sync+0x10/0x20 [ 494.666525] nvmet_ctrl_put+0xa3/0x100 [nvmet] [ 494.671482] nvmet_sq_:q+0x64/0xd0 [nvmet] [ 494.676540] nvmet_fc_delete_target_queue+0x202/0x220 [nvmet_fc] [ 494.683245] nvmet_fc_delete_target_assoc+0x6d/0xc0 [nvmet_fc] [ 494.689743] nvmet_fc_delete_ctrl+0x137/0x1a0 [nvmet_fc] [ 494.695673] nvmet_fatal_error_handler+0x30/0x40 [nvmet] [ 494.701589] process_one_work+0x149/0x360 [ 494.706064] worker_thread+0x4d/0x3c0 [ 494.710148] kthread+0x109/0x140 [ 494.713751] ? rescuer_thread+0x380/0x380 [ 494.718214] ? kthread_park+0x60/0x60 Correct by having the fc transport convert to a different workq context for the actual controller teardown which may call the cancel_work_sync. Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>		2017-11-01 16:34:42 +01:00
..
admin-cmd.c	nvmet: add support for reporting the host identifier	2017-08-30 15:05:18 +02:00
configfs.c	nvmet: use NVME_NSID_ALL	2017-08-28 21:38:19 +02:00
core.c	nvmet: Change max_nsid in subsystem due to ns_disable if needed	2017-10-20 12:06:02 +02:00
discovery.c	nvme: use a single NVME_AQ_DEPTH and relax it to 32	2017-06-28 08:14:13 -06:00
fabrics-cmd.c	nvmet: implement valid sqhd values in completions	2017-09-25 08:56:05 -06:00
fc.c	nvmet: fix fatal_err_work deadlock	2017-11-01 16:34:42 +01:00
fcloop.c	nvme-fcloop: fix port deletes and callbacks	2017-09-25 12:42:11 -06:00
io-cmd.c	block: replace bi_bdev with a gendisk pointer and partitions index	2017-08-23 12:49:55 -06:00
Kconfig	nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME	2016-12-06 14:51:48 +01:00
loop.c	nvme: consolidate common code from ->reset_work	2017-11-01 16:28:06 +01:00
Makefile	nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME	2016-12-06 14:51:48 +01:00
nvmet.h	nvmet: bump NVMET_NR_QUEUES to 128	2017-10-04 09:43:58 +02:00
rdma.c	nvmet-rdma: remove redundant empty device add callout	2017-08-18 10:46:20 -04:00