Commit graph

932 commits

Author SHA1 Message Date
Xiaoguang Wang c07e671951 io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work()
io_iopoll_complete() does not hold completion_lock to complete polled io,
so in io_wq_submit_work(), we can not call io_req_complete() directly, to
complete polled io, otherwise there maybe concurrent access to cqring,
defer_list, etc, which is not safe. Commit dad1b1242f ("io_uring: always
let io_iopoll_complete() complete polled io") has fixed this issue, but
Pavel reported that IOPOLL apart from rw can do buf reg/unreg requests(
IORING_OP_PROVIDE_BUFFERS or IORING_OP_REMOVE_BUFFERS), so the fix is not
good.

Given that io_iopoll_complete() is always called under uring_lock, so here
for polled io, we can also get uring_lock to fix this issue.

Fixes: dad1b1242f ("io_uring: always let io_iopoll_complete() complete polled io")
Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: don't deref 'req' after completing it']
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-22 17:14:53 -07:00
Pavel Begunkov 9faadcc8ab io_uring: fix double io_uring free
Once we created a file for current context during setup, we should not
call io_ring_ctx_wait_and_kill() directly as it'll be done by fput(file)

Cc: stable@vger.kernel.org # 5.10
Reported-by: syzbot+c9937dfb2303a5f18640@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: fix unused 'ret' for !CONFIG_UNIX]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-22 17:14:50 -07:00
Pavel Begunkov a528b04ea4 io_uring: fix ignoring xa_store errors
xa_store() may fail, check the result.

Cc: stable@vger.kernel.org # 5.10
Fixes: 0f2122045b ("io_uring: don't rely on weak ->files references")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-21 13:01:10 -07:00
Pavel Begunkov f57555eda9 io_uring: end waiting before task cancel attempts
Get rid of TASK_UNINTERRUPTIBLE and waiting with finish_wait before
going for next iteration in __io_uring_task_cancel(), because
__io_uring_files_cancel() doesn't expect that sheduling is disallowed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-20 11:05:39 -07:00
Pavel Begunkov 55583d72e2 io_uring: always progress task_work on task cancel
Might happen that __io_uring_cancel_task_requests() cancels nothing but
there are task_works pending. We need to always run them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-20 11:05:39 -07:00
Jens Axboe 00c18640c2 io_uring: make ctx cancel on exit targeted to actual ctx
Before IORING_SETUP_ATTACH_WQ, we could just cancel everything on the
io-wq when exiting. But that's not the case if they are shared, so
cancel for the specific ctx instead.

Cc: stable@vger.kernel.org
Fixes: 24369c2e3b ("io_uring: add io-wq workqueue sharing")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-20 10:45:02 -07:00
Pavel Begunkov dd20166236 io_uring: fix 0-iov read buffer select
Doing vectored buf-select read with 0 iovec passed is meaningless and
utterly broken, forbid it.

Cc: <stable@vger.kernel.org> # 5.7+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-19 06:26:56 -07:00
Pavel Begunkov dfea9fce29 io_uring: close a small race gap for files cancel
The purpose of io_uring_cancel_files() is to wait for all requests
matching ->files to go/be cancelled. We should first drop files of a
request in io_req_drop_files() and only then make it undiscoverable for
io_uring_cancel_files.

First drop, then delete from list. It's ok to leave req->id->files
dangling, because it's not dereferenced by cancellation code, only
compared against. It would potentially go to sleep and be awaken by
following in io_req_drop_files() wake_up().

Fixes: 0f2122045b ("io_uring: don't rely on weak ->files references")
Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-18 08:16:02 -07:00
Pavel Begunkov 89448c47b8 io_uring: limit {io|sq}poll submit locking scope
We don't need to take uring_lock for SQPOLL|IOPOLL to do
io_cqring_overflow_flush() when cq_overflow_list is empty, remove it
from the hot path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17 08:40:52 -07:00
Pavel Begunkov 09e88404f4 io_uring: inline io_cqring_mark_overflow()
There is only one user of it and the name is misleading, get rid of it
by inlining. By the way make overflow_flush's return value deduction
simpler.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17 08:40:52 -07:00
Pavel Begunkov e23de15fdb io_uring: consolidate CQ nr events calculation
Add a helper which calculates number of events in CQ. Handcoded version
of it in io_cqring_overflow_flush() is not the clearest thing, so it
makes it slightly more readable.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17 08:40:52 -07:00
Pavel Begunkov 9cd2be519d io_uring: remove racy overflow list fast checks
list_empty_careful() is not racy only if some conditions are met, i.e.
no re-adds after del_init. io_cqring_overflow_flush() does list_move(),
so it's actually racy.

Remove those checks, we have ->cq_check_overflow for the fast path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17 08:40:52 -07:00
Pavel Begunkov cda286f071 io_uring: cancel reqs shouldn't kill overflow list
io_uring_cancel_task_requests() doesn't imply that the ring is going
away, it may continue to work well after that. The problem is that it
sets ->cq_overflow_flushed effectively disabling the CQ overflow feature

Split setting cq_overflow_flushed from flush, and do the first one only
on exit. It's ok in terms of cancellations because there is a
io_uring->in_idle check in __io_cqring_fill_event().

It also fixes a race with setting ->cq_overflow_flushed in
io_uring_cancel_task_requests, whuch's is not atomic and a part of a
bitmask with other flags. Though, the only other flag that's not set
during init is drain_next, so it's not as bad for sane architectures.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Fixes: 0f2122045b ("io_uring: don't rely on weak ->files references")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17 08:40:45 -07:00
Jens Axboe 4bc4a91253 io_uring: hold mmap_sem for mm->locked_vm manipulation
The kernel doesn't seem to have clear rules around this, but various
spots are using the mmap_sem to serialize access to modifying the
locked_vm count. Play it safe and lock the mm for write when accounting
or unaccounting locked memory.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-17 07:53:33 -07:00
Jens Axboe a146468d76 io_uring: break links on shutdown failure
Ensure that the return value of __sys_shutdown_sock() is used to
potentially break links to the request, if we fail.

Fixes: 36f4fa6886 ("io_uring: add support for shutdown(2)")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-16 14:56:36 -07:00
Linus Torvalds ac7ac4618c for-5.11/block-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/Xec8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpoLbEACzXypgZWwMdfgRckA/Vt333rXHtbhUV+hK
 2XP+P81iRvr9Esi31UPbRp82vrgcDO0cpI1QmQojS5U5TIQP88BfXptfRZZu48eb
 wT5RDDNQ34HItqAh/yEuYsv9yUKcxeIrB99tBVvM+4UmQg9zTdIW3mg6PvCBdbhV
 N38jI0tCF/PJatjfRuphT/nXonQLPWBlVDmZk06KZQFOwQe9ep1vUi1+nbiRPuo3
 geFBpTh1Kp6Vl1B3n4RpECs6Y7I0RRuJdaH2sDizICla1/BW91F9fQwHimNnUxUq
 e1Q1kMuh6ftcQGkYlHSYcPhuv6CvorldTZCO5arPxWpcwvxriTSMRPWAgUr5pEiF
 fhiGhqeDu9e6vl9vS31wUD1B30hy+jFz9wyjRrDwJ3cPHH1JVBjTzvdX+cIh/1ku
 IbIwUMteUtvUrzqAv/DzbGhedp7xWtOFaVo8j0QFYh9zkjd6b8yDOF/yztwX2gjY
 Xt1cd+KpDSiN449ZRaoMI0sCJAxqzhMa6nsWlb0L7KuNyWKAbvKQBm9Rb47FLV9A
 Vx70KC+zkFoyw23capvIahmQazerriUJ5PGe0lVm6ROgmIFdCpXTPDjnrvq/6RZ/
 GEpD7gTW9atGJ7EuEE8686sAfKD5kneChWLX5EHXf0d0AG5Mr2lKsluiGp5LpPJg
 Q1Xqs6xwww==
 =zo4w
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Another series of killing more code than what is being added, again
  thanks to Christoph's relentless cleanups and tech debt tackling.

  This contains:

   - blk-iocost improvements (Baolin Wang)

   - part0 iostat fix (Jeffle Xu)

   - Disable iopoll for split bios (Jeffle Xu)

   - block tracepoint cleanups (Christoph Hellwig)

   - Merging of struct block_device and hd_struct (Christoph Hellwig)

   - Rework/cleanup of how block device sizes are updated (Christoph
     Hellwig)

   - Simplification of gendisk lookup and removal of block device
     aliasing (Christoph Hellwig)

   - Block device ioctl cleanups (Christoph Hellwig)

   - Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)

   - Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)

   - sbitmap improvements (Pavel Begunkov)

   - Hybrid polling fix (Pavel Begunkov)

   - bvec iteration improvements (Pavel Begunkov)

   - Zone revalidation fixes (Damien Le Moal)

   - blk-throttle limit fix (Yu Kuai)

   - Various little fixes"

* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
  blk-mq: fix msec comment from micro to milli seconds
  blk-mq: update arg in comment of blk_mq_map_queue
  blk-mq: add helper allocating tagset->tags
  Revert "block: Fix a lockdep complaint triggered by request queue flushing"
  nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
  blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
  block: disable iopoll for split bio
  block: Improve blk_revalidate_disk_zones() checks
  sbitmap: simplify wrap check
  sbitmap: replace CAS with atomic and
  sbitmap: remove swap_lock
  sbitmap: optimise sbitmap_deferred_clear()
  blk-mq: skip hybrid polling if iopoll doesn't spin
  blk-iocost: Factor out the base vrate change into a separate function
  blk-iocost: Factor out the active iocgs' state check into a separate function
  blk-iocost: Move the usage ratio calculation to the correct place
  blk-iocost: Remove unnecessary advance declaration
  blk-iocost: Fix some typos in comments
  blktrace: fix up a kerneldoc comment
  block: remove the request_queue to argument request based tracepoints
  ...
2020-12-16 12:57:51 -08:00
Linus Torvalds 48aba79bcf for-5.11/io_uring-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/XeDUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnF9D/4+l1r1G5AcsSsgEvu1aCjP83LLWrHIAA5+
 ca3OY6vwOjBvqI7oOoPcYJeYJ9uuGGQc31tDFJtP6Sl6Gk31AB4iSddyrowaX+t+
 UJyJNfsgWKiLjY48EyQJ0gIqjuvPq8hPGMGClJb1A7+w87fqBC5UwCWEnJmE7MaX
 401kIw0CRVWYTnDEOYxToss6D6gQ30E8UZjdJ0cG4g8xVQBY2kKwYR3F9tDlAwsY
 CF+RCKpibcKwnaNZJBL67ClWjj1hC0ivg0O0G+W1UYysesKKdWFRI2rmxvH55K5T
 7tHlfVuVPladNmlLVNZnCvyqBrFHyAZPmOsdv3xQOvJ7pZPaxKV9xIYryQKZW4H4
 9tKkj3T1aop/fDGqIMxgymZsWW+1vvxAmM+7WkdOPHwHRSakJ5wGIj6Ekpton+5y
 aixJUFq390o/o+S8PDO7mgzdvYrasv3iLl5UxnIcU3rq30wxnRKit4vUZny8DlzF
 gOTw7QSocximhGYci+Uz4d4/XdK2CHc6eZDkQDltgJXxIrdsrN0qKxMCEsMKgCR1
 RMiDv+52MP6kp/wpXiOHQF25YRnUOW0qfEjWKK6Ye28DGuKPPuIXtN/BUD3rjdIc
 IJX3lDfOI3PgXNX24nOarucrF+ootyRmE6tGTVZhCVBhUXGR+MGatGfkeCqnmNzZ
 gny2+UrGIQ==
 =ly9V
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Fairly light set of changes this time around, and mostly some bits
  that were pushed out to 5.11 instead of 5.10, fixes/cleanups, and a
  few features. In particular:

   - Cleanups around iovec import (David Laight, Pavel)

   - Add timeout support for io_uring_enter(2), which enables us to
     clean up liburing and avoid a timeout sqe submission in the
     completion path.

     The big win here is that it allows setups that split SQ and CQ
     handling into separate threads to avoid locking, as the CQ side
     will no longer submit when timeouts are needed when waiting for
     events (Hao Xu)

   - Add support for socket shutdown, and renameat/unlinkat.

   - SQPOLL cleanups and improvements (Xiaoguang Wang)

   - Allow SQPOLL setups for CAP_SYS_NICE, and enable regular
     (non-fixed) files to be used.

   - Cancelation improvements (Pavel)

   - Fixed file reference improvements (Pavel)

   - IOPOLL related race fixes (Pavel)

   - Lots of other little fixes and cleanups (mostly Pavel)"

* tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-block: (43 commits)
  io_uring: fix io_cqring_events()'s noflush
  io_uring: fix racy IOPOLL flush overflow
  io_uring: fix racy IOPOLL completions
  io_uring: always let io_iopoll_complete() complete polled io
  io_uring: add timeout update
  io_uring: restructure io_timeout_cancel()
  io_uring: fix files cancellation
  io_uring: use bottom half safe lock for fixed file data
  io_uring: fix miscounting ios_left
  io_uring: change submit file state invariant
  io_uring: check kthread stopped flag when sq thread is unparked
  io_uring: share fixed_file_refs b/w multiple rsrcs
  io_uring: replace inflight_wait with tctx->wait
  io_uring: don't take fs for recvmsg/sendmsg
  io_uring: only wake up sq thread while current task is in io worker context
  io_uring: don't acquire uring_lock twice
  io_uring: initialize 'timeout' properly in io_sq_thread()
  io_uring: refactor io_sq_thread() handling
  io_uring: always batch cancel in *cancel_files()
  io_uring: pass files into kill timeouts/poll
  ...
2020-12-16 12:44:05 -08:00
Linus Torvalds 005b2a9dc8 tif-task_work.arch-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/YJxsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjpyEACBdW+YjenjTbkUPeEXzQgkBkTZUYw3g007
 DPcUT1g8PQZXYXlQvBKCvGhhIr7/KVcjepKoowiNQfBNGcIPJTVopW58nzpqAfTQ
 goI2WYGn5EKFFKBPvtH04cJD/Wo8muXdxynKtqyZbnGGgZjQxPrE259b8dpHjBSR
 6L7HHkk0D1oU/5b6h6Ocpg9mc/0iIUCZylySAYY3eGO0JaVPJaXgZSJZYgHxCHll
 Lb+/y/fXdtm/0PmQ3ko0ev54g3yEWqZIX0NsZW1asrButIy+KLzQ2Mz1xFLFDMag
 prtIfwb8tzgc4dFPY090C/azjCh5CPpxqYS6FkRwS0p86n6OhkyXrqfily5Hs4/B
 NC7CBPBSH/j+NKUK7CYZcpTzTpxPjUr9p0anUdlvMJz8FhTb/3YEEZ1UTeWOeHmk
 Yo5SxnFghLeZZeZ1ok6rdymnVa7WEX12SCLGQX31BB2mld0tNbKb4b+FsBF6OUMk
 IUaX6OjwDFVRaysC88BQ4hjcIP1HxsViG4/VZDX15gjAAH2Pvb+7tev+lcDcOhjz
 TCD4GNFspTFzRhh9nT7oxQ679qCh9G9zHbzuIRewnrS6iqvo5SJQB3dR2yrWZRRH
 ySkQFiHpYOlnLJYv0jg9COlGwo2FUdcvKhCvkjQKKBz48rzW/IC0LwKdRQWZDFk3
 FKGzP/NBig==
 =cadT
 -----END PGP SIGNATURE-----

Merge tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block

Pull TIF_NOTIFY_SIGNAL updates from Jens Axboe:
 "This sits on top of of the core entry/exit and x86 entry branch from
  the tip tree, which contains the generic and x86 parts of this work.

  Here we convert the rest of the archs to support TIF_NOTIFY_SIGNAL.

  With that done, we can get rid of JOBCTL_TASK_WORK from task_work and
  signal.c, and also remove a deadlock work-around in io_uring around
  knowing that signal based task_work waking is invoked with the sighand
  wait queue head lock.

  The motivation for this work is to decouple signal notify based
  task_work, of which io_uring is a heavy user of, from sighand. The
  sighand lock becomes a huge contention point, particularly for
  threaded workloads where it's shared between threads. Even outside of
  threaded applications it's slower than it needs to be.

  Roman Gershman <romger@amazon.com> reported that his networked
  workload dropped from 1.6M QPS at 80% CPU to 1.0M QPS at 100% CPU
  after io_uring was changed to use TIF_NOTIFY_SIGNAL. The time was all
  spent hammering on the sighand lock, showing 57% of the CPU time there
  [1].

  There are further cleanups possible on top of this. One example is
  TIF_PATCH_PENDING, where a patch already exists to use
  TIF_NOTIFY_SIGNAL instead. Hopefully this will also lead to more
  consolidation, but the work stands on its own as well"

[1] https://github.com/axboe/liburing/issues/215

* tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block: (28 commits)
  io_uring: remove 'twa_signal_ok' deadlock work-around
  kernel: remove checking for TIF_NOTIFY_SIGNAL
  signal: kill JOBCTL_TASK_WORK
  io_uring: JOBCTL_TASK_WORK is no longer used by task_work
  task_work: remove legacy TWA_SIGNAL path
  sparc: add support for TIF_NOTIFY_SIGNAL
  riscv: add support for TIF_NOTIFY_SIGNAL
  nds32: add support for TIF_NOTIFY_SIGNAL
  ia64: add support for TIF_NOTIFY_SIGNAL
  h8300: add support for TIF_NOTIFY_SIGNAL
  c6x: add support for TIF_NOTIFY_SIGNAL
  alpha: add support for TIF_NOTIFY_SIGNAL
  xtensa: add support for TIF_NOTIFY_SIGNAL
  arm: add support for TIF_NOTIFY_SIGNAL
  microblaze: add support for TIF_NOTIFY_SIGNAL
  hexagon: add support for TIF_NOTIFY_SIGNAL
  csky: add support for TIF_NOTIFY_SIGNAL
  openrisc: add support for TIF_NOTIFY_SIGNAL
  sh: add support for TIF_NOTIFY_SIGNAL
  um: add support for TIF_NOTIFY_SIGNAL
  ...
2020-12-16 12:33:35 -08:00
Linus Torvalds faf145d6f3 Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "This set of changes ultimately fixes the interaction of posix file
  lock and exec. Fundamentally most of the change is just moving where
  unshare_files is called during exec, and tweaking the users of
  files_struct so that the count of files_struct is not unnecessarily
  played with.

  Along the way fcheck and related helpers were renamed to more
  accurately reflect what they do.

  There were also many other small changes that fell out, as this is the
  first time in a long time much of this code has been touched.

  Benchmarks haven't turned up any practical issues but Al Viro has
  observed a possibility for a lot of pounding on task_lock. So I have
  some changes in progress to convert put_files_struct to always rcu
  free files_struct. That wasn't ready for the merge window so that will
  have to wait until next time"

* 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  exec: Move io_uring_task_cancel after the point of no return
  coredump: Document coredump code exclusively used by cell spufs
  file: Remove get_files_struct
  file: Rename __close_fd_get_file close_fd_get_file
  file: Replace ksys_close with close_fd
  file: Rename __close_fd to close_fd and remove the files parameter
  file: Merge __alloc_fd into alloc_fd
  file: In f_dupfd read RLIMIT_NOFILE once.
  file: Merge __fd_install into fd_install
  proc/fd: In fdinfo seq_show don't use get_files_struct
  bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
  proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
  file: Implement task_lookup_next_fd_rcu
  kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
  proc/fd: In tid_fd_mode use task_lookup_fd_rcu
  file: Implement task_lookup_fd_rcu
  file: Rename fcheck lookup_fd_rcu
  file: Replace fcheck_files with files_lookup_fd_rcu
  file: Factor files_lookup_fd_locked out of fcheck_files
  file: Rename __fcheck_files to files_lookup_fd_raw
  ...
2020-12-15 19:29:43 -08:00
Linus Torvalds d635a69dd4 Networking updates for 5.11
Core:
 
  - support "prefer busy polling" NAPI operation mode, where we defer softirq
    for some time expecting applications to periodically busy poll
 
  - AF_XDP: improve efficiency by more batching and hindering
            the adjacency cache prefetcher
 
  - af_packet: make packet_fanout.arr size configurable up to 64K
 
  - tcp: optimize TCP zero copy receive in presence of partial or unaligned
         reads making zero copy a performance win for much smaller messages
 
  - XDP: add bulk APIs for returning / freeing frames
 
  - sched: support fragmenting IP packets as they come out of conntrack
 
  - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
 
 BPF:
 
  - BPF switch from crude rlimit-based to memcg-based memory accounting
 
  - BPF type format information for kernel modules and related tracing
    enhancements
 
  - BPF implement task local storage for BPF LSM
 
  - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage
 
 Protocols:
 
  - mptcp: improve multiple xmit streams support, memory accounting and
           many smaller improvements
 
  - TLS: support CHACHA20-POLY1305 cipher
 
  - seg6: add support for SRv6 End.DT4/DT6 behavior
 
  - sctp: Implement RFC 6951: UDP Encapsulation of SCTP
 
  - ppp_generic: add ability to bridge channels directly
 
  - bridge: Connectivity Fault Management (CFM) support as is defined in
            IEEE 802.1Q section 12.14.
 
 Drivers:
 
  - mlx5: make use of the new auxiliary bus to organize the driver internals
 
  - mlx5: more accurate port TX timestamping support
 
  - mlxsw:
    - improve the efficiency of offloaded next hop updates by using
      the new nexthop object API
    - support blackhole nexthops
    - support IEEE 802.1ad (Q-in-Q) bridging
 
  - rtw88: major bluetooth co-existance improvements
 
  - iwlwifi: support new 6 GHz frequency band
 
  - ath11k: Fast Initial Link Setup (FILS)
 
  - mt7915: dual band concurrent (DBDC) support
 
  - net: ipa: add basic support for IPA v4.5
 
 Refactor:
 
  - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior
 
  - phy: add support for shared interrupts; get rid of multiple driver
         APIs and have the drivers write a full IRQ handler, slight growth
 	of driver code should be compensated by the simpler API which
 	also allows shared IRQs
 
  - add common code for handling netdev per-cpu counters
 
  - move TX packet re-allocation from Ethernet switch tag drivers to
    a central place
 
  - improve efficiency and rename nla_strlcpy
 
  - number of W=1 warning cleanups as we now catch those in a patchwork
    build bot
 
 Old code removal:
 
  - wan: delete the DLCI / SDLA drivers
 
  - wimax: move to staging
 
  - wifi: remove old WDS wifi bridging support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S
 IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane
 qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl
 N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0
 XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI
 s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM
 4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh
 hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH
 C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk
 LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv
 +htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD
 Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730=
 =GXs1
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - support "prefer busy polling" NAPI operation mode, where we defer
     softirq for some time expecting applications to periodically busy
     poll

   - AF_XDP: improve efficiency by more batching and hindering the
     adjacency cache prefetcher

   - af_packet: make packet_fanout.arr size configurable up to 64K

   - tcp: optimize TCP zero copy receive in presence of partial or
     unaligned reads making zero copy a performance win for much smaller
     messages

   - XDP: add bulk APIs for returning / freeing frames

   - sched: support fragmenting IP packets as they come out of conntrack

   - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

  BPF:

   - BPF switch from crude rlimit-based to memcg-based memory accounting

   - BPF type format information for kernel modules and related tracing
     enhancements

   - BPF implement task local storage for BPF LSM

   - allow the FENTRY/FEXIT/RAW_TP tracing programs to use
     bpf_sk_storage

  Protocols:

   - mptcp: improve multiple xmit streams support, memory accounting and
     many smaller improvements

   - TLS: support CHACHA20-POLY1305 cipher

   - seg6: add support for SRv6 End.DT4/DT6 behavior

   - sctp: Implement RFC 6951: UDP Encapsulation of SCTP

   - ppp_generic: add ability to bridge channels directly

   - bridge: Connectivity Fault Management (CFM) support as is defined
     in IEEE 802.1Q section 12.14.

  Drivers:

   - mlx5: make use of the new auxiliary bus to organize the driver
     internals

   - mlx5: more accurate port TX timestamping support

   - mlxsw:
      - improve the efficiency of offloaded next hop updates by using
        the new nexthop object API
      - support blackhole nexthops
      - support IEEE 802.1ad (Q-in-Q) bridging

   - rtw88: major bluetooth co-existance improvements

   - iwlwifi: support new 6 GHz frequency band

   - ath11k: Fast Initial Link Setup (FILS)

   - mt7915: dual band concurrent (DBDC) support

   - net: ipa: add basic support for IPA v4.5

  Refactor:

   - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
     Siewior

   - phy: add support for shared interrupts; get rid of multiple driver
     APIs and have the drivers write a full IRQ handler, slight growth
     of driver code should be compensated by the simpler API which also
     allows shared IRQs

   - add common code for handling netdev per-cpu counters

   - move TX packet re-allocation from Ethernet switch tag drivers to a
     central place

   - improve efficiency and rename nla_strlcpy

   - number of W=1 warning cleanups as we now catch those in a patchwork
     build bot

  Old code removal:

   - wan: delete the DLCI / SDLA drivers

   - wimax: move to staging

   - wifi: remove old WDS wifi bridging support"

* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
  net: hns3: fix expression that is currently always true
  net: fix proc_fs init handling in af_packet and tls
  nfc: pn533: convert comma to semicolon
  af_vsock: Assign the vsock transport considering the vsock address flags
  af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
  vsock_addr: Check for supported flag values
  vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
  vm_sockets: Add flags field in the vsock address data structure
  net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
  tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
  net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
  nfc: s3fwrn5: Release the nfc firmware
  net: vxget: clean up sparse warnings
  mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
  mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
  mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
  mlxsw: reg: Add Router LPM Cache Enable Register
  mlxsw: reg: Add Router LPM Cache ML Delete Register
  mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
  mlxsw: reg: Add XM Router M Table Register
  ...
2020-12-15 13:22:29 -08:00
Jakub Kicinski a6b5e026e6 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-12-14

1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest.

2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua.

3) Support for finding BTF based kernel attach targets through libbpf's
   bpf_program__set_attach_target() API, from Andrii Nakryiko.

4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song.

5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet.

6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo.

7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel.

8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman.

9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner,
   KP Singh, Jiri Olsa and various others.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
  selftests/bpf: Add a test for ptr_to_map_value on stack for helper access
  bpf: Permits pointers on stack for helper calls
  libbpf: Expose libbpf ring_buffer epoll_fd
  selftests/bpf: Add set_attach_target() API selftest for module target
  libbpf: Support modules in bpf_program__set_attach_target() API
  selftests/bpf: Silence ima_setup.sh when not running in verbose mode.
  selftests/bpf: Drop the need for LLVM's llc
  selftests/bpf: fix bpf_testmod.ko recompilation logic
  samples/bpf: Fix possible hang in xdpsock with multiple threads
  selftests/bpf: Make selftest compilation work on clang 11
  selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore
  selftests/bpf: Drop tcp-{client,server}.py from Makefile
  selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV
  selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV
  selftests/bpf: Xsk selftests - DRV POLL, NOPOLL
  selftests/bpf: Xsk selftests - SKB POLL, NOPOLL
  selftests/bpf: Xsk selftests framework
  bpf: Only provide bpf_sock_from_file with CONFIG_NET
  bpf: Return -ENOTSUPP when attaching to non-kernel BTF
  xsk: Validate socket state in xsk_recvmsg, prior touching socket members
  ...
====================

Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 15:34:36 -08:00
Jens Axboe 355fb9e2b7 io_uring: remove 'twa_signal_ok' deadlock work-around
The TIF_NOTIFY_SIGNAL based implementation of TWA_SIGNAL is always safe
to use, regardless of context, as we won't be recursing into the signal
lock. So now that all archs are using that, we can drop this deadlock
work-around as it's always safe to use TWA_SIGNAL.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12 09:17:38 -07:00
Jens Axboe 792ee0f6db io_uring: JOBCTL_TASK_WORK is no longer used by task_work
Remove the dead code, TWA_SIGNAL will never set JOBCTL_TASK_WORK at
this point.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12 09:17:38 -07:00
Eric W. Biederman 9fe83c43e7 file: Rename __close_fd_get_file close_fd_get_file
The function close_fd_get_file is explicitly a variant of
__close_fd[1].  Now that __close_fd has been renamed close_fd, rename
close_fd_get_file to be consistent with close_fd.

When __alloc_fd, __close_fd and __fd_install were introduced the
double underscore indicated that the function took a struct
files_struct parameter.  The function __close_fd_get_file never has so
the naming has always been inconsistent.  This just cleans things up
so there are not any lingering mentions or references __close_fd left
in the code.

[1] 80cd795630 ("binder: fix use-after-free due to ksys_close() during fdget()")
Link: https://lkml.kernel.org/r/20201120231441.29911-23-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:59 -06:00
Pavel Begunkov 59850d226e io_uring: fix io_cqring_events()'s noflush
Checking !list_empty(&ctx->cq_overflow_list) around noflush in
io_cqring_events() is racy, because if it fails but a request overflowed
just after that, io_cqring_overflow_flush() still will be called.

Remove the second check, it shouldn't be a problem for performance,
because there is cq_check_overflow bit check just above.

Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:02 -07:00
Pavel Begunkov 634578f800 io_uring: fix racy IOPOLL flush overflow
It's not safe to call io_cqring_overflow_flush() for IOPOLL mode without
hodling uring_lock, because it does synchronisation differently. Make
sure we have it.

As for io_ring_exit_work(), we don't even need it there because
io_ring_ctx_wait_and_kill() already set force flag making all overflowed
requests to be dropped.

Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov 31bff9a51b io_uring: fix racy IOPOLL completions
IOPOLL allows buffer remove/provide requests, but they doesn't
synchronise by rules of IOPOLL, namely it have to hold uring_lock.

Cc: <stable@vger.kernel.org> # 5.7+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Xiaoguang Wang dad1b1242f io_uring: always let io_iopoll_complete() complete polled io
Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring():
[   95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0
[   95.505921]
[   95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G    B
W         5.10.0-rc5+ #1
[   95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   95.508248] Call Trace:
[   95.508683]  dump_stack+0x107/0x163
[   95.509323]  ? io_commit_cqring+0x3ec/0x8e0
[   95.509982]  print_address_description.constprop.0+0x3e/0x60
[   95.510814]  ? vprintk_func+0x98/0x140
[   95.511399]  ? io_commit_cqring+0x3ec/0x8e0
[   95.512036]  ? io_commit_cqring+0x3ec/0x8e0
[   95.512733]  kasan_report_invalid_free+0x51/0x80
[   95.513431]  ? io_commit_cqring+0x3ec/0x8e0
[   95.514047]  __kasan_slab_free+0x141/0x160
[   95.514699]  kfree+0xd1/0x390
[   95.515182]  io_commit_cqring+0x3ec/0x8e0
[   95.515799]  __io_req_complete.part.0+0x64/0x90
[   95.516483]  io_wq_submit_work+0x1fa/0x260
[   95.517117]  io_worker_handle_work+0xeac/0x1c00
[   95.517828]  io_wqe_worker+0xc94/0x11a0
[   95.518438]  ? io_worker_handle_work+0x1c00/0x1c00
[   95.519151]  ? __kthread_parkme+0x11d/0x1d0
[   95.519806]  ? io_worker_handle_work+0x1c00/0x1c00
[   95.520512]  ? io_worker_handle_work+0x1c00/0x1c00
[   95.521211]  kthread+0x396/0x470
[   95.521727]  ? _raw_spin_unlock_irq+0x24/0x30
[   95.522380]  ? kthread_mod_delayed_work+0x180/0x180
[   95.523108]  ret_from_fork+0x22/0x30
[   95.523684]
[   95.523985] Allocated by task 4035:
[   95.524543]  kasan_save_stack+0x1b/0x40
[   95.525136]  __kasan_kmalloc.constprop.0+0xc2/0xd0
[   95.525882]  kmem_cache_alloc_trace+0x17b/0x310
[   95.533930]  io_queue_sqe+0x225/0xcb0
[   95.534505]  io_submit_sqes+0x1768/0x25f0
[   95.535164]  __x64_sys_io_uring_enter+0x89e/0xd10
[   95.535900]  do_syscall_64+0x33/0x40
[   95.536465]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   95.537199]
[   95.537505] Freed by task 4035:
[   95.538003]  kasan_save_stack+0x1b/0x40
[   95.538599]  kasan_set_track+0x1c/0x30
[   95.539177]  kasan_set_free_info+0x1b/0x30
[   95.539798]  __kasan_slab_free+0x112/0x160
[   95.540427]  kfree+0xd1/0x390
[   95.540910]  io_commit_cqring+0x3ec/0x8e0
[   95.541516]  io_iopoll_complete+0x914/0x1390
[   95.542150]  io_do_iopoll+0x580/0x700
[   95.542724]  io_iopoll_try_reap_events.part.0+0x108/0x200
[   95.543512]  io_ring_ctx_wait_and_kill+0x118/0x340
[   95.544206]  io_uring_release+0x43/0x50
[   95.544791]  __fput+0x28d/0x940
[   95.545291]  task_work_run+0xea/0x1b0
[   95.545873]  do_exit+0xb6a/0x2c60
[   95.546400]  do_group_exit+0x12a/0x320
[   95.546967]  __x64_sys_exit_group+0x3f/0x50
[   95.547605]  do_syscall_64+0x33/0x40
[   95.548155]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The reason is that once we got a non EAGAIN error in io_wq_submit_work(),
we'll complete req by calling io_req_complete(), which will hold completion_lock
to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't
hold completion_lock to call io_commit_cqring(), then there maybe concurrent
access to ctx->defer_list, double free may happen.

To fix this bug, we always let io_iopoll_complete() complete polled io.

Cc: <stable@vger.kernel.org> # 5.5+
Reported-by: Abaci Fuzz <abaci@linux.alibaba.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov 9c8e11b36c io_uring: add timeout update
Support timeout updates through IORING_OP_TIMEOUT_REMOVE with passed in
IORING_TIMEOUT_UPDATE. Updates doesn't support offset timeout mode.
Oirignal timeout.off will be ignored as well.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: remove now unused 'ret' variable]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov fbd15848f3 io_uring: restructure io_timeout_cancel()
Add io_timeout_extract() helper, which searches and disarms timeouts,
but doesn't complete them. No functional changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov bee749b187 io_uring: fix files cancellation
io_uring_cancel_files()'s task check condition mistakenly got flipped.

1. There can't be a request in the inflight list without
IO_WQ_WORK_FILES, kill this check to keep the whole condition simpler.
2. Also, don't call the function for files==NULL to not do such a check,
all that staff is already handled well by its counter part,
__io_uring_cancel_task_requests().

With that just flip the task check.

Also, it iowq-cancels all request of current task there, don't forget to
set right ->files into struct io_task_cancel.

Fixes: c1973b38bf639 ("io_uring: cancel only requests of current task")
Reported-by: syzbot+c0d52d0b3c0c3ffb9525@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Jens Axboe ac0648a56c io_uring: use bottom half safe lock for fixed file data
io_file_data_ref_zero() can be invoked from soft-irq from the RCU core,
hence we need to ensure that the file_data lock is bottom half safe. Use
the _bh() variants when grabbing this lock.

Reported-by: syzbot+1f4ba1e5520762c523c6@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov bd5bbda72f io_uring: fix miscounting ios_left
io_req_init() doesn't decrement state->ios_left if a request doesn't
need ->file, it just returns before that on if(!needs_file). That's
not really a problem but may cause overhead for an additional fput().
Also inline and kill io_req_set_file() as it's of no use anymore.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov 6e1271e60c io_uring: change submit file state invariant
Keep submit state invariant of whether there are file refs left based on
state->nr_refs instead of (state->file==NULL), and always check against
the first one. It's easier to track and allows to remove 1 if. It also
automatically leaves struct submit_state in a consistent state after
io_submit_state_end(), that's not used yet but nice.

btw rename has_refs to file_refs for more clarity.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Xiaoguang Wang 65b2b21348 io_uring: check kthread stopped flag when sq thread is unparked
syzbot reports following issue:
INFO: task syz-executor.2:12399 can't die for more than 143 seconds.
task:syz-executor.2  state:D stack:28744 pid:12399 ppid:  8504 flags:0x00004004
Call Trace:
 context_switch kernel/sched/core.c:3773 [inline]
 __schedule+0x893/0x2170 kernel/sched/core.c:4522
 schedule+0xcf/0x270 kernel/sched/core.c:4600
 schedule_timeout+0x1d8/0x250 kernel/time/timer.c:1847
 do_wait_for_common kernel/sched/completion.c:85 [inline]
 __wait_for_common kernel/sched/completion.c:106 [inline]
 wait_for_common kernel/sched/completion.c:117 [inline]
 wait_for_completion+0x163/0x260 kernel/sched/completion.c:138
 kthread_stop+0x17a/0x720 kernel/kthread.c:596
 io_put_sq_data fs/io_uring.c:7193 [inline]
 io_sq_thread_stop+0x452/0x570 fs/io_uring.c:7290
 io_finish_async fs/io_uring.c:7297 [inline]
 io_sq_offload_create fs/io_uring.c:8015 [inline]
 io_uring_create fs/io_uring.c:9433 [inline]
 io_uring_setup+0x19b7/0x3730 fs/io_uring.c:9507
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: Unable to access opcode bytes at RIP 0x45de8f.
RSP: 002b:00007f174e51ac78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
RAX: ffffffffffffffda RBX: 0000000000008640 RCX: 000000000045deb9
RDX: 0000000000000000 RSI: 0000000020000140 RDI: 00000000000050e5
RBP: 000000000118bf58 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c
R13: 00007ffed9ca723f R14: 00007f174e51b9c0 R15: 000000000118bf2c
INFO: task syz-executor.2:12399 blocked for more than 143 seconds.
      Not tainted 5.10.0-rc3-next-20201110-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Currently we don't have a reproducer yet, but seems that there is a
race in current codes:
=> io_put_sq_data
      ctx_list is empty now.       |
==> kthread_park(sqd->thread);     |
                                   | T1: sq thread is parked now.
==> kthread_stop(sqd->thread);     |
    KTHREAD_SHOULD_STOP is set now.|
===> kthread_unpark(k);            |
                                   | T2: sq thread is now unparkd, run again.
                                   |
                                   | T3: sq thread is now preempted out.
                                   |
===> wake_up_process(k);           |
                                   |
                                   | T4: Since sqd ctx_list is empty, needs_sched will be true,
                                   | then sq thread sets task state to TASK_INTERRUPTIBLE,
                                   | and schedule, now sq thread will never be waken up.
===> wait_for_completion           |

I have artificially used mdelay() to simulate above race, will get same
stack like this syzbot report, but to be honest, I'm not sure this code
race triggers syzbot report.

To fix this possible code race, when sq thread is unparked, need to check
whether sq thread has been stopped.

Reported-by: syzbot+03beeb595f074db9cfd1@syzkaller.appspotmail.com
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov 36f72fe279 io_uring: share fixed_file_refs b/w multiple rsrcs
Double fixed files for splice/tee are done in a nasty way, it takes 2
ref_node refs, and during the second time it blindly overrides
req->fixed_file_refs hoping that it haven't changed. That works because
all that is done under iouring_lock in a single go but is error-prone.

Bind everything explicitly to a single ref_node and take only one ref,
with current ref_node ordering it's guaranteed to keep all files valid
awhile the request is inflight.

That's mainly a cleanup + preparation for generic resource handling,
but also saves pcpu_ref get/put for splice/tee with 2 fixed files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov c98de08c99 io_uring: replace inflight_wait with tctx->wait
As tasks now cancel only theirs requests, and inflight_wait is awaited
only in io_uring_cancel_files(), which should be called with ->in_idle
set, instead of keeping a separate inflight_wait use tctx->wait.

That will add some spurious wakeups but actually is safer from point of
not hanging the task.

e.g.
task1                   | IRQ
                        | *start* io_complete_rw_common(link)
                        |        link: req1 -> req2 -> req3(with files)
*cancel_files()         |
io_wq_cancel(), etc.    |
                        | put_req(link), adds to io-wq req2
schedule()              |

So, task1 will never try to cancel req2 or req3. If req2 is
long-standing (e.g. read(empty_pipe)), this may hang.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Pavel Begunkov 10cad2c40d io_uring: don't take fs for recvmsg/sendmsg
We don't even allow not plain data msg_control, which is disallowed in
__sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and
IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but
there are cases that can be, e.g. IOSQE_ASYNC.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:01 -07:00
Xiaoguang Wang 2e9dbe902d io_uring: only wake up sq thread while current task is in io worker context
If IORING_SETUP_SQPOLL is enabled, sqes are either handled in sq thread
task context or in io worker task context. If current task context is sq
thread, we don't need to check whether should wake up sq thread.

io_iopoll_req_issued() calls wq_has_sleeper(), which has smp_mb() memory
barrier, before this patch, perf shows obvious overhead:
  Samples: 481K of event 'cycles', Event count (approx.): 299807382878
  Overhead  Comma  Shared Object     Symbol
     3.69%  :9630  [kernel.vmlinux]  [k] io_issue_sqe

With this patch, perf shows:
  Samples: 482K of event 'cycles', Event count (approx.): 299929547283
  Overhead  Comma  Shared Object     Symbol
     0.70%  :4015  [kernel.vmlinux]  [k] io_issue_sqe

It shows some obvious improvements.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Xiaoguang Wang 906a3c6f9c io_uring: don't acquire uring_lock twice
Both IOPOLL and sqes handling need to acquire uring_lock, combine
them together, then we just need to acquire uring_lock once.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Xiaoguang Wang a0d9205f7d io_uring: initialize 'timeout' properly in io_sq_thread()
Some static checker reports below warning:
    fs/io_uring.c:6939 io_sq_thread()
    error: uninitialized symbol 'timeout'.

This is a false positive, but let's just initialize 'timeout' to make
sure we don't trip over this.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Xiaoguang Wang 0836924634 io_uring: refactor io_sq_thread() handling
There are some issues about current io_sq_thread() implementation:
  1. The prepare_to_wait() usage in __io_sq_thread() is weird. If
multiple ctxs share one same poll thread, one ctx will put poll thread
in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't
need to change task's stat at all. I think only if all ctxs don't have
work to do, we can do it.
  2. We use round-robin strategy to make multiple ctxs share one same
poll thread, but there are various condition in __io_sq_thread(), which
seems complicated and may affect round-robin strategy.

To improve above issues, I take below actions:
  1. If multiple ctxs share one same poll thread, only if all all ctxs
don't have work to do, we can call prepare_to_wait() and schedule() to
make poll thread enter sleep state.
  2. To make round-robin strategy more straight, I simplify
__io_sq_thread() a bit, it just does io poll and sqes submit work once,
does not check various condition.
  3. For multiple ctxs share one same poll thread, we choose the biggest
sq_thread_idle among these ctxs as timeout condition, and will update
it when ctx is in or out.
  4. Not need to check EBUSY especially, if io_submit_sqes() returns
EBUSY, IORING_SQ_CQ_OVERFLOW should be set, helper in liburing should
be aware of cq overflow and enters kernel to flush work.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov f6edbabb83 io_uring: always batch cancel in *cancel_files()
Instead of iterating over each request and cancelling it individually in
io_uring_cancel_files(), try to cancel all matching requests and use
->inflight_list only to check if there anything left.

In many cases it should be faster, and we can reuse a lot of code from
task cancellation.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 6b81928d4c io_uring: pass files into kill timeouts/poll
Make io_poll_remove_all() and io_kill_timeouts() to match against files
as well. A preparation patch, effectively not used by now.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov b52fda00dd io_uring: don't iterate io_uring_cancel_files()
io_uring_cancel_files() guarantees to cancel all matching requests,
that's not necessary to do that in a loop. Move it up in the callchain
into io_uring_cancel_task_requests().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov df9923f967 io_uring: cancel only requests of current task
io_uring_cancel_files() cancels all request that match files regardless
of task. There is no real need in that, cancel only requests of the
specified task. That also handles SQPOLL case as it already changes task
to it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 08d2363464 io_uring: add a {task,files} pair matching helper
Add io_match_task() that matches both task and files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 06de5f5973 io_uring: simplify io_task_match()
If IORING_SETUP_SQPOLL is set all requests belong to the corresponding
SQPOLL task, so skip task checking in that case and always match.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 2846c481c9 io_uring: inline io_import_iovec()
Inline io_import_iovec() and leave only its former __io_import_iovec()
renamed to the original name. That makes it more obious what is reused in
io_read/write().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00
Pavel Begunkov 632546c4b5 io_uring: remove duplicated io_size from rw
io_size and iov_count in io_read() and io_write() hold the same value,
kill the last one.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-09 12:04:00 -07:00