linux

mirror of https://github.com/torvalds/linux synced 2024-07-21 18:51:47 +00:00

Author	SHA1	Message	Date
Jens Axboe	e053aaf4da	io_uring: fix issue with io_write() not always undoing sb_start_write() This is actually an older issue, but we never used to hit the -EAGAIN path before having done sb_start_write(). Make sure that we always call kiocb_end_write() if we need to retry the write, so that we keep the calls to sb_start_write() etc balanced. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:32 -06:00
Stefan Roesch	4e17aaab54	io_uring: Add support for async buffered writes This enables the async buffered writes for the filesystems that support async buffered writes in io-uring. Buffered writes are enabled for blocks that are already in the page cache or can be acquired with noio. Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220616212221.2024518-12-shr@fb.com [axboe: adapt to 5.20 branch] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:32 -06:00
Stefan Roesch	66fa3cedf1	fs: Add async write file modification handling. This adds a file_modified_async() function to return -EAGAIN if the request either requires to remove privileges or needs to update the file modification time. This is required for async buffered writes, so the request gets handled in the io worker of io-uring. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-11-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:32 -06:00
Stefan Roesch	6a2aa5d85d	fs: Split off inode_needs_update_time and __file_update_time This splits off the functions inode_needs_update_time() and __file_update_time() from the function file_update_time(). This is required to support async buffered writes. No intended functional changes in this patch. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-10-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:31 -06:00
Stefan Roesch	faf99b5635	fs: add __remove_file_privs() with flags parameter This adds the function __remove_file_privs, which allows the caller to pass the kiocb flags parameter. No intended functional changes in this patch. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-9-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:31 -06:00
Stefan Roesch	8017553980	fs: add a FMODE_BUF_WASYNC flags for f_mode This introduces the flag FMODE_BUF_WASYNC. If devices support async buffered writes, this flag can be set. It also modifies the check in generic_write_checks to take async buffered writes into consideration. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-8-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:31 -06:00
Stefan Roesch	18e419f6e8	iomap: Return -EAGAIN from iomap_write_iter() If iomap_write_iter() encounters -EAGAIN, return -EAGAIN to the caller. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-7-shr@fb.com Reviewed-by: Christoph Hellwig <hch@lst.de> [axboe: make the suggested ternary edit] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:31 -06:00
Stefan Roesch	cae2de6978	iomap: Add async buffered write support This adds async buffered write support to iomap. This replaces the call to balance_dirty_pages_ratelimited() with the call to balance_dirty_pages_ratelimited_flags. This allows to specify if the write request is async or not. In addition this also moves the above function call to the beginning of the function. If the function call is at the end of the function and the decision is made to throttle writes, then there is no request that io-uring can wait on. By moving it to the beginning of the function, the write request is not issued, but returns -EAGAIN instead. io-uring will punt the request and process it in the io-worker. By moving the function call to the beginning of the function, the write throttling will happen one page later. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-6-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:31 -06:00
Stefan Roesch	9753b868fd	iomap: Add flags parameter to iomap_page_create() Add the kiocb flags parameter to the function iomap_page_create(). Depending on the value of the flags parameter it enables different gfp flags. No intended functional changes in this patch. Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20220623175157.1715274-5-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:31 -06:00
Jan Kara	fe6c9c6e3e	mm: Add balance_dirty_pages_ratelimited_flags() function This adds the helper function balance_dirty_pages_ratelimited_flags(). It adds the parameter flags to balance_dirty_pages_ratelimited(). The flags parameter is passed to balance_dirty_pages(). For async buffered writes the flag value will be BDP_ASYNC. If balance_dirty_pages() gets called for async buffered write, we don't want to wait. Instead we need to indicate to the caller that throttling is needed so that it can stop writing and offload the rest of the write to a context that can block. The new helper function is also used by balance_dirty_pages_ratelimited(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220623175157.1715274-4-shr@fb.com [axboe: fix kerneltest bot 'ret' issue] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:31 -06:00
Jan Kara	e92eebbb09	mm: Move updates of dirty_exceeded into one place Transition of wb->dirty_exceeded from 0 to 1 happens before we go to sleep in balance_dirty_pages() while transition from 1 to 0 happens when exiting from balance_dirty_pages(), possibly based on old values. This does not make a lot of sense since wb->dirty_exceeded should simply reflect whether wb is over dirty limit and so we should ratelimit entering to balance_dirty_pages() less. Move the two updates together. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220623175157.1715274-3-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:31 -06:00
Jan Kara	ea6813be07	mm: Move starting of background writeback into the main balancing loop We start background writeback if we are over background threshold after exiting the main loop in balance_dirty_pages(). This may result in basing the decision on already stale values (we may have slept for significant amount of time) and it is also inconvenient for refactoring needed for async dirty throttling. Move the check into the main waiting loop. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Stefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220623175157.1715274-2-shr@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:31 -06:00
Jens Axboe	f6b543fd03	io_uring: ensure REQ_F_ISREG is set async offload If we're offloading requests directly to io-wq because IOSQE_ASYNC was set in the sqe, we can miss hashing writes appropriately because we haven't set REQ_F_ISREG yet. This can cause a performance regression with buffered writes, as io-wq then no longer correctly serializes writes to that file. Ensure that we set the flags in io_prep_async_work(), which will cause the io-wq work item to be hashed appropriately. Fixes: `584b0180f0` ("io_uring: move read/write file prep state into actual opcode handler") Link: https://lore.kernel.org/io-uring/20220608080054.GB22428@xsang-OptiPlex-9020/ Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:18 -06:00
Jens Axboe	4f6a94d337	net: fix compat pointer in get_compat_msghdr() A previous change enabled external users to copy the data before calling __get_compat_msghdr(), but didn't modify get_compat_msghdr() or __io_compat_recvmsg_copy_hdr() to take that into account. They are both stil passing in the __user pointer rather than the copied version. Ensure we pass in the kernel struct, not the pointer to the user data. Link: https://lore.kernel.org/all/46439555-644d-08a1-7d66-16f8f9a320f0@samsung.com/ Fixes: 1a3e4e94a1b9 ("net: copy from user before calling __get_compat_msghdr") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:18 -06:00
Michal Koutný	4890422992	io_uring: Don't require reinitable percpu_ref The commit `8bb649ee1d` ("io_uring: remove ring quiesce for io_uring_register") removed the worklow relying on reinit/resurrection of the percpu_ref, hence, initialization with that requested is a relic. This is based on code review, this causes no real bug (and theoretically can't). Technically it's a revert of commit `214828962d` ("io_uring: initialize percpu refcounters using PERCU_REF_ALLOW_REINIT") but since the flag omission is now justified, I'm not making this a revert. Fixes: `8bb649ee1d` ("io_uring: remove ring quiesce for io_uring_register") Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:18 -06:00
Dylan Yudaken	9b0fc3c054	io_uring: fix types in io_recvmsg_multishot_overflow io_recvmsg_multishot_overflow had incorrect types on non x64 system. But also it had an unnecessary INT_MAX check, which could just be done by changing the type of the accumulator to int (also simplifying the casts). Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: a8b38c4ce724 ("io_uring: support multishot in recvmsg") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220715130252.610639-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:18 -06:00
Uros Bizjak	4ccc6db090	io_uring: Use atomic_long_try_cmpxchg in __io_account_mem Use atomic_long_try_cmpxchg instead of atomic_long_cmpxchg (ptr, old, new) == old in __io_account_mem. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, atomic_long_try_cmpxchg implicitly assigns old ptr value to "old" when cmpxchg fails, enabling further code simplifications. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:18 -06:00
Dylan Yudaken	9bb66906f2	io_uring: support multishot in recvmsg Similar to multishot recv, this will require provided buffers to be used. However recvmsg is much more complex than recv as it has multiple outputs. Specifically flags, name, and control messages. Support this by introducing a new struct io_uring_recvmsg_out with 4 fields. namelen, controllen and flags match the similar out fields in msghdr from standard recvmsg(2), payloadlen is the length of the payload following the header. This struct is placed at the start of the returned buffer. Based on what the user specifies in struct msghdr, the next bytes of the buffer will be name (the next msg_namelen bytes), and then control (the next msg_controllen bytes). The payload will come at the end. The return value in the CQE is the total used size of the provided buffer. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220714110258.1336200-4-dylany@fb.com [axboe: style fixups, see link] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:18 -06:00
Dylan Yudaken	72c531f8ef	net: copy from user before calling __get_compat_msghdr this is in preparation for multishot receive from io_uring, where it needs to have access to the original struct user_msghdr. functionally this should be a no-op. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220714110258.1336200-3-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	7fa875b8e5	net: copy from user before calling __copy_msghdr this is in preparation for multishot receive from io_uring, where it needs to have access to the original struct user_msghdr. functionally this should be a no-op. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220714110258.1336200-2-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	6d2f75a0cf	io_uring: support 0 length iov in buffer select in compat Match up work done in "io_uring: allow iov_len = 0 for recvmsg and buffer select", but for compat code path. Fixes: a68caad69ce5 ("io_uring: allow iov_len = 0 for recvmsg and buffer select") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220708181838.1495428-3-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	e2df2ccb75	io_uring: fix multishot ending when not polled If multishot is not actually polling then return IOU_OK rather than the result. If the result was > 0 this will confuse things further up the callstack which expect a return <= 0. Fixes: 1300ebb20286 ("io_uring: multishot recv") Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220708181838.1495428-2-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Jens Axboe	43e0bbbd0b	io_uring: add netmsg cache For recvmsg/sendmsg, if they don't complete inline, we currently need to allocate a struct io_async_msghdr for each request. This is a somewhat large struct. Hook up sendmsg/recvmsg to use the io_alloc_cache. This reduces the alloc + free overhead considerably, yielding 4-5% of extra performance running netbench. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Jens Axboe	9731bc9855	io_uring: impose max limit on apoll cache Caches like this tend to grow to the peak size, and then never get any smaller. Impose a max limit on the size, to prevent it from growing too big. A somewhat randomly chosen 512 is the max size we'll allow the cache to get. If a batch of frees come in and would bring it over that, we simply start kfree'ing the surplus. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Jens Axboe	9b797a37c4	io_uring: add abstraction around apoll cache In preparation for adding limits, and one more user, abstract out the core bits of the allocation+free cache. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Jens Axboe	9da7471ed1	io_uring: move apoll cache to poll.c This is where it's used, move the flush handler in there. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Pavel Begunkov	e8375e43ca	io_uring: consolidate hash_locked io-wq handling Don't duplicate code disabling REQ_F_HASH_LOCKED for IO_URING_F_UNLOCKED (i.e. io-wq), move the handling into __io_arm_poll_handler(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ff0ffdfaa65b3d536131535c3dad3c63d9b7bb0.1657203020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Pavel Begunkov	b21a51e26e	io_uring: clear REQ_F_HASH_LOCKED on hash removal Instead of clearing REQ_F_HASH_LOCKED while arming a poll, unset the bit when we're removing the entry from the table in io_poll_tw_hash_eject(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/02e48bb88d6f1480c94ac2924c43ad1fbd48e92a.1657203020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Pavel Begunkov	ceff501790	io_uring: don't race double poll setting REQ_F_ASYNC_DATA Just as with io_poll_double_prepare() setting REQ_F_DOUBLE_POLL, we can race with the first poll entry when setting REQ_F_ASYNC_DATA. Move it under io_poll_double_prepare(). Fixes: a18427bb2d9b ("io_uring: optimise submission side poll_refs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/df6920f509c11115aa2bce8b34dc5fdb0eb98920.1657203020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Pavel Begunkov	7a121ced6e	io_uring: don't miss setting REQ_F_DOUBLE_POLL When adding a second poll entry we should set REQ_F_DOUBLE_POLL unconditionally. We might race with the first entry removal but that doesn't change the rule. Fixes: a18427bb2d9b ("io_uring: optimise submission side poll_refs") Reported-and-tested-by: syzbot+49950ba66096b1f0209b@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8b680d83ded07424db83e8745585e7a6d72826ef.1657203020.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	cf0dd9527e	io_uring: disable multishot recvmsg recvmsg has semantics that do not make it trivial to extend to multishot. Specifically it has user pointers and returns data in the original parameter. In order to make this API useful these will need to be somehow included with the provided buffers. For now remove multishot for recvmsg as it is not useful. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220704140106.200167-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	e0486f3f7c	io_uring: only trace one of complete or overflow In overflow we see a duplcate line in the trace, and in some cases 3 lines (if initial io_post_aux_cqe fails). Instead just trace once for each CQE Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-13-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	9b26e811e9	io_uring: fix io_uring_cqe_overflow trace format Make the trace format consistent with io_uring_complete for cflags Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-12-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	b3fdea6ecb	io_uring: multishot recv Support multishot receive for io_uring. Typical server applications will run a loop where for each recv CQE it requeues another recv/recvmsg. This can be simplified by using the existing multishot functionality combined with io_uring's provided buffers. The API is to add the IORING_RECV_MULTISHOT flag to the SQE. CQEs will then be posted (with IORING_CQE_F_MORE flag set) when data is available and is read. Once an error occurs or the socket ends, the multishot will be removed and a completion without IORING_CQE_F_MORE will be posted. The benefit to this is that the recv is much more performant. * Subsequent receives are queued up straight away without requiring the application to finish a processing loop. * If there are more data in the socket (sat the provided buffer size is smaller than the socket buffer) then the data is immediately returned, improving batching. * Poll is only armed once and reused, saving CPU cycles Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-11-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	cbd2574854	io_uring: fix multishot accept ordering Similar to multishot poll, drop multishot accept when CQE overflow occurs. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-10-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	a2da676376	io_uring: fix multishot poll on overflow On overflow, multishot poll can still complete with the IORING_CQE_F_MORE flag set. If in the meantime the user clears a CQE and a the poll was cancelled then the poll will post a CQE without the IORING_CQE_F_MORE (and likely result -ECANCELED). However when processing the application will encounter the non-overflow CQE which indicates that there will be no more events posted. Typical userspace applications would free memory associated with the poll in this case. It will then subsequently receive the earlier CQE which has overflowed, which breaks the contract given by the IORING_CQE_F_MORE flag. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-9-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	52120f0fad	io_uring: add allow_overflow to io_post_aux_cqe Some use cases of io_post_aux_cqe would not want to overflow as is, but might want to change the flags/result. For example multishot receive requires in order CQE, and so if there is an overflow it would need to stop receiving until the overflow is taken care of. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-8-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	114eccdf0e	io_uring: add IOU_STOP_MULTISHOT return code For multishot we want a way to signal the caller that multishot has ended but also this might not be an error return. For example sockets return 0 when closed, which should end a multishot recv, but still have a CQE with result 0 Introduce IOU_STOP_MULTISHOT which does this and indicates that the return code is stored inside req->cqe Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-7-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:17 -06:00
Dylan Yudaken	2ba69707d9	io_uring: clean up io_poll_check_events return values The values returned are a bit confusing, where 0 and 1 have implied meaning, so add some definitions for them. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-6-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Dylan Yudaken	d4e097dae2	io_uring: recycle buffers on error Rather than passing an error back to the user with a buffer attached, recycle the buffer immediately. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-5-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Dylan Yudaken	5702196e7d	io_uring: allow iov_len = 0 for recvmsg and buffer select When using BUFFER_SELECT there is no technical requirement that the user actually provides iov, and this removes one copy_from_user call. So allow iov_len to be 0. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-4-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Dylan Yudaken	32f3c434b1	io_uring: restore bgid in io_put_kbuf Attempt to restore bgid. This is needed when recycling unused buffers as the next time around it will want the correct bgid. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-3-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Dylan Yudaken	b8c015598c	io_uring: allow 0 length for buffer select If user gives 0 for length, we can set it from the available buffer size. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220630091231.1456789-2-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Pavel Begunkov	6e73dffbb9	io_uring: let to set a range for file slot allocation From recently io_uring provides an option to allocate a file index for operation registering fixed files. However, it's utterly unusable with mixed approaches when for a part of files the userspace knows better where to place it, as it may race and users don't have any sane way to pick a slot and hoping it will not be taken. Let the userspace to register a range of fixed file slots in which the auto-allocation happens. The use case is splittting the fixed table in two parts, where on of them is used for auto-allocation and another for slot-specified operations. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66ab0394e436f38437cf7c44676e1920d09687ad.1656154403.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Jens Axboe	e6130eba8a	io_uring: add support for passing fixed file descriptors With IORING_OP_MSG_RING, one ring can send a message to another ring. Extend that support to also allow sending a fixed file descriptor to that ring, enabling one ring to pass a registered descriptor to another one. Arguments are extended to pass in: sqe->addr3 fixed file slot in source ring sqe->file_index fixed file slot in destination ring IORING_OP_MSG_RING is extended to take a command argument in sqe->addr. If set to zero (or IORING_MSG_DATA), it sends just a message like before. If set to IORING_MSG_SEND_FD, a fixed file descriptor is sent according to the above arguments. Two common use cases for this are: 1) Server needs to be shutdown or restarted, pass file descriptors to another onei 2) Backend is split, and one accepts connections, while others then get the fd passed and handle the actual connection. Both of those are classic SCM_RIGHTS use cases, and it's not possible to support them with direct descriptors today. By default, this will post a CQE to the target ring, similarly to how IORING_MSG_DATA does it. If IORING_MSG_RING_CQE_SKIP is set, no message is posted to the target ring. The issuer is expected to notify the receiver side separately. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Jens Axboe	f110ed8498	io_uring: split out fixed file installation and removal Put it with the filetable code, which is where it belongs. While doing so, have the helpers take a ctx rather than an io_kiocb. It doesn't make sense to use a request, as it's not an operation on the request itself. It applies to the ring itself. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Gustavo A. R. Silva	8fcf4c48f4	io_uring: replace zero-length array with flexible-array member There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/78 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Pavel Begunkov	fbb8bb0291	io_uring: remove ctx->refs pinning on enter io_uring_enter() takes ctx->refs, which was previously preventing racing with register quiesce. However, as register now doesn't touch the refs, we can freely kill extra ctx pinning and rely on the fact that we're holding a file reference preventing the ring from being destroyed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a11c57ad33a1be53541fce90669c1b79cf4d8940.1656153286.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Pavel Begunkov	3273c4407a	io_uring: don't check file ops of registered rings Registered rings are per definitions io_uring files, so we don't need to additionally verify them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/425cd64fd885b8e329a46c205ee811987691baaf.1656153286.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00
Pavel Begunkov	ad8b261d83	io_uring: remove extra TIF_NOTIFY_SIGNAL check io_run_task_work() accounts for TIF_NOTIFY_SIGNAL, so no need to have an second check in io_run_task_work_sig(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/52ce41a592ad904511697f432141e5690fd4b968.1656153285.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2022-07-24 18:39:16 -06:00

1 2 3 4 5 ...

1107116 commits