linux/fs
Paolo Abeni 58c9b016e1 epoll: use refcount to reduce ep_mutex contention
We are observing huge contention on the epmutex during an http
connection/rate test:

 83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
[...]
           |--66.96%--__fput
                      |--60.04%--eventpoll_release_file
                                 |--58.41%--__mutex_lock.isra.6
                                           |--56.56%--osq_lock

The application is multi-threaded, creates a new epoll entry for
each incoming connection, and does not delete it before the
connection shutdown - that is, before the connection's fd close().

Many different threads compete frequently for the epmutex lock,
affecting the overall performance.

To reduce the contention this patch introduces explicit reference counting
for the eventpoll struct. Each registered event acquires a reference,
and references are released at ep_remove() time.

The eventpoll struct is released by whoever - among EP file close() and
and the monitored file close() drops its last reference.

Additionally, this introduces a new 'dying' flag to prevent races between
the EP file close() and the monitored file close().
ep_eventpoll_release() marks, under f_lock spinlock, each epitem as dying
before removing it, while EP file close() does not touch dying epitems.

The above is needed as both close operations could run concurrently and
drop the EP reference acquired via the epitem entry. Without the above
flag, the monitored file close() could reach the EP struct via the epitem
list while the epitem is still listed and then try to put it after its
disposal.

An alternative could be avoiding touching the references acquired via
the epitems at EP file close() time, but that could leave the EP struct
alive for potentially unlimited time after EP file close(), with nasty
side effects.

With all the above in place, we can drop the epmutex usage at disposal time.

Overall this produces a significant performance improvement in the
mentioned connection/rate scenario: the mutex operations disappear from
the topmost offenders in the perf report, and the measured connections/rate
grows by ~60%.

To make the change more readable this additionally renames ep_free() to
ep_clear_and_put(), and moves the actual memory cleanup in a separate
ep_free() helper.

Link: https://lkml.kernel.org/r/4a57788dcaf28f5eb4f8dfddcc3a8b172a7357bb.1679504153.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Xiumei Mu <xmu@redhiat.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-08 13:45:38 -07:00
..
9p 9p patches for 6.3 merge window (part 1) 2023-03-01 08:52:49 -08:00
adfs
affs for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
afs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
autofs
befs
bfs
btrfs for-6.3-rc4-tag 2023-04-02 10:57:12 -07:00
cachefiles
ceph Two small fixes from Xiubo and myself, marked for stable. 2023-03-02 10:48:30 -08:00
cifs cifs: get rid of dead check in smb2_reconnect() 2023-03-30 17:56:30 -05:00
coda hardening updates for v6.3-rc1 2023-02-21 11:07:23 -08:00
configfs
cramfs fs/cramfs/inode.c: initialize file_ra_state 2023-03-02 21:54:23 -08:00
crypto fscrypt: check for NULL keyring in fscrypt_put_master_key_activeref() 2023-03-18 21:08:03 -07:00
debugfs ARM: SoC drivers for 6.3 2023-02-27 10:04:49 -08:00
devpts
dlm Driver core changes for 6.3-rc1 2023-02-24 12:58:55 -08:00
ecryptfs This update includes the following changes: 2023-02-21 18:10:50 -08:00
efivarfs A healthy mix of EFI contributions this time: 2023-02-23 14:41:48 -08:00
efs
erofs erofs: use wrapper i_blocksize() in erofs_file_read_iter() 2023-03-09 23:36:04 +08:00
exfat Description for this pull request: 2023-03-01 08:42:27 -08:00
exportfs
ext2 for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
ext4 ext4: fix possible double unlock when moving a directory 2023-03-17 21:53:52 -04:00
f2fs f2fs-for-6.3-rc1 2023-02-27 16:18:51 -08:00
fat There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
freevxfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
fscache
fuse fuse update for 6.3 2023-02-27 09:53:58 -08:00
gfs2 Reinstate "GFS2: free disk inode which is deleted by remote node -V2" 2023-03-23 19:37:56 +01:00
hfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
hfsplus fs: hfsplus: fix UAF issue in hfsplus_put_super 2023-03-02 21:54:23 -08:00
hostfs This pull request contains the following changes for UML: 2023-03-01 09:13:00 -08:00
hpfs
hugetlbfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
iomap - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
isofs
jbd2 Bug fixes and regressions for ext4, the most serious of which is a 2023-03-12 08:55:55 -07:00
jffs2 This pull request contains updates for JFFS2, UBI and UBIFS 2023-03-01 09:06:51 -08:00
jfs Just one simple sanity check 2023-03-01 08:47:19 -08:00
kernfs Driver core changes for 6.3-rc1 2023-02-24 12:58:55 -08:00
ksmbd ksmbd: fix slab-out-of-bounds in init_smb2_rsp_hdr 2023-04-02 23:08:56 -05:00
lockd lockd: set file_lock start and end when decoding nlm4 testargs 2023-03-14 14:00:55 -04:00
minix Merge branch 'work.minix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:01:15 -08:00
netfs iov: Fix netfs_extract_user_to_sg() 2023-03-01 18:18:25 -06:00
nfs nfs: remove empty if statement from nfs3_prepare_get_acl 2023-04-08 13:45:36 -07:00
nfs_common
nfsd nfsd-6.3 fixes: 2023-04-04 11:20:55 -07:00
nilfs2 nilfs2: fix kernel-infoleak in nilfs_ioctl_wrap_copy() 2023-03-23 17:18:32 -07:00
nls
notify RCU pull request for v6.3 2023-02-21 10:45:51 -08:00
ntfs There is no particular theme here - mainly quick hits all over the tree. 2023-02-23 17:55:40 -08:00
ntfs3 - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
ocfs2 ocfs2: fix data corruption after failed write 2023-03-07 17:04:55 -08:00
omfs
openpromfs
orangefs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
overlayfs fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
proc ELF: fix all "Elf" typos 2023-04-08 13:45:37 -07:00
pstore
qnx4
qnx6
quota RCU pull request for v6.3 2023-02-21 10:45:51 -08:00
ramfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
reiserfs - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
romfs
smbfs_common smb3: Replace smb2pdu 1-element arrays with flex-arrays 2023-02-20 17:25:43 -06:00
squashfs revert "squashfs: harden sanity check in squashfs_read_xattr_id_table" 2023-02-03 17:52:25 -08:00
sysfs
sysv Merge branch 'work.sysv' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2023-02-24 19:03:26 -08:00
tracefs
ubifs This pull request contains updates for JFFS2, UBI and UBIFS 2023-03-01 09:06:51 -08:00
udf udf: Warn if block mapping is done for in-ICB files 2023-03-06 16:38:25 +01:00
ufs
unicode
vboxsf
verity fsverity: don't drop pagecache at end of FS_IOC_ENABLE_VERITY 2023-03-15 22:50:41 -07:00
xfs xfs: fix mismerged tracepoints 2023-03-24 13:16:01 -07:00
zonefs zonefs: Do not propagate iomap_dio_rw() ENOTBLK error to user space 2023-03-30 20:56:02 +09:00
aio.c Merge branch 'mm-hotfixes-stable' into mm-stable 2023-02-10 15:34:48 -08:00
anon_inodes.c
attr.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
bad_inode.c
binfmt_elf.c ELF: fix all "Elf" typos 2023-04-08 13:45:37 -07:00
binfmt_elf_fdpic.c ELF: fix all "Elf" typos 2023-04-08 13:45:37 -07:00
binfmt_elf_test.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
buffer.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
char_dev.c
compat_binfmt_elf.c
coredump.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
d_path.c
dax.c fsdax: dax_unshare_iter() should return a valid length 2023-02-03 17:52:24 -08:00
dcache.c
direct-io.c
drop_caches.c
eventfd.c
eventpoll.c epoll: use refcount to reduce ep_mutex contention 2023-04-08 13:45:38 -07:00
exec.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
fcntl.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
fhandle.c
file.c fs: prevent out-of-bounds array speculation when closing a file descriptor 2023-03-09 22:46:21 -05:00
file_table.c
filesystems.c
fs-writeback.c mm: convert mem_cgroup_css_from_page() to mem_cgroup_css_from_folio() 2023-02-02 22:33:19 -08:00
fs_context.c
fs_parser.c
fs_pin.c
fs_struct.c
fs_types.c
fsopen.c
init.c
inode.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
internal.h for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
ioctl.c
Kconfig
Kconfig.binfmt
kernel_read_file.c
libfs.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
locks.c filelocks: use mount idmapping for setlease permission check 2023-03-09 22:36:12 +01:00
Makefile for-6.3/dio-2023-02-16 2023-02-20 14:10:36 -08:00
mbcache.c
mnt_idmapping.c
mount.h
mpage.c - Daniel Verkamp has contributed a memfd series ("mm/memfd: add 2023-02-23 17:09:35 -08:00
namei.c NFSD 6.3 Release Notes 2023-02-22 14:21:40 -08:00
namespace.c fs: drop peer group ids under namespace lock 2023-03-31 12:13:37 +02:00
no-block.c
nsfs.c
open.c vfs: avoid duplicating creds in faccessat if possible 2023-02-27 16:39:19 -08:00
pipe.c
pnode.c
pnode.h
posix_acl.c fs.acl.v6.3 2023-02-20 12:14:33 -08:00
proc_namespace.c
read_write.c
readdir.c
remap_range.c
select.c
seq_file.c
signalfd.c
splice.c splice: Remove redundant assignment to ret 2023-03-09 10:10:31 +01:00
stack.c
stat.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
statfs.c
super.c fscrypt: destroy keyring after security_sb_delete() 2023-03-14 10:30:30 -07:00
sync.c
sysctls.c
timerfd.c
userfaultfd.c mm: replace vma->vm_flags direct modifications with modifier calls 2023-02-09 16:51:39 -08:00
utimes.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00
xattr.c fs.idmapped.v6.3 2023-02-20 11:53:11 -08:00