Commit graph

62890 commits

Author SHA1 Message Date
Trond Myklebust 59e356a967 NFS: Use the 64-bit server readdir cookies when possible
When we're running as a 64-bit architecture and are not running in
32-bit compatibility mode, it is better to use the 64-bit readdir
cookies that supplied by the server. Doing so improves the accuracy
of telldir()/seekdir(), particularly when the directory is changing,
for instance, when doing 'rm -rf'.

We still fall back to using the 32-bit offsets on 32-bit architectures
and when in compatibility mode.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-16 08:34:28 -04:00
Linus Torvalds 34d5a4b336 Fix for yet another subtle futex issue. The futex code used ihold() to
prevent inodes from vanishing, but ihold() does not guarantee inode
 persistence. Replace the inode pointer with a per boot, machine wide,
 unique inode identifier. The second commit fixes the breakage of the hash
 mechanism whihc causes a 100% performance regression.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl5uQJsTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoYp4EAC5fr/AyRaIn/AEIZUmoyK6ELUaknfH
 Z788avxDB/t5GkzC9A2dMpybYi78tzLSAEfB8jYgwbrqExapVtiqvjGZ1RIi3HoN
 f/DWLnOb2s+yYQ3BQlHu4RdKONEzCqBwKFpElGRv3JzCY8Qeh5cQBzdqzvOEFmYw
 P7DJVtJRZ2dud7AzJ+xk6KuNIKCT2F7Djmtop6nq1EVw0J/2oYOVgQu76APBj7cj
 32srLmpP4xcQiJmWLC5ksXKiZrMPnyNfwXhHFufNvJ2Re6+Wf8mmglqG/5DmA+Ns
 Sq3L7D7yXwtWQZ8Po1qnWhPDZVXQbWzHyTn4YAMJAK7yoO7mut8jgECt+A8vf4L+
 hsc41c6THfdCQQ9gmxLL+c08nZGlmvIC4/1RsihNZ3kd2o4k6Ah9xFp8lBFcpjWd
 7tuhakNqJvUOvB34t2AYqzMFbZ/FJG+QNGyIW0bTUn4YIgRPxI/zsdMxqGVBZ4oN
 0iuy1kPLGbGAnLU9thkiVMmAyaPesuiB6f+mmzobEUgGI35GrCJi6a4YaTG1sqFn
 Gl8oPzcU2n+DWbVBfJrVFHJye7oi78kCw6wpNLBCJQp8NP8doAH0Sgspglg52E/p
 G4GGLz0vGauHBC5wQ3WYiGLImWbzC1dwKdcNE7dhuTgXbhz8ChVlOSU9Fu4+pGpq
 6URL6DVTLwDZPg==
 =e2iB
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull futex fix from Thomas Gleixner:
 "Fix for yet another subtle futex issue.

  The futex code used ihold() to prevent inodes from vanishing, but
  ihold() does not guarantee inode persistence. Replace the inode
  pointer with a per boot, machine wide, unique inode identifier.

  The second commit fixes the breakage of the hash mechanism which
  causes a 100% performance regression"

* tag 'locking-urgent-2020-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Unbreak futex hashing
  futex: Fix inode life-time issue
2020-03-15 12:55:52 -07:00
Linus Torvalds b0ea262a23 NFS Client Bugfixes for Linux 5.6-rc5
Fixes:
 - Ensure the fs_context has the correct fs_type when mounting and submounting
 - Fix leaking of ctx->nfs_server.hostname
 - Add minor version to fscache key to prevent collisions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5r+QkACgkQ18tUv7Cl
 QOtq4Q/+Oo707rb3N7DrPikUARB8D7FMTs/m/+xSPNm2DSllImIXJUdckqaoZkwc
 DwBMLw+ZDvHtcNytytJQOWJNp9LGjHpZ20g0TLr2p2/JRrQyGgpc0FxTJONwA5Pp
 zU6MSgqqfMZ5nLgxpMKsqoPNzO45sS8SKi2I6yZIupLZlsZOzF8L1wL/zc6gJvv8
 71UGrSId9mEMKCrE8EQRx7etct5VPuP+pXfDGz4oaI1tdEmfmx3FoJlzZA1/Pf90
 YSHdGZb7mR3LFkFRDlnh6NFHWU+yE+b5iWCt32ifO8pCN/CyIUvBxQblx4VLA47H
 6S5nrYA96zRcQwhh9B/8yWLiqqxXo2hNl574uBJL/iDqSKSmkEBxZmCbE3aFEGa8
 ClWlF6T5z4dlcAlKWXQkn3EXBHzL5+Opev5dArMhqNkr55g4z9Opsa6sc0ZWdywf
 h/rSM8bHn9SNYkCGFHQ1MjAn6eNU0vVQ/s9DhM2xdtyfyTQOOHx5yA/KF6aGG5oQ
 3mlVEJCEfsBKyWWjHhq3e/7ezgLlKRRlauxdLgjmKy+PmtlY6mGii0eF601e9OSL
 RvK7I5/9spbcYmkyyQs0BxitrDZObyhxk31hgNrUMlN/JJrhPhKiyll8/Z7Z4sq6
 QP6T/Vfn2FORA3ZAMtMH6V/ZOeiXUZjOkBRVWArIrPwOBJn/EQY=
 =3DDn
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "These are mostly fscontext fixes, but there is also one that fixes
  collisions seen in fscache:

   - Ensure the fs_context has the correct fs_type when mounting and
     submounting

   - Fix leaking of ctx->nfs_server.hostname

   - Add minor version to fscache key to prevent collisions"

* tag 'nfs-for-5.6-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  nfs: add minor version to nfs_server_key for fscache
  NFS: Fix leak of ctx->nfs_server.hostname
  NFS: Don't hard-code the fs_type when submounting
  NFS: Ensure the fs_context has the correct fs_type before mounting
2020-03-13 15:21:32 -07:00
Linus Torvalds 7e6d869f5f fuse fixes for 5.6-rc6
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXmpHOAAKCRDh3BK/laaZ
 PP0XAQCN52kSOBiSvr8xiQrO5YOONo4yfPDi6qIk/ltvA1yr6wEA3NWAepAL07AS
 n51hMi02+JNXuMVnxOm0z2us5/PYJw0=
 =MJC1
 -----END PGP SIGNATURE-----

Merge tag 'fuse-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse fix from Miklos Szeredi:
 "Fix an Oops introduced in v5.4"

* tag 'fuse-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: fix stack use after return
2020-03-13 15:19:38 -07:00
Linus Torvalds 2af82177af overlayfs fixes for 5.6-rc6
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXmufyAAKCRDh3BK/laaZ
 POXNAQDmkgiy41nUQZ3LxtGKstsgVuzFhqBq+erinBPcF1r9mQEA/xJp4uc2Q8NO
 JKZZHyWFLtAN8gGNYTCli4vrm1LoKQc=
 =JV3K
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
 "Fix three bugs introduced in this cycle"

* tag 'ovl-fixes-5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix lockdep warning for async write
  ovl: fix some xino configurations
  ovl: fix lock in ovl_llseek()
2020-03-13 15:17:21 -07:00
Linus Torvalds 5007928eae io_uring-5.6-2020-03-13
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5rxtkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpv/xEACifgfgyE3a2ZM7w2VTe41IpMxOouEnUWOJ
 oVKRp+9gkynE8pUGlE1igTa7T2nQIZM+Qd0KWqknkP2iFiQaNXSqqr8U6qIz9lzV
 I6SAcj0Pa2FRzlRly5UXLKiadIHbt2OfP6PIk6sXTcMCFUXb75/WzNFVOnNnBuee
 j8F5JUw45xyLXvQnfxpYSt8LeZyGYLoOwJEZX3j+hFHl1GCqSrAY8EB5tkXFbCZi
 L9JdJYOBEvnwFF4qxWl++2bmEOywnKeFea84JqbGr9BaVrDAOjAWMairZAU82xiI
 EWdQRKkSyDzrl+TACz/ri4J87fzE8FhBpHLufSY3HCxizaayNawxItDg5CCW1ghn
 i+bEaKq6djZn1CpSU0w0CTfA1g0D1DnErBS82znC8ciV1ZflAed8oADh3/+X64j8
 HzPT1DRoDGnzp4pBwTiZcG7Jb605Mh8i1TY1p35riaUbIR4y84BVNroEUHtO5Cmh
 U09efdYifsU9XM+u0OXK+SvrHqtDb6EVSx5x37qiV1SVxZ3JSsr9/uTjnBOrjH5W
 nUjqCzQfJZYSNmvRT6aSGDzk5wON95nnv7hYE9HWER/Cw7/VwKdJmBwehIAZUaXG
 NxJ7I/mVndGKV8ghoN119XVl7t2i56Ctj2pwu/UJH7lZB/Yfu9qZ5oKpku/Kbriy
 pYqSdy8J/Q==
 =0jJw
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block

Pull io_uring fix from Jens Axboe:
 "Just a single fix here, improving the RCU callback ordering from last
  week. After a bit more perusing by Paul, he poked a hole in the
  original"

* tag 'io_uring-5.6-2020-03-13' of git://git.kernel.dk/linux-block:
  io_uring: ensure RCU callback ordering with rcu_barrier()
2020-03-13 13:00:08 -07:00
Jann Horn ddd2b85ff7 afs: Use kfree_rcu() instead of casting kfree() to rcu_callback_t
afs_put_addrlist() casts kfree() to rcu_callback_t. Apart from being wrong
in theory, this might also blow up when people start enforcing function
types via compiler instrumentation, and it means the rcu_head has to be
first in struct afs_addr_list.

Use kfree_rcu() instead, it's simpler and more correct.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-13 10:47:33 -07:00
Miklos Szeredi c853680453 ovl: fix lockdep warning for async write
Lockdep reports "WARNING: lock held when returning to user space!" due to
async write holding freeze lock over the write.  Apparently aio.c already
deals with this by lying to lockdep about the state of the lock.

Do the same here.  No need to check for S_IFREG() here since these file ops
are regular-only.

Reported-by: syzbot+9331a354f4f624a52a55@syzkaller.appspotmail.com
Fixes: 2406a307ac ("ovl: implement async IO routines")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-03-13 15:53:06 +01:00
Amir Goldstein 53afcd310e ovl: fix some xino configurations
Fix up two bugs in the coversion to xino_mode:
1. xino=off does not always end up in disabled mode
2. xino=auto on 32bit arch should end up in disabled mode

Take a proactive approach to disabling xino on 32bit kernel:
1. Disable XINO_AUTO config during build time
2. Disable xino with a warning on mount time

As a by product, xino=on on 32bit arch also ends up in disabled mode.
We never intended to enable xino on 32bit arch and this will make the
rest of the logic simpler.

Fixes: 0f831ec85e ("ovl: simplify ovl_same_sb() helper")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-03-13 15:53:06 +01:00
Linus Torvalds 807f030b44 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A couple of fixes for old crap in ->atomic_open() instances"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  cifs_atomic_open(): fix double-put on late allocation failure
  gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache
2020-03-12 15:51:26 -07:00
Al Viro d9a9f4849f cifs_atomic_open(): fix double-put on late allocation failure
several iterations of ->atomic_open() calling conventions ago, we
used to need fput() if ->atomic_open() failed at some point after
successful finish_open().  Now (since 2016) it's not needed -
struct file carries enough state to make fput() work regardless
of the point in struct file lifecycle and discarding it on
failure exits in open() got unified.  Unfortunately, I'd missed
the fact that we had an instance of ->atomic_open() (cifs one)
that used to need that fput(), as well as the stale comment in
finish_open() demanding such late failure handling.  Trivially
fixed...

Fixes: fe9ec8291f "do_last(): take fput() on error after opening to out:"
Cc: stable@kernel.org # v4.7+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-12 18:25:20 -04:00
Al Viro 2103913265 gfs2_atomic_open(): fix O_EXCL|O_CREAT handling on cold dcache
with the way fs/namei.c:do_last() had been done, ->atomic_open()
instances needed to recognize the case when existing file got
found with O_EXCL|O_CREAT, either by falling back to finish_no_open()
or failing themselves.  gfs2 one didn't.

Fixes: 6d4ade986f (GFS2: Add atomic_open support)
Cc: stable@kernel.org # v3.11
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-12 18:21:24 -04:00
Amir Goldstein 531d3040bc ovl: fix lock in ovl_llseek()
ovl_inode_lock() is interruptible. When inode_lock() in ovl_llseek()
was replaced with ovl_inode_lock(), we did not add a check for error.

Fix this by making ovl_inode_lock() uninterruptible and change the
existing call sites to use an _interruptible variant.

Reported-by: syzbot+66a9752fa927f745385e@syzkaller.appspotmail.com
Fixes: b1f9d3858f ("ovl: use ovl_inode_lock in ovl_llseek()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-03-12 16:38:10 +01:00
Linus Torvalds e6e6ec48dd fscrypt fix for v5.6-rc6
Fix a bug where if userspace is writing to encrypted files while the
 FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (introduced in v5.4) is running,
 dirty inodes could be evicted, causing writes could be lost or the
 filesystem to hang due to a use-after-free.  This was encountered during
 real-world use, not just theoretical.
 
 Tested with the existing fscrypt xfstests, and with a new xfstest I
 wrote to reproduce this bug.  This fix does expose an existing bug with
 '-o lazytime' that Ted is working on fixing, but this fix is more
 critical and needed anyway regardless of the lazytime fix.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXmk8HxQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK4YiAQC1RZyH4/mZ890Or6s8SzCgJTVmiLk9
 ZTO/56XmLte6LAD+IBAExqDkkybmAF0rQ4kY1oL75f/e/nEs+50TXra9NQc=
 =s2KD
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt

Pull fscrypt fix from Eric Biggers:
 "Fix a bug where if userspace is writing to encrypted files while the
  FS_IOC_REMOVE_ENCRYPTION_KEY ioctl (introduced in v5.4) is running,
  dirty inodes could be evicted, causing writes could be lost or the
  filesystem to hang due to a use-after-free. This was encountered
  during real-world use, not just theoretical.

  Tested with the existing fscrypt xfstests, and with a new xfstest I
  wrote to reproduce this bug. This fix does expose an existing bug with
  '-o lazytime' that Ted is working on fixing, but this fix is more
  critical and needed anyway regardless of the lazytime fix"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
  fscrypt: don't evict dirty inodes after removing key
2020-03-11 13:35:34 -07:00
Jens Axboe 805b13adde io_uring: ensure RCU callback ordering with rcu_barrier()
After more careful studying, Paul informs me that we cannot rely on
ordering of RCU callbacks in the way that the the tagged commit did.
The current construct looks like this:

	void C(struct rcu_head *rhp)
	{
		do_something(rhp);
		call_rcu(&p->rh, B);
	}

	call_rcu(&p->rh, A);
	call_rcu(&p->rh, C);

and we're relying on ordering between A and B, which isn't guaranteed.
Make this explicit instead, and have a work item issue the rcu_barrier()
to ensure that A has run before we manually execute B.

While thorough testing never showed this issue, it's dependent on the
per-cpu load in terms of RCU callbacks. The updated method simplifies
the code as well, and eliminates the need to maintain an rcu_head in
the fileset data.

Fixes: c1e2148f8e ("io_uring: free fixed_file_data after RCU grace period")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-08 20:07:28 -06:00
Linus Torvalds b34e5c1332 Driver core / debugfs fixes for 5.6-rc5
Here are 4 small driver core / debugfs patches for 5.6-rc3
 
 They are:
 	- debugfs api cleanup now that all callers for
 	  debugfs_create_regset32() have been fixed up.  This was
 	  waiting until after the -rc1 merge as these fixes came in
 	  through different trees
 	- driver core sync state fixes based on reports of minor issues
 	  found in the feature
 
 All of these have been in linux-next with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXmS2Lg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ylvNgCfbnALILZh05QJPCfZv/seNFcFYLIAnRNAzxAU
 mTPqUqTp5+WMXSzGigMa
 =NyIX
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core and debugfs fixes from Greg KH:
 "Here are four small driver core / debugfs patches for 5.6-rc3:

   - debugfs api cleanup now that all debugfs_create_regset32() callers
     have been fixed up. This was waiting until after the -rc1 merge as
     these fixes came in through different trees

   - driver core sync state fixes based on reports of minor issues found
     in the feature

  All of these have been in linux-next with no reported issues"

* tag 'driver-core-5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: Skip unnecessary work when device doesn't have sync_state()
  driver core: Add dev_has_sync_state()
  driver core: Call sync_state() even if supplier has no consumers
  debugfs: remove return value of debugfs_create_regset32()
2020-03-08 10:39:40 -05:00
Eric Biggers 2b4eae95c7 fscrypt: don't evict dirty inodes after removing key
After FS_IOC_REMOVE_ENCRYPTION_KEY removes a key, it syncs the
filesystem and tries to get and put all inodes that were unlocked by the
key so that unused inodes get evicted via fscrypt_drop_inode().
Normally, the inodes are all clean due to the sync.

However, after the filesystem is sync'ed, userspace can modify and close
one of the files.  (Userspace is *supposed* to close the files before
removing the key.  But it doesn't always happen, and the kernel can't
assume it.)  This causes the inode to be dirtied and have i_count == 0.
Then, fscrypt_drop_inode() failed to consider this case and indicated
that the inode can be dropped, causing the write to be lost.

On f2fs, other problems such as a filesystem freeze could occur due to
the inode being freed while still on f2fs's dirty inode list.

Fix this bug by making fscrypt_drop_inode() only drop clean inodes.

I've written an xfstest which detects this bug on ext4, f2fs, and ubifs.

Fixes: b1c0ec3599 ("fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl")
Cc: <stable@vger.kernel.org> # v5.4+
Link: https://lore.kernel.org/r/20200305084138.653498-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-03-07 18:43:07 -08:00
Linus Torvalds c200376527 io_uring-5.6-2020-03-07
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5j8gkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphtKEADIid1/6xG6EO965jKjR1G3e7pnA7M6Ek01
 T0svGLMSYtPV9aRERiWDWdyCE01C0kjwWvmpiTCmWr0sm3bJYBB+NaDXkCtwa1IW
 uFPMNDpeCQijQI1sImbeP2yN2ufGY5r7Y9RCMU7+iKgcao3pFaR136y7UfBHykJ8
 Iyp/sir5FRHlEzrGyoXOe1j131BZrDGCa+cuPyAOlr75abN+TDazJAv05MGBQVfI
 wc4hOHy0+D07juXP3ZD8UptoLTXPNk+tcAIqAEIaEuPxmRxq1lOfnM506rWyp2sy
 XZrQhUblkL8nqfqXASYGQcY/DaNxhEvbzn86MaCKm4qf12uCiP0/DS3hFY/32lAt
 VX9eOYenX1zTRLQoRNwvVHT4+m+Splp7IpICFK9bSGk1jp3rbclSXmWITqSWkOgi
 C45wAAmWw4lzrbxcEDfBAns/lcwsrPwHn12WdM9ofk2I1jTDubO47c/oFEzEn0w/
 IixdKeMVnifNoytP9XFcUcotNzc/NPiPvMNgCkNm59kUHfXMXx6HHyTLO/JUzjZ9
 B/s2LkC23EksjEGC3gQiQxighyvNCsN0Wv9L7InaCjJY5IpcOoL495fnPCPfaOaW
 7c6xrkRxvHN8bSsKmESywcFjtBv23OtlTfbma7hjdByaGkW/M62qdT6DOiQcoiX/
 Ts7YOMtPdQ==
 =aukx
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Here are a few io_uring fixes that should go into this release. This
  contains:

   - Removal of (now) unused io_wq_flush() and associated flag (Pavel)

   - Fix cancelation lockup with linked timeouts (Pavel)

   - Fix for potential use-after-free when freeing percpu ref for fixed
     file sets

   - io-wq cancelation fixups (Pavel)"

* tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
  io_uring: fix lockup with timeouts
  io_uring: free fixed_file_data after RCU grace period
  io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL
  io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation
2020-03-07 14:20:29 -06:00
Pavel Begunkov f0e20b8943 io_uring: fix lockup with timeouts
There is a recipe to deadlock the kernel: submit a timeout sqe with a
linked_timeout (e.g.  test_single_link_timeout_ception() from liburing),
and SIGKILL the process.

Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout
isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it
during io_put_free() to cancel the linked timeout. Probably, the same
can happen with another io_kill_timeout() call site, that is
io_commit_cqring().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-07 08:35:56 -07:00
Linus Torvalds 30fe0d07fd for-5.6-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5iXe8ACgkQxWXV+ddt
 WDvWGg/+LFP+Y8Qz6xHTl3vXuGJKjCr7X/MIi69r2N0JFoCUeXyOdxeSlOuNCfhb
 HiLZzfA5TYoptsdLJAXQLy7nPKFCQcc+J19Mbt2+aebpdGqfgN+YZEGkltfKL8Ao
 xjOGu5HROFFpNTtnwa1dYOQkyVuZ8oafuJxwVJ8T28fxepRvBbi5jmy3lb1ypL3W
 NoIPBe+67g5z/W0ATFmBMF7cCbvS5gsEGWKpbbjh7r8ZHJkhUaxVU7YdxPqlXrAO
 ejZfiJUwi8rTGm0zd8A5TX/wsxSeBEXolvh91k5tatTljjzROHa028KRg2voUZIW
 C5/7X+Z2C3gzuT0o7TGLBOR6CkVhkSutDV8/QIE6hDjZ/aCMNi0mIFco1hG8jjd1
 jQfjemjj7PWuVEnZ6EuVSoHSXjZvBvX66of40YhTQEtSaJpcZU4jP26+8cXENN6+
 6WbWcQpEQbT0cp0YKWhWvAIwGMf0jmWESISeFMEaF0eQd8BtzrH1qYcs3JTmXvHC
 XmC47hoEJLhjQkAgQ4oNa5PZQzR1wEfW/4FPdqlADOR2frE1wDiKdrpN/dkAYHdQ
 edNlo9u0+bRWCP40p04i2IUX/aUAc+me9QxiZwxT3Fw0g5QBSE2035Ly4spvT8NZ
 gIvwnq1KGxmtrJSo5Lpkv4bjHYbByYMOiGJUMOTCIEdqajFI224=
 =06pr
 -----END PGP SIGNATURE-----

Merge tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One fixup for DIO when in use with the new checksums, a missed case
  where the checksum size was still assuming u32"

* tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix RAID direct I/O reads with alternate csums
2020-03-06 14:56:46 -06:00
Linus Torvalds 0b25d45803 File locking fixes for v5.6
-----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAl5igvYTHGpsYXl0b25A
 a2VybmVsLm9yZwAKCRAADmhBGVaCFdbFD/9ZP3XDY+ngnN5nsSYS4QuzudlncnZL
 ceRLD5YykNPLOAesr7DWI8EDky+IFL5w4wRHVxAbOeHpj3haySLefV9vsM/G6sm4
 CiHdikx7uls184r5WYK3jfB19UF3UIePUjTnAtxOpemjkLv58Z15nPNGGQv9lkFJ
 dJbCk1kdwaEA3LYEyXiGC/ianaxLtiqBy+C0d581OZn3ty551c8vmF0Ziz5tcuot
 aObPE3f8sYNxDuTDZcseRxvXUfMS1Qj/tMxeDDIXryX71zIsFbQ6PMPUNHGHGit/
 uoeuprDy90mLqGuEEuUfVaXjn8zEPFlW8IHy1OJ4fFNQ0X/HYa2/CFTA2BiVrpfM
 1lVYKWuMz+mCq9i8wzF/+ikQ9QVMG2cSb0i4kyuAb+RBP+PDjNTbTLjFeEIJVz6O
 yN9MUXWH5XS8liFq2F5VbITwpSJEk7vxiTGDT1zU38HXFdrxL0FRC60TKhkplLzO
 9xsj9jUBV/sD5ohwq9Ga+kcXOB/KA/9iW3TMfBApq7oWIxaEfW7rQ6A/O5tuF/hX
 q2mwrRoEx6tpCy77KFBLT89iF0gzV3xzadwWcnpDkFC7x2OkMmZPPr2nWeJS6qbN
 hPOD1fiWW/NXMXs7foQ9HZ7HdbQMDI7olnf1sjkh4pq2MKDWsJLvNB4fYwZUxhpn
 8K4B+9yfIofvpg==
 =H/ky
 -----END PGP SIGNATURE-----

Merge tag 'filelock-v5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking fixes from Jeff Layton:
 "Just a couple of late-breaking patches for the file locking code. The
  second patch (from yangerkun) fixes a rather nasty looking potential
  use-after-free that should go to stable.

  The other patch could technically wait for 5.7, but it's fairly
  innocuous so I figured we might as well take it"

* tag 'filelock-v5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  locks: fix a potential use-after-free problem when wakeup a waiter
  fcntl: Distribute switch variables for initialization
2020-03-06 14:55:27 -06:00
Jens Axboe c1e2148f8e io_uring: free fixed_file_data after RCU grace period
The percpu refcount protects this structure, and we can have an atomic
switch in progress when exiting. This makes it unsafe to just free the
struct normally, and can trigger the following KASAN warning:

BUG: KASAN: use-after-free in percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
Read of size 1 at addr ffff888181a19a30 by task swapper/0/0

CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc4+ #5747
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
 <IRQ>
 dump_stack+0x76/0xa0
 print_address_description.constprop.0+0x3b/0x60
 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
 __kasan_report.cold+0x1a/0x3d
 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
 percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0
 rcu_core+0x370/0x830
 ? percpu_ref_exit+0x50/0x50
 ? rcu_note_context_switch+0x7b0/0x7b0
 ? run_rebalance_domains+0x11d/0x140
 __do_softirq+0x10a/0x3e9
 irq_exit+0xd5/0xe0
 smp_apic_timer_interrupt+0x86/0x200
 apic_timer_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:default_idle+0x26/0x1f0

Fix this by punting the final exit and free of the struct to RCU, then
we know that it's safe to do so. Jann suggested the approach of using a
double rcu callback to achieve this. It's important that we do a nested
call_rcu() callback, as otherwise the free could be ordered before the
atomic switch, even if the latter was already queued.

Reported-by: syzbot+e017e49c39ab484ac87a@syzkaller.appspotmail.com
Suggested-by: Jann Horn <jannh@google.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-06 10:15:21 -07:00
yangerkun 6d390e4b5d locks: fix a potential use-after-free problem when wakeup a waiter
'16306a61d3b7 ("fs/locks: always delete_block after waiting.")' add the
logic to check waiter->fl_blocker without blocked_lock_lock. And it will
trigger a UAF when we try to wakeup some waiter:

Thread 1 has create a write flock a on file, and now thread 2 try to
unlock and delete flock a, thread 3 try to add flock b on the same file.

Thread2                         Thread3
                                flock syscall(create flock b)
	                        ...flock_lock_inode_wait
				    flock_lock_inode(will insert
				    our fl_blocked_member list
				    to flock a's fl_blocked_requests)
				   sleep
flock syscall(unlock)
...flock_lock_inode_wait
    locks_delete_lock_ctx
    ...__locks_wake_up_blocks
        __locks_delete_blocks(
	b->fl_blocker = NULL)
	...
                                   break by a signal
				   locks_delete_block
				    b->fl_blocker == NULL &&
				    list_empty(&b->fl_blocked_requests)
	                            success, return directly
				 locks_free_lock b
	wake_up(&b->fl_waiter)
	trigger UAF

Fix it by remove this logic, and this patch may also fix CVE-2019-19769.

Cc: stable@vger.kernel.org
Fixes: 16306a61d3 ("fs/locks: always delete_block after waiting.")
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2020-03-06 11:54:13 -05:00
OGAWA Hirofumi bc87302a09 fat: fix uninit-memory access for partial initialized inode
When get an error in the middle of reading an inode, some fields in the
inode might be still not initialized.  And then the evict_inode path may
access those fields via iput().

To fix, this makes sure that inode fields are initialized.

Reported-by: syzbot+9d82b8de2992579da5d0@syzkaller.appspotmail.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/871rqnreqx.fsf@mail.parknet.co.jp
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-06 07:06:09 -06:00
Peter Zijlstra 8019ad13ef futex: Fix inode life-time issue
As reported by Jann, ihold() does not in fact guarantee inode
persistence. And instead of making it so, replace the usage of inode
pointers with a per boot, machine wide, unique inode identifier.

This sequence number is global, but shared (file backed) futexes are
rare enough that this should not become a performance issue.

Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-03-06 11:06:15 +01:00
Linus Torvalds 8b614cb8f1 five small cifs/smb3 fixes, two for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl5dnvEACgkQiiy9cAdy
 T1FaWAv/XnyYfYh6H4fhtgtfNxW9xt9mkHo/AohHcf2rk2erqjVz0lHVe7SuS9C5
 EpDYnijZKa//aiIV6VzDymPaMrXQZ+oCAExAzLPmWZnLeZ65Q02K2P1F3KvURdue
 4nLjuOyzyG4YYkoBi4wKneu1Ji377m9L6BpSfM+MzPScCOl8OV/vv/nBRY1N6gIY
 Rreq5iipRaDhifsaOgiA501sUu7mvpPEHNpluCtFmY4iTHQzYqjWZ5ZGXr2xz63n
 5VV8KWWn/p3nhJGt7L/1aynws59AdEd5GNZ5FbDQHokx9n3MMnyl4QGDzUehnhlY
 Ym6n50QA5QMn9I9NLg8I2aD6z4vNIj9kZxersoHduf4UsA9CyPaucUIyV81mt683
 AZIqtz8H21fgJXOQ3nv4uNc8Yyt1SGQfFDo1EfphwLl6LaE8rx3CFEnVoNLM+jqb
 nyRB/NxLtDWVQhYM8Bg/TP7iMqknHtarfZirv48LFdXLlhb83+qpSSHy0zVy9vli
 y/0B7rEI
 =zLW4
 -----END PGP SIGNATURE-----

Merge tag '5.6-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Five small cifs/smb3 fixes, two for stable (one for a reconnect
  problem and the other fixes a use case when renaming an open file)"

* tag '5.6-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Use #define in cifs_dbg
  cifs: fix rename() by ensuring source handle opened with DELETE bit
  cifs: add missing mount option to /proc/mounts
  cifs: fix potential mismatch of UNC paths
  cifs: don't leak -EAGAIN for stat() during reconnect
2020-03-03 17:31:19 -06:00
Kees Cook 0a68ff5e2e fcntl: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

fs/fcntl.c: In function ‘send_sigio_to_task’:
fs/fcntl.c:738:20: warning: statement will never be executed [-Wswitch-unreachable]
  738 |   kernel_siginfo_t si;
      |                    ^~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
2020-03-03 10:55:06 -05:00
Omar Sandoval e7a04894c7 btrfs: fix RAID direct I/O reads with alternate csums
btrfs_lookup_and_bind_dio_csum() does pointer arithmetic which assumes
32-bit checksums. If using a larger checksum, this leads to spurious
failures when a direct I/O read crosses a stripe. This is easy
to reproduce:

  # mkfs.btrfs -f --checksum blake2 -d raid0 /dev/vdc /dev/vdd
  ...
  # mount /dev/vdc /mnt
  # cd /mnt
  # dd if=/dev/urandom of=foo bs=1M count=1 status=none
  # dd if=foo of=/dev/null bs=1M iflag=direct status=none
  dd: error reading 'foo': Input/output error
  # dmesg | tail -1
  [  135.821568] BTRFS warning (device vdc): csum failed root 5 ino 257 off 421888 ...

Fix it by using the actual checksum size.

Fixes: 1e25a2e3ca ("btrfs: don't assume ordered sums to be 4 bytes")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-03 15:26:08 +01:00
Pavel Begunkov 80ad894382 io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL
io_wq_flush() is buggy, during cancelation of a flush, the associated
work may be passed to the caller's (i.e. io_uring) @match callback. That
callback is expecting it to be embedded in struct io_kiocb. Cancelation
of internal work probably doesn't make a lot of sense to begin with.

As the flush helper is no longer used, just delete it and the associated
work flag.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 14:03:24 -07:00
Pavel Begunkov fc04c39bae io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation
To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the
callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may
return next work, which will be ignored and lost.

Cancel the whole link.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02 07:20:08 -07:00
Linus Torvalds e70869821a Two more bug fixes (including a regression) for 5.6
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl5cMPoACgkQ8vlZVpUN
 gaNYmgf/WX4/jMSYQu2fICudCqLr5fkLqsybvYGZGei3F8BaJ90zohQAQybNznWS
 iyF0JzrOp37b/o0haz7KfDr7xVB3lAVsKu9Bglq+zL8mc9IkPmjhCXuLbknUtOUw
 j3aVdntt4d6S3szbtP4PIZxNqh+/4KJDS2soWvuNWRpYMOv2yoMClptWWQtsimAt
 3fYpxasSz0Jrhtbuf+I1oID++wOycDT3RKiko5tpLlQiFVoKBzfou+0ZdkC4+UIl
 KvcpMBm1ijdGAaN9jfb2L2KCY5UdSvmeVui3sMXtHBEpKMJl2QsClylR1wGfgBKi
 +YMEsjBONxKo3kH2DaPJaU6LEm8JuQ==
 =rszH
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Two more bug fixes (including a regression) for 5.6"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()
  jbd2: fix data races at struct journal_head
2020-03-01 16:35:08 -06:00
Dan Carpenter 37b0b6b8b9 ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()
If sbi->s_flex_groups_allocated is zero and the first allocation fails
then this code will crash.  The problem is that "i--" will set "i" to
-1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1
is type promoted to unsigned and becomes UINT_MAX.  Since UINT_MAX
is more than zero, the condition is true so we call kvfree(new_groups[-1]).
The loop will carry on freeing invalid memory until it crashes.

Fixes: 7c990728b9 ("ext4: fix potential race between s_flex_groups online resizing and access")
Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-29 17:48:08 -05:00
Qian Cai 6c5d911249 jbd2: fix data races at struct journal_head
journal_head::b_transaction and journal_head::b_next_transaction could
be accessed concurrently as noticed by KCSAN,

 LTP: starting fsync04
 /dev/zero: Can't open blockdev
 EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
 EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
 ==================================================================
 BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]

 write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
  __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
  __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
  jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
  (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
  kjournald2+0x13b/0x450 [jbd2]
  kthread+0x1cd/0x1f0
  ret_from_fork+0x27/0x50

 read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
  jbd2_write_access_granted+0x1b2/0x250 [jbd2]
  jbd2_write_access_granted at fs/jbd2/transaction.c:1155
  jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
  __ext4_journal_get_write_access+0x50/0x90 [ext4]
  ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
  ext4_mb_new_blocks+0x54f/0xca0 [ext4]
  ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
  ext4_map_blocks+0x3b4/0x950 [ext4]
  _ext4_get_block+0xfc/0x270 [ext4]
  ext4_get_block+0x3b/0x50 [ext4]
  __block_write_begin_int+0x22e/0xae0
  __block_write_begin+0x39/0x50
  ext4_write_begin+0x388/0xb50 [ext4]
  generic_perform_write+0x15d/0x290
  ext4_buffered_write_iter+0x11f/0x210 [ext4]
  ext4_file_write_iter+0xce/0x9e0 [ext4]
  new_sync_write+0x29c/0x3b0
  __vfs_write+0x92/0xa0
  vfs_write+0x103/0x260
  ksys_write+0x9d/0x130
  __x64_sys_write+0x4c/0x60
  do_syscall_64+0x91/0xb05
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 5 locks held by fsync04/25724:
  #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
  #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
  #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
  #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
  #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
 irq event stamp: 1407125
 hardirqs last  enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790
 hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790
 softirqs last  enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c
 softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

The plain reads are outside of jh->b_state_lock critical section which result
in data races. Fix them by adding pairs of READ|WRITE_ONCE().

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-29 13:40:02 -05:00
Linus Torvalds 74dea5d99d io_uring-5.6-2020-02-28
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5ZXkgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprZqEACOvhiprH9Q75Pp+ZwQknM3xGyJRWI3Mbj9
 ZOyTVTK0qhTeaq6rN4MLSYevXOh+L68x5WRt1YJ1UnQRE0i8+ZQyZczqLKxxl8gF
 trhbYDXjvXIWr9zvdtiL01PKKu4Vjjp6eZAomrbxCTFku0qn76fo9wDgGPRGL+Kx
 lNO/6QvCXr9EjDniEUhlQsxTad5xc4sL0cnL4s2i7RlTCYtW4WJXJMC/4Gkg69j+
 W5GBZyjJDa8Sj3pEbLjtDtA4ooE9VMaldb7ZvR62ONUVwGpftPsbN7UhVlhyhpW+
 8v4ZEf07CxB246+hj7oL0RvEW3+/nB2hym1ySMXyBzpbx4O1JOUG7hQtNgdLRbCZ
 27IOg2O36qbUKM1hUwn7Qm3XAfBPQdFpVmqE2+E9MEOKzigLzhRP6Bu5d9x9VQGh
 JDxsm3B8PRHFJVAasiYu0p7mlx/+BCLjB84UrMB3I9UCBuVfk4mtmuwZX+mcK2PR
 pV1xJlEMYKme3cz2/u6uB8p3Nq6ipE1nSVrI6AnfEvJbQ9sFL61KaG4wHKPvtb0y
 mlNgc4seSjiWcBR2/84561a4CSmlXAn9dWMIGdHFFA43mTPYGc5omTcM8FwcEDkW
 cTFGB8sFukcTNmOw62HUHYI1vPpowX6apV08lEQrScz7GiK5piTYqTFNneqEzcwZ
 3bIMisH3Gg==
 =WheR
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

 - Fix for a race with IOPOLL used with SQPOLL (Xiaoguang)

 - Only show ->fdinfo if procfs is enabled (Tobias)

 - Fix for a chain with multiple personalities in the SQEs

 - Fix for a missing free of personality idr on exit

 - Removal of the spin-for-work optimization

 - Fix for next work lookup on request completion

 - Fix for non-vec read/write result progation in case of links

 - Fix for a fileset references on switch

 - Fix for a recvmsg/sendmsg 32-bit compatability mode

* tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block:
  io_uring: fix 32-bit compatability with sendmsg/recvmsg
  io_uring: define and set show_fdinfo only if procfs is enabled
  io_uring: drop file set ref put/get on switch
  io_uring: import_single_range() returns 0/-ERROR
  io_uring: pick up link work on submit reference drop
  io-wq: ensure work->task_pid is cleared on init
  io-wq: remove spin-for-work optimization
  io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL
  io_uring: fix personality idr leak
  io_uring: handle multiple personalities in link chains
2020-02-28 11:39:14 -08:00
Linus Torvalds bfeb4f9977 zonefs fixes for 5.6-rc4
Two fixes in this pull request:
 * Revert the initial decision to silently ignore IOCB_NOWAIT for
   asynchronous direct IOs to sequential zone files. Instead, return an
   error to the user to signal that the feature is not supported (from
   Christoph)
 * A fix to zonefs Kconfig to select FS_IOMAP to avoid build failures if
   no other file system already selected this option (from Johannes).
 
 Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXljJWAAKCRDdoc3SxdoY
 dmztAP9Sj74cHVTxac+HoDKwf6DYWfjPWonT5tO4wc8q0PBDOgEAhKzHQJZNqJvd
 a0BrEf/t6RLWDgsi75cB/U6HsiGkiA0=
 =+maQ
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs fixes from Damien Le Moal:
 "Two fixes in here:

   - Revert the initial decision to silently ignore IOCB_NOWAIT for
     asynchronous direct IOs to sequential zone files. Instead, return
     an error to the user to signal that the feature is not supported
     (from Christoph)

   - A fix to zonefs Kconfig to select FS_IOMAP to avoid build failures
     if no other file system already selected this option (from
     Johannes)"

* tag 'zonefs-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: select FS_IOMAP
  zonefs: fix IOCB_NOWAIT handling
2020-02-28 08:34:47 -08:00
Jens Axboe d876836204 io_uring: fix 32-bit compatability with sendmsg/recvmsg
We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise
the iovec import for these commands will not do the right thing and fail
the command with -EINVAL.

Found by running the test suite compiled as 32-bit.

Cc: stable@vger.kernel.org
Fixes: aa1fa28fc7 ("io_uring: add support for recvmsg()")
Fixes: 0fa03c624d ("io_uring: add support for sendmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-27 14:17:49 -07:00
Tobias Klauser bebdb65e07 io_uring: define and set show_fdinfo only if procfs is enabled
Follow the pattern used with other *_show_fdinfo functions and only
define and use io_uring_show_fdinfo and its helper functions if
CONFIG_PROC_FS is set.

Fixes: 87ce955b24 ("io_uring: add ->show_fdinfo() for the io_uring file descriptor")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-27 06:56:21 -07:00
Jens Axboe dd3db2a34c io_uring: drop file set ref put/get on switch
Dan reports that he triggered a warning on ring exit doing some testing:

percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic
WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
Modules linked in:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83
RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292
RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000
RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c
RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045
R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0
R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009
FS:  0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 rcu_core+0x1e4/0x4d0
 __do_softirq+0xdb/0x2f1
 irq_exit+0xa0/0xb0
 smp_apic_timer_interrupt+0x60/0x140
 apic_timer_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:default_idle+0x23/0x170
Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95

Turns out that this is due to percpu_ref_switch_to_atomic() only
grabbing a reference to the percpu refcount if it's not already in
atomic mode. io_uring drops a ref and re-gets it when switching back to
percpu mode. We attempt to protect against this with the FFD_F_ATOMIC
bit, but that isn't reliable.

We don't actually need to juggle these refcounts between atomic and
percpu switch, we can just do them when we've switched to atomic mode.
This removes the need for FFD_F_ATOMIC, which wasn't reliable.

Fixes: 05f3fb3c53 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-26 10:53:33 -07:00
Jens Axboe 3a9015988b io_uring: import_single_range() returns 0/-ERROR
Unlike the other core import helpers, import_single_range() returns 0 on
success, not the length imported. This means that links that depend on
the result of non-vec based IORING_OP_{READ,WRITE} that were added for
5.5 get errored when they should not be.

Fixes: 3a6820f2bb ("io_uring: add non-vectored read/write commands")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-26 07:06:57 -07:00
Jens Axboe 2a44f46781 io_uring: pick up link work on submit reference drop
If work completes inline, then we should pick up a dependent link item
in __io_queue_sqe() as well. If we don't do so, we're forced to go async
with that item, which is suboptimal.

This also fixes an issue with io_put_req_find_next(), which always looks
up the next work item. That should only be done if we're dropping the
last reference to the request, to prevent multiple lookups of the same
work item.

Outside of being a fix, this also enables a good cleanup series for 5.7,
where we never have to pass 'nxt' around or into the work handlers.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-26 07:05:30 -07:00
Johannes Thumshirn 0dda2ddb7d zonefs: select FS_IOMAP
Zonefs makes use of iomap internally, so it should also select iomap in
Kconfig.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-02-26 16:58:15 +09:00
Christoph Hellwig 7c69eb84d9 zonefs: fix IOCB_NOWAIT handling
IOCB_NOWAIT can't just be ignored as it breaks applications expecting
it not to block.  Just refuse the operation as applications must handle
that (e.g. by falling back to a thread pool).

Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-02-26 16:57:35 +09:00
Jens Axboe 2d141dd2ca io-wq: ensure work->task_pid is cleared on init
We use ->task_pid for exit cancellation, but we need to ensure it's
cleared to zero for io_req_work_grab_env() to do the right thing. Take
a suggestion from Bart and clear the whole thing, just setting the
function passed in. This makes it more future proof as well.

Fixes: 36282881a7 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-25 13:23:48 -07:00
Scott Mayhew 55dee1bc0d nfs: add minor version to nfs_server_key for fscache
An NFS client that mounts multiple exports from the same NFS
server with higher NFSv4 versions disabled (i.e. 4.2) and without
forcing a specific NFS version results in fscache index cookie
collisions and the following messages:
[  570.004348] FS-Cache: Duplicate cookie detected

Each nfs_client structure should have its own fscache index cookie,
so add the minorversion to nfs_server_key.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200145
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-25 13:53:24 -05:00
Scott Mayhew 75a9b91761 NFS: Fix leak of ctx->nfs_server.hostname
If userspace passes an nfs_mount_data struct in the data argument of
mount(2), then nfs23_parse_monolithic() or nfs4_parse_monolithic()
will allocate memory for ctx->nfs_server.hostname.  This needs to be
freed in nfs_parse_source(), which also allocates memory for
ctx->nfs_server.hostname, otherwise a leak will occur.

Reported-by: syzbot+193c375dcddb4f345091@syzkaller.appspotmail.com
Fixes: f2aedb713c ("NFS: Add fs_context support.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-25 13:48:21 -05:00
Scott Mayhew 1821b26a1f NFS: Don't hard-code the fs_type when submounting
Hard-coding the fstype causes "nfs4" mounts to appear as "nfs",
which breaks scripts that do "umount -at nfs4".

Reported-by: Patrick Steinhardt <ps@pks.im>
Fixes: f2aedb713c ("NFS: Add fs_context support.")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-25 13:31:19 -05:00
Jens Axboe 3030fd4cb7 io-wq: remove spin-for-work optimization
Andres reports that buffered IO seems to suck up more cycles than we
would like, and he narrowed it down to the fact that the io-wq workers
will briefly spin for more work on completion of a work item. This was
a win on the networking side, but apparently some other cases take a
hit because of it. Remove the optimization to avoid burning more CPU
than we have to for disk IO.

Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-25 08:57:37 -07:00
Xiaoguang Wang bdcd3eab2a io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL
After making ext4 support iopoll method:
  let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
we found fio can easily hang in fio_ioring_getevents() with below fio
job:
    rm -f testfile; sync;
    sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
-rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
-bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.

There are two issues that results in this hang, one reason is that
when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
does not use io_uring_enter to get completed events, it relies on
kernel io_sq_thread to poll for completed events.

Another reason is that there is a race: when io_submit_sqes() in
io_sq_thread() submits a batch of sqes, variable 'inflight' will
record the number of submitted reqs, then io_sq_thread will poll for
reqs which have been added to poll_list. But note, if some previous
reqs have been punted to io worker, these reqs will won't be in
poll_list timely. io_sq_thread() will only poll for a part of previous
submitted reqs, and then find poll_list is empty, reset variable
'inflight' to be zero. If app just waits these deferred reqs and does
not wake up io_sq_thread again, then hang happens.

For app that entirely relies on io_sq_thread to poll completed requests,
let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
element to poll_list, and when io_sq_thread prepares to sleep, check
whether poll_list is empty again, if not empty, continue to poll.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-25 08:40:43 -07:00
Joe Perches fb4b5f1346 cifs: Use #define in cifs_dbg
All other uses of cifs_dbg use defines so change this one.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2020-02-24 14:20:38 -06:00
Aurelien Aptel 86f740f2ae cifs: fix rename() by ensuring source handle opened with DELETE bit
To rename a file in SMB2 we open it with the DELETE access and do a
special SetInfo on it. If the handle is missing the DELETE bit the
server will fail the SetInfo with STATUS_ACCESS_DENIED.

We currently try to reuse any existing opened handle we have with
cifs_get_writable_path(). That function looks for handles with WRITE
access but doesn't check for DELETE, making rename() fail if it finds
a handle to reuse. Simple reproducer below.

To select handles with the DELETE bit, this patch adds a flag argument
to cifs_get_writable_path() and find_writable_file() and the existing
'bool fsuid_only' argument is converted to a flag.

The cifsFileInfo struct only stores the UNIX open mode but not the
original SMB access flags. Since the DELETE bit is not mapped in that
mode, this patch stores the access mask in cifs_fid on file open,
which is accessible from cifsFileInfo.

Simple reproducer:

	#include <stdio.h>
	#include <stdlib.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <fcntl.h>
	#include <unistd.h>
	#define E(s) perror(s), exit(1)

	int main(int argc, char *argv[])
	{
		int fd, ret;
		if (argc != 3) {
			fprintf(stderr, "Usage: %s A B\n"
			"create&open A in write mode, "
			"rename A to B, close A\n", argv[0]);
			return 0;
		}

		fd = openat(AT_FDCWD, argv[1], O_WRONLY|O_CREAT|O_SYNC, 0666);
		if (fd == -1) E("openat()");

		ret = rename(argv[1], argv[2]);
		if (ret) E("rename()");

		ret = close(fd);
		if (ret) E("close()");

		return ret;
	}

$ gcc -o bugrename bugrename.c
$ ./bugrename /mnt/a /mnt/b
rename(): Permission denied

Fixes: 8de9e86c67 ("cifs: create a helper to find a writeable handle by path name")
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
2020-02-24 14:20:38 -06:00