Some messages sent by the MDS entail a session sequence number
increment, and the MDS will drop certain types of requests on the floor
when the sequence numbers don't match.
In particular, a REQUEST_CLOSE message can cross with one of the
sequence morphing messages from the MDS which can cause the client to
stall, waiting for a response that will never come.
Originally, this meant an up to 5s delay before the recurring workqueue
job kicked in and resent the request, but a recent change made it so
that the client would never resend, causing a 60s stall unmounting and
sometimes a blockisting event.
Add a new helper for incrementing the session sequence and then testing
to see whether a REQUEST_CLOSE needs to be resent, and move the handling
of CEPH_MDS_SESSION_CLOSING into that function. Change all of the
bare sequence counter increments to use the new helper.
Reorganize check_session_state with a switch statement. It should no
longer be called when the session is CLOSING, so throw a warning if it
ever is (but still handle that case sanely).
[ idryomov: whitespace, pr_err() call fixup ]
URL: https://tracker.ceph.com/issues/47563
Fixes: fa99677342 ("ceph: fix potential mdsc use-after-free crash")
Reported-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull misc vfs updates from Al Viro:
"Assorted stuff all over the place (the largest group here is
Christoph's stat cleanups)"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: remove KSTAT_QUERY_FLAGS
fs: remove vfs_stat_set_lookup_flags
fs: move vfs_fstatat out of line
fs: implement vfs_stat and vfs_lstat in terms of vfs_fstatat
fs: remove vfs_statx_fd
fs: omfs: use kmemdup() rather than kmalloc+memcpy
[PATCH] reduce boilerplate in fsid handling
fs: Remove duplicated flag O_NDELAY occurring twice in VALID_OPEN_FLAGS
selftests: mount: add nosymfollow tests
Add a "nosymfollow" mount option.
Push the allocation of the msg and the send into the caller. Rename
the function to encode_cap_msg and make it void return.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
On 32-bit systems, this shift will overflow for files larger than 4GB.
Cc: stable@vger.kernel.org
Fixes: 61f6881621 ("ceph: check caps in filemap_fault and page_mkwrite")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
PageError really only has meaning within a particular subsystem. Nothing
looks at this bit in the core kernel code, and ceph itself doesn't care
about it. Don't bother setting the PageError bit on error.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
page_mkwrite should only be called with Uptodate pages, so we should
only need to flush incompatible snap contexts.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When dirtying a page, we have to flush incompatible contexts. Move the
search for an incompatible context into a separate function, and fix up
the caller to wait and retry if there is one.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
error_string key in the metadata map of MClientSession message
is intended for humans, but unfortunately became part of the on-wire
format with the introduction of recover_session=clean mode in commit
131d7eb4fa ("ceph: auto reconnect after blacklisted").
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently it calls pagevec_lookup_range_nr_tag(), but that may be
inefficient, as we might end up having to search several times as we get
down to looking for fewer pages to fill the array.
Thus spake Willy:
"I think ceph is misusing pagevec_lookup_range_nr_tag(). Let's suppose
you get a range which is AAAAbbbbAAAAbbbbAAAAbbbbbbbb(...)bbbbAAAA and
you try to fetch max_pages=13. First loop will get AAAAbbbbAAAAb and
have 8 locked_pages. The next call will get bbbAA and now
locked_pages=10. Next call gets AAb ... and now you're iterating your
way through all the 'b' one page at a time until you find that first A."
'A' here refers to pages that are eligible for writeback and 'b'
represents ones that aren't (for whatever reason).
Not capping the number of return pages may mean that we sometimes find
more pages than are needed, but the extra references will just get put
at the end.
Ceph is also the only caller of pagevec_lookup_range_nr_tag(), so this
change should allow us to eliminate that call as well. That will be done
in a follow-on patch.
Reported-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph open-codes this around some other activity and the rationale
for it isn't clear. There is no need to delay free_anon_bdev until
the end of kill_sb.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In client for each inode, it may have many opened files and may
have been pinned in more than one MDS servers. And some inodes
are idle, which have no any opened files.
This patch will show these metrics in the debugfs, likes:
item total
-----------------------------------------
opened files / total inodes 14 / 5
pinned i_caps / total inodes 7 / 5
opened inodes / total inodes 3 / 5
Will send these metrics to ceph, which will be used by the `fs top`,
later.
[ jlayton: drop unrelated hunk, count hashed inodes instead of
allocated ones ]
URL: https://tracker.ceph.com/issues/47005
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will help simplify the code.
[ jlayton: fix minor merge conflict in quota.c ]
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This special casing was added in 7ce469a53e (ceph: fix splice
read for no Fc capability case). The confirm callback for ITER_PIPE
expects that the page is Uptodate and returns an error otherwise.
A simpler workaround is just to use the Uptodate bit, which has no
meaning for anonymous pages. Rip out the special casing for ITER_PIPE
and just SetPageUptodate before we copy to the iter.
Cc: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In multi-mds, the 'caps' debugfs file will have duplicate ino,
add the 'mds' column to indicate which mds session the cap belongs to.
Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since there's a return immediately after the 'break', there's no need for
this extra 'return' in the S_IFDIR case.
Signed-off-by: Luis Henriques <lhenriques@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since nautilus, MDS tracks dirfrags whose child inodes have caps in open
file table. When MDS recovers, it prefetches all of these dirfrags. This
avoids using backtrace to load inodes. But dirfrags prefetch may load
lots of useless inodes into cache, and make MDS run out of memory.
Recent MDS adds an option that disables dirfrags prefetch. When dirfrags
prefetch is disabled. Recovering MDS only prefetches corresponding dir
inodes. Including inodes' parent/d_name in cap reconnect message can
help MDS to load inodes into its cache.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
a 64-bit architecture with a 32-bit ino_t, a patch to disallow leases
to avoid potential data integrity issues when CephFS is re-exported
via NFS or CIFS and a fix for the bulk of W=1 compilation warnings.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl9I9UITHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi5HcB/0XKCF9pJmHM7vyMpZvALaNn97B7jgD
o4Lq0RBnIyum0UltnxvrROO1LGyurFH0hWofRykEWXy8Qt/9fZTddClOmY+6W+7P
A1MVInSxCXpNHj5y1uMuwhWkLmuNWnW1aKJXBn8tXKsmrYQM3SmbCMRx8a2GxJk9
mxl9zAtTRsih0AovRbde93i4FMpeXsyDk9EcyiJFcgnDjJpCAXgN9smu9xjPskOI
rrFQBjsrYB04FVuv5lEB/xZI/2QLM2FqzlpIRsa6udtYRsDnOtgIbLq8p0sNJnC6
QFxDaWFUJAXN/b1UWZz8yw2Y7nF7r47Hg6fDM4PBQ09PQEbIQKY+qn3R
=6+bm
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"We have an inode number handling change, prompted by s390x which is a
64-bit architecture with a 32-bit ino_t, a patch to disallow leases to
avoid potential data integrity issues when CephFS is re-exported via
NFS or CIFS and a fix for the bulk of W=1 compilation warnings"
* tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-client:
ceph: don't allow setlease on cephfs
ceph: fix inode number handling on arches with 32-bit ino_t
libceph: add __maybe_unused to DEFINE_CEPH_FEATURE
Leases don't currently work correctly on kcephfs, as they are not broken
when caps are revoked. They could eventually be implemented similarly to
how we did them in libcephfs, but for now don't allow them.
[ idryomov: no need for simple_nosetlease() in ceph_dir_fops and
ceph_snapdir_fops ]
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tuan and Ulrich mentioned that they were hitting a problem on s390x,
which has a 32-bit ino_t value, even though it's a 64-bit arch (for
historical reasons).
I think the current handling of inode numbers in the ceph driver is
wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
actually not a problem. 32-bit arches can deal with 64-bit inode numbers
just fine when userland code is compiled with LFS support (the common
case these days).
What we really want to do is just use 64-bit numbers everywhere, unless
someone has mounted with the ino32 mount option. In that case, we want
to ensure that we hash the inode number down to something that will fit
in 32 bits before presenting the value to userland.
Add new helper functions that do this, and only do the conversion before
presenting these values to userland in getattr and readdir.
The inode table hashvalue is changed to just cast the inode number to
unsigned long, as low-order bits are the most likely to vary anyway.
While it's not strictly required, we do want to put something in
inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
the size of the ino_t type.
NOTE: This is a user-visible change on 32-bit arches:
1/ inode numbers will be seen to have changed between kernel versions.
32-bit arches will see large inode numbers now instead of the hashed
ones they saw before.
2/ any really old software not built with LFS support may start failing
stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
can do about these, but hopefully the intersection of people running
such code on ceph will be very small.
The workaround for both problems is to mount with "-o ino32".
[ idryomov: changelog tweak ]
URL: https://tracker.ceph.com/issues/46828
Reported-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
Reported-and-Tested-by: Tuan Hoang1 <Tuan.Hoang1@ibm.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Most session messages contain a feature mask, but the MDS will
routinely send a REJECT message with one that is zero-length.
Commit 0fa8263367 ("ceph: fix endianness bug when handling MDS
session feature bits") fixed the decoding of the feature mask,
but failed to account for the MDS sending a zero-length feature
mask. This causes REJECT message decoding to fail.
Skip trying to decode a feature mask if the word count is zero.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46823
Fixes: 0fa8263367 ("ceph: fix endianness bug when handling MDS session feature bits")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Patrick Donnelly <pdonnell@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When doing some tests with multiple mds, we were seeing many mds
forwarding requests between them, causing clients to resend.
If the request is a modification operation and the mode is set to
USE_AUTH_MDS, then the auth mds should be selected to handle the
request. If auth mds for frag is already set, then it should be returned
directly without further processing.
The current logic is wrong because it only returns directly if
mode is USE_AUTH_MDS, but we want to do that for all modes. If we don't,
then when the frag's mds is not equal to cap session's mds, the request
will get sent to the wrong MDS needlessly.
Drop the mode check in this condition.
Signed-off-by: Yanhu Cao <gmayyyha@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When doing some testing recently, I hit some page allocation failures
on mount, when creating the wb_pagevec_pool for the mount. That
requires 128k (32 contiguous pages), and after thrashing the memory
during an xfstests run, sometimes that would fail.
128k for each mount seems like a lot to hold in reserve for a rainy
day, so let's change this to a global mempool that gets allocated
when the module is plugged in.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Symlink inodes should have the security context set in their xattrs on
creation. We already set the context on creation, but we don't attach
the pagelist. The effect is that symlink inodes don't get an SELinux
context set on them at creation, so they end up unlabeled instead of
inheriting the proper context. Make it do so.
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The variable mds is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the ceph_mdsc_init() fails, it will free the mdsc already.
Reported-by: syzbot+b57f46d8d6ea51960b8c@syzkaller.appspotmail.com
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix build warnings:
fs/ceph/mdsmap.c: In function ‘ceph_mdsmap_decode’:
fs/ceph/mdsmap.c:192:7: warning: variable ‘info_cv’ set but not used [-Wunused-but-set-variable]
fs/ceph/mdsmap.c:177:7: warning: variable ‘state_seq’ set but not used [-Wunused-but-set-variable]
fs/ceph/mdsmap.c:123:15: warning: variable ‘mdsmap_cv’ set but not used [-Wunused-but-set-variable]
Note that p is increased in ceph_decode_*.
Signed-off-by: Jia Yang <jiayang5@huawei.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Drop duplicated words "down" and "the" in fs/ceph/.
[ idryomov: merge into a single patch ]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Send metric flags to the MDS, indicating what metrics the client
supports. Currently that consists of cap statistics, and read, write and
metadata latencies.
URL: https://tracker.ceph.com/issues/43435
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will send the caps/read/write/metadata metrics to any available MDS
once per second, which will be the same as the userland client. It will
skip the MDS sessions which don't support the metric collection, as the
MDSs will close socket connections when they get an unknown type
message.
We can disable the metric sending via the disable_send_metrics module
parameter.
[ jlayton: fix up endianness bug in ceph_mdsc_send_metrics() ]
URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If the session is already in closed state, we should skip it.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.
Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `\bxmlns\b`:
For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.
[ idryomov: Do the same for the CRUSH paper and replace
ceph.newdream.net with ceph.io. ]
Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Remove unnecassary casts in the argument to kfree.
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In aio case, if the completion comes very fast just before the
ceph_read_iter() returns to fs/aio.c, the kiocb will be freed in
the completion callback, then if ceph_read_iter() access again
we will potentially hit the use-after-free bug.
[ jlayton: initialize direct_lock early, and use it everywhere ]
URL: https://tracker.ceph.com/issues/45649
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Make this loop look a bit more sane. Also optimize away the spinlock
release/reacquire if we can't get an inode reference.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Make sure the delayed work stopped before releasing the resources.
cancel_delayed_work_sync() will only guarantee that the work finishes
executing if the work is already in the ->worklist. That means after
the cancel_delayed_work_sync() returns, it will leave the work requeued
if it was rearmed at the end. That can lead to a use after free once the
work struct is freed.
Fix it by flushing the delayed work instead of trying to cancel it, and
ensure that the work doesn't rearm if the mdsc is stopping.
URL: https://tracker.ceph.com/issues/46293
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
...and let the errnos bubble up to the callers.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will help to reduce using the global mdsc->mutex lock in many
places.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
And remove the unsed mdsc parameter to simplify the code.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
It make no sense to check the caps when reconnecting to mds. And
for the async dirop caps, they will be put by its _cb() function,
so when releasing the requests, it will make no sense too.
URL: https://tracker.ceph.com/issues/45635
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
send_mds_reconnect takes the s_mutex while the mdsc->mutex is already
held. That inverts the locking order documented in mds_client.h. Drop
the mdsc->mutex, acquire the s_mutex and then reacquire the mdsc->mutex
to prevent a deadlock.
URL: https://tracker.ceph.com/issues/45609
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Similarly to commit 03f219041f ("ceph: check i_nlink while converting
a file handle to dentry"), this fixes another corner case with
name_to_handle_at/open_by_handle_at. The issue has been detected by
xfstest generic/467, when doing:
- name_to_handle_at("/cephfs/myfile")
- open("/cephfs/myfile")
- unlink("/cephfs/myfile")
- sync; sync;
- drop caches
- open_by_handle_at()
The call to open_by_handle_at should not fail because the file hasn't been
deleted yet (only unlinked) and we do have a valid handle to it. -ESTALE
shall be returned only if i_nlink is 0 *and* i_count is 1.
This patch also makes sure we have LINK caps before checking i_nlink.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>