For NFSv4.1/4.2, an atomic upgrade of a delegation from a
read delegation to a write delegation is allowed and can
result in signoficantly improved performance.
This patch adds support for this atomic upgrade, plus fixes
a couple of other delegation related bugs. Since there were
three cases where delegations were being issued, the patch
factors this out into a separate function called
nfsrv_issuedelegations().
This patch should only affect the NFSv4.1/4.2 behaviour
when delegations are enabled, which is not the default.
MFC after: 1 month
The PR reported hangs that were avoided when this commit was
reverted. Since it was only a cleanup, revert it.
The LORs in the PR need further investigation, since I think
readahead only hides the problem.
PR: 279138
This reverts commit fbe965591f.
Whenever file is created, the vnode_create_vobject() function will
try to determine its size by calling vn_getsize_locked() as size 0
is ambigious: it means either the file size is 0 or the file size
is unknown.
Introduce special value for the size argument: VNODE_NO_SIZE.
Only when it is given, the vnode_create_vobject() will try to obtain
file's size on its own.
Introduce dedicated vnode_disk_create_vobject() for use by
g_vfs_open(), so we don't have to call vn_isdisk() in the common case
(for regular files).
Handle the case of mediasize==0 in g_vfs_open().
Reviewed by: alc, kib, markj, olce
Approved by: oshogbo (mentor), allanjude (mentor)
Differential Revision: https://reviews.freebsd.org/D45244
Which allows tmpfs_pager_writecount_recalc() to reliably detect
reclaimed vnode and make its accesses to object->un_pager.swp.private
(== vp) safe against reclaim. Note that vnode instantiation already
assigns v_object under the object lock.
Reviewed by: markj
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D45119
This is mostly to reduce the diff with CheriBSD which adds additional
constants to enum uio_rw, but also matches the normal style used for
uio_segflg.
Reviewed by: kib, emaste
Obtained from: CheriBSD
Differential Revision: https://reviews.freebsd.org/D45142
For a very long time, the NFS client has done readahead for
directory blocks. Unlike data blocks, the readahead cannot
begin until the Readdir RPC reply has been received, since
the directory offset cookie in that Readdir RPC reply is needed.
As such, the readahead is serialized and does not seem to
provide any real benefit.
Recent testing/benchmarking shows that removing this
readahead code for Readdir does not have a negative impact
on performance.
Therefore, this patch deletes the readahead code for Readdir,
which simplifies the code and may make future changes simpler.
MFC after: 1 month
RFC8881 specifies that, when a Link operation occurs on an
NFSv4, that file delegations issued to other clients must
be recalled. Discovered during a recent discussion on nfsv4@ietf.org.
Although I have not observed a problem caused by not doing
the required delegation recall, it is definitely required
by the RFC, so this patch makes the server do the recall.
Tested during a recent NFSv4 IETF Bakeathon event.
MFC after: 1 week
There are a few places in which unionfs_rename() accesses fvp's private
data without holding the necessary lock/interlock. Moreover, the
implementation completely fails to handle the case in which fdvp is not
the same as tdvp; in this case it simply fails to lock fdvp at all.
Finally, it locks fvp while potentially already holding tvp's lock, but
makes no attempt to deal with possible LOR there.
Fix this by optimistically using the vnode interlock to protect
the short accesses to fdvp and fvp private data, sequentially.
If a file copy or shadow directory creation is required to prepare
the upper FS for the rename operation, the interlock must be dropped
and fdvp/fvp locked as necessary.
Additionally, use ERELOOKUP (as suggested by kib@) to simplify the
locking logic and eliminate unionfs_relookup() calls for file-copy/
shadow-directory cases that require tdvp's lock to be dropped.
Reviewed by: kib (earlier version), olce
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D44788
There is only one place in the unpatched sources where B_DIRECT is
set in the NFS client and this code is never executed. As such, this patch
removes this code that is never executed, since B_DIRECT should never
be set.
During a IETF testing event this week, I saw a crash in ncl_doio_directwrite(),
but this function is only called if B_DIRECT is set.
I cannot explain how ncl_doio_directwrite() got called, but once this patch
was applied to the sources, the crash did not recur. This is not surprising,
since this patch deleted the function.
Reviewed by: kib, markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D44980
When an initial attempt to close an NFSv4 lock returns NFSERR_DELAY,
the open structure is put on a list for delayed closing. When this
is done, the nfso_own field is set to NULL, so it cannot be used by
nfsrpc_doclose().
Without this patch, the NFSv4 client can crash when a NFSv4 server
replies NFSERR_DELAY to a Close operation. Fortunately, most extant
NFSv4 servers do not do this. This patch avoids the crash for any
that do return NFSERR_DELAY for Close.
Found during a IETF bakeathon testing event this week.
MFC after: 5 days
Commit 196787f79e erroneously assumed that the client code for
Open/Claim_deleg_cur_FH was broken, but it was not.
It was actually wireshark that was broken and indicated
that the correct XDR was bogus.
This reverts the part of 196787f79e that changed the arguments for
Open/Claim_deleg_cur_FH.
Found during the IETF bakeathon testing event this week.
MFC after: 3 days
This reverts commit f300335d9a.
It turns out that the old code was correct and it was wireshark
that was broken and indicated that the RPC's XDR was bogus.
Found during IETF bakeathon testing this week.
While here remove an old comment regarding preallocation; it appears to
refer to an optimization that is almost certainly irrelevant at this
point.
No functional change intended.
MFC after: 1 week
This setting causes the NFS server to check that all RPCs are sent from
a privileged (<= 1023) port, rejecting those that are not. This
slightly raises the bar for a user with network access to an
unauthenticated NFS server to access exported NFS filesystems.
Users that use traditional NFS clients (e.g., those provided by FreeBSD
or Linux) should not see any difference, assuming that unprivileged
filesystem mounting is disallowed.
Note that the setting is per-VNET, so may be overridden in VNET jails
without affecting the rest of the system.
Discussed with: freebsd-arch@
Reviewed by: rmacklem, bz, emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D44906
If access from unreserved ports is disabled, then a remote host can
cause an NFS server to log a message by sending a packet. This is
useful for diagnosing problems but bad for resiliency in the case where
the server is being spammed with a large number of rejected requests.
Limit prints to once per second (racily).
Reviewed by: rmacklem, emaste
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44819
Add sys/errno.h, sys/malloc.h, sys/queue.h, and vm/uma.h as needed.
sys/sysproto.h currently includes sys/acl.h which currently includes
sys/param.h, sys/queue.h, and vm/uma.h which in turn bring in
sys/errno.h sys/malloc.h.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44465
The author reported that this patch was needed to avoid
crashes on a fairly busy RISC-V system. The author did not
provide details w.r.t. the crashes. Although I
have not seen any such crash, the patch looks reasonable
and I have not found any regressions when testing it.
Since "rdirplus" is not a default option, the patch is
only needed if you are doing NFS mounts with the "rdirplus"
mount option and seeing crashes related to the name cache.
MFC after: 1 week
There are a few spots in which unionfs_lookup() accesses unionfs vnode
private data without holding the corresponding vnode lock or interlock.
Reviewed by: kib, olce
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D44601
This lets tarfs provide readahead/behind hints to the VFS, which helps
memory-mapped I/O performance, important when running faulting in
executables out of a tarfs mount as one might if tarfs is used to back
the root filesystem, for example. The improvement is particularly
noticeable when the backing tarball is zstd-compressed.
The implementation simply returns the extent of the virtual block
containing the target offset, clamped by the maximum I/O size. This is
perhaps simplistic; it effectively just chooses values that would
correspond to a single VOP_READ call in tarfs_read_file().
Reviewed by: des, kib
MFC after: 1 month
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D44626
There is no obvious reason to use a value smaller than that.
Reviewed by: des, kib
MFC after: 1 week
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D44627
Previously, we would error out if we encountered a global extended
header, because we don't know what it means. This doesn't really
matter though, and traditionally, tar implementations have either
ignored them or treated them as plain files, so just ignore them.
This allows tarfs to mount tar files created by `git archive`.
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Reviewed by: kevans
Differential Revision: https://reviews.freebsd.org/D44600
unionfs has a bunch of clunky special-case code to avoid creating
unionfs wrapper vnodes for AF_UNIX sockets. This was added in 2008
to address PR 118346, but in the intervening years the VOP_UNP_*
operations have been added to provide a clean interface to allow
sockets to work in the presence of stacked filesystems.
PR: 275871
Reviewed by: kib (prior version), olce
Tested by: Karlo Miličević <karlo98.m@gmail.com>
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D44288
NFSv4.2 supports a Copy operation, which avoids file data being
read to the client and then written back to the server, if both
input and output files are on the same NFSv4.2 mount for
copy_file_range(2).
Unfortunately, this Copy operation can take a long time under
certain circumstances. If this occurs concurrently with a RPC
that requires an exclusive lock on the nfsd such as ExchangeID
done for a new mount, the result can be an nfsd "stall" until
the Copy completes.
This patch adds a sysctl that can be set to limit the size of
a Copy operation or, if set to 0, disable Copy operations.
The use of this sysctl and other ways to avoid Copy operations
taking too long will be documented in the nfsd.4 man page by
a separate commit.
MFC after: 2 weeks
Since non-doomed unionfs vnodes always share their primary lock with
either the lower or upper vnode, any forwarded call to the base FS
which transiently drops that upper or lower vnode lock may result in
the unionfs vnode becoming completely unlocked during that transient
window. The unionfs vnode may then become doomed by a concurrent
forced unmount, which can lead to either or both of the following:
--Complete loss of the unionfs lock: in the process of being
doomed, the unionfs vnode switches back to the default vnode lock,
so even if the base FS VOP reacquires the upper/lower vnode lock,
that no longer translates into the unionfs vnode being relocked.
This will then violate that caller's locking assumptions as well
as various assertions that are enabled with DEBUG_VFS_LOCKS.
--Complete less of reference on the upper/lower vnode: the caller
normally holds a reference on the unionfs vnode, while the unionfs
vnode in turn holds references on the upper/lower vnodes. But in
the course of being doomed, the unionfs vnode will drop the latter
set of references, which can effectively lead to the base FS VOP
executing with no references at all on its vnode, violating the
assumption that vnodes can't be recycled during these calls and
(if lucky) violating various assertions in the base FS.
Fix this by adding two new functions, unionfs_forward_vop_start_pair()
and unionfs_forward_vop_finish_pair(), which are intended to bookend
any forwarded VOP which may transiently unlock the relevant vnode(s).
These functions are currently only applied to VOPs that modify file
state (and require vnode reference and lock state to be identical at
call entry and exit), as the common reason for transiently dropping
locks is to update filesystem metadata.
Reviewed by: olce
Tested by: pho
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D44076
The checksum code assumed that struct ustar_header filled an entire
block and calculcated the checksum based on the size of the structure.
The header is in fact only 500 bytes long while the checksum covers
the entire block (“logical record” in POSIX terms). Add padding and
an assertion, and clean up the checksum code.
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D44226
* Reject hard or soft links with an empty target path. Currently, a
debugging kernel will hit an assertion in tarfs_lookup_path() while
a non-debugging kernel will happily create a link to the mount root.
* Use a temporary variable to store the result of the link target path,
and copy it to tnp->other only once we have found it to be valid.
Otherwise we error out after creating a reference to the target but
before incrementing the target's reference count, which results in a
use-after-free situation in the cleanup code.
* Correctly return ENOENT from tarfs_lookup_path() if the requested
path was not found and create_dirs is false. Luckily, existing
callers did not rely solely on the return value.
MFC after: 3 days
PR: 277360
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Reviewed by: sjg
Differential Revision: https://reviews.freebsd.org/D44161
FAT file systems do not use inodes, instead all file meta-information
is stored in directory entries.
FAT12 and FAT16 use a fixed size area for root directories, with
typically 512 entries of 32 bytes each (for a total of 16 KB) on hard
disk formats. The file system data is stored in clusters of typically
512 to 4096 bytes, depending on the size of the file system.
The current code uses the offset of a DOS 8.3 style directory entry as
a pseudo-inode, which leads to inode values of 0 to 16368 for typical
root directories with 512 entries.
Sub-directories use 2 cluster length plus the byte offset of the
directory entry in the data area for the pseudo-inode, which may be
as low as 1024 in case of 512 byte clusters. A sub-directory in
cluster 2 and with 512 byte clusters will therefore lead to a
re-use of inode 1024 when there are at least 32 DOS 8.3 style
filenames in the root directory (or 11 14-character Windows
long file names, each of which takes up 3 directory entries).
FAT32 file systems are not affected by this issue and FAT12/FAT16
file systems with larger cluster sizes are unlikely to have as
many directory entries in the root directory as are required to
cause the collision.
This commit leads to inode numbers that are guaranteed to not collide
for all valid FAT12 and FAT16 file system parameters. It does also
provide a small speed-up due to more efficient use of the vnode cache.
Approved by: mckusick
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D43978