A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently. When searching for an open,
usually for a match by file handle, a linear search of all opens
is done.
Commit 3f7e14ad93 added a hash table of lists hashed on file handle
for the opens. This patch uses the hash lists for searching for
a matching open based of file handle instead of an exhaustive
linear search of all opens.
This change appears to be performance neutral for a small number
of opens, but should improve expected performance for a large
number of opens.
This commit should not affect the high level semantics of open
handling.
MFC after: 2 weeks
A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently. When searching for an open,
usually for a match by file handle, a linear search of all opens
is done.
Commit 3f7e14ad93 added a hash table of lists hashed on file handle
for the opens. This patch uses the hash lists for searching for
a matching open based of file handle instead of an exhaustive
linear search of all opens.
This change appears to be performance neutral for a small number
of opens, but should improve expected performance for a large
number of opens. This patch also moves any found match to the front
of the hash list, to try and maintain the hash lists in recently
used ordering (least recently used at the end of the list).
This commit should not affect the high level semantics of open
handling.
MFC after: 2 weeks
A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently. When searching for an open,
usually for a match by file handle, a linear search of all opens
is done.
This patch adds a table of hash lists for the opens, hashed on
file handle. This table will be used by future commits to
search for an open based on file handle more efficiently.
MFC after: 2 weeks
The most difficult NFSv4 client recovery case happens when the
lease has expired on the server. For NFSv4.0, the client will
receive a NFSERR_EXPIRED reply from the server to indicate this
has happened.
For NFSv4.1/4.2, most RPCs have a Sequence operation and, as such,
the client will receive a NFSERR_BADSESSION reply when the lease
has expired for these RPCs. The client will then call nfscl_recover()
to handle the NFSERR_BADSESSION reply. However, for the expired lease
case, the first reclaim Open will fail with NFSERR_NOGRACE.
This patch recognizes this case and calls nfscl_expireclient()
to handle the recovery from an expired lease.
This patch only affects NFSv4.1/4.2 mounts when the lease
expires on the server, due to a network partitioning that
exceeds the lease duration or similar.
MFC after: 2 weeks
There are two module declarations in the nfscl.ko module for "nfscl"
and "nfs". Both of these declarations had MODULE_DEPEND() calls.
This patch deletes the MODULE_DEPEND() calls for "nfs" to avoid
confusion with respect to what modules this module is dependent upon.
The patch also adds comments explaining why there are two module
declarations within the module.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D30102
There is a NFSv4 file attribute called TimeCreate
that can be used for va_birthtime.
r362175 added some support for use of TimeCreate.
This patch completes support of va_birthtime by adding
support for setting this attribute to the server.
It also eanbles the client to
acquire and set the attribute for a NFSv4
server that supports the attribute.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30156
When loading attributes from the cache, the NFS client is careful to
copy only the fields that it initialized. After fetching attributes
from the server, however, it would copy the entire vattr structure
initialized from the RPC response, so uninitialized stack bytes would
end up being copied to userspace. In particular, va_birthtime (v2 and
v3) and va_gen (v3) had this problem.
Use a common subroutine to copy fields provided by the NFS client, and
ensure that we provide a dummy va_gen for the v3 case.
Reviewed by: rmacklem
Reported by: KMSAN
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30090
Commit aad780464f added a function called nfscl_delegreturnvp()
to return delegations during the NFS VOP_RECLAIM().
The function erroneously assumed that nm_clp would
be non-NULL. It will be NULL for NFSV4.0 mounts until
a regular file is opened. It will also be NULL during
vflush() in nfs_unmount() for a forced dismount.
This patch adds a check for clp == NULL to fix this.
Also, since it makes no sense to call nfscl_delegreturnvp()
during a forced dismount, the patch adds a check for that
case and does not do the call during forced dismounts.
PR: 255436
Reported by: ish@amail.plala.or.jp
MFC after: 2 weeks
After a vnode is recycled it can no longer be
acquired via vfs_hash_get() and, as such,
a delegation for the vnode cannot be recalled.
In the unlikely event that a delegation still
exists when the vnode is being recycled, return
the delegation since it will no longer be
recallable.
Until you have this patch in your NFSv4 client,
you should consider avoiding the use of delegations.
MFC after: 2 weeks
Without this patch, if a NFSv4 server recalled a
delegation when the file is not open, the renew
thread would block in the NFS VOP_INACTIVE()
trying to acquire the client state lock that it
already holds.
This patch fixes the problem by delaying the
vrele() call until after the client state
lock is released.
This bug has been in the NFSv4 client for
a long time, but since it only affects
delegation when recalled due to another
client opening the file, it got missed
during previous testing.
Until you have this patch in your client,
you should avoid the use of delegations.
MFC after: 2 weeks
Since 7763814fc9 nfsrpc_setclient() uses mem_alloc() that is macro
around malloc(M_RPC). M_RPC is provided by xdr.ko.
Reviewed by: rmacklem
Sponsored by: Mellanox Technologies/NVidia Networking
MFC after: 1 week
During a recent testing event, it was reported that the NFSv4.1/4.2
server erroneously bound the back channel to a new TCP connection.
RFC5661 specifies that the fore channel is implicitly bound to a
new TCP connection when an RPC with Sequence (almost any of them)
is done on it. For the back channel to be bound to the new TCP
connection, an explicit BindConnectionToSession must be done as
the first RPC on the new connection.
Since new TCP connections are created by the "reconnect" layer
(sys/rpc/clnt_rc.c) of the krpc, this patch adds an optional
upcall done by the krpc whenever a new connection is created.
The patch also adds the specific upcall function that does a
BindConnectionToSession and configures the krpc to call it
when required.
This is necessary for correct interoperability with NFSv4.1/NFSv4.2
servers when the nfscbd daemon is running.
If doing NFSv4.1/NFSv4.2 mounts without this patch, it is
recommended that the nfscbd daemon not be running and that
the "pnfs" mount option not be specified.
PR: 254840
Comments by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D29475
Commit fdc9b2d50f replaced a couple of while loops with LIST_FOREACH()
loops. This patch factors the body of that loop out into a separate
function called nfscl_checkown().
This prepares the code for future changes to use a hash table of
lists for open searches via file handle.
This patch should not result in a semantics change.
MFC after: 2 weeks
This patch replaces a couple of while() loops with LIST_FOREACH() loops.
While here, declare a couple of variables "bool".
I think LIST_FOREACH() is preferred and makes the code more readable.
This also prepares the code for future changes to use a hash table of
lists for open searches via file handle.
This patch should not result in a semantics change.
MFC after: 2 weeks
If a delegation for a file has been acquired, the "oneopenown" option
was ignored when the local open was issued. This could result in multiple
openowners/opens for a file, that would be transferred to the server
when the delegation was recalled.
This would not be serious, but could result in more than one openowner.
Since the Amazon/EFS does not issue delegations, this probably never
occurs in practice.
Spotted during code inspection.
This small patch fixes the code so that it checks for "oneopenown"
when doing client local opens on a delegation.
MFC after: 2 weeks
During a recent NFSv4 testing event a test server caused a hang
where "umount -N" failed. The renew thread was sleeping on "nfsv4lck"
and the "umount" was sleeping, waiting for the renew thread to
terminate.
This is the second of two patches that is hoped to fix the renew thread
so that it will terminate when "umount -N" is done on the mount.
This patch adds a 5second timeout on the msleep()s and checks for
the forced dismount flag so that the renew thread will
wake up and see the forced dismount flag. Normally a wakeup()
will occur in less than 5seconds, but if a premature return from
msleep() does occur, it will simply loop around and msleep() again.
The patch also adds the "mp" argument to nfsv4_lock() so that it
will return when the forced dismount flag is set.
While here, replace the nfsmsleep() wrapper that was used for portability
with the actual msleep() call.
MFC after: 2 weeks
During a recent NFSv4 testing event a test server was replying
NFSERR_OLDSTATEID for layout stateids presented to the server
for LayoutReturn operations. Upon rereading RFC5661, it was
apparent that the FreeBSD NFSv4.1/4.2 pNFS client did not
maintain the seqid field of the layout stateid correctly.
This patch is believed to correct the problem. Tested against
a FreeBSD pNFS server with diagnostics added to check the stateid's
seqid did not indicate problems. Unfortunately, testing aginst
this server will not happen in the near future, so the fix may
not be correct yet.
MFC after: 2 weeks
During a recent virtual NFSv4 testing event, a bug in the FreeBSD client
was detected when doing I/O DS operations on a Flexible File Layout pNFS
server. For an NFSv3 DS, the Read/Write/Commit nfsstats were incremented
instead of the ReadDS/WriteDS/CommitDS counts.
This patch fixes this.
Only the RPC counts reported by nfsstat(1) were affected by this bug,
the I/O operations were performed correctly.
MFC after: 2 weeks
During a recent virtual NFSv4 testing event, a bug in the FreeBSD client
was detected when doing a File Layout pNFS DS I/O operation.
The size of the I/O operation was smaller than expected.
The I/O size is specified as a stripe unit size in bits 6->31 of nflh_util
in the layout. I had misinterpreted RFC5661 and had shifted the value
right by 6 bits. The correct interpretation is to use the value as
presented (it is always an exact multiple of 64), clearing bits 0->5.
This patch fixes this.
Without the patch, I/O through the DSs work, but the I/O size is 1/64th
of what is optimal.
MFC after: 2 weeks
During code inspection I noticed that the n_direofoffset field
of the NFS node was being manipulated without any lock being
held to make it SMP safe.
This patch adds locking of the NFS node's mutex around
handling of n_direofoffset to make it SMP safe.
I have not seen any failure that could be attributed to n_direofoffset
being manipulated concurrently by multiple processors, but I think this
is possible, since directories are read with shared vnode
locking, plus locks only on individual buffer cache blocks.
However, there have been as yet unexplained issues w.r.t reading
large directories over NFS that could have conceivably been caused
by concurrent manipulation of n_direofoffset.
MFC after: 2 weeks
Commit 3fe2c68ba2 dealt with a panic in cache_enter_time() where
the vnode referred to the directory argument.
It would also be possible to get these panics if a broken
NFS server were to return the directory as an new object being
created within the directory or in a Lookup reply.
This patch adds checks to avoid the panics and logs
messages to indicate that the server is broken for the
file object creation cases.
Reviewd by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D28987
Juraj Lutter (otis@) reported a panic "dvp != vp not true" in
cache_enter_time() called from the NFS client's nfsrpc_readdirplus()
function.
This is specific to an NFSv3 mount with the "rdirplus" mount
option. Unlike NFSv4, NFSv3 replies to ReaddirPlus
includes entries for the current directory.
This trivial patch avoids doing a cache_enter_time()
call for the current directory to avoid the panic.
Reported by: otis
Tested by: otis
Reviewed by: mjg
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28969
in6_selectsrc() may call fib6_lookup() in some cases, which requires
epoch. Wrap in6_selectsrc* calls into epoch inside its users.
Mark it as requiring epoch by adding NET_EPOCH_ASSERT().
MFC after: 1 weeek
Differential Revision: https://reviews.freebsd.org/D28647
Otherwise writing thread might wait on sbusy state of the pages which were
busied by itself, similarly to nfs_read(). But also we need to clear
NVNSETSZKSIP flag possibly set by ncl_pager_setsize(), to not undo
extension done by write.
Reported by: bdrewery
Reviewed by: rmacklem
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28306
When using NFS-over-TLS, an NFS client can optionally provide an X.509
certificate to the server during the TLS handshake. For some situations,
such as different NFS servers or different certificates being mapped
to different user credentials on the NFS server, there may be a need
for different mounts to provide different certificates.
This new mount option called "tlscertname" may be used to specify a
non-default certificate be provided. This alernate certificate will
be stored in /etc/rpc.tlsclntd in a file with a name based on what is
provided by this mount option.
successful RPC.
Without this patch, the NFSv4.2 VOP_COPY_FILE_RANGE() client call would
loop until the copy "len" was completed. The problem with doing this is
that it might take a considerable time to complete for a large "len".
By returning after a single successful Copy RPC that copied some of the
data, the application that did the copy_file_range(2) syscall will be
more responsive to signal delivery for large "len" copies.
The KERN_TLS only supports TCP, so use of the "tls" option with "udp" will
not work. This patch adds a test for this case, so that the mount is not
attempted when both "tls" and "udp" are specified.
An Internet Draft titled "Towards Remote Procedure Call Encryption By Default"
(soon to be an RFC I think) describes how Sun RPC is to use TLS with NFS
as a specific application case.
Various commits prepared the NFS code to use KERN_TLS, mainly enabling use
of ext_pgs mbufs for large RPC messages.
r364475 added TLS support to the kernel RPC.
This commit (which is the final one for kernel changes required to do
NFS over TLS) adds support for three export flags:
MNT_EXTLS - Requires a TLS connection.
MNT_EXTLSCERT - Requires a TLS connection where the client presents a valid
X.509 certificate during TLS handshake.
MNT_EXTLSCERTUSER - Requires a TLS connection where the client presents a
valid X.509 certificate with "user@domain" in the otherName
field of the SubjectAltName during TLS handshake.
Without these export options, clients are permitted, but not required, to
use TLS.
For the client, a new nmount(2) option called "tls" makes the client do
a STARTTLS Null RPC and TLS handshake for all TCP connections used for the
mount. The CLSET_TLS client control option is used to indicate to the kernel RPC
that this should be done.
Unless the above export flags or "tls" option is used, semantics should
not change for the NFS client nor server.
For NFS over TLS to work, the userspace daemons rpctlscd(8) { for client }
or rpctlssd(8) daemon { for server } must be running.
This is a partial revert of r363210, since the "use_ext" argument added
by that commit is not actually useful.
This patch should not result in any semantics change.
r363001 added support for ext_pgs mbufs to nfsm_uiombuf().
By inspection, I noticed that "mlen" was not set non-zero and, as such, there
would be an iteration of the loop that did nothing.
This patch sets it.
This bug would have no effect on the system, since the ext_pgs mbuf code
is not yet enabled.
For NFSv4.0, the server creates a server->client TCP connection for callbacks.
If the client mount on the server is using TLS, enable TLS for this callback
TCP connection.
TLS connections from clients will not be supported until the kernel RPC
changes are committed.
Since this changes the internal ABI between the NFS kernel modules that
will require a version bump, delete newnfs_trimtrailing(), which is no
longer used.
Since LCL_TLSCB is not yet set, these changes should not have any semantic
affect at this time.
This patch uses a slightly different algorithm for nfsm_uiombuflist() for
the non-ext_pgs case, where a variable called "mcp" is maintained, pointing to
the current location that mbuf data can be filled into. This avoids use of
mtod(mp, char *) + mp->m_len to calculate the location, since this does
not work for ext_pgs mbufs and I think it makes the algorithm more readable.
This change should not result in semantic changes for the non-ext_pgs case.
The patch also deletes come unneeded code.
It also adds support for anonymous page ext_pgs mbufs to nfsm_split().
This is another in the series of commits that add support to the NFS client
and server for building RPC messages in ext_pgs mbufs with anonymous pages.
This is useful so that the entire mbuf list does not need to be
copied before calling sosend() when NFS over TLS is enabled.
At this time for this case, use of ext_pgs mbufs cannot be enabled, since
ktls_encrypt() replaces the unencrypted data with encrypted data in place.
Until such time as this can be enabled, there should be no semantic change.
Also, note that this code is only used by the NFS client for a mirrored pNFS
server.
This patch modifies writing to mirrored pNFS DSs slightly so that there is
only one m_copym() call for a mirrored pair instead of two of them.
This call replaces the custom nfsm_copym() call, which is no longer needed
and deleted by this patch. The patch does introduce a new nfsm_split()
function that only calls m_split() for the non-ext_pgs case.
The semantics of nfsm_uiombuflist() is changed to include code that nul
pads the generated mbuf list. This was done by nfsm_copym() prior to this patch.
The main reason for this change is that it allows the data to be a list
of ext_pgs mbufs, since the m_copym() is for the entire mbuf list.
This support will be added in a future commit.
This patch only affects writing to mirrored flexible file layout pNFS servers.
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.
Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546
Remove all variations of rtrequest <rtrequest1_fib, rtrequest_fib,
in6_rtrequest, rtrequest_fib> and their uses and switch to
to rib_action(). This is part of the new routing KPI.
Submitted by: Neel Chauhan <neel AT neelc DOT org>
Differential Revision: https://reviews.freebsd.org/D25546
The code in nfscl_dofflayout() loops when a flexible file layout server
provides a small write data limit (no extant server is known to do this).
If/when it looped, it erroneously reused the "drpc" argument for the
mirror worker thread, corrupting it.
This patch fixes the problem by only using the calling thread after the
first loop iteration.
Found during testing by simulating a server with a small write size.
Since no extant pNFS server is known to provide a small write size,
this fix it not needed in practice at this time.
MFC after: 2 weeks
The statement "nd->nd_bpos = mcp;" was in both the if and else. Correct,
but potentially confusing. This patch fixes this.
There should be no semantics change caused by this commit.
This patch uses a slightly different algorithm for the non-ext_pgs case,
where a variable called "mcp" is maintained, pointing to the current
location that mbuf data can be filled into. This avoids use of
mtod(mp, char *) + mp->m_len to calculate the location, since this does
not work for ext_pgs mbufs and I think it makes the algorithm more readable.
This change should not result in semantic changes for the non-ext_pgs case.
This is another in the series of commits that add support to the NFS client
and server for building RPC messages in ext_pgs mbufs with anonymous pages.
This is useful so that the entire mbuf list does not need to be
copied before calling sosend() when NFS over TLS is enabled.
Since ND_EXTPG is never set yet, there is no semantic change at this time.
should be used.
For KERN_TLS (and possibly some other future network interface) the mbuf
list passed into sosend() must be ext_pgs mbufs. The krpc could simply
copy all the mbuf data into ext_pgs mbufs before calling sosend(), but
that would be inefficient for large RPC messages.
This patch adds an argument to nfscl_reqstart() to indicate that it should
fill the RPC message into ext_pgs mbufs.
It also adds fields to "struct nfsrv_descript" needed for building NFS RPC
messages in ext_pgs mbufs, along with new flags for this.
Since the argument is always "false", this commit should not result in any
semantic change. However, this commit prepares the code
for future commits that will add support for building of NFS RPC messages
in ext_pgs mbufs.
These macro definitions are no longer needed as the NFS OSX port is long
dead. The vfs_statfs macro conflicts with the vfsops field of the same
name.
Submitted by: shivank@
Reviewed by: rmacklem
MFC after: 2 weeks
Sponsored by: Google, Inc. (GSoC 2020)
Differential Revision: https://reviews.freebsd.org/D25263
fib4_lookup_nh_ represents pre-epoch generation of fib api,
providing less guarantees over pointer validness and requiring
on-stack data copying.
Switch call to use new fib4_lookup(), allowing to eventually
deprecate old api.
Differential Revision: https://reviews.freebsd.org/D24977
Currently the only reason of refcounting rtentries is the need to report
the rtable operation details immediately after the execution.
Delaying rtentry reclamation allows to stop refcounting and simplify the code.
Additionally, this change allows to reimplement rib_lookup_info(), which
is used by some of the customers to get the matching prefix along
with nexthops, in more efficient way.
The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to
nhop_priv to be able to reliably set curvnet even during vnet teardown.
Rest of the reference counting code will be removed in the D24867 .
Differential Revision: https://reviews.freebsd.org/D24866
The NFS code had a bunch of Mac OS/X accessor functions named uio_XXX
left over from the port to Mac OS/X. Since that port is long forgotten,
replace the calls with the code generated by the FreeBSD macros for these
in nfskpiport.h. This allows the macros to be deleted from nfskpiport.h
and I think makes the code more readable.
This patch should not result in any semantic change.
The macros CAST_USER_ADDR_T() and CAST_DOWN() were used for the Mac OS/X
port. The first of these macros was a no-op for FreeBSD and the second
is no longer used.
This patch gets rid of them. It also deletes the "mbuf_t" typedef which
is no longer used in the FreeBSD code from nfskpiport.h
This patch should not change semantics.
RFC5661 specifies that a client's recovery upon receipt of NFSERR_BADSESSION
should first consist of a CreateSession operation using the extant ClientID.
If that fails, then a full recovery beginning with the ExchangeID operation
is to be done.
Without this patch, the FreeBSD client did not attempt the CreateSession
operation with the extant ClientID and went directly to a full recovery
beginning with ExchangeID. I have had this patch several years, but since
no extant NFSv4.n server required the CreateSession with extant ClientID,
I have never committed it.
I an committing it now, since I suspect some future NFSv4.n server will
require this and it should not negatively impact recovery for extant NFSv4.n
servers, since they should all return NFSERR_STATECLIENTID for this first
CreateSession.
The patched client has been tested for recovery against both the FreeBSD
and Linux NFSv4.n servers and no problems have been observed.
MFC after: 1 month
I missed the "atomic" field of the RemoveExtendedAttribute operation's
reply when I implemented it. It worked between FreeBSD client and server,
since it was missed for both, but it did not conform to RFC 8276.
This patch adds the field for both client and server.
Thanks go to Frank for doing interoperability testing of the extended
attribute support against patches for Linux.
Submitted by: Frank van der Linden <fllinden@amazon.com>
Reported by: Frank van der Linden <fllinden@amazon.com>
I did not realize that zero length attributes are allowed, but they are.
This patch fixes the NFSv4.2 client and server to handle zero length
extended attributes correctly.
Submitted by: Frank van der Linden <fllinden@amazon.com> (earlier version)
Reported by: Frank van der Linden <fllinder@amazon.com>
When the code was ported to Mac OS/X, mbuf handling functions were
converted to using the Mac OS/X accessor functions. For FreeBSD, they
are a simple set of macros in sys/fs/nfs/nfskpiport.h.
Since porting to Mac OS/X is no longer a consideration, replacement of
these macros with the code generated by them makes the code more
readable.
When support for external page mbufs is added as needed by the KERN_TLS,
the patch becomes simpler if done without the macros.
This patch should not result in any semantic change.
This is the final patch of this series and the macros should now be
able to be deleted from the .h files in a future commit.
When the code was ported to Mac OS/X, mbuf handling functions were
converted to using the Mac OS/X accessor functions. For FreeBSD, they
are a simple set of macros in sys/fs/nfs/nfskpiport.h.
Since porting to Mac OS/X is no longer a consideration, replacement of
these macros with the code generated by them makes the code more
readable.
When support for external page mbufs is added as needed by the KERN_TLS,
the patch becomes simpler if done without the macros.
This patch should not result in any semantic change.
This conversion will be committed one file at a time.
This NFS lock device driver was replaced by the kernel NLM around FreeBSD7 and
has not normally been used since then.
To use it, the kernel had to be built without "options NFSLOCKD" and
the nfslockd.ko had to be deleted as well.
Since it uses Giant and is no longer used, this patch removes it.
With this device driver removed, there is now a lot of unused code
in the userland rpc.lockd. That will be removed on a future commit.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22933
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
If node attribute returned in the reply for read rpc indicate
truncation, and it happens that the vnode is exclusively locked,
update of the node attributes would try to shrink vnode size. Since
during the read some vnode pages were busied by the reading thread,
vnode_pager_setsize() deadlocks waiting for the busy state owned by
the caller.
Use a thread-local flag to indicate that NFS read owns some (s)busy
pages states and postpone the call to vnode_pager_setsize() until the
thread relinguishes the ownership.
Diagnosed by: rlibby
Tested by: pho, rlibby
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.
This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.
This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.
[0] https://pubs.opengroup.org/onlinepubs/9699919799/
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
If nfsrpc_getdirpath() returns NFSERR_MINORVERMISMATCH, it would erroneously
get mapped to EIO. This was not particularily harmful, but would make it
hard for sysadmins to diagnose why an NFSv4 mount is failing.
mount_nfs.c still needs to be fixed so that it does not report
NFSERR_MINORVERMISMATCH as an unknown error 10021.
MFC after: 1 week
None of these case were actually using the variable(s) uninitialized, but
I figured that silencing the warnings via initializing them made sense.
Some of these predated r355677.
This patch adds support for NFSv4.2 (RFC-7862) and Extended Attributes
(RFC-8276) to the NFS client and server.
NFSv4.2 is comprised of several optional features that can be supported
in addition to NFSv4.1. This patch adds the following optional features:
- posix_fadvise(POSIX_FADV_WILLNEED/POSIX_FADV_DONTNEED)
- posix_fallocate()
- intra server file range copying via the copy_file_range(2) syscall
--> Avoiding data tranfer over the wire to/from the NFS client.
- lseek(SEEK_DATA/SEEK_HOLE)
- Extended attribute syscalls for "user" namespace attributes as defined
by RFC-8276.
Although this patch is fairly large, it should not affect support for
the other versions of NFS. However it does add two new sysctls that allow
a sysadmin to limit which minor versions of NFSv4 a server supports, allowing
a sysadmin to disable NFSv4.2.
Unfortunately, when the NFS stats structure was last revised, it was assumed
that there would be no additional operations added beyond what was
specified in RFC-7862. However RFC-8276 did add additional operations,
forcing the NFS stats structure to revised again. It now has extra unused
entries in all arrays, so that future extensions to NFSv4.2 can be
accomodated without revising this structure again.
A future commit will update nfsstat(1) to report counts for the new NFSv4.2
specific operations/procedures.
This patch affects the internal interface between the nfscommon, nfscl and
nfsd modules and, as such, they all must be upgraded simultaneously.
I will do a version bump (although arguably not needed), due to this.
This code has survived a "make universe" but has not been built with a
recent GCC. If you encounter build problems, please email me.
Relnotes: yes
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
We might race with reclaim, and then this is no longer a nfs vnode, in
which case we do not need to handle deferred vnode_pager_setsize()
either.
Reported by: rk@ronald.org
PR: 242184
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
flag and use the same system.
This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22116
Make the nfsclient always call vnode_pager_setsize() with the vnode
exclusively locked. This ensures that page fault always can find the
backing page if the object size check succeeded. Set VV_VMSIZEVNLOCK
flag on NFS nodes.
The main offender breaking the interface in nfsclient is
nfs_loadattrcache(), which is used whenever server responded with
updated attributes, which can happen on non-changing operations as
well. Also, iod threads only have buffers locked (and even that is
LK_KERNPROC), but they still may call nfs_loadattrcache() on RPC
response.
Instead of immediately calling vnode_pager_setsize() if server
response indicated changed file size, but the vnode is not exclusively
locked, set a new node flag NVNSETSZSKIP. When the vnode exclusively
locked, or when we can temporary upgrade the lock to exclusive, call
vnode_pager_setsize(), by providing the nfsclient VOP_LOCK() implementation.
Tested by: pho
Discussed with: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D21883
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594
To be consistent with replacing the mtx_lock()/mtx_unlock() calls on
the NFS node mutex (n_mtx) and ncl_iod_mutex, this patch replaces
all mtx_assert() calls on these mutexes with macros as well.
This will simplify changing these locks to sx locks in a future commit.
However, this change may be delayed indefinitely, since it appears there
is a deadlock when vnode_pager_setsize() is called to shrink the size
and the NFS node lock is held.
There is no semantic change as a result of this commit.
Suggested by: kib
MFC after: 1 week
Since the NFS node mutex needs to change to an sx lock so it can be held when
vnode_pager_setsize() is called and the iod lock is held when the NFS node lock
is acquired, the iod mutex will need to be changed to an sx lock as well.
To simply the future commit that changes both the NFS node lock and iod lock
to sx locks, this commit replaces all mtx_lock()/mtx_unlock() calls on the
iod lock with macros.
There is no semantic change as a result of this commit.
I don't know when the future commit will happen and be MFC'd, so I have
set the MFC on this commit to one week so that it can be MFC'd at the same
time.
Suggested by: kib
MFC after: 1 week
For a long time, some places in the NFS code have locked/unlocked the
NFS node lock with the macros NFSLOCKNODE()/NFSUNLOCKNODE() whereas
others have simply used mtx_lock()/mtx_unlock().
Since the NFS node mutex needs to change to an sx lock so it can be held when
vnode_pager_setsize() is called, replace all occurrences of mtx_lock/mtx_unlock
with the macros to simply making the change to an sx lock in future commit.
There is no semantic change as a result of this commit.
I am not sure if the change to an sx lock will be MFC'd soon, so I put
an MFC of 1 week on this commit so that it could be MFC'd with that commit.
Suggested by: kib
MFC after: 1 week
node lock when shrinking.
This is similar to r252528, applied to the above commit.
Apparently there is a race which makes necessary at least to keep the
n_size and pager size consistent when extending. Current suspect is
that iod threads perform vnode_pager_setsize() without taking the
vnode lock, which corrupts the file content.
Reported and tested by: Masachika ISHIZUKA <ish@amail.plala.or.jp>
Discussed with: rmacklem (related issues)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
vnode_pager_setsize() under the node mutex.
r248567 moved some calls of vnode_pager_setsize() after the node lock
is unlocked, do the rest now.
Reported and tested by: peterj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
These were fully neutered in r177676 (2008), but not removed at the time for
unclear reasons. They're totally dead code, so go ahead and yank them now.
No functional change.
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode. Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.
Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM(). This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation. Remove code from
vnode_create_vobject() to handle races with the parallel destruction.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412
This silly code segment has existed in the sources since it was brought
into FreeBSD 10 years ago. I honestly have no idea why this was done.
It was possible that I thought that it might have been better to not
set B_ASYNC for the "else" case, but I can't remember.
Anyhow, this patch gets rid of the if/else that does the same thing
either way, since it looks silly and upsets a static analyser.
This will have no semantic effect on the NFS client.
PR: 238167
vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it. Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20377
The more appropriate place to do the flushing is VOP_OPEN(). This was
uncovered because VOP_SET_TEXT() is now called with the vnode'
vm_object rlocked, which is incompatible with the flush operations.
After the move, there is no need for NFS-specific VOP_SET_TEXT
overload.
Sponsored by: The FreeBSD Foundation
MFC after: 30 days
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
r340744 broke the NFSv4 client, because it replaced pfind_locked() with a
call to pfind(), since pfind() acquires the sx lock for the pid hash and
the NFSv4 already holds a mutex when it does the call.
The patch fixes the problem by recreating a pfind_any_locked() and adding the
functions pidhash_slockall() and pidhash_sunlockall to acquire/release
all of the pid hash locks.
These functions are then used by the NFSv4 client instead of acquiring
the allproc_lock and calling pfind().
Reviewed by: kib, mjg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19887
There are a few places that use hand crafted versions of the macros
from sys/netinet/in.h making it difficult to actually alter the
values in use by these macros. Correct that by replacing handcrafted
code with proper macro usage.
Reviewed by: karels, kristof
Approved by: bde (mentor)
MFC after: 3 weeks
Sponsored by: John Gilmore
Differential Revision: https://reviews.freebsd.org/D19317
that can happen when rerooting into NFSv4 rootfs with kernel
built with INVARIANTS.
I've talked to rmacklem@ (back in 2017), and while the root cause
is still unknown, the case guarded by assertion (nfscl_doclose()
being called from VOP_INACTIVE) is believed to be safe, and the
whole thing seems to run just fine.
Obtained from: CheriBSD
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
The NFS client code (nfsrpc_readdir() and nfsrpc_readdirplus()) wasn't
filling in parts of the readdir reply, such as d_pad[01] and the bytes
at the end of d_name within d_reclen. As such, data left in a buffer cache
block could be leaked to userland in the readdir reply.
This patch makes sure all of the data is filled in.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib, markj
MFC after: 2 weeks
Prior to this patch, nfs_advlock() did NFSVOPUNLOCK(); return (error);
in many places. This patch replaces these code sequenences with a "goto out;"
and does the NFSVOPUNLOCK(); return (error); at the end of the function
in order to make the vnode locking simpler.
This patch does not change the semantics of nfs_advlock().
Suggested by: kib
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D17853
Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17852
This will enable callers to take const paths as part of syscall
decleration improvements.
Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.
Bump __FreeBSD_version for external API consumers.
Reviewed by: kib (prior version)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17805
A crash was reported where the crash occurred in nfs_advlock() when the
NFS_ISV4(vp) macro was being executed. This was caused by the vnode
being VI_DOOMED due to a forced dismount in progress.
This patch fixes the problem by locking the vnode before executing the
NFS_ISV4() macro.
Tested by: rlibby
PR: 232673
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D17757
Use bypass to catch any NFS VOP dispatch and route it through the
wrapper which does sigdeferstop() and then dispatches original
VOP. NFS does not need a bypass below it, which is not supported.
The vop offset in the vop_vector is added since otherwise it is
impossible to get vop_op_t from the internal table, and I did not
wanted to create the layered fs only to wrap NFS VOPs.
VFS_OP()s wrap is straightforward.
Requested and reviewed by: mjg (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17658
Newer versions of gcc generate "might not be initialized" warnings for
several variables in nfsrpc_doiods(). I have checked and all of these
variables are assigned values before they are used.
In the one case of "tdrpc", it could have passed garbage as an argument
to nfscl_dofflayoutio() when mirrorcnt is one. However nfscl_dofflayoutio() only
uses the argument when mirrorcnt > 1, so it wasn't actually broken.
This patch initializes "tdrpc" to avoid confusion and initializes the rest
to make the compiler happy.
Requested by: mmacy
Newer versions of gcc generate "set, but not used" warnings.
Add __unused macros to silence these warnings.
Although the variables are not being used, they are values parsed from
arguments to callback RPCs that might be needed in the future.
Requested by: mmacy
When a NFSv4.1 client mount using pNFS detects a failure trying to do a
Renew (actually just a Sequence operation), the code would simply try
again and again and again every 30sec.
This would tie up the "nfscl" thread, which should also be doing other
things like Renews on other DSs and the MDS.
This patch adds code which closes down the TCP connection and marks it
defunct when Renew detects an failure to communicate with the DS, so
further Renews will not be attempted until a new working TCP connection to
the DS is established.
It also makes the call to nfscl_cancelreqs() unconditional, since
nfscl_cancelreqs() checks the NFSCLDS_SAMECONN flag and does so while holding
the lock.
This fix only applies to the NFSv4.1 client whne using pNFS and without it
the only effect would have been an "nfscl" thread busy doing Renew attempts
on an unresponsive DS.
MFC after: 2 weeks
Without this patch, the client side NFSv4.1 pNFS code erroneously did writes
and commits to both DS mirrors using the TCP connection of the first one.
For my test setup this worked, since I have both DSs running on the same
machine, but it would have failed when the DSs are on separate machines.
This patch fixes the code to use the correct TCP connection for each DS.
This patch should only affect the NFSv4.1 client when using "pnfs" mounts
to mirrored DSs.
MFC after: 2 weeks
So long as the TCP connection to a pNFS DS isn't shared with other DSs,
it can be closed down when the DS is being disabled in the pNFS client.
This causes any RPCs in progress to fail.
This patch only affects the NFSv4.1 pNFS client when errors occur
while doing I/O on a DS.
MFC after: 2 weeks
an I/O attempt on a DS to the server via LayoutReturn.
The current FreeBSD client can generate these errors for an operational
DS while doing a recovery of a mirror after a mirrored DS has been repaired.
I am not sure why these errors occur, but my best current guess is a race
between the Layout Recall issued by the kernel code run from pnfsdscopymr(8)
and a Read operation on the DS for the file bing copied.
The errrors are not fatal, since the client falls back on doing I/O through
the MDS, which can do the I/O successfully as a proxy. (The fact that the
MDS can do this indicates that the file does still exist on the functioning
DS.)
This patch only affects behaviour of the pNFS client and only when using
Flexible File layouts.
MFC after: 2 weeks
Without this patch, the NFSv4.1 pNFS client shared a single TCP connection
for all DSs that resided on the same machine. This made disabling one of
the DSs impossible. Although unlikely, it is possible that the storage
subsystem has failed in such a way that the storage for one DS on a machine
is no longer functioning correctly, but the storage used by another DS on
the same machine is still ok. For this case, it would be nice if a system
can fail one of the DSs without failing them all.
This patch changes the default behaviour to use separate TCP connections
for each DS even if they reside on the same machine.
I do not believe that this will be a problem for extant pNFS servers, but
a sysctl can be set to restore the old behaviour if this change causes a
problem for an extant pNFS server.
This patch only affects the NFSv4.1 pNFS client.
MFC after: 2 weeks
When the NFSv4.1 pNFS client gets an error for a DS I/O operation using a
Flexible File layout, it returns the layout with an error.
This patch changes the code slightly, so that it returns the layout for all
errors except EACCES and lets the MDS decide what to do based on the error.
It also makes a couple of changes to nfscl_layoutrecall() to ensure that
the first layoutreturn(s) will have the error in the reply.
Plus, the patch adds a wakeup() so that the "nfscl" thread won't wait 1sec
before doing the LayoutReturn.
Tested against the pNFS service.
This patch should not affect non-pNFS use of the client.
The unused "dsp" argument will be used by a future patch that disables the
connection to the DS when possible.
MFC after: 2 weeks
The Flexible File layout LayoutReturn operation has argument fields where
an I/O error encountered when attempting I/O on a DS can be reported back
to the MDS.
This patch adds code to the client to do this for the Flexible File layout
mirrored case.
This patch should only affect mounts using the "pnfs" option against servers
that support the Flexible File layout.
MFC after: 2 weeks
Four functions nfscl_reqstart(), nfscl_fillsattr(), nfsm_stateidtom()
and nfsmnt_mdssession() are now called from within the nfsd.
As such, they needed to be moved from nfscl.ko to nfscommon.ko so that
nfsd.ko would load when nfscl.ko wasn't loaded.
Reported by: herbert@gojira.at
the old encodings for the lower 16 and 32 bits and only using the
higher 32 bits for unusually large major and minor numbers. This
change breaks compatibility with the previous encoding (which was only
used in -current).
Fix truncation to (essentially) 16-bit dev_t in newnfs v3.
Any encoding of device numbers gives an ABI, so it can't be changed
without translations for compatibility. Extra bits give the much
larger complication that the translations need to compress into fewer
bits. Fortunately, more than 32 bits are rarely needed, so
compression is rarely needed except for 16-bit linux dev_t where it
was always needed but never done.
The previous encoding moved the major number into the top 32 bits.
Almost no translation code handled this, so the major number was blindly
truncated away in most 32-bit encodings. E.g., for ffs, mknod(8) with
major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent
this and blindly truncated it to 2. But if this mknod was run on any
released version of FreeBSD, it gives dev_t = 0x102. ffs can represent
this, but in the previous encoding it was not decoded, giving major = 0,
minor = 0x102.
The presence of bugs was most obvious for exporting dev_t's from an
old system to -current, since bugs in newnfs augment them. I fixed
oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed
to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then
further in -current. E.g., old ad0 with major = 234, minor = 0x10002
had the correct (major, minor) number on the wire, but newnfs truncated
this to (234, 2) and then the previous encoding shifted the major
number into oblivion as seen by ffs or old applications.
I first tried to fix this by translating on every ABI/API boundary, but
there are too many boundaries and too many sloppy translations by blind
truncation. So use the old encoding for the low 32 bits so that sloppy
translations work no worse than before provided the high 32 bits are
not set. Add some error checking for when bits are lost. Keep not
doing any error checking for translations for almost everything in
compat/linux.
compat/freebsd32/freebsd32_misc.c:
Optionally check for losing bits after possibly-truncating assignments as
before.
compat/linux/linux_stats.c:
Depend on the representation being compatible with Linux's (or just with
itself for local use) and spell some of the translations as assignments in
a macro that hides the details.
fs/nfsclient/nfs_clcomsubs.c:
Essentially the same fix as in 1996, except there is now no possible
truncation in makedev() itself. Also fix nearby style bugs.
kern/vfs_syscalls.c:
As for freebsd32. Also update the sysctl description to include file
numbers, and change it to describe device ids as device numbers.
sys/types.h:
Use inline functions (wrapped by macros) since the expressions are now
a bit too complicated for plain macros. Describe the encoding and
some of the reasons for it. 16-bit compatibility didn't leave many
reasonable choices for the 32-bit encoding, and 32-bit compatibility
doesn't leave many reasonable choices for the 64-bit encoding. My
choice is to put the 8 new minor bits in the low 8 bits of the top 32
bits. This minimizes discontiguities.
Reviewed by: kib (except for rewrite of the comment in linux_stats.c)
This code merge adds a pNFS service to the NFSv4.1 server. Although it is
a large commit it should not affect behaviour for a non-pNFS NFS server.
Some documentation on how this works can be found at:
http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
and will hopefully be turned into a proper document soon.
This is a merge of the kernel code. Userland and man page changes will
come soon, once the dust settles on this merge.
It has passed a "make universe", so I hope it will not cause build problems.
It also adds NFSv4.1 server support for the "current stateid".
Here is a brief overview of the pNFS service:
A pNFS service separates the Read/Write oeprations from all the other NFSv4.1
Metadata operations. It is hoped that this separation allows a pNFS service
to be configured that exceeds the limits of a single NFS server for either
storage capacity and/or I/O bandwidth.
It is possible to configure mirroring within the data servers (DSs) so that
the data storage file for an MDS file will be mirrored on two or more of
the DSs.
When this is used, failure of a DS will not stop the pNFS service and a
failed DS can be recovered once repaired while the pNFS service continues
to operate. Although two way mirroring would be the norm, it is possible
to set a mirroring level of up to four or the number of DSs, whichever is
less.
The Metadata server will always be a single point of failure,
just as a single NFS server is.
A Plan B pNFS service consists of a single MetaData Server (MDS) and K
Data Servers (DS), all of which are recent FreeBSD systems.
Clients will mount the MDS as they would a single NFS server.
When files are created, the MDS creates a file tree identical to what a
single NFS server creates, except that all the regular (VREG) files will
be empty. As such, if you look at the exported tree on the MDS directly
on the MDS server (not via an NFS mount), the files will all be of size 0.
Each of these files will also have two extended attributes in the system
attribute name space:
pnfsd.dsfile - This extended attrbute stores the information that
the MDS needs to find the data storage file(s) on DS(s) for this file.
pnfsd.dsattr - This extended attribute stores the Size, AccessTime, ModifyTime
and Change attributes for the file, so that the MDS doesn't need to
acquire the attributes from the DS for every Getattr operation.
For each regular (VREG) file, the MDS creates a data storage file on one
(or more if mirroring is enabled) of the DSs in one of the "dsNN"
subdirectories. The name of this file is the file handle
of the file on the MDS in hexadecimal so that the name is unique.
The DSs use subdirectories named "ds0" to "dsN" so that no one directory
gets too large. The value of "N" is set via the sysctl vfs.nfsd.dsdirsize
on the MDS, with the default being 20.
For production servers that will store a lot of files, this value should
probably be much larger.
It can be increased when the "nfsd" daemon is not running on the MDS,
once the "dsK" directories are created.
For pNFS aware NFSv4.1 clients, the FreeBSD server will return two pieces
of information to the client that allows it to do I/O directly to the DS.
DeviceInfo - This is relatively static information that defines what a DS
is. The critical bits of information returned by the FreeBSD
server is the IP address of the DS and, for the Flexible
File layout, that NFSv4.1 is to be used and that it is
"tightly coupled".
There is a "deviceid" which identifies the DeviceInfo.
Layout - This is per file and can be recalled by the server when it
is no longer valid. For the FreeBSD server, there is support
for two types of layout, call File and Flexible File layout.
Both allow the client to do I/O on the DS via NFSv4.1 I/O
operations. The Flexible File layout is a more recent variant
that allows specification of mirrors, where the client is
expected to do writes to all mirrors to maintain them in a
consistent state. The Flexible File layout also allows the
client to report I/O errors for a DS back to the MDS.
The Flexible File layout supports two variants referred to as
"tightly coupled" vs "loosely coupled". The FreeBSD server always
uses the "tightly coupled" variant where the client uses the
same credentials to do I/O on the DS as it would on the MDS.
For the "loosely coupled" variant, the layout specifies a
synthetic user/group that the client uses to do I/O on the DS.
The FreeBSD server does not do striping and always returns
layouts for the entire file. The critical information in a layout
is Read vs Read/Writea and DeviceID(s) that identify which
DS(s) the data is stored on.
At this time, the MDS generates File Layout layouts to NFSv4.1 clients
that know how to do pNFS for the non-mirrored DS case unless the sysctl
vfs.nfsd.default_flexfile is set non-zero, in which case Flexible File
layouts are generated.
The mirrored DS configuration always generates Flexible File layouts.
For NFS clients that do not support NFSv4.1 pNFS, all I/O operations
are done against the MDS which acts as a proxy for the appropriate DS(s).
When the MDS receives an I/O RPC, it will do the RPC on the DS as a proxy.
If the DS is on the same machine, the MDS/DS will do the RPC on the DS as
a proxy and so on, until the machine runs out of some resource, such as
session slots or mbufs.
As such, DSs must be separate systems from the MDS.
Tested by: james.rose@framestore.com
Relnotes: yes
There were a couple of cases in newnfs_request() that it assumed that it
was an NFSv4.1 mount with a session. This should always be the case when
a Sequence operation is in the reply or the server replies NFSERR_BADSESSION.
However, if a server was broken and sent an erroneous reply, these safety
belt checks should avoid trouble.
The one check required a small tweak to nfsmnt_mdssession() so that it
returns NULL when there is no session instead of the offset of the field
in the structure (0x8 for i386).
This patch should have no effect on normal operation of the client.
Found by inspection during pNFS server development.
MFC after: 2 weeks
The Flexible File layout case wasn't handled by LayoutRecall callbacks
because it just checked for File layout and returned NFSERR_NOMATCHLAYOUT
otherwise. This patch adds the Flexible File layout handling.
Found during testing of the pNFS server.
MFC after: 2 weeks
The intent was that the default would be based on number of CPUs, but the
code disabled using taskqueue() by default.
This code is only executed when mounting a NFSv4.1 server that supports the
Flexible File layout for pNFS and, since such servers are rare, this change
shouldn't result in a POLA violation.
(The FreeBSD pNFS server is still a project and the only other one that
uses Flexible File layout is being developed by Primary Data and I don't
know if they have even shipped any to customers yet.)
Found while testing the pNFS server.
The sleep for I/O completion during an NFSv4.1 pNFS layout recall used
the wrong event value and could result in the "[nfscl]" thread hung
for the mount.
This patch fixes the event to be the correct.
This bug will only affect NFSv4.1 pnfs mounts and only when the server
does a layout recall callback, so it won't affect many. Without the patch,
a mount without the "pnfs" option will avoid the problem.
Found during testing of the pNFS server.
MFC after: 1 week
Using a pointer after setting it NULL is probably not a good plan.
Spotted by inspection during changes for Flexible File Layout Ioerr handling.
This code path obviously isn't normally executed.
MFC after: 1 week
Mechanically replace uses of MALLOC/FREE with appropriate invocations of
malloc(9) / free(9) (a series of sed expressions). Something like:
* MALLOC(a, b, ... -> a = malloc(...
* FREE( -> free(
* free((caddr_t) -> free(
No functional change.
For now, punt on modifying contrib ipfilter code, leaving a definition of
the macro in its KMALLOC().
Reported by: jhb
Reviewed by: cy, imp, markj, rmacklem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14035
When allocating memory through malloc(9), we always expect the amount of
memory requested to be unsigned as a negative value would either stand for
an error or an overflow.
Unsign some values, found when considering the use of mallocarray(9), to
avoid unnecessary casting. Also consider that indexes should be of
at least the same size/type as the upper limit they pretend to index.
MFC after: 3 weeks
Uses of mallocarray(9).
The use of mallocarray(9) has rocketed the required swap to build FreeBSD.
This is likely caused by the allocation size attributes which put extra pressure
on the compiler.
Given that most of these checks are superfluous we have to choose better
where to use mallocarray(9). We still have more uses of mallocarray(9) but
hopefully this is enough to bring swap usage to a reasonable level.
Reported by: wosch
PR: 225197
Focus on code where we are doing multiplications within malloc(9). None of
these ire likely to overflow, however the change is still useful as some
static checkers can benefit from the allocation attributes we use for
mallocarray.
This initial sweep only covers malloc(9) calls with M_NOWAIT. No good
reason but I started doing the changes before r327796 and at that time it
was convenient to make sure the sorrounding code could handle NULL values.
X-Differential revision: https://reviews.freebsd.org/D13837
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.
Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
gcc complaints that the comparision is always false due to the value
range, and the cast does not prevent the analysis. Split the LP64
vs. ILP32 clamping as a workaround.
Sponsored by: The FreeBSD Foundation
On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.
To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method. For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.
While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value. In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.
Discussed with: bde
Reviewed by: kib (part of a larger patch)
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D12572
- Define a NFS_LINK_MAX as UINT32_MAX to match the wire protocol.
- Use NFS_LINK_MAX instead of LINK_MAX as the fallback value reported
for a PATHCONF RPC by the NFS server.
- Use NFS_LINK_MAX instead of LINK_MAX as the default value reported
by the NFS client pathconf() if not overridden by the NFS server.
- When reading the link count out of an RPC reply, read the full 32
bits instead of the lower 16 bits.
Reviewed by: rmacklem (earlier version)
Sponsored by: Chelsio Communications
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
When the NFSv4.1 pNFS client is using a Flexible File Layout specifying
mirrored Data Servers, it must do the writes and commits to all mirrors.
This patch modifies the client to use a taskqueue to perform these writes
and commits concurrently.
The number of threads can't be changed for taskqueue(9), so it is set
to 4 * mp_ncpus by default, but this can be overridden by setting the
sysctl vfs.nfs.pnfsiothreads.
Differential Revision: https://reviews.freebsd.org/D12632
This patch adds support for the Flexible File Layout to the pNFS client.
Although the patch is rather large, it should only affect NFS mounts
using the "pnfs" option against pNFS servers that do not support File
Layout.
There are still a couple of things missing from the Flexible File Layout
client implementation:
- The code does not yet do a LayoutReturn with I/O error stats when
I/O error(s) occur when attempting to do I/O on a DS.
This will be fixed in a future commit, since it is important for the
MDS to know that I/O on a DS is failing.
- The current code does writes and commits to mirror DSs serially.
Making them happen concurrently will be done in a future commit,
after discussion on freebsd-current@ on the best way to do this.
- The code does not handle NFSv4.0 DSs. Since there is no extant pNFS
server that implements NFSv4.0 DSs and NFSv4.1 DSs makes more sense
now, I don't intend to implement this until there is a need for it.
There is support for NFSv4.1 and NFSv3 DSs.
This patch changes nfsv4_getipaddr() and nfsrpc_fillsa() to use
a sockaddr_in * and sockaddr_in6 * instead of sockaddr_storage, to
avoid allocating the latter on the stack. It also moves the nfsrpc_fillsa()
call to after the completion of parsing of the DeviceInfo reply from
the server. This patch is in preparation for addition of Flex File
Layout support in a future commit.
It only affects the "pnfs" NFSv4.1 client mount option and should not
have changed its semantics.
When a "pnfs" NFSv4.1 mount was unmounted, it didn't free up the layouts
and deviceinfo structures. This leak only affects "pnfs" mounts and only
when the mount is umounted.
Found while testing the pNFS Flexible File layout client code.
MFC after: 2 weeks
This patch adds "vers" and "minorvers" arguments to nfscl_reqstart().
The patch always passes them in as "0" and that implies no change
in semantics. These arguments will be used by a future commit that
adds support for the Flexible File Layout.
nfsm_uiombuflist() zero filled the mbuf list to a multiple of 4bytes
as required for XDR. Unfortunately that modified an mbuf list after
it was m_copym()'d and was broken. This patch removes the zero filling code.
Since nfsm_uiombuflist() is not yet used in head/current, this has no
effect on users.
The function will be used by a future commit of code that adds Flex
File Layout support.
Make the NFSv4 pNFS client function nfsrpc_layoutget() a static, since it
is only used in sys/fs/nfsclient/nfs_clrpcops.c.
This prepares the code for future patches that add Flex File layout
support.
This patch adds a new function called nfsm_uiombuflist(), which is
similar to nfsm_uiombuf(), but doesn't not use the fields in
struct nfsrv_descript. This new function will be used by the pNFS client
for writing to mirrors using Flex Files layout.
The function is not yet called anywhere.
Also, get rid of #ifndef APPLE, which is ancient cruft left over from
the Mac OSX port of the NFSv4 client.
Simplify nfsrpc_layoutreturn() args. in preparation for the addition
of Flex File layout support, since File layout uses a 0 length field.
Flex Files does use a longer field, but that will be added in a
subsequent commit.
The code in nfscl_doflayoutio() bogusly used FREAD instead of
NFSV4OPEN_ACCESSREAD. Since both happen to be defined as "1", this
worked and the patch doesn't result in a functional change.
Found by inspection during development of Flex File Layout support.
MFC after: 2 weeks
Currently several paths in the NFS client upgrade the shared vnode
lock to exclusive, which might cause temporal dropping of the lock.
This action appears to be fatal for nullfs mounts over NFS. If the
operation is performed over nullfs vnode, then bypassed down to NFS
VOP, and the lock is dropped, other thread might reclaim the upper
nullfs vnode. Since on reclaim the nullfs vnode lock and NFS vnode
lock are split, the original lock state of the nullfs vnode is not
restored. As result, VFS operations receive not locked vnode after a
VOP call.
Stop upgrading the vnode lock when we check the consistency or flush
buffers as result of detected inconsistency. Instead, allocate a new
lockmgr lock for each NFS node, which is locked exclusive instead of
the vnode lock upgrade. In other words, the other parallel
modification of the vnode are excluded by either vnode lock conflict
or exclusivity of the new lock when the vnode lock is shared.
Also revert r316529 because now the vnode cannot be reclaimed during
ncl_vinvalbuf().
In collaboration with: pho
Reviewed by: rmacklem
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12083
When an NFS mount is hung against an unresponsive NFS server, the "umount -f"
option can be used to dismount the mount. Unfortunately, "umount -f" gets
hung as well if a "umount" without "-f" has already been done. Usually,
this is because of a vnode lock being held by the "umount" for the mounted-on
vnode.
This patch adds kernel code so that a new "-N" option can be added to "umount",
allowing it to avoid getting hung for this case.
It adds two flags. One indicates that a forced dismount is about to happen
and the other is used, along with setting mnt_data == NULL, to handshake
with the nfs_unmount() VFS call.
It includes a slight change to the interface used between the client and
common NFS modules, so I bumped __FreeBSD_version to ensure both modules are
rebuilt.
Tested by: pho
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D11735
If the nfsrpc_createlayoutrpc() call in nfsrpc_getcreatelayout() fails,
the code used nfhpp when it might be set NULL. This patch checks for
the error cases (laystat != 0) and avoids using nfhpp for the failure case.
This would only affect NFSv4.1 mounts with the "pnfs" option.
Found while testing the "umount -N" patch not yet in head.
MFC after: 2 weeks
This patch defines a macro that checks for MNTK_UNMOUNTF and replaces
explicit checks with this macro. It has no effect on semantics, but
prepares the code for a future patch where there will also be a
NFS specific flag for "forced dismount about to occur".
Suggested by: kib
MFC after: 2 weeks
Suppose that a file on NFS has partially filled last page, and this
page is dirty. NFS VOP_PAGEOUT() method only marks the the page clean
up to the block of the last written byte, leaving other blocks dirty.
Also any page which erronously exists in the vnode vm_object past EOF
is also left marked as dirty.
With the introduction of the buf-cache coherent pager, each pass of
syncer over the object with such page results in creation of B_DELWRI
buffer due to VOP_WRITE() call. This buffer is noted on next syncer
pass, which results e.g. a visible manifestation of shutdown never
finishing vnode sync. Note that before buf-cache coherency commit, a
dirty page might left never synced to server if a partial writes
occur.
Fix this by clearing dirty bits after EOF. Only blocks of the partial
page which are completely after EOF are marked clean, to avoid
possible user data loss.
Reported by: mav
Reviewed by: alc, markj
Tested by: mav, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D11697
r320062 used nm_rsize, nm_wsize to set the maximum request/response sizes for
the NFSv4.1 session. If rsize,wsize are not specified as options, the
value of nm_rsize, nm_wsize is 0 at session creation, resulting in
values for request/response that are too small.
This patch fixes the problem. A workaround is to specify rsize=N,wsize=N
mount options explicitly, so they are set before session creation.
This bug only affects NFSv4.1 mounts against some non-FreeBSD servers.
MFC after: 1 week
r320062 used nm_rsize, nm_wsize to set the maximum request/response sizes for
the NFSv4.1 session. If rsize,wsize are not specified as options, the
value of nm_rsize, nm_wsize is 0 at session creation, resulting in
values for request/response that are too small.
This patch fixes the problem. A workaround is to specify rsize=N,wsize=N
mount options explicitly, so they are set before session creation.
This bug only affects NFSv4.1 mounts against some non-FreeBSD servers.
MFC after: 1 week
Update filesystems not currently using vop_stdpathconf() in pathconf
VOPs to use vop_stdpathconf() for any configuration variables that do
not have filesystem-specific values. vop_stdpathconf() is used for
variables that have system-wide settings as well as providing default
values for some values based on system limits. Filesystems can still
explicitly override individual settings.
PR: 219851
Reported by: cem
Reviewed by: cem, kib, ngie
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D11541
If an NFSv3 server were to reply with weak cache consistency attributes,
but not post operation attributes, the client would use garbage attributes
from memory. This was spotted during work on the code for the NFSv4.1 client.
I have never seen evidence that this happens and it wouldn't make sense
for an NFSv3 server to do this, so this patch is basically "theoretical",
but does fix the problem if a server were to do the above.
PR: 219552
MFC after: 2 weeks
A NFSv4.1/pNFS server using File Layout can specify that Commit operations
are to be done against the DS instead of MDS. Since no extant pNFS
server did this, the code was untested and "#ifdef notyet".
The FreeBSD pNFS server I am developing does specify that Commits be done
through the DS, so the code has been enabled/tested.
This patch should only affect the case of a pNFS server that specfies
Commits through the DS.
PR: 219551
MFC after: 2 weeks
When the NFSv4.1 client is doing pNFS, it needs to get an Open and
a Layout for every file it will be doing I/O on. The current code
does two separate RPCs to get these. This patch adds two new compounds
that do the both the Open and LayoutGet in the same RPC, reducing the
RPC count.
It also factors out the code that sets up and parses the LayoutGet operation
into separate functions, so that the code doesn't get duplicated for
these new RPCs.
This patch is fairly large, but should only affect the NFSv4.1 client
when the "pnfs" option is specified.
PR: 219550
MFC after: 2 weeks
initialized.
bdrewery@ has reported panics "newnfs_copycred: negative nfsc_ngroups".
The only way I can see that this occurs is that the credentials field of
the open structure gets used before being filled in.
I am not sure quite how this happens, but for the file create case, the
code is serialized via the vnode lock on the directory. If, somehow, a
link to the same file gets created just after file creation, this might
occur.
This patch ensures that the credentials field is initialized to a reasonable
set of credentials before the structure is linked into any list, so I
this should ensure it is initialized before use.
I am committing the patch now, since bdrewery@ notes that the panics
are intermittent and it may be months before he knows if the patch fixes
his problem.
Reported by: bdrewery
MFC after: 2 weeks
r320070 removed the definition of maxbcachebuf from sys/param.h to
fix the build for arm.
This patch adds the definition of maxbcachebuf to sys/buf.h, which
should be ok, since sys/buf.h is not being included in arm/arm/elf_note.S.
Suggested by: kib
MFC after: 2 weeks
The code still doesn't use d_off. That will come in a future commit.
The code also removes the checks for servers returning a fileno that
doesn't fit in 32bits, since that should work ok now.
Bump __FreeBSD_version since this patch changes the interface between
the NFS kernel modules.
Reviewed by: kib
arm build.
In the arm build, elf_note.S includes sys/param.h and then does an
elf macro called ELFNOTE(). Although the compile error doesn't make
sense to me, I believe it just means that an "extern ..." can't exist
in param.h for this inclusion case.
I suspect adding #if !defined(LOCORE) might fix the build, but this
commit just takes the definition out.
I will ask freebsd-current@ what is the best was to deal with this
and do a subsequent commit after that.
Reported by: melounmichal@gmail.com
By making MAXBCACHEBUF a tunable, it can be increased to allow for
larger read/write data sizes for the NFS client.
The tunable is limited to MAXPHYS, which is currently 128K.
Making MAXPHYS a tunable or increasing its value is being discussed,
since it would be nice to support a read/write data size of 1Mbyte
for the NFS client when mounting the AmazonEFS file service.
Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D10991
Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all
bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use
f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this
method or reporting st_dev by stat(2).
Provide a new helper vn_fsid() to avoid duplicating code to copy
f_fsid to va_fsid.
Note that the change is mostly cosmetic. Its motivation is to avoid
sign-extension of f_fsid[0] into 64bit dev_t value which happens after
dev_t becomes 64bit..
Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version)
Sponsored by: The FreeBSD Foundation
Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.
ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.
Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.
Struct xvnode changed layout, no compat shims are provided.
For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.
Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.
Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).
Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
The code specified the length of a layout as INT64_MAX instead of
UINT64_MAX. This could result in getting a layout for less than the
full file for extremely large files. Although having little practical
effect, this patch corrects this in the code.
Detected during recent testing of the pNFS server.
MFC after: 2 weeks
The nfsv4_seqsession() call returns NFSERR_REPLYFROMCACHE when it has a
reply in the session, due to a requestor retry. The code erroneously
assumed a return of 0 for this case. This patch fixes this and adds
a KASSERT(). This would be an extremely rare occurrence. It was found
during code inspection during the pNFS server development.
MFC after: 2 weeks
An NFSv4 server has the option of allowing a Read to be done using a Write
Open. If this is not allowed, the server will return NFSERR_OPENMODE.
This patch attempts the read with a write open and then disables this
if the server replies NFSERR_OPENMODE.
This change will avoid some uses of the special stateids. This will be
useful for pNFS/DS Reads, since they cannot use special stateids.
It will also be useful for any NFSv4 server that does not support reading
via the special stateids. It has been tested against both types of NFSv4 server.
MFC after: 2 weeks
The NFSv4.1/pNFS client does not use/need a backchannel for the Data Server (DS)
sessions, so the flag should only be set for MetaData Server (MDS) sessions.
This patch should have been a part of r317275.
MFC after: 2 weeks
The "return layout on close" case in the pNFS client was badly broken.
Fortunately, extant pNFS servers that I have tested against do not
do this. This patch fixes it. It also changes the way the layout stateid.seqid
is set for LayoutReturn. I think this change is correct w.r.t. the RFC,
but I am not 100% sure.
This was found during recent testing of the pNFS server under development.
MFC after: 2 weeks
The NFSv4.1/pNFS client wasn't doing a newnfs_disconnect() call for the
connection to the Data Server (DS) under some circumstances. The main
effect of this was a leak of malloc'd structures in the krpc. This patch
adds the newnfs_disconnect() calls to fix this.
Detected during recent testing against the pNFS server under development.
MFC after: 2 weeks
The nfscl_mtofh() function didn't check for failed operations and, as such,
would have returned EBADRPC for these cases, due to parsing failure.
This patch adds checks, so that it returns with ND_NOMOREDATA set.
This is needed for future use in the pNFS server and acts as a safety
belt in the meantime.
MFC after: 2 weeks
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
The "cred" argument of ncl_flush() is unused and it was confusing to have
the code passing in NULL for this argument in some cases. This patch deletes
this argument.
There is no semantic change because of this patch.
MFC after: 2 weeks
Some NFSv4.1 servers such as AmazonEFS can only support a small fixed number
of open_owner4s. This patch adds a mount option called "oneopenown" that
can be used for NFSv4.1 mounts to make the client do all Opens with the
same open_owner4 string. This option can only be used with NFSv4.1 and
may not work correctly when Delegations are is use.
Reported by: cperciva
Tested by: cperciva
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8988
A function called svcpool_close() was added to the server side krpc by
r313735, so that a pool could be closed without destroying the data structures.
This little patch adds a call to it for the callback pool (svcpool_nfscbd),
so that the nfscbd daemon can be killed/restarted and continue to work
correctly.
MFC after: 2 weeks
When an mmap'd text file is written and then executed immediately
afterwards, it was possible that the modify time would change after the
text file was executing, resulting in the process executing the file
being killed. This was usually only observed when the file system's
times were set to higher resolution, but could have occurred for any
time resolution.
This was reported on a recent email list discussion.
This patch adds a VOP_SET_TEXT() to the NFS client which flushed all
dirty pages to the NFS server and then makes sure that n_mtime is up
to date to avoid this from occurring.
Thanks go to kib@ and pho@ for their help with developing this patch.
Tested by: pho
Reviewed by: kib
MFC after: 2 weeks
If the ExchangeID/CreateSession operations done by an NFSv4.1 client
after the server crashes/reboots fails, it is possible that some process/thread
is waiting for an open_owner lock. If the client state is free'd, this
can cause a crash.
This would not normally happen, but has been observed on a mount of the
AmazonEFS service.
Reported by: cperciva
Tested by: cperciva
PR: 216086
MFC after: 2 weeks
during recovery.
If the NFSv4.1 client gets a NFSv4.1 NFSERR_BADSESSION reply to an Open/Lock
operation while recovering from the server crash/reboot, allow the opens
to be retained for a subsequent recovery attempt. Since NFSv4.1 servers
should only reply NFSERR_BADSESSION after a crash/reboot that has lost
state, this case should almost never happen.
However, for the AmazonEFS file service, this has been observed when
the client does a fresh TCP connection for RPCs.
Reported by: cperciva
Tested by: cperciva
PR: 216088
MFC after: 2 weeks
Instead, issue a diagnostic and return appropriate error if
ncl_flush() was unable to clean buffer queue after the specified
number or retries.
Reviewed by: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
This patch gives a requestor of the exclusive lock on the client state
in the NFSv4 client priority over shared lock requestors. This avoids
the server crash recovery thread being starved out by other threads doing
RPCs.
Tested by: cperciva
PR: 216087
MFC after: 2 weeks
When the NFSv4 client Commit operation encountered a stale write verifier,
it erroneously mapped that to EIO. This could have caused recently written
data to be lost when a server crashes/reboots between an UNSTABLE write
and the subsequent commit. This patch fixes this.
The bug was only for the NFSv4 client and did not affect NFSv3.
Tested by: cperciva
PR: 215887
MFC after: 2 weeks
If an operation that preceeds a Setattr in an NFSv4 compound fails,
there is no bitmap of attributes to parse. Without this patch, the
parsing would fail and return EBADRPC instead of the correct failure
error. This could break recovery from a server crash/reboot.
Tested by: cperciva
PR: 215883
MFC after: 2 weeks
Write out the dirty pages using VOP_WRITE() instead of directly
calling ncl_writerpc(). The state of the buffers now reflects the
write, fixing some hard to diagnose consistency and write order
issues. The change also allowed to remove remapping of paged out
pages into kernel space and related allocation of the phys buffer.
Reviewed by: markj, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D10241
ncl_vinvalbuf() might need to upgrade vnode lock, allowing the vnode
to be reclaimed by other thread. Handle the situation, indicated by
the returned error zero and VI_DOOMED iflag set, converting it into
EBADF. Handle all calls, even where the vnode is exclusively locked
right now.
Reviewed by: markj, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D10241
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
Right now this is not critical, but will be after planned increase of
MNAMELEN from 88 to 1k.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
For most NFSv4.1 servers, a NFS4ERR_BAD_SESSION error is a rare failure
that indicates that the server has lost session/open/lock state.
However, recent testing by cperciva@ against the AmazonEFS server found
several problems with client recovery from this due to it generating this
failure frequently.
Briefly, the problems fixed are:
- If all session slots were in use at the time of the failure, some processes
would continue to loop waiting for a slot on the old session forever.
- If an RPC that doesn't use open/lock state failed with NFS4ERR_BAD_SESSION,
it would fail the RPC/syscall instead of initiating recovery and then
looping to retry the RPC.
- If a successful reply to an RPC for an old session wasn't processed
until after a new session was created for a NFS4ERR_BAD_SESSION error,
it would erroneously update the new session and corrupt it.
- The use of the first element of the session list in the nfs mount
structure (which is always the current metadata session) was slightly
racey. With changes for the above problems it became more racey, so all
uses of this head pointer was wrapped with a NFSLOCKMNT()/NFSUNLOCKMNT().
- Although the kernel malloc() usually allocates more bytes than requested
and, as such, this wouldn't have caused problems, the allocation of a
session structure was 1 byte smaller than it should have been.
(Null termination byte for the string not included in byte count.)
There are probably still problems with a pNFS data server that fails
with NFS4ERR_BAD_SESSION, but I have no server that does this to test
against (the AmazonEFS server doesn't do pNFS), so I can't fix these yet.
Although this patch is fairly large, it should only affect the handling
of NFS4ERR_BAD_SESSION error replies from an NFSv4.1 server.
Thanks go to cperciva@ for the extension testing he did to help isolate/fix
these problems.
Reported by: cperciva
Tested by: cperciva
MFC after: 3 months
Differential Revision: https://reviews.freebsd.org/D8745
the vnode is inactivated. This contradicts with the nullfs caching
which keeps upper vnode around, as consequence keeping the use
reference to lower vnode.
Add a filesystem flag to request nullfs to not cache when mounted over
that filesystem, and set the flag for nfs v4 mounts.
Reported by: asomers
Reviewed by: rmacklem
Tested by: asomers, rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The pager, due to its construction, implements clustering for the
page-ins. In particular, buildworld load demonstrates reduction of
the READ RPCs from 39k down to 24k. No change in real or CPU time was
observed.
Discussed with, and measured by: bde
No objections from: rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week