system/freebsd-src

mirror of https://github.com/freebsd/freebsd-src synced 2024-10-15 04:43:53 +00:00

Author	SHA1	Message	Date
Alexander V. Chernikov	9d5df78e64	Fix NOINET6 build broken by r361575. Reported by: ci, hps	2020-05-28 09:52:28 +00:00
Alexander V. Chernikov	c74ce5cca3	Make NFS address selection use fib4_lookup(). fib4_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees over pointer validness and requiring on-stack data copying. Switch call to use new fib4_lookup(), allowing to eventually deprecate old api. Differential Revision: https://reviews.freebsd.org/D24977	2020-05-28 07:35:07 +00:00
Alexander V. Chernikov	2bbab0af6d	Use epoch(9) for rtentries to simplify control plane operations. Currently the only reason of refcounting rtentries is the need to report the rtable operation details immediately after the execution. Delaying rtentry reclamation allows to stop refcounting and simplify the code. Additionally, this change allows to reimplement rib_lookup_info(), which is used by some of the customers to get the matching prefix along with nexthops, in more efficient way. The change keeps per-vnet rtzone uma zone. It adds nh_vnet field to nhop_priv to be able to reliably set curvnet even during vnet teardown. Rest of the reference counting code will be removed in the D24867 . Differential Revision: https://reviews.freebsd.org/D24866	2020-05-23 10:21:02 +00:00
Ryan Moeller	b9cc3262bc	nfs: Remove APPLESTATIC macro It is no longer useful. Reviewed by: rmacklem Approved by: mav (mentor) MFC after: 1 week Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D24811	2020-05-12 13:23:25 +00:00
Ryan Moeller	32033b3d30	Remove APPLEKEXT ifndefs They are no longer useful. Reviewed by: rmacklem Approved by: mav (mentor) MFC after: 1 week Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D24752	2020-05-08 14:39:38 +00:00
Rick Macklem	5ecf33c6c4	Get rid of uio_XXX macros used for the Mac OS/X port. The NFS code had a bunch of Mac OS/X accessor functions named uio_XXX left over from the port to Mac OS/X. Since that port is long forgotten, replace the calls with the code generated by the FreeBSD macros for these in nfskpiport.h. This allows the macros to be deleted from nfskpiport.h and I think makes the code more readable. This patch should not result in any semantic change.	2020-04-28 02:11:02 +00:00
Rick Macklem	e4a458bb1b	Remove Mac OS/X macros that did nothing for FreeBSD. The macros CAST_USER_ADDR_T() and CAST_DOWN() were used for the Mac OS/X port. The first of these macros was a no-op for FreeBSD and the second is no longer used. This patch gets rid of them. It also deletes the "mbuf_t" typedef which is no longer used in the FreeBSD code from nfskpiport.h This patch should not change semantics.	2020-04-25 02:18:59 +00:00
Rick Macklem	897d7d45ba	Make the NFSv4.n client's recovery from NFSERR_BADSESSION RFC5661 conformant. RFC5661 specifies that a client's recovery upon receipt of NFSERR_BADSESSION should first consist of a CreateSession operation using the extant ClientID. If that fails, then a full recovery beginning with the ExchangeID operation is to be done. Without this patch, the FreeBSD client did not attempt the CreateSession operation with the extant ClientID and went directly to a full recovery beginning with ExchangeID. I have had this patch several years, but since no extant NFSv4.n server required the CreateSession with extant ClientID, I have never committed it. I an committing it now, since I suspect some future NFSv4.n server will require this and it should not negatively impact recovery for extant NFSv4.n servers, since they should all return NFSERR_STATECLIENTID for this first CreateSession. The patched client has been tested for recovery against both the FreeBSD and Linux NFSv4.n servers and no problems have been observed. MFC after: 1 month	2020-04-22 21:00:14 +00:00
Rick Macklem	0bda1ddd33	Fix the NFSv4.2 extended attribute support for remove extended attrbute. I missed the "atomic" field of the RemoveExtendedAttribute operation's reply when I implemented it. It worked between FreeBSD client and server, since it was missed for both, but it did not conform to RFC 8276. This patch adds the field for both client and server. Thanks go to Frank for doing interoperability testing of the extended attribute support against patches for Linux. Submitted by: Frank van der Linden <fllinden@amazon.com> Reported by: Frank van der Linden <fllinden@amazon.com>	2020-04-15 21:27:52 +00:00
Rick Macklem	fb8ed4c5f8	Fix the NFSv2 extended attribute support to handle 0 length attributes. I did not realize that zero length attributes are allowed, but they are. This patch fixes the NFSv4.2 client and server to handle zero length extended attributes correctly. Submitted by: Frank van der Linden <fllinden@amazon.com> (earlier version) Reported by: Frank van der Linden <fllinder@amazon.com>	2020-04-14 22:57:21 +00:00
Rick Macklem	e3e7c612f3	Replace mbuf macros with the code they would generate in the NFS code. When the code was ported to Mac OS/X, mbuf handling functions were converted to using the Mac OS/X accessor functions. For FreeBSD, they are a simple set of macros in sys/fs/nfs/nfskpiport.h. Since porting to Mac OS/X is no longer a consideration, replacement of these macros with the code generated by them makes the code more readable. When support for external page mbufs is added as needed by the KERN_TLS, the patch becomes simpler if done without the macros. This patch should not result in any semantic change. This is the final patch of this series and the macros should now be able to be deleted from the .h files in a future commit.	2020-04-11 23:37:58 +00:00
Rick Macklem	3133bbf7a4	Replace mbuf macros with the code they would generate in the NFS code. When the code was ported to Mac OS/X, mbuf handling functions were converted to using the Mac OS/X accessor functions. For FreeBSD, they are a simple set of macros in sys/fs/nfs/nfskpiport.h. Since porting to Mac OS/X is no longer a consideration, replacement of these macros with the code generated by them makes the code more readable. When support for external page mbufs is added as needed by the KERN_TLS, the patch becomes simpler if done without the macros. This patch should not result in any semantic change. This conversion will be committed one file at a time.	2020-04-10 22:42:14 +00:00
Rick Macklem	8de97f394e	Remove the old NFS lock device driver that uses Giant. This NFS lock device driver was replaced by the kernel NLM around FreeBSD7 and has not normally been used since then. To use it, the kernel had to be built without "options NFSLOCKD" and the nfslockd.ko had to be deleted as well. Since it uses Giant and is no longer used, this patch removes it. With this device driver removed, there is now a lot of unused code in the userland rpc.lockd. That will be removed on a future commit. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D22933	2020-04-09 14:44:46 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Konstantin Belousov	0ff51c98d1	Fix NFS client deadlock when read reports truncated node. If node attribute returned in the reply for read rpc indicate truncation, and it happens that the vnode is exclusively locked, update of the node attributes would try to shrink vnode size. Since during the read some vnode pages were busied by the reading thread, vnode_pager_setsize() deadlocks waiting for the busy state owned by the caller. Use a thread-local flag to indicate that NFS read owns some (s)busy pages states and postpone the call to vnode_pager_setsize() until the thread relinguishes the ownership. Diagnosed by: rlibby Tested by: pho, rlibby Sponsored by: The FreeBSD Foundation MFC after: 1 week	2020-02-22 20:50:30 +00:00
Kyle Evans	6a5abb1ee5	Provide O_SEARCH O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping permissions checks on the directory itself after the initial open(). This is close to the semantics we've historically applied for O_EXEC on a directory, which is UB according to POSIX. Conveniently, O_SEARCH on a file is also explicitly undefined behavior according to POSIX, so O_EXEC would be a fine choice. The spec goes on to state that O_SEARCH and O_EXEC need not be distinct values, but they're not defined to be the same value. This was pointed out as an incompatibility with other systems that had made its way into libarchive, which had assumed that O_EXEC was an alias for O_SEARCH. This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a directory is checked in vn_open_vnode already, so for completeness we add a NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not re-check that when descending in namei. [0] https://pubs.opengroup.org/onlinepubs/9699919799/ Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23247	2020-02-02 16:34:57 +00:00
Mateusz Guzik	b249ce48ea	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427	2020-01-03 22:29:58 +00:00
Rick Macklem	05dcd5d2c8	Fix nfsmount() so that it will return NFSERR_MINORVERMISMATCH. If nfsrpc_getdirpath() returns NFSERR_MINORVERMISMATCH, it would erroneously get mapped to EIO. This was not particularily harmful, but would make it hard for sysadmins to diagnose why an NFSv4 mount is failing. mount_nfs.c still needs to be fixed so that it does not report NFSERR_MINORVERMISMATCH as an unknown error 10021. MFC after: 1 week	2019-12-25 01:15:38 +00:00
Mateusz Guzik	6fa079fc3f	vfs: flatten vop vectors This eliminates the following loop from all VOP calls: while(vop != NULL && \ vop->vop_spare2 == NULL && vop->vop_bypass == NULL) vop = vop->vop_default; Reviewed by: jeff Tesetd by: pho Differential Revision: https://reviews.freebsd.org/D22738	2019-12-16 00:06:22 +00:00
Rick Macklem	f808cf7294	Silence some "might not be initialized" warnings for riscv64. None of these case were actually using the variable(s) uninitialized, but I figured that silencing the warnings via initializing them made sense. Some of these predated r355677.	2019-12-13 21:38:08 +00:00
Rick Macklem	bf6ac05aa3	Add some more initializations to quiet riscv build. The one case in nfs_copy_file_range() was a legitimate case, although it would probably never occur in practice.	2019-12-13 01:34:25 +00:00
Rick Macklem	c057a37818	Add support for NFSv4.2 to the NFS client and server. This patch adds support for NFSv4.2 (RFC-7862) and Extended Attributes (RFC-8276) to the NFS client and server. NFSv4.2 is comprised of several optional features that can be supported in addition to NFSv4.1. This patch adds the following optional features: - posix_fadvise(POSIX_FADV_WILLNEED/POSIX_FADV_DONTNEED) - posix_fallocate() - intra server file range copying via the copy_file_range(2) syscall --> Avoiding data tranfer over the wire to/from the NFS client. - lseek(SEEK_DATA/SEEK_HOLE) - Extended attribute syscalls for "user" namespace attributes as defined by RFC-8276. Although this patch is fairly large, it should not affect support for the other versions of NFS. However it does add two new sysctls that allow a sysadmin to limit which minor versions of NFSv4 a server supports, allowing a sysadmin to disable NFSv4.2. Unfortunately, when the NFS stats structure was last revised, it was assumed that there would be no additional operations added beyond what was specified in RFC-7862. However RFC-8276 did add additional operations, forcing the NFS stats structure to revised again. It now has extra unused entries in all arrays, so that future extensions to NFSv4.2 can be accomodated without revising this structure again. A future commit will update nfsstat(1) to report counts for the new NFSv4.2 specific operations/procedures. This patch affects the internal interface between the nfscommon, nfscl and nfsd modules and, as such, they all must be upgraded simultaneously. I will do a version bump (although arguably not needed), due to this. This code has survived a "make universe" but has not been built with a recent GCC. If you encounter build problems, please email me. Relnotes: yes	2019-12-12 23:22:55 +00:00
Mateusz Guzik	abd80ddb94	vfs: introduce v_irflag and make v_type smaller The current vnode layout is not smp-friendly by having frequently read data avoidably sharing cachelines with very frequently modified fields. In particular v_iflag inspected for VI_DOOMED can be found in the same line with v_usecount. Instead make it available in the same cacheline as the v_op, v_data and v_type which all get read all the time. v_type is avoidably 4 bytes while the necessary data will easily fit in 1. Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new flag field with a new value: VIRF_DOOMED. Reviewed by: kib, jeff Differential Revision: https://reviews.freebsd.org/D22715	2019-12-08 21:30:04 +00:00
Rick Macklem	a95cd06e9a	Delete an unused external declaration. Since nfsv4_opflag is no longer used in nfs_clcomsubs.c, delete the external declaration of it. Found during NFSv4.2 code merge. MFC after: 2 weeks	2019-12-08 16:59:36 +00:00
Konstantin Belousov	9698d99230	In nfs_lock(), recheck vp->v_data after lock before accessing it. We might race with reclaim, and then this is no longer a nfs vnode, in which case we do not need to handle deferred vnode_pager_setsize() either. Reported by: rk@ronald.org PR: 242184 Sponsored by: The FreeBSD Foundation MFC after: 3 days	2019-11-29 13:55:56 +00:00
Jeff Roberson	67d0e29304	Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY flag and use the same system. This enables further fault locking improvements by allowing more faults to proceed with a shared lock. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22116	2019-10-29 21:06:34 +00:00
Konstantin Belousov	c6ba06d86c	Fix interface between nfsclient and vnode pager. Make the nfsclient always call vnode_pager_setsize() with the vnode exclusively locked. This ensures that page fault always can find the backing page if the object size check succeeded. Set VV_VMSIZEVNLOCK flag on NFS nodes. The main offender breaking the interface in nfsclient is nfs_loadattrcache(), which is used whenever server responded with updated attributes, which can happen on non-changing operations as well. Also, iod threads only have buffers locked (and even that is LK_KERNPROC), but they still may call nfs_loadattrcache() on RPC response. Instead of immediately calling vnode_pager_setsize() if server response indicated changed file size, but the vnode is not exclusively locked, set a new node flag NVNSETSZSKIP. When the vnode exclusively locked, or when we can temporary upgrade the lock to exclusive, call vnode_pager_setsize(), by providing the nfsclient VOP_LOCK() implementation. Tested by: pho Discussed with: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D21883	2019-10-22 16:17:38 +00:00
Jeff Roberson	0012f373e4	(4/6) Protect page valid with the busy lock. Atomics are used for page busy and valid state when the shared busy is held. The details of the locking protocol and valid and dirty synchronization are in the updated vm_page.h comments. Reviewed by: kib, markj Tested by: pho Sponsored by: Netflix, Intel Differential Revision: https://reviews.freebsd.org/D21594	2019-10-15 03:45:41 +00:00
Mateusz Guzik	d511f93e45	nfsclient: add root vnode caching See r353150. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21646	2019-10-06 22:17:29 +00:00
Rick Macklem	ee7201a725	Replace all mtx_assert() calls for n_mtx and ncl_iod_mutex with macros. To be consistent with replacing the mtx_lock()/mtx_unlock() calls on the NFS node mutex (n_mtx) and ncl_iod_mutex, this patch replaces all mtx_assert() calls on these mutexes with macros as well. This will simplify changing these locks to sx locks in a future commit. However, this change may be delayed indefinitely, since it appears there is a deadlock when vnode_pager_setsize() is called to shrink the size and the NFS node lock is held. There is no semantic change as a result of this commit. Suggested by: kib MFC after: 1 week	2019-09-26 02:54:45 +00:00
Rick Macklem	b662b41e62	Replace all mtx_lock()/mtx_unlock() on the iod lock with macros. Since the NFS node mutex needs to change to an sx lock so it can be held when vnode_pager_setsize() is called and the iod lock is held when the NFS node lock is acquired, the iod mutex will need to be changed to an sx lock as well. To simply the future commit that changes both the NFS node lock and iod lock to sx locks, this commit replaces all mtx_lock()/mtx_unlock() calls on the iod lock with macros. There is no semantic change as a result of this commit. I don't know when the future commit will happen and be MFC'd, so I have set the MFC on this commit to one week so that it can be MFC'd at the same time. Suggested by: kib MFC after: 1 week	2019-09-24 23:38:10 +00:00
Rick Macklem	5d85e12f44	Replace all mtx_lock()/mtx_unlock() on n_mtx with the macros. For a long time, some places in the NFS code have locked/unlocked the NFS node lock with the macros NFSLOCKNODE()/NFSUNLOCKNODE() whereas others have simply used mtx_lock()/mtx_unlock(). Since the NFS node mutex needs to change to an sx lock so it can be held when vnode_pager_setsize() is called, replace all occurrences of mtx_lock/mtx_unlock with the macros to simply making the change to an sx lock in future commit. There is no semantic change as a result of this commit. I am not sure if the change to an sx lock will be MFC'd soon, so I put an MFC of 1 week on this commit so that it could be MFC'd with that commit. Suggested by: kib MFC after: 1 week	2019-09-24 01:58:54 +00:00
Konstantin Belousov	6fd583583b	Further refine r352393, only call vnode_pager_setsize() outside the node lock when shrinking. This is similar to r252528, applied to the above commit. Apparently there is a race which makes necessary at least to keep the n_size and pager size consistent when extending. Current suspect is that iod threads perform vnode_pager_setsize() without taking the vnode lock, which corrupts the file content. Reported and tested by: Masachika ISHIZUKA <ish@amail.plala.or.jp> Discussed with: rmacklem (related issues) Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-17 18:41:39 +00:00
Konstantin Belousov	1246ee664b	nfscl_loadattrcache: fix rest of the cases to not call vnode_pager_setsize() under the node mutex. r248567 moved some calls of vnode_pager_setsize() after the node lock is unlocked, do the rest now. Reported and tested by: peterj Sponsored by: The FreeBSD Foundation MFC after: 1 week	2019-09-16 13:26:27 +00:00
Conrad Meyer	a6935d085c	Remove long-dead BUF_ASSERT_{,UN}HELD assertions These were fully neutered in r177676 (2008), but not removed at the time for unclear reasons. They're totally dead code, so go ahead and yank them now. No functional change.	2019-09-05 21:43:33 +00:00
Konstantin Belousov	6470c8d3db	Rework v_object lifecycle for vnodes. Current implementation of vnode_create_vobject() and vnode_destroy_vobject() is written so that it prepared to handle the vm object destruction for live vnode. Practically, no filesystems use this, except for some remnants that were present in UFS till today. One of the consequences of that model is that each filesystem must call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result all of them get rid of the v_object in reclaim. Move the call to vnode_destroy_vobject() to vgonel() before VOP_RECLAIM(). This makes v_object stable: either the object is NULL, or it is valid vm object till the vnode reclamation. Remove code from vnode_create_vobject() to handle races with the parallel destruction. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21412	2019-08-29 07:50:25 +00:00
Rick Macklem	6aab442af9	Get rid of extraneous initialization. Get rid of an extraneous initialization, mainly to keep a static analyser happy. No semantic change. PR: 238167 Submitted by: Alexey Dokuchaev	2019-05-31 03:13:09 +00:00
Rick Macklem	26fd36b29d	Clean up silly code case. This silly code segment has existed in the sources since it was brought into FreeBSD 10 years ago. I honestly have no idea why this was done. It was possible that I thought that it might have been better to not set B_ASYNC for the "else" case, but I can't remember. Anyhow, this patch gets rid of the if/else that does the same thing either way, since it looks silly and upsets a static analyser. This will have no semantic effect on the NFS client. PR: 238167	2019-05-31 00:56:31 +00:00
Alan Somers	65417f5e27	Remove "struct ucred" argument from vtruncbuf vtruncbuf takes a "struct ucred" argument. AFAICT, it's been unused ever since that function was first added in r34611. Remove it. Also, remove some "struct ucred" arguments from fuse and nfs functions that were only used by vtruncbuf. Reviewed by: cem MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20377	2019-05-24 20:27:50 +00:00
Konstantin Belousov	391918a3c1	Do not flush NFS node from NFS VOP_SET_TEXT(). The more appropriate place to do the flushing is VOP_OPEN(). This was uncovered because VOP_SET_TEXT() is now called with the vnode' vm_object rlocked, which is incompatible with the flush operations. After the move, there is no need for NFS-specific VOP_SET_TEXT overload. Sponsored by: The FreeBSD Foundation MFC after: 30 days	2019-05-06 08:49:43 +00:00
Konstantin Belousov	78022527bb	Switch to use shared vnode locks for text files during image activation. kern_execve() locks text vnode exclusive to be able to set and clear VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0 condition. The change removes VV_TEXT, replacing it with the condition v_writecount <= -1, and puts v_writecount under the vnode interlock. Each text reference decrements v_writecount. To clear the text reference when the segment is unmapped, it is recorded in the vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and v_writecount is incremented on the map entry removal The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that v_writecount does not contradict the desired change. vn_writecheck() is now racy and its use was eliminated everywhere except access. Atomic check for writeability and increment of v_writecount is performed by the VOP. vn_truncate() now increments v_writecount around VOP_SETATTR() call, lack of which is arguably a bug on its own. nullfs bypasses v_writecount to the lower vnode always, so nullfs vnode has its own v_writecount correct, and lower vnode gets all references, since object->handle is always lower vnode. On the text vnode' vm object dealloc, the v_writecount value is reset to zero, and deadfs vop_unset_text short-circuit the operation. Reclamation of lowervp always reclaims all nullfs vnodes referencing lowervp first, so no stray references are left. Reviewed by: markj, trasz Tested by: mjg, pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D19923	2019-05-05 11:20:43 +00:00
Rick Macklem	eeb1f3ed51	Fix the NFSv4 client to safely find processes. r340744 broke the NFSv4 client, because it replaced pfind_locked() with a call to pfind(), since pfind() acquires the sx lock for the pid hash and the NFSv4 already holds a mutex when it does the call. The patch fixes the problem by recreating a pfind_any_locked() and adding the functions pidhash_slockall() and pidhash_sunlockall to acquire/release all of the pid hash locks. These functions are then used by the NFSv4 client instead of acquiring the allproc_lock and calling pfind(). Reviewed by: kib, mjg MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D19887	2019-04-15 01:27:15 +00:00
Rodney W. Grimes	6c1c6ae537	Use IN_foo() macros from sys/netinet/in.h inplace of handcrafted code There are a few places that use hand crafted versions of the macros from sys/netinet/in.h making it difficult to actually alter the values in use by these macros. Correct that by replacing handcrafted code with proper macro usage. Reviewed by: karels, kristof Approved by: bde (mentor) MFC after: 3 weeks Sponsored by: John Gilmore Differential Revision: https://reviews.freebsd.org/D19317	2019-04-04 19:01:13 +00:00
Edward Tomasz Napierala	2df8bd90c8	Drop unused 'p' argument to nfsv4_strtogid(). MFC after: 2 weeks Sponsored by: DARPA, AFRL	2019-03-12 15:07:47 +00:00
Edward Tomasz Napierala	0658ac3943	Drop unused 'p' argument to nfsv4_strtouid(). MFC after: 2 weeks Sponsored by: DARPA, AFRL	2019-03-12 15:02:52 +00:00
Simon J. Gerraty	f5fdf82d82	Add _PC_ACL_* to vop_stdpathconf This avoid EINVAL from tmpfs etc. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D19512	2019-03-11 20:40:56 +00:00
Edward Tomasz Napierala	c9172fb4f1	Work around the "nfscl: bad open cnt on server" assertion that can happen when rerooting into NFSv4 rootfs with kernel built with INVARIANTS. I've talked to rmacklem@ (back in 2017), and while the root cause is still unknown, the case guarded by assertion (nfscl_doclose() being called from VOP_INACTIVE) is believed to be safe, and the whole thing seems to run just fine. Obtained from: CheriBSD MFC after: 2 weeks Sponsored by: DARPA, AFRL	2019-02-19 12:45:37 +00:00
Gleb Smirnoff	756a541279	Allocate pager bufs from UMA instead of 80-ish mutex protected linked list. o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many pbufs are we going to have set. In various subsystems that are going to utilize pbufs create private zones via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(), and sets a limit on created zone. After startup preallocate pbufs according to requirements of all pbuf zones. Subsystems that used to have a private limit with old allocator now have private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS, swap, vnode pager. The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9), aio(4). They should have their private limits, but changing that is out of scope of this commit. o Fetch tunable value of kern.nswbuf from init_param2() and while here move NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only this option. Default values aren't touched by this commit, but they probably should be reviewed wrt to modern hardware. This change removes a tight bottleneck from sendfile(2) operation, that uses pbufs in vnode pager. Other pagers also would benefit from faster allocation. Together with: gallatin Tested by: pho	2019-01-15 01:02:16 +00:00
Rick Macklem	f86bce1770	Make sure the NFS readdir client fills in all "struct dirent" data. The NFS client code (nfsrpc_readdir() and nfsrpc_readdirplus()) wasn't filling in parts of the readdir reply, such as d_pad[01] and the bytes at the end of d_name within d_reclen. As such, data left in a buffer cache block could be leaked to userland in the readdir reply. This patch makes sure all of the data is filled in. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: kib, markj MFC after: 2 weeks	2018-11-23 00:17:47 +00:00
Mateusz Guzik	53011553fa	proc: convert pfind & friends to use pidhash locks and other cleanup pfind_locked is retired as it relied on allproc which unnecessarily restricts locking of the hash. Sponsored by: The FreeBSD Foundation	2018-11-21 20:15:56 +00:00
Rick Macklem	6ad8a6eaa4	Change nfs_advlock() so that the NFSVOPUNLOCK() is mostly done at the end. Prior to this patch, nfs_advlock() did NFSVOPUNLOCK(); return (error); in many places. This patch replaces these code sequenences with a "goto out;" and does the NFSVOPUNLOCK(); return (error); at the end of the function in order to make the vnode locking simpler. This patch does not change the semantics of nfs_advlock(). Suggested by: kib Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D17853	2018-11-06 22:50:50 +00:00
Brooks Davis	318f0d7720	Use declared types for caddr_t arguments. Leave ptrace(2) alone for the moment as it's defined to take a caddr_t. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17852	2018-11-06 18:46:38 +00:00
Brooks Davis	1493c2ee62	Make vop_symlink take a const target path. This will enable callers to take const paths as part of syscall decleration improvements. Where doing so is easy and non-distruptive carry the const through implementations. In UFS the value is passed to an interface that must take non-const values. In ZFS, const poisoning would touch code shared with upstream and it's not worth adding diffs. Bump __FreeBSD_version for external API consumers. Reviewed by: kib (prior version) Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17805	2018-11-02 14:42:36 +00:00
Rick Macklem	881a9516a2	Fix NFS client vnode locking to avoid a crash during forced dismount. A crash was reported where the crash occurred in nfs_advlock() when the NFS_ISV4(vp) macro was being executed. This was caused by the vnode being VI_DOOMED due to a forced dismount in progress. This patch fixes the problem by locking the vnode before executing the NFS_ISV4() macro. Tested by: rlibby PR: 232673 Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D17757	2018-11-01 15:27:22 +00:00
Konstantin Belousov	8ff7fad1d7	Only call sigdeferstop() for NFS. Use bypass to catch any NFS VOP dispatch and route it through the wrapper which does sigdeferstop() and then dispatches original VOP. NFS does not need a bypass below it, which is not supported. The vop offset in the vop_vector is added since otherwise it is impossible to get vop_op_t from the internal table, and I did not wanted to create the layered fs only to wrap NFS VOPs. VFS_OP()s wrap is straightforward. Requested and reviewed by: mjg (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17658	2018-10-23 21:43:41 +00:00
Rick Macklem	ac0d649588	Silence newer gcc warnings. Newer versions of gcc generate "might not be initialized" warnings for several variables in nfsrpc_doiods(). I have checked and all of these variables are assigned values before they are used. In the one case of "tdrpc", it could have passed garbage as an argument to nfscl_dofflayoutio() when mirrorcnt is one. However nfscl_dofflayoutio() only uses the argument when mirrorcnt > 1, so it wasn't actually broken. This patch initializes "tdrpc" to avoid confusion and initializes the rest to make the compiler happy. Requested by: mmacy	2018-08-02 20:10:59 +00:00
Rick Macklem	743d528198	Silence newer gcc warnings. Newer versions of gcc generate "set, but not used" warnings. Add __unused macros to silence these warnings. Although the variables are not being used, they are values parsed from arguments to callback RPCs that might be needed in the future. Requested by: mmacy	2018-07-30 20:25:32 +00:00
Eitan Adler	33f4bccaa6	Use https over http for FreeBSD pages	2018-07-27 10:40:48 +00:00
Rick Macklem	5da3882447	Shut down the TCP connection to a DS in the pNFS client when Renew fails. When a NFSv4.1 client mount using pNFS detects a failure trying to do a Renew (actually just a Sequence operation), the code would simply try again and again and again every 30sec. This would tie up the "nfscl" thread, which should also be doing other things like Renews on other DSs and the MDS. This patch adds code which closes down the TCP connection and marks it defunct when Renew detects an failure to communicate with the DS, so further Renews will not be attempted until a new working TCP connection to the DS is established. It also makes the call to nfscl_cancelreqs() unconditional, since nfscl_cancelreqs() checks the NFSCLDS_SAMECONN flag and does so while holding the lock. This fix only applies to the NFSv4.1 client whne using pNFS and without it the only effect would have been an "nfscl" thread busy doing Renew attempts on an unresponsive DS. MFC after: 2 weeks	2018-07-15 18:54:44 +00:00
Rick Macklem	89c64a3a4f	Fix the pNFS client when mirrors aren't on the same machine. Without this patch, the client side NFSv4.1 pNFS code erroneously did writes and commits to both DS mirrors using the TCP connection of the first one. For my test setup this worked, since I have both DSs running on the same machine, but it would have failed when the DSs are on separate machines. This patch fixes the code to use the correct TCP connection for each DS. This patch should only affect the NFSv4.1 client when using "pnfs" mounts to mirrored DSs. MFC after: 2 weeks	2018-07-14 19:51:44 +00:00
Rick Macklem	0e7bd20bb2	Close down the TCP connection to a pNFS DS when it is disabled. So long as the TCP connection to a pNFS DS isn't shared with other DSs, it can be closed down when the DS is being disabled in the pNFS client. This causes any RPCs in progress to fail. This patch only affects the NFSv4.1 pNFS client when errors occur while doing I/O on a DS. MFC after: 2 weeks	2018-07-13 20:03:05 +00:00
Rick Macklem	83f526de6a	Change the pNFS client so that it does not report an NFSERR_STALE from an I/O attempt on a DS to the server via LayoutReturn. The current FreeBSD client can generate these errors for an operational DS while doing a recovery of a mirror after a mirrored DS has been repaired. I am not sure why these errors occur, but my best current guess is a race between the Layout Recall issued by the kernel code run from pnfsdscopymr(8) and a Read operation on the DS for the file bing copied. The errrors are not fatal, since the client falls back on doing I/O through the MDS, which can do the I/O successfully as a proxy. (The fact that the MDS can do this indicates that the file does still exist on the functioning DS.) This patch only affects behaviour of the pNFS client and only when using Flexible File layouts. MFC after: 2 weeks	2018-07-13 12:39:27 +00:00
Rick Macklem	a6fed5f514	Modify the NFSv4.1 pNFS client to use separate TCP connections for DSs. Without this patch, the NFSv4.1 pNFS client shared a single TCP connection for all DSs that resided on the same machine. This made disabling one of the DSs impossible. Although unlikely, it is possible that the storage subsystem has failed in such a way that the storage for one DS on a machine is no longer functioning correctly, but the storage used by another DS on the same machine is still ok. For this case, it would be nice if a system can fail one of the DSs without failing them all. This patch changes the default behaviour to use separate TCP connections for each DS even if they reside on the same machine. I do not believe that this will be a problem for extant pNFS servers, but a sysctl can be set to restore the old behaviour if this change causes a problem for an extant pNFS server. This patch only affects the NFSv4.1 pNFS client. MFC after: 2 weeks	2018-07-12 20:46:22 +00:00
Rick Macklem	2e35b8fe24	Change the NFSv4.1 pNFS client so that it returns the DS error in layoutreturn. When the NFSv4.1 pNFS client gets an error for a DS I/O operation using a Flexible File layout, it returns the layout with an error. This patch changes the code slightly, so that it returns the layout for all errors except EACCES and lets the MDS decide what to do based on the error. It also makes a couple of changes to nfscl_layoutrecall() to ensure that the first layoutreturn(s) will have the error in the reply. Plus, the patch adds a wakeup() so that the "nfscl" thread won't wait 1sec before doing the LayoutReturn. Tested against the pNFS service. This patch should not affect non-pNFS use of the client. The unused "dsp" argument will be used by a future patch that disables the connection to the DS when possible. MFC after: 2 weeks	2018-06-22 21:25:27 +00:00
Rick Macklem	2bad64241c	Make the pNFS NFSv4.1 client return a Flexible File layout upon error. The Flexible File layout LayoutReturn operation has argument fields where an I/O error encountered when attempting I/O on a DS can be reported back to the MDS. This patch adds code to the client to do this for the Flexible File layout mirrored case. This patch should only affect mounts using the "pnfs" option against servers that support the Flexible File layout. MFC after: 2 weeks	2018-06-17 16:30:06 +00:00
Rick Macklem	c338c94d20	Move four functions in nfscl.ko to nfscommon.ko. Four functions nfscl_reqstart(), nfscl_fillsattr(), nfsm_stateidtom() and nfsmnt_mdssession() are now called from within the nfsd. As such, they needed to be moved from nfscl.ko to nfscommon.ko so that nfsd.ko would load when nfscl.ko wasn't loaded. Reported by: herbert@gojira.at	2018-06-14 10:00:19 +00:00
Bruce Evans	ab35e1c71b	Fix the encoding of major and minor numbers in 64-bit dev_t by restoring the old encodings for the lower 16 and 32 bits and only using the higher 32 bits for unusually large major and minor numbers. This change breaks compatibility with the previous encoding (which was only used in -current). Fix truncation to (essentially) 16-bit dev_t in newnfs v3. Any encoding of device numbers gives an ABI, so it can't be changed without translations for compatibility. Extra bits give the much larger complication that the translations need to compress into fewer bits. Fortunately, more than 32 bits are rarely needed, so compression is rarely needed except for 16-bit linux dev_t where it was always needed but never done. The previous encoding moved the major number into the top 32 bits. Almost no translation code handled this, so the major number was blindly truncated away in most 32-bit encodings. E.g., for ffs, mknod(8) with major = 1 and minor = 2 gave dev_t = 0x10000002; ffs cannot represent this and blindly truncated it to 2. But if this mknod was run on any released version of FreeBSD, it gives dev_t = 0x102. ffs can represent this, but in the previous encoding it was not decoded, giving major = 0, minor = 0x102. The presence of bugs was most obvious for exporting dev_t's from an old system to -current, since bugs in newnfs augment them. I fixed oldnfs to support 32-bit dev_t in 1996 (r16634), but this regressed to 16-bit dev_t in newnfs, first to the old 16-bit encoding and then further in -current. E.g., old ad0 with major = 234, minor = 0x10002 had the correct (major, minor) number on the wire, but newnfs truncated this to (234, 2) and then the previous encoding shifted the major number into oblivion as seen by ffs or old applications. I first tried to fix this by translating on every ABI/API boundary, but there are too many boundaries and too many sloppy translations by blind truncation. So use the old encoding for the low 32 bits so that sloppy translations work no worse than before provided the high 32 bits are not set. Add some error checking for when bits are lost. Keep not doing any error checking for translations for almost everything in compat/linux. compat/freebsd32/freebsd32_misc.c: Optionally check for losing bits after possibly-truncating assignments as before. compat/linux/linux_stats.c: Depend on the representation being compatible with Linux's (or just with itself for local use) and spell some of the translations as assignments in a macro that hides the details. fs/nfsclient/nfs_clcomsubs.c: Essentially the same fix as in 1996, except there is now no possible truncation in makedev() itself. Also fix nearby style bugs. kern/vfs_syscalls.c: As for freebsd32. Also update the sysctl description to include file numbers, and change it to describe device ids as device numbers. sys/types.h: Use inline functions (wrapped by macros) since the expressions are now a bit too complicated for plain macros. Describe the encoding and some of the reasons for it. 16-bit compatibility didn't leave many reasonable choices for the 32-bit encoding, and 32-bit compatibility doesn't leave many reasonable choices for the 64-bit encoding. My choice is to put the 8 new minor bits in the low 8 bits of the top 32 bits. This minimizes discontiguities. Reviewed by: kib (except for rewrite of the comment in linux_stats.c)	2018-06-13 12:22:00 +00:00
Rick Macklem	90d2dfab19	Merge the pNFS server code from projects/pnfs-planb-server into head. This code merge adds a pNFS service to the NFSv4.1 server. Although it is a large commit it should not affect behaviour for a non-pNFS NFS server. Some documentation on how this works can be found at: http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt and will hopefully be turned into a proper document soon. This is a merge of the kernel code. Userland and man page changes will come soon, once the dust settles on this merge. It has passed a "make universe", so I hope it will not cause build problems. It also adds NFSv4.1 server support for the "current stateid". Here is a brief overview of the pNFS service: A pNFS service separates the Read/Write oeprations from all the other NFSv4.1 Metadata operations. It is hoped that this separation allows a pNFS service to be configured that exceeds the limits of a single NFS server for either storage capacity and/or I/O bandwidth. It is possible to configure mirroring within the data servers (DSs) so that the data storage file for an MDS file will be mirrored on two or more of the DSs. When this is used, failure of a DS will not stop the pNFS service and a failed DS can be recovered once repaired while the pNFS service continues to operate. Although two way mirroring would be the norm, it is possible to set a mirroring level of up to four or the number of DSs, whichever is less. The Metadata server will always be a single point of failure, just as a single NFS server is. A Plan B pNFS service consists of a single MetaData Server (MDS) and K Data Servers (DS), all of which are recent FreeBSD systems. Clients will mount the MDS as they would a single NFS server. When files are created, the MDS creates a file tree identical to what a single NFS server creates, except that all the regular (VREG) files will be empty. As such, if you look at the exported tree on the MDS directly on the MDS server (not via an NFS mount), the files will all be of size 0. Each of these files will also have two extended attributes in the system attribute name space: pnfsd.dsfile - This extended attrbute stores the information that the MDS needs to find the data storage file(s) on DS(s) for this file. pnfsd.dsattr - This extended attribute stores the Size, AccessTime, ModifyTime and Change attributes for the file, so that the MDS doesn't need to acquire the attributes from the DS for every Getattr operation. For each regular (VREG) file, the MDS creates a data storage file on one (or more if mirroring is enabled) of the DSs in one of the "dsNN" subdirectories. The name of this file is the file handle of the file on the MDS in hexadecimal so that the name is unique. The DSs use subdirectories named "ds0" to "dsN" so that no one directory gets too large. The value of "N" is set via the sysctl vfs.nfsd.dsdirsize on the MDS, with the default being 20. For production servers that will store a lot of files, this value should probably be much larger. It can be increased when the "nfsd" daemon is not running on the MDS, once the "dsK" directories are created. For pNFS aware NFSv4.1 clients, the FreeBSD server will return two pieces of information to the client that allows it to do I/O directly to the DS. DeviceInfo - This is relatively static information that defines what a DS is. The critical bits of information returned by the FreeBSD server is the IP address of the DS and, for the Flexible File layout, that NFSv4.1 is to be used and that it is "tightly coupled". There is a "deviceid" which identifies the DeviceInfo. Layout - This is per file and can be recalled by the server when it is no longer valid. For the FreeBSD server, there is support for two types of layout, call File and Flexible File layout. Both allow the client to do I/O on the DS via NFSv4.1 I/O operations. The Flexible File layout is a more recent variant that allows specification of mirrors, where the client is expected to do writes to all mirrors to maintain them in a consistent state. The Flexible File layout also allows the client to report I/O errors for a DS back to the MDS. The Flexible File layout supports two variants referred to as "tightly coupled" vs "loosely coupled". The FreeBSD server always uses the "tightly coupled" variant where the client uses the same credentials to do I/O on the DS as it would on the MDS. For the "loosely coupled" variant, the layout specifies a synthetic user/group that the client uses to do I/O on the DS. The FreeBSD server does not do striping and always returns layouts for the entire file. The critical information in a layout is Read vs Read/Writea and DeviceID(s) that identify which DS(s) the data is stored on. At this time, the MDS generates File Layout layouts to NFSv4.1 clients that know how to do pNFS for the non-mirrored DS case unless the sysctl vfs.nfsd.default_flexfile is set non-zero, in which case Flexible File layouts are generated. The mirrored DS configuration always generates Flexible File layouts. For NFS clients that do not support NFSv4.1 pNFS, all I/O operations are done against the MDS which acts as a proxy for the appropriate DS(s). When the MDS receives an I/O RPC, it will do the RPC on the DS as a proxy. If the DS is on the same machine, the MDS/DS will do the RPC on the DS as a proxy and so on, until the machine runs out of some resource, such as session slots or mbufs. As such, DSs must be separate systems from the MDS. Tested by: james.rose@framestore.com Relnotes: yes	2018-06-12 19:36:32 +00:00
Rick Macklem	73b1879c2d	Add a couple of safety belt checks to the NFSv4.1 client related to sessions. There were a couple of cases in newnfs_request() that it assumed that it was an NFSv4.1 mount with a session. This should always be the case when a Sequence operation is in the reply or the server replies NFSERR_BADSESSION. However, if a server was broken and sent an erroneous reply, these safety belt checks should avoid trouble. The one check required a small tweak to nfsmnt_mdssession() so that it returns NULL when there is no session instead of the offset of the field in the structure (0x8 for i386). This patch should have no effect on normal operation of the client. Found by inspection during pNFS server development. MFC after: 2 weeks	2018-06-11 19:00:07 +00:00
Rick Macklem	8097753476	Add checks for the Flexible File layout to LayoutRecall callbacks. The Flexible File layout case wasn't handled by LayoutRecall callbacks because it just checked for File layout and returned NFSERR_NOMATCHLAYOUT otherwise. This patch adds the Flexible File layout handling. Found during testing of the pNFS server. MFC after: 2 weeks	2018-06-10 19:03:21 +00:00
Rick Macklem	dec8894b45	Fix the default number of threads for Flex File layout pNFS client I/O. The intent was that the default would be based on number of CPUs, but the code disabled using taskqueue() by default. This code is only executed when mounting a NFSv4.1 server that supports the Flexible File layout for pNFS and, since such servers are rare, this change shouldn't result in a POLA violation. (The FreeBSD pNFS server is still a project and the only other one that uses Flexible File layout is being developed by Primary Data and I don't know if they have even shipped any to customers yet.) Found while testing the pNFS server.	2018-06-02 00:11:26 +00:00
Rick Macklem	260785fe60	Fix the sleep event for layout recall. The sleep for I/O completion during an NFSv4.1 pNFS layout recall used the wrong event value and could result in the "[nfscl]" thread hung for the mount. This patch fixes the event to be the correct. This bug will only affect NFSv4.1 pnfs mounts and only when the server does a layout recall callback, so it won't affect many. Without the patch, a mount without the "pnfs" option will avoid the problem. Found during testing of the pNFS server. MFC after: 1 week	2018-05-26 23:02:15 +00:00
Matt Macy	b7faa59dee	nfsclient: warnings cleanups	2018-05-20 06:14:12 +00:00
Rick Macklem	cb0d9834d4	Fix use of pointer after being set NULL. Using a pointer after setting it NULL is probably not a good plan. Spotted by inspection during changes for Flexible File Layout Ioerr handling. This code path obviously isn't normally executed. MFC after: 1 week	2018-04-20 11:38:29 +00:00
Conrad Meyer	222daa421f	style: Remove remaining deprecated MALLOC/FREE macros Mechanically replace uses of MALLOC/FREE with appropriate invocations of malloc(9) / free(9) (a series of sed expressions). Something like: * MALLOC(a, b, ... -> a = malloc(... * FREE( -> free( * free((caddr_t) -> free( No functional change. For now, punt on modifying contrib ipfilter code, leaving a definition of the macro in its KMALLOC(). Reported by: jhb Reviewed by: cy, imp, markj, rmacklem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14035	2018-01-25 22:25:13 +00:00
Pedro F. Giffuni	d821d36419	Unsign some values related to allocation. When allocating memory through malloc(9), we always expect the amount of memory requested to be unsigned as a negative value would either stand for an error or an overflow. Unsign some values, found when considering the use of mallocarray(9), to avoid unnecessary casting. Also consider that indexes should be of at least the same size/type as the upper limit they pretend to index. MFC after: 3 weeks	2018-01-22 02:08:10 +00:00
Pedro F. Giffuni	ac2fffa4b7	Revert r327828, r327949, r327953, r328016-r328026, r328041: Uses of mallocarray(9). The use of mallocarray(9) has rocketed the required swap to build FreeBSD. This is likely caused by the allocation size attributes which put extra pressure on the compiler. Given that most of these checks are superfluous we have to choose better where to use mallocarray(9). We still have more uses of mallocarray(9) but hopefully this is enough to bring swap usage to a reasonable level. Reported by: wosch PR: 225197	2018-01-21 15:42:36 +00:00
Pedro F. Giffuni	d48d1a6464	nfsclient: make some use of mallocarray(9). Focus on code where we are doing multiplications within malloc(9). None of these ire likely to overflow, however the change is still useful as some static checkers can benefit from the allocation attributes we use for mallocarray. This initial sweep only covers malloc(9) calls with M_NOWAIT. No good reason but I started doing the changes before r327796 and at that time it was convenient to make sure the sorrounding code could handle NULL values. X-Differential revision: https://reviews.freebsd.org/D13837	2018-01-15 21:14:56 +00:00
Eitan Adler	caa7e52f3f	kernel: Fix several typos and minor errors - duplicate words - typos - references to old versions of FreeBSD Reviewed by: imp, benno	2017-12-27 03:23:21 +00:00
Alexander Kabaev	151ba7933a	Do pass removing some write-only variables from the kernel. This reduces noise when kernel is compiled by newer GCC versions, such as one used by external toolchain ports. Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial) Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c) Differential Revision: https://reviews.freebsd.org/D10385	2017-12-25 04:48:39 +00:00
Konstantin Belousov	95c8838c2a	Fix build for LP64 arches with gcc. gcc complaints that the comparision is always false due to the value range, and the cast does not prevent the analysis. Split the LP64 vs. ILP32 clamping as a workaround. Sponsored by: The FreeBSD Foundation	2017-12-21 23:08:10 +00:00
John Baldwin	b501cc5da6	Rework pathconf handling for FIFOs. On the one hand, FIFOs should respect other variables not supported by the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.). These values are fs-specific and must come from a fs-specific method. On the other hand, filesystems that support FIFOs are required to support _PC_PIPE_BUF on directory vnodes that can contain FIFOs. Given this latter requirement, once the fs-specific VOP_PATHCONF method supports _PC_PIPE_BUF for directories, it is also suitable for FIFOs permitting a single VOP_PATHCONF method to be used for both FIFOs and non-FIFOs. To that end, retire all of the FIFO-specific pathconf methods from filesystems and change FIFO-specific vnode operation switches to use the existing fs-specific VOP_PATHCONF method. For fifofs, set it's VOP_PATHCONF to VOP_PANIC since it should no longer be used. While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that only filesystems supporting FIFOs will report a value. In addition, only report a valid _PC_PIPE_BUF for directories and FIFOs. Discussed with: bde Reviewed by: kib (part of a larger patch) MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D12572	2017-12-19 22:39:05 +00:00
John Baldwin	a0a073b16d	Update NFS to handle larger link counts post ino64. - Define a NFS_LINK_MAX as UINT32_MAX to match the wire protocol. - Use NFS_LINK_MAX instead of LINK_MAX as the fallback value reported for a PATHCONF RPC by the NFS server. - Use NFS_LINK_MAX instead of LINK_MAX as the default value reported by the NFS client pathconf() if not overridden by the NFS server. - When reading the link count out of an RPC reply, read the full 32 bits instead of the lower 16 bits. Reviewed by: rmacklem (earlier version) Sponsored by: Chelsio Communications	2017-12-19 19:18:48 +00:00
Pedro F. Giffuni	d63027b668	sys/fs: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts.	2017-11-27 15:15:37 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Xin LI	11d38f2ca0	Remove unused header.	2017-11-19 03:52:03 +00:00
Rick Macklem	f49c813c1d	Use taskqueue(9) to do writes/commits to mirrored DSs concurrently. When the NFSv4.1 pNFS client is using a Flexible File Layout specifying mirrored Data Servers, it must do the writes and commits to all mirrors. This patch modifies the client to use a taskqueue to perform these writes and commits concurrently. The number of threads can't be changed for taskqueue(9), so it is set to 4 * mp_ncpus by default, but this can be overridden by setting the sysctl vfs.nfs.pnfsiothreads. Differential Revision: https://reviews.freebsd.org/D12632	2017-10-16 23:28:12 +00:00
Rick Macklem	b949cc41d1	Add Flex File Layout support to the NFSv4.1 pNFS client. This patch adds support for the Flexible File Layout to the pNFS client. Although the patch is rather large, it should only affect NFS mounts using the "pnfs" option against pNFS servers that do not support File Layout. There are still a couple of things missing from the Flexible File Layout client implementation: - The code does not yet do a LayoutReturn with I/O error stats when I/O error(s) occur when attempting to do I/O on a DS. This will be fixed in a future commit, since it is important for the MDS to know that I/O on a DS is failing. - The current code does writes and commits to mirror DSs serially. Making them happen concurrently will be done in a future commit, after discussion on freebsd-current@ on the best way to do this. - The code does not handle NFSv4.0 DSs. Since there is no extant pNFS server that implements NFSv4.0 DSs and NFSv4.1 DSs makes more sense now, I don't intend to implement this until there is a need for it. There is support for NFSv4.1 and NFSv3 DSs.	2017-10-05 20:10:40 +00:00
Rick Macklem	be3d32ad6e	Change nfsv4_getipaddr() and nfsrpc_fillsa() to not use sockaddr_storage. This patch changes nfsv4_getipaddr() and nfsrpc_fillsa() to use a sockaddr_in * and sockaddr_in6 * instead of sockaddr_storage, to avoid allocating the latter on the stack. It also moves the nfsrpc_fillsa() call to after the completion of parsing of the DeviceInfo reply from the server. This patch is in preparation for addition of Flex File Layout support in a future commit. It only affects the "pnfs" NFSv4.1 client mount option and should not have changed its semantics.	2017-09-28 22:33:01 +00:00
Rick Macklem	bd290946e9	Fix a memory leak that occurred in the pNFS client. When a "pnfs" NFSv4.1 mount was unmounted, it didn't free up the layouts and deviceinfo structures. This leak only affects "pnfs" mounts and only when the mount is umounted. Found while testing the pNFS Flexible File layout client code. MFC after: 2 weeks	2017-09-27 23:23:41 +00:00
Mark Johnston	47f11baaca	Use C99 initializers for DTrace provider methods. This makes the definitions easier to read and more cscope-friendly. MFC after: 1 week	2017-09-27 17:46:38 +00:00
Rick Macklem	a8462c582c	Add major and minor version arguments to nfscl_reqstart(). This patch adds "vers" and "minorvers" arguments to nfscl_reqstart(). The patch always passes them in as "0" and that implies no change in semantics. These arguments will be used by a future commit that adds support for the Flexible File Layout.	2017-09-26 23:42:44 +00:00
Rick Macklem	c36e087097	Remove 0 filling from nfsm_uiombuflist(). nfsm_uiombuflist() zero filled the mbuf list to a multiple of 4bytes as required for XDR. Unfortunately that modified an mbuf list after it was m_copym()'d and was broken. This patch removes the zero filling code. Since nfsm_uiombuflist() is not yet used in head/current, this has no effect on users. The function will be used by a future commit of code that adds Flex File Layout support.	2017-09-24 19:43:31 +00:00
Rick Macklem	0f29b8292d	Make the nfsrpc_layoutget() function a static. Make the NFSv4 pNFS client function nfsrpc_layoutget() a static, since it is only used in sys/fs/nfsclient/nfs_clrpcops.c. This prepares the code for future patches that add Flex File layout support.	2017-09-19 23:28:22 +00:00
Rick Macklem	2742a21091	Add a new function called nfsm_uiombuflist(), similar to nfsm_uiombuf(). This patch adds a new function called nfsm_uiombuflist(), which is similar to nfsm_uiombuf(), but doesn't not use the fields in struct nfsrv_descript. This new function will be used by the pNFS client for writing to mirrors using Flex Files layout. The function is not yet called anywhere. Also, get rid of #ifndef APPLE, which is ancient cruft left over from the Mac OSX port of the NFSv4 client.	2017-09-19 21:31:36 +00:00
Rick Macklem	b0932afacc	Simplify nfsrpc_layoutreturn() args. Simplify nfsrpc_layoutreturn() args. in preparation for the addition of Flex File layout support, since File layout uses a 0 length field. Flex Files does use a longer field, but that will be added in a subsequent commit.	2017-09-19 20:45:25 +00:00
Rick Macklem	ab118d04be	Simplify nfsrpc_layoutcommit() args. Simplify nfsrpc_layoutcommit() args. in preparation for the addition of Flex File layout support, since it also uses a 0 length field.	2017-09-19 20:18:41 +00:00
Rick Macklem	ccf038250a	Fix bogus FREAD with NFSV4OPEN_ACCESSREAD. No functional change. The code in nfscl_doflayoutio() bogusly used FREAD instead of NFSV4OPEN_ACCESSREAD. Since both happen to be defined as "1", this worked and the patch doesn't result in a functional change. Found by inspection during development of Flex File Layout support. MFC after: 2 weeks	2017-09-17 22:18:01 +00:00
Konstantin Belousov	e5cffdd34b	Do not drop NFS vnode lock when performing consistency checks. Currently several paths in the NFS client upgrade the shared vnode lock to exclusive, which might cause temporal dropping of the lock. This action appears to be fatal for nullfs mounts over NFS. If the operation is performed over nullfs vnode, then bypassed down to NFS VOP, and the lock is dropped, other thread might reclaim the upper nullfs vnode. Since on reclaim the nullfs vnode lock and NFS vnode lock are split, the original lock state of the nullfs vnode is not restored. As result, VFS operations receive not locked vnode after a VOP call. Stop upgrading the vnode lock when we check the consistency or flush buffers as result of detected inconsistency. Instead, allocate a new lockmgr lock for each NFS node, which is locked exclusive instead of the vnode lock upgrade. In other words, the other parallel modification of the vnode are excluded by either vnode lock conflict or exclusivity of the new lock when the vnode lock is shared. Also revert r316529 because now the vnode cannot be reclaimed during ncl_vinvalbuf(). In collaboration with: pho Reviewed by: rmacklem Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12083	2017-08-20 10:08:45 +00:00
Rick Macklem	47cbff34fa	Add kernel support for the NFS client forced dismount "umount -N" option. When an NFS mount is hung against an unresponsive NFS server, the "umount -f" option can be used to dismount the mount. Unfortunately, "umount -f" gets hung as well if a "umount" without "-f" has already been done. Usually, this is because of a vnode lock being held by the "umount" for the mounted-on vnode. This patch adds kernel code so that a new "-N" option can be added to "umount", allowing it to avoid getting hung for this case. It adds two flags. One indicates that a forced dismount is about to happen and the other is used, along with setting mnt_data == NULL, to handshake with the nfs_unmount() VFS call. It includes a slight change to the interface used between the client and common NFS modules, so I bumped __FreeBSD_version to ensure both modules are rebuilt. Tested by: pho Reviewed by: kib MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D11735	2017-07-29 19:52:47 +00:00
Rick Macklem	23e148e9a4	Fix possible crash for the NFSv4.1 pNFS client. If the nfsrpc_createlayoutrpc() call in nfsrpc_getcreatelayout() fails, the code used nfhpp when it might be set NULL. This patch checks for the error cases (laystat != 0) and avoids using nfhpp for the failure case. This would only affect NFSv4.1 mounts with the "pnfs" option. Found while testing the "umount -N" patch not yet in head. MFC after: 2 weeks	2017-07-29 02:25:49 +00:00
Rick Macklem	16f300fa4a	Replace the checks for MNTK_UNMOUNTF with a macro that does the same thing. This patch defines a macro that checks for MNTK_UNMOUNTF and replaces explicit checks with this macro. It has no effect on semantics, but prepares the code for a future patch where there will also be a NFS specific flag for "forced dismount about to occur". Suggested by: kib MFC after: 2 weeks	2017-07-27 20:55:31 +00:00
Konstantin Belousov	555b7bb4c8	Mark pages after EOF as clean after pageout. Suppose that a file on NFS has partially filled last page, and this page is dirty. NFS VOP_PAGEOUT() method only marks the the page clean up to the block of the last written byte, leaving other blocks dirty. Also any page which erronously exists in the vnode vm_object past EOF is also left marked as dirty. With the introduction of the buf-cache coherent pager, each pass of syncer over the object with such page results in creation of B_DELWRI buffer due to VOP_WRITE() call. This buffer is noted on next syncer pass, which results e.g. a visible manifestation of shutdown never finishing vnode sync. Note that before buf-cache coherency commit, a dirty page might left never synced to server if a partial writes occur. Fix this by clearing dirty bits after EOF. Only blocks of the partial page which are completely after EOF are marked clean, to avoid possible user data loss. Reported by: mav Reviewed by: alc, markj Tested by: mav, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D11697	2017-07-26 20:07:05 +00:00
Konstantin Belousov	cc2c26223b	Move rtvals initialization out of the region protected by NFS node lock. Noted by: alc Reviewed by: alc, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D11697	2017-07-26 20:01:31 +00:00
Rick Macklem	f8181b5e0e	r320062 introduced a bug when doing NFSv4.1 mounts against some non-FreeBSD servers. r320062 used nm_rsize, nm_wsize to set the maximum request/response sizes for the NFSv4.1 session. If rsize,wsize are not specified as options, the value of nm_rsize, nm_wsize is 0 at session creation, resulting in values for request/response that are too small. This patch fixes the problem. A workaround is to specify rsize=N,wsize=N mount options explicitly, so they are set before session creation. This bug only affects NFSv4.1 mounts against some non-FreeBSD servers. MFC after: 1 week	2017-07-21 00:14:43 +00:00
Rick Macklem	06ea10c60b	Revert r321308. I'll commit a better fix soon.	2017-07-20 23:59:47 +00:00
Rick Macklem	a9d104fd89	r320062 introduced a bug when doing NFSv4.1 mounts against some non-FreeBSD servers. r320062 used nm_rsize, nm_wsize to set the maximum request/response sizes for the NFSv4.1 session. If rsize,wsize are not specified as options, the value of nm_rsize, nm_wsize is 0 at session creation, resulting in values for request/response that are too small. This patch fixes the problem. A workaround is to specify rsize=N,wsize=N mount options explicitly, so they are set before session creation. This bug only affects NFSv4.1 mounts against some non-FreeBSD servers. MFC after: 1 week	2017-07-20 23:15:50 +00:00
John Baldwin	15a88f8158	Consistently use vop_stdpathconf() for default pathconf values. Update filesystems not currently using vop_stdpathconf() in pathconf VOPs to use vop_stdpathconf() for any configuration variables that do not have filesystem-specific values. vop_stdpathconf() is used for variables that have system-wide settings as well as providing default values for some values based on system limits. Filesystems can still explicitly override individual settings. PR: 219851 Reported by: cem Reviewed by: cem, kib, ngie MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D11541	2017-07-11 21:55:20 +00:00
Rick Macklem	ad6eb97601	Fix an NFSv3 client case that probably never happens. If an NFSv3 server were to reply with weak cache consistency attributes, but not post operation attributes, the client would use garbage attributes from memory. This was spotted during work on the code for the NFSv4.1 client. I have never seen evidence that this happens and it wouldn't make sense for an NFSv3 server to do this, so this patch is basically "theoretical", but does fix the problem if a server were to do the above. PR: 219552 MFC after: 2 weeks	2017-06-28 21:37:08 +00:00
Rick Macklem	81b07aac10	Add support to the NFSv4.1/pNFS client for commits through the DS. A NFSv4.1/pNFS server using File Layout can specify that Commit operations are to be done against the DS instead of MDS. Since no extant pNFS server did this, the code was untested and "#ifdef notyet". The FreeBSD pNFS server I am developing does specify that Commits be done through the DS, so the code has been enabled/tested. This patch should only affect the case of a pNFS server that specfies Commits through the DS. PR: 219551 MFC after: 2 weeks	2017-06-26 00:43:04 +00:00
Rick Macklem	a351e99ce6	Add two new compound RPCs to the NFSv4.1/pNFS client. When the NFSv4.1 client is doing pNFS, it needs to get an Open and a Layout for every file it will be doing I/O on. The current code does two separate RPCs to get these. This patch adds two new compounds that do the both the Open and LayoutGet in the same RPC, reducing the RPC count. It also factors out the code that sets up and parses the LayoutGet operation into separate functions, so that the code doesn't get duplicated for these new RPCs. This patch is fairly large, but should only affect the NFSv4.1 client when the "pnfs" option is specified. PR: 219550 MFC after: 2 weeks	2017-06-24 20:01:21 +00:00
Rick Macklem	6d7963ecd4	Ensure that the credentials field of the NFSv4 client open structure is initialized. bdrewery@ has reported panics "newnfs_copycred: negative nfsc_ngroups". The only way I can see that this occurs is that the credentials field of the open structure gets used before being filled in. I am not sure quite how this happens, but for the file create case, the code is serialized via the vnode lock on the directory. If, somehow, a link to the same file gets created just after file creation, this might occur. This patch ensures that the credentials field is initialized to a reasonable set of credentials before the structure is linked into any list, so I this should ensure it is initialized before use. I am committing the patch now, since bdrewery@ notes that the panics are intermittent and it may be months before he knows if the patch fixes his problem. Reported by: bdrewery MFC after: 2 weeks	2017-06-22 00:17:15 +00:00
Rick Macklem	ee791357a2	Add the definition of maxbcachebuf to sys/buf.h. r320070 removed the definition of maxbcachebuf from sys/param.h to fix the build for arm. This patch adds the definition of maxbcachebuf to sys/buf.h, which should be ok, since sys/buf.h is not being included in arm/arm/elf_note.S. Suggested by: kib MFC after: 2 weeks	2017-06-19 22:07:53 +00:00
Rick Macklem	95ac7f1a74	Fix the NFS client/server so that it actually uses the 64bit ino_t filenos. The code still doesn't use d_off. That will come in a future commit. The code also removes the checks for servers returning a fileno that doesn't fit in 32bits, since that should work ok now. Bump __FreeBSD_version since this patch changes the interface between the NFS kernel modules. Reviewed by: kib	2017-06-18 21:48:31 +00:00
Rick Macklem	1d9f01b18e	Take "extern int maxbcachebuf" out of sys/param.h, since it breaks the arm build. In the arm build, elf_note.S includes sys/param.h and then does an elf macro called ELFNOTE(). Although the compile error doesn't make sense to me, I believe it just means that an "extern ..." can't exist in param.h for this inclusion case. I suspect adding #if !defined(LOCORE) might fix the build, but this commit just takes the definition out. I will ask freebsd-current@ what is the best was to deal with this and do a subsequent commit after that. Reported by: melounmichal@gmail.com	2017-06-18 12:28:43 +00:00
Rick Macklem	d1c5e240a8	Make MAXBCACHEBUF a tunable called vfs.maxbcachebuf. By making MAXBCACHEBUF a tunable, it can be increased to allow for larger read/write data sizes for the NFS client. The tunable is limited to MAXPHYS, which is currently 128K. Making MAXPHYS a tunable or increasing its value is being discussed, since it would be nice to support a read/write data size of 1Mbyte for the NFS client when mounting the AmazonEFS file service. Reviewed by: kib MFC after: 2 weeks Relnotes: yes Differential Revision: https://reviews.freebsd.org/D10991	2017-06-17 22:24:19 +00:00
Konstantin Belousov	a8e7f543af	Fix bug in r318997: remove the line which overrides vn_fsid() calculation. Noted by: jhb Reviewed by: rmacklem Sponsored by: The FreeBSD Foundation	2017-05-30 21:20:54 +00:00
Konstantin Belousov	03311f117b	Use whole mnt_stat.f_fsid bits for st_dev. Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this method or reporting st_dev by stat(2). Provide a new helper vn_fsid() to avoid duplicating code to copy f_fsid to va_fsid. Note that the change is mostly cosmetic. Its motivation is to avoid sign-extension of f_fsid[0] into 64bit dev_t value which happens after dev_t becomes 64bit.. Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version) Sponsored by: The FreeBSD Foundation	2017-05-27 17:00:30 +00:00
Konstantin Belousov	6992112349	Commit the 64-bit inode project. Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify struct dirent layout to add d_off, increase the size of d_fileno to 64-bits, increase the size of d_namlen to 16-bits, and change the required alignment. Increase struct statfs f_mntfromname[] and f_mntonname[] array length MNAMELEN to 1024. ABI breakage is mitigated by providing compatibility using versioned symbols, ingenious use of the existing padding in structures, and by employing other tricks. Unfortunately, not everything can be fixed, especially outside the base system. For instance, third-party APIs which pass struct stat around are broken in backward and forward incompatible ways. Kinfo sysctl MIBs ABI is changed in backward-compatible way, but there is no general mechanism to handle other sysctl MIBS which return structures where the layout has changed. It was considered that the breakage is either in the management interfaces, where we usually allow ABI slip, or is not important. Struct xvnode changed layout, no compat shims are provided. For struct xtty, dev_t tty device member was reduced to uint32_t. It was decided that keeping ABI compat in this case is more useful than reporting 64-bit dev_t, for the sake of pstat. Update note: strictly follow the instructions in UPDATING. Build and install the new kernel with COMPAT_FREEBSD11 option enabled, then reboot, and only then install new world. Credits: The 64-bit inode project, also known as ino64, started life many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick (mckusick) then picked up and updated the patch, and acted as a flag-waver. Feedback, suggestions, and discussions were carried by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles), and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial ports investigation followed by an exp-run by Antoine Brodin (antoine). Essential and all-embracing testing was done by Peter Holm (pho). The heavy lifting of coordinating all these efforts and bringing the project to completion were done by Konstantin Belousov (kib). Sponsored by: The FreeBSD Foundation (emaste, kib) Differential revision: https://reviews.freebsd.org/D10439	2017-05-23 09:29:05 +00:00
Rick Macklem	46adb5dcf8	Make nfscl_mtofh() return ENXIO when nfhpp == NULL. r317272 introduced a case where nfscl_mtofh() could return 0 when nfhpp is NULL. This patch makes it return ENXIO for this case. MFC after: 1 week	2017-05-15 13:14:13 +00:00
Rick Macklem	845eb84c56	Modify the NFSv4.1/pNFS client to ask for a maximum length of layout. The code specified the length of a layout as INT64_MAX instead of UINT64_MAX. This could result in getting a layout for less than the full file for extremely large files. Although having little practical effect, this patch corrects this in the code. Detected during recent testing of the pNFS server. MFC after: 2 weeks	2017-04-29 00:34:53 +00:00
Rick Macklem	ad81354ceb	Fix handling of a NFSv4.1 callback reply from the session cache. The nfsv4_seqsession() call returns NFSERR_REPLYFROMCACHE when it has a reply in the session, due to a requestor retry. The code erroneously assumed a return of 0 for this case. This patch fixes this and adds a KASSERT(). This would be an extremely rare occurrence. It was found during code inspection during the pNFS server development. MFC after: 2 weeks	2017-04-26 21:54:53 +00:00
Rick Macklem	6406db24cb	Make the NFSv4 client to use a write open for reading if allowed by the server. An NFSv4 server has the option of allowing a Read to be done using a Write Open. If this is not allowed, the server will return NFSERR_OPENMODE. This patch attempts the read with a write open and then disables this if the server replies NFSERR_OPENMODE. This change will avoid some uses of the special stateids. This will be useful for pNFS/DS Reads, since they cannot use special stateids. It will also be useful for any NFSv4 server that does not support reading via the special stateids. It has been tested against both types of NFSv4 server. MFC after: 2 weeks	2017-04-23 21:51:28 +00:00
Rick Macklem	b845c29a03	Don't set the connection-back-channel flag for DS sessions. The NFSv4.1/pNFS client does not use/need a backchannel for the Data Server (DS) sessions, so the flag should only be set for MetaData Server (MDS) sessions. This patch should have been a part of r317275. MFC after: 2 weeks	2017-04-23 21:36:32 +00:00
Rick Macklem	4e47dd1885	Fix the NFSv4.1/pNFS client return layout on close. The "return layout on close" case in the pNFS client was badly broken. Fortunately, extant pNFS servers that I have tested against do not do this. This patch fixes it. It also changes the way the layout stateid.seqid is set for LayoutReturn. I think this change is correct w.r.t. the RFC, but I am not 100% sure. This was found during recent testing of the pNFS server under development. MFC after: 2 weeks	2017-04-22 22:37:44 +00:00
Rick Macklem	c20a721023	Fix some krpc leaks for the NFSv4.1/pNFS client. The NFSv4.1/pNFS client wasn't doing a newnfs_disconnect() call for the connection to the Data Server (DS) under some circumstances. The main effect of this was a leak of malloc'd structures in the krpc. This patch adds the newnfs_disconnect() calls to fix this. Detected during recent testing against the pNFS server under development. MFC after: 2 weeks	2017-04-22 20:55:39 +00:00
Rick Macklem	e96af29419	Add checks for failed operations to the NFSv4 client function nfscl_mtofh(). The nfscl_mtofh() function didn't check for failed operations and, as such, would have returned EBADRPC for these cases, due to parsing failure. This patch adds checks, so that it returns with ND_NOMOREDATA set. This is needed for future use in the pNFS server and acts as a safety belt in the meantime. MFC after: 2 weeks	2017-04-21 21:43:00 +00:00
Gleb Smirnoff	83c9dea1ba	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
Rick Macklem	6d4377c1ae	Remove unused "cred" argument to ncl_flush(). The "cred" argument of ncl_flush() is unused and it was confusing to have the code passing in NULL for this argument in some cases. This patch deletes this argument. There is no semantic change because of this patch. MFC after: 2 weeks	2017-04-14 13:25:45 +00:00
Rick Macklem	037a2012e9	Add an NFSv4.1 mount option for "use one openowner". Some NFSv4.1 servers such as AmazonEFS can only support a small fixed number of open_owner4s. This patch adds a mount option called "oneopenown" that can be used for NFSv4.1 mounts to make the client do all Opens with the same open_owner4 string. This option can only be used with NFSv4.1 and may not work correctly when Delegations are is use. Reported by: cperciva Tested by: cperciva MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D8988	2017-04-13 21:54:19 +00:00
Rick Macklem	60287c0c03	Add call to svcpool_close() for the NFSv4 callback pool (svcpool_nfscbd). A function called svcpool_close() was added to the server side krpc by r313735, so that a pool could be closed without destroying the data structures. This little patch adds a call to it for the callback pool (svcpool_nfscbd), so that the nfscbd daemon can be killed/restarted and continue to work correctly. MFC after: 2 weeks	2017-04-13 20:16:29 +00:00
Rick Macklem	0649fcaea5	Fix the NFS client for "text file modified, process killed" mmap'd case. When an mmap'd text file is written and then executed immediately afterwards, it was possible that the modify time would change after the text file was executing, resulting in the process executing the file being killed. This was usually only observed when the file system's times were set to higher resolution, but could have occurred for any time resolution. This was reported on a recent email list discussion. This patch adds a VOP_SET_TEXT() to the NFS client which flushed all dirty pages to the NFS server and then makes sure that n_mtime is up to date to avoid this from occurring. Thanks go to kib@ and pho@ for their help with developing this patch. Tested by: pho Reviewed by: kib MFC after: 2 weeks	2017-04-12 21:37:12 +00:00
Rick Macklem	7d9b62e1a1	Don't throw away Open state when a NFSv4.1 client recovery fails. If the ExchangeID/CreateSession operations done by an NFSv4.1 client after the server crashes/reboots fails, it is possible that some process/thread is waiting for an open_owner lock. If the client state is free'd, this can cause a crash. This would not normally happen, but has been observed on a mount of the AmazonEFS service. Reported by: cperciva Tested by: cperciva PR: 216086 MFC after: 2 weeks	2017-04-11 22:47:02 +00:00
Rick Macklem	9e507a6a00	During a server crash recovery, fix the NFSv4.1 client for a NFSERR_BADSESSION during recovery. If the NFSv4.1 client gets a NFSv4.1 NFSERR_BADSESSION reply to an Open/Lock operation while recovering from the server crash/reboot, allow the opens to be retained for a subsequent recovery attempt. Since NFSv4.1 servers should only reply NFSERR_BADSESSION after a crash/reboot that has lost state, this case should almost never happen. However, for the AmazonEFS file service, this has been observed when the client does a fresh TCP connection for RPCs. Reported by: cperciva Tested by: cperciva PR: 216088 MFC after: 2 weeks	2017-04-11 20:28:15 +00:00
Konstantin Belousov	6627d919de	Remove debugging printf. Instead, issue a diagnostic and return appropriate error if ncl_flush() was unable to clean buffer queue after the specified number or retries. Reviewed by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2017-04-11 08:29:12 +00:00
Rick Macklem	8efffe80ba	Avoid starvation of the server crash recovery thread for the NFSv4 client. This patch gives a requestor of the exclusive lock on the client state in the NFSv4 client priority over shared lock requestors. This avoids the server crash recovery thread being starved out by other threads doing RPCs. Tested by: cperciva PR: 216087 MFC after: 2 weeks	2017-04-10 01:28:01 +00:00
Rick Macklem	4e7dcfab2d	Fix the NFSv4 client hndling of a stale write verifier in the Commit operation. When the NFSv4 client Commit operation encountered a stale write verifier, it erroneously mapped that to EIO. This could have caused recently written data to be lost when a server crashes/reboots between an UNSTABLE write and the subsequent commit. This patch fixes this. The bug was only for the NFSv4 client and did not affect NFSv3. Tested by: cperciva PR: 215887 MFC after: 2 weeks	2017-04-09 21:50:21 +00:00
Rick Macklem	83a37350bf	Fix parsing failure for NFSv4 Setattr operation for failed case. If an operation that preceeds a Setattr in an NFSv4 compound fails, there is no bitmap of attributes to parse. Without this patch, the parsing would fail and return EBADRPC instead of the correct failure error. This could break recovery from a server crash/reboot. Tested by: cperciva PR: 215883 MFC after: 2 weeks	2017-04-09 12:32:22 +00:00
Konstantin Belousov	a046da7e90	Remove spl*() calls from the nfsclient code. Style adjustments in the related lines in ncl_writebp(). Reviewed by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-04-06 12:44:34 +00:00
Konstantin Belousov	ea52525928	Make nfs pageout coherent with the dirty state of the buffers. Write out the dirty pages using VOP_WRITE() instead of directly calling ncl_writerpc(). The state of the buffers now reflects the write, fixing some hard to diagnose consistency and write order issues. The change also allowed to remove remapping of paged out pages into kernel space and related allocation of the phys buffer. Reviewed by: markj, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D10241	2017-04-05 17:26:20 +00:00
Konstantin Belousov	b73cd4d344	Handle nfs IO_ASYNC write requests asynchronously. Reviewed by: markj, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks X-Differential revision: https://reviews.freebsd.org/D10241	2017-04-05 17:20:31 +00:00
Konstantin Belousov	48fe926362	Handle possible vnode reclamation after ncl_vinvalbuf() call. ncl_vinvalbuf() might need to upgrade vnode lock, allowing the vnode to be reclaimed by other thread. Handle the situation, indicated by the returned error zero and VI_DOOMED iflag set, converting it into EBADF. Handle all calls, even where the vnode is exclusively locked right now. Reviewed by: markj, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks X-Differential revision: https://reviews.freebsd.org/D10241	2017-04-05 17:11:39 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Konstantin Belousov	bd1623def1	Do not access memory past the buffer end. Do not accept and silently truncate too long hostname. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-02-16 06:36:16 +00:00
Konstantin Belousov	599009e261	Do not allocate char[MNAMELEN] on stack in nfsclient. Right now this is not critical, but will be after planned increase of MNAMELEN from 88 to 1k. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-02-16 06:34:20 +00:00
Ed Maste	1dc349ab95	prefix UFS symbols with UFS_ to reduce namespace pollution Specifically: ROOTINO -> UFS_ROOTINO WINO -> UFS_WINO NXADDR -> UFS_NXADDR NDADDR -> UFS_NDADDR NIADDR -> UFS_NIADDR MAXSYMLINKLEN_UFS[12] -> UFS[12]_MAXSYMLINKLEN (for consistency) Also prefix ext2's and nandfs's NDADDR and NIADDR with EXT2_ and NANDFS_ Reviewed by: kib, mckusick Obtained from: NetBSD MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D9536	2017-02-15 19:50:26 +00:00
Konstantin Belousov	1c32456953	Use type-independent formats for printing nlink_t and ino_t. Extracted from: ino64 work by gleb, mckusick Discussed with: mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week	2017-01-06 16:59:33 +00:00
Rick Macklem	b2fc0141d9	Fix NFSv4.1 client recovery from NFS4ERR_BAD_SESSION errors. For most NFSv4.1 servers, a NFS4ERR_BAD_SESSION error is a rare failure that indicates that the server has lost session/open/lock state. However, recent testing by cperciva@ against the AmazonEFS server found several problems with client recovery from this due to it generating this failure frequently. Briefly, the problems fixed are: - If all session slots were in use at the time of the failure, some processes would continue to loop waiting for a slot on the old session forever. - If an RPC that doesn't use open/lock state failed with NFS4ERR_BAD_SESSION, it would fail the RPC/syscall instead of initiating recovery and then looping to retry the RPC. - If a successful reply to an RPC for an old session wasn't processed until after a new session was created for a NFS4ERR_BAD_SESSION error, it would erroneously update the new session and corrupt it. - The use of the first element of the session list in the nfs mount structure (which is always the current metadata session) was slightly racey. With changes for the above problems it became more racey, so all uses of this head pointer was wrapped with a NFSLOCKMNT()/NFSUNLOCKMNT(). - Although the kernel malloc() usually allocates more bytes than requested and, as such, this wouldn't have caused problems, the allocation of a session structure was 1 byte smaller than it should have been. (Null termination byte for the string not included in byte count.) There are probably still problems with a pNFS data server that fails with NFS4ERR_BAD_SESSION, but I have no server that does this to test against (the AmazonEFS server doesn't do pNFS), so I can't fix these yet. Although this patch is fairly large, it should only affect the handling of NFS4ERR_BAD_SESSION error replies from an NFSv4.1 server. Thanks go to cperciva@ for the extension testing he did to help isolate/fix these problems. Reported by: cperciva Tested by: cperciva MFC after: 3 months Differential Revision: https://reviews.freebsd.org/D8745	2016-12-23 23:14:53 +00:00
Konstantin Belousov	abc1515601	NFSv4 client tracks opens, and the track records are only dropped when the vnode is inactivated. This contradicts with the nullfs caching which keeps upper vnode around, as consequence keeping the use reference to lower vnode. Add a filesystem flag to request nullfs to not cache when mounted over that filesystem, and set the flag for nfs v4 mounts. Reported by: asomers Reviewed by: rmacklem Tested by: asomers, rmacklem Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-27 09:20:58 +00:00
Konstantin Belousov	753a007f0d	Use buffer pager for NFS. The pager, due to its construction, implements clustering for the page-ins. In particular, buildworld load demonstrates reduction of the READ RPCs from 39k down to 24k. No change in real or CPU time was observed. Discussed with, and measured by: bde No objections from: rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-22 10:58:24 +00:00
Konstantin Belousov	fc2c3afee0	Minor cleanup, remove unneeded XXX comments and unused re-define. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-11-22 10:24:59 +00:00
Rick Macklem	1b819cf265	Update the nfsstats structure to include the changes needed by the patch in D1626 plus changes so that it includes counts for NFSv4.1 (and the draft of NFSv4.2). Also, make all the counts uint64_t and add a vers field at the beginning, so that future revisions can easily be implemented. There is code in place to handle the old vesion of the nfsstats structure for backwards binary compatibility. Subsequent commits will update nfsstat(8) to use the new fields. Submitted by: will (earlier version) Reviewed by: ken MFC after: 1 month Relnotes: yes Differential Revision: https://reviews.freebsd.org/D1626	2016-08-12 22:44:59 +00:00
Konstantin Belousov	ad600ac8e3	Remove ncl_printf(), use printf(9) directly. After r303710 the function duplicates printf(). Correct function names in the messages []. Noted by: bde [] Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-03 15:58:20 +00:00
Konstantin Belousov	83d7cf21ea	Remove unneeded (recursing) Giant acquisition around vprintf(9). Reviewed by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-08-03 11:49:17 +00:00
Konstantin Belousov	20de93c6c0	Clean other flags in ncl_inactive, only. Add comment explaining why other flags should be unset. Suggested and reviewed by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 12 days Approved by: re (gjb)	2016-06-26 14:18:28 +00:00
Konstantin Belousov	8f73d398ed	Since VOP_INACTIVE() is not guaranteed to be called, all cleanups executed by inactive methods, must be repeated on reclaim. In particular, unlink and free sillyrenamed vnode both on inactivation and reclaim. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb)	2016-06-25 11:34:06 +00:00
Konstantin Belousov	e37dfd3d2b	Do not access NFS data for reclaimed vnode. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (delphij)	2016-06-19 18:29:43 +00:00
Conrad Meyer	ab8316b8df	nfs_clvfsops: Fix leading whitespace introduced in r299848 Replace spaces with tabs. No functional change. Sponsored by: EMC / Isilon Storage Division	2016-06-07 20:16:01 +00:00
Conrad Meyer	15634fd60c	nfs_clvfsops: Prevent strdup of stack garbage with bogus mount specs If strlen(hostp) was zero, the stack array 'nam' would never be initialized before being strdup()ed. Fix this by initializing it to the empty string. It's possible some external condition makes this case impossible, in which case, an assertion instead of this workaround is appropriate. Introduced in r299848. Reported by: Coverity CID: 1355336 Sponsored by: EMC / Isilon Storage Division	2016-06-07 20:00:20 +00:00
Gleb Smirnoff	fefbf77024	Comment fix: the getsockaddr() is actually meant here. Reviewed by: rmacklem	2016-05-18 17:40:53 +00:00
Edward Tomasz Napierala	0d1654c39b	Make it possible to reroot into NFS. This means one can have eg an NFSv4 root over WiFi: boot from md_root (small rootfs image preloaded by loader(8)), setup WiFi, and then reroot into the actual root, over NFS. Note that it's currently limited to NFSv4, and due to problems with nfsuserd(8) it requres a workaround on the server side: one needs to set the vfs.nfsd.enable_stringtouid=1 sysctl and not run nfsuserd(8) on either the server or the client side. Reviewed by: rmacklem@ MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6347	2016-05-15 08:34:59 +00:00
Konstantin Belousov	b6a60ae74a	Use vfs_hash_ref(9) to eliminate LK_EXCLOTHER kludge. As a consequence, the nfs client override of VOP_LOCK1() is no longer needed. Reviewed and tested by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 1 week	2016-05-11 06:35:46 +00:00
Pedro F. Giffuni	a96c9b30e2	NFS: spelling fixes on comments. No funcional change.	2016-04-29 16:07:25 +00:00
Rick Macklem	84aa8a8ad1	Bruce Evans reported that there was a performance regression between the old and new NFS clients. He did a good job of isolating the problem which was caused by the new NFS client not setting the post write mtime correctly. The new NFS client code was cloned from the old client, but was incorrect, because the mtime in the nfs vnode's cache wasn't yet updated. This patch fixes this problem. The patch also adds missing mutex locking. Reported and tested by: bde MFC after: 2 weeks	2016-04-11 21:55:21 +00:00
Pedro F. Giffuni	74b8d63dcc	Cleanup unnecessary semicolons from the kernel. Found with devel/coccinelle.	2016-04-10 23:07:00 +00:00
Bjoern A. Zeeb	8676704962	Unbreak NOIP builds after r294084.	2016-01-15 16:45:36 +00:00
Alexander V. Chernikov	d3bf8f6486	Make nfscl_getmyip() use new routing KPI. * Use standard IPv6 SAS instead of rt->rt_ifa address. * Make address lookup work for IPv6 LLA. * Save address into buffer provided by caller instead of using static vars. Discussed with: rmacklem	2016-01-15 09:05:14 +00:00
Gleb Smirnoff	f17f88d3e0	Fix breakage caused by r292373 in ZFS/FUSE/NFS/SMBFS. With the new VOP_GETPAGES() KPI the "count" argument counts pages already, and doesn't need to be translated from bytes to pages. While here make it consistent that rbehind and rahead are updated only if we doesn't return error. Pointy hat to: glebius	2015-12-16 23:48:50 +00:00
Gleb Smirnoff	b0cd20172d	A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES(). o With new KPI consumers can request contiguous ranges of pages, and unlike before, all pages will be kept busied on return, like it was done before with the 'reqpage' only. Now the reqpage goes away. With new interface it is easier to implement code protected from race conditions. Such arrayed requests for now should be preceeded by a call to vm_pager_haspage() to make sure that request is possible. This could be improved later, making vm_pager_haspage() obsolete. Strenghtening the promises on the business of the array of pages allows us to remove such hacks as swp_pager_free_nrpage() and vm_pager_free_nonreq(). o New KPI accepts two integer pointers that may optionally point at values for read ahead and read behind, that a pager may do, if it can. These pages are completely owned by pager, and not controlled by the caller. This shifts the UFS-specific readahead logic from vm_fault.c, which should be file system agnostic, into vnode_pager.c. It also removes one VOP_BMAP() request per hard fault. Discussed with: kib, alc, jeff, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-12-16 21:30:45 +00:00
Kirk McKusick	43a993bb7d	For performance reasons, it is useful to have a single string used as the name of a filesystem when setting it as the first parameter to the getnewvnode() function. Most filesystems call getnewvnode from just one place so can use a literal string as the first parameter. However, NFS calls getnewvnode from two places, so we create a global constant string that can be used by the two instances. This change also collapses two instances of getnewvnode() in the UFS filesystem to a single call. Reviewed by: kib Tested by: Peter Holm	2015-11-29 21:01:02 +00:00
Rick Macklem	b179878dde	Revert r283330 since it broke directory caching in the client. At this time I cannot see a way to fix directory caching when it has partial blocks in the buffer cache, due to the fact that the syscall's uio_offset won't stay the same as the lblkno * NFS_DIRBLKSIZ offset. Reported by: bde MFC after: 2 weeks	2015-11-21 00:15:41 +00:00
Rick Macklem	f315383406	mnt_stat.f_iosize (which is used to set bo_bsize) must be set to the largest size of buffer cache block or the mapping of the buffer is bogus. When a mount with rsize=4096,wsize=4096 was done, f_iosize would be set to 4096. This resulted in corrupted directory data, since the buffer cache block size for directories is NFS_DIRBLKSIZ (8192). This patch fixes the code so that it always sets f_iosize to at least NFS_DIRBLKSIZ. Tested by: krichy@cflinux.hu PR: 177971 MFC after: 2 weeks	2015-11-17 01:44:26 +00:00
Conrad Meyer	b5af3f30a7	nfsclient: Protest loudly when GETATTR responses are invalid BROKEN NFS SERVER OR MIDDLEWARE: Certain WAN "accelerators" attempt to cache NFS GETATTR traffic, but actually corrupt it (e.g., responding to requests with attributes for totally different files). Warn very verbosely when this is detected. Linux' NFS client has a similar warning. Adds a sysctl/tunable (vfs.nfs.fileid_maxwarnings) to configure the quantity of warnings; default to 10. (Zero disables; -1 is unlimited.) Adds a failpoint to aid in validating the warning / behavior with a non-broken server. Use something like: sysctl 'debug.fail_point.nfscl_force_fileid_warning=10%return(1)' Reviewed by: rmacklem Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3304	2015-08-05 22:27:30 +00:00
Rick Macklem	2a3508eb48	If a "principal" argument isn't provided for a Kerberized NFS mount, the kernel would generate a bogus one with a ":/<path>" suffix. This would only occur for the case where there was no explicit "principal" argument and the getaddrinfo() call in mount_nfs.c failed to a return a cannonical name for the server. This patch fixes this unusual case. PR: 201073 Submitted by: masato@itc.naist.jp MFC after: 2 weeks	2015-07-03 22:11:07 +00:00
Rick Macklem	d189dcb6e2	Alex Burlyga reported a POLA violation for the new NFS client as compared to the old NFS client via email to the freebsd-fs@ mailing list. For the new client, when multiple clients attempted to create a symbolic link concurrently, more that one client would report success instead of EEXIST. This was caused by code in the new client that mapped EEXIST to OK assuming it was caused by a retried RPC request. Since the old client did not do this, the patch defaults to the old behaviour and permits the new behaviour to be enabled via a sysctl. Reported by: alex.burlyga.ietf@gmail.com Tested by: alex.burlyga.ietf@gmail.com MFC after: 2 weeks	2015-07-03 01:15:21 +00:00
Gleb Smirnoff	093ebe1d28	o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async(). o Provide an extensive set of assertions for input array of pages. o Remove now duplicate assertions from different pagers. Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-06-17 22:44:27 +00:00
Rick Macklem	262a84286d	The NFS client generated directory block(s) with d_fileno == 0 so that it would not return less data than requested. Since returning less directory data than requested is not a problem for FreeBSD and even UFS no longer returns directory structures with d_fileno == 0, this patch stops the client from doing this. Although entries with d_fileno == 0 should not be a problem, the man pages no longer document that these entries should be ignored, so there was a concern that these entries might be an issue in the future. Suggested by: trasz Tested by: trasz MFC after: 2 weeks	2015-05-23 21:58:41 +00:00
Rick Macklem	86b9457f5b	The NFS client wasn't handling getdirentries(2) requests for sizes that are not an exact multiple of DIRBLKSIZ correctly. Fortunately readdir(3) always uses an exact multiple of DIRBLKSIZ, so few applications were affected. This patch fixes this problem by reducing the size of the directory read to an exact multiple of DIRBLKSIZ. Tested by: trasz Reported by: trasz Reviewed by: trasz MFC after: 2 weeks	2015-05-21 23:14:18 +00:00
Alexander Motin	a87627b26b	Do not promote large async writes to sync. Present implementation of large sync writes is too strict and so can be quite slow. Instead of doing that, execute large async write in chunks, syncing each chunk separately. It would be good to fix large sync writes too, but I leave it to somebody with more skills in this area. Reviewed by: rmacklem MFC after: 1 week	2015-05-14 10:04:42 +00:00
Rick Macklem	7cfdc2a7bc	MAXBSIZE defines both the largest UFS block size and the largest size for a buffer in the buffer cache. This patch defines a new constant MAXBCACHEBUF, which is the largest size for a buffer in the buffer cache. Having a separate constant allows MAXBCACHEBUF to be set larger than MAXBSIZE on a per-architecture basis, so that NFS can do larger read/writes for these architectures. It modifies sys/param.h so that BKVASIZE can also be set on a per-architecture basis. A couple of cases where NFS used MAXBSIZE instead of NFS_MAXBSIZE is fixed as well. Differential Revision: https://reviews.freebsd.org/D2330 Reviewed by: mav, kib MFC after: 2 weeks	2015-04-25 00:52:01 +00:00
Pedro F. Giffuni	2f39c91019	Prevent a double free. This is similar to r281756 so set the ptr NULL after free as a safety belt against future changes. Obtained from: HardenedBSD (b2e77ced9ae213d358b44d98f552d9ae4636ecac) Submitted by: Oliver Pinter Revewed by: rmacklem	2015-04-20 16:40:13 +00:00
Pedro F. Giffuni	a3a4b110da	nfsrpc_createv4: fix double free. Reported by: Oliver Pinter, clang static checker Obtained from: HardenedBSD (commit 63cac77c42c0c3fc67da62f97d5ab651d52ae707) Reviewed by: rmacklem MFC after: 5 days	2015-04-19 23:55:59 +00:00
Alexander Motin	afdfc9a40d	Change wcommitsize default from one empirical value to another. The new value is more predictable with growing RAM size: hibufspace maxvnodes old new i386: 256MB 32980992 15800 2198732 2097152 2GB 94027776 107677 878764 4194304 amd64: 256MB 32980992 15800 2198732 2097152 1GB 114114560 68062 1678155 4194304 4GB 217055232 111807 1955452 4194304 16GB 1717846016 337308 5097465 16777216 64GB 1734918144 1164427 1490479 16777216 256GB 1734918144 4426453 391983 16777216 Reviewed by: rmacklem MFC after: 2 weeks	2015-04-19 11:34:41 +00:00
Edward Tomasz Napierala	50a220c699	Replace "new NFS" with just "NFS" in some sysctl description strings. Sponsored by: The FreeBSD Foundation	2015-04-19 06:18:41 +00:00
Rick Macklem	dda11d4ab9	File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache. Reviewed by: kib MFC after: 2 weeks	2015-04-15 20:16:31 +00:00
Gleb Smirnoff	4d6481a4c9	o Enhance vm_pager_free_nonreq() function: - Allow to call the function with vm object lock held. - Allow to specify reqpage that doesn't match any page in the region, meaning freeing all pages. o Utilize the new function in couple more places in vnode pager. Reviewed by: alc, kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-03-17 19:19:19 +00:00
Rick Macklem	07d491dede	r245508 modified the NFS client's Setattr RPC to use VA_UTIMES_NULL to indicate whether it should set the time to the current tod on the server. This had the side effect of making the NFS client use the client's timestamp for exclusive create, starting with FreeBSD9.2. Unfortunately a bug in some Solaris NFS servers causes these servers to return NFS_OK to the Setattr RPC done during exclusive create, but not actually set the file's mode, leaving the file's mode == 0. This patch restores the NFS client's behaviour to use the server's tod for the exclusive open's Setattr RPC, to avoid the Solaris server bug and to restore the pre-FreeBSD9.2 NFS behaviour. Discussed on: freebsd-fs PR: 186293 MFC after: 3 months	2014-12-28 21:13:52 +00:00
Rick Macklem	2f88b3d20a	Delete some duplicate code that was harmless because exactly the same code is at the end of the nfscl_checksattr() function that is called just before it. As such, this code had already been executed and didn't do anything. MFC after: 1 week	2014-12-25 22:29:37 +00:00
Rick Macklem	62c23db947	Fix kernel builds with "options NFS_DEBUG" that were broken by r276096. Also delete the two kernel options NFS_GATHERDELAY, NFS_WDELAYHASHSIZ which are no longer used. Reported by: bz	2014-12-23 14:24:36 +00:00
Rick Macklem	c15882f091	Remove the old NFS client and server from head, which means that the NFSCLIENT and NFSSERVER kernel options will no longer work. This commit only removes the kernel components. Removal of unused code in the user utilities will be done later. This commit does not include an addition to UPDATING, but that will be committed in a few minutes. Discussed on: freebsd-fs	2014-12-23 00:47:46 +00:00
Konstantin Belousov	6c21f6edb8	The VOP_LOOKUP() implementations for CREATE op do not put the name into namecache, to avoid cache trashing when doing large operations. E.g., tar archive extraction is not usually followed by access to many of the files created. Right now, each VOP_LOOKUP() implementation explicitely knowns about this quirk and tests for both MAKEENTRY flag presence and op != CREATE to make the call to cache_enter(). Centralize the handling of the quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP. VFS now sets NOCACHE flag for CREATE namei() calls. Note that the change in semantic is backward-compatible and could be merged to the stable branch, and is compatible with non-changed third-party filesystems which correctly handle MAKEENTRY. Suggested by: Chris Torek <torek@pi-coral.com> Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-18 10:01:12 +00:00
Edward Tomasz Napierala	2fbe0cff73	Fix handling of "conn" mount_nfs(8) option. Reviewed by: rmacklem@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2014-10-30 09:25:03 +00:00
Edward Tomasz Napierala	5a06ac3540	Add support for "timeo", "actimeo", "noac", and "proto" options to mount_nfs(8). They are implemented on Linux, OS X, and Solaris, and thus can be expected to appear in automounter maps. Reviewed by: rmacklem@ MFC after: 1 month Sponsored by: The FreeBSD Foundation	2014-10-30 08:50:01 +00:00
Rick Macklem	6a30c96cdc	Clip the settings for the NFS rsize, wsize mount options to a power of 2. For non-power of 2 settings, intermittent page faults have been reported. Although the bug that causes these page faults/crashes has not been identified, it does not appear to occur when rsize, wsize is a power of 2. Reported by: tcberner@gmail.com MFC after: 2 weeks	2014-10-22 22:27:51 +00:00
Rick Macklem	fcf121d481	Revert r273481 so it can be recoded using fls(), which some feel will make it more readable.	2014-10-22 21:57:35 +00:00
Rick Macklem	88cc4e92da	Clip the settings for the NFS rsize, wsize mount options to a power of 2. For non-power of 2 settings, intermittent page faults have been reported. Although the bug that causes these page faults/crashes has not been identified, it does not appear to occur when rsize, wsize is a power of 2. Reported by: tcberner@gmail.com MFC after: 2 weeks	2014-10-22 20:47:11 +00:00
Davide Italiano	2be111bf7d	Follow up to r225617. In order to maximize the re-usability of kernel code in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv(). This fixes a namespace collision with libc symbols. Submitted by: kmacy Tested by: make universe	2014-10-16 18:04:43 +00:00
Alan Cox	396b3e34b4	Avoid an exclusive acquisition of the object lock on the expected execution path through the NFS clients' getpages functions. Introduce vm_pager_free_nonreq(). This function can be used to eliminate code that is duplicated in many getpages functions. Also, in contrast to the code that currently appears in those getpages functions, vm_pager_free_nonreq() avoids acquiring an exclusive object lock in one case. Reviewed by: kib MFC after: 6 weeks Sponsored by: EMC / Isilon Storage Division	2014-09-14 18:07:55 +00:00
Konstantin Belousov	65589a29f4	Check for the cross-device cross-link attempt in the VFS, instead of forcing filesystem VOP_LINK() methods to repeat the code. In tmpfs_link(), remove redundand check for the type of the source, already done by VFS. Note that NFS server already performs this check before calling VOP_LINK(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-16 14:04:46 +00:00
Rick Macklem	c59e4cc34d	Merge the NFSv4.1 server code in projects/nfsv4.1-server over into head. The code is not believed to have any effect on the semantics of non-NFSv4.1 server behaviour. It is a rather large merge, but I am hoping that there will not be any regressions for the NFS server. MFC after: 1 month	2014-07-01 20:47:16 +00:00
Rick Macklem	2d5f835917	There might be a potential race condition for the NFSv4 client when a newly created file has another open done on it that update the open mode. This patch moves the code that updates the open mode up into the block where the mutex is held to ensure this cannot happen. No bug caused by this potential race has been observed, but this fix is a safety belt to ensure it cannot happen. MFC after: 2 weeks	2014-06-28 21:47:15 +00:00
Rick Macklem	c0990edac6	Modify the NFSv4 client's Pathconf RPC (actually a Getattr Op.) so that it only does the RPC for names that are answered by the RPC. Doing the RPC for other names is harmless, but unnecessary. MFC after: 2 weeks	2014-04-23 22:13:10 +00:00
Rick Macklem	9eeef7464b	Fixes mkdir for the NFSv2 client that was broken by r264705. Reported by: bdrewery MFC after: 2 weeks	2014-04-22 04:42:46 +00:00
Rick Macklem	c7b560b9b4	For an NFSv4 mount with the "nocto" option, don't get the up to date file attributes upon close. This reduces the Getattr RPC count by about 65% for software builds. MFC after: 2 weeks	2014-04-21 19:10:23 +00:00
Rick Macklem	c3e4a7261c	Modify the NFSv4 client create/mkdir RPC so that it acquires post-create/mkdir directory attributes. This allows the RPC to name cache the newly created directory and reduces the lookup RPC count for applications creating a lot of directories. MFC after: 2 weeks	2014-04-20 22:19:00 +00:00
Rick Macklem	de1a42bd0c	Modify the NFSv4 client open/create RPC so that it acquires post-open/create directory attributes. This allows the RPC to name cache the newly created file and reduces the lookup RPC count by about 10% for software builds. MFC after: 2 weeks	2014-04-19 19:40:20 +00:00
Rick Macklem	a6f8e64e74	Modify the Lookup RPC for NFSv4 so that it acquires directory attributes. This allows the client to cache directory names when they are looked up, reducing the Lookup RPC count by about 40% for software builds. MFC after: 2 weeks	2014-04-18 22:05:34 +00:00
Robert Watson	4a14441044	Update kernel inclusions of capability.h to use capsicum.h instead; some further refinement is required as some device drivers intended to be portable over FreeBSD versions rely on __FreeBSD_version to decide whether to include capability.h. MFC after: 3 weeks	2014-03-16 10:55:57 +00:00
Rick Macklem	b921158ae0	The NFSv4 client was passing both the p and cred arguments to nfsv4_fillattr() as NULLs for the Getattr callback. This caused nfsv4_fillattr() to not fill in the Change attribute for the reply. I believe this was a violation of the RFC, but had little effect on server behaviour. This patch passes a non-NULL p argument to fix this. MFC after: 1 week	2013-12-24 00:48:39 +00:00
Rick Macklem	6b8fe5d59d	The NFSv4.1 client didn't return NFSv4.1 specific error codes for the Getattr and Recall callbacks. This patch fixes it. Since the NFSv4.1 specific error codes would only happen for abnormal circumstances, this patch has little effect, in practice. MFC after: 1 week	2013-12-23 15:16:53 +00:00
Rick Macklem	cf766161ff	For software builds, the NFS client does many small synchronous (with FILE_SYNC) writes because non-contiguous byte ranges in the same buffer cache block are being written. This patch adds a new mount option "noncontigwr" which allows the non-contiguous byte ranges to be combined, with the dirty byte range becoming the superset of the bytes that are dirty, if the file has not been file locked. This reduces the number of writes significantly for software builds. The only case where this change might break existing applications is where an application is writing non-overlapping byte ranges within the same buffer cache block of a file from multiple clients concurrently. Since such an application would normally do file locking on the file, avoiding the byte range merge for files that have been file locked should be sufficient for most (maybe all?) cases. Submitted by: jhb (earlier version) Reviewed by: kib MFC after: 3 weeks	2013-12-07 23:05:59 +00:00
Sergey Kandaurov	0d8dc7cc39	- Nuke a second copy of nfscl_attrcache extern declarations from under ifdef KDTRACE_HOOKS. This fixes kernel build with options KDTRACE_HOOKS. - Fix style inconsistencies.	2013-11-26 22:41:40 +00:00
Gleb Smirnoff	285e7a2d97	Fix build, attempt two.	2013-11-26 20:27:57 +00:00
Gleb Smirnoff	6882b8ea66	Fix build.	2013-11-26 10:34:34 +00:00
Attilio Rao	54366c0bd7	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip	2013-11-25 07:38:45 +00:00
Rick Macklem	42b6336a98	Fix an NFSv4.1 client specific case where a forced dismount would hang. The hang occurred in nfsv4_setsequence() when it couldn't find an available session slot and is fixed by checking for a forced dismount in progress and just returning for this case. MFC after: 1 month	2013-11-09 21:24:56 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Rick Macklem	f7d8291af0	Crashes have been observed for NFSv4.1 mounts when the system is being shut down which were caused by the nfscbd_pool being destroyed before the backchannel is disabled. This patch is believed to fix the problem, by simply avoiding ever destroying the nfscbd_pool. Since the NFS client module cannot be unloaded, this should not cause a memory leak. MFC after: 2 weeks	2013-09-04 22:47:56 +00:00
Rick Macklem	8fe6bddff7	Forced dismounts of NFS mounts can fail when thread(s) are stuck waiting for an RPC reply from the server while holding the mount point busy (mnt_lockref incremented). This happens because dounmount() msleep()s waiting for mnt_lockref to become 0, before calling VFS_UNMOUNT(). This patch adds a new VFS operation called VFS_PURGE(), which the NFS client implements as purging RPCs in progress. Making this call before checking mnt_lockref fixes the problem, by ensuring that the VOP_xxx() calls will fail and unbusy the mount point. Reported by: sbruno Reviewed by: kib MFC after: 2 weeks	2013-09-01 23:02:59 +00:00
Rick Macklem	88a2437a65	Add support for host-based (Kerberos 5 service principal) initiator credentials to the kernel rpc. Modify the NFSv4 client to add support for the gssname and allgssname mount options to use this capability. Requires the gssd daemon to be running with the "-h" option. Reviewed by: jhb	2013-07-09 01:05:28 +00:00
Rick Macklem	a820822ec8	A problem with the old NFS client where large writes to large files would sometimes result in a corrupted file was reported via email. This problem appears to have been caused by r251719 (reverting r251719 fixed the problem). Although I have not been able to reproduce this problem, I suspect it is caused by another thread increasing np->n_size after the mtx_unlock(&np->n_mtx) but before the vnode_pager_setsize() call. Since the np->n_mtx mutex serializes updates to np->n_size, doing the vnode_pager_setsize() with the mutex locked appears to avoid the problem. Unfortunately, vnode_pager_setsize() where the new size is smaller, cannot be called with a mutex held. This patch returns the semantics to be close to pre-r251719 (actually pre-r248567, r248581, r248567 for the new client) such that the call to vnode_pager_setsize() is only delayed until after the mutex is unlocked when np->n_size is shrinking. Since the file is growing when being written, I believe this will fix the corruption. A better solution might be to replace the mutex with a sleep lock, but that is a non-trivial conversion, so this fix is hoped to be sufficient in the meantime. Reported by: David G. Lawrence (dg@dglawrence.com) Tested by: David G. Lawrence (to be done soon) Reviewed by: kib MFC after: 1 week	2013-07-03 00:19:03 +00:00
Rick Macklem	2e6a4b0c55	Fix r252074 so that it builds on 64bit arches.	2013-06-22 21:58:21 +00:00
Rick Macklem	1dd95a046c	The NFSv4.1 LayoutCommit operation requires a valid offset and length. (0, 0 is not sufficient) This patch a loop for each file layout, using the offset, length of each file layout in a separate LayoutCommit.	2013-06-21 22:46:16 +00:00
Rick Macklem	562395581b	When the NFSv4.1 client is writing to a pNFS Data Server (DS), the file's size attribute does not get updated. As such, it is necessary to invalidate the attribute cache before clearing NMODIFIED for pNFS. MFC after: 2 weeks	2013-06-21 22:26:18 +00:00
Rick Macklem	315c38d135	Since some NFSv4 servers enforce the requirement for a reserved port#, enable use of the (no)resvport mount option for NFSv4. I had thought that the RFC required that non-reserved port #s be allowed, but I couldn't find it in the RFC. MFC after: 2 weeks	2013-06-21 19:41:30 +00:00
Jeff Roberson	22a722605d	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf	2013-05-31 00:43:41 +00:00
Rick Macklem	734b03c38d	Post-r248567, there were times when the client would return a truncated directory for some NFS servers. This turned out to be because the size of a directory reported by an NFS server can be smaller that the ufs-like directory created from the RPC XDR in the client. This patch fixes the problem by changing r248567 so that vnode_pager_setsize() is only done for regular files. Reported and tested by: hartmut.brandt@dlr.de Reviewed by: kib MFC after: 1 week	2013-05-28 22:36:01 +00:00
Rick Macklem	77a03c148c	Add support for the eofflag to nfs_readdir() in the new NFS client so that it works under a unionfs mount. Submitted by: Jared Yanovich (slovichon@gmail.com) Reviewed by: kib MFC after: 2 weeks	2013-05-12 21:48:08 +00:00
Rick Macklem	64a0e848ab	When an NFS unmount occurs, once vflush() writes the last dirty buffer for the last vnode on the mount back to the server, it returns. At that point, the code continues with the unmount, including freeing up the nfs specific part of the mount structure. It is possible that an nfsiod thread will try to check for an empty I/O queue in the nfs specific part of the mount structure after it has been free'd by the unmount. This patch avoids this problem by setting the iodmount entries for the mount back to NULL while holding the mutex in the unmount and checking the appropriate entry is non-NULL after acquiring the mutex in the nfsiod thread. Reported and tested by: pho Reviewed by: kib MFC after: 2 weeks	2013-04-18 23:20:16 +00:00
Rick Macklem	175b3f31d3	Both NFS clients can deadlock when using the "rdirplus" mount option. This can occur when an nfsiod thread that already holds a buffer lock attempts to acquire a vnode lock on an entry in the directory (a LOR) when another thread holding the vnode lock is waiting on an nfsiod thread. This patch avoids the deadlock by disabling readahead for this case, so the nfsiod threads never do readdirplus. Since readaheads for directories need the directory offset cookie from the previous read, they cannot normally happen in parallel. As such, testing by jhb@ and myself didn't find any performance degredation when this patch is applied. If there is a case where this results in a significant performance degradation, mounting without the "rdirplus" option can be done to re-enable readahead for directories. Reported and tested by: jhb Reviewed by: jhb MFC after: 2 weeks	2013-04-18 13:09:04 +00:00
Kenneth D. Merry	d96b98a360	Revamp the old NFS server's File Handle Affinity (FHA) code so that it will work with either the old or new server. The FHA code keeps a cache of currently active file handles for NFSv2 and v3 requests, so that read and write requests for the same file are directed to the same group of threads (reads) or thread (writes). It does not currently work for NFSv4 requests. They are more complex, and will take more work to support. This improves read-ahead performance, especially with ZFS, if the FHA tuning parameters are configured appropriately. Without the FHA code, concurrent reads that are part of a sequential read from a file will be directed to separate NFS threads. This has the effect of confusing the ZFS zfetch (prefetch) code and makes sequential reads significantly slower with clients like Linux that do a lot of prefetching. The FHA code has also been updated to direct write requests to nearby file offsets to the same thread in the same way it batches reads, and the FHA code will now also send writes to multiple threads when needed. This improves sequential write performance in ZFS, because writes to a file are now more ordered. Since NFS writes (generally less than 64K) are smaller than the typical ZFS record size (usually 128K), out of order NFS writes to the same block can trigger a read in ZFS. Sending them down the same thread increases the odds of their being in order. In order for multiple write threads per file in the FHA code to be useful, writes in the NFS server have been changed to use a LK_SHARED vnode lock, and upgrade that to LK_EXCLUSIVE if the filesystem doesn't allow multiple writers to a file at once. ZFS is currently the only filesystem that allows multiple writers to a file, because it has internal file range locking. This change does not affect the NFSv4 code. This improves random write performance to a single file in ZFS, since we can now have multiple writers inside ZFS at one time. I have changed the default tuning parameters to a 22 bit (4MB) window size (from 256K) and unlimited commands per thread as a result of my benchmarking with ZFS. The FHA code has been updated to allow configuring the tuning parameters from loader tunable variables in addition to sysctl variables. The read offset window calculation has been slightly modified as well. Instead of having separate bins, each file handle has a rolling window of bin_shift size. This minimizes glitches in throughput when shifting from one bin to another. sys/conf/files: Add nfs_fha_new.c and nfs_fha_old.c. Compile nfs_fha.c when either the old or the new NFS server is built. sys/fs/nfs/nfsport.h, sys/fs/nfs/nfs_commonport.c: Bring in changes from Rick Macklem to newnfs_realign that allow it to operate in blocking (M_WAITOK) or non-blocking (M_NOWAIT) mode. sys/fs/nfs/nfs_commonsubs.c, sys/fs/nfs/nfs_var.h: Bring in a change from Rick Macklem to allow telling nfsm_dissect() whether or not to wait for mallocs. sys/fs/nfs/nfsm_subs.h: Bring in changes from Rick Macklem to create a new nfsm_dissect_nonblock() inline function and NFSM_DISSECT_NONBLOCK() macro. sys/fs/nfs/nfs_commonkrpc.c, sys/fs/nfsclient/nfs_clkrpc.c: Add the malloc wait flag to a newnfs_realign() call. sys/fs/nfsserver/nfs_nfsdkrpc.c: Setup the new NFS server's RPC thread pool so that it will call the FHA code. Add the malloc flag argument to newnfs_realign(). Unstaticize newnfs_nfsv3_procid[] so that we can use it in the FHA code. sys/fs/nfsserver/nfs_nfsdsocket.c: In nfsrvd_dorpc(), add NFSPROC_WRITE to the list of RPC types that use the LK_SHARED lock type. sys/fs/nfsserver/nfs_nfsdport.c: In nfsd_fhtovp(), if we're starting a write, check to see whether the underlying filesystem supports shared writes. If not, upgrade the lock type from LK_SHARED to LK_EXCLUSIVE. sys/nfsserver/nfs_fha.c: Remove all code that is specific to the NFS server implementation. Anything that is server-specific is now accessed through a callback supplied by that server's FHA shim in the new softc. There are now separate sysctls and tunables for the FHA implementations for the old and new NFS servers. The new NFS server has its tunables under vfs.nfsd.fha, the old NFS server's tunables are under vfs.nfsrv.fha as before. In fha_extract_info(), use callouts for all server-specific code. Getting file handles and offsets is now done in the individual server's shim module. In fha_hash_entry_choose_thread(), change the way we decide whether two reads are in proximity to each other. Previously, the calculation was a simple shift operation to see whether the offsets were in the same power of 2 bucket. The issue was that there would be a bucket (and therefore thread) transition, even if the reads were in close proximity. When there is a thread transition, reads wind up going somewhat out of order, and ZFS gets confused. The new calculation simply tries to see whether the offsets are within 1 << bin_shift of each other. If they are, the reads will be sent to the same thread. The effect of this change is that for sequential reads, if the client doesn't exceed the max_reqs_per_nfsd parameter and the bin_shift is set to a reasonable value (22, or 4MB works well in my tests), the reads in any sequential stream will largely be confined to a single thread. Change fha_assign() so that it takes a softc argument. It is now called from the individual server's shim code, which will pass in the softc. Change fhe_stats_sysctl() so that it takes a softc parameter. It is now called from the individual server's shim code. Add the current offset to the list of things printed out about each active thread. Change the num_reads and num_writes counters in the fha_hash_entry structure to 32-bit values, and rename them num_rw and num_exclusive, respectively, to reflect their changed usage. Add an enable sysctl and tunable that allows the user to disable the FHA code (when vfs.XXX.fha.enable = 0). This is useful for before/after performance comparisons. nfs_fha.h: Move most structure definitions out of nfs_fha.c and into the header file, so that the individual server shims can see them. Change the default bin_shift to 22 (4MB) instead of 18 (256K). Allow unlimited commands per thread. sys/nfsserver/nfs_fha_old.c, sys/nfsserver/nfs_fha_old.h, sys/fs/nfsserver/nfs_fha_new.c, sys/fs/nfsserver/nfs_fha_new.h: Add shims for the old and new NFS servers to interface with the FHA code, and callbacks for the The shims contain all of the code and definitions that are specific to the NFS servers. They setup the server-specific callbacks and set the server name for the sysctl and loader tunable variables. sys/nfsserver/nfs_srvkrpc.c: Configure the RPC code to call fhaold_assign() instead of fha_assign(). sys/modules/nfsd/Makefile: Add nfs_fha.c and nfs_fha_new.c. sys/modules/nfsserver/Makefile: Add nfs_fha_old.c. Reviewed by: rmacklem Sponsored by: Spectra Logic MFC after: 2 weeks	2013-04-17 21:00:22 +00:00
Konstantin Belousov	159a400eb6	Strip the unnneeded spaces, mostly at the end of lines. MFC after: 3 days	2013-04-01 09:56:48 +00:00
Konstantin Belousov	4d569af96c	Initialize the variable to avoid (false) compiler warning about use of an uninitialized local. Reported by: Ivan Klymenko <fidaj@ukr.net> MFC after: 2 weeks	2013-03-21 12:59:24 +00:00
Konstantin Belousov	7157d8f7ab	Do not call vnode_pager_setsize() while a NFS node mutex is locked. vnode_pager_setsize() might sleep waiting for the page after EOF be unbusied. Call vnode_pager_setsize() both for the regular and directory vnodes. Reported by: mich Reviewed by: rmacklem Discussed with: avg, jhb MFC after: 2 weeks	2013-03-21 07:25:08 +00:00
Ed Maste	96ecfd9813	Fix remainder calculation when biosize is not a power of 2 In common configurations biosize is a power of two, but is not required to be so. Thanks to markj@ for spotting an additional case beyond my original patch. Reviewed by: rmacklem@	2013-03-19 13:06:11 +00:00
John Baldwin	3b14c753ff	Revert 195703 and 195821 as this special stop handling in NFS is now implemented via VFCF_SBDRY rather than passing PBDRY to individual sleep calls.	2013-03-13 21:06:03 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Pawel Jakub Dawidek	2609222ab4	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib	2013-03-02 00:53:12 +00:00
John Baldwin	593efaf9f7	Further refine the handling of stop signals in the NFS client. The changes in r246417 were incomplete as they did not add explicit calls to sigdeferstop() around all the places that previously passed SBDRY to _sleep(). In addition, nfs_getcacheblk() could trigger a write RPC from getblk() resulting in sigdeferstop() recursing. Rather than manually deferring stop signals in specific places, change the VFS_() and VOP_() methods to defer stop signals for filesystems which request this behavior via a new VFCF_SBDRY flag. Note that this has to be a VFC flag rather than a MNTK flag so that it works properly with VFS_MOUNT() when the mount is not yet fully constructed. For now, only the NFS clients are set this new flag in VFS_SET(). A few other related changes: - Add an assertion to ensure that TDF_SBDRY doesn't leak to userland. - When a lookup request uses VOP_READLINK() to follow a symlink, mark the request as being on behalf of the thread performing the lookup (cnp_thread) rather than using a NULL thread pointer. This causes NFS to properly handle signals during this VOP on an interruptible mount. PR: kern/176179 Reported by: Russell Cattelan (sigdeferstop() recursion) Reviewed by: kib MFC after: 1 month	2013-02-21 19:02:50 +00:00
Warner Losh	b96f7e0a60	The request queue is already locked, so we don't need the splsofclock/splx here to note future work.	2013-02-21 02:43:44 +00:00
Konstantin Belousov	6168020f66	Be conservative and do not try to consume more bytes than was requested from the server for the read operation. Server shall not reply with too large size, but client should be resilent too. Reviewed by: rmacklem MFC after: 1 week	2013-01-27 09:34:25 +00:00
John Baldwin	a89a2c8ba4	Further cleanups to use of timestamps in NFS: - Use NFSD_MONOSEC (which maps to time_uptime) instead of the seconds portion of wall-time stamps to manage timeouts on events. - Remove unused nd_starttime from the per-request structure in the new NFS server. - Use nanotime() for the modification time on a delegation to get as precise a time as possible. - Use time_second instead of extracting the second from a call to getmicrotime(). Submitted by: bde (3) Reviewed by: bde, rmacklem MFC after: 2 weeks	2013-01-25 15:25:24 +00:00
John Baldwin	d177f14da9	Use vfs_timestamp() to set file timestamps rather than invoking getmicrotime() or getnanotime() directly in NFS. Reviewed by: rmacklem, bde MFC after: 1 week	2013-01-18 18:43:38 +00:00
John Baldwin	39804bc89d	Remove a no-longer-used variable after the previous change to use VA_UTIMES_NULL. Submitted by: bde, rmacklem MFC after: 1 week	2013-01-17 18:45:20 +00:00
John Baldwin	5055536eec	Use the VA_UTIMES_NULL flag to detect when NULL was passed to utimes() instead of comparing the desired time against the current time as a heuristic. Reviewed by: rmacklem MFC after: 1 week	2013-01-16 21:52:31 +00:00
Rick Macklem	ef8f1261d2	Add "nfsstat -m" support for the two new NFS mount options added by r244042.	2012-12-09 22:23:50 +00:00
Rick Macklem	1f60bfd822	Move the NFSv4.1 client patches over from projects/nfsv4.1-client to head. I don't think the NFS client behaviour will change unless the new "minorversion=1" mount option is used. It includes basic NFSv4.1 support plus support for pNFS using the Files Layout only. All problems detecting during an NFSv4.1 Bakeathon testing event in June 2012 have been resolved in this code and it has been tested against the NFSv4.1 server available to me. Although not reviewed, I believe that kib@ has looked at it.	2012-12-08 22:52:39 +00:00
Gleb Smirnoff	eb1b1807af	Mechanically substitute flags from historic mbuf allocator with malloc(9) flags within sys. Exceptions: - sys/contrib not touched - sys/mbuf.h edited manually	2012-12-05 08:04:20 +00:00
Rick Macklem	99d2727d67	Add an nfssvc() option to the kernel for the new NFS client which dumps out the actual options being used by an NFS mount. This will be used to implement a "-m" option for nfsstat(1). Reviewed by: alfred MFC after: 2 weeks	2012-12-02 01:16:04 +00:00
Attilio Rao	c6e0355cee	r16312 is not any longer real since many years (likely since when VFS received granular locking) but the comment present in UFS has been copied all over other filesystems code incorrectly for several times. Removes comments that makes no sense now. Reviewed by: kib MFC after: 3 days	2012-11-19 22:43:45 +00:00

... 3 4 5 6 7 ...

651 commits