Since non-doomed unionfs vnodes always share their primary lock with
either the lower or upper vnode, any forwarded call to the base FS
which transiently drops that upper or lower vnode lock may result in
the unionfs vnode becoming completely unlocked during that transient
window. The unionfs vnode may then become doomed by a concurrent
forced unmount, which can lead to either or both of the following:
--Complete loss of the unionfs lock: in the process of being
doomed, the unionfs vnode switches back to the default vnode lock,
so even if the base FS VOP reacquires the upper/lower vnode lock,
that no longer translates into the unionfs vnode being relocked.
This will then violate that caller's locking assumptions as well
as various assertions that are enabled with DEBUG_VFS_LOCKS.
--Complete less of reference on the upper/lower vnode: the caller
normally holds a reference on the unionfs vnode, while the unionfs
vnode in turn holds references on the upper/lower vnodes. But in
the course of being doomed, the unionfs vnode will drop the latter
set of references, which can effectively lead to the base FS VOP
executing with no references at all on its vnode, violating the
assumption that vnodes can't be recycled during these calls and
(if lucky) violating various assertions in the base FS.
Fix this by adding two new functions, unionfs_forward_vop_start_pair()
and unionfs_forward_vop_finish_pair(), which are intended to bookend
any forwarded VOP which may transiently unlock the relevant vnode(s).
These functions are currently only applied to VOPs that modify file
state (and require vnode reference and lock state to be identical at
call entry and exit), as the common reason for transiently dropping
locks is to update filesystem metadata.
Reviewed by: olce
Tested by: pho
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D44076
unionfs_mkshadowdir() may be invoked on a non-leaf pathname component
during lookup, in which case the NUL terminator of the pathname buffer
will be well beyond the end of the current component. cn_namelen in
this case will still (correctly) indicate the length of only the
current component, but ZFS in particular does not currently respect
cn_namelen, leading to the creation on inacessible files with slashes
in their names. Work around this behavior by temporarily NUL-
terminating the current pathname component for the call to VOP_MKDIR().
https://github.com/openzfs/zfs/issues/15705 has been filed to track
a proper upstream fix for the issue at hand.
PR: 275871
Reported by: Karlo Miličević <karlo98.m@gmail.com>
Tested by: Karlo Miličević <karlo98.m@gmail.com>
Reviewed by: kib, olce
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D43818
Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.
Sponsored by: Netflix
If either the lower or upper vnode is found to be doomed after
locking it, the newly-created unionfs node won't be associated
with it and its lock will be dropped. In that case, clear the
uppervp and lowervp locals as necessary to avoid further use
of the vnode in unionfs_nodeget(). If the upper vnode is doomed
but the lower vnode remains valid, additionally reset the unionfs
node's v_vnlock field to point to the lower vnode lock.
Reviewed by: kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D39767
If either of vnodes is shared locked, lock must not be recursed.
Requested by: rmacklem
Reviewed by: markj, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39444
To quote from a comment above vput_final:
<quote>
* XXX Some filesystems pass in an exclusively locked vnode and strongly depend
* on the lock being held all the way until VOP_INACTIVE. This in particular
* happens with UFS which adds half-constructed vnodes to the hash, where they
* can be found by other code.
</quote>
As is there is no mechanism which allows filesystems to denote that a
vnode is fully initialized, consequently problems like the above are
only found the hard way(tm).
Add rudimentary support for state transitions, which in particular allow
to assert the vnode is not legally unlocked until its fate is decided
(either construction finishes or vgone is called to abort it).
The new field lands in a 1-byte hole, thus it does not grow the struct.
Bump __FreeBSD_version to 1400077
Reviewed by: kib (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37759
instead of looking at SAVESTART
This is a step towards removing the flag.
Reviewed by: mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D34468
This removes some of the complexity needed to maintain HASBUF and
allows for removing injecting SAVENAME by filesystems.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D36542
The unionfs root vnode will always share a lock with its lower vnode.
If unionfs was mounted with the 'below' option, this will also be the
vnode covered by the unionfs mount. During unmount, the covered vnode
will be locked by dounmount() while the unionfs root vnode will be
locked by vgone(). This effectively requires recursion on the same
underlying like, albeit through two different vnodes.
Reported by: pho
Reviewed by: kib, markj, pho
Differential Revision: https://reviews.freebsd.org/D34109
Also remove once-used functions to clean up after failed insmntque1(),
which were destructor callbacks in previous life.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D34071
This avoids a potentially wild reference to the mount object.
Additionally, simplify some of the checks around VV_ROOT in
unionfs_nodeget().
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D33914
I was somehow convinced that insmntque calls insmntque1 with a NULL
destructor. Unfortunately this worked well enough to not immediately
blow up in simple testing.
Keep not using the destructor in previously patched filesystems though
as it avoids unnecessary casts.
Noted by: kib
Reported by: pho
do_execve() will hold the vnode lock shared when it calls VOP_OPEN(),
but unionfs_open() requires the lock to be held exclusive to
correctly synchronize node status updates. This requirement is
asserted in unionfs_get_node_status().
Change unionfs_open() to temporarily upgrade the lock as is already
done in unionfs_close(). Related to this, fix various cases throughout
unionfs in which vnodes are not checked for reclamation following lock
upgrades that may have temporarily dropped the lock. Also fix another
related issue in which unionfs_lock() can incorrectly add LK_NOWAIT
during a downgrade operation, which trips a lockmgr assertion.
Reviewed by: kib (prior version), markj, pho
Reported by: pho
Differential Revision: https://reviews.freebsd.org/D33729
Use atomics to track the writecount granted to the underlying FS,
and avoid holding the vnode interlock while calling the underling FS'
VOP_ADD_WRITECOUNT(). This also fixes a WITNESS warning about nesting
the same lock type. Also add comments explaining why we need to track
the writecount on the unionfs vnode in the first place. Finally,
simplify writecount management to only use the upper vnode and assert
that we shouldn't have an active writecount on the lower vnode through
unionfs.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D33611
Also remove a couple of write-only variables found by the recent clang
update. No functional change intended.
Discussed with: kib
Differential Revision: https://reviews.freebsd.org/D33008
Instead of validating that a vnode belongs to unionfs only when the
caller attempts to extract the upper or lower vnode pointers, do this
validation any time the caller tries to extract a unionfs_node from
the vnode private data.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D32629
The lower FS VOP_READDIR() shouldn't return an empty read without
setting EOF; don't try to handle this case only for non-DIAGNOSTIC
builds.
Noted by: kib
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D32629
Add an assertion to unionfs_node_update() that the upper vnode is
exclusively locked; we already make the same assertion for the lower
vnode.
Also, assert in unionfs_noderem() that the vnode lock is not recursed
and acquire v_lock with LK_NOWAIT. Since v_lock is not the active
lock for the vnode at this point, it should not be contended.
Finally, remove VDIR assertions from unionfs_get_cached_vnode().
lvp/uvp will be referenced but not locked at this point, so v_type
may concurrently change due to vgonel(). The cached unionfs node,
if one exists, would only have made it into the cache if lvp/uvp
were of type VDIR at the time of insertion; the corresponding
VDIR assert in unionfs_ins_cached_vnode() should be safe because
lvp/uvp will be locked by that time and will not be used if either
is doomed.
Noted by: kib
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D32629
--Clearing cached subdirectories in unionfs_noderem() should be done
under the vnode interlock
--When preparing to switch the vnode lock in both unionfs_node_update()
and unionfs_noderem(), the incoming lock should be acquired before
updating the v_vnlock field to point to it. Otherwise we effectively
break the locking contract for a brief window.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D32629
unionfs uses a per-directory hashtable to cache subdirectory nodes.
Currently this hashtable is looked up using the directory name, but
since unionfs nodes aren't removed from the cache until they're
reclaimed, this poses some problems. For example, if a directory is
created on a unionfs mount shortly after deleting a previous directory
with the same path, the cache may end up reusing the node for the
previous directory, including its upper/lower FS vnodes. Operations
against those vnodes with then likely fail because the vnodes
represent deleted files; for example UFS will reject VOP_MKDIR()
against such a vnode because its effective link count is 0. This may
then manifest as e.g. mkdir(2) or open(2) returning ENOENT for an
attempt to create a file under the re-created directory.
While it would be possible to fix this by explicitly managing the
name-based cache during delete or rename operations, or by rejecting
cache hits if the underlying FS vnodes don't match those passed to
unionfs_nodeget(), it seems cleaner to instead hash the unionfs nodes
based on their underlying FS vnodes. Since unionfs prefers to operate
against the upper vnode if one is present, the lower vnode will only
be used for hashing as long as the upper vnode is NULL. This should
also make hashing faster by eliminating string traversal and using
the already-computed hash index stored in each vnode.
While here, fix a couple of other cache-related issues:
--Remove 8 bytes of unnecessary baggage from each unionfs node by
getting rid of the stored hash mask field. The mask is knowable
at compile time.
--When a matching node is found in the cache, reference its vnode
using vrefl() while still holding the vnode interlock. Previously
unionfs_nodeget() would vref() the vnode after the interlock was
dropped, but the vnode may be reclaimed during that window. This
caused intermittent panics from vn_lock(9) during unionfs stress
testing.
Reviewed by: kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D32533
"rm-style" system calls such as kern_frmdirat() and kern_funlinkat()
don't supply SAVENAME to preserve the pathname buffer for subsequent
vnode ops. For unionfs this poses an issue because the pathname may
be needed for a relookup operation in unionfs_remove()/unionfs_rmdir().
Currently unionfs doesn't check for this case, leading to a panic on
DIAGNOSTIC kernels and use-after-free of cn_nameptr otherwise.
The unionfs node's stored buffer would suffice as a replacement for
cnp->cn_nameptr in some (but not all) cases, but it's cleaner to just
ensure that unionfs vnode ops always have a valid cn_nameptr by setting
SAVENAME in unionfs_lookup().
While here, do some light cleanup in unionfs_lookup() and assert that
HASBUF is always present in the relevant relookup calls.
Reported by: pho
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D32148
This fixes an insta-panic when attempting to use unionfs with
DEBUG_VFS_LOCKS. Note that unionfs still has a long way to
go before it's generally stable or usable.
Reviewed by: kib (prior version), markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31917
Running stress2 unionfs tests reliably produces a namei_zone corruption
panic due to unionfs_relookup() attempting to NUL-terminate a newly-
allocate pathname buffer without first validating the buffer length.
Instead, avoid allocating new pathname buffers in unionfs entirely,
using already-provided buffers while ensuring the the correct flags
are set in struct componentname to prevent freeing or manipulation
of those buffers at lower layers.
While here, also compute and store the path length once in the unionfs
node instead of constantly invoking strlen() on it.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D31728
Each unionfs node holds a reference to its parent directory vnode.
A single open file reference can therefore end up keeping an
arbitrarily deep vnode hierarchy in place. When that reference is
released, the resulting VOP_RECLAIM call chain can then exhaust the
kernel stack.
This is easily reproducible by running the unionfs.sh stress2 test.
Fix it by deferring recursive unionfs vnode release to taskqueue
context.
PR: 238883
Reviewed By: kib (earlier version), markj
Differential Revision: https://reviews.freebsd.org/D30748
Allocate nameidata on stack and NDPREINIT() it, for compatibility with
assumptions from other filesystems' lookup code.
Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041
No functional change intended.
Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).
__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.
Reviewed by: kib
Discussed with: markj, mjg
Differential Revision: https://reviews.freebsd.org/D27037
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
For the most part the code was passing the LK_RELEASE flag.
The 2 cases which did not use the VOP_UNLOCK_FLAGS macro.
This fixes a panic when stacking unionfs on top of e.g., tmpfs when
debug is enabled.
Note there are latent bugs which prevent unionfs from working with debug
regardless of this change.
PR: 243064
Reported by: Mason Loring Bliss
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
- Provide unionfs_add_writecount() which passes the writecount to the
lower or upper vnode as appropriate.
- In unionfs VOP_RECLAIM() implementation, annulate unionfs
writecounts from upper or lower vnode. It is not clear that it is
always correct to remove the all references from either lower or
upper vnode, but we currently do not track which vnode get how many
refs anyway.
Reported and tested by: t_uemura@macome.co.jp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.
Like r348026, these CUs did not show up in tinderbox as missing the include.
Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.
Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.
Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter(). Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.
Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.
Suggested by: Chris Torek <torek@pi-coral.com>
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
received granular locking) but the comment present in UFS has been
copied all over other filesystems code incorrectly for several times.
Removes comments that makes no sense now.
Reviewed by: kib
MFC after: 3 days
was still possible to open for write from the lower filesystem. There
is a symmetric situation where the binary could already has file
descriptors opened for write, but it can be executed from the nullfs
overlay.
Handle the issue by passing one v_writecount reference to the lower
vnode if nullfs vnode has non-zero v_writecount. Note that only one
write reference can be donated, since nullfs only keeps one use
reference on the lower vnode. Always use the lower vnode v_writecount
for the checks.
Introduce the VOP_GET_WRITECOUNT to read v_writecount, which is
currently always bypassed to the lower vnode, and VOP_ADD_WRITECOUNT
to manipulate the v_writecount value, which manages a single bypass
reference to the lower vnode. Caling the VOPs instead of directly
accessing v_writecount provide the fix described in the previous
paragraph.
Tested by: pho
MFC after: 3 weeks
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho