Commit graph

2520 commits

Author SHA1 Message Date
Konstantin Belousov 026502d9ed UFS quotaoff: start write before unbusying
Otherwise the mount point could be unmounted meantime.

Reported and tested by:	pho
Reviewed by:	jah
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35638
2022-06-29 12:36:59 +03:00
Konstantin Belousov bc6d0d72f4 UFS rename: make it reliable when using SU and reaching nlink limit
PR:	165392
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35577
2022-06-24 17:46:26 +03:00
Kirk McKusick ce6296caa3 Fix build break in 50dc4c7.
No functional change intended.

MFC after:   1 month (with 076002f24d)
2022-06-23 19:54:18 -07:00
Kirk McKusick 50dc4c7df4 When a superblock integrity check fails, report the cause of the failure.
No functional change intended.

MFC after:   1 month (with 076002f24d)
Differential Revision: https://reviews.freebsd.org/D35219
2022-06-23 17:39:53 -07:00
Chuck Silvers f1b4324b81 ffs: fix vn_read_from_obj() usage for PAGE_SIZE > block size
vn_read_from_obj() requires that all pages of a vnode (except the last
partial page) be either completely valid or completely invalid,
but for file systems with block size smaller than PAGE_SIZE,
partially valid pages may exist anywhere in the file.
Do not enable the vn_read_from_obj() path in this case.

Reviewed by:	mckusick, kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34836
2022-06-22 14:57:29 -07:00
Konstantin Belousov 8db679af66 UFS: make mkdir() and link() reliable when using SU and reaching nlink limit
i_nlink overflow might be transient, i_effnlink indicates the final
value of the link count after all dependencies would be resolved. So if
i_nlink reached the maximum but i_efflink did not, we should be able to
make the link by syncing.

We must sync the whole filesystem to resolve dependencies,
which requires unlocking vnodes locked for VOPs.  Use existing
ERELOOKUP/VOP_UNLOCK_PAIR() mechanism to restart the VOP if sync with
unlock was done.

PR:	165392
Reported by:	Vsevolod Volkov <vvv@colocall.net>
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35514
2022-06-22 15:35:47 +03:00
Chuck Silvers 82817f26f8 ffs: fix vn_io_fault_pgmove() offset for PAGE_SIZE > block size
The "offset" argument to vn_io_fault_pgmove() is supposed to be
the offset within the page, but for ffs we currently use the offset
within the block.  When the block size is at least as large as the
page size then these values are the same, but when the page size is
larger than the block size then we need to add the offset of
the block within the page as well.

Sponsored by:	Netflix

Reviewed by:	mckusick, kib, markj
Differential Revision:	https://reviews.freebsd.org/D34835
2022-06-21 17:54:18 -07:00
Kirk McKusick 800a53b445 Bug fix to UFS/FFS superblock integrity checks when reading a superblock.
One of the checks was that the cylinder group size (fs_cgsize)
matched that calculated by CGSIZE(). The value calculated by CGSIZE()
has changed over time as the filesystem has evolved. Thus comparing
the value of CGSIZE() of the current generation filesystem may not
match the size as computed by CGSIZE() that was in effect at the
time an older filesystem was created. Therefore the check for
fs_cgsize is changed to simply ensure that it is not larger than
the filesystem blocksize (fs_bsize).

Reported by: Martin Birgmeier
Tested by:   Martin Birgmeier
MFC after:   1 month (with 076002f24d)
PR:          264450
Differential Revision: https://reviews.freebsd.org/D35219
2022-06-11 11:05:14 -07:00
Gordon Bergling a429d3050e ufs: Fix a typo a source code comment
- s/droped/dropped/

MFC after:	3 days
2022-06-04 15:23:53 +02:00
Kirk McKusick bc218d8920 Two bug fixes to UFS/FFS superblock integrity checks when reading a superblock.
Two bugs have been reported with the UFS/FFS superblock integrity
checks that were added in commit 076002f24d.

The code checked that fs_sblockactualloc was properly set to the
location of the superblock. The fs_sblockactualloc field was an
addition to the superblock in commit dffce2150e on Jan 26 2018
and used a field that was zero in filesystems created before it
was added. The integrity check had to be expanded to accept the
fs_sblockactualloc field being zero so as not to reject filesystems
created before Jan 26 2018.

The integrity check set an upper bound on the value of fs_maxcontig
based on the maximum transfer size supported by the kernel. It
required that fs->fs_maxcontig <= maxphys / fs->fs_bsize. The kernel
variable maxphys defines the maximum transfer size permitted by the
controllers and/or buffering. The fs_maxcontig parameter controls the
maximum number of blocks that the filesystem will read or write in
a single transfer. It is calculated when the filesystem is created
as maxphys / fs_bsize. The bug appeared in the loader because it
uses a maxphys of 128K even when running on a system that supports
larger values. If the filesystem was built on a system that supports
a larger maxphys (1M is typical) it will have configured fs_maxcontig
for that larger system so would fail the test when run with the smaller
maxphys used by the loader. So we bound the upper allowable limit
for fs_maxconfig to be able to at least work with a 1M maxphys on the
smallest block size filesystem: 1M / 4096 == 256. We then use the
limit for fs_maxcontig as fs_maxcontig <= MAX(256, maxphys / fs_bsize).
There is no harm in allowing the mounting of filesystems that make larger
than maxphys I/O requests because those (mostly 32-bit machines) can
(very slowly) handle I/O requests that exceed maxphys.

Thanks to everyone who helped sort out the problems and the fixes.

Reported by:  Cy Schubert, David Wolfskill
Diagnosis by: Mark Johnston, John Baldwin
Reviewed by:  Warner Losh
Tested by:    Cy Schubert, David Wolfskill
MFC after:    1 month (with 076002f24d)
Differential Revision: https://reviews.freebsd.org/D35219
2022-05-31 19:58:37 -07:00
Kirk McKusick 076002f24d Do comprehensive UFS/FFS superblock integrity checks when reading a superblock.
Historically only minimal checks were made of a superblock when it
was read in as it was assumed that fsck would have been run to
correct any errors before attempting to use the filesystem. Recently
several bug reports have been submitted reporting kernel panics
that can be triggered by deliberately corrupting filesystem superblocks,
see Bug 263979 - [meta] UFS / FFS / GEOM crash (panic) tracking
which is tracking the reported corruption bugs.

This change upgrades the checks that are performed. These additional
checks should prevent panics from a corrupted superblock. Although
it appears in only one place, the new code will apply to the kernel
modules and (through libufs) user applications that read in superblocks.

Reported by:  Robert Morris and Neeraj
Reviewed by:  kib
Tested by:    Peter Holm
PR:           263979
MFC after:    1 month
Differential Revision: https://reviews.freebsd.org/D35219
2022-05-27 12:22:07 -07:00
Kirk McKusick 187d7e9821 Reduce code nesting in readsuper().
No functional change.
2022-05-15 15:02:24 -07:00
Konstantin Belousov ca7c2d2eed UFS: clear fs_fmod once more, in the buffer data copy.
This is needed for in-kernel copy of the code, where allocation might
happen after fs_fmod is cleared in ffs_sbput() but before the write.

Reported by:	markj
Reviewed by:	chs, markj
PR:	263765
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35149
2022-05-09 23:46:05 +03:00
Konstantin Belousov 4ac2df8f4c ffs_use_bwrite: make the superblock snapshot more consistent
Copy in-memory struct fs to the superblock buffer under the UFS mutex.

Reviewed by:	chs, markj
PR:	263765
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35149
2022-05-09 23:45:27 +03:00
Stefan Eßer ecbbb0c85e ffs: plug a set-but-not-used var 2022-04-19 16:51:12 +02:00
Konstantin Belousov 5c075d6404 ufs/acl.h: forward-declare struct inode
Right now it is incidentally declared in sys/lockf.h, which will be
corrected shortly.

Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Konstantin Belousov 8cc19b1e47 Style.
Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Gordon Bergling d4b3b0c2ef ufs: Fix a typo in a source code comment
- s/explicitely/explicitly/

MFC after:	3 days
2022-04-09 09:13:31 +02:00
Chuck Silvers 3dc5f8e19d ffs: wait for trims earlier during unmount to avoid panic
All softdep processing is supposed to be completed by
softdep_flushfiles() and no more deps are supposed to be created after
that, but if a pending trim completes after softdep_flushfiles() and
before softdep_unmount() then the blkfree that is performed by
ffs_blkfree_trim_task() will create a dep when none should exist, and
if softdep_unmount() is called before that dep is freed then the
kernel will panic.  Prevent this by waiting for trims to complete
earlier in the unmount process, in ffs_flushfiles(), so that any deps
will be freed and any modified CG buffers will be flushed by the final
fsync of the devvp in ffs_flushfiles() as intended.

Reviewed by:	mckusick, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34806
2022-04-08 10:19:40 -07:00
Gordon Bergling 2733b242e4 ffs(3): Fix a common typo in source code comments
- s/quadradically/quadratically/

Obtained from:	NetBSD
MFC after:	3 days
2022-03-28 19:37:03 +02:00
Mateusz Guzik bb92cd7bcd vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd) 2022-03-24 10:20:51 +00:00
Robert Wing ab2dbd9b87 ffs_mount(): fix snapshotting
Commit 0455cc7104 broke snapshotting for ffs. In that commit,
ffs_mount() was changed so the namei() lookup for a disk device happens
before ffs_snapshot(). This caused the issue where namei() would lookup
the snapshot file and fail because the file doesn't exist. Even if it did
exist, taking a snapshot would still fail since it's not a disk device.

Fix this by taking a snapshot of the filesystem as-is and return without
altering ro/rw or any other attributes that are passed in.

Reported by:    pho
Reviewed by:	mckusick
Fixes: 0455cc7104 ("ffs_mount(): return early if namei() fails to lookup disk device")
Differential Revision:	https://reviews.freebsd.org/D34562
2022-03-16 17:32:37 -08:00
Robert Wing 0455cc7104 ffs_mount(): return early if namei() fails to lookup disk device
With soft updates enabled, an INVARIANTS panic is hit in ffs_unmount().

The problem occurs in ffs_mount() when upgrading a mount from ro->rw.
During a mount update, the soft update code gets set up but doesn't get
cleaned up if namei() fails when looking up the disk device.

Avoid this scenario by looking up the disk device first and bail early
if the namei() lookup fails.

PR:             256511
MFC After:      2 weeks
Reviewed by:	mckusick, kib
Differential Revision:	https://reviews.freebsd.org/D30870
2022-03-07 10:48:44 -09:00
Konstantin Belousov 0af463e661 ffs_read(): lock buffers after snaplk with LK_NOWITNESS
Reviewed and tested by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34179
2022-02-06 03:26:22 +02:00
Konstantin Belousov 303d3ae7e8 ufs, msdosfs: do not record witness order when creating vnode
When allocating new vnode, we need to lock it exclusively before
making it externally visible.  Since other threads cannot observe the
vnode yet, current lock order cannot create LoR conditions.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34126
2022-02-01 10:51:55 +02:00
Konstantin Belousov 99aa3b731c ffs: lock buffers after snaplk with LK_NOWITNESS
Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34073
2022-02-01 06:54:50 +02:00
Konstantin Belousov e11b2b69c5 ffs_alloc.c: order includes alphabetically
Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34073
2022-02-01 06:54:50 +02:00
Konstantin Belousov 8d8589b385 ufs: be more persistent with finishing some operations
when the vnode is doomed after relock.  The mere fact that the vnode is
doomed does not prevent us from doing UFS operations on it while it is
still belongs to UFS, which is determined by non-NULL v_data.  Not
finishing some operations, e.g. not syncing the inode block only because
the vnode started reclamation, is not correct.

Add macro IS_UFS() which incapsulates the v_data != NULL, and use it
instead of VN_IS_DOOMED() for places where the operation completion is
important.

Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34072
2022-01-31 04:46:21 +02:00
Konstantin Belousov 4559700a0a ffs_snapblkfree(): add a comment explaining lockmgr invocation
Reviewed by:	markj, mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34072
2022-01-31 04:46:21 +02:00
Konstantin Belousov 0cdc603308 ufs: Use IS_SNAPSHOT()
Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34072
2022-01-31 04:46:21 +02:00
Kirk McKusick ddf162d1d1 ufs: handle LoR between snap lock and vnode lock
When a filesystem is mounted all of its associated snapshots must
be activated. It first allocates a snapshot lock (snaplk) that will
be shared by all the snapshot vnodes associated with the filesystem.
As part of each snapshot file activation, it must replace its own
ufs vnode lock with the snaplk. In this way acquiring the snaplk
gives exclusive access to all the snapshots for the filesystem.

A write to a ufs vnode first acquires the ufs vnode lock for the
file to be written then acquires the snaplk. Once it has the snaplk,
it can check all the snapshots to see if any of them needs to make
a copy of the block that is about to be written. This ffs_copyonwrite()
code path establishes the ufs vnode followed by snaplk locking
order.

When a filesystem is unmounted it has to release all of its snapshot
vnodes. Part of doing the release is to revert the snapshot vnode
from using the snaplk to using its original vnode lock. While holding
the snaplk, the vnode lock has to be acquired, the vnode updated
to reference it, then the snaplk released. Acquiring the vnode lock
while holding the snaplk violates the ufs vnode then snaplk order.
Because the vnode lock is unused, using LK_EXCLUSIVE | LK_NOWAIT
to acquire it will always succeed and the LK_NOWAIT prevents the
reverse lock order from being recorded.

This change was made in January 2021 (173779b98f) to avoid an LOR
violation in ffs_snapshot_unmount(). The same LOR issue was recently
found again when removing a snapshot in ffs_snapremove() which must
also revert the snaplk to the original vnode lock as part of freeing it.

The unwind in ffs_snapremove() deals with the case in which the
snaplk is held as a recursive lock holding multiple references.
Specifically an equal number of references are made on the vnode
lock. This change factors out the lock reversion operations into a
new function revert_snaplock() which handles both the recursive
locks and avoids the LOR. The new revert_snaplock() function is
then used in both ffs_snapshot_unmount() and in ffs_snapremove().

Reviewed by:  kib
Tested by:    Peter Holm
MFC after:    2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33946
2022-01-27 23:03:35 -08:00
Kirk McKusick 7ef56fb049 Avoid unnecessary setting of UFS flag requesting fsck(8) be run.
When the kernel is requested to mount a filesystem with a bad superblock
check hash, it would set the flag in the superblock requesting that the
fsck(8) program be run. The flag is only written to disk as part of a
superblock update. Since the superblock always has its check hash updated
when it is written to disk, the problem for which the flag has been set
will no longer exist. Hence, it is counter-productive to set the flag
as it will just cause an unnecessary run of fsck if it ever gets written.

Sponsored by: Netflix
2022-01-09 16:18:28 -08:00
Kirk McKusick 1fbcaa13b0 When doing a read-only mount of a UFS filesystem using gjournal(8),
suppress error message about a missing gjournal provider.

Submitted by: Andreas Longwitz
MFC after:    2 weeks
Sponsored by: Netflix
2022-01-02 14:04:39 -08:00
Jessica Clarke 324150d6da ufs: Avoid subobject overflow in snapshot expunge code
The code here tries to be smart and zeroes out both di_db and di_ib with
a single bzero call, thereby overrunning the di_db subobject. This is
fine on most architectures, if a little dodgy. However, on CHERI, the
compiler can optionally restrict the bounds on pointers to subobjects to
just that subobject, in order to mitigate intra-object buffer overflows,
and this is enabled in CheriBSD's pure-capability kernels.

Instead, use separate bzero calls for each array, and let the compiler
optimise it as it sees fit; even if it's not generating inline zeroing
code, Clang will happily optimise two consecutive bzero's to a single
larger call.

Reviewed by:	mckusick
Differential Revision:	https://reviews.freebsd.org/D33651
2022-01-02 20:55:49 +00:00
Jessica Clarke 5b13fa7987 ufs: Rework shortlink handling to avoid subobject overflows
Shortlinks occupy the space of both di_db and di_ib when used. However,
everywhere that wants to read or write a shortlink takes a pointer do
di_db and promptly runs off the end of it into di_ib. This is fine on
most architectures, if a little dodgy. However, on CHERI, the compiler
can optionally restrict the bounds on pointers to subobjects to just
that subobject, in order to mitigate intra-object buffer overflows, and
this is enabled in CheriBSD's pure-capability kernels.

Instead, clean this up by inserting a union such that a new di_shortlink
can be added with the right size and element type, avoiding the need to
cast and allowing the use of the DIP macro to access the field. This
also mirrors how the ext2fs code implements extents support, with the
exact same structure other than having a uint32_t i_data[] instead of a
char di_shortlink[].

Reviewed by:	mckusick, jhb
Differential Revision:	https://reviews.freebsd.org/D33650
2022-01-02 20:55:36 +00:00
Alan Somers b214fcceac Change VOP_READDIR's cookies argument to a **uint64_t
The cookies argument is only used by the NFS server.  NFSv2 defines the
cookie as 32 bits on the wire, but NFSv3 increased it to 64 bits.  Our
VOP_READDIR, however, has always defined it as u_long, which is 32 bits
on some architectures.  Change it to 64 bits on all architectures.  This
doesn't matter for any in-tree file systems, but it matters for some
FUSE file systems that use 64-bit directory cookies.

PR:             260375
Reviewed by:    rmacklem
Differential Revision: https://reviews.freebsd.org/D33404
2021-12-15 20:54:57 -07:00
Gordon Bergling f9af3151fa Revert "ffs(3): Fix a typo in a sysctl description"
It should be

- s/contigous/contiguous/ not continuous

Reported by:	tuexen@

This reverts commit 42efe994ec.
2021-12-05 13:45:47 +01:00
Gordon Bergling 42efe994ec ffs(3): Fix a typo in a sysctl description
- s/contigous/continuous/

MFC after:	3 days
2021-12-04 12:15:34 +01:00
Mateusz Guzik 7e1d3eefd4 vfs: remove the unused thread argument from NDINIT*
See b4a58fbf64 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.
2021-11-25 22:50:42 +00:00
Gordon Bergling bebff61587 ffs_softdep: Fix a typo in a source code comment
- s/conditonally/conditionally/

MFC after:	3 days
2021-11-19 19:17:41 +01:00
Konstantin Belousov c34a5148e8 ffs: fix newly introduced LOR between mntfs vnode lock and topology lock
The mntfs vnode lock should be before topology, as established in
ffs_mountfs().  Extend the locked region in ffs_unmount().

Reported and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33013
2021-11-16 20:01:31 +02:00
Kirk McKusick 9b8eb1c5b6 Followup to f2b391528a to improve printed message.
Sponsored by: Netflix
2021-11-15 16:10:02 -08:00
Kirk McKusick 9e9dcac95a Allow forced r/w mount of UFS/FFS filesystem with a bad check hash.
Normally a UFS/FFS filesystem with a bad check hash can only be
mounted read only. With this commit the mount(8) -f (force) option
can be used to force a read-write mount of a UFS/FFS filesystem with
a bad check hash. Conveniently the filesystem will proceed to
update its on-disk superblock with a corrected check hash.

Sponsored by: Netflix
2021-11-15 16:03:47 -08:00
Kirk McKusick f2b391528a Add ability to suppress UFS/FFS superblock check-hash failure messages.
When reading UFS/FFS superblocks that have check hashes, both the kernel
and libufs print an error message if the check hash is incorrect. This
commit adds the ability to request that the error message not be made.
It is intended for use by programs like fsck that wants to print its
own error message and by kernel subsystems like glabel that just wants
to check for possible filesystem types.

This capability will be used in followup commits.

Sponsored by: Netflix
2021-11-15 09:11:54 -08:00
Kirk McKusick b366ee4868 Consolodate four copies of the STDSB define into a single place.
The STDSB macro is passed to the ffs_sbget() routine to fetch a
UFS/FFS superblock "from the stadard place". It was identically defined
in lib/libufs/libufs.h, stand/libsa/ufs.c, sys/ufs/ffs/ffs_extern.h,
and sys/ufs/ffs/ffs_subr.c. Delete it from these four files and
define it instead in sys/ufs/ffs/fs.h. All existing uses of this macro
already include sys/ufs/ffs/fs.h so no include changes need to be made.

No functional change intended.

Sponsored by: Netflix
2021-11-14 22:10:16 -08:00
Konstantin Belousov eede22d66d ffs_snapshot: do not assert that um_devvp is locked
It is not, and the lock is not needed there

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32761
2021-11-13 01:00:54 +02:00
Konstantin Belousov 25809a018d mntfs: lock mntfs pseudo devfs vnode properly
Require devvp locked for mntfs_freevp(), to have it locked around
vgone().  Make that true for ffs, which is the only consumer of
the interface.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32761
2021-11-13 01:00:41 +02:00
Konstantin Belousov 76b05e3e39 ffs: Remove assertions about locked um_devvp in several places
Namely, ffs_blkfree_cg(), and ffs_flushfiles().

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32761
2021-11-13 01:00:33 +02:00
Konstantin Belousov 2030ee0e1b ufs: remove write-only variables
Mark variables as __diagused for invariant-only vars

Reviewed by:	imp, mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32577
2021-10-21 21:40:46 +03:00
Mateusz Guzik b4a58fbf64 vfs: remove cn_thread
It is always curthread.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D32453
2021-10-11 13:21:47 +00:00
Kyle Evans 6b88668f0b vfs: remove dead fifoop VOP_KQFILTER implementations
These began to become obsolete in d6d64f0f2c (r137739) and the deal
was later sealed in 003e18aef4 (r137801) when vfs.fifofs.fops was
dropped and vop-bypass for pipes became mandatory.

PR:		225934
Suggested by:	markj
Reviewe by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D32270
2021-10-03 01:02:51 -05:00
Robert Wing 9acea16404 ffs: retire unused fsckpid mount option
The fsckpid mount option was introduced in 927a12ae16 along with a
couple sysctl's to support SU+J with snapshots. However, those sysctl's
were never used and eventually removed in f2620e9ceb.

There are no in-tree consumers of this mount option.

Reviewed by:	mckusick, kib
Differential Revision:	https://reviews.freebsd.org/D32015
2021-10-02 15:11:40 -08:00
Kirk McKusick 4a365e863f Avoid "consumer not attached in g_io_request" panic when disk lost
while using a UFS snapshot.

The UFS filesystem supports snapshots. Each snapshot is a file whose
contents are a frozen image of the disk partition on which the filesystem
resides. Each time an existing block in the filesystem is modified,
the filesystem checks whether that block was in use at the time that
the snapshot was taken. If so, and if it has not already been copied,
a new block is allocated from among the blocks that were not in use
at the time that the snapshot was taken and placed in the snapshot file
to replace the entry that has not yet been copied. The previous contents
of the block are copied to the newly allocated snapshot file block,
and the write to the original is then allowed to proceed.

The block allocation is done using the usual UFS_BALLOC() routine
which allocates the needed block in the snapshot and returns a
buffer that is set up to write data into the newly allocated block.
In usual filesystem operation, the contents for the new block is
copied from user space into the buffer and the buffer is then written
to the file using bwrite(), bawrite(), or bdwrite(). In the case of a
snapshot the new block must be filled from the disk block that is about
to be rewritten. The snapshot routine has a function readblock() that
it uses to read the `about to be rewritten' disk block.

/*
 * Read the specified block into the given buffer.
 */
static int
readblock(snapvp, bp, lbn)
	struct vnode *snapvp;
	struct buf *bp;
	ufs2_daddr_t lbn;
{
	struct inode *ip;
	struct bio *bip;
	struct fs *fs;

	ip = VTOI(snapvp);
	fs = ITOFS(ip);

	bip = g_alloc_bio();
	bip->bio_cmd = BIO_READ;
	bip->bio_offset = dbtob(fsbtodb(fs, blkstofrags(fs, lbn)));
	bip->bio_data = bp->b_data;
	bip->bio_length = bp->b_bcount;
	bip->bio_done = NULL;

	g_io_request(bip, ITODEVVP(ip)->v_bufobj.bo_private);
	bp->b_error = biowait(bip, "snaprdb");
	g_destroy_bio(bip);
	return (bp->b_error);
}

When the underlying disk fails, its GEOM module is removed.
Subsequent attempts to access it should return the ENXIO error.
The functionality of checking for the lost disk and returning
ENXIO is handled by the g_vfs_strategy() routine:

void
g_vfs_strategy(struct bufobj *bo, struct buf *bp)
{
	struct g_vfs_softc *sc;
	struct g_consumer *cp;
	struct bio *bip;

	cp = bo->bo_private;
	sc = cp->geom->softc;

	/*
	 * If the provider has orphaned us, just return ENXIO.
	 */
	mtx_lock(&sc->sc_mtx);
	if (sc->sc_orphaned || sc->sc_enxio_active) {
		mtx_unlock(&sc->sc_mtx);
		bp->b_error = ENXIO;
		bp->b_ioflags |= BIO_ERROR;
		bufdone(bp);
		return;
	}
	sc->sc_active++;
	mtx_unlock(&sc->sc_mtx);

	bip = g_alloc_bio();
	bip->bio_cmd = bp->b_iocmd;
	bip->bio_offset = bp->b_iooffset;
	bip->bio_length = bp->b_bcount;
	bdata2bio(bp, bip);
	if ((bp->b_flags & B_BARRIER) != 0) {
		bip->bio_flags |= BIO_ORDERED;
		bp->b_flags &= ~B_BARRIER;
	}
	if (bp->b_iocmd == BIO_SPEEDUP)
		bip->bio_flags |= bp->b_ioflags;
	bip->bio_done = g_vfs_done;
	bip->bio_caller2 = bp;
	g_io_request(bip, cp);
}

Only after checking that the device is present does it construct
the "bio" request and call g_io_request(). When readblock()
constructs its own "bio" request and calls g_io_request() directly
it panics with "consumer not attached in g_io_request" when the
underlying device no longer exists.

The fix is to have readblock() call g_vfs_strategy() rather than
constructing its own "bio" request:

/*
 * Read the specified block into the given buffer.
 */
static int
readblock(snapvp, bp, lbn)
	struct vnode *snapvp;
	struct buf *bp;
	ufs2_daddr_t lbn;
{
	struct inode *ip;
	struct fs *fs;

	ip = VTOI(snapvp);
	fs = ITOFS(ip);

	bp->b_iocmd = BIO_READ;
	bp->b_iooffset = dbtob(fsbtodb(fs, blkstofrags(fs, lbn)));
	bp->b_iodone = bdone;
	g_vfs_strategy(&ITODEVVP(ip)->v_bufobj, bp);
	bufwait(bp);
	return (bp->b_error);
}

Here it uses the buffer that will eventually be written to the disk.
The g_vfs_strategy() routine uses four parts of the buffer: b_bcount,
b_iocmd, b_iooffset, and b_data.

The b_bcount field is already correctly set for the buffer. It is
safe to set the b_iocmd and b_iooffset fields as they are set
correctly when the later write is done. The write path will also
clear the B_DONE flag that our use of the buffer will set. The
b_iodone callback has to be set to bdone() which will do just
notification that the I/O is done in bufdone(). The rest of
bufdone() includes things like processing the softdeps associated
with the buffer should not be done until the buffer has been
written. Bufdone() will set b_iodone back to NULL after using it,
so the full bufdone() processing will be done when the buffer is
written. The final change from the previous version of readblock()
is that it used the b_data for the destination of the read while
g_vfs_strategy() uses the bdata2bio() function to take advantage
of VMIO when it is available.

Differential revision: https://reviews.freebsd.org/D32150
Reviewed by:  kib, chs
MFC after:    1 week
Sponsored by: Netflix
2021-09-27 20:04:51 -07:00
Kirk McKusick d7770a5495 Eliminate snaplk / bufwait LOR when creating UFS snapshots
Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of creating a
new UFS snapshot, it has to have its individual vnode lock replaced
with the filesystem's snapshot lock.

The lock order for regular vnodes with respect to buffer locks is that
they must first acquire the vnode lock, then a buffer lock. The order
for the snapshot lock is reversed: a buffer lock must be acquired before
the snapshot lock.

When creating a new snapshot, the snapshot file must retain its vnode
lock until it has allocated all the blocks that it needs before
switching to the snapshot lock. This update moves one final piece of
the initial snapshot block allocation so that it is done before the
newly created snapshot is switched to use the snapshot lock.

Reported by:  Witness code
MFC after:    1 week
Sponsored by: Netflix
2021-09-18 17:02:30 -07:00
Konstantin Belousov 197a4f29f3 buffer pager: allow get_blksize method to return error
Reported and reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31998
2021-09-17 20:29:55 +03:00
Robert Wing 440320b620 ffs: remove unused thread argument from ffs_reload()
MFC After:      1 week
Reviewed by:	imp, kib
Differential Revision:	https://reviews.freebsd.org/D31127
2021-09-04 12:25:10 -08:00
Konstantin Belousov bb536de6c0 ffs_update(): Do not assume that EBUSY can only come LK_NOWAIT trylock
Instead do protective check for the local flags and do not interpret
EBUSY specially if we did not request trylock mode for bread().

Reviewed by:	mckusick
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-08-31 07:38:35 +03:00
Konstantin Belousov f822d4feb8 ffs_update(): recalculate flags after relocking the vnode
Inode type could migrate between snapshot and regular types while the
vnode is unlocked.  Recalculate flags specific for snapshot after relock.

Reviewed by:	mckusick
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-08-31 07:38:35 +03:00
Keith Owens 3b29c8b4bd ddb: do not assume that ffs is mounted with softdep
Avoid a panic when debugging with "show ffs" in ddb.

Reviewed By:	kib, markj, mckusick
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D31622
2021-08-24 21:00:19 -05:00
Gordon Bergling 464a166c27 ufs_dirhash: Correct a typo in a comment
- s/memry/memory/

MFC after:	3 days
2021-08-20 09:59:18 +02:00
Konstantin Belousov 8df4bc48c8 ufs rename: ensure that the result of ufs_checkpath() is stable
ufs_rename() calls ufs_checkpath() to ensure that the target directory
is not a child of the source.  If not, rename would create a loop.
For instance:
	source->X1->X2->target
and if source moved under target, we get corrupted filesystem.
Suppose that we initially have
	source->X1 .... and X2->target
where X1 is not on path from root to X2.  Then ufs_checkpath() accepts
the inodes, but there is nothing preventing parallel rename of X2 to become
under X1, after checkpath finished.

Ensure stability of ufs_checkpath() result by taking a per-mount sx in
ufs_rename right before ufs_checkpath() and till the end.

Reviewed by:	chs, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2021-08-13 17:52:26 +03:00
Konstantin Belousov 2e2212b4f5 Style: wrap the long line, definition of ufs_checkpath()
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-08-13 17:52:20 +03:00
Kirk McKusick a91716efeb Clean up orphaned indirdep dependency structures after disk failure.
During forcible unmount after a disk failure there is a bug that
causes one or more indirdep dependency structures to fail to be
deallocated. Until we manage to track down why they fail to get
cleaned up, this code tracks them down and eliminates them so that
the unmount can succeed.

Reported by:  Peter Holm
Help from:    kib
Reviewed by:  Chuck Silvers
Tested by:    Peter Holm
MFC after:    7 days
Sponsored by: Netflix
2021-07-29 16:31:16 -07:00
Kirk McKusick 412b5e40a7 Diagnotic improvement to soft dependency structure management.
The soft updates diagnotic code keeps a list for each type of soft
update dependency. When a new block is allocated for a file it is
initially tracked by a "newblk" dependency. The "newblk" dependency
eventually becomes either an "allocdirect" dependency or an "indiralloc"
dependency. The diagnotic code failed to move the "newblk" from the list
of "newblk"s to its new type list.

No functional change intended.

Reviewed by:  Chuck Silvers (as part of a larger change)
Tested by:    Peter Holm (as part of a larger change)
Sponsored by: Netflix
2021-07-29 16:13:54 -07:00
Jason A. Harmening 211ec9b7d6 FFS: remove ffs_fsfail_task
Now that dounmount() supports a dedicated taskqueue, we can simply call
it with MNT_DEFERRED directly from the failing context.  This also
avoids blocking taskqueue_thread with a potentially-expensive unmount
operation.

Reviewed by:	kib, mckusick
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D31016
2021-07-24 12:52:41 -07:00
Jason A. Harmening c746ed724d Allow stacked filesystems to be recursively unmounted
In certain emergency cases such as media failure or removal, UFS will
initiate a forced unmount in order to prevent dirty buffers from
accumulating against the no-longer-usable filesystem.  The presence
of a stacked filesystem such as nullfs or unionfs above the UFS mount
will prevent this forced unmount from succeeding.

This change addreses the situation by allowing stacked filesystems to
be recursively unmounted on a taskqueue thread when the MNT_RECURSE
flag is specified to dounmount().  This call will block until all upper
mounts have been removed unless the caller specifies the MNT_DEFERRED
flag to indicate the base filesystem should also be unmounted from the
taskqueue.

To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs
have been combined with the existing 'mnt_uppers' list used by nullfs
and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper().
The format of the mnt_uppers list has also been changed to accommodate
filesystems such as unionfs in which a given mount may be stacked atop
more than one lower mount.  Additionally, management of lower FS
reclaim/unlink notifications has been split into a separate list
managed by a separate set of KPIs, as registration of an upper FS no
longer implies interest in these notifications.

Reviewed by:	kib, mckusick
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D31016
2021-07-24 12:52:00 -07:00
John Baldwin 58109a87d4 Use an ANSI C function declaration for journal_check_space.
GCC6 fails to compile this due to a -Wstrict-prototypes error.

Sponsored by:	Chelsio Communications
2021-07-23 15:59:11 -07:00
Konstantin Belousov 50acaaef54 ffs_softdep: force sync if journal is low in journal_check_space
This effectively causes syncing of the mount point from softdep_prealloc(),
softdep_prerename(), and softdep_prelink().  Typically it avoids the need
for journal suspension at this point, at all.

Suggested and reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30041
2021-06-23 23:47:05 +03:00
Konstantin Belousov 2126f103e0 ffs_softdep.c: add journal_check_space() helper
Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30041
2021-06-23 23:47:05 +03:00
Konstantin Belousov 64b494a105 softdep_prelink(): only do sync if other thread changed the vnode metadata since previous prelink
We call into softdep_prerename() and softdep_prelink() when there is
low free space in the journal. Functions sync all vnodes participating
in the VOP, in the hope that this would reduce journal utilization. But
if the vnodes are already synced, doing sync would only spend writes,
journal is filled not due to the records from modifications of our
vnodes.

Remember original seqc numbers for vnodes, and only initiate syncs when
seqc changed.

Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30041
2021-06-23 23:46:54 +03:00
Konstantin Belousov f756546662 ufs_rename(): only do softdep_prerename() when other thread changed a vnode
Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30041
2021-06-23 23:46:15 +03:00
Konstantin Belousov d4d289cd51 ffs: mark block (re-)allocations as seqc writes
Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30041
2021-06-23 23:46:15 +03:00
Konstantin Belousov 5eacde3eb8 ufs_rename(): softdep_prerename() does something only for SU+J
so call it only in SU+J case

Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30041
2021-06-23 23:46:15 +03:00
Konstantin Belousov d0929a990c ffs: reduce number of dvp relocks in softdep_prelink()
If vp == NULL, we unlocked and then immediately relocked dvp there.

Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30041
2021-06-23 23:46:15 +03:00
Konstantin Belousov b2b40b28b1 ufs_vnops.c: style
Wrap too long functions declarations.

Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30041
2021-06-23 23:46:15 +03:00
Mark Johnston b2f9575646 ffs: Correct the input size check in sysctl_ffs_fsck()
Make sure we return an error if no input was specified, since
SYSCTL_IN() will report success in that case.

Reported by:	KMSAN
Reviewed by:	mckusick
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30586
2021-05-31 18:59:18 -04:00
Jason A. Harmening a4b07a2701 VFS_QUOTACTL(9): allow implementation to indicate busy state changes
Instead of requiring all implementations of vfs_quotactl to unbusy
the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param
to VFS_QUOTACTL(9).  The implementation may then indicate to the caller
whether it needed to unbusy the mount.

Also, add stbool.h to libprocstat modules which #define _KERNEL
before including sys/mount.h.  Otherwise they'll pull in sys/types.h
before defining _KERNEL and therefore won't have the bool definition
they need for mp_busy.

Reviewed By:	kib, markj
Differential Revision: https://reviews.freebsd.org/D30556
2021-05-30 14:53:47 -07:00
Jason A. Harmening 271fcf1c28 Revert commits 6d3e78ad6c and 54256e7954
Parts of libprocstat like to pretend they're kernel components for the
sake of including mount.h, and including sys/types.h in the _KERNEL
case doesn't fix the build for some reason.  Revert both the
VFS_QUOTACTL() change and the follow-up "fix" for now.
2021-05-29 17:48:02 -07:00
Jason A. Harmening 6d3e78ad6c VFS_QUOTACTL(9): allow implementation to indicate busy state changes
Instead of requiring all implementations of vfs_quotactl to unbusy
the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param
to VFS_QUOTACTL(9).  The implementation may then indicate to the caller
whether it needed to unbusy the mount.

Reviewed By:	kib, markj
Differential Revision: https://reviews.freebsd.org/D30218
2021-05-29 14:05:39 -07:00
Konstantin Belousov f784da883f Move mnt_maxsymlinklen into appropriate fs mount data structures
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-MFC-Note:	struct mount layout
Differential revision:	https://reviews.freebsd.org/D30325
2021-05-22 15:16:09 +03:00
Don Morris f17a590085 ufs: Avoid M_WAITOK allocations when building a dirhash
At this point the directory's vnode lock is held, so blocking while
waiting for free pages makes the system more susceptible to deadlock in
low memory conditions.  This is particularly problematic on NUMA systems
as UMA currently implements a strict first-touch policy.

ufsdirhash_build() already uses M_NOWAIT for other allocations and
already handled failures for the block array allocation, so just convert
to M_NOWAIT.

PR:		253992
Reviewed by:	markj, mckusick, vangyzen
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29045
2021-05-20 11:25:45 -04:00
Kirk McKusick 9a2fac6ba6 Fix handling of embedded symbolic links (and history lesson).
The original filesystem release (4.2BSD) had no embedded sysmlinks.
Historically symbolic links were just a different type of file, so
the content of the symbolic link was contained in a single disk block
fragment. We observed that most symbolic links were short enough that
they could fit in the area of the inode that normally holds the block
pointers. So we created embedded symlinks where the content of the
link was held in the inode's pointer area thus avoiding the need to
seek and read a data fragment and reducing the pressure on the block
cache. At the time we had only UFS1 with 32-bit block pointers,
so the test for a fastlink was:

	di_size < (NDADDR + NIADDR) * sizeof(daddr_t)

(where daddr_t would be ufs1_daddr_t today).

When embedded symlinks were added, a spare field in the superblock
with a known zero value became fs_maxsymlinklen. New filesystems
set this field to (NDADDR + NIADDR) * sizeof(daddr_t). Embedded
symlinks were assumed when di_size < fs->fs_maxsymlinklen. Thus
filesystems that preceeded this change always read from blocks
(since fs->fs_maxsymlinklen == 0) and newer ones used embedded
symlinks if they fit. Similarly symlinks created on pre-embedded
symlink filesystems always spill into blocks while newer ones will
embed if they fit.

At the same time that the embedded symbolic links were added, the
on-disk directory structure was changed splitting the former
u_int16_t d_namlen into u_int8_t d_type and u_int8_t d_namlen.
Thus fs_maxsymlinklen <= 0 (as used by the OFSFMT() macro) can
be used to distinguish old directory formats. In retrospect that
should have just been an added flag, but we did not realize we
needed to know about that change until it was already in production.

Code was split into ufs/ffs so that the log structured filesystem could
use ufs functionality while doing its own disk layout. This meant
that no ffs superblock fields could be used in the ufs code. Thus
ffs superblock fields that were needed in ufs code had to be copied
to fields in the mount structure. Since ufs_readlink needed to know
if a link was embedded, fs_maxlinklen gets copied to mnt_maxsymlinklen.

The kernel panic that arose to making this fix was triggered when a
disk error created an inode of type symlink with no allocated data
blocks but a large size. When readlink was called the uiomove was
attempted which segment faulted.

static int
ufs_readlink(ap)
	struct vop_readlink_args /* {
		struct vnode *a_vp;
		struct uio *a_uio;
		struct ucred *a_cred;
	} */ *ap;
{
	struct vnode *vp = ap->a_vp;
	struct inode *ip = VTOI(vp);
	doff_t isize;

	isize = ip->i_size;
	if ((isize < vp->v_mount->mnt_maxsymlinklen) ||
	    DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */
		return (uiomove(SHORTLINK(ip), isize, ap->a_uio));
	}
	return (VOP_READ(vp, ap->a_uio, 0, ap->a_cred));
}

The second part of the "if" statement that adds

	DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */

is problematic. It never appeared in BSD released by Berkeley because
as noted above mnt_maxsymlinklen is 0 for old format filesystems, so
will always fall through to the VOP_READ as it should. I had to dig
back through `git blame' to find that Rodney Grimes added it as
part of ``The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.''
He must have brought it across from an earlier FreeBSD. Unfortunately
the source-control logs for FreeBSD up to the merger with the
AT&T-blessed 4.4BSD-Lite conversion were destroyed as part of the
agreement to let FreeBSD remain unencumbered, so I cannot pin-point
where that line got added on the FreeBSD side.

The one change needed here is that mnt_maxsymlinklen is declared as
an `int' and should be changed to be `u_int64_t'.

This discovery led us to check out the code that deletes symbolic
links. Specifically

	if (vp->v_type == VLNK &&
	    (ip->i_size < vp->v_mount->mnt_maxsymlinklen ||
	     datablocks == 0)) {
		if (length != 0)
			panic("ffs_truncate: partial truncate of symlink");
		bzero(SHORTLINK(ip), (u_int)ip->i_size);
		ip->i_size = 0;
		DIP_SET(ip, i_size, 0);
		UFS_INODE_SET_FLAG(ip, IN_SIZEMOD | IN_CHANGE | IN_UPDATE);
		if (needextclean)
			goto extclean;
		return (ffs_update(vp, waitforupdate));
	}

Here too our broken symlink inode with no data blocks allocated
and a large size will segment fault as we are incorrectly using the
test that we have no data blocks to decide that it is an embdedded
symbolic link and attempting to bzero past the end of the inode.
The test for datablocks == 0 is unnecessary as the test for
ip->i_size < vp->v_mount->mnt_maxsymlinklen will do the right
thing in all cases.

The test for datablocks == 0 was added by David Greenman in this commit:

Author: David Greenman <dg@FreeBSD.org>
Date:   Tue Aug 2 13:51:05 1994 +0000

    Completed (hopefully) the kernel support for old style "fastlinks".

    Notes:
	svn path=/head/; revision=1821

I am guessing that he likely earlier added the incorrect test in the
ufs_readlink code.

I asked David if he had any recollection of why he made this change.
Amazingly, he still had a recollection of why he had made a one-line
change more than twenty years ago. And unsurpisingly it was because
he had been stuck between a rock and a hard place.

FreeBSD was up to 1.1.5 before the switch to the 4.4BSD-Lite code
base. Prior to that, there were three years of development in all
areas of the kernel, including the filesystem code, from the combined
set of people including Bill Jolitz, Patchkit contributors, and
FreeBSD Project members. The compatibility issue at hand was caused
by the FASTLINKS patches from Curt Mayer. In merging in the 4.4BSD-Lite
changes David had to find a way to provide compatibility with both
the changes that had been made in FreeBSD 1.1.5 and with 4.4BSD-Lite.
He felt that these changes would provide compatibility with both systems.

In his words:
``My recollection is that the 'FASTLINKS' symlinks support in
FreeBSD-1.x, as implemented by Curt Mayer, worked differently than
4.4BSD. He used a spare field in the inode to duplicately store the
length. When the 4.4BSD-Lite merge was done, the optimized symlinks
support for existing filesystems (those that were initialized in
FreeBSD-1.x) were broken due to the FFS on-disk structure of
4.4BSD-Lite differing from FreeBSD-1.x. My commit was needed to
restore the backward compatibility with FreeBSD-1.x filesystems.
I think it was the best that could be done in the somewhat urgent
circumstances of the post Berkeley-USL settlement. Also, regarding
Rod's massive commit with little explanation, some context: John
Dyson and I did the initial re-port of the 4.4BSD-Lite kernel to
the 386 platform in just 10 days. It was by far the most intense
hacking effort of my life. In addition to the porting of tons of
FreeBSD-1 code, I think we wrote more than 30,000 lines of new code
in that time to deal with the missing pieces and architectural
changes of 4.4BSD-Lite. We didn't make many notes along the way.
There was a lot of pressure to get something out to the rest of the
developer community as fast as possible, so detailed discrete commits
didn't happen - it all came as a giant wad, which is why Rod's
commit message was worded the way it was.''

Reported by:  Chuck Silvers
Tested by:    Chuck Silvers
History by:   David Greenman Lawrence
MFC after:    1 week
Sponsored by: Netflix
2021-05-16 17:04:11 -07:00
Konstantin Belousov e3d6759585 b_vflags update requries bufobj lock
The trunc_dependencies() issue was reported by	Alexander Lochmann
<alexander.lochmann@tu-dortmund.de>, who found the problem by performing
lock analysis using LockDoc, see https://doi.org/10.1145/3302424.3303948.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-15 15:47:42 +03:00
Kirk McKusick 14d0cd7225 Ensure that the mount command shows "with quotas" when quotas are enabled.
When quotas are enabled with the quotaon(8) command, it sets the
MNT_QUOTA flag in the mount structure mnt_flag field. The mount
structure holds a cached copy of the filesystem statfs structure
in mnt_stat that includes a copy of the mnt_flag field in
mnt_stat.f_flags. The mnt_stat structure may not be updated for
hours. Since the mount command requests mount details using the
MNT_NOWAIT option, it gets the mount's mnt_stat statfs structure
whose f_flags field does not yet show the MNT_QUOTA flag being set
in mnt_flag.

The fix is to have quotaon(8) set the MNT_QUOTA flag in both mnt_flag
and in mnt_stat.f_flags so that it will be immediately visible to
callers of statfs(2).

Reported by:  Christos Chatzaras
Tested by:    Christos Chatzaras
PR:           254682
MFC after:    3 days
Sponsored by: Netflix
2021-04-14 15:25:08 -07:00
Konstantin Belousov 0b3948e73b softdep_unmount: assert that no dandling dependencies are left
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov 7a8d4b4da6 FFS: assign fully initialized struct mount_softdeps to um_softdep
Other threads observing the non-NULL um_softdep can assume that it is
safe to use it. This is important for ro->rw remounts where change from
read-only to read-write status cannot be made atomic.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov 2af934cc15 Assert that um_softdep is NULL on free(ump), i.e. softdep_unmount() was called
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov f776c54cee ffs_mount: when remounting ro->rw and sbupdate failed, cleanup softdeps
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov d7e5e37416 softdep_unmount: handle spurious wakeups
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov fabbc3d879 softdep_flush(): do not access ump after we acked FLUSH_EXIT and unlocked SU lock
otherwise we might follow a pointer in the freed memory.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov 7c7a6681fa ffs: clear MNT_SOFTDEP earlier when remounting rw to ro
Suppose that we remount rw->ro and in parallel some reader tries to
instantiate a vnode, e.g. during lookup.  Suppose that softdep_unmount()
already started, but we did not cleared the MNT_SOFTDEP flag yet.
Then ffs_vgetf() calls into softdep_load_inodeblock() which accessed
destroyed hashes and freed memory.

Set/clear fs_ronly simultaneously (WRT to files flush) with MNT_SOFTDEP.
It might be reasonable to move the change of fs_ronly to under MNT_ILOCK,
but no readers take it.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:07 +02:00
Konstantin Belousov 7f682bdcab Rework MOUNTED/DOING SOFTDEP/SUJ macros
Now MNT_SOFTDEP indicates that SU are active in any variant +-J, and
SU+J is indicated by MNT_SOFTDEP | MNT_SUJ combination.  The reason is
that unmount will be able to easily hide SU from other operations by
clearing MNT_SOFTDEP while keeping the record of the active journal.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:07 +02:00
Konstantin Belousov 81cdb19e04 ffs softdep: clear ump->um_softdep on softdep_unmount()
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:07 +02:00
Konstantin Belousov a285d3edac ffs_extern.h: Add comments for ffs_vgetf() flags
Requested and reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:30:59 +02:00
Konstantin Belousov fd97fa6463 Add FFSV_FORCEINODEDEP flag for ffs_vgetf()
It will be used to allow SU flush code to sync the volume while external
consumers see that SU is already disabled on the filesystem.  Use it where
ffs_vgetf() called by SU code to process dependencies.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:30:38 +02:00
Konstantin Belousov 25aac48d2c simplify journal_mount: move the out label after success block
This removes the need to check for error == 0.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:30:37 +02:00
Konstantin Belousov 8742817ba6 FFS extattr: fix handling of the tail
There are three issues with change that stopped truncating ea area before
write, and resulted in possible zero tail in the ea area:
- Truncate to zero checked i_ea_len after the reference was dropped,
  making the last drop effectively truncate to zero length always.
- Loop to fill uio for zeroing specified too large length, that triggered
  assert in normal situation.
- Integrity check could trip over the tail, instead we must allow
  partial header or header with zero length, and clamp ea image in
  memory at it.

Reported by:	arichardson
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Fixup:	5e198e7646
Differential Revision:	https://reviews.freebsd.org/D28999
2021-03-02 02:19:34 +02:00
Konstantin Belousov 6f30ac9995 Call softdep_prealloc() before taking ffs_lock_ea(), if unlock is committing
softdep_prealloc() must be called to ensure enough journal space is
available, before ffs_extwrite(). Also it must be done before taking
ffs_lock_ea(), because it calls ffs_syncvnode(), potentially dropping
the vnode lock.

Reviewed by:	mckusick
Tested by:	pho
MFC after:      1 week
Sponsored by:   The FreeBSD Foundation
2021-02-24 09:55:21 +02:00
Konstantin Belousov 5e198e7646 ffs_close_ea: do not relock vnode under lock_ea
ffs_lock_ea is after the vnode lock, so vnode must not be relocked under
lock_ea. Move ffs_truncate() call in ffs_close_ea() after the lock_ea is
dropped, and only truncate to length zero, since this is the only mode
supported by ffs_truncate() for EAs. Previously code did truncation and
then write.

Zero the part of the ext area that is unused, if truncation is due but not
done because ea area is not zero-length.

Reviewed by:	mckusick
Tested by:	pho
MFC after:      1 week
Sponsored by:   The FreeBSD Foundation
2021-02-24 09:55:04 +02:00
Konstantin Belousov c6d68ca842 ffs_vnops.c: style
Use local var to shorten ap->a_vp expression.

Reviewed by:	mckusick
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-02-24 09:54:53 +02:00
Konstantin Belousov 4983146279 ffs: do not call softdep_prealloc() from UFS_BALLOC()
Do it in ffs_write(), where we can gracefuly handle relock and its
consequences. In particular, recheck the v_data to see if the vnode
reclamation ended, and return EBADF when we cannot proceed with the
write.

Reviewed by:	mckusick
Reported by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-02-24 09:54:50 +02:00
Konstantin Belousov cc9958bf22 ffs_reallocblks: change the guard for softdep_prealloc() call to DOINGSUJ()
instead of DOINGSOFTDEP().  The softdep_prealloc() function does nothing
in SU case.

Note that the call should be safe with regard to the vnode relock,
because it is called with MNT_NOWAIT, which does not descend into fsync.

Reviewed by:	mckusick
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-02-24 09:54:30 +02:00
Konstantin Belousov 2bfd8992c7 vnode: move write cluster support data to inodes.
The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.

Requested by:	mjg
Reviewed by:	asomers (previous version), fsu (ext2 parts), mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28679
2021-02-21 11:38:21 +02:00
Konstantin Belousov d485c77f20 Remove #define _KERNEL hacks from libprocstat
Make sys/buf.h, sys/pipe.h, sys/fs/devfs/devfs*.h headers usable in
userspace, assuming that the consumer has an idea what it is for.
Unhide more material from sys/mount.h and sys/ufs/ufs/inode.h,
sys/ufs/ufs/ufsmount.h for consumption of userspace tools, with the
same caveat.

Remove unacceptable hack from usr.sbin/makefs which relied on sys/buf.h
being unusable in userspace, where it override struct buf with its own
definition.  Instead, provide struct m_buf and struct m_vnode and adapt
code to use local variants.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D28679
2021-02-21 11:38:21 +02:00
Konstantin Belousov c31480a1f6 UFS snapshots: properly set the vm object size.
Citing Kirk:
The previous code [before 8563de2f27 -- kib] did not call
vnode_pager_setsize() but worked because later in ffs_snapshot() it
does a UFS_WRITE() to output the snaplist. Previously the UFS_WRITE()
allocated the extra block at the end of the file which caused it to do
the needed vnode_pager_setsize(). But the new code had already allocated
the extra block, so UFS_WRITE() did not extend the size and thus did not
do the vnode_pager_setsize().

PR:	253158
Reported by:	Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Reviewed by:	mckusick
Tested by:	cy
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-02-16 07:11:52 +02:00
Kirk McKusick 8563de2f27 Fix bug 253158 - Panic: snapacct_ufs2: bad block - mksnap_ffs(8) crash
The panic reported in 253158 arises because the /mnt/.snap/.factory
snapshot allocated the last block in the filesystem. The snapshot
code allocates the last block in the filesystem as a way of setting
its length to be the size of the filesystem. Part of taking a
snapshot is to remove all the earlier snapshots from the image of
the newest snapshot so that newer snapshots will not claim the blocks
of the earlier snapshots. The panic occurs when the new snapshot
finds that both it and an earlier snapshot claim the same block.

The fix is to set the size of the snapshot to be one block after
the last block in the filesystem. This block can never be allocated
since it is not a valid block in the filesystem. This extra block
is used as a place to store the initial list of blocks that the
snapshot has already copied and is used to avoid a deadlock in and
speed up the ffs_copyonwrite() function.

Reported by:  Harald Schmalzbauer
Tested by:    Peter Holm
PR:           253158
Sponsored by: Netflix
2021-02-11 21:31:16 -08:00
Konstantin Belousov adf28ab456 fifo: minor comment and assert improvements.
In particular, replace a note that reload through vget() is obsoleted,
with explanation why this code is required.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov 26af9f72f7 ffs_unlock: assert that IN_ENDOFF is not leaked past locked scope
This catches both missed processing of IN_ENDOFF and missed application
of VOP_VPUT_PAIR() after VOP that created an entry in the directory.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov 28703d2713 ffs softdep: Force processing of VI_OWEINACT vnodes when there is inode shortage
Such vnodes prevent inode reuse, and should be force-cleared when ffs_valloc()
is unable to find a free inode.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov 2011b44fa3 softdep_request_cleanup: wait for softdep_request_clean_flush() to pass
if we noted a parallel request is active and declined to overflow the
system with parallel redundant sync of the vnodes.  But we need to wait
for the flush to finish to see if there are any freed resources.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov 013168db8c ufs_inactive(): stop hiding ERELOOKUP from ffs_truncate(), return it.
VFS should retry inactivation when possible, then. This should provide
timely removal of unlinked unreferenced inodes.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov b59a8e63d6 Stop ignoring ERELOOKUP from VOP_INACTIVE()
When possible, relock the vnode and retry inactivation.  Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.

This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov 6aed2435c8 ufs vnops: brace softdep_prelink() with DOINGSUJ instead of DOINGSOFTDEP
because softdep_prelink() is reverted to NOP for non-J case.  There is no
need to do anything before ufs_direnter() in SU/non-J case, everything
required to sync the directory is done in VOP_VPUT_PAIR().

Suggested by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 week
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov ede40b0675 ffs softdep: remove will_direnter argument of softdep_prelink()
Originally this was done in 8a1509e442 to forcibly cover cases
where a hole in the directory could be created by extending into
indirect block, since dependency of writing out indirect block is not
tracked.  This results in excessive amount of fsyncing the directories,
where all creation of new entry forced fsync before it.  This is not needed,
it is enough to fsync when IN_NEEDSYNC is set, and VOP_VPUT_PAIR() provides
the required hook to only perform required syncing.

The series of changes culminating in this commit puts the performance of
metadata-intensive loads back to that before 8a1509e442.

Analyzed by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov 06f2918ab8 ufs_direnter: directory truncation does not need special case for rename
In ufs_rename case, tdvp is locked from the place where ufs_direnter()
is done till VOP_VPUT_PAIR(), which means that we no longer need to specially
handle rename in ufs_direnter().  Truncation, if possible, is done in the
same way in ffs_vput_pair() both for rename and other VOPs calling
ufs_direnter().  Remove isrename argument and set IN_ENDOFF if
ufs_direnter() succeeded and directory needs truncation.

In ffs_vput_pair(), stop verifying the condition that directory needs
truncation when IN_ENDOFF is set, instead assert that the condition is
true.

Suggested by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov 038fe6e089 ufs_rename: use VOP_VPUT_PAIR and rely on directory sync/truncation there
Suggested by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov 74a3652f83 ufs_direnter: move directory truncation to ffs_vput_pair().
VOP_VPUT_PAIR() provides the hook to do the truncation right before
unlock, which is required since truncation might need to fsync(), which
itself might unlock the directory vnode.

Set new flag IN_ENDOFF which indicates that i_endoff is valid and should
be checked against inode size. Excessive size is chomped, but this
operation is advisory and failure to truncate should not result in the
failure of the main VOP.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov 30bfb2fa0f ffs_vput_pair(): try harder to recover from the vnode reclaim
In particular, if unlock_vp is false, save vp's inode number and
generation. If ffs_inotovp() can re-create the vnode with the same
number and generation after we finished with handling dvp, then we most
likely raced with unmount, and were able to restore atomicity of open.
We use FFSV_REPLACE_DOOMED there, to drop the old vnode.

This additional recovery is not strictly required, but it improves the
quality of the implementation.

Suggested by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov f2c9d038bd FFS: implement special VOP_VPUT_PAIR().
It cleans IN_NEEDSYNC flag on dvp before returning, by applying
ffs_syncvnode() until success or an error different from ERELOOKUP.
IN_NEEDSYNC cleanup is required to avoid creating holes in the directories
when extended into indirect block.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov be44e98637 ffs_snapshot: use VOP_VPUT_PAIR after VOP_CREATE.
If the snapshot embrio was reclaimed under us, return error outright.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov 08c2dc2841 ufs_direnter/SU: unconditionally UFS_UPDATE inode when extending directory
for all kinds of async/SU mount variants.

Submitted by:	mckusick
Reviewed by:	chs
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov 1de1e2bfbf ffs_syncvnode: only clear IN_NEEDSYNC after successfull sync
If it is cleaned before the sync, other threads might see the inode without
the flag set, because syncing could unlock it.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov 89fd61d955 Merge ufs_fhtovp() into ffs_inotovp().
The function alone was not used for anything but ffs_fstovp() for long time.

Suggested by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov 5952c86c78 ffs_inotovp(): interface to convert (ino, gen) into alive vnode
It generalizes the VFS_FHTOVP() interface, making it possible to fetch
the inode without faking filehandle.  Also it adds the ffs flags argument
which allows to control ffs_vgetf() call.

Requested by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov f16c26b1c0 ffs: Add FFSV_REPLACE_DOOMED flag to ffs_vgetf()
It specifies that caller requests a fresh non-doomed vnode.  If doomed
vnode is found in the hash, it should behave similarly to FFSV_REPLACE.
Or, to put it differently, the flag is same as FFSV_REPLACE, but only
when the found hashed vnode is doomed.

Reviewed by:	chs, mkcusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov e94f2f1be3 ffs: call ufsdirhash_dirtrunc() right after setting directory size
Later processing of ffs_truncate() might temporary unlock the directory
vnode, causing unsychronized dirhash and inode sizes if update is
postponed to UFS_TRUNCATE() callers.

Reviewed by:	chs, mkcusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:19 +02:00
Konstantin Belousov bf0db19339 buf SU hooks: track buf_start() calls with B_IOSTARTED flag
and only call buf_complete() if previously started.  Some error paths,
like CoW failire, might skip buf_start() and do bufdone(), which itself
call buf_complete().

Various SU handle_written_XXX() functions check that io was started
and incomplete parts of the buffer data reverted before restoring them.
This is a useful invariant that B_IO_STARTED on buffer layer allows to
keep instead of changing check and panic into check and return.

Reported by:	pho
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundations
2021-02-12 03:02:19 +02:00
Konstantin Belousov 0281f88e5d ffs_vnops.c: Move opt_*.h includes to the top.
as it is done in other places.  Header files might need options defined
for correct operation.

Reviewed by:	chs, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-02-12 03:02:19 +02:00
Kirk McKusick a63eae65ff Revert 2d4422e799, Eliminate lock order reversal in UFS ffs_unmount().
After discussion with Chuck Silvers (chs@) we have decided that
there is a better way to resolve this lock order reversal which
will be committed separately.

Sponsored by: Netflix
2021-01-30 00:03:37 -08:00
Mateusz Guzik c892d60a1d ufs: denote lack of support for lockless symlink lookup
It is unclear without investigating if it can be provided without using
extra memory, so for the time being just don't.
2021-01-23 15:04:43 +00:00
Kirk McKusick 79a5c790bd Eliminate a locking panic when cleaning up UFS snapshots after a
disk failure.

Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.

When a disk fails while the UFS filesystem it contains is still
mounted (for example when a thumb drive is removed) UFS forcibly
unmounts the filesystem. The loss of the drive causes the GEOM
subsystem to orphan the provider, but the consumer remains until
the filesystem has finished with the unmount. Information describing
the snapshot locks was being prematurely cleared during the orphaning
causing the return of the snapshot vnode's original locks to fail.
The fix is to not clear the needed information prematurely.

Sponsored by: Netflix
2021-01-15 16:36:42 -08:00
Kirk McKusick 173779b98f Eliminate lock order reversal in UFS when unmounting filesystems
with snapshots.

Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update.  As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.

The lock order reversal happens because vnode locks must be acquired
before snapshot locks. When unmounting we must lock both the snapshot
lock and the vnode lock before swapping them so that the vnode will
be continuously locked during the swap. For each vnode representing
a snapshot, we must first acquire the snapshot lock to ensure
exclusive access to it and its original lock.  We then face a lock
order reversal when we try to acquire the original vnode lock. The
problem is eliminated by doing a non-blocking exclusive lock on the
original lock which will always succeed since there are no users
of that lock.

Sponsored by: Netflix
2021-01-15 16:03:01 -08:00
Mateusz Guzik 6b3a9a0f3d Convert remaining cap_rights_init users to cap_rights_init_one
semantic patch:

@@

expression rights, r;

@@

- cap_rights_init(&rights, r)
+ cap_rights_init_one(&rights, r)
2021-01-12 13:16:10 +00:00
Kirk McKusick 2d4422e799 Eliminate lock order reversal in UFS ffs_unmount().
UFS uses a new "mntfs" pseudo file system which provides private
device vnodes for a file system to safely access its disk device.
The original device vnode is saved in um_odevvp to hold the exclusive
lock on the device so that any attempts to open it for writing will
fail. But it is otherwise unused and has its BO_NOBUFS flag set to
enforce that file systems using mntfs vnodes do not accidentally
use the original devfs vnode. When the file system is unmounted,
um_odevvp is no longer needed and is released.

The lock order reversal happens because device vnodes must be locked
before UFS vnodes. During unmount, the root directory vnode lock
is held. When when calling vrele() on um_odevvp, vrele() attempts to
exclusive lock um_odevvp causing the lock order reversal. The problem
is eliminated by doing a non-blocking exclusive lock on um_odevvp
which will always succeed since there are no users of um_odevvp.
With um_odevvp locked, it can be released using vput which does not
attempt to do a blocking exclusive lock request and thus avoids the
lock order reversal.

Sponsored by: Netflix
2021-01-11 16:49:07 -08:00
Thomas Munro e7347be9e3 ffs: Support O_DSYNC.
Respect the new IO_DATASYNC flag when performing synchronous writes.
Compared to O_SYNC, O_DSYNC lets us skip updating the inode in some
cases, matching the behaviour of fdatasync(2).

Reviewed by: kib
Differential Review: https://reviews.freebsd.org/D25160
2021-01-08 13:15:56 +13:00
Mateusz Guzik 3e506a67bb vfs: add v_irflag accessors
Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D27793
2021-01-03 06:50:06 +00:00
Mateusz Guzik 9997aedb8f ufs: use VNPASS when asserting on a vnode in ufs_read_pgcache 2021-01-01 03:14:11 +00:00
Mark Johnston ace3d9475c ffs: Avoid out-of-bounds accesses in the fs_active bitmap
We use a bitmap to track which cylinder groups have changed between
snapshot creation and filesystem suspension.  The "legs" of the bitmap
are four bytes wide (see ACTIVESET()) so we must round up the allocation
size to a multiple of four bytes.

I believe this bug is harmless since UMA/kmem_* will both pad the
allocation and zero the full allocation.  Note that malloc() does inline
zeroing when the allocation size is known at compile-time.

Reported by:	pho (using KASAN)
Reviewed by:	kib, mckusick
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27731
2020-12-23 11:16:40 -05:00
Ryan Libby 93dba42c0e ffs: quiet -Wstrict-prototypes
Reviewed by:	kib, markj, mckusick
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27558
2020-12-11 22:51:57 +00:00
Kirk McKusick bb3c01ec79 Document the BA_CLRBUF flag used in ufs and ext2fs filesystems.
Suggested by: kib
MFC after:    3 days
Sponsored by: Netflix
2020-12-06 20:50:21 +00:00
Konstantin Belousov 2c7ada9917 ufs: handle two more cases of possible VNON vnode returned from VFS_VGET().
Reported by:	kevans
Reviewed by:	mckusick, mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27457
2020-12-06 18:09:14 +00:00
Konstantin Belousov 21a45add50 ffs: do not read full direct blocks if they are going to be overwritten.
BA_CLRBUF specifies that existing context of the block will be
completely overwritten by caller, so there is no reason to spend io
fetching existing data.  We do the same for indirect blocks.

Reported by:	tmunro
Reviewed by:	mckusick, tmunro
Tested by:	pho, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27353
2020-11-30 17:03:26 +00:00
Konstantin Belousov cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Konstantin Belousov 92bcefd1d2 clear_inodedeps: handle ERELOOKUP from ffs_syncvnode().
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-11-26 18:03:24 +00:00
Konstantin Belousov 07ef907f6e ffs_softdep.c: get_parent_vp(): Fix bp lock leak when inum inode was already freed.
Reported by:	markj, pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-11-25 17:12:21 +00:00
Konstantin Belousov 8a1509e442 Handle LoR in flush_pagedep_deps().
When operating in SU or SU+J mode, ffs_syncvnode() might need to
instantiate other vnode by inode number while owning syncing vnode
lock.  Typically this other vnode is the parent of our vnode, but due
to renames occuring right before fsync (or during fsync when we drop
the syncing vnode lock, see below) it might be no longer parent.

More, the called function flush_pagedep_deps() needs to lock other
vnode while owning the lock for vnode which owns the buffer, for which
the dependencies are flushed.  This creates another instance of the
same LoR as was fixed in softdep_sync().

Put the generic code for safe relocking into new SU helper
get_parent_vp() and use it in flush_pagedep_deps().  The case for safe
relocking of two vnodes with undefined lock order was extracted into
vn helper vn_lock_pair().

Due to call sequence
     ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(),
ffs_syncvnode() indicates with ERELOOKUP that passed vnode was
unlocked in process, and can return ENOENT if the passed vnode
reclaimed.  All callers of the function were inspected.

Because UFS namei lookups store auxiliary information about directory
entry in in-memory directory inode, and this information is then used
by UFS code that creates/removed directory entry in the actual
mutating VOPs, it is critical that directory vnode lock is not dropped
between lookup and VOP.  For softdep_prelink(), which ensures that
later link/unlink operation can proceed without overflowing the
journal, calls were moved to the place where it is safe to drop
processing VOP because mutations are not yet applied.  Then, ERELOOKUP
causes restart of the whole VFS operation (typically VFS syscall) at
top level, including the re-lookup of the involved pathes.  [Note that
we already do the same restart for failing calls to vn_start_write(),
so formally this patch does not introduce new behavior.]

Similarly, unsafe calls to fsync in snapshot creation code were
plugged.  A possible view on these failures is that it does not make
sense to continue creating snapshot if the snapshot vnode was
reclaimed due to forced unmount.

It is possible that relock/ERELOOKUP situation occurs in
ffs_truncate() called from ufs_inactive().  In this case, dropping the
vnode lock is not safe.  Detect the situation with VI_DOINGINACT and
reschedule inactivation by setting VI_OWEINACT.  ufs_inactive()
rechecks VI_OWEINACT and avoids reclaiming vnode is truncation failed
this way.

In ffs_truncate(), allocation of the EOF block for partial truncation
is re-done after vnode is synced, since we cannot leave the buffer
locked through ffs_syncvnode().

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:30:10 +00:00
Konstantin Belousov 738ea0010b Add ffs_inode_bwrite() helper.
In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:19:59 +00:00
Konstantin Belousov 7b795aa3c0 Revert r367669 to re-commit with proper message 2020-11-14 05:19:44 +00:00
Konstantin Belousov c0d2077f41 Add a framework that tracks exclusive vnode lock generation count for UFS.
This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored.  Enabled under DIAGNOSTICS.

UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents.  If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.

Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter.  Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.

Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock.  It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:17:04 +00:00
Konstantin Belousov 61846fc4dc Add a framework that tracks exclusive vnode lock generation count for UFS.
This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored.  Enabled under DIAGNOSTICS.

UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents.  If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.

Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter.  Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.

Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock.  It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:10:39 +00:00