Commit graph

1627 commits

Author SHA1 Message Date
Konstantin Belousov e3d6759585 b_vflags update requries bufobj lock
The trunc_dependencies() issue was reported by	Alexander Lochmann
<alexander.lochmann@tu-dortmund.de>, who found the problem by performing
lock analysis using LockDoc, see https://doi.org/10.1145/3302424.3303948.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-15 15:47:42 +03:00
Konstantin Belousov 0b3948e73b softdep_unmount: assert that no dandling dependencies are left
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov 7a8d4b4da6 FFS: assign fully initialized struct mount_softdeps to um_softdep
Other threads observing the non-NULL um_softdep can assume that it is
safe to use it. This is important for ro->rw remounts where change from
read-only to read-write status cannot be made atomic.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov 2af934cc15 Assert that um_softdep is NULL on free(ump), i.e. softdep_unmount() was called
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov f776c54cee ffs_mount: when remounting ro->rw and sbupdate failed, cleanup softdeps
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov d7e5e37416 softdep_unmount: handle spurious wakeups
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov fabbc3d879 softdep_flush(): do not access ump after we acked FLUSH_EXIT and unlocked SU lock
otherwise we might follow a pointer in the freed memory.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Konstantin Belousov 7c7a6681fa ffs: clear MNT_SOFTDEP earlier when remounting rw to ro
Suppose that we remount rw->ro and in parallel some reader tries to
instantiate a vnode, e.g. during lookup.  Suppose that softdep_unmount()
already started, but we did not cleared the MNT_SOFTDEP flag yet.
Then ffs_vgetf() calls into softdep_load_inodeblock() which accessed
destroyed hashes and freed memory.

Set/clear fs_ronly simultaneously (WRT to files flush) with MNT_SOFTDEP.
It might be reasonable to move the change of fs_ronly to under MNT_ILOCK,
but no readers take it.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:07 +02:00
Konstantin Belousov 7f682bdcab Rework MOUNTED/DOING SOFTDEP/SUJ macros
Now MNT_SOFTDEP indicates that SU are active in any variant +-J, and
SU+J is indicated by MNT_SOFTDEP | MNT_SUJ combination.  The reason is
that unmount will be able to easily hide SU from other operations by
clearing MNT_SOFTDEP while keeping the record of the active journal.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:07 +02:00
Konstantin Belousov 81cdb19e04 ffs softdep: clear ump->um_softdep on softdep_unmount()
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:07 +02:00
Konstantin Belousov a285d3edac ffs_extern.h: Add comments for ffs_vgetf() flags
Requested and reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:30:59 +02:00
Konstantin Belousov fd97fa6463 Add FFSV_FORCEINODEDEP flag for ffs_vgetf()
It will be used to allow SU flush code to sync the volume while external
consumers see that SU is already disabled on the filesystem.  Use it where
ffs_vgetf() called by SU code to process dependencies.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:30:38 +02:00
Konstantin Belousov 25aac48d2c simplify journal_mount: move the out label after success block
This removes the need to check for error == 0.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:30:37 +02:00
Konstantin Belousov 8742817ba6 FFS extattr: fix handling of the tail
There are three issues with change that stopped truncating ea area before
write, and resulted in possible zero tail in the ea area:
- Truncate to zero checked i_ea_len after the reference was dropped,
  making the last drop effectively truncate to zero length always.
- Loop to fill uio for zeroing specified too large length, that triggered
  assert in normal situation.
- Integrity check could trip over the tail, instead we must allow
  partial header or header with zero length, and clamp ea image in
  memory at it.

Reported by:	arichardson
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Fixup:	5e198e7646
Differential Revision:	https://reviews.freebsd.org/D28999
2021-03-02 02:19:34 +02:00
Konstantin Belousov 6f30ac9995 Call softdep_prealloc() before taking ffs_lock_ea(), if unlock is committing
softdep_prealloc() must be called to ensure enough journal space is
available, before ffs_extwrite(). Also it must be done before taking
ffs_lock_ea(), because it calls ffs_syncvnode(), potentially dropping
the vnode lock.

Reviewed by:	mckusick
Tested by:	pho
MFC after:      1 week
Sponsored by:   The FreeBSD Foundation
2021-02-24 09:55:21 +02:00
Konstantin Belousov 5e198e7646 ffs_close_ea: do not relock vnode under lock_ea
ffs_lock_ea is after the vnode lock, so vnode must not be relocked under
lock_ea. Move ffs_truncate() call in ffs_close_ea() after the lock_ea is
dropped, and only truncate to length zero, since this is the only mode
supported by ffs_truncate() for EAs. Previously code did truncation and
then write.

Zero the part of the ext area that is unused, if truncation is due but not
done because ea area is not zero-length.

Reviewed by:	mckusick
Tested by:	pho
MFC after:      1 week
Sponsored by:   The FreeBSD Foundation
2021-02-24 09:55:04 +02:00
Konstantin Belousov c6d68ca842 ffs_vnops.c: style
Use local var to shorten ap->a_vp expression.

Reviewed by:	mckusick
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-02-24 09:54:53 +02:00
Konstantin Belousov 4983146279 ffs: do not call softdep_prealloc() from UFS_BALLOC()
Do it in ffs_write(), where we can gracefuly handle relock and its
consequences. In particular, recheck the v_data to see if the vnode
reclamation ended, and return EBADF when we cannot proceed with the
write.

Reviewed by:	mckusick
Reported by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-02-24 09:54:50 +02:00
Konstantin Belousov cc9958bf22 ffs_reallocblks: change the guard for softdep_prealloc() call to DOINGSUJ()
instead of DOINGSOFTDEP().  The softdep_prealloc() function does nothing
in SU case.

Note that the call should be safe with regard to the vnode relock,
because it is called with MNT_NOWAIT, which does not descend into fsync.

Reviewed by:	mckusick
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-02-24 09:54:30 +02:00
Konstantin Belousov 2bfd8992c7 vnode: move write cluster support data to inodes.
The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.

Requested by:	mjg
Reviewed by:	asomers (previous version), fsu (ext2 parts), mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28679
2021-02-21 11:38:21 +02:00
Konstantin Belousov c31480a1f6 UFS snapshots: properly set the vm object size.
Citing Kirk:
The previous code [before 8563de2f27 -- kib] did not call
vnode_pager_setsize() but worked because later in ffs_snapshot() it
does a UFS_WRITE() to output the snaplist. Previously the UFS_WRITE()
allocated the extra block at the end of the file which caused it to do
the needed vnode_pager_setsize(). But the new code had already allocated
the extra block, so UFS_WRITE() did not extend the size and thus did not
do the vnode_pager_setsize().

PR:	253158
Reported by:	Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Reviewed by:	mckusick
Tested by:	cy
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-02-16 07:11:52 +02:00
Kirk McKusick 8563de2f27 Fix bug 253158 - Panic: snapacct_ufs2: bad block - mksnap_ffs(8) crash
The panic reported in 253158 arises because the /mnt/.snap/.factory
snapshot allocated the last block in the filesystem. The snapshot
code allocates the last block in the filesystem as a way of setting
its length to be the size of the filesystem. Part of taking a
snapshot is to remove all the earlier snapshots from the image of
the newest snapshot so that newer snapshots will not claim the blocks
of the earlier snapshots. The panic occurs when the new snapshot
finds that both it and an earlier snapshot claim the same block.

The fix is to set the size of the snapshot to be one block after
the last block in the filesystem. This block can never be allocated
since it is not a valid block in the filesystem. This extra block
is used as a place to store the initial list of blocks that the
snapshot has already copied and is used to avoid a deadlock in and
speed up the ffs_copyonwrite() function.

Reported by:  Harald Schmalzbauer
Tested by:    Peter Holm
PR:           253158
Sponsored by: Netflix
2021-02-11 21:31:16 -08:00
Konstantin Belousov 26af9f72f7 ffs_unlock: assert that IN_ENDOFF is not leaked past locked scope
This catches both missed processing of IN_ENDOFF and missed application
of VOP_VPUT_PAIR() after VOP that created an entry in the directory.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov 28703d2713 ffs softdep: Force processing of VI_OWEINACT vnodes when there is inode shortage
Such vnodes prevent inode reuse, and should be force-cleared when ffs_valloc()
is unable to find a free inode.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov 2011b44fa3 softdep_request_cleanup: wait for softdep_request_clean_flush() to pass
if we noted a parallel request is active and declined to overflow the
system with parallel redundant sync of the vnodes.  But we need to wait
for the flush to finish to see if there are any freed resources.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:22 +02:00
Konstantin Belousov 013168db8c ufs_inactive(): stop hiding ERELOOKUP from ffs_truncate(), return it.
VFS should retry inactivation when possible, then. This should provide
timely removal of unlinked unreferenced inodes.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov b59a8e63d6 Stop ignoring ERELOOKUP from VOP_INACTIVE()
When possible, relock the vnode and retry inactivation.  Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.

This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov ede40b0675 ffs softdep: remove will_direnter argument of softdep_prelink()
Originally this was done in 8a1509e442 to forcibly cover cases
where a hole in the directory could be created by extending into
indirect block, since dependency of writing out indirect block is not
tracked.  This results in excessive amount of fsyncing the directories,
where all creation of new entry forced fsync before it.  This is not needed,
it is enough to fsync when IN_NEEDSYNC is set, and VOP_VPUT_PAIR() provides
the required hook to only perform required syncing.

The series of changes culminating in this commit puts the performance of
metadata-intensive loads back to that before 8a1509e442.

Analyzed by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov 06f2918ab8 ufs_direnter: directory truncation does not need special case for rename
In ufs_rename case, tdvp is locked from the place where ufs_direnter()
is done till VOP_VPUT_PAIR(), which means that we no longer need to specially
handle rename in ufs_direnter().  Truncation, if possible, is done in the
same way in ffs_vput_pair() both for rename and other VOPs calling
ufs_direnter().  Remove isrename argument and set IN_ENDOFF if
ufs_direnter() succeeded and directory needs truncation.

In ffs_vput_pair(), stop verifying the condition that directory needs
truncation when IN_ENDOFF is set, instead assert that the condition is
true.

Suggested by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov 74a3652f83 ufs_direnter: move directory truncation to ffs_vput_pair().
VOP_VPUT_PAIR() provides the hook to do the truncation right before
unlock, which is required since truncation might need to fsync(), which
itself might unlock the directory vnode.

Set new flag IN_ENDOFF which indicates that i_endoff is valid and should
be checked against inode size. Excessive size is chomped, but this
operation is advisory and failure to truncate should not result in the
failure of the main VOP.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov 30bfb2fa0f ffs_vput_pair(): try harder to recover from the vnode reclaim
In particular, if unlock_vp is false, save vp's inode number and
generation. If ffs_inotovp() can re-create the vnode with the same
number and generation after we finished with handling dvp, then we most
likely raced with unmount, and were able to restore atomicity of open.
We use FFSV_REPLACE_DOOMED there, to drop the old vnode.

This additional recovery is not strictly required, but it improves the
quality of the implementation.

Suggested by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov f2c9d038bd FFS: implement special VOP_VPUT_PAIR().
It cleans IN_NEEDSYNC flag on dvp before returning, by applying
ffs_syncvnode() until success or an error different from ERELOOKUP.
IN_NEEDSYNC cleanup is required to avoid creating holes in the directories
when extended into indirect block.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:21 +02:00
Konstantin Belousov be44e98637 ffs_snapshot: use VOP_VPUT_PAIR after VOP_CREATE.
If the snapshot embrio was reclaimed under us, return error outright.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov 1de1e2bfbf ffs_syncvnode: only clear IN_NEEDSYNC after successfull sync
If it is cleaned before the sync, other threads might see the inode without
the flag set, because syncing could unlock it.

Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov 89fd61d955 Merge ufs_fhtovp() into ffs_inotovp().
The function alone was not used for anything but ffs_fstovp() for long time.

Suggested by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov 5952c86c78 ffs_inotovp(): interface to convert (ino, gen) into alive vnode
It generalizes the VFS_FHTOVP() interface, making it possible to fetch
the inode without faking filehandle.  Also it adds the ffs flags argument
which allows to control ffs_vgetf() call.

Requested by:	mckusick
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov f16c26b1c0 ffs: Add FFSV_REPLACE_DOOMED flag to ffs_vgetf()
It specifies that caller requests a fresh non-doomed vnode.  If doomed
vnode is found in the hash, it should behave similarly to FFSV_REPLACE.
Or, to put it differently, the flag is same as FFSV_REPLACE, but only
when the found hashed vnode is doomed.

Reviewed by:	chs, mkcusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:20 +02:00
Konstantin Belousov e94f2f1be3 ffs: call ufsdirhash_dirtrunc() right after setting directory size
Later processing of ffs_truncate() might temporary unlock the directory
vnode, causing unsychronized dirhash and inode sizes if update is
postponed to UFS_TRUNCATE() callers.

Reviewed by:	chs, mkcusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-02-12 03:02:19 +02:00
Konstantin Belousov bf0db19339 buf SU hooks: track buf_start() calls with B_IOSTARTED flag
and only call buf_complete() if previously started.  Some error paths,
like CoW failire, might skip buf_start() and do bufdone(), which itself
call buf_complete().

Various SU handle_written_XXX() functions check that io was started
and incomplete parts of the buffer data reverted before restoring them.
This is a useful invariant that B_IO_STARTED on buffer layer allows to
keep instead of changing check and panic into check and return.

Reported by:	pho
Reviewed by:	chs, mckusick
Tested by:	pho
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundations
2021-02-12 03:02:19 +02:00
Konstantin Belousov 0281f88e5d ffs_vnops.c: Move opt_*.h includes to the top.
as it is done in other places.  Header files might need options defined
for correct operation.

Reviewed by:	chs, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-02-12 03:02:19 +02:00
Kirk McKusick a63eae65ff Revert 2d4422e799, Eliminate lock order reversal in UFS ffs_unmount().
After discussion with Chuck Silvers (chs@) we have decided that
there is a better way to resolve this lock order reversal which
will be committed separately.

Sponsored by: Netflix
2021-01-30 00:03:37 -08:00
Kirk McKusick 79a5c790bd Eliminate a locking panic when cleaning up UFS snapshots after a
disk failure.

Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.

When a disk fails while the UFS filesystem it contains is still
mounted (for example when a thumb drive is removed) UFS forcibly
unmounts the filesystem. The loss of the drive causes the GEOM
subsystem to orphan the provider, but the consumer remains until
the filesystem has finished with the unmount. Information describing
the snapshot locks was being prematurely cleared during the orphaning
causing the return of the snapshot vnode's original locks to fail.
The fix is to not clear the needed information prematurely.

Sponsored by: Netflix
2021-01-15 16:36:42 -08:00
Kirk McKusick 173779b98f Eliminate lock order reversal in UFS when unmounting filesystems
with snapshots.

Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update.  As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.

The lock order reversal happens because vnode locks must be acquired
before snapshot locks. When unmounting we must lock both the snapshot
lock and the vnode lock before swapping them so that the vnode will
be continuously locked during the swap. For each vnode representing
a snapshot, we must first acquire the snapshot lock to ensure
exclusive access to it and its original lock.  We then face a lock
order reversal when we try to acquire the original vnode lock. The
problem is eliminated by doing a non-blocking exclusive lock on the
original lock which will always succeed since there are no users
of that lock.

Sponsored by: Netflix
2021-01-15 16:03:01 -08:00
Mateusz Guzik 6b3a9a0f3d Convert remaining cap_rights_init users to cap_rights_init_one
semantic patch:

@@

expression rights, r;

@@

- cap_rights_init(&rights, r)
+ cap_rights_init_one(&rights, r)
2021-01-12 13:16:10 +00:00
Kirk McKusick 2d4422e799 Eliminate lock order reversal in UFS ffs_unmount().
UFS uses a new "mntfs" pseudo file system which provides private
device vnodes for a file system to safely access its disk device.
The original device vnode is saved in um_odevvp to hold the exclusive
lock on the device so that any attempts to open it for writing will
fail. But it is otherwise unused and has its BO_NOBUFS flag set to
enforce that file systems using mntfs vnodes do not accidentally
use the original devfs vnode. When the file system is unmounted,
um_odevvp is no longer needed and is released.

The lock order reversal happens because device vnodes must be locked
before UFS vnodes. During unmount, the root directory vnode lock
is held. When when calling vrele() on um_odevvp, vrele() attempts to
exclusive lock um_odevvp causing the lock order reversal. The problem
is eliminated by doing a non-blocking exclusive lock on um_odevvp
which will always succeed since there are no users of um_odevvp.
With um_odevvp locked, it can be released using vput which does not
attempt to do a blocking exclusive lock request and thus avoids the
lock order reversal.

Sponsored by: Netflix
2021-01-11 16:49:07 -08:00
Thomas Munro e7347be9e3 ffs: Support O_DSYNC.
Respect the new IO_DATASYNC flag when performing synchronous writes.
Compared to O_SYNC, O_DSYNC lets us skip updating the inode in some
cases, matching the behaviour of fdatasync(2).

Reviewed by: kib
Differential Review: https://reviews.freebsd.org/D25160
2021-01-08 13:15:56 +13:00
Mark Johnston ace3d9475c ffs: Avoid out-of-bounds accesses in the fs_active bitmap
We use a bitmap to track which cylinder groups have changed between
snapshot creation and filesystem suspension.  The "legs" of the bitmap
are four bytes wide (see ACTIVESET()) so we must round up the allocation
size to a multiple of four bytes.

I believe this bug is harmless since UMA/kmem_* will both pad the
allocation and zero the full allocation.  Note that malloc() does inline
zeroing when the allocation size is known at compile-time.

Reported by:	pho (using KASAN)
Reviewed by:	kib, mckusick
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27731
2020-12-23 11:16:40 -05:00
Ryan Libby 93dba42c0e ffs: quiet -Wstrict-prototypes
Reviewed by:	kib, markj, mckusick
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27558
2020-12-11 22:51:57 +00:00
Konstantin Belousov 21a45add50 ffs: do not read full direct blocks if they are going to be overwritten.
BA_CLRBUF specifies that existing context of the block will be
completely overwritten by caller, so there is no reason to spend io
fetching existing data.  We do the same for indirect blocks.

Reported by:	tmunro
Reviewed by:	mckusick, tmunro
Tested by:	pho, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27353
2020-11-30 17:03:26 +00:00
Konstantin Belousov cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Konstantin Belousov 92bcefd1d2 clear_inodedeps: handle ERELOOKUP from ffs_syncvnode().
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-11-26 18:03:24 +00:00
Konstantin Belousov 07ef907f6e ffs_softdep.c: get_parent_vp(): Fix bp lock leak when inum inode was already freed.
Reported by:	markj, pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-11-25 17:12:21 +00:00
Konstantin Belousov 8a1509e442 Handle LoR in flush_pagedep_deps().
When operating in SU or SU+J mode, ffs_syncvnode() might need to
instantiate other vnode by inode number while owning syncing vnode
lock.  Typically this other vnode is the parent of our vnode, but due
to renames occuring right before fsync (or during fsync when we drop
the syncing vnode lock, see below) it might be no longer parent.

More, the called function flush_pagedep_deps() needs to lock other
vnode while owning the lock for vnode which owns the buffer, for which
the dependencies are flushed.  This creates another instance of the
same LoR as was fixed in softdep_sync().

Put the generic code for safe relocking into new SU helper
get_parent_vp() and use it in flush_pagedep_deps().  The case for safe
relocking of two vnodes with undefined lock order was extracted into
vn helper vn_lock_pair().

Due to call sequence
     ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(),
ffs_syncvnode() indicates with ERELOOKUP that passed vnode was
unlocked in process, and can return ENOENT if the passed vnode
reclaimed.  All callers of the function were inspected.

Because UFS namei lookups store auxiliary information about directory
entry in in-memory directory inode, and this information is then used
by UFS code that creates/removed directory entry in the actual
mutating VOPs, it is critical that directory vnode lock is not dropped
between lookup and VOP.  For softdep_prelink(), which ensures that
later link/unlink operation can proceed without overflowing the
journal, calls were moved to the place where it is safe to drop
processing VOP because mutations are not yet applied.  Then, ERELOOKUP
causes restart of the whole VFS operation (typically VFS syscall) at
top level, including the re-lookup of the involved pathes.  [Note that
we already do the same restart for failing calls to vn_start_write(),
so formally this patch does not introduce new behavior.]

Similarly, unsafe calls to fsync in snapshot creation code were
plugged.  A possible view on these failures is that it does not make
sense to continue creating snapshot if the snapshot vnode was
reclaimed due to forced unmount.

It is possible that relock/ERELOOKUP situation occurs in
ffs_truncate() called from ufs_inactive().  In this case, dropping the
vnode lock is not safe.  Detect the situation with VI_DOINGINACT and
reschedule inactivation by setting VI_OWEINACT.  ufs_inactive()
rechecks VI_OWEINACT and avoids reclaiming vnode is truncation failed
this way.

In ffs_truncate(), allocation of the EOF block for partial truncation
is re-done after vnode is synced, since we cannot leave the buffer
locked through ffs_syncvnode().

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:30:10 +00:00
Konstantin Belousov 738ea0010b Add ffs_inode_bwrite() helper.
In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:19:59 +00:00
Konstantin Belousov 7b795aa3c0 Revert r367669 to re-commit with proper message 2020-11-14 05:19:44 +00:00
Konstantin Belousov c0d2077f41 Add a framework that tracks exclusive vnode lock generation count for UFS.
This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored.  Enabled under DIAGNOSTICS.

UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents.  If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.

Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter.  Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.

Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock.  It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:17:04 +00:00
Konstantin Belousov 61846fc4dc Add a framework that tracks exclusive vnode lock generation count for UFS.
This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored.  Enabled under DIAGNOSTICS.

UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents.  If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.

Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter.  Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.

Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock.  It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-14 05:10:39 +00:00
Mark Johnston f44994874b ffs: Clamp BIO_SPEEDUP length
On 32-bit platforms, the computed size of the BIO_SPEEDUP requested by
softdep_request_cleanup() may be negative when assigned to bp->b_bcount,
which has type "long".

Clamp the size to LONG_MAX.  Also convert the unused g_io_speedup() to
use an off_t for the magnitude of the shortage for consistency with
softdep_send_speedup().

Reviewed by:	chs, kib
Reported by:	pho
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27081
2020-11-11 13:48:07 +00:00
Conrad Meyer e6790841f7 UFS2: Fix DoS due to corrupted extattrfile
Prior versions of FreeBSD (11.x) may have produced a corrupt extattr file.
(Specifically, r312416 accidentally fixed this defect by removing a strcpy.)
CURRENT FreeBSD supports disk images from those prior versions of FreeBSD.
Validate the internal structure as soon as we read it in from disk, to
prevent these extattr files from causing invariants violations and DoS.

Attempting to access the extattr portion of these files results in
EINTEGRITY.  At this time, the only way to repair files damaged in this way
is to copy the contents to another file and move it over the original.

PR:		244089
Reported by:	Andrea Venturoli <ml AT netfence.it>
Reviewed by:	kib
Discussed with:	mckusick (earlier draft)
Security:	no
Differential Revision:	https://reviews.freebsd.org/D27010
2020-10-30 19:00:42 +00:00
Edward Tomasz Napierala bce7ee9d41 Drop "All rights reserved" from all my stuff. This includes
Foundation copyrights, approved by emaste@.  It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.

Reviewed by:	emaste, imp, gbe (manpages)
Differential Revision:	https://reviews.freebsd.org/D26980
2020-10-28 13:46:11 +00:00
Kirk McKusick 996d40f91d Various new check-hash checks have been added to the UFS filesystem
over various major releases. Superblock check hashes were added for
the 12 release and cylinder-group and inode check hashes will appear
in the 13 release.

When a disk with a UFS filesystem is writably mounted, the kernel
clears the feature flags for anything that it does not support. For
example, if a UFS disk from a 12-stable kernel is mounted on an
11-stable system, the 11-stable kernel will clear the flag in the
filesystem superblock that indicates that superblock check-hashs
are being maintained. Thus if the disk is later moved back to a
12-stable system, the 12-stable system will know to ignore its
incorrect check-hash.

If the only filesystem modification done on the earlier kernel is
to run a utility such as growfs(8) that modifies the superblock but
neither updates the check-hash nor clears the feature flag indicating
that it does not support the check-hash, the disk will fail to mount
if it is moved back to its original newer kernel.

This patch moves the code that clears the filesystem feature flags
from the mount code (ffs_mountfs()) to the code that reads the
superblock (ffs_sbget()). As ffs_sbget() is used by the kernel mount
code and is imported into libufs(3), all the filesystem utilities
will now also clear these flags when they make modifications to the
filesystem.

As suggested by John Baldwin, fsck_ffs(8) has been changed to accept
and repair bad superblock check-hashes rather than refusing to run.
This change allows fsck to recover filesystems that have been impacted
by utilities older than those created after this change and is a
sensible thing to do in any event.

Reported by:  John Baldwin (jhb@)
MFC after:    2 weeks
Sponsored by: Netflix
2020-10-25 00:43:48 +00:00
Brooks Davis 44ca4575ea vmapbuf: don't smuggle address or length in buf
Instead, add arguments to vmapbuf.  Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf.  (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)

In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.

Suggested by:	jhb
Reviewed by:	imp, jhb
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26784
2020-10-21 16:00:15 +00:00
Konstantin Belousov e1ef4c29a3 Do not leak B_BARRIER.
Normally when a buffer with B_BARRIER is written, the flag is cleared
by g_vfs_strategy() when creating bio.  But in some cases FFS buffer
might not reach g_vfs_strategy(), for instance when copy-on-write
reports an error like ENOSPC.  In this case buffer is returned to
dirty queue and might be written later by other means.  Among then
bdwrite() reasonably asserts that B_BARRIER is not set.

In fact, the only current use of B_BARRIER is for lazy inode block
initialization, where write of the new inode block is fenced against
cylinder group write to mark inode as used.  The situation could be
seen that we break dependency by updating cg without written out
inode.  Practically since CoW was not able to find space for a copy of
inode block, for the same reason cg group block write should fail.

Reported by:	pho
Discussed with:	chs, imp, mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26511
2020-10-08 22:41:02 +00:00
Konstantin Belousov 96474d2a3f Do not copy vp into f_data for DTYPE_VNODE files.
The pointer to vnode is already stored into f_vnode, so f_data can be
reused.  Fix all found users of f_data for DTYPE_VNODE.

Provide finit_vnode() helper to initialize file of DTYPE_VNODE type.

Reviewed by:	markj (previous version)
Discussed with:	freqlabs (openzfs chunk)
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26346
2020-09-15 21:55:21 +00:00
Mateusz Guzik d90f2c3617 ufs: clean up empty lines in .c and .h files 2020-09-01 21:23:00 +00:00
Mateusz Guzik 7ad2a82da2 vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error
Most consumers pass NULL.
2020-08-19 02:51:17 +00:00
Mateusz Guzik a92a971bbb vfs: remove the thread argument from vget
It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)
2020-08-16 17:18:54 +00:00
Mateusz Guzik 03337743db vfs: clean MNTK_FPLOOKUP if MNT_UNION is set
Elides checking it during lookup.
2020-08-10 11:51:21 +00:00
Mateusz Guzik e5e10c82ec ufs: only pass LK_ADAPTIVE if LK_NODDLKTREAT is set
This restores the pre-adaptive spinning state for SU which livelocks
otherwise. Note this is a bug in SU.

Reported by:	pho
2020-08-04 23:09:15 +00:00
Mateusz Guzik 9d5a594f0b ufs: add support for lockless lookup
ACLs are not supported, meaning their presence will force the use of the old lookup.

Reviewed by:    kib
Tested by:      pho (in a patchset)
Differential Revision:	https://reviews.freebsd.org/D25579
2020-07-25 10:38:05 +00:00
Mateusz Guzik 31ad4050fe lockmgr: add adaptive spinning
It is very conservative. Only spinning when LK_ADAPTIVE is passed, only on
exclusive lock and never when any waiters are present. buffer cache is remains
not spinning.

This reduces total sleep times during buildworld etc., but it does not shorten
total real time (culprits are contention in the vm subsystem along with slock +
upgrade which is not covered).

For microbenchmarks: open3_processes -t 52 (open/close of the same file for
writing) ops/s:
before: 258845
after: 801638

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D25753
2020-07-22 12:30:31 +00:00
Kirk McKusick 93440bbefd The binary representation of the superblock (the fs structure) is written
out verbatim to the disk: see ffs_sbput() in sys/ufs/ffs/ffs_subr.c.
It contains a pointer to the fs_summary_info structure. This pointer
value inadvertently causes garbage to be stored. It is garbage because
the pointer to the fs_summary_info structure is the address the then
current stack or heap. Although a mere pointer does not reveal anything
useful (like a part of a private key) to an attacker, garbage output
deteriorates reproducibility.

This commit zeros out the pointer to the fs_summary_info structure
before writing the out the superblock.

Reviewed by:  kib
Tested by:    Peter Holm
PR:           246983
Sponsored by: Netflix
2020-06-19 01:04:25 +00:00
Kirk McKusick 34816cb9ae Move the pointers stored in the superblock into a separate
fs_summary_info structure. This change was originally done
by the CheriBSD project as they need larger pointers that
do not fit in the existing superblock.

This cleanup of the superblock eases the task of the commit
that immediately follows this one.

Suggested by: brooks
Reviewed by:  kib
PR:           246983
Sponsored by: Netflix
2020-06-19 01:02:53 +00:00
Chuck Silvers d9a8abf6c2 Move all of the functions in ffs_subr.c that are only used by the ufs kernel
module from that file into ffs_vfsops.c.  This fixes the build for kernel
configs that don't include FFS.

PR:		247256
Submitted by:	glebius
Reviewed by:	mckusick (earlier version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D25285
2020-06-17 23:39:52 +00:00
Rick Macklem 1f7104d720 Fix export_args ex_flags field so that is 64bits, the same as mnt_flags.
Since mnt_flags was upgraded to 64bits there has been a quirk in
"struct export_args", since it hold a copy of mnt_flags
in ex_flags, which is an "int" (32bits).
This happens to currently work, since all the flag bits used in ex_flags are
defined in the low order 32bits. However, new export flags cannot be defined.
Also, ex_anon is a "struct xucred", which limits it to 16 additional groups.
This patch revises "struct export_args" to make ex_flags 64bits and replaces
ex_anon with ex_uid, ex_ngroups and ex_groups (which points to a
groups list, so it can be malloc'd up to NGROUPS in size.
This requires that the VFS_CHECKEXP() arguments change, so I also modified the
last "secflavors" argument to be an array pointer, so that the
secflavors could be copied in VFS_CHECKEXP() while the export entry is locked.
(Without this patch VFS_CHECKEXP() returns a pointer to the secflavors
array and then it is used after being unlocked, which is potentially
a problem if the exports entry is changed.
In practice this does not occur when mountd is run with "-S",
but I think it is worth fixing.)

This patch also deleted the vfs_oexport_conv() function, since
do_mount_update() does the conversion, as required by the old vfs_cmount()
calls.

Reviewed by:	kib, freqlabs
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D25088
2020-06-14 00:10:18 +00:00
Kirk McKusick 513274c79c Clear the IN_SIZEMOD and IN_IBLKDATA flags only when doing a
synchronous inode update.

The IN_SIZEMOD and IN_IBLKDATA flags indicate changes to the
file size and block pointer fields in the inode. When these
fields have been changed, the fsync() and fsyncdata() system
calls must write the inode to ensure their semantics that the
file is on stable store.

The IN_SIZEMOD and IN_IBLKDATA flags cannot be cleared until
a synchronous write of the inode is done. If they are cleared
on an asynchronous write, then the inode may not yet have been
written to the disk when an fsync() or fsyncdata() call is done.
Absent these flags, these calls would not know that they needed
to write the inode. Thus, these flags only can be cleared on
synchronous writes of the inode. Since the inode will be locked
for the duration of the I/O that writes it to disk, no fsync()
or fsyncdata() will be able to run before the on-disk inode
is complete.

Reviewed by: kib
MFC with: -r361785
Differential revision:  https://reviews.freebsd.org/D25072
2020-06-06 20:17:56 +00:00
Kirk McKusick 52488b5148 Further evaluation of the POSIX spec for fdatasync() shows that it
requires that new data on growing files be accessible. Thus, the
the fsyncdata() system call must update the on-disk inode when the
size of the file has changed.

This commit adds another inode update flag, IN_SIZEMOD, that gets
set any time that the file size changes. If either the IN_IBLKDATA
or the IN_SIZEMOD flag is set when fdatasync() is called, the
associated inode is synchronously written to disk. We could have
overloaded the IN_IBLKDATA flag to also track size changes since
the only (current) use case for these flags are for fsyncdata(),
but it does seem useful for possible future uses to separately
track the file size changes and the inode block pointer changes.

Reviewed by: kib
MFC with: -r361785
Differential revision:  https://reviews.freebsd.org/D25072
2020-06-05 01:00:55 +00:00
Stefan Eßer 23e84cf153 Fix obvious typo: IN_BLKDATA should be IN_IBLKDATA 2020-06-04 19:54:25 +00:00
Kirk McKusick 30296c428a Two additional places that need to identify IN_IBLKDATA.
Reviewed by: kib
MFC with: -r361785
Differential Revision:  https://reviews.freebsd.org/D25072
2020-06-04 18:35:21 +00:00
Konstantin Belousov 7428630b75 UFS: write inode block for fdatasync(2) if pointers in inode where allocated
The fdatasync() description in POSIX specifies that
    all I/O operations shall be completed as defined for synchronized I/O
    data integrity completion.
and then the explanation of Synchronized I/O Data Integrity Completion says
    The write is complete only when the data specified in the write
    request is successfully transferred and all file system
    information required to retrieve the data is successfully
    transferred.

For UFS this means that all pointers must be on disk. Indirect
pointers already contribute to the list of dirty data blocks, so only
direct blocks and root pointers to indirect blocks, both of which
reside in the inode block, should be taken care of. In ffs_balloc(),
mark the inode with the new flag IN_IBLKDATA that specifies that
ffs_syncvnode(DATA_ONLY) needs a call to ffs_update() to flush the
inode block.

Reviewed by:	mckusick
Discussed with:	tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D25072
2020-06-04 12:23:15 +00:00
Chuck Silvers d79ff54b5c This commit enables a UFS filesystem to do a forcible unmount when
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.

The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.

There are two cases for disk I/O errors:

   - ENXIO, which means that this disk is gone and the lower layers
     of the storage stack already guarantee that no future I/O to
     this disk will succeed.

   - EIO (or most other errors), which means that this particular
     I/O request has failed but subsequent I/O requests to this
     disk might still succeed.

For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.

This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.

Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.

Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.

The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.

Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.

Coauthered by: mckusick
Reviewed by:   kib
Tested by:     Peter Holm
Approved by:   mckusick (mentor)
Sponsored by:  Netflix
Differential Revision:  https://reviews.freebsd.org/D24088
2020-05-25 23:47:31 +00:00
John Baldwin 71d11ee322 Update name of description of vfs.ffs.setsize in comment.
Previously it used the name 'adjsize' instead of 'setsize'.
2020-05-22 17:23:43 +00:00
John Baldwin f2620e9ceb Retire two unused background fsck sysctls.
These two sysctls were added to support UFS softupdates journalling
with snapshots.  However, the changes to fsck to use them were never
committed and there have never been any in-tree uses of these sysctls.

More details from Kirk:

When journalling got added to soft updates, its journal rollback freed
blocks that it thought were no longer in use. But it does not take
snapshots into account (i.e., if a snapshot is still using it, then it
cannot be freed). So I added the needed logic to fsck by having the
free go through the kernel's blkfree code so it could grab blocks that
were still needed by snapshots. That is done using the setbufoutput
hack. I never got that code working reliably, so it is still sitting
in my work directory. Which also explains why you still cannot take
snapshots on filesystems running with journalling...

In looking over my use of this feature, and in particular the troubles
I was having with it, I conclude that it may be better to extract the
code from the kernel that handles freeing blocks claimed by snapshots
and putting it into fsck directly. My original intent was that it is
complex and at the time changing, so only having to maintain it in one
place was appealing. But at this point it has not changed in years and
the hacks like setinode and setbufoutput to be able to use the kernel
code is sufficiently ugly, that I am leaning towards just extracting
it.

Reviewed by:	mckusick
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D24484
2020-04-21 17:42:32 +00:00
Konstantin Belousov 71f2642988 ufs: apply suspension for non-forced rw unmounts.
Forced rw unmounts and remounts from rw to ro already suspend
filesystem, which closes races with writers instantiating new vnodes
while unmount flushes the queue.  Original intent of not including
non-forced unmounts into this regime was to allow such unmounts to
fail if writer was active, but this did not worked well.

Similar change, but causing all unmount, even involving only ro
filesystem, were proposed in D24088, but I believe that suspending ro
is undesirable, and definitely spends CPU time.

Reported by:	markj
Discussed with:	chs, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-04-10 01:24:16 +00:00
Kirk McKusick 621a274820 Fixing the soft update macros in -r359612 triggered a previously
hidden bug in the file truncation code. Until that bug is tracked
down and fixed, revert to the old behavior.

Reported by: Peter Holm
Reviewed by: kib, Chuck Silvers
2020-04-09 23:51:18 +00:00
Kirk McKusick c79f5a4328 Revert -r359612 as it can cause other panics.
An updated version will be made when the issue has been resolved.

Reported by: Peter Holm
2020-04-06 20:23:47 +00:00
Kirk McKusick 2baca88584 When shrinking the size of a directory it is sometimes necessary to
sync it to disk before shrinking it. Complete the sync before getting
the buffer for the block to be updated to do the shrink to avoid
panicing with a recursive lock on one of the directory's buffers.

Reviewed by:  Chuck Silvers (chs)
MFC after:    3 days
Sponsored by: Netflix
2020-04-03 20:43:25 +00:00
Konstantin Belousov abfdf76791 VOP_GETPAGES_ASYNC(): consistently call iodone() callback in case of error.
Reviewed by:	glebius, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D24038
2020-03-30 21:44:30 +00:00
Kirk McKusick 95ca762da8 When mounting a UFS filesystem, return EINTEGRITY rather than EIO
when a superblock check-hash error is detected. This change clarifies
a mount that failed due to media hardware failures (EIO) from a mount
that failed due to media errors (EINTEGRITY) that can be corrected by
running fsck(8).

Sponsored by: Netflix
2020-03-11 21:00:40 +00:00
Chuck Silvers 69b3fdfa0b Use the devfs vnode rather than the mntfs vnode for permissions checks.
I missed this one in r358714.

Reported by:	pho
Reviewed by:	mckusick
Approved by:	imp (mentor)
Sponsored by:	Netflix
2020-03-09 15:55:13 +00:00
Mateusz Guzik d2222aa0e9 fd: use smr for managing struct pwd
This has a side effect of eliminating filedesc slock/sunlock during path
lookup, which in turn removes contention vs concurrent modifications to the fd
table.

Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D23889
2020-03-08 00:23:36 +00:00
Chuck Silvers f15ccf8836 Add a new "mntfs" pseudo file system which provides private device vnodes for
file systems to safely access their disk devices, and adapt FFS to use it.
Also add a new BO_NOBUFS flag to allow enforcing that file systems using
mntfs vnodes do not accidentally use the original devfs vnode to create buffers.

Reviewed by:	kib, mckusick
Approved by:	imp (mentor)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D23787
2020-03-06 18:41:37 +00:00
Mateusz Guzik 8d03b99b9d fd: move vnodes out of filedesc into a dedicated structure
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.

This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23884
2020-03-01 21:53:46 +00:00
Pawel Biernacki 7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Kirk McKusick 98b6844690 Additional KASSERTs to ensure the consistency of the soft updates
indirdep structure. No functional change.

Tested by:    Peter Holm (as part of a larger patch)
Sponsored by: Netflix
2020-02-18 23:56:23 +00:00
Scott Long 1353215314 Add rudamentary support for UFS to probe whether a block device supports the
BIO_SPEEDUP command.  Add complimentary support to the CAM periphs that
support it.  This is a redo of r357710.
2020-02-16 23:10:59 +00:00
Mateusz Guzik 4d51e175f9 ufs: use faster lockgmr entry points in ffs_lock 2020-02-15 21:48:48 +00:00
Scott Long 85eb41f751 Revert r357710 and 357711 until they can be debugged 2020-02-10 14:27:28 +00:00
Scott Long 7d99bda79e Add rudamentary support for UFS to probe whether a block device supports the
BIO_SPEEDUP command.  Add complimentary support to the CAM periphs that
support it.
2020-02-10 00:23:20 +00:00
Chuck Silvers 62612737d6 With INVARIANTS, track all softdep dependency structures centrally
so that we can find them in dumps.

Approved by:	mckusick (mentor)
Sponsored by:	Netflix
2020-02-03 17:47:14 +00:00
Mateusz Guzik f1fa1ba3d0 Fix up various vnode-related asserts which did not dump the used vnode 2020-02-03 14:25:32 +00:00
Mateusz Guzik 6c44a3e019 ufs: add vgone calls for unconstructed vnodes in the error path
This mostly eliminates the requirement that vput never unlocks the vnode
before calling VOP_INACTIVE. Note it may still be present for other
filesystems.

See r356126 for an example bug.

Note vput stopped doing early unlock in r357070 thus this change does
not affect correctness as it is.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23215
2020-01-26 00:38:06 +00:00
Mateusz Guzik d93762b94d vfs: stop handling VI_OWEINACT in vget
vget is almost always called with LK_SHARED, meaning the flag (if present) is
almost guaranteed to get cleared. Stop handling it in the first place and
instead let the thread which wanted to do inactive handle the bumepd usecount.

Reviewed by:	jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23184
2020-01-24 07:45:59 +00:00
Warner Losh 38b37b93d4 We only want to send the speedup to the lower layers when there's a shortage.
Only send a speedup when there's a shortage. While this is a little racy, lost
races aren't a big deal for this function. If there's a shorage just popping up
after we check these values, then we'll catch it next time. If there's a
shortage that's just clearing up, we may do some work at the lower layers a
little sooner than we otherwise would have. Sicne shortages are relatively rare
events, both races are acceptable.

Reviewed by: chs
Differential Revision: https://reviews.freebsd.org/D23182
2020-01-17 01:16:23 +00:00
Warner Losh 3cf5dd8401 Use buf to send speedup
It turns out there's a problem with using g_io to send the speedup. It leads to
a race when there's a resource shortage when a disk fails.

Instead, send BIO_SPEEDUP via struct buf. This is pretty straight forward,
except we need to transfer the bio_flags from b_ioflags for BIO_SPEEDUP commands
in g_vfs_strategy.

Reviewed by: kirk, chs
Differential Revision: https://reviews.freebsd.org/D23117
2020-01-17 01:16:19 +00:00
Kirk McKusick 0297c1384a When sync'ing a mount point, the mount point's vnodes were scanned
twice. Once to update the changed inodes, and a second time to update
changed quota information. This change merges these two scans into a
single scan which does both inode and quota updates.

MFC after: 7 days
2020-01-14 22:27:46 +00:00
Jeff Roberson 815b74866a Fix a long standing bug in journaled soft-updates. The dirrem structure
needs to handle file removal, directory removal, file move, directory move,
etc.  The code in handle_workitem_remove() needs to propagate any completed
journal entries to the write that will render the change stable.  In the
case of a moved directory this means the new parent.  However, for an
overwrite that frees a directory (DIRCHG) we must move the jsegdep to the
removed inode to be released when it is stable in the cg bitmap or the
unlinked inode list.  This case was previously unhandled and caused a
panic.

Reported by:	mckusick, pho
Reviewed by:	mckusick
Tested by:	pho
2020-01-14 02:00:24 +00:00
Mateusz Guzik f1fcaffd8e ufs: relax an overzealous assert added in r356671
Part of i_flag can persist across a drop to hold count of 0, at which
point the vnode is taken off the lazy list. Then whoever locks and unlocks
the vnode can trip on the assert.

This trips over kyua running a test untarring character devices to ufs.

Reported by:	lwhsu
2020-01-13 14:33:51 +00:00
Mateusz Guzik 80663cadb8 ufs: use lazy list instead of active list for syncer
Quota code is temporarily regressed to do a full vnode scan.

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22996
2020-01-13 02:35:15 +00:00
Mateusz Guzik ac4ec14188 ufs: add a setter for inode i_flag field
This will be used later to add vnodes to the lazy list.

Reviewed by:	kib (previous version), jeff
Tested by:	pho (in a larger patch)
Differential Revision:	https://reviews.freebsd.org/D22994
2020-01-13 02:31:51 +00:00
Mateusz Guzik 8dbc63520c vfs: drop thread argument from vinactive 2020-01-05 00:59:47 +00:00
Mateusz Guzik b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Konstantin Belousov 4085590d39 ufs: do not leave non-reclaimed vnodes with zero i_mode around.
After a recent change, vput() relocks even the exclusively locked
vnode before inactivating it.  Before that, UFS could safely
instantiate a vnode for cleared inode, then the last vput() after
ffs_vgetf() noted that ip->i_mode == 0 and recycled.  Now, it is
possible for other threads to note the half-constructed vnode, e.g. to
insert it into hash, which makes other threads to use it despite mode
is zero, before inactivation and reclaim.

Handle the found cases in SU code, by explicitly doing reclaim.
Assert that other places get fully constructed inode from ffs_vgetf(),
which cannot be cleared before dependencies are resolved.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-12-27 16:43:34 +00:00
Warner Losh 56e4d45895 Drop a sleepable lock when we plan on sleeping
g_io_speedup waits for the completion of the speedup request before proceeding
using biowait(), but check_clear_deps is called with the softdeps lock held
(which is non-sleepable). It's safe to drop this lock around the call to
speedup, so do that.

Submitted by: Peter Holm
Reviewed by: kib@
2019-12-18 16:01:15 +00:00
Warner Losh 22dd705f1f Add BIO_SPEEDUP signalling to UFS
When we have a resource shortage in UFS, send down a BIO_SPEEDUP to
give the CAM I/O scheduler a heads up that we have a resource shortage
and that it should bias its decisions knowing that.

Reviewed by: kirk, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D18351
2019-12-17 00:13:40 +00:00
Mateusz Guzik 6fa079fc3f vfs: flatten vop vectors
This eliminates the following loop from all VOP calls:

while(vop != NULL && \
    vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
        vop = vop->vop_default;

Reviewed by:	jeff
Tesetd by:	pho
Differential Revision:	https://reviews.freebsd.org/D22738
2019-12-16 00:06:22 +00:00
Mateusz Guzik abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Kirk McKusick d00066a5f9 Currently the breadn_flags() and getblkx() interfaces are passed
the vnode, logical block number, and size of data block that is
being requested. They then use the VOP_BMAP function to calculate
the mapping from logical block number to physical block number from
which to access the data. This change expands the interface to also
pass the physical block number in cases where the VOP_MAP function
may no longer work, for example when a file is being truncated.

No functional change.

Reviewed by:  kib
Tested by:    Peter Holm
Sponsored by: Netflix
2019-12-03 23:07:09 +00:00
Chuck Silvers 2ac044e6bc As part of creating a snapshot, set fs->fs_fmod to 0 in the snapshot image
because nothing ever changes this field for read-only mounts and we want
to verify that it is still 0 when we unmount.

Reviewed by:	mckusick
Approved by:	mckusick (mentor)
Sponsored by:	Netflix
2019-11-28 00:37:43 +00:00
Chuck Silvers 2bcfb938f4 In ffs_freefile(), use a separate variable to hold the inode number within
the cg rather than reusuing "ino" for this purpose.  This reduces the diff
for an upcoming change that improves handling of I/O errors.

No functional change.

Reviewed by:	mckusick
Approved by:	mckusick (mentor)
Sponsored by:	Netflix
2019-11-25 19:31:38 +00:00
Kirk McKusick 486b9a61f7 Add some KASSERTs. Reacquire a mutex after a kernel printf rather
than holding it during the printf. White space cleanup.

Sponsored by: Netflix
2019-11-20 01:10:01 +00:00
Jeff Roberson 67d0e29304 Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22116
2019-10-29 21:06:34 +00:00
Kirk McKusick 7792f70137 Soft updates needs to keep an on-disk linked list of inodes that
have been unlinked, but are still referenced by open file descriptors.
These inodes cannot be freed until the final file descriptor reference
has been closed. If the system crashes while they are still being
referenced, these inodes and their referenced blocks need to be
freed by fsck. By having them on a linked list with the head pointer
in the superblock, fsck can quickly find and process them rather
than having to check every inode in the filesystem to see if it is
unreferenced.

When updating the head pointer of this list of unlinked inodes in
the superblock, the superblock check-hash was not getting updated.
If the system crashed with the incorrect superblock check-hash, the
superblock would appear to be corrupted. This patch ensures that
the superblock check-hash is updated when updating the head pointer
of the unlinked inodes list.

There is no need to MFC as superblock check hashes first appeared in
13.0.

Tested by:    Peter Holm
Sponsored by: Netflix
2019-10-24 19:47:18 +00:00
Mark Johnston c456a0a1a6 Abbreviate softdep lock names.
The softdep lock names were unusually long and tended to stick out in
lock profiling reports.  Abbreviate them and make them consistent with
our conventional style for lock names.

Reviewed by:	mckusick
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22042
2019-10-18 17:01:27 +00:00
Mateusz Guzik e35cd9e38f ufs: add root vnode caching
See r353150.

Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D21646
2019-10-06 22:18:03 +00:00
Eric van Gyzen fdd888dee3 Add CTLFLAG_STATS to several debug.softdep sysctl OIDs
Refer to r353111.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2019-10-04 21:44:52 +00:00
Kirk McKusick 44d37182ce Update ffs_getcg() function to accept a flags parameter to be passed
to breadn_flags() in preparation for later need when doing forcible
unmount when disk dies or is removed.

No functional change.

Sponsored by: Netflix
2019-10-04 05:28:36 +00:00
Mateusz Guzik 4cace859c2 vfs: convert struct mount counters to per-cpu
There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.

Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before:   852393 ops/s
after:  76682077 ops/s

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21637
2019-09-16 21:37:47 +00:00
Mateusz Guzik e87f3f72f1 vfs: manage mnt_writeopcount with atomics
See r352424.

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21575
2019-09-16 21:33:16 +00:00
Konstantin Belousov d89ac450a7 Remove some unneeded vfs_busy() calls in SU code.
When softdep_fsync() is running, a caller must already started write
for the mount point.  Since unmount or remount to ro suspends mount
point, it cannot run in parallel with softdep_fsync(), which makes
vfs_busy() call there not needed.

Doing blocking vfs_busy() there effectively causes lock order reversal
between vn_start_write() and setting MNTK_UNMOUNT, because
vfs_busy(mp, 0) sleeps waiting for MNTK_UNMOUNT becoming clear, while
unmount sets the flag and starts the suspension.

Note that all other uses of vfs_busy() in SU code are non-blocking.

Reported by:	chs by mckusick
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-09-09 11:22:38 +00:00
Conrad Meyer f3cf622523 ufs: Remove redundant brelse() after r294954
Same automation.

No functional change.
2019-09-06 08:08:33 +00:00
Konstantin Belousov 1604022248 UFS: stop reusing the vnode for reallocated inode.
In ffs_valloc(), force reclaim existing vnode on inode reuse, instead
of trying to re-initialize the same vnode for new purposes.  This is
done in preparation of changes to the vp->v_object lifecycle handling.

A new FFSV_REPLACE flag to ffs_vgetf() directs the function to
vgone(9) the vnode if found in vfs hash, instead of returning it.

Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21412
2019-08-29 07:45:23 +00:00
Konstantin Belousov e671edac06 De-commision the MNTK_NOINSMNTQ kernel mount flag.
After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-23 19:40:10 +00:00
Kirk McKusick 5a0d467f5f Clarify comment that describes how the FS_METACKHASH is managed.
MFC after: 3 days
2019-08-13 20:56:44 +00:00
Kirk McKusick 9454b4fd78 A race condition existed between the time a UFS/FFS superblock check
hash was computed and the time that the superblock was copied to a
buffer to be written to disk. The result was a failed superblock
check hash the next time that the superblock was read.

The fix is to compute the check hash after the superblock has been
copied to a buffer to be written.

PR:           236504
Reported by:  Peter Holm
Tested by:    Peter Holm
Sponsored by: Netflix
2019-08-06 18:10:34 +00:00
Kirk McKusick 90381b1ca9 When updating the user or group disk quotas for the return of inodes or
disk blocks, set the FORCE flag in the call to chkiq() or chkdq() since
the user is always allowed to return resources and hence there is no need
to check the user's credential .

Reported by:    Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as:    FS-1-UFS-1: Denial Of Service in mount (prison_priv_check)
Discussed with: kib
MFC:            1 week
Sponsored by:   Netflix
2019-07-31 22:44:58 +00:00
Kirk McKusick fdf34aa3a5 The error reported in FS-14-UFS-3 can only happen on UFS/FFS
filesystems that have block pointers that are out-of-range for their
filesystem. These out-of-range block pointers are corrected by
fsck(8) so are only encountered when an unchecked filesystem is
mounted.

A new "untrusted" flag has been added to the generic mount interface
that can be set when mounting media of unknown provenance or integrity.
For example, a daemon that automounts a filesystem on a flash drive
when it is plugged into a system.

This commit adds a test to UFS/FFS that validates all block numbers
before using them. Because checking for out-of-range blocks adds
unnecessary overhead to normal operation, the tests are only done
when the filesystem is mounted as an "untrusted" filesystem.

Reported by:  Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as:  FS-14-UFS-3: Out of bounds read in write-2 (ffs_alloccg)
Reviewed by:  kib
Sponsored by: Netflix
2019-07-17 22:07:43 +00:00
Kirk McKusick 1fd136ec5e When a process attempts to allocate space on a full filesystem, a
filesystem full message is sent to the offending process or the
kernel log if the offending process cannot be identified.

To prevent an explotion of messages, the kernel ppsratecheck()
function is used to limit the messages to one per second. This
revision changes the variable that tracks the rate of these messages
from a systemwide limit to a per-filesystem limit by moving it from
a global variable to a variable in the ufsmount structure.

Suggested by: kib
Reviewed by:  kib
Sponsored by: Netflix
2019-07-16 23:12:27 +00:00
Kirk McKusick daba4da81d Add a new "untrusted" option to the mount command. Its purpose
is to notify the kernel that the file system is untrusted and it
should use more extensive checks on the file-system's metadata
before using it. This option is intended to be used when mounting
file systems from untrusted media such as USB memory sticks or other
externally-provided media.

It will initially be used by the UFS/FFS file system, but should
likely be expanded to be used by other file systems that may appear
on external media like msdosfs, exfat, and ext2fs.

Reviewed by:  kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20786
2019-07-01 23:22:26 +00:00
Mark Johnston 6137883ff3 Remove references to splbio in ffs_softdep.c.
Assert that the per-mountpoint softdep mutex is held in modified
functions that do not already have this assertion.  No functional
change intended.

Reviewed by:	kib, mckusick (previous version)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20741
2019-06-26 16:28:42 +00:00
Xin LI f89d207279 Separate kernel crc32() implementation to its own header (gsb_crc32.h) and
rename the source to gsb_crc32.c.

This is a prerequisite of unifying kernel zlib instances.

PR:		229763
Submitted by:	Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision:	https://reviews.freebsd.org/D20193
2019-06-17 19:49:08 +00:00
Kirk McKusick e94828443c Add a missing bresle() in seldom-used error return. 2019-05-28 17:31:35 +00:00
Kirk McKusick af6aeacb3e Convert use of UFS-specific #ifdef DEBUG to DIAGNOSTIC or INVARIANTS
as appropriate. No functional change intended.

Suggested-by: markj
2019-05-28 16:32:04 +00:00
Kirk McKusick 298184acb8 Add function name and line number debugging information to softupdates
worklist structures to help track their movement between work lists.
No functional change to the operation of soft updates intended.
2019-05-27 06:22:43 +00:00
Alan Somers 65417f5e27 Remove "struct ucred*" argument from vtruncbuf
vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it.  Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20377
2019-05-24 20:27:50 +00:00
Conrad Meyer daec92844e Include ktr.h in more compilation units
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes.  Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone.  Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by:	peterj (arm64/mp_machdep.c)
X-MFC-With:	r347984
Sponsored by:	Dell EMC Isilon
2019-05-21 20:38:48 +00:00
Konstantin Belousov 5ffc99e2e4 Handle races when remounting UFS volume from ro to rw.
In particular, ensure that writers are not unleashed before SU
structures are initialized.  Also, correctly handle MNT_ASYNC before
this.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-04-08 15:20:05 +00:00
Mariusz Zaborski a1304030b8 Introduce funlinkat syscall that always us to check if we are removing
the file associated with the given file descriptor.

Reviewed by:	kib, asomers
Reviewed by:	cem, jilles, brooks (they reviewed previous version)
Discussed with:	pjd, and many others
Differential Revision:	https://reviews.freebsd.org/D14567
2019-04-06 09:34:26 +00:00
Kirk McKusick 69166928c7 This is an additional and hopefully final fix for bug report 230962.
This bug was introduced with the change to use softdep_bp_to_mp()
in January 2018 changes -r327723 and -r327821. The softdep_bp_to_mp()
function failed to include VSOCK as one of the valid cases.

Although local-domain sockets do not allocate blocks in the filesystem,
they will allocate blocks if they use extended attributes (such as
ACLs). Thus, softdep_bp_to_mp() needs to return a non-NULL mount
pointer when presented with a socket vnode so that the soft updates
write complete will properly process the soft updates structures
associated with the extended attribute blocks. It was the failure
to process these soft updates structures, thus leaving them hanging
off the buffer, which lead to the "panic: softdep_deallocate_dependencies:
dangling deps" when trying to clean up the buffer after it was written.

PR:           230962
Reported by:  2t8mr7kx9f@protonmail.com
Reviewed by:  kib
Tested by:    Peter Holm
MFC after:    1 week
Sponsored by: Netflix
2019-03-20 23:11:05 +00:00
Kirk McKusick 42a5a356a8 Add KASSERT to the softdep_disk_write_complete() function in the
soft dependency code to ensure that it will be able to avoid a
dangling dependency.

Sponsored by: Netflix
2019-03-12 00:10:31 +00:00