Explicitly pass the struct thread argument.
Move the function prototype from sys/systm.h to geom/geom.h, we do not
need almost each kernel source to see the prototype, it is now used
only by kern/vfs_mountroot.c outside geom/geom_event.c, where the
function is defined.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888
Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.
Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.
The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.
Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888
Rather than trying to shoehorn flags into the requested superblock
address, create a separate flags parameter to the ffs_sbget()
function in sys/ufs/ffs/ffs_subr.c. The ffs_sbget() function is
used both in the kernel and in user-level utilities through export
to the sbget() function in the libufs(3) library (see sbget(3)
for details). The kernel uses ffs_sbget() when mounting UFS
filesystems, in the glabel(8) and gjournal(8) GEOM utilities,
and in the standalone library used when booting the system
from a UFS root filesystem.
The ffs_sbget() function reads the superblock located at the byte
offset specified by its sblockloc parameter. The value UFS_STDSB
may be specified for sblockloc to request that the standard
location for the superblock be read.
The two existing options are now flags:
UFS_NOHASHFAIL will note if the check hash is wrong but will still
return the superblock. This is used by the bootstrap code to
give the system a chance to come up so that fsck can be run to
correct the problem.
UFS_NOMSG indicates that superblock inconsistency error messages
should not be printed. It is used by programs like fsck that
want to print their own error message and programs like glabel(8)
that just want to know if a UFS filesystem exists on a partition.
One additional flag is added:
UFS_NOCSUM causes only the superblock itself to be returned, but does
not read in any auxiliary data structures like the cylinder group
summary information. It is used by clients like glabel(8) that
just want to check for possible filesystem types. Using UFS_NOCSUM
skips the superblock checks for csum data which allows superblocks
that have corrupted csum data to be read and used.
The validate_sblock() function checks that the superblock has not
been corrupted in a way that can crash or hang the system. Unless
the UFS_NOMSG flag is specified, it will print out any errors that
it finds. Prior to this commit, validate_sblock() returned as soon
as it found an inconsistency so would print at most one message.
It now does all its checks so when UFS_NOMSG has not been specified
will print out everything that it finds inconsistent.
Sponsored by: The FreeBSD Foundation
With clang 15, the following -Werror warning is produced:
sys/geom/geom_subr.c:484:16: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
g_wither_washer()
^
void
This is because g_wither_washer() is declared with a (void) argument
list, but defined with an empty argument list. Make the definition match
the declaration.
MFC after: 3 days
With clang 15, the following -Werror warning is produced:
sys/geom/geom_io.c:272:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
g_io_init()
^
void
This is because g_io_init() is declared with a (void) argument list, but
defined with an empty argument list. Make the definition match the
declaration.
MFC after: 3 days
With clang 15, the following -Werror warnings are produced:
sys/geom/geom_event.c:261:13: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
g_run_events()
^
void
sys/geom/geom_event.c:405:12: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
g_do_wither()
^
void
sys/geom/geom_event.c:449:13: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
g_event_init()
^
void
This is because g_run_events(), g_do_wither(), and g_event_init() are
declared with (void) argument lists, but defined with empty argument
lists. Make the definitions match the declarations.
MFC after: 3 days
Historically, GEOM utilities (gpart(8), gstripe(8), gmirror(8),
etc) used the gctl_error() routine to report errors. If they called
gctl_error() they would exit with EXIT_FAILURE, otherwise they would
return with EXIT_SUCCESS. If they used gctl_error() to output an
informational message, for example when run with the -v (verbose)
option, they would mistakenly exit with EXIT_FAILURE. A further
limitation of the gctl_error() function was that it could only be
called once. Messages from any additional calls to gctl_error()
would be silently discarded.
To resolve these problems a new function, gctl_msg() has been added.
It can be called multiple times to output multiple messages. It
also has an additional errno argument which should be zero if it is
an informational message or an errno value (EINVAL, EBUSY, etc) if
it is an error. When done the gctl_post_messages() function should
be called to indicate that all messages have been posted. If any
of the messages had a non-zero errno, the utility will EXIT_FAILURE.
If only informational messages (with zero errno) were posted, the
utility will EXIT_SUCCESS.
Tested by: Peter Holm
PR: 265184
MFC after: 1 week
Before this patch CAM periph drivers called both disk_alloc() and
disk_create() same time on periph creation. But then prevented disks
from opening until the periph probe completion with cam_periph_hold().
As result, especially if disk misbehaves during the probe, GEOM event
thread, triggered to taste the disk, got blocked on open attempt,
potentially for a long time, unable to process other events.
This patch moves disk_create() call from periph creation to the end of
the probe. To allow disk_create() calls from non-sleepable CAM contexts
some of its duties requiring memory allocations are moved either back
to disk_alloc() or forward to g_disk_create(), so now disk_alloc() and
disk_add_alias() are the only disk methods that require sleeping. If
disk fails during the probe disk_create() may just be skipped, going
directly to disk_destroy(). Other method calls during that time are
just ignored. Since GEOM may now see the disks after CAM bus scan is
already completed, introduce per-periph boot hold functions. Enclosure
driver already had such mechanism, so just generalize it.
Reviewed by: imp
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D35784
SES allows element descriptors to contain characters like spaces and
quotes that devfs does not allow to appear in device aliases. Since SES
element descriptors are outside of the kernel's control, we should
gracefully handle a failure to create a device physical path alias.
PR: 264513
Reported by: Yuri <yuri@aetern.org>
Reviewed by: imp, mav
Sponsored by: Axcient
MFC after: 2 weeks
It is unused, especially now that the underlying d_dumper methods do not
accept the argument.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35174
The physical address argument is essentially ignored by every dumper
method. In addition, the dump routines don't actually pass a real
address; every call to dump_append() passes a value of zero for
physical.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35173
Add support for the following NOTE events:
NOTE_OPEN, NOTE_CLOSE, NOTE_CLOSE_WRITE, NOTE_READ, and NOTE_WRITE.
Differential Revision: https://reviews.freebsd.org/D34777
The contract with the lower layers is that once ENXIO is reported, all
further I/O to the device is not possible. This is reported when the
device departs for good or changes in some material manner out from
underneath the system. Since the lower layers terminate all pending I/O
when this is detected with ENXIO, reporting more than one provides no
extra value. ENXIO suppression done with atomics due to race described
in e8827f4094. It's on the error path and a rare event, so this won't affect
performance.
Sponsored by: Netflix
Reviewed by: mckusick, kib
Differential Revision: https://reviews.freebsd.org/D35034
On the 0 -> 1 transition of sc_enxio_active, report that we're doing
this. This is a rare, but interesting, event. Convert to using atomics
to set this field to prevent a rare race:
In CAM, when we invalidate a device, one thread (T1) will start the
process in error processing called from *dadone
(cam_periph_error). This routine will queue work to xpt_async_td
(T2) and indicate to *dadone to call biodone(ENXIO) for the bio. T2
wakes up and basically waits to acquire the periph lock. T2 will do
so when T1 drops the periph lock just before T1's call to
biodone. T2 acquires the lock and calls biodone(ENXIO) on all
pending bios. These two threads will race and we could lose the
printf or get two in rare cases. Since we only touch sc_enxio_active
in an error path that's infrequent, the extra atomic traffic will be
rare but will ensure robustness.
Sponsored by: Netflix
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35037
We have a report of a panic in GELI that appears to go away when
unmapped I/O is disabled. Add a tunable to make such investigations
easier in the future. No functional change intended.
PR: 262894
Reviewed by: asomers
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34944
This code was marked gone_in(14), so it can now be removed.
The only consumer of this interface is dumpon(8). We do not maintain
strict backwards compatibility for this utility because a) it
can't/shouldn't be used from a jail or chroot and b) it is highly
specific interface unique to FreeBSD. The host's (presumably more
up-to-date) copy of dumpon(8) should be used to configure kernel dump
devices.
Reviewed by: markj, emaste
MFC after: never
Differential Revision: https://reviews.freebsd.org/D34914
This code was marked gone_in(13), so its time has passed.
The only consumer of this interface is dumpon(8). We do not maintain
strict backwards compatibility for this utility because a) it
can't/shouldn't be used from a jail or chroot and b) it is highly
specific interface unique to FreeBSD. The host's (presumably more
up-to-date) copy of dumpon(8) should be used to configure kernel dump
devices.
Reviewed by: markj, emaste
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D34913
Traditionally the GEOM's primary channel of information from kernel to
user-space was confxml, fetched by libgeom through kern.geom.confxml
sysctl. It is convenient and informative, representing full state of
GEOM in a single XML document. But problems start to arise on systems
with hundreds of disks, where the full confxml size reaches many
megabytes, taking significant time to first write it and then parse.
This patch introduces alternative solution, allowing to fetch much
smaller XML document, subset of the full confxml, limited to 64KB and
representing only one specified geom and optionally its parents. It
uses existing GEOM control interface, extended with new "getxml" verb.
In case of any error, such as the buffer overflow, it just transparently
falls back to traditional full confxml. This patch uses the new API in
user-space GEOM tools where it is possible.
Reviewed by: imp
MFC after: 2 month
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D34529
Create g_part_getattr to allow gpart geoms to have their attributes queried.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32782
The gunion(8) utility is used to track changes to a read-only disk on
a writable disk. Logically, a writable disk is placed over a read-only
disk. Write requests are intercepted and stored on the writable
disk. Read requests are first checked to see if they have been
written on the top (writable disk) and if found are returned. If
they have not been written on the top disk, then they are read from
the lower disk.
The gunion(8) utility can be especially useful if you have a large
disk with a corrupted filesystem that you are unsure of how to
repair. You can use gunion(8) to place another disk over the corrupted
disk and then attempt to repair the filesystem. If the repair fails,
you can revert all the changes in the upper disk and be back to the
unchanged state of the lower disk thus allowing you to try another
approach to repairing it. If the repair is successful you can commit
all the writes recorded on the top disk to the lower disk.
Another use of the gunion(8) utility is to try out upgrades to your
system. Place the upper disk over the disk holding your filesystem
that is to be upgraded and then run the upgrade on it. If it works,
commit it; if it fails, revert the upgrade.
Further details can be found in the gunion(8) manual page.
Reviewed by: Chuck Silvers, kib (earlier version)
tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D32697
The gctl_error() function provides GEOM modules with the ability
to report only a single message. When running with the verbose
flag, commands that handle multiple devices may want to report a
message for each of the devices on which it operates. This commit
adds the gctl_msg() function that can be called multiple times
to post messages. When finished issuing messages, the application
must either call gctl_post_messages() or call gctl_error() to cause
the messages to be reported to the calling process.
Tested by: Peter Holm
When using bio's created by g_clone_bio() or g_duplicate_bio()
their consumer device (the device to which their I/O requests
are sent) is listed by the geom debugging facility as [unknown].
If available, this update lists the consumer associated with
the bio's parent.
MFC after: 2 weeks
Sponsored by: Netflix
All I/O requests through the taste consumers are synchronous, done
with g_read_data() and without any locks held. It makes no sense
to delegate the I/O to g_down/g_up threads.
This removes many of context switches during disk retaste.
MFC after: 2 weeks
The geom_gate API provides 2 distinct paths for exchanging error
details between the kernel and the userland client: Including an error
code in the g_gate_ctl_io structure passed in the ioctl(2) call or
having the ioctl(2) call return -1 with an error code in errno. The
latter reflects errors in the ioctl(2) call itself whilst the former
reflects errors within the geom_gate instance.
The G_GATE_CMD_START ioctl blocks waiting for an I/O request to be
directed to the geom_gate instance and the wait can fail
(necessitating an error return) if the geom_gate instance is destroyed
or if the msleep(9) fails. The code previously treated both error
cases indentically: Returning ECANCELED as a geom_gate instance error
(which the ggatec treats as a fatal error). Whilst this is the correct
behaviour if the geom_gate instance is destroyed, a msleep(9) failure
is unrelated to the geom_gate instance itself and should be reported
as an ioctl(2) "failure". The distinction is important because
msleep(9) can return ERESTART, which means the system call should be
retried (and this will occur automatically as part of the generic
syscall return processing).
This change alters the msleep(9) handling to directly return the error
code from msleep(9), which ensures ERESTART is correctly handled,
rather than being treated as a fatal error.
Reviewed by: Johannes Totz <jo@bruelltuete.com>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D33996
All I/O requests through the taste consumer are synchronous, done
with g_read_data() and without any locks held. It makes no sense
to delegate the I/O to g_down/g_up threads.
This removes many of context switches during disk retaste.
MFC after: 2 weeks
The only cases when direct dispatch does not make sense is for I/O
submission from down thread and for completion from up thread. In
all other cases, if both consumer and producer are OK about it, we
can save on context switches.
MFC after: 2 weeks
Unlike normal consumers all taste consumer I/O is synchronous, done
with g_read_data() and without any locks held. It makes no sense to
delegate I/O submission to g_down thread.
This should remove number of context switches during disk retaste.
MFC after: 2 weeks
Like BIO_FLUSH, there is no reason for consumers to pass a BIO_SPEEDUP
request with non-NULL bio_data, so assert this.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
In particular, there is no need to allocate a data block when passing
BIO_FLUSH requests to child providers, and g_io_request() asserts that
bp->bio_data == NULL for such requests.
PR: 255131
Reported and tested by: nvass@gmx.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
I observed a situation where some read requests failed when a 2-way geom
mirror lost one disk. The problem appears to be in the logic that skips
retrying a failed request when a mirror has only one active disk.
Generally, that makes sense. But during a transition from two disks to
one it is possible that the request failed on the failing disk before it
was inactivated and, so, the remaining active disk is the disk that
should be tried.
This change adds an additional check to ensure that it was the (only)
active disk that was already tried.
Reviewed by: mav
MFC after: 3 weeks
As documented in the HiFive Unmatched Software Reference Manual.
Reviewed by: imp, mhorne
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34010
These routines are used internally by GEOM to dispatch I/O requests to a
provider, typically for tasting or for updating GEOM class metadata
blocks.
These routines assumed that partial I/O did not occur without setting
BIO_ERROR, but this is possible in at least two cases:
- Some or all of the I/O range is beyond the provider's mediasize.
In this scenario g_io_check() truncates the bounds of the request
before it is handed to the target provider.
- A read from vnode-backed md(4) device returns EOF (the backing vnode
is allowed to be smaller than the device itself) or partial vnode I/O
occurs.
In these scenarios g_read_data() could return a partially uninitialized
buffer. Many consumers are not affected by the first case, since the
offsets used for provider metadata or tasting are relative to the
provider's mediasize, but in some cases metadata is read at fixed
offsets, such as when searching for a UFS superblock using the offsets
defined by SBLOCKSEARCH.
Thus, modify the routines to explicitly check for a non-zero residual
and return EIO in that case. Remove a related check from the
DIOCGDELETE ioctl handler, it is handled within g_delete_data() now.
Reviewed by: mav, imp, kib
Reported by: KMSAN
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31293
The only event hooked up is NOTE_ATTRIB, which is triggered when the
device is resized. Support for other NOTE_* events to follow.
Reviewed by: kib, jhb
Differential Revision: https://reviews.freebsd.org/D33402
IVs are not the size of keys as a general case. Most often they are
the size of a single block.
Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33885
It must be greater than zero, and be multiple of the device block size.
In collaboration with: pho
Reviewed by: markj, mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33721
With crafted input to the G_GATE_CMD_CREATE ioctl, geom_gate can be made
to print kernel memory to the system console, potentially revealing
sensitive data from whatever was previously in that memory page.
But but but: this is a case of the sys admin misconfiguring, and you'd
need root privileges to do this.
Submitted By: Johannes Totz <jo@bruelltuete.com>
MFC after: 2 weeks
Reviewed By: asomers
Differential Revision: https://reviews.freebsd.org/D31727
- Remove timeouts from msleep()'s. Those should always be woken up.
- Move wakeup() under the lock to not call on possibly freed pointer.
- Remove some dead code.
MFC after: 2 weeks
The commit at hand happens to break userspace build as the header ins
included by sbin/gbde/gbde.c and the __diagused macro is not provided to
userspace.
Revert until this gets sorted out.
This reverts commit 26e837e2d4.
This definition enables callers to estimate remaining space on the
kstack, and take action on it. Notably, it enables optimizations in the
GEOM and netgraph subsystems to directly dispatch work items when there
is sufficient stack space, rather than queuing them for a worker thread.
Implement it for riscv, arm, and mips. Remove the #ifdefs, so it will
not go unimplemented elsewhere.
PR: 259157
Reviewed by: mav, kib, markj (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32580
Single provider may have multiple consumers, and locking one of consumers
is not sufficient to protect the provider. Though the only part of the
provider this locking protects now is its statistics.
Reported by: Arka Sharma <arka.sw1988@gmail.com>
MFC after: 2 weeks
The mntfs vnode lock should be before topology, as established in
ffs_mountfs(). Extend the locked region in ffs_unmount().
Reported and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D33013
disk labels.
When the geom label subsystem is checking labels to discover if they
are UFS/FFS filesystems, do not print a kernel error message if a
superblock is found with a check-hash error. That issue is best
handled later if an attempt is made to actually use the filesystem.
Sponsored by: Netflix
It is needed for g_vfs_close() invalidating the buffers. We rely on the
vnode lock for correctness.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761
Move the efimedia reporting to g_part_mbr_efimedia and use that from
g_part_mbr_dumpconf to report it.
Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D32781
Move the efimedia reporting to g_part_gpt_efimedia and use that from
g_part_gpt_dumpconf to report it.
Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D32780
I dropped the + 1 from the other two instances in each file but failed
to do so for this one, resulting in a more egregious buffer overread
than the one I was fixing (since the read character ended up in the
output if there was space).
Reported by: Jenkins
Fixes: 34fb1c133c ("Fix intra-object buffer overread for labeled msdosfs volumes")
Volume labels, like directory entries, are padded with spaces and so
have no NUL terminator. Whilst the MIN for the dsize argument to strlcpy
ensures that the copy does not overflow the destination, strlcpy is
defined to return the number of characters in the source string,
regardless of the provided dsize, and so keeps reading until it finds a
NUL, which likely exists somewhere within the following fields, but On
CHERI with the subobject bounds enabled in the compiler this buffer
overread will be detected and trap with a bounds violation.
Found by: CHERI
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D32579
- Ensure that the computed MFT record size isn't negative or larger than
maxphys before trying to read $Volume.
- Guard against truncated records in volume metadata.
- Ensure that the record length is large enough to contain the volume
name.
- Verify that the (UTF-16-encoded) volume name's length is a multiple of
two.
PR: 258833, 258914
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
That fixes memory leak on last GELI provider destroyed, introduced
in 2dbc9a388e. This patch was originally developed late 2019 and
the flag was necessary to prevent zone drainage under memory pressure.
Today, with f09cbea31a the UMA is fixed not to drain into reserves.
Discussed with: jtl, markj
Fixes: 2dbc9a388e
PR: 258787
When we get low on memory, the VM system tries to free some by swapping
pages. However, if we are so low on free pages that GELI allocations block,
then the swapout operation cannot complete. This keeps the VM system from
being able to free enough memory so the allocation can complete.
To alleviate this, keep a UMA pool at the GELI layer which is used for data
buffer allocation in the fast path, and reserve some of that memory for swap
operations. If an IO operation is a swap, then use the reserved memory. If
the allocation still fails, return ENOMEM instead of blocking.
For non-swap allocations, change the default to using M_NOWAIT. In general,
this *should* be better, since it gives upper layers a signal of the memory
pressure and a chance to manage their failure strategy appropriately. However,
a user can set the kern.geom.eli.blocking_malloc sysctl/tunable to restore
the previous M_WAITOK strategy.
Submitted by: jtl
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D24400
Make sure that the provider sector size is large enough to contain a
valid label before trying to read it. We performed this check already
for most label types, but not for several filesystem labels.
Reported by: syzbot+f52918174cdf193ae29c@syzkaller.appspotmail.com
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
- In g_disk_start(), verify that the data to be written is initialized
according to KMSAN shadow state.
- In g_disk_done(), verify that the block driver updated shadow state as
expected, so as to catch sources of false positives early.
Sponsored by: The FreeBSD Foundation
When an active g_vfs is orphaned due to an underlying disk going away
the destroy is deferred until the filesystem is unmounted in
g_vfs_done(). However, g_vfs_done() is invoked from a non-sleepable
context and cannot use M_WAITOK to allocate the event. Instead,
allocate the event in g_vfs_orphan() and save it in the softc to be
retrieved by the last call to g_vfs_done().
Reported by: Jithesh Arakkan @ Chelsio
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31354
Preallocate a geom_event (using the new geom_alloc_event) when we create
a disk. When we create the disk, we're going to be in a sleepable
context, so we can always allocate this extra bit of memory. Then use
this preallocated memory to free the disk. CAM can try to free the disk
from an unsleepable context if there was I/O outstanding when the disk
was destroyted (say because the SIM said it had gone away). The I/O
context isn't sleepable. Rather than trying to invent a retry mechanism
and making sure all the other geom_disk consumers did it properly,
preallocating the event ensure that the geom_disk will be properly torn
down, even when there's memory pressure when the disk departs.
Reviewd by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30544
g_alloc_event will allocate storage for an opaque event. g_post_event_ep
can use memory returned by g_alloc_event to send an event from a context
that might not be able to allocate the event. Occasionally, we can
alloate memory when we create an object, but not while we're destroy
it. This allows one to allocate at creation time memory to use when
destorying the object.
Reviewed by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30544
This partially reverts commit af433832f7.
Since such bogus disklabels still exist in the wild, we now probe for a
disklabel to decide whether to ignore the UFS partition or not; if there
is a label then we use the old behaviour, and if there isn't one then we
use the new behaviour.
Reviewed by: cy, mckusick
Differential Revision: https://reviews.freebsd.org/D31068
When authentication is configured, GELI ensures that the amount of data
per sector is a multiple of 16 bytes. This is done in
eli_metadata_softc(). When the digest size is not a multiple of 16
bytes, this leaves some extra pad bytes at the end of every sector, and
they were not being zeroed before being written to disk. In particular,
this happens with the HMAC/SHA1, HMAC/RIPEMD160 and HMAC/SHA384 data
authentication algorithms.
This change ensures that they are zeroed before being written to disk.
Reported by: KMSAN
Reviewed by: delphij, asomers
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31170
Ensure that string buffers and pad bytes are zero-filled before writing
graid3 metadata.
Reported by: KMSAN
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Ensure that string buffers and pad bytes are zero-filled before writing
gconcat metadata. Also make sure to zero the full block buffer before
encoding the metadata and writing.
Fix some style bugs in g_concat_write_metadata() while here.
Reported by: KMSAN
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The mirror metadata fields contain string buffers and pad bytes, neither
were being zeroed before metadata was written to disk. Also, the
metadata structure is smaller than the sector size, and in one case
gmirror was failing to zero-fill the full buffer before writing.
Fix these problems by pre-zeroing the metadata structure and the sector
buffer.
Reported by: KMSAN
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
We removed sysinstall(8) back in 2011, so this workaround should be long
since unnecessary. This workaround can end up breaking cases that are
hit in the real world, such as dd'ing a small pre-built disk image to a
large partition that you intend to grow on first boot and uses a UFS
disk label for / in its /etc/fstab (as the only reliable thing a raw UFS
image can reference).
Reviewed by: imp, mckusick
Differential Revision: https://reviews.freebsd.org/D30825
Implement the "gconcat append" command which can be used
to append a disk to the end of an existing gconcat device
without unmounting.
If the gconcat device is using the "automatic" method, i.e.,
stores metadata on the devices, new metadata is written
to all existing components, as well as to the newly added one.
Pull Request: https://github.com/freebsd/freebsd-src/pull/472
Reviewed by: imp@
zfsd uses a device's physical path attribute to automatically replace a
missing ZFS disk when a blank disk is inserted into the same physical
slot. Currently gmultipath passes through its underlying providers'
physical path attribute. That may cause zfsd to replace a missing
gmultipath provider with a newly arrived, single-path disk. That would
be bad.
This commit fixes that problem by simply appending "/mp" to the
underlying providers' physical path, in a manner similar to what geli
already does.
Sponsored by: Axcient
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D29941
Currently, OpenCrypto consumers can request asynchronous dispatch by
setting a flag in the cryptop. (Currently only IPSec may do this.) I
think this is a bit confusing: we (conditionally) set cryptop flags to
request async dispatch, and then crypto_dispatch() immediately examines
those flags to see if the consumer wants async dispatch. The flag names
are also confusing since they don't specify what "async" applies to:
dispatch or completion.
Add a new KPI, crypto_dispatch_async(), rather than encoding the
requested dispatch type in each cryptop. crypto_dispatch_async() falls
back to crypto_dispatch() if the session's driver provides asynchronous
dispatch. Get rid of CRYPTOP_ASYNC() and CRYPTOP_ASYNC_KEEPORDER().
Similarly, add crypto_dispatch_batch() to request processing of a tailq
of cryptops, rather than encoding the scheduling policy using cryptop
flags. Convert GELI, the only user of this interface (disabled by
default) to use the new interface.
Add CRYPTO_SESS_SYNC(), which can be used by consumers to determine
whether crypto requests will be dispatched synchronously. This is just
a helper macro. Use it instead of looking at cap flags directly.
Fix style in crypto_done(). Also get rid of CRYPTO_RETW_EMPTY() and
just check the relevant queues directly. This could result in some
unnecessary wakeups but I think it's very uncommon to be using more than
one queue per worker in a given workload, so checking all three queues
is a waste of cycles.
Reviewed by: jhb
Sponsored by: Ampere Computing
Submitted by: Klara, Inc.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D28194
This fixes a failed assertion in scenario where the provider
disappears, disk_gone() gets called, and at the exact same
time something else closes the device node triggering a retaste.
Reviewed By: mav
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27330
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
On 32-bit platforms, the computed size of the BIO_SPEEDUP requested by
softdep_request_cleanup() may be negative when assigned to bp->b_bcount,
which has type "long".
Clamp the size to LONG_MAX. Also convert the unused g_io_speedup() to
use an off_t for the magnitude of the shortage for consistency with
softdep_send_speedup().
Reviewed by: chs, kib
Reported by: pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27081