Commit graph

301 commits

Author SHA1 Message Date
Alan Somers 9d449caddd md: round-trip the MUSTDEALLOC and RESERVE options
If those options are requested when the device is created, ensure that
they will be reported by MDIOCQUERY.

MFC after:	2 weeks
Reviewed by:	imp
Pull Request:	https://github.com/freebsd/freebsd-src/pull/1270
2024-06-01 12:40:19 -06:00
John Baldwin f75764fea3 md: Merge two switch statements in mdstart_vnode
While here, use bp->bio_cmd instead of auio.uio_rw to drive read vs
write behavior.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D45155
2024-05-10 13:43:23 -07:00
Andrew Gallatin 13a5a46c49 Fix new users of MAXPHYS and hide it from the kernel namespace
In cd85379104, kib made maxphys a load-time tunable.  This made
the #define MAXPHYS in sys/param.h  almost entirely obsolete, as
it could now be overridden by kern.maxphys at boot time, or by
opt_maxphys.h.

However, decades of tradition have led to several new, incorrect, uses
of MAXPHYS in other parts of the kernel, mostly by seasoned
developers.  I've corrected those uses here in a mechanical fashion,
and verified that it fixes a bug in the md driver that I was
experiencing.

Since using MAXPHYS is such an easy mistake to make, it is best to
hide it from the kernel namespace.  So I've moved its definition to
_maxphys.h, which is now included in param.h only for userspace.

That brings up the fact that lots of userspace programs use MAXPHYS
for different reasons, most of them probably wrong.  Userspace consumers
that really need to know the value of maxphys should probably be
changed to use the kern.maxphys sysctl.  But that's outside the scope
of this change.

Reviewed by: imp, jkim, kib, markj
Fixes: 30038a8b4e ("md: Get rid of the pbuf zone")
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D44986
2024-04-30 15:29:06 -04:00
Warner Losh 29363fb446 sys: Remove ancient SCCS tags.
Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by:		Netflix
2023-11-26 22:23:30 -07:00
Warner Losh 95ee2897e9 sys: Remove $FreeBSD$: two-line .h pattern
Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/
2023-08-16 11:54:11 -06:00
Mike Karels 58a46cfd75 md driver compat32: fix structure padding for arm, powerpc
Because the 32-bit md_ioctl structure contains 64-bit members, arm
and powerpc add padding to a multiple of 8.  i386 doesn't do this.
The md_ioctl32 definition was correct for amd64/i386 without padding,
but wrong for arm64 and powerpc64.  Make __packed__ conditional on
__amd64__, and test for the expected size on non-amd64.  Note that
mdconfig is used in the ATF test suite.  Note, I verified the
structure size for powerpc, but was unable to test.

MFC after:	1 week
Reviewed by:	jrtc27
Differential Revision:	https://reviews.freebsd.org/D41339
Discussed with:	jhibbits
2023-08-08 09:09:03 -05:00
Jessica Clarke 8a6ab0f71f Pre-quote macros passed to .incbin to avoid unwanted substitution
Currently for the MFS, firmware and VDSO template assembly files we pass
the path to include with .incbin unquoted and use __XSTRING within the
assembly file to stringify it. However, __XSTRING doesn't just perform a
single level of expansion, it performs the normal full expansion of the
macro, and so if the path itself happens to tokenise to something that
includes a defined macro in it that will itself be substituted. For
example, with #define MACRO 1, a path like /path/containing/MACRO/in/it
will expand to /path/containing/1/in/it and then, when stringified, end
up as "/path/containing/1/in/it", not the intended string. Normally,
macros have names that start or end witih underscores and are unlikely
to appear in a tokenised path (even if technically they could), but now
that we've switched to GNU C as of commit ec41a96daa ("sys: Switch the
kernel's C standard from C99 to GNU99.") there are a few new macros
defined which don't start or end with underscores: unix, which is always
defined to 1, and i386, which is defined to 1 on i386. The former
probably doesn't appear in user paths in practice, but the latter has
been seen to and is likely quite common in the wild.

Fix this by defining the macro pre-quoted instead of using __XSTRING.
Note that technically we don't need to do this for vdso_wrap.S today as
all the paths passed to it are safe file names with no user-controlled
prefix but we should do it anyway for consistency and robustness against
future changes.

This allows make tinderbox to pass when built with source and object
directories inside ~/path-with-unix, which would otherwise expand to
~/path-with-1 and break.

PR:	272744
Fixes:	ec41a96daa ("sys: Switch the kernel's C standard from C99 to GNU99.")
2023-07-28 05:08:43 +01:00
Mark Johnston 30038a8b4e md: Get rid of the pbuf zone
The zone is used solely to provide KVA for mapping BIOs so that we can
pass mapped buffers to VOP_READ and VOP_WRITE.  Currently we preallocate
nswbuf/10 bufs for this purpose during boot.

The intent was to limit KVA usage on 32-bit systems, but the
preallocation means that we in fact consumed more KVA than needed unless
one has more than nswbuf/10 (typically 25) vnode-backed MD devices
in existence, which I would argue is the uncommon case.

Meanwhile, all I/O to an MD is handled by a dedicated thread, so we can
instead simply preallocate the KVA region at MD device creation time.

Event:		BSDCan 2023
Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D40215
2023-05-23 10:27:10 -04:00
Konstantin Belousov ad8feb1ed9 md.c: another style fix
Noted by:	jkim
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2023-01-20 04:39:36 +02:00
Konstantin Belousov 6189672e60 Handle ERELOOKUP from VOP_FSYNC() in several other places
We need to repeat the operation if the vnode was relocked.

Reported and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D38114
2023-01-20 03:54:56 +02:00
Mateusz Guzik bb92cd7bcd vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd) 2022-03-24 10:20:51 +00:00
Aleksandr Fedorov cb28dfb27d md(4): Add dummy support of the BIO_FLUSH command for malloc and swap
backend.

PR:	260200
Reported by:	editor@callfortesting.org
Reviewed by:	vmaffione (mentor), markj
Approved by:	vmaffione (mentor), markj
Differential Revision:	https://reviews.freebsd.org/D34260
2022-02-17 22:21:56 +03:00
Kyle Evans b9c92d631c Annotate geom_md with MODULE_VERSION
This was missed in 74d6c131cb where other geom modules were annotated
with MODULE_VERSION.  Again, the problem is the same: we can't detect
that geom_md is loaded into the kernel without it.

This was noticed in release builds on the cluster; mdconfig attempts to
load geom_md because it can't detect it in the kernel, but the cluster
config includes md(4) and does not build the kmod.  This problem would
have been masked on hosts with the kmod built, as the kmod attempts to
register the g_md module and fails.  With this commit, mdconfig would
not even try to load it again.

Reported by:	re (cperciva)
MFC after:	3 days
2022-02-10 00:16:19 -06:00
Mateusz Guzik 7e1d3eefd4 vfs: remove the unused thread argument from NDINIT*
See b4a58fbf64 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.
2021-11-25 22:50:42 +00:00
Ka Ho Ng 3703c18883 md: Add MD_MUSTDEALLOC support
This adds an option to detect if hole-punching is implemented by the
underlying file system.  If this flag is set, and if the underlying file
system does not support hole-punching, md(4) fails BIO_DELETE requests
with EOPNOTSUPP.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D31883
2021-09-11 20:04:52 +08:00
Mark Johnston 47619b6044 md: Clamp to a multiple of the sector size when resizing
We do this when creating md(4) devices, in kern_mdattach_locked(), but
not when resizing the provider.  Apply the same policy when resizing, as
many GEOM classes do not expect to deal with providers for which
pp->mediasize % pp->sectorsize != 0.

Reported by:	syzkaller
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-08-31 15:50:04 -04:00
Ka Ho Ng 78267c2e70 md: Replace BIO_DELETE emulation with vn_deallocate(9)
Both zero-filling and/or deallocation can be done with vn_deallocate(9).

Sponsored by:	The FreeBSD Foundation
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D28899
2021-08-19 18:30:13 +08:00
Alex Richardson 69e18c9b7b sys/dev/md: Drop unncessary __GLOBL(mfs_root)
LLVM12 complains if you change the symbol binding:
error: mfs_root_end changed binding to STB_WEAK [-Werror,-Winline-asm]
error: mfs_root changed binding to STB_WEAK [-Werror,-Winline-asm]
2021-03-30 14:59:43 +01:00
Mark Johnston c4cceb1d0d md: Fix a race in mdstart_swap()
Release a grabbed page's busy state only after marking it as referenced.
Otherwise there exists a narrow window where the page could be freed
before the update.  Before r356902 this was not a problem since the
object lock was held.

Discussed with:	kib
Sponsored by:	The FreeBSD Foundation
2021-01-04 08:26:14 -05:00
Mark Johnston 795a009b32 md: Set bio_completed properly in the face of errors
Account for any residual bytes.  This is only relevant for vnode-backed
md(4) devices.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27738
2020-12-27 16:49:35 -05:00
Mark Johnston a7a7c306bf md: Fix a read-after-free in BIO_GETATTR handling
g_handleattr_int() consumes the bio if the attribute matches, so when we
check bp->bio_cmd bp may have been freed.

Move GETATTR handling to a separate function to avoid the problem.  We
do not need to set bio_completed for such bios, g_handleattr_int() will
handle it.  Also remove the setting of bio_resid before the
devstat_end_transaction_bio() call.  All of the md(4) bio handlers set
bio_resid already.

Reported by:	KASAN
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27724
2020-12-23 11:16:40 -05:00
Konstantin Belousov cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Mateusz Piotrowski e2a03adb53 Fix a typo in a license comment
Approved by:	kaktus (src)
2020-11-12 15:50:18 +00:00
John Baldwin f54c6ef100 Use a template assembly file to generate the embedded MFS.
This uses the .incbin directive to pull in the MFS image contents.
Using assembly directly ensures that symbols can be defined with the
name and properties (such as .size) desired without having to rename
symbols, etc. via a second objcopy invocation.  Since it is compiled
by the C compiler driver, it also avoids the need for all of the
EMBEDFS* make variables.

Suggested by:	jrtc27
Reviewed by:	kib, markj
Obtained from:	CheriBSD
MFC after:	2 weeks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26781
2020-10-20 16:48:45 +00:00
Mateusz Guzik 1224a253d8 md: clean up empty lines in .c and .h files 2020-09-01 22:08:52 +00:00
Mark Johnston 3507b8d467 Remove some redundant assignments and computations.
Reported by:	alc
Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D25400
2020-06-28 21:34:38 +00:00
Mark Johnston 84242cf68a Call swap_pager_freespace() from vm_object_page_remove().
All vm_object_page_remove() callers, except
linux_invalidate_mapping_pages() in the LinuxKPI, free swap space when
removing a range of pages from an object.  The LinuxKPI case appears to
be an unintentional omission that could result in leaked swap blocks, so
unconditionally free swap space in vm_object_page_remove() to protect
against similar bugs in the future.

Reviewed by:	alc, kib
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D25329
2020-06-25 15:21:21 +00:00
Jeff Roberson 7aaf252c96 Convert a few triviail consumers to the new unlocked grab API.
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23847
2020-02-28 20:34:30 +00:00
Jeff Roberson d6e13f3b4d Don't hold the object lock while calling getpages.
The vnode pager does not want the object lock held.  Moving this out allows
further object lock scope reduction in callers.  While here add some missing
paging in progress calls and an assert.  The object handle is now protected
explicitly with pip.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D23033
2020-01-19 23:47:32 +00:00
Mateusz Guzik b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Mark Johnston 2c14385aa2 Fix a page leak in the md(4) swap I/O path.
r356147 removed a vm_page_activate() call, but this is required to
ensure that pages end up in the page queues in the first place.

Restore the pre-r356157 logic.  Now, without the page lock, the
vm_page_active() check is racy, but this race is harmless.

Reviewed by:	alc, kib
Reported and tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23024
2020-01-03 18:48:53 +00:00
Alexander Motin 5a93d93e8b Avoid duplicate I/O statistics accounting.
Alike to geom_disk free the provider statistics structure and point GEOM
toward local statistics.  It allows to save some CPU time.

MFC after:	2 weeks
2020-01-03 04:37:47 +00:00
Alexander Motin 024932aae9 Use atomic for start_count in devstat_start_transaction().
Combined with earlier nstart/nend removal it allows to remove several locks
from request path of GEOM and few other places.  It would be cool if we had
more SMP-friendly statistics, but this helps too.

Sponsored by:	iXsystems, Inc.
2019-12-30 03:13:38 +00:00
Mark Johnston 9f5632e6c8 Remove page locking for queue operations.
With the previous reviews, the page lock is no longer required in order
to perform queue operations on a page.  It is also no longer needed in
the page queue scans.  This change effectively eliminates remaining uses
of the page lock and also the false sharing caused by multiple pages
sharing a page lock.

Reviewed by:	jeff
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D22885
2019-12-28 19:04:00 +00:00
Jeff Roberson a808177864 Add a deferred free mechanism for freeing swap space that does not require
an exclusive object lock.

Previously swap space was freed on a best effort basis when a page that
had valid swap was dirtied, thus invalidating the swap copy.  This may be
done inconsistently and requires the object lock which is not always
convenient.

Instead, track when swap space is present.  The first dirty is responsible
for deleting space or setting PGA_SWAP_FREE which will trigger background
scans to free the swap space.

Simplify the locking in vm_fault_dirty() now that we can reliably identify
the first dirty.

Discussed with:	alc, kib, markj
Differential Revision:	https://reviews.freebsd.org/D22654
2019-12-15 03:15:06 +00:00
Mateusz Guzik abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Jeff Roberson 0f9e06e18b Fix a few places that free a page from an object without busy held. This is
tightening constraints on busy as a precursor to lockless page lookup and
should largely be a NOP for these cases.

Reviewed by:	alc, kib, markj
Differential Revision:	https://reviews.freebsd.org/D22611
2019-12-02 22:42:05 +00:00
Jeff Roberson 0012f373e4 (4/6) Protect page valid with the busy lock.
Atomics are used for page busy and valid state when the shared busy is
held.  The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:        https://reviews.freebsd.org/D21594
2019-10-15 03:45:41 +00:00
Mark Johnston fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
Brooks Davis dcb235ab9e md(4): remove the unused and unusable MDIOCLIST ioctl.
It is unused, the ABI was broken in r322969, and it is broken by design
(more than MDNPAD md devices can exist and there is no way to retreive
them with this interface).

mdconfig(8) was converted to use libgeom to obtain this information
in r157160 and any other consumers of MDIOCLIST should likewise be
converted.

Reviewed by:	emaste
Relnotes:	yes
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D18936
2019-08-16 18:57:32 +00:00
Kirk McKusick 7e1a6d4777 When using the force option to shut down a memory-disk device,
I/O operations already in its queue were not being properly drained.
The GEOM framework does the queue draining, but the device driver
needs to wait for the draining to happen. The waiting is done by
adding a g_md_providergone() function to wait for the I/O operations
to finish up.

It is likely that every GEOM provider that implements orphaning
attached GEOM consumers needs to use the "providergone" mechanism
for this same reason, but some of them do not do so. Apparently
Kenneth Merry (ken@) added the drain for just such races, but he
missed adding it to some of the device drivers that needed it.

Submitted by: Chuck Silvers
Reviewed by:  imp
Tested by:    Chuck Silvers
MFC after:    1 week
Sponsored by: Netflix
2019-03-31 21:34:58 +00:00
Gleb Smirnoff 756a541279 Allocate pager bufs from UMA instead of 80-ish mutex protected linked list.
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
  pbufs are we going to have set.
  In various subsystems that are going to utilize pbufs create private zones
  via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
  and sets a limit on created zone. After startup preallocate pbufs according
  to requirements of all pbuf zones.

  Subsystems that used to have a private limit with old allocator now have
  private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
  swap, vnode pager.

  The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
  aio(4). They should have their private limits, but changing that is out of
  scope of this commit.

o Fetch tunable value of kern.nswbuf from init_param2() and while here move
  NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
  this option.
  Default values aren't touched by this commit, but they probably should be
  reviewed wrt to modern hardware.

This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.

Together with:	gallatin
Tested by:	pho
2019-01-15 01:02:16 +00:00
Bruce Evans c907940bf9 Fix devstat on md devices, second attempt. r341765 depends on
g_io_deliver() finishing initialization of the bio, but g_io_deliver()
actually destroys the bio.  INVARIANTS makes the bug obvious by
overwriting the bio with garbage.

Restore the old order for calling devstat (except don't restore not calling
it for the error case), and translate to the devstat KPI so that this order
works.

Reviewed by:	kib
2018-12-22 22:59:11 +00:00
Bruce Evans 9e5ed8593f Use VOP_ADVISE() with POSIX_FADV_DONTNEED instead of IO_DIRECT to
implement not double-caching for reads from vnode-backed md devices.
Use VOP_ADVISE() similarly instead of !IO_DIRECT unsimilarly for writes.
Add a "cache" option to mdconfig to allow changing the default of not
caching.

This depends on a recent commit to fix VOP_ADVISE().  A previous version
had optimizations for sequential i/o's (merge the i/o's and only uncache
for discontiguous i/o's and for full blocks), but optimizations and
knowledge of block boundaries belong in VOP_ADVISE().  Read-ahead should
also be handled better, by supporting it in md and discarding it in
VOP_ADVISE().

POSIX_FADV_DONTNEED is ignored by zfs, but so is IO_DIRECT.

POSIX_FADV_DONTNEED works better than IO_DIRECT if it is not ignored,
since it only discards from the buffer cache immediately, while
IO_DIRECT also discards from the page cache immediately.

IO_DIRECT was not used for writes since it was claimed to be too slow,
but most of the slowness for writes is from doing them synchronously by
default.  Non-synchronous writes still deadlock in many cases.

IO_DIRECT only has a special implementation for ffs reads with DIRECTIO
configured.  Otherwise, if it is not ignored than it uses the buffer and
page caches normally except for discarding everything after each i/o,
and then it has much the same overheads as POSIX_FADV_DONTNEED.  The
overheads for reading with ffs and DIRECTIO were similar in tests of md.

Reviewed by:	kib
2018-12-21 08:15:31 +00:00
Bruce Evans dac6a0d559 Fix devstat on md devices.
devstat_end_transaction() was called before the i/o was actually ended
(by delivering it to GEOM), so at least the i/o length was messed up.
It was always recorded as 0, so the average transaction size and the
average transfer rate was always displayed as 0.

devstat_end_transaction() was not called at all for the error case, so
there were sometimes multiple starts per end.  I didn't observe this in
practice and don't know if it did much damage.  I think it extended the
length of the i/o to the next transaction.

Reviewed by:	kib
2018-12-09 15:34:20 +00:00
Breno Leitao 7b2c7b92be md: use prestaged mfs_root
On PowerNV systems, the rootfs is passed through kexec, which loads the rootfs
into memory and set two fdt entries to describe where the file is located in
the memory;

I need to pass this memory region to the md device as a mfs_root, but, current
md driver does not support two things:

 * Just getting a pointer from an external (bootloader) memory. If I need to
workaround it, I would need to declare a static array and memcopy from this
external memory to this static variable.

 * The size of the image. The usage of mfs_root_end, which is not a pointer,
seems to be not possible for this prestaged scenario.

This patch simply adds a new way to load mfs_root from memory.

Differential Revision: https://reviews.freebsd.org/D15625
Approved by: kib, jhibbits (mentor)
2018-06-07 13:57:34 +00:00
Brooks Davis 6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Brooks Davis 026cac8213 Move 32-bit compat for md(4) ioctls into the md code.
This is more correct in that ioctl commands have no meaning until they
hit the handler associated with the file descriptor.

Add support for MDIOCRESIZE_32 which was missed when it was added.

Reviewed by:	cem, kib, markj (various versions)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14714
2018-03-27 16:07:54 +00:00
Brooks Davis 34a77b9741 Move uio enums to sys/_uio.h.
Include _uio.h instead of uio.h in several headers to reduce header
polution.

Fix a few places that relied on header polution to get the uio.h header.

I have not moved struct uio as many more things that use it rely on
header polution to get other definitions from uio.h.

Reviewed by:	cem, kib, markj
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14811
2018-03-27 15:20:03 +00:00
Brooks Davis 064c9c3d42 Add a request structure and make the implementation use it.
This allows compatibility translation to take place on the stack
(md_ioctl is too big) and is more suitable as a public interface within
the kernel than the kern_ioctl interface.

Except for the initialization of the md_req from the md_ioctl
(including detection of kernel md_file pointers) and the updating
of the md_ioctl prior to return, this is a mechanical replacment
of md_ioctl and mdio with md_req and mdr.

Reviewed by:	markj, cem, kib (assorted versions)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14704
2018-03-15 21:42:49 +00:00