This removes support for reading and writing volumes using the
following algorithms:
- Triple DES
- Blowfish
- MD5 HMAC integrity
In addition, this commit adds an explicit whitelist of supported
algorithms to give a better error message when an invalid or
unsupported algorithm is used by an existing volume.
Reviewed by: cem
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24343
- The linked list of cryptoini structures used in session
initialization is replaced with a new flat structure: struct
crypto_session_params. This session includes a new mode to define
how the other fields should be interpreted. Available modes
include:
- COMPRESS (for compression/decompression)
- CIPHER (for simply encryption/decryption)
- DIGEST (computing and verifying digests)
- AEAD (combined auth and encryption such as AES-GCM and AES-CCM)
- ETA (combined auth and encryption using encrypt-then-authenticate)
Additional modes could be added in the future (e.g. if we wanted to
support TLS MtE for AES-CBC in the kernel we could add a new mode
for that. TLS modes might also affect how AAD is interpreted, etc.)
The flat structure also includes the key lengths and algorithms as
before. However, code doesn't have to walk the linked list and
switch on the algorithm to determine which key is the auth key vs
encryption key. The 'csp_auth_*' fields are always used for auth
keys and settings and 'csp_cipher_*' for cipher. (Compression
algorithms are stored in csp_cipher_alg.)
- Drivers no longer register a list of supported algorithms. This
doesn't quite work when you factor in modes (e.g. a driver might
support both AES-CBC and SHA2-256-HMAC separately but not combined
for ETA). Instead, a new 'crypto_probesession' method has been
added to the kobj interface for symmteric crypto drivers. This
method returns a negative value on success (similar to how
device_probe works) and the crypto framework uses this value to pick
the "best" driver. There are three constants for hardware
(e.g. ccr), accelerated software (e.g. aesni), and plain software
(cryptosoft) that give preference in that order. One effect of this
is that if you request only hardware when creating a new session,
you will no longer get a session using accelerated software.
Another effect is that the default setting to disallow software
crypto via /dev/crypto now disables accelerated software.
Once a driver is chosen, 'crypto_newsession' is invoked as before.
- Crypto operations are now solely described by the flat 'cryptop'
structure. The linked list of descriptors has been removed.
A separate enum has been added to describe the type of data buffer
in use instead of using CRYPTO_F_* flags to make it easier to add
more types in the future if needed (e.g. wired userspace buffers for
zero-copy). It will also make it easier to re-introduce separate
input and output buffers (in-kernel TLS would benefit from this).
Try to make the flags related to IV handling less insane:
- CRYPTO_F_IV_SEPARATE means that the IV is stored in the 'crp_iv'
member of the operation structure. If this flag is not set, the
IV is stored in the data buffer at the 'crp_iv_start' offset.
- CRYPTO_F_IV_GENERATE means that a random IV should be generated
and stored into the data buffer. This cannot be used with
CRYPTO_F_IV_SEPARATE.
If a consumer wants to deal with explicit vs implicit IVs, etc. it
can always generate the IV however it needs and store partial IVs in
the buffer and the full IV/nonce in crp_iv and set
CRYPTO_F_IV_SEPARATE.
The layout of the buffer is now described via fields in cryptop.
crp_aad_start and crp_aad_length define the boundaries of any AAD.
Previously with GCM and CCM you defined an auth crd with this range,
but for ETA your auth crd had to span both the AAD and plaintext
(and they had to be adjacent).
crp_payload_start and crp_payload_length define the boundaries of
the plaintext/ciphertext. Modes that only do a single operation
(COMPRESS, CIPHER, DIGEST) should only use this region and leave the
AAD region empty.
If a digest is present (or should be generated), it's starting
location is marked by crp_digest_start.
Instead of using the CRD_F_ENCRYPT flag to determine the direction
of the operation, cryptop now includes an 'op' field defining the
operation to perform. For digests I've added a new VERIFY digest
mode which assumes a digest is present in the input and fails the
request with EBADMSG if it doesn't match the internally-computed
digest. GCM and CCM already assumed this, and the new AEAD mode
requires this for decryption. The new ETA mode now also requires
this for decryption, so IPsec and GELI no longer do their own
authentication verification. Simple DIGEST operations can also do
this, though there are no in-tree consumers.
To eventually support some refcounting to close races, the session
cookie is now passed to crypto_getop() and clients should no longer
set crp_sesssion directly.
- Assymteric crypto operation structures should be allocated via
crypto_getkreq() and freed via crypto_freekreq(). This permits the
crypto layer to track open asym requests and close races with a
driver trying to unregister while asym requests are in flight.
- crypto_copyback, crypto_copydata, crypto_apply, and
crypto_contiguous_subsegment now accept the 'crp' object as the
first parameter instead of individual members. This makes it easier
to deal with different buffer types in the future as well as
separate input and output buffers. It's also simpler for driver
writers to use.
- bus_dmamap_load_crp() loads a DMA mapping for a crypto buffer.
This understands the various types of buffers so that drivers that
use DMA do not have to be aware of different buffer types.
- Helper routines now exist to build an auth context for HMAC IPAD
and OPAD. This reduces some duplicated work among drivers.
- Key buffers are now treated as const throughout the framework and in
device drivers. However, session key buffers provided when a session
is created are expected to remain alive for the duration of the
session.
- GCM and CCM sessions now only specify a cipher algorithm and a cipher
key. The redundant auth information is not needed or used.
- For cryptosoft, split up the code a bit such that the 'process'
callback now invokes a function pointer in the session. This
function pointer is set based on the mode (in effect) though it
simplifies a few edge cases that would otherwise be in the switch in
'process'.
It does split up GCM vs CCM which I think is more readable even if there
is some duplication.
- I changed /dev/crypto to support GMAC requests using CRYPTO_AES_NIST_GMAC
as an auth algorithm and updated cryptocheck to work with it.
- Combined cipher and auth sessions via /dev/crypto now always use ETA
mode. The COP_F_CIPHER_FIRST flag is now a no-op that is ignored.
This was actually documented as being true in crypto(4) before, but
the code had not implemented this before I added the CIPHER_FIRST
flag.
- I have not yet updated /dev/crypto to be aware of explicit modes for
sessions. I will probably do that at some point in the future as well
as teach it about IV/nonce and tag lengths for AEAD so we can support
all of the NIST KAT tests for GCM and CCM.
- I've split up the exising crypto.9 manpage into several pages
of which many are written from scratch.
- I have converted all drivers and consumers in the tree and verified
that they compile, but I have not tested all of them. I have tested
the following drivers:
- cryptosoft
- aesni (AES only)
- blake2
- ccr
and the following consumers:
- cryptodev
- IPsec
- ktls_ocf
- GELI (lightly)
I have not tested the following:
- ccp
- aesni with sha
- hifn
- kgssapi_krb5
- ubsec
- padlock
- safe
- armv8_crypto (aarch64)
- glxsb (i386)
- sec (ppc)
- cesa (armv7)
- cryptocteon (mips64)
- nlmsec (mips64)
Discussed with: cem
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D23677
GELI used to fail with EINVAL when a read request spanned a disk
sector whose contents did not match the sector's authentication tag.
The recently-added EINTEGRITY more closely matches to the error in
this case.
Reviewed by: cem, mckusick
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24131
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Approved by: kib (mentor, blanket)
Differential Revision: https://reviews.freebsd.org/D23637
Currently the installer will tag geliboot partitions with both BOOT and
GELIBOOT; the former allows the kernel to taste it at boot, while the latter
is what loaders keys off of.
However, it seems reasonable to assume that if a provider's been tagged with
GELIBOOT that the kernel should also take that as a hint to taste/attach at
boot. This would allow us to stop tagging GELIBOOT partitions with BOOT in
bsdinstall, but I'm not sure that there's a compelling reason to do so any
time soon.
Reviewed by: oshogbo
Differential Revision: https://reviews.freebsd.org/D23387
For the moment, supress the operation not supported messages at this level. In
the fullness of time, we will have better error tracking so we can diagnose
issues in the future.
Reviewed by: scottl@
If the bio was split into several bios going down, completion computes
bio_completed of the original bio as sum of the bio_completes of the
splits. For BIO_SETUP, bio_length means something different than the
length. it is the requested speedup amount, and is duplicated into the
splits, which is in fact reasonable, since we cannot know how the
previous activity was distributed among subordinate geoms. Obviously,
the sum of n bio_length is greater than bio_length for n > 1, which
triggers assert that bio_length >= bio_completed for e.g. geom_stripe
and geom_raid3.
Fix this by reassigning bio_completed from bio_length for completed
BIO_SPEEDED, I do not think it really mattters what we return in
bio_completed.
Reported and tested by: pho
Reviewed by: imp
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D23380
%20%20%20 is ugly and doesn't really help make human-readable devfs names.
PR: 243318
Reported by: Peter Eriksson <pen AT lysator.liu.se>
Relnotes: yes
It turns out there's a problem with using g_io to send the speedup. It leads to
a race when there's a resource shortage when a disk fails.
Instead, send BIO_SPEEDUP via struct buf. This is pretty straight forward,
except we need to transfer the bio_flags from b_ioflags for BIO_SPEEDUP commands
in g_vfs_strategy.
Reviewed by: kirk, chs
Differential Revision: https://reviews.freebsd.org/D23117
While some geom layers pass unknown commands down, not all do. For the ones that
don't, pass BIO_SPEEDUP down to the providers that constittue the geom, as
applicable. No changes to vinum or virstor because I was unsure how to add this
support, and I'm also unsure how to test these. gvinum doesn't implement
BIO_FLUSH either, so it may just be poorly maintained. gvirstor is for testing
and not supportig BIO_SPEEDUP is fine.
Reviewed by: chs
Differential Revision: https://reviews.freebsd.org/D23183
The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.
Reviewed by: kib (previous version)
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23036
Combined with earlier nstart/nend removal it allows to remove several locks
from request path of GEOM and few other places. It would be cool if we had
more SMP-friendly statistics, but this helps too.
Sponsored by: iXsystems, Inc.
Those counters were abused for decade to workaround broken orphanization
process in different classes by delaying the call while there are active
requests. But from one side it did not close all the races, while from
another was quite expensive on SMP due to trashing twice per request cache
lines of consumer and provider and requiring locks. It lost its sense
after I manually went through all the GEOM classes in base and made
orphanization wait for either provider close or request completion.
Consumer counters are still used under INVARIANTS to detect premature
consumer close and detach. Provider counters are removed completely.
Sponsored by: iXsystems, Inc.
This code was not actively maintained since it was introduced 10 years ago.
It lacks support for many later GEOM features, such as direct dispatch,
unmapped I/O, stripesize/stripeoffset, resize, etc. Plus it is the only
remaining use of GEOM nstart/nend request counters, used there to implement
live insertion/removal, questionable by itself. Plus, as number of people
commented, GEOM is not the best place for I/O scheduler, since it has
limited information about layers both above and below it, required for
efficient scheduling. Plus with the modern shift to SSDs there is just no
more significant need for this kind of scheduling.
Approved by: imp, phk, luigi
Relnotes: yes
If we postpone consumer destruction till close, then the close calls should
not be ignored. Delay geom withering till the last close too.
MFC after: 2 weeks
X-MFC-with: r356162
Sponsored by: iXsystems, Inc.
Previous code closed and destroyed consumer even with I/O in progress.
This patch postpones the destruction till the last close.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Previous code closed and detached consumer even with I/O still in progress.
This patch adds locking and request counting to postpone the close till
the last of running requests completes.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Change the "count_until_fail" option of gnop, now it enables the failing
rating instead of setting them to 100%.
The original patch introduced the new flag, which sets the fail/rate to 100%
after N requests. In some cases, we don't want to have 100% of failure
probabilities. We want to start failing at some point.
For example, on the early stage, we may like to allow some read/writes requests
before having some requests delayed - when we try to mount the partition,
or when we are trying to import the pool.
Another case may be to check how scrub in ZFS will behave on different stages.
This allows us to cover more cases.
The previous behavior still may be configured.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D22632
Thanks to this option we can create more then one gnop provider from
single provider. This may be useful for temporary labeling some data
on the disk.
Reviewed by: markj, allanjude, bcr
Differential Revision: https://reviews.freebsd.org/D22304
Previous code closed and destroyed consumer even with I/O in progress.
This patch postpones the destruction till the last close, identical to
GEOM_STRIPE, since they seem to have common origin.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Previous code closed and destroyed direct read consumer even with I/O still
in progress. This patch adds locking and request counting to postpone the
close till the last of running requests completes.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Previous code destroyed softc even with provider still open, that resulted
in panic under load. This change postpones the free till the final close,
when we know for sure there will be no more I/O requests.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
gvinum was the only GEOM class, using consumer nstart/nend fields. Making
it do its own accounting for orphanization purposes allows in perspective
to remove burden of that expensive for SMP accounting from GEOM.
Also the previous implementation spinned in a tight event loop, waiting
for all active BIOs to complete, while the new one knows exactly when it
is possible to close the consumer.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Add BIO_SPEEDUP bio command and g_io_speedup wrapper. It tells the
lower layers that the upper layers are dealing with some shortage
(dirty pages and/or disk blocks). The lower layers should do what they
can to speed up anything that's been delayed.
The first use will be to tell the CAM I/O scheduler that any TRIM
shaping should be short-circuited because the system needs
blocks. We'll also call it when there's too many resources used by
UFS.
Reviewed by: kirk, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D18351
to specify an optional separator to insert before partition name;
eg if it's set to "c/", you'll get "ada0c/s1" instead of "ada0s1".
(It cannot be set to just “/“, since ada0 is a device node, not
a directory.)
Reviewed by: imp
MFC after: 2 weeks
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D22193
pp->private just can not be NULL in those places.
In g_disk_start() and g_disk_ioctl() both dp != NULL and !dp->d_destroyed
should always be true if disk_gone() and disk_destroy() are used properly,
since GEOM does not send requests to errored providers. If the protocol is
not followed, then no amount of additional checks here give real safety.
In g_disk_access() though the checks are useful, since GEOM blocks only
new opens for errored providers, but allows closes. It should not happen
if disk_gone() and disk_destroy() are used properly, but may otherwise.
To improve cases when disk_gone() is not used, call it from disk_destroy().
It does not give full guaranties, but it errors the provider and makes
GEOM block unwanted requests at least after some race.
MFC after: 2 weeks
For normal I/Os consumer and provider statuses are checked by g_io_check().
But ioctl calls often do not go through it, being dispatched directly. This
change makes their semantics more alike, protecting lower levels.
MFC after: 2 weeks
Add ATF tests for most gmultipath operations. Add some dtrace probes too,
primarily for configuration changes that happen in response to provider
errors.
PR: 178473
MFC after: 2 weeks
Sponsored by: Axcient
Differential Revision: https://reviews.freebsd.org/D22235
In most cases with debug disabled this function does nothing, but argument
passing and the call still cost measurable time due to cache misses, etc.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
It closes the race condition and so allows to remove few NULL checks.
Also while there, use dev->si_drv1 in addition to cp->private to store
softc pointer. For calls coming from the dev side it gives reliable cache
hit instead of often miss before.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
to geom, and nothing we call requires it to be held. It's left over
from a time when the latter wasn't the case. Retire it.
Reviewed in concept: scottl@
via 'diskinfo -v'. This avoids the need to track it down via CAM,
and should also work for disks that don't use CAM. And since it's
inherited thru the GEOM hierarchy, in most cases one doesn't need
to walk the GEOM graph either, eg you can use it on a partition
instead of disk itself.
Reviewed by: allanjude, imp
Sponsored by: Klara Inc
Differential Revision: https://reviews.freebsd.org/D22249
filling in the same defaults that the current userland module uses.
This allows an old geom_nop.so userland module to work with a new kernel.
Approved by: imp (mentor)
Reviewed by: cem
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21972
the request. It is the same as gctl_get_paraml() except that the request
is not marked with an error if the parameter is not present.
Approved by: imp (mentor)
Reviewed by: cem
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21972
I/O requests after the given number have been allowed though.
Approved by: imp (mentor)
Reviewed by: rpokala kib 0mp mckusick
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21593
GEOM is supposed to be topology-agnostic, but the GPT and BSD partition code
has arbitrary restrictions on nesting that are annoying in cases such as
running VMs on raw partitions (since the VM's partitioning scheme is not
visible to the host).
This patch adds sysctls to disable the restrictions except in the case of
BSD label (and similar) partitions with offset 0 (where we need to avoid
recursively recognizing the label).
Submitted by: Andrew Gierth
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D21350
The Zstd format bumps the CLOOP major number to 4 to avoid incompatibility
with older systems. Support in geom_uzip(4) is conditional on the ZSTDIO
kernel option, which is enabled in amd64 GENERIC, but not all in-tree
configurations.
mkuzip(8) was modified slightly to always initialize the nblocks + 1'th
offset in the CLOOP file format. Previously, it was only initialized in the
case where the final compressed block happened to be unaligned w.r.t.
DEV_BSIZE. The "Fake" last+1 block change in r298619 means that the final
compressed block's 'blen' was never correct unless the compressed uzip image
happened to be BSIZE-aligned. This happened in about 1 out of every 512
cases. The zlib and lzma decompressors are probably tolerant of extra trash
following the frame they were told to decode, but Zstd complains that the
input size is incorrect.
Correspondingly, geom_uzip(4) was modified slightly to avoid trashing the
nblocks + 1'th offset when it is known to be initialized to a good value.
This corrects the calculated final real cluster compressed length to match
that printed by mkuzip(8).
mkuzip(8) was refactored somewhat to reduce code duplication and increase
ease of adding other compression formats.
* Input block size validation was pulled out of individual compression
init routines into main().
* Init routines now validate a user-provided compression level or select
an algorithm-specific default, if none was provided.
* A new interface for calculating the maximal compressed size of an
incompressible input block was added for each driver. The generic code
uses it to validate against MAXPHYS as well as to allocate compression
result buffers in the generic code.
* Algorithm selection is now driven by a table lookup, to increase ease of
adding other formats in the future.
mkuzip(8) gained the ability to explicitly specify a compression level with
'-C'. The prior defaults -- 9 for zlib and 6 for lzma -- are maintained.
The new zstd default is 9, to match zlib.
Rather than select lzma or zlib with '-L' or its absense, respectively, a
new argument '-A <algorithm>' is provided to select 'zlib', 'lzma', or
'zstd'. '-L' is considered deprecated, but will probably never be removed.
All of the new features were documented in mkuzip.8; the page was also
cleaned up slightly.
Relnotes: yes
Follow-up on r322318 and r322319 and remove the deprecated modules.
Shift some now-unused kernel files into userspace utilities that incorporate
them. Remove references to removed GEOM classes in userspace utilities.
Reviewed by: imp (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21249
- Use new zlib headers;
- Removed z_alloc and z_free to use the common sys/dev/zlib version.
- Replace z_compressBound with compressBound from zlib.
While there, limit LZMA CFLAGS to apply only for g_uzip_lzma.c.
PR: 229763
Submitted by: Yoshihiro Ota <ota j email ne jp> (with changes,
bugs are mine)
Differential Revision: https://reviews.freebsd.org/D20271
Similar to what was done for device_printfs in r347229.
Convert g_print_bio() to a thin shim around g_format_bio(), which acts on an
sbuf; documented in g_bio.9.
Reviewed by: markj
Discussed with: rlibby
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21165
This allows to simulated disk that is responding slowly to the IO requests.
Reviewed by: markj, bcr, pjd (previous version)
Differential Revision: https://reviews.freebsd.org/D21052
If g_mirror_taste encountered an error at g_mirror_add_disk, it might
try to g_mirror_destroy the device with the G_MIRROR_DEVICE_FLAG_TASTING
flag still set. This would wait on a worker to complete the destruction
with g_mirror_try_destroy, but that function bails out if the tasting
flag is set, resulting in a deadlock. Clear the tasting flag before
trying to destroy the device.
Test Plan:
sysctl debug.fail_point.mnowait="1%return"
kyua test -k /usr/tests/sys/geom/class/mirror/Kyuafile
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20744
NANDFS has been broken for years. Remove it. The NAND drivers that
remain are for ancient parts that are no longer relevant. They are
polled, have terrible performance and just for ancient arm
hardware. NAND parts have evolved significantly from this early work
and little to none of it would be relevant should someone need to
update to support raw nand. This code has been off by default for
years and has violated the vnode protocol leading to panics since it
was committed.
Numerous posts to arch@ and other locations have found no actual users
for this software.
Relnotes: Yes
No Objection From: arch@
Differential Revision: https://reviews.freebsd.org/D20745
When it comes to megabytes of text, difference between sbuf_printf() and
sbuf_cat() becomes substantial.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
On large systems those sysctls may generate megabytes of output. Before
this change sbuf(9) code was resizing buffer by 4KB each time many times,
generating tons of TLB shootdowns. Unfortunately in this case existing
sbuf_new_for_sysctl() mechanism, supposed to help with this issue, is not
applicable, since all the sbuf writes are done in different kernel thread.
This change improves situation in two ways:
- on first sysctl call, not providing any output buffer, it sets special
sbuf drain function, just counting the data and so not needing big buffer;
- on second sysctl call it uses as initial buffer size value saved on
previous call, so that in most cases there will be no reallocation, unless
GEOM topology changed significantly.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
rename the source to gsb_crc32.c.
This is a prerequisite of unifying kernel zlib instances.
PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193
operations already in its queue were not being properly drained.
The GEOM framework does the queue draining, but the module needs
to wait for the draining to happen. The waiting is done by adding
a g_nop_providergone() function to wait for the I/O operations to
finish up. This change is similar to change -r345758 made to the
memory-disk driver.
Submitted by: Chuck Silvers
Tested by: Chuck Silvers
MFC after: 1 week
Sponsored by: Netflix
- Triple DES has been formally deprecated in Kerberos (RFC 8429)
and is soon to be deprecated in IPsec (RFC 8221).
- Blowfish is deprecated. FreeBSD doesn't support its successor
(Twofish).
- MD5 is generally considered a weak digest that has known attacks.
geli refuses to create new volumes using these algorithms via 'geli
init'. It also warns when attaching to existing volumes or creating
temporary volumes via 'geli onetime' . The plan is to fully remove
support for these algorithms in FreeBSD 13.
Note that none of these algorithms have ever been the default
algorithm used by geli(8). Users would have had to explicitly select
these algorithms when creating volumes in the past.
Reviewed by: cem, delphij
MFC after: 3 days
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D20344
Allow users to specify multiple dump configurations in a prioritized list.
This enables fallback to secondary device(s) if primary dump fails. E.g.,
one might configure a preference for netdump, but fallback to disk dump as a
second choice if netdump is unavailable.
This change does not list-ify netdump configuration, which is tracked
separately from ordinary disk dumps internally; only one netdump
configuration can be made at a time, for now. It also does not implement
IPv6 netdump.
savecore(8) is already capable of scanning and iterating multiple devices
from /etc/fstab or passed on the command line.
This change doesn't update the rc or loader variables 'dumpdev' in any way;
it can still be set to configure a single dump device, and rc.d/savecore
still uses it as a single device. Only dumpon(8) is updated to be able to
configure the more complicated configurations for now.
As part of revving the ABI, unify netdump and disk dump configuration ioctl
/ structure, and leave room for ipv6 netdump as a future possibility.
Backwards-compatibility ioctls are added to smooth ABI transition,
especially for developers who may not keep kernel and userspace perfectly
synced.
Reviewed by: markj, scottl (earlier version)
Relnotes: maybe
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D19996
There's a race between the initialization of devsoftc.mtx (by devinit)
and the creation of the geom worker thread g_run_events, which calls
devctl_queue_data_f. Both of those are initialized at SI_SUB_DRIVERS
and SI_ORDER_FIRST, which means the geom worked thread can be created
before the mutex has been initialized, leading to the panic below:
wpanic: mtx_lock() of spin mutex (null) @ /usr/home/osstest/build.135317.build-amd64-freebsd/freebsd/sys/kern/subr_bus.c:620
cpuid = 3
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe003b968710
vpanic() at vpanic+0x19d/frame 0xfffffe003b968760
panic() at panic+0x43/frame 0xfffffe003b9687c0
__mtx_lock_flags() at __mtx_lock_flags+0x145/frame 0xfffffe003b968810
devctl_queue_data_f() at devctl_queue_data_f+0x6a/frame 0xfffffe003b968840
g_dev_taste() at g_dev_taste+0x463/frame 0xfffffe003b968a00
g_load_class() at g_load_class+0x1bc/frame 0xfffffe003b968a30
g_run_events() at g_run_events+0x197/frame 0xfffffe003b968a70
fork_exit() at fork_exit+0x84/frame 0xfffffe003b968ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe003b968ab0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 13 tid 100029 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why
Fix this by initializing geom at SI_ORDER_SECOND instead of
SI_ORDER_FIRST.
Sponsored by: Citrix Systems R&D
Reviewed by: kevans, markj
Differential revision: https://reviews.freebsd.org/D20148
destroy_dev_sched_cb() is excessively asynchronous, and during media change
retaste new provider may appear sooner then device of the previous one get
destroyed.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
provider grows, GELI will expand automatically and will move the metadata
to the new location of the last sector.
This functionality is turned on by default. It can be turned off with the
-R flag, but it is not recommended - if the underlying provider grows and
automatic expansion is turned off, it won't be possible to attach this
provider again, as the metadata is no longer located in the last sector.
If the automatic expansion is turned off and the underlying provider grows,
GELI will only log a message with the previous size of the provider, so
recovery can be easier.
Obtained from: Fudo Security
providers mediasize changes.
While here, use GEOM nomenclature to describe providers instead of calling
them device nodes.
Obtained from: Fudo Security
Tested in: AWS
While geom_flashmap has always supported label names for its slices, it does
so by appending "s.labelname" to the provider device name, meaning you still
have to know the name and unit of the hardware device to use the labels.
These changes add support for device-independent geom_flashmap labels, using
the standard geom_label infrastructure. geom_flashmap now creates a softc
struct attached to its geom, and as it creates slices it stores the label
into an array in the softc. The new geom_label_flashmap uses those labels
when tasting a geom_flashmap provider.
Differential Revision: https://reviews.freebsd.org/D19535
In revision 254095, gpt_entries is not set to match the on-disk
hdr_entries, but rather is computed based on available space.
There are 2 problems with this:
1. The GPT backend respects hdr_entries and only reads and writes
that number of partition entries. On top of that, CRC32 is
computed over the table that has hdr_entries elements. When
the common code works on what is possibly a larger number, the
behaviour becomes inconsistent and problematic. In particular,
it would be possible to add a new partition that on a reboot
isn't there anymore.
2. The calculation of gpt_entries is based on flawed assumptions.
The GPT specification does not dictate that sectors are layed
out in a particular way that the available space can be
determined by looking at LBAs. In practice, implementations
do the same thing, because there's no reason to do it any
other way. Still, GPT allows certain freedoms that can be
exploited in some form or shape if the need arises.
PR: 229977
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19438
Embedded lzma decompression library becomes a module usable by other
consumers, in addition to geom_uzip.
Most important code changes are
- removal of XZ_DEC_SINGLE define, we need the code to work
with XZ_DEC_DYNALLOC;
- xz_crc32_init() call is removed from geom_uzip, xz module handles
initialization on its own.
xz is no longer embedded into geom_uzip, instead the depend line for
the module is provided, and corresponding kernel option is added to
each MIPS kernel config file using geom_uzip.
The commit also carries unrelated cleanup by removing excess "device geom_uzip"
in places which were missed in r344479.
Reviewed by: cem, hselasky, ray, slavash (previous versions)
Sponsored by: Mellanox Technologies
Differential revision: https://reviews.freebsd.org/D19266
MFC after: 3 weeks
The DIOCGETZONE ioctl can be used to fetch the zone list of an SMR
drive, and the caller specifies the number of entries it wants to fetch.
Clamp the caller's request to a sane limit so that a user cannot attempt
large allocations. Callers already need to invoke the ioctl multiple
times to fetch the full list in general, so there's no harm in limiting
the number of entries returned.
Fix style while here.
admbug: 807
Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by: asomers, ken
Tested by: ken
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19249
Otherwise a privileged user can trigger a memory allocation of
unbounded size, or an integer overflow in the subsequent
geom_alloc_copyin() call, leading to out-of-bounds accesses.
Hard-code a large limit to circumvent this problem.
admbug: 854
Reported by: Anonymous of the Shellphish Grill Team
Reviewed by: ae
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19251
gmirror's sc_flags is shared between some on-disk state and some runtime
only state. There's no real reason for that and they could probably be
split up. Until they are, locate all of the flags for the same field
nearby each other in the source, for clarity.
No functional change.
Sponsored by: Dell EMC Isilon
g_handleattr() fills out bp->bio_completed; otherwise, g_getattr()
returns an error in response to the query. This caused BIO_DELETE
support to not be propagated through stacked configurations, e.g.,
a gconcat of gmirror volumes would not handle BIO_DELETE even when
the gmirrors do. g_io_getattr() was not affected by the problem.
PR: 232676
Reported and tested by: noah.bergbauer@tum.de
MFC after: 1 week
Mutexes in I/O path there were used twice per I/O to atomically access
several variables to close and/or destroy the device on last request
completion. I found the way to fit all required info into one integer,
suitable for atomic operations. It opened race window on device close,
but addition of timeout to the msleep() there should cover it.
Profiling shows removal of significant spinning time on those mutexes
and IOPS increase from ~600K to >800K to NVMe on 72-core systems.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
I mistakenly added a lock assertion to this routine at the last minute
without confirming it was held during g_mirror_create. It isn't (it isn't
even initialized yet). Mea culpa. Access is exclusive in both callers,
just not always by that particular lock.
Reported by: lwhsu
X-MFC-With: r341840, r341674
r341674 inadvertently introduced a bug where newer mirror components being
tasted would clear the high sc_flags that are not controlled by component
metadata, such as G_MIRROR_DEVICE_FLAG_TASTING. This could plausibly expose
a small window of time during STARTING where device destruction might race
with mirror component addition, probably resulting in a crash.
Reviewed by: markj
X-MFC-With: r341674
Differential Revision: https://reviews.freebsd.org/D18521
Re-apply r341665 with format strings fixed.
If we happen to taste a stale mirror component first, don't reject valid,
newer components that have differing metadata from the stale component
(during STARTING). Instead, update our view of the most recent metadata as
we taste components.
Like mediasize beforehand, remove some checks from g_mirror_check_metadata
which would evict valid components due to metadata that can change over a
mirror's lifetime. g_mirror_check_metadata is invoked long before we check
genid/syncid and decide which component(s) are newest and whether or not we
have quorum.
Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW
component is added, first remove any known stale or inconsistent disks from
the mirrorset, rather than removing them *after* deciding we have quorum.
Check if we have quorum after removing these components.
Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
before transitioning from STARTING to RUNNING. This is a kludge to help
ensure all eligible, boot-time available mirror components are tasted before
RUNNING a gmirror.
Add a basic test case for STARTING -> RUNNING startup behavior around stale
genids.
PR: 232671, 232835
Submitted by: Cindy Yang <cyang AT isilon.com> (previous version)
Reviewed by: markj (kernel portions)
Discussed with: asomers, Cindy Yang
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18062
If we happen to taste a stale mirror component first, don't reject valid,
newer components that have differing metadata from the stale component
(during STARTING). Instead, update our view of the most recent metadata as
we taste components.
Like mediasize beforehand, remove some checks from g_mirror_check_metadata
which would evict valid components due to metadata that can change over a
mirror's lifetime. g_mirror_check_metadata is invoked long before we check
genid/syncid and decide which component(s) are newest and whether or not we
have quorum.
Before checking if we can enter RUNNING (i.e., we have quorum) after a NEW
component is added, first remove any known stale or inconsistent disks from
the mirrorset, rather than removing them *after* deciding we have quorum.
Check if we have quorum after removing these components.
Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
before transitioning from STARTING to RUNNING. This is a kludge to help
ensure all eligible, boot-time available mirror components are tasted before
RUNNING a gmirror.
When we are instructed to forget mirror components, bump the generation id
to avoid confusion with such stale components later.
Add a basic test case for STARTING -> RUNNING startup behavior around stale
genids.
PR: 232671, 232835
Submitted by: Cindy Yang <cyang AT isilon.com> (previous version)
Reviewed by: markj (kernel portions)
Discussed with: asomers, Cindy Yang
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D18062
superblock has a check-hash error, an error message noting the
superblock check-hash failure is printed and the mount fails. The
administrator then runs fsck to repair the filesystem and when
successful, the filesystem can once again be mounted.
This approach fails if the filesystem in question is a root filesystem
from which you are trying to boot. Here, the loader fails when trying
to access the filesystem to get the kernel to boot. So it is necessary
to allow the loader to ignore the superblock check-hash error and make
a best effort to read the kernel. The filesystem may be suffiently
corrupted that the read attempt fails, but there is no harm in trying
since the loader makes no attempt to write to the filesystem.
Once the kernel is loaded and starts to run, it attempts to mount its
root filesystem. Once again, failure means that it breaks to its prompt
to ask where to get its root filesystem. Unless you have an alternate
root filesystem, you are stuck.
Since the root filesystem is initially mounted read-only, it is
safe to make an attempt to mount the root filesystem with the failed
superblock check-hash. Thus, when asked to mount a root filesystem
with a failed superblock check-hash, the kernel prints a warning
message that the root filesystem superblock check-hash needs repair,
but notes that it is ignoring the error and proceeding. It does
mark the filesystem as needing an fsck which prevents it from being
enabled for writing until fsck has been run on it. The net effect
is that the reboot fails to single user, but at least at that point
the administrator has the tools at hand to fix the problem.
Reported by: Rick Macklem (rmacklem@)
Discussed with: Warner Losh (imp@)
Sponsored by: Netflix
handling slightly out-of-bound requests properly (r340187).
Perform range check here rather then rely on g_delete_data() to DTRT.
The g_delete_data() would always return success for requests
starting just the next byte after providers media boundary.
MFC after: 4 weeks
from setting the volume serial number. This unbreaks older boot blocks
that don't support serial numbers, and allows boot0cfg to set the serial
number itself if requested by the user.
Submitted by: lev@, yuripv@
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17386
i/o into last_sector+N is handled differently for N==1 and N>1 cases to
accomodate that, so some other approach would be needed to fix DIOCGDELETE
ioctl(2).
fully beyond the end of providers media. The only exception is made
for the zero length transfers which are allowed to be just on the
boundary. Previously, any requests starting on the boundary (i.e. next
byte after the last one) have been allowed to go through.
No response from: freebsd-geom@, phk
MFC after: 1 month
GEOM's stripeoffset overflows at 4 gigabyte margin (2^32)
because of its u_int type. This leads to incorrect data in the output
generated by "sysctl kern.geom.confxml" command, "graid list" etc.
when GEOM array has volumes larger than 4G, for example.
This change does not affect ABI but changes KBI. No MFC planned.
Differential Revision: https://reviews.freebsd.org/D13426
In r332361 and r333439, two new parameters were added to geli attach
verb using gctl_get_paraml, which requires the value to be present.
This would prevent old geli(8) binary from attaching geli(4) device
as they have no knowledge about the new parameters.
Restore backward compatibility by treating the absense of these two
values as seeing the default value supplied by userland.
PR: 232595
Reviewed by: oshogbo
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D17680