Changes the device name for NVMe and NVMe-oF namespaces from using "ns"
to "n" to be more compatible with other operating systems. For example,
a device which was previously /dev/nvme0ns1 is now /dev/nvme0n1.
Preserves the existing functionality by creating alias from nvmeXnY to
nvmeXnsY.
Reviewed by: imp
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D45414
When possible, we split up I/Os to NVMe drives that advertise a
preferred alignment. Add a counter for this.
Sponsored by: Netflix
Reviewed by: chuck, mav
Differential Revision: https://reviews.freebsd.org/D45311
It's easy to overlook the chain of events that lead to tr->deadline
being updated. Add a comment here to explain what otherwise looks like
an oversight w/o careful study.
Sponsored by: Netflix
We don't need to dereference qpair to get the ctrlr pointer each time,
so use the cached value. It's not going to change. No change intended.
Sponsored by: Netflix
nvme_qpair_complete_tracker and nvme_qpair_manual_complete_tracker have
to be called without the qpair lock, so assert its unowned.
Sponsored by: Netflix
Add definition for page types 7 and 8 for host initiated telemetry and
controller initiated telemetry (they differ by one byte, but that byte
that's defined in the host version is reserved in the controller
version).
Sponsored by: Netflix
This was already true for most architectures due to uint64_t structure
members. However, i386 is special in that it only requires 4 byte
alignment for uint64_t. As a result, casts from struct nvme_command
to struct nvmf_fabric_cmd were raising a "cast increases alignment"
warning on i386. Explicitly aligning struct nvme_command pacifies
this warning on i386.
Reported by: rscheff
Sponsored by: Chelsio Communications
This ensures that embedded uint64_t values used for statistics
counters are aligned when allocating a structure on the stack or as
part of a containing structure. In particular this quiets
-Waddress-of-packed-member warnings from GCC when compiling the code
in nvmfd to update the stats.
Reported by: GCC
This defines structures, ioctl commands, and related constants used
for both the Fabrics host and controller.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44706
We can't post a AER for this page, so there's no need to be able to swap
it to host byte order. It's not one of the standard defined pages that
can post via AER, and the vendor's public docs for this temperature page
don't suggest it's possible to get over or under event changes. Since
nvmecontrol no longer needsd the swap routine, remove it since it's
now unused.
Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D44659
Add sys/errno.h, sys/malloc.h, sys/queue.h, and vm/uma.h as needed.
sys/sysproto.h currently includes sys/acl.h which currently includes
sys/param.h, sys/queue.h, and vm/uma.h which in turn bring in
sys/errno.h sys/malloc.h.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44465
Add all the bits from the NVMe 2.0 base specification: CMD_EFFECTS to
indicate the commands and effects log page is supported, TELEMETRY to
indicate that the telemetry log pages and protocols are supported,
PERSISTENT_EVENTS to indicate the persistent event log is supported,
LOG_PAGES_PAGE to indicate that various log pages related to log page
and command support are supported: L0, L5, L12, and L13. and
DA4_TELEMETRY to indicate that the DA4 area is supported for telemetry
data.
Sponsored by: Netflix
This is used in NVMe over Fabrics to enumerate a list of available
controllers.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44446
nvme(4) doesn't check this flag, but Fabrics implementations may need
to set this flag in the log page attributes cdata field.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44444
This is not used in nvme(4) but is used in NVMe over Fabrics
transports which use SGLs to describe buffers instead of PRPs.
While here, adjust the shift value for the FUSE field to be relative
to the 'fuse' member of 'struct nvme_command'.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44443
Fabrics capsules use an SGL structure instead of prp1/2 addresses to
describe the data buffer used for a command. The SGL structure is
added to a union with the existing prp1/2 fields.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44442
These are useful for NVMe over Fabrics.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44441
We only see a request with a failed controller while we're in the
process of failing the controller. Add a comment to that effect.
Sponsored by: Netflix
We're logging when we start a reset, but not when we complete it, nor
the result. Create now log a success or timed_out event for the reset.
Currently, the only detectable error we have from reset is 'failure to
become ready in time,' though the code looks like it might be more
generic. Log this and if we ever have other failure modes, change the
logging to devd when that happens.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D44211
Change the devctl events slightly for the controller. SMART errors will
log the changed bits in the NVME SMART Critical Warning State as its
event.
Reset will now emit 'event=start'. Soon more.
Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D44210
Split the devctl aspect of things out to its own function in
nvme_ctrlr_devctl_log. In preparing to document this, and based on
actual use, we want something different for the SMART errors, so this
will facilitate that.
Sponsored by: Netflix
Reviewed by: chuck, mav
Differential Revision: https://reviews.freebsd.org/D44209
In particular, don't try to byteswap the values as 64-bit integers and
always print a non-empty version as a string.
Reviewed by: chuck, imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44121
This macro accepts a field name and a value for the field and
constructs the shifted field value.
Reviewed by: chuck
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43604
A few of these omitted a shift of 0, but this is more consistent.
Reviewed by: chuck
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43602
The current macro always builds a full mask for a named field, so use
the M suffix for mask.
Reviewed by: chuck, imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43601
struct nvme_hmb_desc contains a pad field which was not getting
initialized before being synced. This doesn't have much consequence but
triggers a report from KMSAN, which verifies that host-filled DMA memory
is initialized before it is made visible to the device. So, let's just
initialize it properly.
Reported by: KMSAN
Reviewed by: mav, imp
MFC after: 1 week
Sponsored by: Klara, Inc.
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D43090
Apply the following automated changes to try to eliminate
no-longer-needed sys/cdefs.h includes as well as now-empty
blank lines in a row.
Remove /^#if.*\n#endif.*\n#include\s+<sys/cdefs.h>.*\n/
Remove /\n+#include\s+<sys/cdefs.h>.*\n+#if.*\n#endif.*\n+/
Remove /\n+#if.*\n#endif.*\n+/
Remove /^#if.*\n#endif.*\n/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/types.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/param.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/capsicum.h>/
Sponsored by: Netflix
Instead, use the attribtue bits from the identification data to
determine if we should listen to namespace changes and firmware
activation. Should have no functional change, though we may stop
listening for events that will never happen.
Sponsored by: Netflix
sys/cam/cam.h includes opt_cam.h, so none of the clients need to do
this. cam.h does all the right dancing to conditionally include
opt_cam.h only when it makes sense. It generally only matters when
cam_debug.h is included (it must be included before that). Many of the
stray opt_cam.h includes were after cam_debug.h which would be a problem
were it not included in cam/cam.h. The other users of CAM options that
aren't debug all already include cam/cam.h.
Also trim unneeded sys/cdefs.h files from the files touched.
Sponsored by: Netflix
KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty
namespace. In some cases like hot-plug it takes longer, triggering
timeout and controller resets after just 30 seconds. Linux for many
years has separate 60 seconds timeout for admin queue. This patch
does the same. And it is good to be consistent.
Sponsored by: iXsystems, Inc.
Reviewed by: imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D42454
While we should have cleared all the pending I/O prior to calling
nvme_qpair_destroy, which should ensure that if the callout_drain causes
a call to nvme_qpair_timeout(), it won't schedule any new
timeout. However, it doesn't hurt to set timeout_pending to false in
nvme_qpair_destroy() and have nvme_qpair_timeout() exit early if it sees
it w/o scheduling a timeout. Since we don't otherwise stop the timeout
until we're about to destroy the qpair, this ensures we fail safe. The
lock/unlock also ensures the callout_drain will either remove the callout,
or wait for it to run with the early bailout.
We can likely further improve this by using callout_stop() inside the
pending lock. I'll investigate that for future refinement.
Sponsored by: Netflix
Suggestions by: jhb
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D42065
While it seemed like a good idea to have this state, we can do
everything we wanted with the state by checking ctrlr->is_failed since
that's set before we start failing the qpairs. Add some comments about
racing when we're failing the controller, though in practice I'm not
sure that kind of race could even be lost.
Sponsored by: Netflix
Reviewed by: chuck, gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D42051
After da8324a925, the pre/post hooks are gone. So remove a coment
about why we don't call them in this case.
Sponsored by: Netflix
Reviewed by: chuck, jhb
Differential Revision: https://reviews.freebsd.org/D42050
da8324a925 removed one of the two instances of NVME_2X_RESET. It
failed to snag the other one, and remove it from the options file.
Remove from both of those here.
Sponsored by: Netflix
Reviewed by: chuck, gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D42049
In 4b977e6dda we removed the call to nvme_ctrlr_post_failed_request
because we can now directly fail requests in this context since we're in
the reset task already. No need to queue it. I left it in place against
future need, but it's been two years and no panics have resulted. Since
the static analysis (code checking) and the dyanmic analysis (surviving
in the field for 2 years, including at $WORK where we know we've gone
through this path when we've failed drives) both signal that it's not
really needed, go ahead and GC it. If we discover at a later date a flaw
in this analysis, we can add it back easily enough by reverting this and
4b977e6dda.
Sponsored by: Netflix
Reviewed by: chuck, gallatin, jhb
Differential Revision: https://reviews.freebsd.org/D42048
When running nvme passthrough commands through the ioctl interface
memory is mapped with vmapbuf() but not unmapped. This results in leaked
memory whenever a process executes an nvme passthrough command with a
data buffer. This can be replicated with a simple c function (error
checks skipped for brevity):
void leak_memory(int nvme_ns_fd, uint16_t nblocks) {
struct nvme_pt_command pt = {
.cmd = {
.opc = NVME_OPC_READ,
.cdw12 = nblocks - 1,
},
.len = nblocks * 512, // Assumes devices with 512 byte lba
.is_read = 1, // Reads and writes should both trigger leak
}
void *buf;
posix_memalign(&buf, nblocks * 512);
pt.buf = buf;
ioctl(nvme_ns_fd, NVME_PASSTHROUGH_COMMAND, &pt);
free(buf);
}
Signed-off-by: David Sloan <david.sloan@eideticom.com>
PR: 273626
Reviewed by: imp, markj
MFC after: 1 week
When we're suspending, we get messages about waiting for the controller
to reset. These are in error: we're not waiting for it to reset. We put
the recovery state as part of suspending, so we should suppress these as
a false positive.
Also remove a stray debug that's left over from earlier versions of
the recovery code that no longer makes sense.
Sponsored by: Netflix
Currently, when we suspend, we need to tear down all the qpairs. We call
nvme_admin_qpair_abort_aers with the admin qpair lock held, but the
tracker it will call for the pending AER also locks it (recursively)
hitting an assert. This routine is called without the qpair lock held
when we destroy the device entirely in a number of places. Add an assert
to this effect and drop the qpair lock before calling it.
nvme_admin_qpair_abort_aers then locks the qpair lock to traverse the
list, dropping it around calls to nvme_qpair_complete_tracker, and
restarting the list scan after picking it back up.
Note: If interrupts are still running, there's a tiny window for these
AERs: If one fires just an instant after we manually complete it, then
we'll be fine: we set the state of the queue to 'waiting' and we ignore
interrupts while 'waiting'. We know we'll destroy all the queue state
with these pending interrupts before looking at them again and we know
all the TRs will have been completed or rescheduled. So either way we're
covered.
Also, tidy up the failure case as well: failing a queue is a superset of
disabling it, so no need to call disable first. This solves solves some
locking issues with recursion since we don't need to recurse.. Set the
qpair state of failed queues to RECOVERY_FAILED and stop scheduling the
watchdog. Assert we're not failed when we're enabling a qpair, since
failure currently is one-way. Make failure a little less verbose.
Next, kill the pre/post reset stuff. It's completely bogus since we
disable the qparis, we don't need to also hold the lock through the
reset: disabling will cause the ISR to return early. This keeps us from
recursing on the recovery lock when resuming. We only need the recovery
lock to avoid a specific race between the timer and the ISR.
Finally, kill NVME_RESET_2X. It'S been a major release since we put it
in and nobody has used it as far as I can tell. And it was a motivator
for the pre/post uglification.
These are all interrelated, so need to be done at the same time.
Sponsored by: Netflix
Reviewed by: jhb
Tested by: jhb (made sure suspend / resume worked)
MFC After: 3 days
Differential Revision: https://reviews.freebsd.org/D41866