Commit graph

433 commits

Author SHA1 Message Date
Chuck Tuffli ce75bfcac9 nvme: Change namespace device name
Changes the device name for NVMe and NVMe-oF namespaces from using "ns"
to "n" to be more compatible with other operating systems. For example,
a device which was previously /dev/nvme0ns1 is now /dev/nvme0n1.

Preserves the existing functionality by creating alias from nvmeXnY to
nvmeXnsY.

Reviewed by:	imp
MFC after:	1 month
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D45414
2024-06-01 04:14:14 -07:00
Warner Losh d09ee08f10 nvme: Count number of alginment splits
When possible, we split up I/Os to NVMe drives that advertise a
preferred alignment. Add a counter for this.

Sponsored by:		Netflix
Reviewed by:		chuck, mav
Differential Revision:	https://reviews.freebsd.org/D45311
2024-05-24 08:32:47 -06:00
Warner Losh 0dd84c3b11 nvme: Add comment about where tr->deadline is set
It's easy to overlook the chain of events that lead to tr->deadline
being updated. Add a comment here to explain what otherwise looks like
an oversight w/o careful study.

Sponsored by:		Netflix
2024-05-13 16:14:04 -06:00
Warner Losh c931cf6af0 nvme: Slight simplification
We don't need to dereference qpair to get the ctrlr pointer each time,
so use the cached value. It's not going to change. No change intended.

Sponsored by:		Netflix
2024-05-13 16:14:04 -06:00
Warner Losh 9db8ca92b9 nvme: Slight reworking this loop to match FreeBSD style
Update the comment for the code, and slightly rework the code in the
'fast exit' paradigm that FreeBSD generally tries to do.

Sponsored by:		Netflix
2024-05-13 16:14:04 -06:00
Warner Losh 5a178b831a nvme: Add locking asserts
nvme_qpair_complete_tracker and nvme_qpair_manual_complete_tracker have
to be called without the qpair lock, so assert its unowned.

Sponsored by:		Netflix
2024-05-13 16:14:03 -06:00
John Baldwin da4230af3f nvme/f: Use strlcpy instead of strncpy + manual string termination
Reviewed by:	dab, imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D45153
2024-05-13 12:04:03 -07:00
John Baldwin 01fc488381 nvme: Use strlcpy instead of strncpy to ensure termination
Reviewed by:	dab, imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D45152
2024-05-13 12:03:49 -07:00
Warner Losh e84a75f936 nvme: Add telemetry page definitions
Add definition for page types 7 and 8 for host initiated telemetry and
controller initiated telemetry (they differ by one byte, but that byte
that's defined in the host version is reserved in the controller
version).

Sponsored by:		Netflix
2024-05-11 12:09:50 -06:00
John Baldwin ebcfab998e nvme: Explicitly align struct nvme_command on an 8 byte boundary
This was already true for most architectures due to uint64_t structure
members.  However, i386 is special in that it only requires 4 byte
alignment for uint64_t.  As a result, casts from struct nvme_command
to struct nvmf_fabric_cmd were raising a "cast increases alignment"
warning on i386.  Explicitly aligning struct nvme_command pacifies
this warning on i386.

Reported by:	rscheff
Sponsored by:	Chelsio Communications
2024-05-08 16:05:39 -07:00
John Baldwin 29d7e39f56 nvme: Bump the alignment of struct nvme_health_information_page to 8
This ensures that embedded uint64_t values used for statistics
counters are aligned when allocating a structure on the stack or as
part of a containing structure.  In particular this quiets
-Waddress-of-packed-member warnings from GCC when compiling the code
in nvmfd to update the stats.

Reported by:	GCC
2024-05-07 13:54:00 -07:00
John Baldwin 5e3e444230 nvme: Add constants for the Fused Operation (FUSE) field in commands
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44845
2024-05-02 16:31:02 -07:00
John Baldwin d86edc181a nvmf.h: New header defining ioctls for NVMe over Fabrics
This defines structures, ioctl commands, and related constants used
for both the Fabrics host and controller.

Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44706
2024-05-02 16:27:13 -07:00
Warner Losh 97b77de2d9 nvme: Eliminate intel_log_temp_stats_swapbytes
We can't post a AER for this page, so there's no need to be able to swap
it to host byte order. It's not one of the standard defined pages that
can post via AER, and the vendor's public docs for this temperature page
don't suggest it's possible to get over or under event changes. Since
nvmecontrol no longer needsd the swap routine, remove it since it's
now unused.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D44659
2024-04-16 21:30:19 -06:00
Brooks Davis 6bb132ba1e Reduce reliance on sys/sysproto.h pollution
Add sys/errno.h, sys/malloc.h, sys/queue.h, and vm/uma.h as needed.

sys/sysproto.h currently includes sys/acl.h which currently includes
sys/param.h, sys/queue.h, and vm/uma.h which in turn bring in
sys/errno.h sys/malloc.h.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D44465
2024-04-15 21:35:40 +01:00
Warner Losh 0b8f21e8d1 nvme: Add LPA bits
Add all the bits from the NVMe 2.0 base specification: CMD_EFFECTS to
indicate the commands and effects log page is supported, TELEMETRY to
indicate that the telemetry log pages and protocols are supported,
PERSISTENT_EVENTS to indicate the persistent event log is supported,
LOG_PAGES_PAGE to indicate that various log pages related to log page
and command support are supported: L0, L5, L12, and L13. and
DA4_TELEMETRY to indicate that the DA4 area is supported for telemetry
data.

Sponsored by:		Netflix
2024-04-05 16:53:47 -06:00
John Baldwin 21d3a84db4 nvme: Add NVMe over Fabrics fields to nvme_controller_data
Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44448
2024-03-22 17:24:52 -07:00
John Baldwin 7fa8adb8c5 nvme: Add constants for the Controller Attributes field in cdata
Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44447
2024-03-22 17:24:31 -07:00
John Baldwin 88ecf154c7 nvme: Add constants and types for the discovery log page
This is used in NVMe over Fabrics to enumerate a list of available
controllers.

Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44446
2024-03-22 17:24:18 -07:00
John Baldwin b354bb04cb nvme: Add constants for fields in AER completion dword 0
Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44445
2024-03-22 17:24:06 -07:00
John Baldwin cbda1886ab nvme: Add constants for the extended data for Get Log Page command flag
nvme(4) doesn't check this flag, but Fabrics implementations may need
to set this flag in the log page attributes cdata field.

Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44444
2024-03-22 17:23:46 -07:00
John Baldwin b8cb8dd362 nvme: Add constants for the PSDT field in cdw0
This is not used in nvme(4) but is used in NVMe over Fabrics
transports which use SGLs to describe buffers instead of PRPs.

While here, adjust the shift value for the FUSE field to be relative
to the 'fuse' member of 'struct nvme_command'.

Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44443
2024-03-22 17:23:24 -07:00
John Baldwin f21a54d190 nvme: Add SGL structure and constants for use in NVMe commands
Fabrics capsules use an SGL structure instead of prp1/2 addresses to
describe the data buffer used for a command.  The SGL structure is
added to a union with the existing prp1/2 fields.

Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44442
2024-03-22 17:23:09 -07:00
John Baldwin 1931b75e00 nvme: Export constants for min and max queue sizes
These are useful for NVMe over Fabrics.

Reviewed by:	imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44441
2024-03-22 17:23:02 -07:00
Warner Losh fe52c3384c nvme_sim: Add comment about the is_failed test
We only see a request with a failed controller while we're in the
process of failing the controller. Add a comment to that effect.

Sponsored by:		Netflix
2024-03-07 12:05:28 -07:00
Warner Losh 2a2682ee53 nvme: Add SMART WARNING for persistent memory region
NVME 2.0 added persistent memory regions, and this bit reports critical
warnings / errors with those regions.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D44213
2024-03-06 18:38:59 -07:00
Warner Losh 5cdedf676d nvme: Log reset success or failure to devd
We're logging when we start a reset, but not when we complete it, nor
the result. Create now log a success or timed_out event for the reset.
Currently, the only detectable error we have from reset is 'failure to
become ready in time,' though the code looks like it might be more
generic. Log this and if we ever have other failure modes, change the
logging to devd when that happens.

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D44211
2024-03-06 18:38:59 -07:00
Warner Losh 4f817fcf6a nvme: Change devctl events for the controller
Change the devctl events slightly for the controller. SMART errors will
log the changed bits in the NVME SMART Critical Warning State as its
event.

Reset will now emit 'event=start'. Soon more.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D44210
2024-03-06 18:38:59 -07:00
Warner Losh fc3afe9395 nvme: split devctl out to its own function
Split the devctl aspect of things out to its own function in
nvme_ctrlr_devctl_log. In preparing to document this, and based on
actual use, we want something different for the SMART errors, so this
will facilitate that.

Sponsored by:		Netflix
Reviewed by:		chuck, mav
Differential Revision:	https://reviews.freebsd.org/D44209
2024-03-06 18:38:59 -07:00
Warner Losh c5246cb7b0 nvme: Report only the unknown bits
When we get a smart error that's unknown, report only the unknown
(reserved) bits of the Critical Warning Bitfield.

Sponsored by:		Netflix
2024-03-01 16:04:27 -07:00
John Baldwin 7485926e09 nvme: Firmware revisions in the firmware slot info logpage are ASCII strings
In particular, don't try to byteswap the values as 64-bit integers and
always print a non-empty version as a string.

Reviewed by:	chuck, imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44121
2024-03-01 14:18:43 -08:00
John Baldwin 5650bd3fe8 nvme: Use the NVMEF macro to construct fields
Reviewed by:	chuck, imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43605
2024-01-29 11:01:13 -08:00
John Baldwin 3a477a9b70 nvme: Add NVMEF helper macro as the inverse of NVMEV
This macro accepts a field name and a value for the field and
constructs the shifted field value.

Reviewed by:	chuck
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43604
2024-01-29 11:00:57 -08:00
John Baldwin 8488fc417f nvme: Use the NVMEM macro instead of expanded versions
A few of these omitted a shift of 0, but this is more consistent.

Reviewed by:	chuck
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43602
2024-01-29 10:59:37 -08:00
John Baldwin 1dade1f255 nvme: Rename NVMEB helper macro to NVMEM
The current macro always builds a full mask for a named field, so use
the M suffix for mask.

Reviewed by:	chuck, imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43601
2024-01-29 10:58:28 -08:00
John Baldwin 479680f235 nvme: Use the NVMEV macro instead of expanded versions
Reviewed by:	chuck
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43595
2024-01-29 10:30:54 -08:00
Alexander Motin b46c7b1ed4 nvme: Add some bits from NVMe 2.0c spec.
MFC after:	1 week
2023-12-27 13:50:54 -05:00
Mark Johnston d9b7301bb7 nvme: Initialize HMB entries before loading them into the controller
struct nvme_hmb_desc contains a pad field which was not getting
initialized before being synced.  This doesn't have much consequence but
triggers a report from KMSAN, which verifies that host-filled DMA memory
is initialized before it is made visible to the device.  So, let's just
initialize it properly.

Reported by:	KMSAN
Reviewed by:	mav, imp
MFC after:	1 week
Sponsored by:	Klara, Inc.
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D43090
2023-12-18 17:45:24 -05:00
Warner Losh fdafd315ad sys: Automated cleanup of cdefs and other formatting
Apply the following automated changes to try to eliminate
no-longer-needed sys/cdefs.h includes as well as now-empty
blank lines in a row.

Remove /^#if.*\n#endif.*\n#include\s+<sys/cdefs.h>.*\n/
Remove /\n+#include\s+<sys/cdefs.h>.*\n+#if.*\n#endif.*\n+/
Remove /\n+#if.*\n#endif.*\n+/
Remove /^#if.*\n#endif.*\n/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/types.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/param.h>/
Remove /\n+#include\s+<sys/cdefs.h>\n#include\s+<sys/capsicum.h>/

Sponsored by:		Netflix
2023-11-26 22:24:00 -07:00
Warner Losh 34a6ad848f nvme: Don't use version to listen for events for ns and fw changes
Instead, use the attribtue bits from the identification data to
determine if we should listen to namespace changes and firmware
activation. Should have no functional change, though we may stop
listening for events that will never happen.

Sponsored by:		Netflix
2023-11-17 21:25:57 -07:00
Warner Losh fd9a4a67d0 cam: Minor opt_cam.h cleanup
sys/cam/cam.h includes opt_cam.h, so none of the clients need to do
this. cam.h does all the right dancing to conditionally include
opt_cam.h only when it makes sense. It generally only matters when
cam_debug.h is included (it must be included before that). Many of the
stray opt_cam.h includes were after cam_debug.h which would be a problem
were it not included in cam/cam.h. The other users of CAM options that
aren't debug all already include cam/cam.h.

Also trim unneeded sys/cdefs.h files from the files touched.

Sponsored by:		Netflix
2023-11-06 10:47:15 -07:00
Alexander Motin 8d6c0743e3 nvme: Introduce longer timeouts for admin queue
KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty
namespace.  In some cases like hot-plug it takes longer, triggering
timeout and controller resets after just 30 seconds. Linux for many
years has separate 60 seconds timeout for admin queue.  This patch
does the same.  And it is good to be consistent.

Sponsored by:	iXsystems, Inc.
Reviewed by:	imp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D42454
2023-11-06 11:05:48 -05:00
Warner Losh afc3d49b17 nvme: Close a race in destroying qpair and timeouts
While we should have cleared all the pending I/O prior to calling
nvme_qpair_destroy, which should ensure that if the callout_drain causes
a call to nvme_qpair_timeout(), it won't schedule any new
timeout. However, it doesn't hurt to set timeout_pending to false in
nvme_qpair_destroy() and have nvme_qpair_timeout() exit early if it sees
it w/o scheduling a timeout. Since we don't otherwise stop the timeout
until we're about to destroy the qpair, this ensures we fail safe. The
lock/unlock also ensures the callout_drain will either remove the callout,
or wait for it to run with the early bailout.

We can likely further improve this by using callout_stop() inside the
pending lock. I'll investigate that for future refinement.

Sponsored by:		Netflix
Suggestions by:		jhb
Reviewed by:		gallatin
Differential Revision:	https://reviews.freebsd.org/D42065
2023-10-10 16:13:57 -06:00
Warner Losh 9cd7b62473 nvme: Eliminate RECOVERY_FAILED state
While it seemed like a good idea to have this state, we can do
everything we wanted with the state by checking ctrlr->is_failed since
that's set before we start failing the qpairs. Add some comments about
racing when we're failing the controller, though in practice I'm not
sure that kind of race could even be lost.

Sponsored by:		Netflix
Reviewed by:		chuck, gallatin, jhb
Differential Revision:	https://reviews.freebsd.org/D42051
2023-10-10 16:13:57 -06:00
Warner Losh 6b2a6e9cb0 nvme: Remove stale comment
After da8324a925, the pre/post hooks are gone. So remove a coment
about why we don't call them in this case.

Sponsored by:		Netflix
Reviewed by:		chuck, jhb
Differential Revision:	https://reviews.freebsd.org/D42050
2023-10-10 16:13:56 -06:00
Warner Losh 4026128983 nvme: Really remove NVME_2X_RESET
da8324a925 removed one of the two instances of NVME_2X_RESET. It
failed to snag the other one, and remove it from the options file.
Remove from both of those here.

Sponsored by:		Netflix
Reviewed by:		chuck, gallatin, jhb
Differential Revision:	https://reviews.freebsd.org/D42049
2023-10-10 16:13:56 -06:00
Warner Losh bc85cd303c nvme: gc nvme_ctrlr_post_failed_request and related task stuff
In 4b977e6dda we removed the call to nvme_ctrlr_post_failed_request
because we can now directly fail requests in this context since we're in
the reset task already. No need to queue it. I left it in place against
future need, but it's been two years and no panics have resulted. Since
the static analysis (code checking) and the dyanmic analysis (surviving
in the field for 2 years, including at $WORK where we know we've gone
through this path when we've failed drives) both signal that it's not
really needed, go ahead and GC it. If we discover at a later date a flaw
in this analysis, we can add it back easily enough by reverting this and
4b977e6dda.

Sponsored by:		Netflix
Reviewed by:		chuck, gallatin, jhb
Differential Revision:	https://reviews.freebsd.org/D42048
2023-10-10 16:13:56 -06:00
David Sloan 7ea866eb14 nvme: Fix memory leak in pt ioctl commands
When running nvme passthrough commands through the ioctl interface
memory is mapped with vmapbuf() but not unmapped. This results in leaked
memory whenever a process executes an nvme passthrough command with a
data buffer. This can be replicated with a simple c function (error
checks skipped for brevity):

void leak_memory(int nvme_ns_fd, uint16_t nblocks) {
	struct nvme_pt_command pt = {
		.cmd = {
			.opc = NVME_OPC_READ,
			.cdw12 = nblocks - 1,
		},
		.len = nblocks * 512, // Assumes devices with 512 byte lba
		.is_read = 1, // Reads and writes should both trigger leak
	}
	void *buf;

	posix_memalign(&buf, nblocks * 512);
	pt.buf = buf;
	ioctl(nvme_ns_fd, NVME_PASSTHROUGH_COMMAND, &pt);
	free(buf);
}

Signed-off-by: David Sloan <david.sloan@eideticom.com>

PR:		273626
Reviewed by:	imp, markj
MFC after:	1 week
2023-10-02 11:50:14 -04:00
Warner Losh 1d6021cd72 nvme: Supress noise messages
When we're suspending, we get messages about waiting for the controller
to reset. These are in error: we're not waiting for it to reset. We put
the recovery state as part of suspending, so we should suppress these as
a false positive.

Also remove a stray debug that's left over from earlier versions of
the recovery code that no longer makes sense.

Sponsored by:		Netflix
2023-09-25 22:21:58 -06:00
Warner Losh da8324a925 nvme: Fix locking protocol violation to fix suspend / resume
Currently, when we suspend, we need to tear down all the qpairs. We call
nvme_admin_qpair_abort_aers with the admin qpair lock held, but the
tracker it will call for the pending AER also locks it (recursively)
hitting an assert. This routine is called without the qpair lock held
when we destroy the device entirely in a number of places. Add an assert
to this effect and drop the qpair lock before calling it.
nvme_admin_qpair_abort_aers then locks the qpair lock to traverse the
list, dropping it around calls to nvme_qpair_complete_tracker, and
restarting the list scan after picking it back up.

Note: If interrupts are still running, there's a tiny window for these
AERs: If one fires just an instant after we manually complete it, then
we'll be fine: we set the state of the queue to 'waiting' and we ignore
interrupts while 'waiting'. We know we'll destroy all the queue state
with these pending interrupts before looking at them again and we know
all the TRs will have been completed or rescheduled. So either way we're
covered.

Also, tidy up the failure case as well: failing a queue is a superset of
disabling it, so no need to call disable first. This solves solves some
locking issues with recursion since we don't need to recurse.. Set the
qpair state of failed queues to RECOVERY_FAILED and stop scheduling the
watchdog. Assert we're not failed when we're enabling a qpair, since
failure currently is one-way. Make failure a little less verbose.

Next, kill the pre/post reset stuff. It's completely bogus since we
disable the qparis, we don't need to also hold the lock through the
reset: disabling will cause the ISR to return early. This keeps us from
recursing on the recovery lock when resuming. We only need the recovery
lock to avoid a specific race between the timer and the ISR.

Finally, kill NVME_RESET_2X. It'S been a major release since we put it
in and nobody has used it as far as I can tell. And it was a motivator
for the pre/post uglification.

These are all interrelated, so need to be done at the same time.

Sponsored by:		Netflix
Reviewed by:		jhb
Tested by:		jhb (made sure suspend / resume worked)
MFC After:		3 days
Differential Revision:	https://reviews.freebsd.org/D41866
2023-09-24 07:17:18 -06:00