To quote the manual:
The pointer passed in as name and type is saved rather than the data
it points to. The data pointed to must remain stable until the mutex
is destroyed.
It seems that the type is actually copied, but the name is stored as
a pointer indeed.
mmc_cam_sim_alloc used a name stored on stack.
So, a corrupt mutex name would be reported.
For example:
lock order reversal: (sleepable after non-sleepable)
1st 0xd7285b20 <8A><C0><C0>P@<C1><D0>P@<C1>^D^A (aw_mmc_sim, sleep mutex) @ /usr/devel/git/orange/sys/cam/cam_xpt.c:2804
This change moves the name to struct mmc_sim.
Also, that name is used as the sim name as well.
Unused mtx_name variable is removed too.
Set the bucket size to be SBT_1US/50000 + 1 to be the first number >
20us. I had this uncommitted in my three when I pushed 2283206935
since kern.cam.iosched.bucket_base_us was reporting 19us.
Fixes: 2283206935
Sponsored by: Netflix
Add sysctls to publish the latency bucket size, number, and stride. Move
to putting all the iosched stuff under kern.cam.iosched as well and move
kern.cam.do_dynamic_iosched to kern.cam.iosched.dynamic. In addition, move
kern.cam.io_sched_alpha_bits to kern.cam.iosched.alpha_bits. Publish
kern.cam.iosched.bucket_base (the smallest bucket time), .bucket_ratio
(the geometric progression factor * 100), and .buckets (the total number
of buckets).
Move to publishing 20 buckets, starting at 20us. This allows us to get
better resolution on the short end and detect preformance degredation of
the NVMe drives I've tested on, even with the uncertainty of bucketing.
Sponsored by: Netflix
As part of converting the code to a while loop, the unconditional
initialization of wired to false was lost.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33163
Delete all the write only variables in CAM. At worst, the only behavior
change would be to prevent core dumps from chasing NULL pointers (though
I think in all these cases the pointers can't be NULL).
Sponsored by: Netflix
The SanDisk SD8SB8U1 and likely others pad their serial number with
spaces on the end rather than the start (at least when connected to a
SAS3008). This makes them difficult to wire unit numbers to with the
serial because you have to specify the trailing spaces. Instead, strip
out the trailing spaces.
We already strip leading spaces both here. In addition, when glabel
creates the devfs device nodes, leading and trailing spaces are removed
already (so there will be no change there either).
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32684
For scsi, ata and nvme, at least, we read a serial number from the
device (if the device supports it, some scsi drives do not) and record
it during the *_xpt probe device state machine before it posts the
AC_FOUND_DEVICE async event. For mmc, no serial number is ever
retrieved, so it's always NULL. Add the ability to match this serial
number during device wiring.
This mechanism is competely optional, and often times using a label
and/or some other attribute of the device is easier. However, other
times wiring a unit to a serial number simplifies management as most
monitoring tools require the *daX device and having it stable from boot
to boot helps with data continuity. It can be especially helpful for
nvme where no other means exists to reliably tie a ndaX device to an
underlying nvme drive and namespace.
A similar mechanism exists in Linux to mange device unit numbers with
udev.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32683
If we assigned just a lun as a wired unit (something that camperiphunit
will accept), we failed to properly skip over that unit when computing a
next unit number. Add lun so the code matches the comments that we have
to skip all the same criteria that camperiphunit uses to select wired
units for a driver.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32682
When scanning the resources that are wired for this driver, skip any
that whose number doesn't match newunit. They aren't relevant. Switch to
positive logic to break out of the loop (and thus go to the next unit)
if we find either a target resource or an at resource. This makes the
code easier to read and modify.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32681
The code in camperiphunit rejects "scbus" as an 'at' location that would
allow any other wiring to use that unit number. Yet in
camperiphunitnext, if we have a no target and the 'at' location of
'scbus' it would be excluded on the basis that it's a wiring
cadidate. This is improper and appears to be a hold-over of the
pre-hints / pre-newbus config system, so remove it.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32680
1. During CD probing, we read the TOC header to find the number of
entries, then read the TOC itself. The header determines the number
of entries, which determines the amount of data to read from the
device into the softc in the CD_STATE_MEDIA_TOC_FULL state. We
hard-code a limit of 99 tracks (plus one for the lead-out) in the
softc, but were not validating that the size reported by the media
would fit in this hard-coded limit. Kernel memory corruption could
occur if not.[1] Add validation to check this, and refuse to cache
the TOC if it would not fit.
2. The CDIOCPLAYTRACKS ioctl uses caller provided track numbers to index
into the TOC, but we only validate the starting index. Add
validation of the ending index.
Also, raise the hard-coded limit from 100 tracks to 170, per a
suggestion from Ken.
Reported by: C Turt <ecturt@gmail.com> [1]
Reviewed by: ken, avg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32803
In the ATA/ATAPI spec these are space-padded fixed-length strings with
no NUL-terminator (and byte swapped). When performing the identify we
call ata_param_fixup to swap the bytes back to be in order, strip any
leading/trailing spaces and coalesce consecutive spaces, padding with
NULs. However, if the input has no padding spaces, the fixed-up strings
are still not NUL-terminated. This causes two issues. The first is that
strlcpy will truncate the string by replacing the final byte with a NUL.
The second is that strlcpy will keep reading src until it finds a NUL in
order to calculate the return value, which is defined as the length of
src (so that callers can then compare it with the dsize input to see if
the input string was truncated), thereby reading past the end of the
buffer and into whatever adjacent fields are in the structure. In
practice there's a NUL byte somewhere in the structure, but on CHERI
with subobject bounds enabled in the compiler this overread will be
detected and trap as a bounds violation.
Note this matches ata_xpt's aprobedone, which does a bcopy to a
malloc'ed buffer and manually NUL-terminates it for the CAM path's
device's serial_num.
Found by: CHERI
Reviewed by: imp, scottl
Differential Revision: https://reviews.freebsd.org/D32567
At least for SAS that we only support now disks are typically
connected to the same bus as the enclosure. Limiting the search
scope makes it much faster on systems with multiple buses and
thousands of disks.
Reviewed by: imp
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D32305
Remove *_MATCH_NONE enums, making no sense and so never used. Make
*_MATCH_ANY enums 0 (no any match flags set), previously used by
*_MATCH_NONE. Bump CAM_VERSION to 0x1a reflecting those changes and
add compat shims.
When traversing through buses and devices do not descend if we can
already see that requested pattern does not match the bus or device.
It allows to save significant amount of time on system with thousands
of disks when doing limited searches.
Reviewed by: imp
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D32304
Depending on the state of the target doneq thread at the time of the
panic, the wakeup can hang indefinitely in thread_lock_block_wait().
That function should likely be modified to return immediately if the
scheduler is stopped, but it is also preferable to avoid wakeups in
general after a panic.
Reported by: pho
Reviewed by: mav, imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32126
Define structures related to the depop set of commands (GET PHYSICAL ELEMENT
STATUS, REMOVE ELEMENT AND TRUNCATE, and RESTORE ELEMENT AND REBUILD) as
well as the CDB construction routines.
Also create scsi_wrap.c. This will have convenience routines that will do all
the elements of allocating the ccb, generating the CDB, sending the command
(looping as necessary for cases where data is returned, but it's size isn't
known up front), etc. As this functionality is fleshed out, calling many
camcontrol commands programatically gets much easier.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29017
cfiscsi_datamove_out() can race with cfiscsi_session_terminate_tasks()
and enqueue a new task after the latter function has aborted existing
tasks. This could result in a deadlock as
cfiscsi_session_terminate_tasks() waited forever for this task to
complete.
Reviewed by: mav
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31892
This adds support for SCSI UNMAP command to file-backed LUNs, if the
underlying file system has a non-zerofilling VOP_DEALLOCATE
implementation where some or all parts of the requested operation range
may be deallocated.
Sponsored by: The FreeBSD Foundation
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D31922
There is a data race between cdsysctlinit and cdcheckmedia. Both
functions change softc->flags without synchronization.
Submitted by: Arseny Smalyuk <smalukav@gmail.com>
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31726
This turns debugging printf() into a KASSERT().
Reviewed By: imp
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31523
If some device is plugged back in after unplug before the probe periph
destroyed, it will just restart the probe process. But I've found that
PROBE_INQUIRY_CKSUM flag not cleared between the iterations may cause
AC_FOUND_DEVICE not reported on the second iteration, and because of
AC_LOST_DEVICE reported during the first iteration, the device end up
configured, but without any periphs attached.
We've found that enabled serial console and 102-disk JBOD cause enough
probe delays to easily trigger the issue for half of the disks. This
change fixes it reliably on my tests.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
When using iSCSI PDU offload (cxgbei) on T6 adapters, a burst of
received PDUs can be reported via a single message to the driver.
Previously the driver passed these multi-PDU bursts up to the iSCSI
stack up as a single "large" PDU by rewriting the buffer offset, data
segment length, and DataSN fields in the iSCSI header. The DataSN
field in particular was rewritten so that each of the "large" PDUs
used consecutively increasing values. While this worked, the forged
DataSN values did not match the ExpDataSN value in the subsequent SCSI
Response PDU. The initiator does not currently verify this value, but
the forged DataSN values prevent adding a check.
To avoid this, allow a logical iSCSI PDU (struct icl_pdu) to describe
a burst of PDUs via a new 'ip_additional_pdus' field. Normally this
field is set to zero when 'struct icl_pdu' represents a single PDU.
If logical PDU represents a burst of on-the-wire PDUs, then 'ip_npdus'
contains the count of additional on-the-wire PDUs. The header of this
"large" PDU is still modified, but the DataSN field now contains the
DataSN value of the first on-the-wire PDU in the burst.
Reviewed by: mav
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31577
This turns debugging printf() into a KASSERT().
It's for ATA for now; SCSI will came later.
Reviewed By: imp
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31380
Similar to TSO, iSCSI segmentation offload permits the upper layers to
submit a "large" virtual PDU which is split up into multiple segments
(PDUs) on the wire. Similar to how the TCP/IP headers are used as
templates for TSO, the BHS at the start of a large PDU is used as a
template to construct the specific BHS at the start of each PDU. In
particular, the DataSN is incremented for each subsequent PDU, and the
'F' flag is only set on the last PDU.
struct icl_conn has a new 'ic_hw_isomax' field which defaults to 0,
but can be set to the largest virtual PDU a backend supports. If this
value is non-zero, the iSCSI target and initiator use this size
instead of 'ic_max_send_data_segment_length' to determine the maximum
size for SCSI Data-In and SCSI Data-Out PDUs. Note that since PDUs
can be constructed from multiple buffers before being dispatched, the
target and initiator must wait for the PDU to be fully constructed
before determining the number of DataSN values were consumed (and thus
updating the per-transfer DataSN value used for the start of the next
PDU).
The target generates large PDUs for SCSI Data-In PDUs in
cfiscsi_datamove_in(). The initiator generates large PDUs for SCSI
Data-Out PDUs generated in response to an R2T.
Reviewed by: mav
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31222
and remove repetetive code that calculates vnode locking type for write.
Reviewed by: khng, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31405
This makes the da(4) driver use UMA for its CCBs by default,
like ada(4) already does. Please let me know via email
if you notice any suspicious kernel messages,
Reviewed By: imp
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31257
Note that, as pointed out by scottl@, this code should really look
a bit different, in that the stack allocations should be replaced
with dynamic allocation, and the periph creation should be moved
to a context where one can use M_WAITOK. See the review for more
details. For now let's go with a minimal fix until we're done with
UMA CCBs.
Reviewed By: mav, imp
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D30298
kern.cam.do_dynamic_iosched is really a bool, so change its type to
bool. While I'm here, also use the CTLFLAG_TUN flag instead of a
separate tunable line for it and kern.cam.iosched_alpha_bits.
MFC After: 1 week
Sponsored by: Netflix
In ata_dev_advinfo() and nvme_dev_advinfo(), if the physical path is
being stored and there is a malloc failure (malloc(9) is called with
M_NOWAIT), we could wind up in a situation where the device's
physpath_len is set to the length the user provided, but the physpath
itself is NULL.
If another context then comes in to fetch the physical path value, we
would wind up trying to memcpy a NULL pointer into the caller's buffer.
So, set the physpath_len to 0 when we free the physpath on entry into
the store case for the physical path. Reset the length to a non-zero
value only after we've successfully malloced a buffer to hold it.
This code mirrors scsi_xpt.c does already as well.
Signed-off-by: Young Xiao <92siuyang@gmail.com>
Reviewed by: imp
PR: 238014