2 * This file and its contents are supplied under the terms of the
3 * Common Development and Distribution License ("CDDL"), version 1.0.
4 * You may only use this file in accordance with the terms of version
7 * A full copy of the text of the CDDL should have accompanied this
8 * source. A copy of the CDDL is also available via the Internet at
9 * http://www.illumos.org/license/CDDL.
13 * Copyright 2016 Nexenta Systems, Inc. All rights reserved.
14 * Copyright 2016 Tegile Systems, Inc. All rights reserved.
15 * Copyright (c) 2016 The MathWorks, Inc. All rights reserved.
19 * blkdev driver for NVMe compliant storage devices
21 * This driver was written to conform to version 1.0e of the NVMe specification.
22 * It may work with newer versions, but that is completely untested and disabled
25 * The driver has only been tested on x86 systems and will not work on big-
26 * endian systems without changes to the code accessing registers and data
27 * structures used by the hardware.
32 * The driver will use a FIXED interrupt while configuring the device as the
33 * specification requires. Later in the attach process it will switch to MSI-X
34 * or MSI if supported. The driver wants to have one interrupt vector per CPU,
35 * but it will work correctly if less are available. Interrupts can be shared
36 * by queues, the interrupt handler will iterate through the I/O queue array by
37 * steps of n_intr_cnt. Usually only the admin queue will share an interrupt
38 * with one I/O queue. The interrupt handler will retrieve completed commands
39 * from all queues sharing an interrupt vector and will post them to a taskq
40 * for completion processing.
45 * NVMe devices can have up to 65536 I/O queue pairs, with each queue holding up
46 * to 65536 I/O commands. The driver will configure one I/O queue pair per
47 * available interrupt vector, with the queue length usually much smaller than
48 * the maximum of 65536. If the hardware doesn't provide enough queues, fewer
49 * interrupt vectors will be used.
51 * Additionally the hardware provides a single special admin queue pair that can
52 * hold up to 4096 admin commands.
54 * From the hardware perspective both queues of a queue pair are independent,
55 * but they share some driver state: the command array (holding pointers to
56 * commands currently being processed by the hardware) and the active command
57 * counter. Access to the submission side of a queue pair and the shared state
58 * is protected by nq_mutex. The completion side of a queue pair does not need
59 * that protection apart from its access to the shared state; it is called only
60 * in the interrupt handler which does not run concurrently for the same
63 * When a command is submitted to a queue pair the active command counter is
64 * incremented and a pointer to the command is stored in the command array. The
65 * array index is used as command identifier (CID) in the submission queue
66 * entry. Some commands may take a very long time to complete, and if the queue
67 * wraps around in that time a submission may find the next array slot to still
68 * be used by a long-running command. In this case the array is sequentially
69 * searched for the next free slot. The length of the command array is the same
70 * as the configured queue length.
75 * NVMe devices can have multiple namespaces, each being a independent data
76 * store. The driver supports multiple namespaces and creates a blkdev interface
77 * for each namespace found. Namespaces can have various attributes to support
78 * thin provisioning and protection information. This driver does not support
79 * any of this and ignores namespaces that have these attributes.
84 * This driver uses blkdev to do all the heavy lifting involved with presenting
85 * a disk device to the system. As a result, the processing of I/O requests is
86 * relatively simple as blkdev takes care of partitioning, boundary checks, DMA
87 * setup, and splitting of transfers into manageable chunks.
89 * I/O requests coming in from blkdev are turned into NVM commands and posted to
90 * an I/O queue. The queue is selected by taking the CPU id modulo the number of
91 * queues. There is currently no timeout handling of I/O commands.
93 * Blkdev also supports querying device/media information and generating a
94 * devid. The driver reports the best block size as determined by the namespace
95 * format back to blkdev as physical block size to support partition and block
96 * alignment. The devid is composed using the device vendor ID, model number,
97 * serial number, and the namespace ID.
102 * Error handling is currently limited to detecting fatal hardware errors,
103 * either by asynchronous events, or synchronously through command status or
104 * admin command timeouts. In case of severe errors the device is fenced off,
105 * all further requests will return EIO. FMA is then called to fault the device.
107 * The hardware has a limit for outstanding asynchronous event requests. Before
108 * this limit is known the driver assumes it is at least 1 and posts a single
109 * asynchronous request. Later when the limit is known more asynchronous event
110 * requests are posted to allow quicker reception of error information. When an
111 * asynchronous event is posted by the hardware the driver will parse the error
112 * status fields and log information or fault the device, depending on the
113 * severity of the asynchronous event. The asynchronous event request is then
114 * reused and posted to the admin queue again.
116 * On command completion the command status is checked for errors. In case of
117 * errors indicating a driver bug the driver panics. Almost all other error
118 * status values just cause EIO to be returned.
120 * Command timeouts are currently detected for all admin commands except
121 * asynchronous event requests. If a command times out and the hardware appears
122 * to be healthy the driver attempts to abort the command. If this fails the
123 * driver assumes the device to be dead, fences it off, and calls FMA to retire
124 * it. In general admin commands are issued at attach time only. No timeout
125 * handling of normal I/O commands is presently done.
127 * In some cases it may be possible that the ABORT command times out, too. In
128 * that case the device is also declared dead and fenced off.
131 * Quiesce / Fast Reboot:
133 * The driver currently does not support fast reboot. A quiesce(9E) entry point
134 * is still provided which is used to send a shutdown notification to the
138 * Driver Configuration:
140 * The following driver properties can be changed to control some aspects of the
142 * - strict-version: can be set to 0 to allow devices conforming to newer
143 * versions to be used
144 * - ignore-unknown-vendor-status: can be set to 1 to not handle any vendor
145 * specific command status as a fatal error leading device faulting
146 * - admin-queue-len: the maximum length of the admin queue (16-4096)
147 * - io-queue-len: the maximum length of the I/O queues (16-65536)
148 * - async-event-limit: the maximum number of asynchronous event requests to be
149 * posted by the driver
150 * - volatile-write-cache-enable: can be set to 0 to disable the volatile write
152 * - min-phys-block-size: the minimum physical block size to report to blkdev,
153 * which is among other things the basis for ZFS vdev ashift
157 * - figure out sane default for I/O queue depth reported to blkdev
158 * - polled I/O support to support kernel core dumping
159 * - FMA handling of media errors
160 * - support for devices supporting very large I/O requests using chained PRPs
161 * - support for querying log pages from user space
162 * - support for configuring hardware parameters like interrupt coalescing
163 * - support for media formatting and hard partitioning into namespaces
164 * - support for big-endian systems
165 * - support for fast reboot
168 #include <sys/byteorder.h>
170 #error nvme driver needs porting for big-endian platforms
173 #include <sys/modctl.h>
174 #include <sys/conf.h>
175 #include <sys/devops.h>
177 #include <sys/sunddi.h>
178 #include <sys/bitmap.h>
179 #include <sys/sysmacros.h>
180 #include <sys/param.h>
181 #include <sys/varargs.h>
182 #include <sys/cpuvar.h>
183 #include <sys/disp.h>
184 #include <sys/blkdev.h>
185 #include <sys/atomic.h>
186 #include <sys/archsystm.h>
187 #include <sys/sata/sata_hba.h>
189 #include "nvme_reg.h"
190 #include "nvme_var.h"
193 /* NVMe spec version supported */
194 static const int nvme_version_major
= 1;
195 static const int nvme_version_minor
= 0;
197 /* tunable for admin command timeout in seconds, default is 1s */
198 static volatile int nvme_admin_cmd_timeout
= 1;
200 static int nvme_attach(dev_info_t
*, ddi_attach_cmd_t
);
201 static int nvme_detach(dev_info_t
*, ddi_detach_cmd_t
);
202 static int nvme_quiesce(dev_info_t
*);
203 static int nvme_fm_errcb(dev_info_t
*, ddi_fm_error_t
*, const void *);
204 static int nvme_setup_interrupts(nvme_t
*, int, int);
205 static void nvme_release_interrupts(nvme_t
*);
206 static uint_t
nvme_intr(caddr_t
, caddr_t
);
208 static void nvme_shutdown(nvme_t
*, int, boolean_t
);
209 static boolean_t
nvme_reset(nvme_t
*, boolean_t
);
210 static int nvme_init(nvme_t
*);
211 static nvme_cmd_t
*nvme_alloc_cmd(nvme_t
*, int);
212 static void nvme_free_cmd(nvme_cmd_t
*);
213 static nvme_cmd_t
*nvme_create_nvm_cmd(nvme_namespace_t
*, uint8_t,
215 static int nvme_admin_cmd(nvme_cmd_t
*, int);
216 static int nvme_submit_cmd(nvme_qpair_t
*, nvme_cmd_t
*);
217 static nvme_cmd_t
*nvme_retrieve_cmd(nvme_t
*, nvme_qpair_t
*);
218 static boolean_t
nvme_wait_cmd(nvme_cmd_t
*, uint_t
);
219 static void nvme_wakeup_cmd(void *);
220 static void nvme_async_event_task(void *);
222 static int nvme_check_unknown_cmd_status(nvme_cmd_t
*);
223 static int nvme_check_vendor_cmd_status(nvme_cmd_t
*);
224 static int nvme_check_integrity_cmd_status(nvme_cmd_t
*);
225 static int nvme_check_specific_cmd_status(nvme_cmd_t
*);
226 static int nvme_check_generic_cmd_status(nvme_cmd_t
*);
227 static inline int nvme_check_cmd_status(nvme_cmd_t
*);
229 static void nvme_abort_cmd(nvme_cmd_t
*);
230 static int nvme_async_event(nvme_t
*);
231 static void *nvme_get_logpage(nvme_t
*, uint8_t, ...);
232 static void *nvme_identify(nvme_t
*, uint32_t);
233 static boolean_t
nvme_set_features(nvme_t
*, uint32_t, uint8_t, uint32_t,
235 static boolean_t
nvme_write_cache_set(nvme_t
*, boolean_t
);
236 static int nvme_set_nqueues(nvme_t
*, uint16_t);
238 static void nvme_free_dma(nvme_dma_t
*);
239 static int nvme_zalloc_dma(nvme_t
*, size_t, uint_t
, ddi_dma_attr_t
*,
241 static int nvme_zalloc_queue_dma(nvme_t
*, uint32_t, uint16_t, uint_t
,
243 static void nvme_free_qpair(nvme_qpair_t
*);
244 static int nvme_alloc_qpair(nvme_t
*, uint32_t, nvme_qpair_t
**, int);
245 static int nvme_create_io_qpair(nvme_t
*, nvme_qpair_t
*, uint16_t);
247 static inline void nvme_put64(nvme_t
*, uintptr_t, uint64_t);
248 static inline void nvme_put32(nvme_t
*, uintptr_t, uint32_t);
249 static inline uint64_t nvme_get64(nvme_t
*, uintptr_t);
250 static inline uint32_t nvme_get32(nvme_t
*, uintptr_t);
252 static boolean_t
nvme_check_regs_hdl(nvme_t
*);
253 static boolean_t
nvme_check_dma_hdl(nvme_dma_t
*);
255 static int nvme_fill_prp(nvme_cmd_t
*, bd_xfer_t
*);
257 static void nvme_bd_xfer_done(void *);
258 static void nvme_bd_driveinfo(void *, bd_drive_t
*);
259 static int nvme_bd_mediainfo(void *, bd_media_t
*);
260 static int nvme_bd_cmd(nvme_namespace_t
*, bd_xfer_t
*, uint8_t);
261 static int nvme_bd_read(void *, bd_xfer_t
*);
262 static int nvme_bd_write(void *, bd_xfer_t
*);
263 static int nvme_bd_sync(void *, bd_xfer_t
*);
264 static int nvme_bd_devid(void *, dev_info_t
*, ddi_devid_t
*);
266 static int nvme_prp_dma_constructor(void *, void *, int);
267 static void nvme_prp_dma_destructor(void *, void *);
269 static void nvme_prepare_devid(nvme_t
*, uint32_t);
271 static void *nvme_state
;
272 static kmem_cache_t
*nvme_cmd_cache
;
275 * DMA attributes for queue DMA memory
277 * Queue DMA memory must be page aligned. The maximum length of a queue is
278 * 65536 entries, and an entry can be 64 bytes long.
280 static ddi_dma_attr_t nvme_queue_dma_attr
= {
281 .dma_attr_version
= DMA_ATTR_V0
,
282 .dma_attr_addr_lo
= 0,
283 .dma_attr_addr_hi
= 0xffffffffffffffffULL
,
284 .dma_attr_count_max
= (UINT16_MAX
+ 1) * sizeof (nvme_sqe_t
) - 1,
285 .dma_attr_align
= 0x1000,
286 .dma_attr_burstsizes
= 0x7ff,
287 .dma_attr_minxfer
= 0x1000,
288 .dma_attr_maxxfer
= (UINT16_MAX
+ 1) * sizeof (nvme_sqe_t
),
289 .dma_attr_seg
= 0xffffffffffffffffULL
,
290 .dma_attr_sgllen
= 1,
291 .dma_attr_granular
= 1,
296 * DMA attributes for transfers using Physical Region Page (PRP) entries
298 * A PRP entry describes one page of DMA memory using the page size specified
299 * in the controller configuration's memory page size register (CC.MPS). It uses
300 * a 64bit base address aligned to this page size. There is no limitation on
301 * chaining PRPs together for arbitrarily large DMA transfers.
303 static ddi_dma_attr_t nvme_prp_dma_attr
= {
304 .dma_attr_version
= DMA_ATTR_V0
,
305 .dma_attr_addr_lo
= 0,
306 .dma_attr_addr_hi
= 0xffffffffffffffffULL
,
307 .dma_attr_count_max
= 0xfff,
308 .dma_attr_align
= 0x1000,
309 .dma_attr_burstsizes
= 0x7ff,
310 .dma_attr_minxfer
= 0x1000,
311 .dma_attr_maxxfer
= 0x1000,
312 .dma_attr_seg
= 0xfff,
313 .dma_attr_sgllen
= -1,
314 .dma_attr_granular
= 1,
319 * DMA attributes for transfers using scatter/gather lists
321 * A SGL entry describes a chunk of DMA memory using a 64bit base address and a
322 * 32bit length field. SGL Segment and SGL Last Segment entries require the
323 * length to be a multiple of 16 bytes.
325 static ddi_dma_attr_t nvme_sgl_dma_attr
= {
326 .dma_attr_version
= DMA_ATTR_V0
,
327 .dma_attr_addr_lo
= 0,
328 .dma_attr_addr_hi
= 0xffffffffffffffffULL
,
329 .dma_attr_count_max
= 0xffffffffUL
,
331 .dma_attr_burstsizes
= 0x7ff,
332 .dma_attr_minxfer
= 0x10,
333 .dma_attr_maxxfer
= 0xfffffffffULL
,
334 .dma_attr_seg
= 0xffffffffffffffffULL
,
335 .dma_attr_sgllen
= -1,
336 .dma_attr_granular
= 0x10,
340 static ddi_device_acc_attr_t nvme_reg_acc_attr
= {
341 .devacc_attr_version
= DDI_DEVICE_ATTR_V0
,
342 .devacc_attr_endian_flags
= DDI_STRUCTURE_LE_ACC
,
343 .devacc_attr_dataorder
= DDI_STRICTORDER_ACC
346 static struct dev_ops nvme_dev_ops
= {
347 .devo_rev
= DEVO_REV
,
349 .devo_getinfo
= ddi_no_info
,
350 .devo_identify
= nulldev
,
351 .devo_probe
= nulldev
,
352 .devo_attach
= nvme_attach
,
353 .devo_detach
= nvme_detach
,
356 .devo_bus_ops
= NULL
,
358 .devo_quiesce
= nvme_quiesce
,
361 static struct modldrv nvme_modldrv
= {
362 .drv_modops
= &mod_driverops
,
363 .drv_linkinfo
= "NVMe v1.0e",
364 .drv_dev_ops
= &nvme_dev_ops
367 static struct modlinkage nvme_modlinkage
= {
369 .ml_linkage
= { &nvme_modldrv
, NULL
}
372 static bd_ops_t nvme_bd_ops
= {
373 .o_version
= BD_OPS_VERSION_0
,
374 .o_drive_info
= nvme_bd_driveinfo
,
375 .o_media_info
= nvme_bd_mediainfo
,
376 .o_devid_init
= nvme_bd_devid
,
377 .o_sync_cache
= nvme_bd_sync
,
378 .o_read
= nvme_bd_read
,
379 .o_write
= nvme_bd_write
,
387 error
= ddi_soft_state_init(&nvme_state
, sizeof (nvme_t
), 1);
388 if (error
!= DDI_SUCCESS
)
391 nvme_cmd_cache
= kmem_cache_create("nvme_cmd_cache",
392 sizeof (nvme_cmd_t
), 64, NULL
, NULL
, NULL
, NULL
, NULL
, 0);
394 bd_mod_init(&nvme_dev_ops
);
396 error
= mod_install(&nvme_modlinkage
);
397 if (error
!= DDI_SUCCESS
) {
398 ddi_soft_state_fini(&nvme_state
);
399 bd_mod_fini(&nvme_dev_ops
);
410 error
= mod_remove(&nvme_modlinkage
);
411 if (error
== DDI_SUCCESS
) {
412 ddi_soft_state_fini(&nvme_state
);
413 kmem_cache_destroy(nvme_cmd_cache
);
414 bd_mod_fini(&nvme_dev_ops
);
421 _info(struct modinfo
*modinfop
)
423 return (mod_info(&nvme_modlinkage
, modinfop
));
427 nvme_put64(nvme_t
*nvme
, uintptr_t reg
, uint64_t val
)
429 ASSERT(((uintptr_t)(nvme
->n_regs
+ reg
) & 0x7) == 0);
431 /*LINTED: E_BAD_PTR_CAST_ALIGN*/
432 ddi_put64(nvme
->n_regh
, (uint64_t *)(nvme
->n_regs
+ reg
), val
);
436 nvme_put32(nvme_t
*nvme
, uintptr_t reg
, uint32_t val
)
438 ASSERT(((uintptr_t)(nvme
->n_regs
+ reg
) & 0x3) == 0);
440 /*LINTED: E_BAD_PTR_CAST_ALIGN*/
441 ddi_put32(nvme
->n_regh
, (uint32_t *)(nvme
->n_regs
+ reg
), val
);
444 static inline uint64_t
445 nvme_get64(nvme_t
*nvme
, uintptr_t reg
)
449 ASSERT(((uintptr_t)(nvme
->n_regs
+ reg
) & 0x7) == 0);
451 /*LINTED: E_BAD_PTR_CAST_ALIGN*/
452 val
= ddi_get64(nvme
->n_regh
, (uint64_t *)(nvme
->n_regs
+ reg
));
457 static inline uint32_t
458 nvme_get32(nvme_t
*nvme
, uintptr_t reg
)
462 ASSERT(((uintptr_t)(nvme
->n_regs
+ reg
) & 0x3) == 0);
464 /*LINTED: E_BAD_PTR_CAST_ALIGN*/
465 val
= ddi_get32(nvme
->n_regh
, (uint32_t *)(nvme
->n_regs
+ reg
));
471 nvme_check_regs_hdl(nvme_t
*nvme
)
473 ddi_fm_error_t error
;
475 ddi_fm_acc_err_get(nvme
->n_regh
, &error
, DDI_FME_VERSION
);
477 if (error
.fme_status
!= DDI_FM_OK
)
484 nvme_check_dma_hdl(nvme_dma_t
*dma
)
486 ddi_fm_error_t error
;
491 ddi_fm_dma_err_get(dma
->nd_dmah
, &error
, DDI_FME_VERSION
);
493 if (error
.fme_status
!= DDI_FM_OK
)
500 nvme_free_dma_common(nvme_dma_t
*dma
)
502 if (dma
->nd_dmah
!= NULL
)
503 (void) ddi_dma_unbind_handle(dma
->nd_dmah
);
504 if (dma
->nd_acch
!= NULL
)
505 ddi_dma_mem_free(&dma
->nd_acch
);
506 if (dma
->nd_dmah
!= NULL
)
507 ddi_dma_free_handle(&dma
->nd_dmah
);
511 nvme_free_dma(nvme_dma_t
*dma
)
513 nvme_free_dma_common(dma
);
514 kmem_free(dma
, sizeof (*dma
));
519 nvme_prp_dma_destructor(void *buf
, void *private)
521 nvme_dma_t
*dma
= (nvme_dma_t
*)buf
;
523 nvme_free_dma_common(dma
);
527 nvme_alloc_dma_common(nvme_t
*nvme
, nvme_dma_t
*dma
,
528 size_t len
, uint_t flags
, ddi_dma_attr_t
*dma_attr
)
530 if (ddi_dma_alloc_handle(nvme
->n_dip
, dma_attr
, DDI_DMA_SLEEP
, NULL
,
531 &dma
->nd_dmah
) != DDI_SUCCESS
) {
533 * Due to DDI_DMA_SLEEP this can't be DDI_DMA_NORESOURCES, and
534 * the only other possible error is DDI_DMA_BADATTR which
535 * indicates a driver bug which should cause a panic.
537 dev_err(nvme
->n_dip
, CE_PANIC
,
538 "!failed to get DMA handle, check DMA attributes");
539 return (DDI_FAILURE
);
543 * ddi_dma_mem_alloc() can only fail when DDI_DMA_NOSLEEP is specified
544 * or the flags are conflicting, which isn't the case here.
546 (void) ddi_dma_mem_alloc(dma
->nd_dmah
, len
, &nvme
->n_reg_acc_attr
,
547 DDI_DMA_CONSISTENT
, DDI_DMA_SLEEP
, NULL
, &dma
->nd_memp
,
548 &dma
->nd_len
, &dma
->nd_acch
);
550 if (ddi_dma_addr_bind_handle(dma
->nd_dmah
, NULL
, dma
->nd_memp
,
551 dma
->nd_len
, flags
| DDI_DMA_CONSISTENT
, DDI_DMA_SLEEP
, NULL
,
552 &dma
->nd_cookie
, &dma
->nd_ncookie
) != DDI_DMA_MAPPED
) {
553 dev_err(nvme
->n_dip
, CE_WARN
,
554 "!failed to bind DMA memory");
555 atomic_inc_32(&nvme
->n_dma_bind_err
);
556 nvme_free_dma_common(dma
);
557 return (DDI_FAILURE
);
560 return (DDI_SUCCESS
);
564 nvme_zalloc_dma(nvme_t
*nvme
, size_t len
, uint_t flags
,
565 ddi_dma_attr_t
*dma_attr
, nvme_dma_t
**ret
)
567 nvme_dma_t
*dma
= kmem_zalloc(sizeof (nvme_dma_t
), KM_SLEEP
);
569 if (nvme_alloc_dma_common(nvme
, dma
, len
, flags
, dma_attr
) !=
572 kmem_free(dma
, sizeof (nvme_dma_t
));
573 return (DDI_FAILURE
);
576 bzero(dma
->nd_memp
, dma
->nd_len
);
579 return (DDI_SUCCESS
);
584 nvme_prp_dma_constructor(void *buf
, void *private, int flags
)
586 nvme_dma_t
*dma
= (nvme_dma_t
*)buf
;
587 nvme_t
*nvme
= (nvme_t
*)private;
592 if (nvme_alloc_dma_common(nvme
, dma
, nvme
->n_pagesize
,
593 DDI_DMA_READ
, &nvme
->n_prp_dma_attr
) != DDI_SUCCESS
) {
597 ASSERT(dma
->nd_ncookie
== 1);
599 dma
->nd_cached
= B_TRUE
;
605 nvme_zalloc_queue_dma(nvme_t
*nvme
, uint32_t nentry
, uint16_t qe_len
,
606 uint_t flags
, nvme_dma_t
**dma
)
608 uint32_t len
= nentry
* qe_len
;
609 ddi_dma_attr_t q_dma_attr
= nvme
->n_queue_dma_attr
;
611 len
= roundup(len
, nvme
->n_pagesize
);
613 q_dma_attr
.dma_attr_minxfer
= len
;
615 if (nvme_zalloc_dma(nvme
, len
, flags
, &q_dma_attr
, dma
)
617 dev_err(nvme
->n_dip
, CE_WARN
,
618 "!failed to get DMA memory for queue");
622 if ((*dma
)->nd_ncookie
!= 1) {
623 dev_err(nvme
->n_dip
, CE_WARN
,
624 "!got too many cookies for queue DMA");
628 return (DDI_SUCCESS
);
636 return (DDI_FAILURE
);
640 nvme_free_qpair(nvme_qpair_t
*qp
)
644 mutex_destroy(&qp
->nq_mutex
);
646 if (qp
->nq_sqdma
!= NULL
)
647 nvme_free_dma(qp
->nq_sqdma
);
648 if (qp
->nq_cqdma
!= NULL
)
649 nvme_free_dma(qp
->nq_cqdma
);
651 if (qp
->nq_active_cmds
> 0)
652 for (i
= 0; i
!= qp
->nq_nentry
; i
++)
653 if (qp
->nq_cmd
[i
] != NULL
)
654 nvme_free_cmd(qp
->nq_cmd
[i
]);
656 if (qp
->nq_cmd
!= NULL
)
657 kmem_free(qp
->nq_cmd
, sizeof (nvme_cmd_t
*) * qp
->nq_nentry
);
659 kmem_free(qp
, sizeof (nvme_qpair_t
));
663 nvme_alloc_qpair(nvme_t
*nvme
, uint32_t nentry
, nvme_qpair_t
**nqp
,
666 nvme_qpair_t
*qp
= kmem_zalloc(sizeof (*qp
), KM_SLEEP
);
668 mutex_init(&qp
->nq_mutex
, NULL
, MUTEX_DRIVER
,
669 DDI_INTR_PRI(nvme
->n_intr_pri
));
671 if (nvme_zalloc_queue_dma(nvme
, nentry
, sizeof (nvme_sqe_t
),
672 DDI_DMA_WRITE
, &qp
->nq_sqdma
) != DDI_SUCCESS
)
675 if (nvme_zalloc_queue_dma(nvme
, nentry
, sizeof (nvme_cqe_t
),
676 DDI_DMA_READ
, &qp
->nq_cqdma
) != DDI_SUCCESS
)
679 qp
->nq_sq
= (nvme_sqe_t
*)qp
->nq_sqdma
->nd_memp
;
680 qp
->nq_cq
= (nvme_cqe_t
*)qp
->nq_cqdma
->nd_memp
;
681 qp
->nq_nentry
= nentry
;
683 qp
->nq_sqtdbl
= NVME_REG_SQTDBL(nvme
, idx
);
684 qp
->nq_cqhdbl
= NVME_REG_CQHDBL(nvme
, idx
);
686 qp
->nq_cmd
= kmem_zalloc(sizeof (nvme_cmd_t
*) * nentry
, KM_SLEEP
);
690 return (DDI_SUCCESS
);
696 return (DDI_FAILURE
);
700 nvme_alloc_cmd(nvme_t
*nvme
, int kmflag
)
702 nvme_cmd_t
*cmd
= kmem_cache_alloc(nvme_cmd_cache
, kmflag
);
707 bzero(cmd
, sizeof (nvme_cmd_t
));
711 mutex_init(&cmd
->nc_mutex
, NULL
, MUTEX_DRIVER
,
712 DDI_INTR_PRI(nvme
->n_intr_pri
));
713 cv_init(&cmd
->nc_cv
, NULL
, CV_DRIVER
, NULL
);
719 nvme_free_cmd(nvme_cmd_t
*cmd
)
722 if (cmd
->nc_dma
->nd_cached
)
723 kmem_cache_free(cmd
->nc_nvme
->n_prp_cache
,
726 nvme_free_dma(cmd
->nc_dma
);
730 cv_destroy(&cmd
->nc_cv
);
731 mutex_destroy(&cmd
->nc_mutex
);
733 kmem_cache_free(nvme_cmd_cache
, cmd
);
737 nvme_submit_cmd(nvme_qpair_t
*qp
, nvme_cmd_t
*cmd
)
739 nvme_reg_sqtdbl_t tail
= { 0 };
741 mutex_enter(&qp
->nq_mutex
);
743 if (qp
->nq_active_cmds
== qp
->nq_nentry
) {
744 mutex_exit(&qp
->nq_mutex
);
745 return (DDI_FAILURE
);
748 cmd
->nc_completed
= B_FALSE
;
751 * Try to insert the cmd into the active cmd array at the nq_next_cmd
752 * slot. If the slot is already occupied advance to the next slot and
753 * try again. This can happen for long running commands like async event
756 while (qp
->nq_cmd
[qp
->nq_next_cmd
] != NULL
)
757 qp
->nq_next_cmd
= (qp
->nq_next_cmd
+ 1) % qp
->nq_nentry
;
758 qp
->nq_cmd
[qp
->nq_next_cmd
] = cmd
;
760 qp
->nq_active_cmds
++;
762 cmd
->nc_sqe
.sqe_cid
= qp
->nq_next_cmd
;
763 bcopy(&cmd
->nc_sqe
, &qp
->nq_sq
[qp
->nq_sqtail
], sizeof (nvme_sqe_t
));
764 (void) ddi_dma_sync(qp
->nq_sqdma
->nd_dmah
,
765 sizeof (nvme_sqe_t
) * qp
->nq_sqtail
,
766 sizeof (nvme_sqe_t
), DDI_DMA_SYNC_FORDEV
);
767 qp
->nq_next_cmd
= (qp
->nq_next_cmd
+ 1) % qp
->nq_nentry
;
769 tail
.b
.sqtdbl_sqt
= qp
->nq_sqtail
= (qp
->nq_sqtail
+ 1) % qp
->nq_nentry
;
770 nvme_put32(cmd
->nc_nvme
, qp
->nq_sqtdbl
, tail
.r
);
772 mutex_exit(&qp
->nq_mutex
);
773 return (DDI_SUCCESS
);
777 nvme_retrieve_cmd(nvme_t
*nvme
, nvme_qpair_t
*qp
)
779 nvme_reg_cqhdbl_t head
= { 0 };
784 (void) ddi_dma_sync(qp
->nq_cqdma
->nd_dmah
, 0,
785 sizeof (nvme_cqe_t
) * qp
->nq_nentry
, DDI_DMA_SYNC_FORKERNEL
);
787 cqe
= &qp
->nq_cq
[qp
->nq_cqhead
];
789 /* Check phase tag of CQE. Hardware inverts it for new entries. */
790 if (cqe
->cqe_sf
.sf_p
== qp
->nq_phase
)
793 ASSERT(nvme
->n_ioq
[cqe
->cqe_sqid
] == qp
);
794 ASSERT(cqe
->cqe_cid
< qp
->nq_nentry
);
796 mutex_enter(&qp
->nq_mutex
);
797 cmd
= qp
->nq_cmd
[cqe
->cqe_cid
];
798 qp
->nq_cmd
[cqe
->cqe_cid
] = NULL
;
799 qp
->nq_active_cmds
--;
800 mutex_exit(&qp
->nq_mutex
);
803 ASSERT(cmd
->nc_nvme
== nvme
);
804 ASSERT(cmd
->nc_sqid
== cqe
->cqe_sqid
);
805 ASSERT(cmd
->nc_sqe
.sqe_cid
== cqe
->cqe_cid
);
806 bcopy(cqe
, &cmd
->nc_cqe
, sizeof (nvme_cqe_t
));
808 qp
->nq_sqhead
= cqe
->cqe_sqhd
;
810 head
.b
.cqhdbl_cqh
= qp
->nq_cqhead
= (qp
->nq_cqhead
+ 1) % qp
->nq_nentry
;
812 /* Toggle phase on wrap-around. */
813 if (qp
->nq_cqhead
== 0)
814 qp
->nq_phase
= qp
->nq_phase
? 0 : 1;
816 nvme_put32(cmd
->nc_nvme
, qp
->nq_cqhdbl
, head
.r
);
822 nvme_check_unknown_cmd_status(nvme_cmd_t
*cmd
)
824 nvme_cqe_t
*cqe
= &cmd
->nc_cqe
;
826 dev_err(cmd
->nc_nvme
->n_dip
, CE_WARN
,
827 "!unknown command status received: opc = %x, sqid = %d, cid = %d, "
828 "sc = %x, sct = %x, dnr = %d, m = %d", cmd
->nc_sqe
.sqe_opc
,
829 cqe
->cqe_sqid
, cqe
->cqe_cid
, cqe
->cqe_sf
.sf_sc
, cqe
->cqe_sf
.sf_sct
,
830 cqe
->cqe_sf
.sf_dnr
, cqe
->cqe_sf
.sf_m
);
832 bd_error(cmd
->nc_xfer
, BD_ERR_ILLRQ
);
834 if (cmd
->nc_nvme
->n_strict_version
) {
835 cmd
->nc_nvme
->n_dead
= B_TRUE
;
836 ddi_fm_service_impact(cmd
->nc_nvme
->n_dip
, DDI_SERVICE_LOST
);
843 nvme_check_vendor_cmd_status(nvme_cmd_t
*cmd
)
845 nvme_cqe_t
*cqe
= &cmd
->nc_cqe
;
847 dev_err(cmd
->nc_nvme
->n_dip
, CE_WARN
,
848 "!unknown command status received: opc = %x, sqid = %d, cid = %d, "
849 "sc = %x, sct = %x, dnr = %d, m = %d", cmd
->nc_sqe
.sqe_opc
,
850 cqe
->cqe_sqid
, cqe
->cqe_cid
, cqe
->cqe_sf
.sf_sc
, cqe
->cqe_sf
.sf_sct
,
851 cqe
->cqe_sf
.sf_dnr
, cqe
->cqe_sf
.sf_m
);
852 if (!cmd
->nc_nvme
->n_ignore_unknown_vendor_status
) {
853 cmd
->nc_nvme
->n_dead
= B_TRUE
;
854 ddi_fm_service_impact(cmd
->nc_nvme
->n_dip
, DDI_SERVICE_LOST
);
861 nvme_check_integrity_cmd_status(nvme_cmd_t
*cmd
)
863 nvme_cqe_t
*cqe
= &cmd
->nc_cqe
;
865 switch (cqe
->cqe_sf
.sf_sc
) {
866 case NVME_CQE_SC_INT_NVM_WRITE
:
868 /* TODO: post ereport */
869 bd_error(cmd
->nc_xfer
, BD_ERR_MEDIA
);
872 case NVME_CQE_SC_INT_NVM_READ
:
874 /* TODO: post ereport */
875 bd_error(cmd
->nc_xfer
, BD_ERR_MEDIA
);
879 return (nvme_check_unknown_cmd_status(cmd
));
884 nvme_check_generic_cmd_status(nvme_cmd_t
*cmd
)
886 nvme_cqe_t
*cqe
= &cmd
->nc_cqe
;
888 switch (cqe
->cqe_sf
.sf_sc
) {
889 case NVME_CQE_SC_GEN_SUCCESS
:
893 * Errors indicating a bug in the driver should cause a panic.
895 case NVME_CQE_SC_GEN_INV_OPC
:
896 /* Invalid Command Opcode */
897 dev_err(cmd
->nc_nvme
->n_dip
, CE_PANIC
, "programming error: "
898 "invalid opcode in cmd %p", (void *)cmd
);
901 case NVME_CQE_SC_GEN_INV_FLD
:
902 /* Invalid Field in Command */
903 dev_err(cmd
->nc_nvme
->n_dip
, CE_PANIC
, "programming error: "
904 "invalid field in cmd %p", (void *)cmd
);
907 case NVME_CQE_SC_GEN_ID_CNFL
:
908 /* Command ID Conflict */
909 dev_err(cmd
->nc_nvme
->n_dip
, CE_PANIC
, "programming error: "
910 "cmd ID conflict in cmd %p", (void *)cmd
);
913 case NVME_CQE_SC_GEN_INV_NS
:
914 /* Invalid Namespace or Format */
915 dev_err(cmd
->nc_nvme
->n_dip
, CE_PANIC
, "programming error: "
916 "invalid NS/format in cmd %p", (void *)cmd
);
919 case NVME_CQE_SC_GEN_NVM_LBA_RANGE
:
920 /* LBA Out Of Range */
921 dev_err(cmd
->nc_nvme
->n_dip
, CE_PANIC
, "programming error: "
922 "LBA out of range in cmd %p", (void *)cmd
);
926 * Non-fatal errors, handle gracefully.
928 case NVME_CQE_SC_GEN_DATA_XFR_ERR
:
929 /* Data Transfer Error (DMA) */
930 /* TODO: post ereport */
931 atomic_inc_32(&cmd
->nc_nvme
->n_data_xfr_err
);
932 bd_error(cmd
->nc_xfer
, BD_ERR_NTRDY
);
935 case NVME_CQE_SC_GEN_INTERNAL_ERR
:
937 * Internal Error. The spec (v1.0, section 4.5.1.2) says
938 * detailed error information is returned as async event,
939 * so we pretty much ignore the error here and handle it
940 * in the async event handler.
942 atomic_inc_32(&cmd
->nc_nvme
->n_internal_err
);
943 bd_error(cmd
->nc_xfer
, BD_ERR_NTRDY
);
946 case NVME_CQE_SC_GEN_ABORT_REQUEST
:
948 * Command Abort Requested. This normally happens only when a
951 /* TODO: post ereport or change blkdev to handle this? */
952 atomic_inc_32(&cmd
->nc_nvme
->n_abort_rq_err
);
955 case NVME_CQE_SC_GEN_ABORT_PWRLOSS
:
956 /* Command Aborted due to Power Loss Notification */
957 ddi_fm_service_impact(cmd
->nc_nvme
->n_dip
, DDI_SERVICE_LOST
);
958 cmd
->nc_nvme
->n_dead
= B_TRUE
;
961 case NVME_CQE_SC_GEN_ABORT_SQ_DEL
:
962 /* Command Aborted due to SQ Deletion */
963 atomic_inc_32(&cmd
->nc_nvme
->n_abort_sq_del
);
966 case NVME_CQE_SC_GEN_NVM_CAP_EXC
:
967 /* Capacity Exceeded */
968 atomic_inc_32(&cmd
->nc_nvme
->n_nvm_cap_exc
);
969 bd_error(cmd
->nc_xfer
, BD_ERR_MEDIA
);
972 case NVME_CQE_SC_GEN_NVM_NS_NOTRDY
:
973 /* Namespace Not Ready */
974 atomic_inc_32(&cmd
->nc_nvme
->n_nvm_ns_notrdy
);
975 bd_error(cmd
->nc_xfer
, BD_ERR_NTRDY
);
979 return (nvme_check_unknown_cmd_status(cmd
));
984 nvme_check_specific_cmd_status(nvme_cmd_t
*cmd
)
986 nvme_cqe_t
*cqe
= &cmd
->nc_cqe
;
988 switch (cqe
->cqe_sf
.sf_sc
) {
989 case NVME_CQE_SC_SPC_INV_CQ
:
990 /* Completion Queue Invalid */
991 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_CREATE_SQUEUE
);
992 atomic_inc_32(&cmd
->nc_nvme
->n_inv_cq_err
);
995 case NVME_CQE_SC_SPC_INV_QID
:
996 /* Invalid Queue Identifier */
997 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_CREATE_SQUEUE
||
998 cmd
->nc_sqe
.sqe_opc
== NVME_OPC_DELETE_SQUEUE
||
999 cmd
->nc_sqe
.sqe_opc
== NVME_OPC_CREATE_CQUEUE
||
1000 cmd
->nc_sqe
.sqe_opc
== NVME_OPC_DELETE_CQUEUE
);
1001 atomic_inc_32(&cmd
->nc_nvme
->n_inv_qid_err
);
1004 case NVME_CQE_SC_SPC_MAX_QSZ_EXC
:
1005 /* Max Queue Size Exceeded */
1006 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_CREATE_SQUEUE
||
1007 cmd
->nc_sqe
.sqe_opc
== NVME_OPC_CREATE_CQUEUE
);
1008 atomic_inc_32(&cmd
->nc_nvme
->n_max_qsz_exc
);
1011 case NVME_CQE_SC_SPC_ABRT_CMD_EXC
:
1012 /* Abort Command Limit Exceeded */
1013 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_ABORT
);
1014 dev_err(cmd
->nc_nvme
->n_dip
, CE_PANIC
, "programming error: "
1015 "abort command limit exceeded in cmd %p", (void *)cmd
);
1018 case NVME_CQE_SC_SPC_ASYNC_EVREQ_EXC
:
1019 /* Async Event Request Limit Exceeded */
1020 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_ASYNC_EVENT
);
1021 dev_err(cmd
->nc_nvme
->n_dip
, CE_PANIC
, "programming error: "
1022 "async event request limit exceeded in cmd %p",
1026 case NVME_CQE_SC_SPC_INV_INT_VECT
:
1027 /* Invalid Interrupt Vector */
1028 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_CREATE_CQUEUE
);
1029 atomic_inc_32(&cmd
->nc_nvme
->n_inv_int_vect
);
1032 case NVME_CQE_SC_SPC_INV_LOG_PAGE
:
1033 /* Invalid Log Page */
1034 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_GET_LOG_PAGE
);
1035 atomic_inc_32(&cmd
->nc_nvme
->n_inv_log_page
);
1036 bd_error(cmd
->nc_xfer
, BD_ERR_ILLRQ
);
1039 case NVME_CQE_SC_SPC_INV_FORMAT
:
1040 /* Invalid Format */
1041 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_NVM_FORMAT
);
1042 atomic_inc_32(&cmd
->nc_nvme
->n_inv_format
);
1043 bd_error(cmd
->nc_xfer
, BD_ERR_ILLRQ
);
1046 case NVME_CQE_SC_SPC_INV_Q_DEL
:
1047 /* Invalid Queue Deletion */
1048 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_DELETE_CQUEUE
);
1049 atomic_inc_32(&cmd
->nc_nvme
->n_inv_q_del
);
1052 case NVME_CQE_SC_SPC_NVM_CNFL_ATTR
:
1053 /* Conflicting Attributes */
1054 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_NVM_DSET_MGMT
||
1055 cmd
->nc_sqe
.sqe_opc
== NVME_OPC_NVM_READ
||
1056 cmd
->nc_sqe
.sqe_opc
== NVME_OPC_NVM_WRITE
);
1057 atomic_inc_32(&cmd
->nc_nvme
->n_cnfl_attr
);
1058 bd_error(cmd
->nc_xfer
, BD_ERR_ILLRQ
);
1061 case NVME_CQE_SC_SPC_NVM_INV_PROT
:
1062 /* Invalid Protection Information */
1063 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_NVM_COMPARE
||
1064 cmd
->nc_sqe
.sqe_opc
== NVME_OPC_NVM_READ
||
1065 cmd
->nc_sqe
.sqe_opc
== NVME_OPC_NVM_WRITE
);
1066 atomic_inc_32(&cmd
->nc_nvme
->n_inv_prot
);
1067 bd_error(cmd
->nc_xfer
, BD_ERR_ILLRQ
);
1070 case NVME_CQE_SC_SPC_NVM_READONLY
:
1071 /* Write to Read Only Range */
1072 ASSERT(cmd
->nc_sqe
.sqe_opc
== NVME_OPC_NVM_WRITE
);
1073 atomic_inc_32(&cmd
->nc_nvme
->n_readonly
);
1074 bd_error(cmd
->nc_xfer
, BD_ERR_ILLRQ
);
1078 return (nvme_check_unknown_cmd_status(cmd
));
1083 nvme_check_cmd_status(nvme_cmd_t
*cmd
)
1085 nvme_cqe_t
*cqe
= &cmd
->nc_cqe
;
1087 /* take a shortcut if everything is alright */
1088 if (cqe
->cqe_sf
.sf_sct
== NVME_CQE_SCT_GENERIC
&&
1089 cqe
->cqe_sf
.sf_sc
== NVME_CQE_SC_GEN_SUCCESS
)
1092 if (cqe
->cqe_sf
.sf_sct
== NVME_CQE_SCT_GENERIC
)
1093 return (nvme_check_generic_cmd_status(cmd
));
1094 else if (cqe
->cqe_sf
.sf_sct
== NVME_CQE_SCT_SPECIFIC
)
1095 return (nvme_check_specific_cmd_status(cmd
));
1096 else if (cqe
->cqe_sf
.sf_sct
== NVME_CQE_SCT_INTEGRITY
)
1097 return (nvme_check_integrity_cmd_status(cmd
));
1098 else if (cqe
->cqe_sf
.sf_sct
== NVME_CQE_SCT_VENDOR
)
1099 return (nvme_check_vendor_cmd_status(cmd
));
1101 return (nvme_check_unknown_cmd_status(cmd
));
1105 * nvme_abort_cmd_cb -- replaces nc_callback of aborted commands
1107 * This functions takes care of cleaning up aborted commands. The command
1108 * status is checked to catch any fatal errors.
1111 nvme_abort_cmd_cb(void *arg
)
1113 nvme_cmd_t
*cmd
= arg
;
1116 * Grab the command mutex. Once we have it we hold the last reference
1117 * to the command and can safely free it.
1119 mutex_enter(&cmd
->nc_mutex
);
1120 (void) nvme_check_cmd_status(cmd
);
1121 mutex_exit(&cmd
->nc_mutex
);
1127 nvme_abort_cmd(nvme_cmd_t
*abort_cmd
)
1129 nvme_t
*nvme
= abort_cmd
->nc_nvme
;
1130 nvme_cmd_t
*cmd
= nvme_alloc_cmd(nvme
, KM_SLEEP
);
1131 nvme_abort_cmd_t ac
= { 0 };
1133 sema_p(&nvme
->n_abort_sema
);
1135 ac
.b
.ac_cid
= abort_cmd
->nc_sqe
.sqe_cid
;
1136 ac
.b
.ac_sqid
= abort_cmd
->nc_sqid
;
1139 * Drop the mutex of the aborted command. From this point on
1140 * we must assume that the abort callback has freed the command.
1142 mutex_exit(&abort_cmd
->nc_mutex
);
1145 cmd
->nc_sqe
.sqe_opc
= NVME_OPC_ABORT
;
1146 cmd
->nc_callback
= nvme_wakeup_cmd
;
1147 cmd
->nc_sqe
.sqe_cdw10
= ac
.r
;
1150 * Send the ABORT to the hardware. The ABORT command will return _after_
1151 * the aborted command has completed (aborted or otherwise).
1153 if (nvme_admin_cmd(cmd
, nvme_admin_cmd_timeout
) != DDI_SUCCESS
) {
1154 sema_v(&nvme
->n_abort_sema
);
1155 dev_err(nvme
->n_dip
, CE_WARN
,
1156 "!nvme_admin_cmd failed for ABORT");
1157 atomic_inc_32(&nvme
->n_abort_failed
);
1160 sema_v(&nvme
->n_abort_sema
);
1162 if (nvme_check_cmd_status(cmd
)) {
1163 dev_err(nvme
->n_dip
, CE_WARN
,
1164 "!ABORT failed with sct = %x, sc = %x",
1165 cmd
->nc_cqe
.cqe_sf
.sf_sct
, cmd
->nc_cqe
.cqe_sf
.sf_sc
);
1166 atomic_inc_32(&nvme
->n_abort_failed
);
1168 atomic_inc_32(&nvme
->n_cmd_aborted
);
1175 * nvme_wait_cmd -- wait for command completion or timeout
1177 * Returns B_TRUE if the command completed normally.
1179 * Returns B_FALSE if the command timed out and an abort was attempted. The
1180 * command mutex will be dropped and the command must be considered freed. The
1181 * freeing of the command is normally done by the abort command callback.
1183 * In case of a serious error or a timeout of the abort command the hardware
1184 * will be declared dead and FMA will be notified.
1187 nvme_wait_cmd(nvme_cmd_t
*cmd
, uint_t sec
)
1189 clock_t timeout
= ddi_get_lbolt() + drv_usectohz(sec
* MICROSEC
);
1190 nvme_t
*nvme
= cmd
->nc_nvme
;
1191 nvme_reg_csts_t csts
;
1193 ASSERT(mutex_owned(&cmd
->nc_mutex
));
1195 while (!cmd
->nc_completed
) {
1196 if (cv_timedwait(&cmd
->nc_cv
, &cmd
->nc_mutex
, timeout
) == -1)
1200 if (cmd
->nc_completed
)
1204 * The command timed out. Change the callback to the cleanup function.
1206 cmd
->nc_callback
= nvme_abort_cmd_cb
;
1209 * Check controller for fatal status, any errors associated with the
1210 * register or DMA handle, or for a double timeout (abort command timed
1211 * out). If necessary log a warning and call FMA.
1213 csts
.r
= nvme_get32(nvme
, NVME_REG_CSTS
);
1214 dev_err(nvme
->n_dip
, CE_WARN
, "!command timeout, "
1215 "OPC = %x, CFS = %d", cmd
->nc_sqe
.sqe_opc
, csts
.b
.csts_cfs
);
1216 atomic_inc_32(&nvme
->n_cmd_timeout
);
1218 if (csts
.b
.csts_cfs
||
1219 nvme_check_regs_hdl(nvme
) ||
1220 nvme_check_dma_hdl(cmd
->nc_dma
) ||
1221 cmd
->nc_sqe
.sqe_opc
== NVME_OPC_ABORT
) {
1222 ddi_fm_service_impact(nvme
->n_dip
, DDI_SERVICE_LOST
);
1223 nvme
->n_dead
= B_TRUE
;
1224 mutex_exit(&cmd
->nc_mutex
);
1227 * Try to abort the command. The command mutex is released by
1229 * If the abort succeeds it will have freed the aborted command.
1230 * If the abort fails for other reasons we must assume that the
1231 * command may complete at any time, and the callback will free
1234 nvme_abort_cmd(cmd
);
1241 nvme_wakeup_cmd(void *arg
)
1243 nvme_cmd_t
*cmd
= arg
;
1245 mutex_enter(&cmd
->nc_mutex
);
1247 * There is a slight chance that this command completed shortly after
1248 * the timeout was hit in nvme_wait_cmd() but before the callback was
1249 * changed. Catch that case here and clean up accordingly.
1251 if (cmd
->nc_callback
== nvme_abort_cmd_cb
) {
1252 mutex_exit(&cmd
->nc_mutex
);
1253 nvme_abort_cmd_cb(cmd
);
1257 cmd
->nc_completed
= B_TRUE
;
1258 cv_signal(&cmd
->nc_cv
);
1259 mutex_exit(&cmd
->nc_mutex
);
1263 nvme_async_event_task(void *arg
)
1265 nvme_cmd_t
*cmd
= arg
;
1266 nvme_t
*nvme
= cmd
->nc_nvme
;
1267 nvme_error_log_entry_t
*error_log
= NULL
;
1268 nvme_health_log_t
*health_log
= NULL
;
1269 nvme_async_event_t event
;
1273 * Check for errors associated with the async request itself. The only
1274 * command-specific error is "async event limit exceeded", which
1275 * indicates a programming error in the driver and causes a panic in
1276 * nvme_check_cmd_status().
1278 * Other possible errors are various scenarios where the async request
1279 * was aborted, or internal errors in the device. Internal errors are
1280 * reported to FMA, the command aborts need no special handling here.
1282 if (nvme_check_cmd_status(cmd
)) {
1283 dev_err(cmd
->nc_nvme
->n_dip
, CE_WARN
,
1284 "!async event request returned failure, sct = %x, "
1285 "sc = %x, dnr = %d, m = %d", cmd
->nc_cqe
.cqe_sf
.sf_sct
,
1286 cmd
->nc_cqe
.cqe_sf
.sf_sc
, cmd
->nc_cqe
.cqe_sf
.sf_dnr
,
1287 cmd
->nc_cqe
.cqe_sf
.sf_m
);
1289 if (cmd
->nc_cqe
.cqe_sf
.sf_sct
== NVME_CQE_SCT_GENERIC
&&
1290 cmd
->nc_cqe
.cqe_sf
.sf_sc
== NVME_CQE_SC_GEN_INTERNAL_ERR
) {
1291 cmd
->nc_nvme
->n_dead
= B_TRUE
;
1292 ddi_fm_service_impact(cmd
->nc_nvme
->n_dip
,
1300 event
.r
= cmd
->nc_cqe
.cqe_dw0
;
1302 /* Clear CQE and re-submit the async request. */
1303 bzero(&cmd
->nc_cqe
, sizeof (nvme_cqe_t
));
1304 ret
= nvme_submit_cmd(nvme
->n_adminq
, cmd
);
1306 if (ret
!= DDI_SUCCESS
) {
1307 dev_err(nvme
->n_dip
, CE_WARN
,
1308 "!failed to resubmit async event request");
1309 atomic_inc_32(&nvme
->n_async_resubmit_failed
);
1313 switch (event
.b
.ae_type
) {
1314 case NVME_ASYNC_TYPE_ERROR
:
1315 if (event
.b
.ae_logpage
== NVME_LOGPAGE_ERROR
) {
1316 error_log
= (nvme_error_log_entry_t
*)
1317 nvme_get_logpage(nvme
, event
.b
.ae_logpage
);
1319 dev_err(nvme
->n_dip
, CE_WARN
, "!wrong logpage in "
1320 "async event reply: %d", event
.b
.ae_logpage
);
1321 atomic_inc_32(&nvme
->n_wrong_logpage
);
1324 switch (event
.b
.ae_info
) {
1325 case NVME_ASYNC_ERROR_INV_SQ
:
1326 dev_err(nvme
->n_dip
, CE_PANIC
, "programming error: "
1327 "invalid submission queue");
1330 case NVME_ASYNC_ERROR_INV_DBL
:
1331 dev_err(nvme
->n_dip
, CE_PANIC
, "programming error: "
1332 "invalid doorbell write value");
1335 case NVME_ASYNC_ERROR_DIAGFAIL
:
1336 dev_err(nvme
->n_dip
, CE_WARN
, "!diagnostic failure");
1337 ddi_fm_service_impact(nvme
->n_dip
, DDI_SERVICE_LOST
);
1338 nvme
->n_dead
= B_TRUE
;
1339 atomic_inc_32(&nvme
->n_diagfail_event
);
1342 case NVME_ASYNC_ERROR_PERSISTENT
:
1343 dev_err(nvme
->n_dip
, CE_WARN
, "!persistent internal "
1345 ddi_fm_service_impact(nvme
->n_dip
, DDI_SERVICE_LOST
);
1346 nvme
->n_dead
= B_TRUE
;
1347 atomic_inc_32(&nvme
->n_persistent_event
);
1350 case NVME_ASYNC_ERROR_TRANSIENT
:
1351 dev_err(nvme
->n_dip
, CE_WARN
, "!transient internal "
1353 /* TODO: send ereport */
1354 atomic_inc_32(&nvme
->n_transient_event
);
1357 case NVME_ASYNC_ERROR_FW_LOAD
:
1358 dev_err(nvme
->n_dip
, CE_WARN
,
1359 "!firmware image load error");
1360 atomic_inc_32(&nvme
->n_fw_load_event
);
1365 case NVME_ASYNC_TYPE_HEALTH
:
1366 if (event
.b
.ae_logpage
== NVME_LOGPAGE_HEALTH
) {
1367 health_log
= (nvme_health_log_t
*)
1368 nvme_get_logpage(nvme
, event
.b
.ae_logpage
, -1);
1370 dev_err(nvme
->n_dip
, CE_WARN
, "!wrong logpage in "
1371 "async event reply: %d", event
.b
.ae_logpage
);
1372 atomic_inc_32(&nvme
->n_wrong_logpage
);
1375 switch (event
.b
.ae_info
) {
1376 case NVME_ASYNC_HEALTH_RELIABILITY
:
1377 dev_err(nvme
->n_dip
, CE_WARN
,
1378 "!device reliability compromised");
1379 /* TODO: send ereport */
1380 atomic_inc_32(&nvme
->n_reliability_event
);
1383 case NVME_ASYNC_HEALTH_TEMPERATURE
:
1384 dev_err(nvme
->n_dip
, CE_WARN
,
1385 "!temperature above threshold");
1386 /* TODO: send ereport */
1387 atomic_inc_32(&nvme
->n_temperature_event
);
1390 case NVME_ASYNC_HEALTH_SPARE
:
1391 dev_err(nvme
->n_dip
, CE_WARN
,
1392 "!spare space below threshold");
1393 /* TODO: send ereport */
1394 atomic_inc_32(&nvme
->n_spare_event
);
1399 case NVME_ASYNC_TYPE_VENDOR
:
1400 dev_err(nvme
->n_dip
, CE_WARN
, "!vendor specific async event "
1401 "received, info = %x, logpage = %x", event
.b
.ae_info
,
1402 event
.b
.ae_logpage
);
1403 atomic_inc_32(&nvme
->n_vendor_event
);
1407 dev_err(nvme
->n_dip
, CE_WARN
, "!unknown async event received, "
1408 "type = %x, info = %x, logpage = %x", event
.b
.ae_type
,
1409 event
.b
.ae_info
, event
.b
.ae_logpage
);
1410 atomic_inc_32(&nvme
->n_unknown_event
);
1415 kmem_free(error_log
, sizeof (nvme_error_log_entry_t
) *
1416 nvme
->n_error_log_len
);
1419 kmem_free(health_log
, sizeof (nvme_health_log_t
));
1423 nvme_admin_cmd(nvme_cmd_t
*cmd
, int sec
)
1427 mutex_enter(&cmd
->nc_mutex
);
1428 ret
= nvme_submit_cmd(cmd
->nc_nvme
->n_adminq
, cmd
);
1430 if (ret
!= DDI_SUCCESS
) {
1431 mutex_exit(&cmd
->nc_mutex
);
1432 dev_err(cmd
->nc_nvme
->n_dip
, CE_WARN
,
1433 "!nvme_submit_cmd failed");
1434 atomic_inc_32(&cmd
->nc_nvme
->n_admin_queue_full
);
1436 return (DDI_FAILURE
);
1439 if (nvme_wait_cmd(cmd
, sec
) == B_FALSE
) {
1441 * The command timed out. An abort command was posted that
1442 * will take care of the cleanup.
1444 return (DDI_FAILURE
);
1446 mutex_exit(&cmd
->nc_mutex
);
1448 return (DDI_SUCCESS
);
1452 nvme_async_event(nvme_t
*nvme
)
1454 nvme_cmd_t
*cmd
= nvme_alloc_cmd(nvme
, KM_SLEEP
);
1458 cmd
->nc_sqe
.sqe_opc
= NVME_OPC_ASYNC_EVENT
;
1459 cmd
->nc_callback
= nvme_async_event_task
;
1461 ret
= nvme_submit_cmd(nvme
->n_adminq
, cmd
);
1463 if (ret
!= DDI_SUCCESS
) {
1464 dev_err(nvme
->n_dip
, CE_WARN
,
1465 "!nvme_submit_cmd failed for ASYNCHRONOUS EVENT");
1467 return (DDI_FAILURE
);
1470 return (DDI_SUCCESS
);
1474 nvme_get_logpage(nvme_t
*nvme
, uint8_t logpage
, ...)
1476 nvme_cmd_t
*cmd
= nvme_alloc_cmd(nvme
, KM_SLEEP
);
1478 nvme_getlogpage_t getlogpage
= { 0 };
1482 va_start(ap
, logpage
);
1485 cmd
->nc_callback
= nvme_wakeup_cmd
;
1486 cmd
->nc_sqe
.sqe_opc
= NVME_OPC_GET_LOG_PAGE
;
1488 getlogpage
.b
.lp_lid
= logpage
;
1491 case NVME_LOGPAGE_ERROR
:
1492 cmd
->nc_sqe
.sqe_nsid
= (uint32_t)-1;
1493 bufsize
= nvme
->n_error_log_len
*
1494 sizeof (nvme_error_log_entry_t
);
1497 case NVME_LOGPAGE_HEALTH
:
1498 cmd
->nc_sqe
.sqe_nsid
= va_arg(ap
, uint32_t);
1499 bufsize
= sizeof (nvme_health_log_t
);
1502 case NVME_LOGPAGE_FWSLOT
:
1503 cmd
->nc_sqe
.sqe_nsid
= (uint32_t)-1;
1504 bufsize
= sizeof (nvme_fwslot_log_t
);
1508 dev_err(nvme
->n_dip
, CE_WARN
, "!unknown log page requested: %d",
1510 atomic_inc_32(&nvme
->n_unknown_logpage
);
1516 getlogpage
.b
.lp_numd
= bufsize
/ sizeof (uint32_t) - 1;
1518 cmd
->nc_sqe
.sqe_cdw10
= getlogpage
.r
;
1520 if (nvme_zalloc_dma(nvme
, getlogpage
.b
.lp_numd
* sizeof (uint32_t),
1521 DDI_DMA_READ
, &nvme
->n_prp_dma_attr
, &cmd
->nc_dma
) != DDI_SUCCESS
) {
1522 dev_err(nvme
->n_dip
, CE_WARN
,
1523 "!nvme_zalloc_dma failed for GET LOG PAGE");
1527 if (cmd
->nc_dma
->nd_ncookie
> 2) {
1528 dev_err(nvme
->n_dip
, CE_WARN
,
1529 "!too many DMA cookies for GET LOG PAGE");
1530 atomic_inc_32(&nvme
->n_too_many_cookies
);
1534 cmd
->nc_sqe
.sqe_dptr
.d_prp
[0] = cmd
->nc_dma
->nd_cookie
.dmac_laddress
;
1535 if (cmd
->nc_dma
->nd_ncookie
> 1) {
1536 ddi_dma_nextcookie(cmd
->nc_dma
->nd_dmah
,
1537 &cmd
->nc_dma
->nd_cookie
);
1538 cmd
->nc_sqe
.sqe_dptr
.d_prp
[1] =
1539 cmd
->nc_dma
->nd_cookie
.dmac_laddress
;
1542 if (nvme_admin_cmd(cmd
, nvme_admin_cmd_timeout
) != DDI_SUCCESS
) {
1543 dev_err(nvme
->n_dip
, CE_WARN
,
1544 "!nvme_admin_cmd failed for GET LOG PAGE");
1548 if (nvme_check_cmd_status(cmd
)) {
1549 dev_err(nvme
->n_dip
, CE_WARN
,
1550 "!GET LOG PAGE failed with sct = %x, sc = %x",
1551 cmd
->nc_cqe
.cqe_sf
.sf_sct
, cmd
->nc_cqe
.cqe_sf
.sf_sc
);
1555 buf
= kmem_alloc(bufsize
, KM_SLEEP
);
1556 bcopy(cmd
->nc_dma
->nd_memp
, buf
, bufsize
);
1565 nvme_identify(nvme_t
*nvme
, uint32_t nsid
)
1567 nvme_cmd_t
*cmd
= nvme_alloc_cmd(nvme
, KM_SLEEP
);
1571 cmd
->nc_callback
= nvme_wakeup_cmd
;
1572 cmd
->nc_sqe
.sqe_opc
= NVME_OPC_IDENTIFY
;
1573 cmd
->nc_sqe
.sqe_nsid
= nsid
;
1574 cmd
->nc_sqe
.sqe_cdw10
= nsid
? NVME_IDENTIFY_NSID
: NVME_IDENTIFY_CTRL
;
1576 if (nvme_zalloc_dma(nvme
, NVME_IDENTIFY_BUFSIZE
, DDI_DMA_READ
,
1577 &nvme
->n_prp_dma_attr
, &cmd
->nc_dma
) != DDI_SUCCESS
) {
1578 dev_err(nvme
->n_dip
, CE_WARN
,
1579 "!nvme_zalloc_dma failed for IDENTIFY");
1583 if (cmd
->nc_dma
->nd_ncookie
> 2) {
1584 dev_err(nvme
->n_dip
, CE_WARN
,
1585 "!too many DMA cookies for IDENTIFY");
1586 atomic_inc_32(&nvme
->n_too_many_cookies
);
1590 cmd
->nc_sqe
.sqe_dptr
.d_prp
[0] = cmd
->nc_dma
->nd_cookie
.dmac_laddress
;
1591 if (cmd
->nc_dma
->nd_ncookie
> 1) {
1592 ddi_dma_nextcookie(cmd
->nc_dma
->nd_dmah
,
1593 &cmd
->nc_dma
->nd_cookie
);
1594 cmd
->nc_sqe
.sqe_dptr
.d_prp
[1] =
1595 cmd
->nc_dma
->nd_cookie
.dmac_laddress
;
1598 if (nvme_admin_cmd(cmd
, nvme_admin_cmd_timeout
) != DDI_SUCCESS
) {
1599 dev_err(nvme
->n_dip
, CE_WARN
,
1600 "!nvme_admin_cmd failed for IDENTIFY");
1604 if (nvme_check_cmd_status(cmd
)) {
1605 dev_err(nvme
->n_dip
, CE_WARN
,
1606 "!IDENTIFY failed with sct = %x, sc = %x",
1607 cmd
->nc_cqe
.cqe_sf
.sf_sct
, cmd
->nc_cqe
.cqe_sf
.sf_sc
);
1611 buf
= kmem_alloc(NVME_IDENTIFY_BUFSIZE
, KM_SLEEP
);
1612 bcopy(cmd
->nc_dma
->nd_memp
, buf
, NVME_IDENTIFY_BUFSIZE
);
1621 nvme_set_features(nvme_t
*nvme
, uint32_t nsid
, uint8_t feature
, uint32_t val
,
1624 _NOTE(ARGUNUSED(nsid
));
1625 nvme_cmd_t
*cmd
= nvme_alloc_cmd(nvme
, KM_SLEEP
);
1626 boolean_t ret
= B_FALSE
;
1628 ASSERT(res
!= NULL
);
1631 cmd
->nc_callback
= nvme_wakeup_cmd
;
1632 cmd
->nc_sqe
.sqe_opc
= NVME_OPC_SET_FEATURES
;
1633 cmd
->nc_sqe
.sqe_cdw10
= feature
;
1634 cmd
->nc_sqe
.sqe_cdw11
= val
;
1637 case NVME_FEAT_WRITE_CACHE
:
1638 if (!nvme
->n_write_cache_present
)
1642 case NVME_FEAT_NQUEUES
:
1649 if (nvme_admin_cmd(cmd
, nvme_admin_cmd_timeout
) != DDI_SUCCESS
) {
1650 dev_err(nvme
->n_dip
, CE_WARN
,
1651 "!nvme_admin_cmd failed for SET FEATURES");
1655 if (nvme_check_cmd_status(cmd
)) {
1656 dev_err(nvme
->n_dip
, CE_WARN
,
1657 "!SET FEATURES %d failed with sct = %x, sc = %x",
1658 feature
, cmd
->nc_cqe
.cqe_sf
.sf_sct
,
1659 cmd
->nc_cqe
.cqe_sf
.sf_sc
);
1663 *res
= cmd
->nc_cqe
.cqe_dw0
;
1672 nvme_write_cache_set(nvme_t
*nvme
, boolean_t enable
)
1674 nvme_write_cache_t nwc
= { 0 };
1679 if (!nvme_set_features(nvme
, 0, NVME_FEAT_WRITE_CACHE
, nwc
.r
, &nwc
.r
))
1686 nvme_set_nqueues(nvme_t
*nvme
, uint16_t nqueues
)
1688 nvme_nqueue_t nq
= { 0 };
1690 nq
.b
.nq_nsq
= nq
.b
.nq_ncq
= nqueues
- 1;
1692 if (!nvme_set_features(nvme
, 0, NVME_FEAT_NQUEUES
, nq
.r
, &nq
.r
)) {
1697 * Always use the same number of submission and completion queues, and
1698 * never use more than the requested number of queues.
1700 return (MIN(nqueues
, MIN(nq
.b
.nq_nsq
, nq
.b
.nq_ncq
) + 1));
1704 nvme_create_io_qpair(nvme_t
*nvme
, nvme_qpair_t
*qp
, uint16_t idx
)
1706 nvme_cmd_t
*cmd
= nvme_alloc_cmd(nvme
, KM_SLEEP
);
1707 nvme_create_queue_dw10_t dw10
= { 0 };
1708 nvme_create_cq_dw11_t c_dw11
= { 0 };
1709 nvme_create_sq_dw11_t s_dw11
= { 0 };
1712 dw10
.b
.q_qsize
= qp
->nq_nentry
- 1;
1715 c_dw11
.b
.cq_ien
= 1;
1716 c_dw11
.b
.cq_iv
= idx
% nvme
->n_intr_cnt
;
1719 cmd
->nc_callback
= nvme_wakeup_cmd
;
1720 cmd
->nc_sqe
.sqe_opc
= NVME_OPC_CREATE_CQUEUE
;
1721 cmd
->nc_sqe
.sqe_cdw10
= dw10
.r
;
1722 cmd
->nc_sqe
.sqe_cdw11
= c_dw11
.r
;
1723 cmd
->nc_sqe
.sqe_dptr
.d_prp
[0] = qp
->nq_cqdma
->nd_cookie
.dmac_laddress
;
1725 if (nvme_admin_cmd(cmd
, nvme_admin_cmd_timeout
) != DDI_SUCCESS
) {
1726 dev_err(nvme
->n_dip
, CE_WARN
,
1727 "!nvme_admin_cmd failed for CREATE CQUEUE");
1728 return (DDI_FAILURE
);
1731 if (nvme_check_cmd_status(cmd
)) {
1732 dev_err(nvme
->n_dip
, CE_WARN
,
1733 "!CREATE CQUEUE failed with sct = %x, sc = %x",
1734 cmd
->nc_cqe
.cqe_sf
.sf_sct
, cmd
->nc_cqe
.cqe_sf
.sf_sc
);
1736 return (DDI_FAILURE
);
1742 s_dw11
.b
.sq_cqid
= idx
;
1744 cmd
= nvme_alloc_cmd(nvme
, KM_SLEEP
);
1746 cmd
->nc_callback
= nvme_wakeup_cmd
;
1747 cmd
->nc_sqe
.sqe_opc
= NVME_OPC_CREATE_SQUEUE
;
1748 cmd
->nc_sqe
.sqe_cdw10
= dw10
.r
;
1749 cmd
->nc_sqe
.sqe_cdw11
= s_dw11
.r
;
1750 cmd
->nc_sqe
.sqe_dptr
.d_prp
[0] = qp
->nq_sqdma
->nd_cookie
.dmac_laddress
;
1752 if (nvme_admin_cmd(cmd
, nvme_admin_cmd_timeout
) != DDI_SUCCESS
) {
1753 dev_err(nvme
->n_dip
, CE_WARN
,
1754 "!nvme_admin_cmd failed for CREATE SQUEUE");
1755 return (DDI_FAILURE
);
1758 if (nvme_check_cmd_status(cmd
)) {
1759 dev_err(nvme
->n_dip
, CE_WARN
,
1760 "!CREATE SQUEUE failed with sct = %x, sc = %x",
1761 cmd
->nc_cqe
.cqe_sf
.sf_sct
, cmd
->nc_cqe
.cqe_sf
.sf_sc
);
1763 return (DDI_FAILURE
);
1768 return (DDI_SUCCESS
);
1772 nvme_reset(nvme_t
*nvme
, boolean_t quiesce
)
1774 nvme_reg_csts_t csts
;
1777 nvme_put32(nvme
, NVME_REG_CC
, 0);
1779 csts
.r
= nvme_get32(nvme
, NVME_REG_CSTS
);
1780 if (csts
.b
.csts_rdy
== 1) {
1781 nvme_put32(nvme
, NVME_REG_CC
, 0);
1782 for (i
= 0; i
!= nvme
->n_timeout
* 10; i
++) {
1783 csts
.r
= nvme_get32(nvme
, NVME_REG_CSTS
);
1784 if (csts
.b
.csts_rdy
== 0)
1788 drv_usecwait(50000);
1790 delay(drv_usectohz(50000));
1794 nvme_put32(nvme
, NVME_REG_AQA
, 0);
1795 nvme_put32(nvme
, NVME_REG_ASQ
, 0);
1796 nvme_put32(nvme
, NVME_REG_ACQ
, 0);
1798 csts
.r
= nvme_get32(nvme
, NVME_REG_CSTS
);
1799 return (csts
.b
.csts_rdy
== 0 ? B_TRUE
: B_FALSE
);
1803 nvme_shutdown(nvme_t
*nvme
, int mode
, boolean_t quiesce
)
1806 nvme_reg_csts_t csts
;
1809 ASSERT(mode
== NVME_CC_SHN_NORMAL
|| mode
== NVME_CC_SHN_ABRUPT
);
1811 cc
.r
= nvme_get32(nvme
, NVME_REG_CC
);
1812 cc
.b
.cc_shn
= mode
& 0x3;
1813 nvme_put32(nvme
, NVME_REG_CC
, cc
.r
);
1815 for (i
= 0; i
!= 10; i
++) {
1816 csts
.r
= nvme_get32(nvme
, NVME_REG_CSTS
);
1817 if (csts
.b
.csts_shst
== NVME_CSTS_SHN_COMPLETE
)
1821 drv_usecwait(100000);
1823 delay(drv_usectohz(100000));
1829 nvme_prepare_devid(nvme_t
*nvme
, uint32_t nsid
)
1831 char model
[sizeof (nvme
->n_idctl
->id_model
) + 1];
1832 char serial
[sizeof (nvme
->n_idctl
->id_serial
) + 1];
1834 bcopy(nvme
->n_idctl
->id_model
, model
, sizeof (nvme
->n_idctl
->id_model
));
1835 bcopy(nvme
->n_idctl
->id_serial
, serial
,
1836 sizeof (nvme
->n_idctl
->id_serial
));
1838 model
[sizeof (nvme
->n_idctl
->id_model
)] = '\0';
1839 serial
[sizeof (nvme
->n_idctl
->id_serial
)] = '\0';
1841 (void) snprintf(nvme
->n_ns
[nsid
- 1].ns_devid
,
1842 sizeof (nvme
->n_ns
[0].ns_devid
), "%4X-%s-%s-%X",
1843 nvme
->n_idctl
->id_vid
, model
, serial
, nsid
);
1847 nvme_init(nvme_t
*nvme
)
1849 nvme_reg_cc_t cc
= { 0 };
1850 nvme_reg_aqa_t aqa
= { 0 };
1851 nvme_reg_asq_t asq
= { 0 };
1852 nvme_reg_acq_t acq
= { 0 };
1855 nvme_reg_csts_t csts
;
1858 char model
[sizeof (nvme
->n_idctl
->id_model
) + 1];
1859 char *vendor
, *product
;
1861 /* Check controller version */
1862 vs
.r
= nvme_get32(nvme
, NVME_REG_VS
);
1863 dev_err(nvme
->n_dip
, CE_CONT
, "?NVMe spec version %d.%d",
1864 vs
.b
.vs_mjr
, vs
.b
.vs_mnr
);
1866 if (nvme_version_major
< vs
.b
.vs_mjr
||
1867 (nvme_version_major
== vs
.b
.vs_mjr
&&
1868 nvme_version_minor
< vs
.b
.vs_mnr
)) {
1869 dev_err(nvme
->n_dip
, CE_WARN
, "!no support for version > %d.%d",
1870 nvme_version_major
, nvme_version_minor
);
1871 if (nvme
->n_strict_version
)
1875 /* retrieve controller configuration */
1876 cap
.r
= nvme_get64(nvme
, NVME_REG_CAP
);
1878 if ((cap
.b
.cap_css
& NVME_CAP_CSS_NVM
) == 0) {
1879 dev_err(nvme
->n_dip
, CE_WARN
,
1880 "!NVM command set not supported by hardware");
1884 nvme
->n_nssr_supported
= cap
.b
.cap_nssrs
;
1885 nvme
->n_doorbell_stride
= 4 << cap
.b
.cap_dstrd
;
1886 nvme
->n_timeout
= cap
.b
.cap_to
;
1887 nvme
->n_arbitration_mechanisms
= cap
.b
.cap_ams
;
1888 nvme
->n_cont_queues_reqd
= cap
.b
.cap_cqr
;
1889 nvme
->n_max_queue_entries
= cap
.b
.cap_mqes
+ 1;
1892 * The MPSMIN and MPSMAX fields in the CAP register use 0 to specify
1893 * the base page size of 4k (1<<12), so add 12 here to get the real
1896 nvme
->n_pageshift
= MIN(MAX(cap
.b
.cap_mpsmin
+ 12, PAGESHIFT
),
1897 cap
.b
.cap_mpsmax
+ 12);
1898 nvme
->n_pagesize
= 1UL << (nvme
->n_pageshift
);
1901 * Set up Queue DMA to transfer at least 1 page-aligned page at a time.
1903 nvme
->n_queue_dma_attr
.dma_attr_align
= nvme
->n_pagesize
;
1904 nvme
->n_queue_dma_attr
.dma_attr_minxfer
= nvme
->n_pagesize
;
1907 * Set up PRP DMA to transfer 1 page-aligned page at a time.
1908 * Maxxfer may be increased after we identified the controller limits.
1910 nvme
->n_prp_dma_attr
.dma_attr_maxxfer
= nvme
->n_pagesize
;
1911 nvme
->n_prp_dma_attr
.dma_attr_minxfer
= nvme
->n_pagesize
;
1912 nvme
->n_prp_dma_attr
.dma_attr_align
= nvme
->n_pagesize
;
1913 nvme
->n_prp_dma_attr
.dma_attr_seg
= nvme
->n_pagesize
- 1;
1916 * Reset controller if it's still in ready state.
1918 if (nvme_reset(nvme
, B_FALSE
) == B_FALSE
) {
1919 dev_err(nvme
->n_dip
, CE_WARN
, "!unable to reset controller");
1920 ddi_fm_service_impact(nvme
->n_dip
, DDI_SERVICE_LOST
);
1921 nvme
->n_dead
= B_TRUE
;
1926 * Create the admin queue pair.
1928 if (nvme_alloc_qpair(nvme
, nvme
->n_admin_queue_len
, &nvme
->n_adminq
, 0)
1930 dev_err(nvme
->n_dip
, CE_WARN
,
1931 "!unable to allocate admin qpair");
1934 nvme
->n_ioq
= kmem_alloc(sizeof (nvme_qpair_t
*), KM_SLEEP
);
1935 nvme
->n_ioq
[0] = nvme
->n_adminq
;
1937 nvme
->n_progress
|= NVME_ADMIN_QUEUE
;
1939 (void) ddi_prop_update_int(DDI_DEV_T_NONE
, nvme
->n_dip
,
1940 "admin-queue-len", nvme
->n_admin_queue_len
);
1942 aqa
.b
.aqa_asqs
= aqa
.b
.aqa_acqs
= nvme
->n_admin_queue_len
- 1;
1943 asq
= nvme
->n_adminq
->nq_sqdma
->nd_cookie
.dmac_laddress
;
1944 acq
= nvme
->n_adminq
->nq_cqdma
->nd_cookie
.dmac_laddress
;
1946 ASSERT((asq
& (nvme
->n_pagesize
- 1)) == 0);
1947 ASSERT((acq
& (nvme
->n_pagesize
- 1)) == 0);
1949 nvme_put32(nvme
, NVME_REG_AQA
, aqa
.r
);
1950 nvme_put64(nvme
, NVME_REG_ASQ
, asq
);
1951 nvme_put64(nvme
, NVME_REG_ACQ
, acq
);
1953 cc
.b
.cc_ams
= 0; /* use Round-Robin arbitration */
1954 cc
.b
.cc_css
= 0; /* use NVM command set */
1955 cc
.b
.cc_mps
= nvme
->n_pageshift
- 12;
1956 cc
.b
.cc_shn
= 0; /* no shutdown in progress */
1957 cc
.b
.cc_en
= 1; /* enable controller */
1958 cc
.b
.cc_iosqes
= 6; /* submission queue entry is 2^6 bytes long */
1959 cc
.b
.cc_iocqes
= 4; /* completion queue entry is 2^4 bytes long */
1961 nvme_put32(nvme
, NVME_REG_CC
, cc
.r
);
1964 * Wait for the controller to become ready.
1966 csts
.r
= nvme_get32(nvme
, NVME_REG_CSTS
);
1967 if (csts
.b
.csts_rdy
== 0) {
1968 for (i
= 0; i
!= nvme
->n_timeout
* 10; i
++) {
1969 delay(drv_usectohz(50000));
1970 csts
.r
= nvme_get32(nvme
, NVME_REG_CSTS
);
1972 if (csts
.b
.csts_cfs
== 1) {
1973 dev_err(nvme
->n_dip
, CE_WARN
,
1974 "!controller fatal status at init");
1975 ddi_fm_service_impact(nvme
->n_dip
,
1977 nvme
->n_dead
= B_TRUE
;
1981 if (csts
.b
.csts_rdy
== 1)
1986 if (csts
.b
.csts_rdy
== 0) {
1987 dev_err(nvme
->n_dip
, CE_WARN
, "!controller not ready");
1988 ddi_fm_service_impact(nvme
->n_dip
, DDI_SERVICE_LOST
);
1989 nvme
->n_dead
= B_TRUE
;
1994 * Assume an abort command limit of 1. We'll destroy and re-init
1995 * that later when we know the true abort command limit.
1997 sema_init(&nvme
->n_abort_sema
, 1, NULL
, SEMA_DRIVER
, NULL
);
2000 * Setup initial interrupt for admin queue.
2002 if ((nvme_setup_interrupts(nvme
, DDI_INTR_TYPE_MSIX
, 1)
2004 (nvme_setup_interrupts(nvme
, DDI_INTR_TYPE_MSI
, 1)
2006 (nvme_setup_interrupts(nvme
, DDI_INTR_TYPE_FIXED
, 1)
2008 dev_err(nvme
->n_dip
, CE_WARN
,
2009 "!failed to setup initial interrupt");
2014 * Post an asynchronous event command to catch errors.
2016 if (nvme_async_event(nvme
) != DDI_SUCCESS
) {
2017 dev_err(nvme
->n_dip
, CE_WARN
,
2018 "!failed to post async event");
2023 * Identify Controller
2025 nvme
->n_idctl
= nvme_identify(nvme
, 0);
2026 if (nvme
->n_idctl
== NULL
) {
2027 dev_err(nvme
->n_dip
, CE_WARN
,
2028 "!failed to identify controller");
2033 * Get Vendor & Product ID
2035 bcopy(nvme
->n_idctl
->id_model
, model
, sizeof (nvme
->n_idctl
->id_model
));
2036 model
[sizeof (nvme
->n_idctl
->id_model
)] = '\0';
2037 sata_split_model(model
, &vendor
, &product
);
2040 nvme
->n_vendor
= strdup("NVMe");
2042 nvme
->n_vendor
= strdup(vendor
);
2044 nvme
->n_product
= strdup(product
);
2047 * Get controller limits.
2049 nvme
->n_async_event_limit
= MAX(NVME_MIN_ASYNC_EVENT_LIMIT
,
2050 MIN(nvme
->n_admin_queue_len
/ 10,
2051 MIN(nvme
->n_idctl
->id_aerl
+ 1, nvme
->n_async_event_limit
)));
2053 (void) ddi_prop_update_int(DDI_DEV_T_NONE
, nvme
->n_dip
,
2054 "async-event-limit", nvme
->n_async_event_limit
);
2056 nvme
->n_abort_command_limit
= nvme
->n_idctl
->id_acl
+ 1;
2059 * Reinitialize the semaphore with the true abort command limit
2060 * supported by the hardware. It's not necessary to disable interrupts
2061 * as only command aborts use the semaphore, and no commands are
2062 * executed or aborted while we're here.
2064 sema_destroy(&nvme
->n_abort_sema
);
2065 sema_init(&nvme
->n_abort_sema
, nvme
->n_abort_command_limit
- 1, NULL
,
2068 nvme
->n_progress
|= NVME_CTRL_LIMITS
;
2070 if (nvme
->n_idctl
->id_mdts
== 0)
2071 nvme
->n_max_data_transfer_size
= nvme
->n_pagesize
* 65536;
2073 nvme
->n_max_data_transfer_size
=
2074 1ull << (nvme
->n_pageshift
+ nvme
->n_idctl
->id_mdts
);
2076 nvme
->n_error_log_len
= nvme
->n_idctl
->id_elpe
+ 1;
2079 * Limit n_max_data_transfer_size to what we can handle in one PRP.
2080 * Chained PRPs are currently unsupported.
2082 * This is a no-op on hardware which doesn't support a transfer size
2083 * big enough to require chained PRPs.
2085 nvme
->n_max_data_transfer_size
= MIN(nvme
->n_max_data_transfer_size
,
2086 (nvme
->n_pagesize
/ sizeof (uint64_t) * nvme
->n_pagesize
));
2088 nvme
->n_prp_dma_attr
.dma_attr_maxxfer
= nvme
->n_max_data_transfer_size
;
2091 * Make sure the minimum/maximum queue entry sizes are not
2092 * larger/smaller than the default.
2095 if (((1 << nvme
->n_idctl
->id_sqes
.qes_min
) > sizeof (nvme_sqe_t
)) ||
2096 ((1 << nvme
->n_idctl
->id_sqes
.qes_max
) < sizeof (nvme_sqe_t
)) ||
2097 ((1 << nvme
->n_idctl
->id_cqes
.qes_min
) > sizeof (nvme_cqe_t
)) ||
2098 ((1 << nvme
->n_idctl
->id_cqes
.qes_max
) < sizeof (nvme_cqe_t
)))
2102 * Check for the presence of a Volatile Write Cache. If present,
2103 * enable or disable based on the value of the property
2104 * volatile-write-cache-enable (default is enabled).
2106 nvme
->n_write_cache_present
=
2107 nvme
->n_idctl
->id_vwc
.vwc_present
== 0 ? B_FALSE
: B_TRUE
;
2109 (void) ddi_prop_update_int(DDI_DEV_T_NONE
, nvme
->n_dip
,
2110 "volatile-write-cache-present",
2111 nvme
->n_write_cache_present
? 1 : 0);
2113 if (!nvme
->n_write_cache_present
) {
2114 nvme
->n_write_cache_enabled
= B_FALSE
;
2115 } else if (!nvme_write_cache_set(nvme
, nvme
->n_write_cache_enabled
)) {
2116 dev_err(nvme
->n_dip
, CE_WARN
,
2117 "!failed to %sable volatile write cache",
2118 nvme
->n_write_cache_enabled
? "en" : "dis");
2120 * Assume the cache is (still) enabled.
2122 nvme
->n_write_cache_enabled
= B_TRUE
;
2125 (void) ddi_prop_update_int(DDI_DEV_T_NONE
, nvme
->n_dip
,
2126 "volatile-write-cache-enable",
2127 nvme
->n_write_cache_enabled
? 1 : 0);
2130 * Grab a copy of all mandatory log pages.
2132 * TODO: should go away once user space tool exists to print logs
2134 nvme
->n_error_log
= (nvme_error_log_entry_t
*)
2135 nvme_get_logpage(nvme
, NVME_LOGPAGE_ERROR
);
2136 nvme
->n_health_log
= (nvme_health_log_t
*)
2137 nvme_get_logpage(nvme
, NVME_LOGPAGE_HEALTH
, -1);
2138 nvme
->n_fwslot_log
= (nvme_fwslot_log_t
*)
2139 nvme_get_logpage(nvme
, NVME_LOGPAGE_FWSLOT
);
2142 * Identify Namespaces
2144 nvme
->n_namespace_count
= nvme
->n_idctl
->id_nn
;
2145 nvme
->n_ns
= kmem_zalloc(sizeof (nvme_namespace_t
) *
2146 nvme
->n_namespace_count
, KM_SLEEP
);
2148 for (i
= 0; i
!= nvme
->n_namespace_count
; i
++) {
2149 nvme_identify_nsid_t
*idns
;
2152 nvme
->n_ns
[i
].ns_nvme
= nvme
;
2153 nvme
->n_ns
[i
].ns_idns
= idns
= nvme_identify(nvme
, i
+ 1);
2156 dev_err(nvme
->n_dip
, CE_WARN
,
2157 "!failed to identify namespace %d", i
+ 1);
2161 nvme
->n_ns
[i
].ns_id
= i
+ 1;
2162 nvme
->n_ns
[i
].ns_block_count
= idns
->id_nsize
;
2163 nvme
->n_ns
[i
].ns_block_size
=
2164 1 << idns
->id_lbaf
[idns
->id_flbas
.lba_format
].lbaf_lbads
;
2165 nvme
->n_ns
[i
].ns_best_block_size
= nvme
->n_ns
[i
].ns_block_size
;
2167 nvme_prepare_devid(nvme
, nvme
->n_ns
[i
].ns_id
);
2170 * Find the LBA format with no metadata and the best relative
2171 * performance. A value of 3 means "degraded", 0 is best.
2174 for (int j
= 0; j
<= idns
->id_nlbaf
; j
++) {
2175 if (idns
->id_lbaf
[j
].lbaf_lbads
== 0)
2177 if (idns
->id_lbaf
[j
].lbaf_ms
!= 0)
2179 if (idns
->id_lbaf
[j
].lbaf_rp
>= last_rp
)
2181 last_rp
= idns
->id_lbaf
[j
].lbaf_rp
;
2182 nvme
->n_ns
[i
].ns_best_block_size
=
2183 1 << idns
->id_lbaf
[j
].lbaf_lbads
;
2186 if (nvme
->n_ns
[i
].ns_best_block_size
< nvme
->n_min_block_size
)
2187 nvme
->n_ns
[i
].ns_best_block_size
=
2188 nvme
->n_min_block_size
;
2191 * We currently don't support namespaces that use either:
2192 * - thin provisioning
2193 * - protection information
2195 if (idns
->id_nsfeat
.f_thin
||
2196 idns
->id_dps
.dp_pinfo
) {
2197 dev_err(nvme
->n_dip
, CE_WARN
,
2198 "!ignoring namespace %d, unsupported features: "
2199 "thin = %d, pinfo = %d", i
+ 1,
2200 idns
->id_nsfeat
.f_thin
, idns
->id_dps
.dp_pinfo
);
2201 nvme
->n_ns
[i
].ns_ignore
= B_TRUE
;
2206 * Try to set up MSI/MSI-X interrupts.
2208 if ((nvme
->n_intr_types
& (DDI_INTR_TYPE_MSI
| DDI_INTR_TYPE_MSIX
))
2210 nvme_release_interrupts(nvme
);
2212 nqueues
= MIN(UINT16_MAX
, ncpus
);
2214 if ((nvme_setup_interrupts(nvme
, DDI_INTR_TYPE_MSIX
,
2215 nqueues
) != DDI_SUCCESS
) &&
2216 (nvme_setup_interrupts(nvme
, DDI_INTR_TYPE_MSI
,
2217 nqueues
) != DDI_SUCCESS
)) {
2218 dev_err(nvme
->n_dip
, CE_WARN
,
2219 "!failed to setup MSI/MSI-X interrupts");
2224 nqueues
= nvme
->n_intr_cnt
;
2227 * Create I/O queue pairs.
2229 nvme
->n_ioq_count
= nvme_set_nqueues(nvme
, nqueues
);
2230 if (nvme
->n_ioq_count
== 0) {
2231 dev_err(nvme
->n_dip
, CE_WARN
,
2232 "!failed to set number of I/O queues to %d", nqueues
);
2237 * Reallocate I/O queue array
2239 kmem_free(nvme
->n_ioq
, sizeof (nvme_qpair_t
*));
2240 nvme
->n_ioq
= kmem_zalloc(sizeof (nvme_qpair_t
*) *
2241 (nvme
->n_ioq_count
+ 1), KM_SLEEP
);
2242 nvme
->n_ioq
[0] = nvme
->n_adminq
;
2245 * If we got less queues than we asked for we might as well give
2246 * some of the interrupt vectors back to the system.
2248 if (nvme
->n_ioq_count
< nqueues
) {
2249 nvme_release_interrupts(nvme
);
2251 if (nvme_setup_interrupts(nvme
, nvme
->n_intr_type
,
2252 nvme
->n_ioq_count
) != DDI_SUCCESS
) {
2253 dev_err(nvme
->n_dip
, CE_WARN
,
2254 "!failed to reduce number of interrupts");
2260 * Alloc & register I/O queue pairs
2262 nvme
->n_io_queue_len
=
2263 MIN(nvme
->n_io_queue_len
, nvme
->n_max_queue_entries
);
2264 (void) ddi_prop_update_int(DDI_DEV_T_NONE
, nvme
->n_dip
, "io-queue-len",
2265 nvme
->n_io_queue_len
);
2267 for (i
= 1; i
!= nvme
->n_ioq_count
+ 1; i
++) {
2268 if (nvme_alloc_qpair(nvme
, nvme
->n_io_queue_len
,
2269 &nvme
->n_ioq
[i
], i
) != DDI_SUCCESS
) {
2270 dev_err(nvme
->n_dip
, CE_WARN
,
2271 "!unable to allocate I/O qpair %d", i
);
2275 if (nvme_create_io_qpair(nvme
, nvme
->n_ioq
[i
], i
)
2277 dev_err(nvme
->n_dip
, CE_WARN
,
2278 "!unable to create I/O qpair %d", i
);
2284 * Post more asynchronous events commands to reduce event reporting
2285 * latency as suggested by the spec.
2287 for (i
= 1; i
!= nvme
->n_async_event_limit
; i
++) {
2288 if (nvme_async_event(nvme
) != DDI_SUCCESS
) {
2289 dev_err(nvme
->n_dip
, CE_WARN
,
2290 "!failed to post async event %d", i
);
2295 return (DDI_SUCCESS
);
2298 (void) nvme_reset(nvme
, B_FALSE
);
2299 return (DDI_FAILURE
);
2303 nvme_intr(caddr_t arg1
, caddr_t arg2
)
2305 /*LINTED: E_PTR_BAD_CAST_ALIGN*/
2306 nvme_t
*nvme
= (nvme_t
*)arg1
;
2307 int inum
= (int)(uintptr_t)arg2
;
2312 if (inum
>= nvme
->n_intr_cnt
)
2313 return (DDI_INTR_UNCLAIMED
);
2316 * The interrupt vector a queue uses is calculated as queue_idx %
2317 * intr_cnt in nvme_create_io_qpair(). Iterate through the queue array
2318 * in steps of n_intr_cnt to process all queues using this vector.
2321 qnum
< nvme
->n_ioq_count
+ 1 && nvme
->n_ioq
[qnum
] != NULL
;
2322 qnum
+= nvme
->n_intr_cnt
) {
2323 while ((cmd
= nvme_retrieve_cmd(nvme
, nvme
->n_ioq
[qnum
]))) {
2324 taskq_dispatch_ent((taskq_t
*)cmd
->nc_nvme
->n_cmd_taskq
,
2325 cmd
->nc_callback
, cmd
, TQ_NOSLEEP
, &cmd
->nc_tqent
);
2330 return (ccnt
> 0 ? DDI_INTR_CLAIMED
: DDI_INTR_UNCLAIMED
);
2334 nvme_release_interrupts(nvme_t
*nvme
)
2338 for (i
= 0; i
< nvme
->n_intr_cnt
; i
++) {
2339 if (nvme
->n_inth
[i
] == NULL
)
2342 if (nvme
->n_intr_cap
& DDI_INTR_FLAG_BLOCK
)
2343 (void) ddi_intr_block_disable(&nvme
->n_inth
[i
], 1);
2345 (void) ddi_intr_disable(nvme
->n_inth
[i
]);
2347 (void) ddi_intr_remove_handler(nvme
->n_inth
[i
]);
2348 (void) ddi_intr_free(nvme
->n_inth
[i
]);
2351 kmem_free(nvme
->n_inth
, nvme
->n_inth_sz
);
2352 nvme
->n_inth
= NULL
;
2353 nvme
->n_inth_sz
= 0;
2355 nvme
->n_progress
&= ~NVME_INTERRUPTS
;
2359 nvme_setup_interrupts(nvme_t
*nvme
, int intr_type
, int nqpairs
)
2361 int nintrs
, navail
, count
;
2365 if (nvme
->n_intr_types
== 0) {
2366 ret
= ddi_intr_get_supported_types(nvme
->n_dip
,
2367 &nvme
->n_intr_types
);
2368 if (ret
!= DDI_SUCCESS
) {
2369 dev_err(nvme
->n_dip
, CE_WARN
,
2370 "!%s: ddi_intr_get_supported types failed",
2376 if ((nvme
->n_intr_types
& intr_type
) == 0)
2377 return (DDI_FAILURE
);
2379 ret
= ddi_intr_get_nintrs(nvme
->n_dip
, intr_type
, &nintrs
);
2380 if (ret
!= DDI_SUCCESS
) {
2381 dev_err(nvme
->n_dip
, CE_WARN
, "!%s: ddi_intr_get_nintrs failed",
2386 ret
= ddi_intr_get_navail(nvme
->n_dip
, intr_type
, &navail
);
2387 if (ret
!= DDI_SUCCESS
) {
2388 dev_err(nvme
->n_dip
, CE_WARN
, "!%s: ddi_intr_get_navail failed",
2393 /* We want at most one interrupt per queue pair. */
2394 if (navail
> nqpairs
)
2397 nvme
->n_inth_sz
= sizeof (ddi_intr_handle_t
) * navail
;
2398 nvme
->n_inth
= kmem_zalloc(nvme
->n_inth_sz
, KM_SLEEP
);
2400 ret
= ddi_intr_alloc(nvme
->n_dip
, nvme
->n_inth
, intr_type
, 0, navail
,
2402 if (ret
!= DDI_SUCCESS
) {
2403 dev_err(nvme
->n_dip
, CE_WARN
, "!%s: ddi_intr_alloc failed",
2408 nvme
->n_intr_cnt
= count
;
2410 ret
= ddi_intr_get_pri(nvme
->n_inth
[0], &nvme
->n_intr_pri
);
2411 if (ret
!= DDI_SUCCESS
) {
2412 dev_err(nvme
->n_dip
, CE_WARN
, "!%s: ddi_intr_get_pri failed",
2417 for (i
= 0; i
< count
; i
++) {
2418 ret
= ddi_intr_add_handler(nvme
->n_inth
[i
], nvme_intr
,
2419 (void *)nvme
, (void *)(uintptr_t)i
);
2420 if (ret
!= DDI_SUCCESS
) {
2421 dev_err(nvme
->n_dip
, CE_WARN
,
2422 "!%s: ddi_intr_add_handler failed", __func__
);
2427 (void) ddi_intr_get_cap(nvme
->n_inth
[0], &nvme
->n_intr_cap
);
2429 for (i
= 0; i
< count
; i
++) {
2430 if (nvme
->n_intr_cap
& DDI_INTR_FLAG_BLOCK
)
2431 ret
= ddi_intr_block_enable(&nvme
->n_inth
[i
], 1);
2433 ret
= ddi_intr_enable(nvme
->n_inth
[i
]);
2435 if (ret
!= DDI_SUCCESS
) {
2436 dev_err(nvme
->n_dip
, CE_WARN
,
2437 "!%s: enabling interrupt %d failed", __func__
, i
);
2442 nvme
->n_intr_type
= intr_type
;
2444 nvme
->n_progress
|= NVME_INTERRUPTS
;
2446 return (DDI_SUCCESS
);
2449 nvme_release_interrupts(nvme
);
2455 nvme_fm_errcb(dev_info_t
*dip
, ddi_fm_error_t
*fm_error
, const void *arg
)
2457 _NOTE(ARGUNUSED(arg
));
2459 pci_ereport_post(dip
, fm_error
, NULL
);
2460 return (fm_error
->fme_status
);
2464 nvme_attach(dev_info_t
*dip
, ddi_attach_cmd_t cmd
)
2473 if (cmd
!= DDI_ATTACH
)
2474 return (DDI_FAILURE
);
2476 instance
= ddi_get_instance(dip
);
2478 if (ddi_soft_state_zalloc(nvme_state
, instance
) != DDI_SUCCESS
)
2479 return (DDI_FAILURE
);
2481 nvme
= ddi_get_soft_state(nvme_state
, instance
);
2482 ddi_set_driver_private(dip
, nvme
);
2485 nvme
->n_strict_version
= ddi_prop_get_int(DDI_DEV_T_ANY
, dip
,
2486 DDI_PROP_DONTPASS
, "strict-version", 1) == 1 ? B_TRUE
: B_FALSE
;
2487 nvme
->n_ignore_unknown_vendor_status
= ddi_prop_get_int(DDI_DEV_T_ANY
,
2488 dip
, DDI_PROP_DONTPASS
, "ignore-unknown-vendor-status", 0) == 1 ?
2490 nvme
->n_admin_queue_len
= ddi_prop_get_int(DDI_DEV_T_ANY
, dip
,
2491 DDI_PROP_DONTPASS
, "admin-queue-len", NVME_DEFAULT_ADMIN_QUEUE_LEN
);
2492 nvme
->n_io_queue_len
= ddi_prop_get_int(DDI_DEV_T_ANY
, dip
,
2493 DDI_PROP_DONTPASS
, "io-queue-len", NVME_DEFAULT_IO_QUEUE_LEN
);
2494 nvme
->n_async_event_limit
= ddi_prop_get_int(DDI_DEV_T_ANY
, dip
,
2495 DDI_PROP_DONTPASS
, "async-event-limit",
2496 NVME_DEFAULT_ASYNC_EVENT_LIMIT
);
2497 nvme
->n_write_cache_enabled
= ddi_prop_get_int(DDI_DEV_T_ANY
, dip
,
2498 DDI_PROP_DONTPASS
, "volatile-write-cache-enable", 1) != 0 ?
2500 nvme
->n_min_block_size
= ddi_prop_get_int(DDI_DEV_T_ANY
, dip
,
2501 DDI_PROP_DONTPASS
, "min-phys-block-size",
2502 NVME_DEFAULT_MIN_BLOCK_SIZE
);
2504 if (!ISP2(nvme
->n_min_block_size
) ||
2505 (nvme
->n_min_block_size
< NVME_DEFAULT_MIN_BLOCK_SIZE
)) {
2506 dev_err(dip
, CE_WARN
, "!min-phys-block-size %s, "
2507 "using default %d", ISP2(nvme
->n_min_block_size
) ?
2508 "too low" : "not a power of 2",
2509 NVME_DEFAULT_MIN_BLOCK_SIZE
);
2510 nvme
->n_min_block_size
= NVME_DEFAULT_MIN_BLOCK_SIZE
;
2513 if (nvme
->n_admin_queue_len
< NVME_MIN_ADMIN_QUEUE_LEN
)
2514 nvme
->n_admin_queue_len
= NVME_MIN_ADMIN_QUEUE_LEN
;
2515 else if (nvme
->n_admin_queue_len
> NVME_MAX_ADMIN_QUEUE_LEN
)
2516 nvme
->n_admin_queue_len
= NVME_MAX_ADMIN_QUEUE_LEN
;
2518 if (nvme
->n_io_queue_len
< NVME_MIN_IO_QUEUE_LEN
)
2519 nvme
->n_io_queue_len
= NVME_MIN_IO_QUEUE_LEN
;
2521 if (nvme
->n_async_event_limit
< 1)
2522 nvme
->n_async_event_limit
= NVME_DEFAULT_ASYNC_EVENT_LIMIT
;
2524 nvme
->n_reg_acc_attr
= nvme_reg_acc_attr
;
2525 nvme
->n_queue_dma_attr
= nvme_queue_dma_attr
;
2526 nvme
->n_prp_dma_attr
= nvme_prp_dma_attr
;
2527 nvme
->n_sgl_dma_attr
= nvme_sgl_dma_attr
;
2530 * Setup FMA support.
2532 nvme
->n_fm_cap
= ddi_getprop(DDI_DEV_T_ANY
, dip
,
2533 DDI_PROP_CANSLEEP
| DDI_PROP_DONTPASS
, "fm-capable",
2534 DDI_FM_EREPORT_CAPABLE
| DDI_FM_ACCCHK_CAPABLE
|
2535 DDI_FM_DMACHK_CAPABLE
| DDI_FM_ERRCB_CAPABLE
);
2537 ddi_fm_init(dip
, &nvme
->n_fm_cap
, &nvme
->n_fm_ibc
);
2539 if (nvme
->n_fm_cap
) {
2540 if (nvme
->n_fm_cap
& DDI_FM_ACCCHK_CAPABLE
)
2541 nvme
->n_reg_acc_attr
.devacc_attr_access
=
2544 if (nvme
->n_fm_cap
& DDI_FM_DMACHK_CAPABLE
) {
2545 nvme
->n_prp_dma_attr
.dma_attr_flags
|= DDI_DMA_FLAGERR
;
2546 nvme
->n_sgl_dma_attr
.dma_attr_flags
|= DDI_DMA_FLAGERR
;
2549 if (DDI_FM_EREPORT_CAP(nvme
->n_fm_cap
) ||
2550 DDI_FM_ERRCB_CAP(nvme
->n_fm_cap
))
2551 pci_ereport_setup(dip
);
2553 if (DDI_FM_ERRCB_CAP(nvme
->n_fm_cap
))
2554 ddi_fm_handler_register(dip
, nvme_fm_errcb
,
2558 nvme
->n_progress
|= NVME_FMA_INIT
;
2561 * The spec defines several register sets. Only the controller
2562 * registers (set 1) are currently used.
2564 if (ddi_dev_nregs(dip
, &nregs
) == DDI_FAILURE
||
2566 ddi_dev_regsize(dip
, 1, ®size
) == DDI_FAILURE
)
2569 if (ddi_regs_map_setup(dip
, 1, &nvme
->n_regs
, 0, regsize
,
2570 &nvme
->n_reg_acc_attr
, &nvme
->n_regh
) != DDI_SUCCESS
) {
2571 dev_err(dip
, CE_WARN
, "!failed to map regset 1");
2575 nvme
->n_progress
|= NVME_REGS_MAPPED
;
2578 * Create taskq for command completion.
2580 (void) snprintf(name
, sizeof (name
), "%s%d_cmd_taskq",
2581 ddi_driver_name(dip
), ddi_get_instance(dip
));
2582 nvme
->n_cmd_taskq
= ddi_taskq_create(dip
, name
, MIN(UINT16_MAX
, ncpus
),
2583 TASKQ_DEFAULTPRI
, 0);
2584 if (nvme
->n_cmd_taskq
== NULL
) {
2585 dev_err(dip
, CE_WARN
, "!failed to create cmd taskq");
2590 * Create PRP DMA cache
2592 (void) snprintf(name
, sizeof (name
), "%s%d_prp_cache",
2593 ddi_driver_name(dip
), ddi_get_instance(dip
));
2594 nvme
->n_prp_cache
= kmem_cache_create(name
, sizeof (nvme_dma_t
),
2595 0, nvme_prp_dma_constructor
, nvme_prp_dma_destructor
,
2596 NULL
, (void *)nvme
, NULL
, 0);
2598 if (nvme_init(nvme
) != DDI_SUCCESS
)
2602 * Attach the blkdev driver for each namespace.
2604 for (i
= 0; i
!= nvme
->n_namespace_count
; i
++) {
2605 if (nvme
->n_ns
[i
].ns_ignore
)
2608 nvme
->n_ns
[i
].ns_bd_hdl
= bd_alloc_handle(&nvme
->n_ns
[i
],
2609 &nvme_bd_ops
, &nvme
->n_prp_dma_attr
, KM_SLEEP
);
2611 if (nvme
->n_ns
[i
].ns_bd_hdl
== NULL
) {
2612 dev_err(dip
, CE_WARN
,
2613 "!failed to get blkdev handle for namespace %d", i
);
2617 if (bd_attach_handle(dip
, nvme
->n_ns
[i
].ns_bd_hdl
)
2619 dev_err(dip
, CE_WARN
,
2620 "!failed to attach blkdev handle for namespace %d",
2626 return (DDI_SUCCESS
);
2629 /* attach successful anyway so that FMA can retire the device */
2631 return (DDI_SUCCESS
);
2633 (void) nvme_detach(dip
, DDI_DETACH
);
2635 return (DDI_FAILURE
);
2639 nvme_detach(dev_info_t
*dip
, ddi_detach_cmd_t cmd
)
2644 if (cmd
!= DDI_DETACH
)
2645 return (DDI_FAILURE
);
2647 instance
= ddi_get_instance(dip
);
2649 nvme
= ddi_get_soft_state(nvme_state
, instance
);
2652 return (DDI_FAILURE
);
2655 for (i
= 0; i
!= nvme
->n_namespace_count
; i
++) {
2656 if (nvme
->n_ns
[i
].ns_bd_hdl
) {
2657 (void) bd_detach_handle(
2658 nvme
->n_ns
[i
].ns_bd_hdl
);
2659 bd_free_handle(nvme
->n_ns
[i
].ns_bd_hdl
);
2662 if (nvme
->n_ns
[i
].ns_idns
)
2663 kmem_free(nvme
->n_ns
[i
].ns_idns
,
2664 sizeof (nvme_identify_nsid_t
));
2667 kmem_free(nvme
->n_ns
, sizeof (nvme_namespace_t
) *
2668 nvme
->n_namespace_count
);
2671 if (nvme
->n_progress
& NVME_INTERRUPTS
)
2672 nvme_release_interrupts(nvme
);
2674 if (nvme
->n_cmd_taskq
)
2675 ddi_taskq_wait(nvme
->n_cmd_taskq
);
2677 if (nvme
->n_ioq_count
> 0) {
2678 for (i
= 1; i
!= nvme
->n_ioq_count
+ 1; i
++) {
2679 if (nvme
->n_ioq
[i
] != NULL
) {
2680 /* TODO: send destroy queue commands */
2681 nvme_free_qpair(nvme
->n_ioq
[i
]);
2685 kmem_free(nvme
->n_ioq
, sizeof (nvme_qpair_t
*) *
2686 (nvme
->n_ioq_count
+ 1));
2689 if (nvme
->n_prp_cache
!= NULL
) {
2690 kmem_cache_destroy(nvme
->n_prp_cache
);
2693 if (nvme
->n_progress
& NVME_REGS_MAPPED
) {
2694 nvme_shutdown(nvme
, NVME_CC_SHN_NORMAL
, B_FALSE
);
2695 (void) nvme_reset(nvme
, B_FALSE
);
2698 if (nvme
->n_cmd_taskq
)
2699 ddi_taskq_destroy(nvme
->n_cmd_taskq
);
2701 if (nvme
->n_progress
& NVME_CTRL_LIMITS
)
2702 sema_destroy(&nvme
->n_abort_sema
);
2704 if (nvme
->n_progress
& NVME_ADMIN_QUEUE
)
2705 nvme_free_qpair(nvme
->n_adminq
);
2708 kmem_free(nvme
->n_idctl
, sizeof (nvme_identify_ctrl_t
));
2710 if (nvme
->n_progress
& NVME_REGS_MAPPED
)
2711 ddi_regs_map_free(&nvme
->n_regh
);
2713 if (nvme
->n_progress
& NVME_FMA_INIT
) {
2714 if (DDI_FM_ERRCB_CAP(nvme
->n_fm_cap
))
2715 ddi_fm_handler_unregister(nvme
->n_dip
);
2717 if (DDI_FM_EREPORT_CAP(nvme
->n_fm_cap
) ||
2718 DDI_FM_ERRCB_CAP(nvme
->n_fm_cap
))
2719 pci_ereport_teardown(nvme
->n_dip
);
2721 ddi_fm_fini(nvme
->n_dip
);
2724 if (nvme
->n_vendor
!= NULL
)
2725 strfree(nvme
->n_vendor
);
2727 if (nvme
->n_product
!= NULL
)
2728 strfree(nvme
->n_product
);
2730 ddi_soft_state_free(nvme_state
, instance
);
2732 return (DDI_SUCCESS
);
2736 nvme_quiesce(dev_info_t
*dip
)
2741 instance
= ddi_get_instance(dip
);
2743 nvme
= ddi_get_soft_state(nvme_state
, instance
);
2746 return (DDI_FAILURE
);
2748 nvme_shutdown(nvme
, NVME_CC_SHN_ABRUPT
, B_TRUE
);
2750 (void) nvme_reset(nvme
, B_TRUE
);
2752 return (DDI_FAILURE
);
2756 nvme_fill_prp(nvme_cmd_t
*cmd
, bd_xfer_t
*xfer
)
2758 nvme_t
*nvme
= cmd
->nc_nvme
;
2759 int nprp_page
, nprp
;
2762 if (xfer
->x_ndmac
== 0)
2763 return (DDI_FAILURE
);
2765 cmd
->nc_sqe
.sqe_dptr
.d_prp
[0] = xfer
->x_dmac
.dmac_laddress
;
2766 ddi_dma_nextcookie(xfer
->x_dmah
, &xfer
->x_dmac
);
2768 if (xfer
->x_ndmac
== 1) {
2769 cmd
->nc_sqe
.sqe_dptr
.d_prp
[1] = 0;
2770 return (DDI_SUCCESS
);
2771 } else if (xfer
->x_ndmac
== 2) {
2772 cmd
->nc_sqe
.sqe_dptr
.d_prp
[1] = xfer
->x_dmac
.dmac_laddress
;
2773 return (DDI_SUCCESS
);
2778 nprp_page
= nvme
->n_pagesize
/ sizeof (uint64_t) - 1;
2779 ASSERT(nprp_page
> 0);
2780 nprp
= (xfer
->x_ndmac
+ nprp_page
- 1) / nprp_page
;
2783 * We currently don't support chained PRPs and set up our DMA
2784 * attributes to reflect that. If we still get an I/O request
2785 * that needs a chained PRP something is very wrong.
2789 cmd
->nc_dma
= kmem_cache_alloc(nvme
->n_prp_cache
, KM_SLEEP
);
2790 bzero(cmd
->nc_dma
->nd_memp
, cmd
->nc_dma
->nd_len
);
2792 cmd
->nc_sqe
.sqe_dptr
.d_prp
[1] = cmd
->nc_dma
->nd_cookie
.dmac_laddress
;
2794 /*LINTED: E_PTR_BAD_CAST_ALIGN*/
2795 for (prp
= (uint64_t *)cmd
->nc_dma
->nd_memp
;
2797 prp
++, xfer
->x_ndmac
--) {
2798 *prp
= xfer
->x_dmac
.dmac_laddress
;
2799 ddi_dma_nextcookie(xfer
->x_dmah
, &xfer
->x_dmac
);
2802 (void) ddi_dma_sync(cmd
->nc_dma
->nd_dmah
, 0, cmd
->nc_dma
->nd_len
,
2803 DDI_DMA_SYNC_FORDEV
);
2804 return (DDI_SUCCESS
);
2808 nvme_create_nvm_cmd(nvme_namespace_t
*ns
, uint8_t opc
, bd_xfer_t
*xfer
)
2810 nvme_t
*nvme
= ns
->ns_nvme
;
2814 * Blkdev only sets BD_XFER_POLL when dumping, so don't sleep.
2816 cmd
= nvme_alloc_cmd(nvme
, (xfer
->x_flags
& BD_XFER_POLL
) ?
2817 KM_NOSLEEP
: KM_SLEEP
);
2822 cmd
->nc_sqe
.sqe_opc
= opc
;
2823 cmd
->nc_callback
= nvme_bd_xfer_done
;
2824 cmd
->nc_xfer
= xfer
;
2827 case NVME_OPC_NVM_WRITE
:
2828 case NVME_OPC_NVM_READ
:
2829 VERIFY(xfer
->x_nblks
<= 0x10000);
2831 cmd
->nc_sqe
.sqe_nsid
= ns
->ns_id
;
2833 cmd
->nc_sqe
.sqe_cdw10
= xfer
->x_blkno
& 0xffffffffu
;
2834 cmd
->nc_sqe
.sqe_cdw11
= (xfer
->x_blkno
>> 32);
2835 cmd
->nc_sqe
.sqe_cdw12
= (uint16_t)(xfer
->x_nblks
- 1);
2837 if (nvme_fill_prp(cmd
, xfer
) != DDI_SUCCESS
)
2841 case NVME_OPC_NVM_FLUSH
:
2842 cmd
->nc_sqe
.sqe_nsid
= ns
->ns_id
;
2857 nvme_bd_xfer_done(void *arg
)
2859 nvme_cmd_t
*cmd
= arg
;
2860 bd_xfer_t
*xfer
= cmd
->nc_xfer
;
2863 error
= nvme_check_cmd_status(cmd
);
2866 bd_xfer_done(xfer
, error
);
2870 nvme_bd_driveinfo(void *arg
, bd_drive_t
*drive
)
2872 nvme_namespace_t
*ns
= arg
;
2873 nvme_t
*nvme
= ns
->ns_nvme
;
2876 * blkdev maintains one queue size per instance (namespace),
2877 * but all namespace share the I/O queues.
2878 * TODO: need to figure out a sane default, or use per-NS I/O queues,
2879 * or change blkdev to handle EAGAIN
2881 drive
->d_qsize
= nvme
->n_ioq_count
* nvme
->n_io_queue_len
2882 / nvme
->n_namespace_count
;
2885 * d_maxxfer is not set, which means the value is taken from the DMA
2886 * attributes specified to bd_alloc_handle.
2889 drive
->d_removable
= B_FALSE
;
2890 drive
->d_hotpluggable
= B_FALSE
;
2892 drive
->d_target
= ns
->ns_id
;
2895 drive
->d_model
= nvme
->n_idctl
->id_model
;
2896 drive
->d_model_len
= sizeof (nvme
->n_idctl
->id_model
);
2897 drive
->d_vendor
= nvme
->n_vendor
;
2898 drive
->d_vendor_len
= strlen(nvme
->n_vendor
);
2899 drive
->d_product
= nvme
->n_product
;
2900 drive
->d_product_len
= strlen(nvme
->n_product
);
2901 drive
->d_serial
= nvme
->n_idctl
->id_serial
;
2902 drive
->d_serial_len
= sizeof (nvme
->n_idctl
->id_serial
);
2903 drive
->d_revision
= nvme
->n_idctl
->id_fwrev
;
2904 drive
->d_revision_len
= sizeof (nvme
->n_idctl
->id_fwrev
);
2908 nvme_bd_mediainfo(void *arg
, bd_media_t
*media
)
2910 nvme_namespace_t
*ns
= arg
;
2912 media
->m_nblks
= ns
->ns_block_count
;
2913 media
->m_blksize
= ns
->ns_block_size
;
2914 media
->m_readonly
= B_FALSE
;
2915 media
->m_solidstate
= B_TRUE
;
2917 media
->m_pblksize
= ns
->ns_best_block_size
;
2923 nvme_bd_cmd(nvme_namespace_t
*ns
, bd_xfer_t
*xfer
, uint8_t opc
)
2925 nvme_t
*nvme
= ns
->ns_nvme
;
2931 /* No polling for now */
2932 if (xfer
->x_flags
& BD_XFER_POLL
)
2935 cmd
= nvme_create_nvm_cmd(ns
, opc
, xfer
);
2939 cmd
->nc_sqid
= (CPU
->cpu_id
% nvme
->n_ioq_count
) + 1;
2940 ASSERT(cmd
->nc_sqid
<= nvme
->n_ioq_count
);
2942 if (nvme_submit_cmd(nvme
->n_ioq
[cmd
->nc_sqid
], cmd
)
2950 nvme_bd_read(void *arg
, bd_xfer_t
*xfer
)
2952 nvme_namespace_t
*ns
= arg
;
2954 return (nvme_bd_cmd(ns
, xfer
, NVME_OPC_NVM_READ
));
2958 nvme_bd_write(void *arg
, bd_xfer_t
*xfer
)
2960 nvme_namespace_t
*ns
= arg
;
2962 return (nvme_bd_cmd(ns
, xfer
, NVME_OPC_NVM_WRITE
));
2966 nvme_bd_sync(void *arg
, bd_xfer_t
*xfer
)
2968 nvme_namespace_t
*ns
= arg
;
2970 if (ns
->ns_nvme
->n_dead
)
2974 * If the volatile write cache is not present or not enabled the FLUSH
2975 * command is a no-op, so we can take a shortcut here.
2977 if (!ns
->ns_nvme
->n_write_cache_present
) {
2978 bd_xfer_done(xfer
, ENOTSUP
);
2982 if (!ns
->ns_nvme
->n_write_cache_enabled
) {
2983 bd_xfer_done(xfer
, 0);
2987 return (nvme_bd_cmd(ns
, xfer
, NVME_OPC_NVM_FLUSH
));
2991 nvme_bd_devid(void *arg
, dev_info_t
*devinfo
, ddi_devid_t
*devid
)
2993 nvme_namespace_t
*ns
= arg
;
2995 return (ddi_devid_init(devinfo
, DEVID_ENCAP
, strlen(ns
->ns_devid
),
2996 ns
->ns_devid
, devid
));