1 The MSI Driver Guide HOWTO
2 Tom L Nguyen tom.l.nguyen@intel.com
4 Revised Feb 12, 2004 by Martine Silbermann
5 email: Martine.Silbermann@hp.com
6 Revised Jun 25, 2004 by Tom L Nguyen
10 This guide describes the basics of Message Signaled Interrupts (MSI),
11 the advantages of using MSI over traditional interrupt mechanisms,
12 and how to enable your driver to use MSI or MSI-X. Also included is
13 a Frequently Asked Questions (FAQ) section.
17 PCI devices can be single-function or multi-function. In either case,
18 when this text talks about enabling or disabling MSI on a "device
19 function," it is referring to one specific PCI device and function and
20 not to all functions on a PCI device (unless the PCI device has only
23 2. Copyright 2003 Intel Corporation
27 Message Signaled Interrupt (MSI), as described in the PCI Local Bus
28 Specification Revision 2.3 or later, is an optional feature, and a
29 required feature for PCI Express devices. MSI enables a device function
30 to request service by sending an Inbound Memory Write on its PCI bus to
31 the FSB as a Message Signal Interrupt transaction. Because MSI is
32 generated in the form of a Memory Write, all transaction conditions,
33 such as a Retry, Master-Abort, Target-Abort or normal completion, are
36 A PCI device that supports MSI must also support pin IRQ assertion
37 interrupt mechanism to provide backward compatibility for systems that
38 do not support MSI. In systems which support MSI, the bus driver is
39 responsible for initializing the message address and message data of
40 the device function's MSI/MSI-X capability structure during device
41 initial configuration.
43 An MSI capable device function indicates MSI support by implementing
44 the MSI/MSI-X capability structure in its PCI capability list. The
45 device function may implement both the MSI capability structure and
46 the MSI-X capability structure; however, the bus driver should not
49 The MSI capability structure contains Message Control register,
50 Message Address register and Message Data register. These registers
51 provide the bus driver control over MSI. The Message Control register
52 indicates the MSI capability supported by the device. The Message
53 Address register specifies the target address and the Message Data
54 register specifies the characteristics of the message. To request
55 service, the device function writes the content of the Message Data
56 register to the target address. The device and its software driver
57 are prohibited from writing to these registers.
59 The MSI-X capability structure is an optional extension to MSI. It
60 uses an independent and separate capability structure. There are
61 some key advantages to implementing the MSI-X capability structure
62 over the MSI capability structure as described below.
64 - Support a larger maximum number of vectors per function.
66 - Provide the ability for system software to configure
67 each vector with an independent message address and message
68 data, specified by a table that resides in Memory Space.
70 - MSI and MSI-X both support per-vector masking. Per-vector
71 masking is an optional extension of MSI but a required
72 feature for MSI-X. Per-vector masking provides the kernel the
73 ability to mask/unmask a single MSI while running its
74 interrupt service routine. If per-vector masking is
75 not supported, then the device driver should provide the
76 hardware/software synchronization to ensure that the device
77 generates MSI when the driver wants it to do so.
81 As a benefit to the simplification of board design, MSI allows board
82 designers to remove out-of-band interrupt routing. MSI is another
83 step towards a legacy-free environment.
85 Due to increasing pressure on chipset and processor packages to
86 reduce pin count, the need for interrupt pins is expected to
87 diminish over time. Devices, due to pin constraints, may implement
88 messages to increase performance.
90 PCI Express endpoints uses INTx emulation (in-band messages) instead
91 of IRQ pin assertion. Using INTx emulation requires interrupt
92 sharing among devices connected to the same node (PCI bridge) while
93 MSI is unique (non-shared) and does not require BIOS configuration
94 support. As a result, the PCI Express technology requires MSI
95 support for better interrupt performance.
97 Using MSI enables the device functions to support two or more
98 vectors, which can be configured to target different CPUs to
101 5. Configuring a driver to use MSI/MSI-X
103 By default, the kernel will not enable MSI/MSI-X on all devices that
104 support this capability. The CONFIG_PCI_MSI kernel option
105 must be selected to enable MSI/MSI-X support.
107 5.1 Including MSI/MSI-X support into the kernel
109 To allow MSI/MSI-X capable device drivers to selectively enable
110 MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described
111 below), the VECTOR based scheme needs to be enabled by setting
112 CONFIG_PCI_MSI during kernel config.
114 Since the target of the inbound message is the local APIC, providing
115 CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_MSI.
117 5.2 Configuring for MSI support
119 Due to the non-contiguous fashion in vector assignment of the
120 existing Linux kernel, this version does not support multiple
121 messages regardless of a device function is capable of supporting
122 more than one vector. To enable MSI on a device function's MSI
123 capability structure requires a device driver to call the function
124 pci_enable_msi() explicitly.
126 5.2.1 API pci_enable_msi
128 int pci_enable_msi(struct pci_dev *dev)
130 With this new API, a device driver that wants to have MSI
131 enabled on its device function must call this API to enable MSI.
132 A successful call will initialize the MSI capability structure
133 with ONE vector, regardless of whether a device function is
134 capable of supporting multiple messages. This vector replaces the
135 pre-assigned dev->irq with a new MSI vector. To avoid a conflict
136 of the new assigned vector with existing pre-assigned vector requires
137 a device driver to call this API before calling request_irq().
139 5.2.2 API pci_disable_msi
141 void pci_disable_msi(struct pci_dev *dev)
143 This API should always be used to undo the effect of pci_enable_msi()
144 when a device driver is unloading. This API restores dev->irq with
145 the pre-assigned IOAPIC vector and switches a device's interrupt
146 mode to PCI pin-irq assertion/INTx emulation mode.
148 Note that a device driver should always call free_irq() on the MSI vector
149 that it has done request_irq() on before calling this API. Failure to do
150 so results in a BUG_ON() and a device will be left with MSI enabled and
153 5.2.3 MSI mode vs. legacy mode diagram
155 The below diagram shows the events which switch the interrupt
156 mode on the MSI-capable device function between MSI mode and
157 PIN-IRQ assertion mode.
159 ------------ pci_enable_msi ------------------------
160 | | <=============== | |
161 | MSI MODE | | PIN-IRQ ASSERTION MODE |
162 | | ===============> | |
163 ------------ pci_disable_msi ------------------------
166 Figure 1. MSI Mode vs. Legacy Mode
168 In Figure 1, a device operates by default in legacy mode. Legacy
169 in this context means PCI pin-irq assertion or PCI-Express INTx
170 emulation. A successful MSI request (using pci_enable_msi()) switches
171 a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector
172 stored in dev->irq will be saved by the PCI subsystem and a new
173 assigned MSI vector will replace dev->irq.
175 To return back to its default mode, a device driver should always call
176 pci_disable_msi() to undo the effect of pci_enable_msi(). Note that a
177 device driver should always call free_irq() on the MSI vector it has
178 done request_irq() on before calling pci_disable_msi(). Failure to do
179 so results in a BUG_ON() and a device will be left with MSI enabled and
180 leaks its vector. Otherwise, the PCI subsystem restores a device's
181 dev->irq with a pre-assigned IOAPIC vector and marks the released
182 MSI vector as unused.
184 Once being marked as unused, there is no guarantee that the PCI
185 subsystem will reserve this MSI vector for a device. Depending on
186 the availability of current PCI vector resources and the number of
187 MSI/MSI-X requests from other drivers, this MSI may be re-assigned.
189 For the case where the PCI subsystem re-assigns this MSI vector to
190 another driver, a request to switch back to MSI mode may result
191 in being assigned a different MSI vector or a failure if no more
192 vectors are available.
194 5.3 Configuring for MSI-X support
196 Due to the ability of the system software to configure each vector of
197 the MSI-X capability structure with an independent message address
198 and message data, the non-contiguous fashion in vector assignment of
199 the existing Linux kernel has no impact on supporting multiple
200 messages on an MSI-X capable device functions. To enable MSI-X on
201 a device function's MSI-X capability structure requires its device
202 driver to call the function pci_enable_msix() explicitly.
204 The function pci_enable_msix(), once invoked, enables either
205 all or nothing, depending on the current availability of PCI vector
206 resources. If the PCI vector resources are available for the number
207 of vectors requested by a device driver, this function will configure
208 the MSI-X table of the MSI-X capability structure of a device with
209 requested messages. To emphasize this reason, for example, a device
210 may be capable for supporting the maximum of 32 vectors while its
211 software driver usually may request 4 vectors. It is recommended
212 that the device driver should call this function once during the
213 initialization phase of the device driver.
215 Unlike the function pci_enable_msi(), the function pci_enable_msix()
216 does not replace the pre-assigned IOAPIC dev->irq with a new MSI
217 vector because the PCI subsystem writes the 1:1 vector-to-entry mapping
218 into the field vector of each element contained in a second argument.
219 Note that the pre-assigned IOAPIC dev->irq is valid only if the device
220 operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt at
221 using dev->irq by the device driver to request for interrupt service
222 may result unpredictabe behavior.
224 For each MSI-X vector granted, a device driver is responsible for calling
225 other functions like request_irq(), enable_irq(), etc. to enable
226 this vector with its corresponding interrupt service handler. It is
227 a device driver's choice to assign all vectors with the same
228 interrupt service handler or each vector with a unique interrupt
231 5.3.1 Handling MMIO address space of MSI-X Table
233 The PCI 3.0 specification has implementation notes that MMIO address
234 space for a device's MSI-X structure should be isolated so that the
235 software system can set different pages for controlling accesses to the
236 MSI-X structure. The implementation of MSI support requires the PCI
237 subsystem, not a device driver, to maintain full control of the MSI-X
238 table/MSI-X PBA (Pending Bit Array) and MMIO address space of the MSI-X
239 table/MSI-X PBA. A device driver is prohibited from requesting the MMIO
240 address space of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem
241 will fail enabling MSI-X on its hardware device when it calls the function
244 5.3.2 Handling MSI-X allocation
246 Determining the number of MSI-X vectors allocated to a function is
247 dependent on the number of MSI capable devices and MSI-X capable
248 devices populated in the system. The policy of allocating MSI-X
249 vectors to a function is defined as the following:
251 #of MSI-X vectors allocated to a function = (x - y)/z where
253 x = The number of available PCI vector resources by the time
254 the device driver calls pci_enable_msix(). The PCI vector
255 resources is the sum of the number of unassigned vectors
256 (new) and the number of released vectors when any MSI/MSI-X
257 device driver switches its hardware device back to a legacy
258 mode or is hot-removed. The number of unassigned vectors
259 may exclude some vectors reserved, as defined in parameter
260 NR_HP_RESERVED_VECTORS, for the case where the system is
261 capable of supporting hot-add/hot-remove operations. Users
262 may change the value defined in NR_HR_RESERVED_VECTORS to
263 meet their specific needs.
265 y = The number of MSI capable devices populated in the system.
266 This policy ensures that each MSI capable device has its
267 vector reserved to avoid the case where some MSI-X capable
268 drivers may attempt to claim all available vector resources.
270 z = The number of MSI-X capable devices populated in the system.
271 This policy ensures that maximum (x - y) is distributed
272 evenly among MSI-X capable devices.
274 Note that the PCI subsystem scans y and z during a bus enumeration.
275 When the PCI subsystem completes configuring MSI/MSI-X capability
276 structure of a device as requested by its device driver, y/z is
277 decremented accordingly.
279 5.3.3 Handling MSI-X shortages
281 For the case where fewer MSI-X vectors are allocated to a function
282 than requested, the function pci_enable_msix() will return the
283 maximum number of MSI-X vectors available to the caller. A device
284 driver may re-send its request with fewer or equal vectors indicated
285 in the return. For example, if a device driver requests 5 vectors, but
286 the number of available vectors is 3 vectors, a value of 3 will be
287 returned as a result of pci_enable_msix() call. A function could be
288 designed for its driver to use only 3 MSI-X table entries as
289 different combinations as ABC--, A-B-C, A--CB, etc. Note that this
290 patch does not support multiple entries with the same vector. Such
291 attempt by a device driver to use 5 MSI-X table entries with 3 vectors
292 as ABBCC, AABCC, BCCBA, etc will result as a failure by the function
293 pci_enable_msix(). Below are the reasons why supporting multiple
294 entries with the same vector is an undesirable solution.
296 - The PCI subsystem cannot determine the entry that
297 generated the message to mask/unmask MSI while handling
298 software driver ISR. Attempting to walk through all MSI-X
299 table entries (2048 max) to mask/unmask any match vector
300 is an undesirable solution.
302 - Walking through all MSI-X table entries (2048 max) to handle
303 SMP affinity of any match vector is an undesirable solution.
305 5.3.4 API pci_enable_msix
307 int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
309 This API enables a device driver to request the PCI subsystem
310 to enable MSI-X messages on its hardware device. Depending on
311 the availability of PCI vectors resources, the PCI subsystem enables
312 either all or none of the requested vectors.
314 Argument 'dev' points to the device (pci_dev) structure.
316 Argument 'entries' is a pointer to an array of msix_entry structs.
317 The number of entries is indicated in argument 'nvec'.
318 struct msix_entry is defined in /driver/pci/msi.h:
321 u16 vector; /* kernel uses to write alloc vector */
322 u16 entry; /* driver uses to specify entry */
325 A device driver is responsible for initializing the field 'entry' of
326 each element with a unique entry supported by MSI-X table. Otherwise,
327 -EINVAL will be returned as a result. A successful return of zero
328 indicates the PCI subsystem completed initializing each of the requested
329 entries of the MSI-X table with message address and message data.
330 Last but not least, the PCI subsystem will write the 1:1
331 vector-to-entry mapping into the field 'vector' of each element. A
332 device driver is responsible for keeping track of allocated MSI-X
333 vectors in its internal data structure.
335 A return of zero indicates that the number of MSI-X vectors was
336 successfully allocated. A return of greater than zero indicates
337 MSI-X vector shortage. Or a return of less than zero indicates
338 a failure. This failure may be a result of duplicate entries
339 specified in second argument, or a result of no available vector,
340 or a result of failing to initialize MSI-X table entries.
342 5.3.5 API pci_disable_msix
344 void pci_disable_msix(struct pci_dev *dev)
346 This API should always be used to undo the effect of pci_enable_msix()
347 when a device driver is unloading. Note that a device driver should
348 always call free_irq() on all MSI-X vectors it has done request_irq()
349 on before calling this API. Failure to do so results in a BUG_ON() and
350 a device will be left with MSI-X enabled and leaks its vectors.
352 5.3.6 MSI-X mode vs. legacy mode diagram
354 The below diagram shows the events which switch the interrupt
355 mode on the MSI-X capable device function between MSI-X mode and
356 PIN-IRQ assertion mode (legacy).
358 ------------ pci_enable_msix(,,n) ------------------------
359 | | <=============== | |
360 | MSI-X MODE | | PIN-IRQ ASSERTION MODE |
361 | | ===============> | |
362 ------------ pci_disable_msix ------------------------
364 Figure 2. MSI-X Mode vs. Legacy Mode
366 In Figure 2, a device operates by default in legacy mode. A
367 successful MSI-X request (using pci_enable_msix()) switches a
368 device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector
369 stored in dev->irq will be saved by the PCI subsystem; however,
370 unlike MSI mode, the PCI subsystem will not replace dev->irq with
371 assigned MSI-X vector because the PCI subsystem already writes the 1:1
372 vector-to-entry mapping into the field 'vector' of each element
373 specified in second argument.
375 To return back to its default mode, a device driver should always call
376 pci_disable_msix() to undo the effect of pci_enable_msix(). Note that
377 a device driver should always call free_irq() on all MSI-X vectors it
378 has done request_irq() on before calling pci_disable_msix(). Failure
379 to do so results in a BUG_ON() and a device will be left with MSI-X
380 enabled and leaks its vectors. Otherwise, the PCI subsystem switches a
381 device function's interrupt mode from MSI-X mode to legacy mode and
382 marks all allocated MSI-X vectors as unused.
384 Once being marked as unused, there is no guarantee that the PCI
385 subsystem will reserve these MSI-X vectors for a device. Depending on
386 the availability of current PCI vector resources and the number of
387 MSI/MSI-X requests from other drivers, these MSI-X vectors may be
390 For the case where the PCI subsystem re-assigned these MSI-X vectors
391 to other drivers, a request to switch back to MSI-X mode may result
392 being assigned with another set of MSI-X vectors or a failure if no
393 more vectors are available.
395 5.4 Handling function implementing both MSI and MSI-X capabilities
397 For the case where a function implements both MSI and MSI-X
398 capabilities, the PCI subsystem enables a device to run either in MSI
399 mode or MSI-X mode but not both. A device driver determines whether it
400 wants MSI or MSI-X enabled on its hardware device. Once a device
401 driver requests for MSI, for example, it is prohibited from requesting
402 MSI-X; in other words, a device driver is not permitted to ping-pong
403 between MSI mod MSI-X mode during a run-time.
405 5.5 Hardware requirements for MSI/MSI-X support
407 MSI/MSI-X support requires support from both system hardware and
408 individual hardware device functions.
410 5.5.1 System hardware support
412 Since the target of MSI address is the local APIC CPU, enabling
413 MSI/MSI-X support in the Linux kernel is dependent on whether existing
414 system hardware supports local APIC. Users should verify that their
415 system supports local APIC operation by testing that it runs when
416 CONFIG_X86_LOCAL_APIC=y.
418 In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set;
419 however, in UP environment, users must manually set
420 CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting
421 CONFIG_PCI_MSI enables the VECTOR based scheme and the option for
422 MSI-capable device drivers to selectively enable MSI/MSI-X.
424 Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X
425 vector is allocated new during runtime and MSI/MSI-X support does not
426 depend on BIOS support. This key independency enables MSI/MSI-X
427 support on future IOxAPIC free platforms.
429 5.5.2 Device hardware support
431 The hardware device function supports MSI by indicating the
432 MSI/MSI-X capability structure on its PCI capability list. By
433 default, this capability structure will not be initialized by
434 the kernel to enable MSI during the system boot. In other words,
435 the device function is running on its default pin assertion mode.
436 Note that in many cases the hardware supporting MSI have bugs,
437 which may result in system hangs. The software driver of specific
438 MSI-capable hardware is responsible for deciding whether to call
439 pci_enable_msi or not. A return of zero indicates the kernel
440 successfully initialized the MSI/MSI-X capability structure of the
441 device function. The device function is now running on MSI/MSI-X mode.
443 5.6 How to tell whether MSI/MSI-X is enabled on device function
445 At the driver level, a return of zero from the function call of
446 pci_enable_msi()/pci_enable_msix() indicates to a device driver that
447 its device function is initialized successfully and ready to run in
450 At the user level, users can use the command 'cat /proc/interrupts'
451 to display the vectors allocated for devices and their interrupt
452 MSI/MSI-X modes ("PCI-MSI"/"PCI-MSI-X"). Below shows MSI mode is
453 enabled on a SCSI Adaptec 39320D Ultra320 controller.
456 0: 324639 0 IO-APIC-edge timer
457 1: 1186 0 IO-APIC-edge i8042
458 2: 0 0 XT-PIC cascade
459 12: 2797 0 IO-APIC-edge i8042
460 14: 6543 0 IO-APIC-edge ide0
461 15: 1 0 IO-APIC-edge ide1
462 169: 0 0 IO-APIC-level uhci-hcd
463 185: 0 0 IO-APIC-level uhci-hcd
464 193: 138 10 PCI-MSI aic79xx
465 201: 30 0 PCI-MSI aic79xx
466 225: 30 0 IO-APIC-level aic7xxx
467 233: 30 0 IO-APIC-level aic7xxx
475 Several PCI chipsets or devices are known to not support MSI.
476 The PCI stack provides 3 possible levels of MSI disabling:
478 * on all devices behind a specific bridge
481 6.1. Disabling MSI on a single device
483 Under some circumstances, it might be required to disable MSI on a
484 single device, It may be achived by either not calling pci_enable_msi()
485 or all, or setting the pci_dev->no_msi flag before (most of the time
488 6.2. Disabling MSI below a bridge
490 The vast majority of MSI quirks are required by PCI bridges not
491 being able to route MSI between busses. In this case, MSI have to be
492 disabled on all devices behind this bridge. It is achieves by setting
493 the PCI_BUS_FLAGS_NO_MSI flag in the pci_bus->bus_flags of the bridge
494 subordinate bus. There is no need to set the same flag on bridges that
495 are below the broken brigde. When pci_enable_msi() is called to enable
496 MSI on a device, pci_msi_supported() takes care of checking the NO_MSI
497 flag in all parent busses of the device.
499 Some bridges actually support dynamic MSI support enabling/disabling
500 by changing some bits in their PCI configuration space (especially
501 the Hypertransport chipsets such as the nVidia nForce and Serverworks
502 HT2000). It may then be required to update the NO_MSI flag on the
503 corresponding devices in the sysfs hierarchy. To enable MSI support
504 on device "0000:00:0e", do:
506 echo 1 > /sys/bus/pci/devices/0000:00:0e/msi_bus
508 To disable MSI support, echo 0 instead of 1. Note that it should be
509 used with caution since changing this value might break interrupts.
511 6.3. Disabling MSI globally
513 Some extreme cases may require to disable MSI globally on the system.
514 For now, the only known case is a Serverworks PCI-X chipsets (MSI are
515 not supported on several busses that are not all connected to the
516 chipset in the Linux PCI hierarchy). In the vast majority of other
517 cases, disabling only behind a specific bridge is enough.
519 For debugging purpose, the user may also pass pci=nomsi on the kernel
520 command-line to explicitly disable MSI globally. But, once the appro-
521 priate quirks are added to the kernel, this option should not be
524 6.4. Finding why MSI cannot be enabled on a device
526 Assuming that MSI are not enabled on a device, you should look at
527 dmesg to find messages that quirks may output when disabling MSI
528 on some devices, some bridges or even globally.
529 Then, lspci -t gives the list of bridges above a device. Reading
530 /sys/bus/pci/devices/0000:00:0e/msi_bus will tell you whether MSI
531 are enabled (1) or disabled (0). In 0 is found in a single bridge
532 msi_bus file above the device, MSI cannot be enabled.
536 Q1. Are there any limitations on using the MSI?
538 A1. If the PCI device supports MSI and conforms to the
539 specification and the platform supports the APIC local bus,
540 then using MSI should work.
542 Q2. Will it work on all the Pentium processors (P3, P4, Xeon,
543 AMD processors)? In P3 IPI's are transmitted on the APIC local
544 bus and in P4 and Xeon they are transmitted on the system
545 bus. Are there any implications with this?
547 A2. MSI support enables a PCI device sending an inbound
548 memory write (0xfeexxxxx as target address) on its PCI bus
549 directly to the FSB. Since the message address has a
550 redirection hint bit cleared, it should work.
552 Q3. The target address 0xfeexxxxx will be translated by the
553 Host Bridge into an interrupt message. Are there any
554 limitations on the chipsets such as Intel 8xx, Intel e7xxx,
557 A3. If these chipsets support an inbound memory write with
558 target address set as 0xfeexxxxx, as conformed to PCI
559 specification 2.3 or latest, then it should work.
561 Q4. From the driver point of view, if the MSI is lost because
562 of errors occurring during inbound memory write, then it may
563 wait forever. Is there a mechanism for it to recover?
565 A4. Since the target of the transaction is an inbound memory
566 write, all transaction termination conditions (Retry,
567 Master-Abort, Target-Abort, or normal completion) are
568 supported. A device sending an MSI must abide by all the PCI
569 rules and conditions regarding that inbound memory write. So,
570 if a retry is signaled it must retry, etc... We believe that
571 the recommendation for Abort is also a retry (refer to PCI
572 specification 2.3 or latest).