4 Copyright (c) 2014 Virtual Open Systems Sarl.
6 This work is licensed under the terms of the GNU GPL, version 2 or later.
7 See the COPYING file in the top-level directory.
10 This protocol is aiming to complement the ioctl interface used to control the
11 vhost implementation in the Linux kernel. It implements the control plane needed
12 to establish virtqueue sharing with a user space process on the same host. It
13 uses communication over a Unix domain socket to share file descriptors in the
14 ancillary data of the message.
16 The protocol defines 2 sides of the communication, master and slave. Master is
17 the application that shares its virtqueues, in our case QEMU. Slave is the
18 consumer of the virtqueues.
20 In the current implementation QEMU is the Master, and the Slave is intended to
21 be a software Ethernet switch running in user space, such as Snabbswitch.
23 Master and slave can be either a client (i.e. connecting) or server (listening)
24 in the socket communication.
29 Note that all numbers are in the machine native byte order. A vhost-user message
30 consists of 3 header fields and a payload:
32 ------------------------------------
33 | request | flags | size | payload |
34 ------------------------------------
36 * Request: 32-bit type of the request
37 * Flags: 32-bit bit field:
38 - Lower 2 bits are the version (currently 0x01)
39 - Bit 2 is the reply flag - needs to be sent on each reply from the slave
40 - Bit 3 is the need_reply flag - see VHOST_USER_PROTOCOL_F_REPLY_ACK for
42 * Size - 32-bit size of the payload
45 Depending on the request type, payload can be:
47 * A single 64-bit integer
52 u64: a 64-bit unsigned integer
54 * A vring state description
62 * A vring address description
63 --------------------------------------------------------------
64 | index | flags | size | descriptor | used | available | log |
65 --------------------------------------------------------------
67 Index: a 32-bit vring index
68 Flags: a 32-bit vring flags
69 Descriptor: a 64-bit ring address of the vring descriptor table
70 Used: a 64-bit ring address of the vring used ring
71 Available: a 64-bit ring address of the vring available ring
72 Log: a 64-bit guest address for logging
74 Note that a ring address is an IOVA if VIRTIO_F_IOMMU_PLATFORM has been
75 negotiated. Otherwise it is a user address.
77 * Memory regions description
78 ---------------------------------------------------
79 | num regions | padding | region0 | ... | region7 |
80 ---------------------------------------------------
82 Num regions: a 32-bit number of regions
86 -----------------------------------------------------
87 | guest address | size | user address | mmap offset |
88 -----------------------------------------------------
90 Guest address: a 64-bit guest address of the region
92 User address: a 64-bit user address
93 mmap offset: 64-bit offset where region starts in the mapped memory
96 ---------------------------
97 | log size | log offset |
98 ---------------------------
99 log size: size of area used for logging
100 log offset: offset from start of supplied file descriptor
101 where logging starts (i.e. where guest address 0 would be logged)
104 ---------------------------------------------------------
105 | iova | size | user address | permissions flags | type |
106 ---------------------------------------------------------
108 IOVA: a 64-bit I/O virtual address programmed by the guest
110 User address: a 64-bit user address
111 Permissions: a 8-bit value:
115 - 3: Read/Write access
116 Type: a 8-bit IOTLB message type:
119 - 3: IOTLB invalidate
120 - 4: IOTLB access fail
122 In QEMU the vhost-user message is implemented with the following struct:
124 typedef struct VhostUserMsg {
125 VhostUserRequest request;
130 struct vhost_vring_state state;
131 struct vhost_vring_addr addr;
132 VhostUserMemory memory;
134 struct vhost_iotlb_msg iotlb;
136 } QEMU_PACKED VhostUserMsg;
141 The protocol for vhost-user is based on the existing implementation of vhost
142 for the Linux Kernel. Most messages that can be sent via the Unix domain socket
143 implementing vhost-user have an equivalent ioctl to the kernel implementation.
145 The communication consists of master sending message requests and slave sending
146 message replies. Most of the requests don't require replies. Here is a list of
149 * VHOST_USER_GET_FEATURES
150 * VHOST_USER_GET_PROTOCOL_FEATURES
151 * VHOST_USER_GET_VRING_BASE
152 * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
154 [ Also see the section on REPLY_ACK protocol extension. ]
156 There are several messages that the master sends with file descriptors passed
157 in the ancillary data:
159 * VHOST_USER_SET_MEM_TABLE
160 * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
161 * VHOST_USER_SET_LOG_FD
162 * VHOST_USER_SET_VRING_KICK
163 * VHOST_USER_SET_VRING_CALL
164 * VHOST_USER_SET_VRING_ERR
165 * VHOST_USER_SET_SLAVE_REQ_FD
167 If Master is unable to send the full message or receives a wrong reply it will
168 close the connection. An optional reconnection mechanism can be implemented.
170 Any protocol extensions are gated by protocol feature bits,
171 which allows full backwards compatibility on both master
173 As older slaves don't support negotiating protocol features,
174 a feature bit was dedicated for this purpose:
175 #define VHOST_USER_F_PROTOCOL_FEATURES 30
177 Starting and stopping rings
178 ----------------------
179 Client must only process each ring when it is started.
181 Client must only pass data between the ring and the
182 backend, when the ring is enabled.
184 If ring is started but disabled, client must process the
185 ring without talking to the backend.
187 For example, for a networking device, in the disabled state
188 client must not supply any new RX packets, but must process
189 and discard any TX packets.
191 If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized
194 If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized
195 in a disabled state. Client must not pass data to/from the backend until ring is enabled by
196 VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by
197 VHOST_USER_SET_VRING_ENABLE with parameter 0.
199 Each ring is initialized in a stopped state, client must not process it until
200 ring is started, or after it has been stopped.
202 Client must start ring upon receiving a kick (that is, detecting that file
203 descriptor is readable) on the descriptor specified by
204 VHOST_USER_SET_VRING_KICK, and stop ring upon receiving
205 VHOST_USER_GET_VRING_BASE.
207 While processing the rings (whether they are enabled or not), client must
208 support changing some configuration aspects on the fly.
210 Multiple queue support
211 ----------------------
213 Multiple queue is treated as a protocol extension, hence the slave has to
214 implement protocol features first. The multiple queues feature is supported
215 only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set.
217 The max number of queues the slave supports can be queried with message
218 VHOST_USER_GET_PROTOCOL_FEATURES. Master should stop when the number of
219 requested queues is bigger than that.
221 As all queues share one connection, the master uses a unique index for each
222 queue in the sent message to identify a specified queue. One queue pair
223 is enabled initially. More queues are enabled dynamically, by sending
224 message VHOST_USER_SET_VRING_ENABLE.
229 During live migration, the master may need to track the modifications
230 the slave makes to the memory mapped regions. The client should mark
231 the dirty pages in a log. Once it complies to this logging, it may
232 declare the VHOST_F_LOG_ALL vhost feature.
234 To start/stop logging of data/used ring writes, server may send messages
235 VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with
236 VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively.
238 All the modifications to memory pointed by vring "descriptor" should
239 be marked. Modifications to "used" vring should be marked if
240 VHOST_VRING_F_LOG is part of ring's flags.
242 Dirty pages are of size:
243 #define VHOST_LOG_PAGE 0x1000
245 The log memory fd is provided in the ancillary data of
246 VHOST_USER_SET_LOG_BASE message when the slave has
247 VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature.
249 The size of the log is supplied as part of VhostUserMsg
250 which should be large enough to cover all known guest
251 addresses. Log starts at the supplied offset in the
252 supplied file descriptor.
253 The log covers from address 0 to the maximum of guest
254 regions. In pseudo-code, to mark page at "addr" as dirty:
256 page = addr / VHOST_LOG_PAGE
257 log[page / 8] |= 1 << page % 8
259 Where addr is the guest physical address.
261 Use atomic operations, as the log may be concurrently manipulated.
263 Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG
264 is set for this ring), log_guest_addr should be used to calculate the log
265 offset: the write to first byte of the used ring is logged at this offset from
266 log start. Also note that this value might be outside the legal guest physical
267 address range (i.e. does not have to be covered by the VhostUserMemory table),
268 but the bit offset of the last byte of the ring must fall within
269 the size supplied by VhostUserLog.
271 VHOST_USER_SET_LOG_FD is an optional message with an eventfd in
272 ancillary data, it may be used to inform the master that the log has
275 Once the source has finished migration, rings will be stopped by
276 the source. No further update must be done before rings are
282 The master sends a list of vhost memory regions to the slave using the
283 VHOST_USER_SET_MEM_TABLE message. Each region has two base addresses: a guest
284 address and a user address.
286 Messages contain guest addresses and/or user addresses to reference locations
287 within the shared memory. The mapping of these addresses works as follows.
289 User addresses map to the vhost memory region containing that user address.
291 When the VIRTIO_F_IOMMU_PLATFORM feature has not been negotiated:
293 * Guest addresses map to the vhost memory region containing that guest
296 When the VIRTIO_F_IOMMU_PLATFORM feature has been negotiated:
298 * Guest addresses are also called I/O virtual addresses (IOVAs). They are
299 translated to user addresses via the IOTLB.
301 * The vhost memory region guest address is not used.
306 When the VIRTIO_F_IOMMU_PLATFORM feature has been negotiated, the master
307 sends IOTLB entries update & invalidation by sending VHOST_USER_IOTLB_MSG
308 requests to the slave with a struct vhost_iotlb_msg as payload. For update
309 events, the iotlb payload has to be filled with the update message type (2),
310 the I/O virtual address, the size, the user virtual address, and the
311 permissions flags. Addresses and size must be within vhost memory regions set
312 via the VHOST_USER_SET_MEM_TABLE request. For invalidation events, the iotlb
313 payload has to be filled with the invalidation message type (3), the I/O virtual
314 address and the size. On success, the slave is expected to reply with a zero
315 payload, non-zero otherwise.
317 The slave relies on the slave communcation channel (see "Slave communication"
318 section below) to send IOTLB miss and access failure events, by sending
319 VHOST_USER_SLAVE_IOTLB_MSG requests to the master with a struct vhost_iotlb_msg
320 as payload. For miss events, the iotlb payload has to be filled with the miss
321 message type (1), the I/O virtual address and the permissions flags. For access
322 failure event, the iotlb payload has to be filled with the access failure
323 message type (4), the I/O virtual address and the permissions flags.
324 For synchronization purpose, the slave may rely on the reply-ack feature,
325 so the master may send a reply when operation is completed if the reply-ack
326 feature is negotiated and slaves requests a reply. For miss events, completed
327 operation means either master sent an update message containing the IOTLB entry
328 containing requested address and permission, or master sent nothing if the IOTLB
329 miss message is invalid (invalid IOVA or permission).
331 The master isn't expected to take the initiative to send IOTLB update messages,
332 as the slave sends IOTLB miss messages for the guest virtual memory areas it
338 An optional communication channel is provided if the slave declares
339 VHOST_USER_PROTOCOL_F_SLAVE_REQ protocol feature, to allow the slave to make
340 requests to the master.
342 The fd is provided via VHOST_USER_SET_SLAVE_REQ_FD ancillary data.
344 A slave may then send VHOST_USER_SLAVE_* messages to the master
345 using this fd communication channel.
350 #define VHOST_USER_PROTOCOL_F_MQ 0
351 #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1
352 #define VHOST_USER_PROTOCOL_F_RARP 2
353 #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3
354 #define VHOST_USER_PROTOCOL_F_MTU 4
355 #define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5
356 #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN 6
361 * VHOST_USER_GET_FEATURES
364 Equivalent ioctl: VHOST_GET_FEATURES
368 Get from the underlying vhost implementation the features bitmask.
369 Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
370 VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
372 * VHOST_USER_SET_FEATURES
375 Ioctl: VHOST_SET_FEATURES
378 Enable features in the underlying vhost implementation using a bitmask.
379 Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
380 VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
382 * VHOST_USER_GET_PROTOCOL_FEATURES
385 Equivalent ioctl: VHOST_GET_FEATURES
389 Get the protocol feature bitmask from the underlying vhost implementation.
390 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
391 VHOST_USER_GET_FEATURES.
392 Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
393 this message even before VHOST_USER_SET_FEATURES was called.
395 * VHOST_USER_SET_PROTOCOL_FEATURES
398 Ioctl: VHOST_SET_FEATURES
401 Enable protocol features in the underlying vhost implementation.
402 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
403 VHOST_USER_GET_FEATURES.
404 Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
405 this message even before VHOST_USER_SET_FEATURES was called.
407 * VHOST_USER_SET_OWNER
410 Equivalent ioctl: VHOST_SET_OWNER
413 Issued when a new connection is established. It sets the current Master
414 as an owner of the session. This can be used on the Slave as a
415 "session start" flag.
417 * VHOST_USER_RESET_OWNER
422 This is no longer used. Used to be sent to request disabling
423 all rings, but some clients interpreted it to also discard
424 connection state (this interpretation would lead to bugs).
425 It is recommended that clients either ignore this message,
426 or use it to disable all rings.
428 * VHOST_USER_SET_MEM_TABLE
431 Equivalent ioctl: VHOST_SET_MEM_TABLE
432 Master payload: memory regions description
434 Sets the memory map regions on the slave so it can translate the vring
435 addresses. In the ancillary data there is an array of file descriptors
436 for each memory mapped region. The size and ordering of the fds matches
437 the number and ordering of memory regions.
439 * VHOST_USER_SET_LOG_BASE
442 Equivalent ioctl: VHOST_SET_LOG_BASE
446 Sets logging shared memory space.
447 When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol
448 feature, the log memory fd is provided in the ancillary data of
449 VHOST_USER_SET_LOG_BASE message, the size and offset of shared
450 memory area provided in the message.
453 * VHOST_USER_SET_LOG_FD
456 Equivalent ioctl: VHOST_SET_LOG_FD
459 Sets the logging file descriptor, which is passed as ancillary data.
461 * VHOST_USER_SET_VRING_NUM
464 Equivalent ioctl: VHOST_SET_VRING_NUM
465 Master payload: vring state description
467 Set the size of the queue.
469 * VHOST_USER_SET_VRING_ADDR
472 Equivalent ioctl: VHOST_SET_VRING_ADDR
473 Master payload: vring address description
476 Sets the addresses of the different aspects of the vring.
478 * VHOST_USER_SET_VRING_BASE
481 Equivalent ioctl: VHOST_SET_VRING_BASE
482 Master payload: vring state description
484 Sets the base offset in the available vring.
486 * VHOST_USER_GET_VRING_BASE
489 Equivalent ioctl: VHOST_USER_GET_VRING_BASE
490 Master payload: vring state description
491 Slave payload: vring state description
493 Get the available vring base offset.
495 * VHOST_USER_SET_VRING_KICK
498 Equivalent ioctl: VHOST_SET_VRING_KICK
501 Set the event file descriptor for adding buffers to the vring. It
502 is passed in the ancillary data.
503 Bits (0-7) of the payload contain the vring index. Bit 8 is the
504 invalid FD flag. This flag is set when there is no file descriptor
505 in the ancillary data. This signals that polling should be used
506 instead of waiting for a kick.
508 * VHOST_USER_SET_VRING_CALL
511 Equivalent ioctl: VHOST_SET_VRING_CALL
514 Set the event file descriptor to signal when buffers are used. It
515 is passed in the ancillary data.
516 Bits (0-7) of the payload contain the vring index. Bit 8 is the
517 invalid FD flag. This flag is set when there is no file descriptor
518 in the ancillary data. This signals that polling will be used
519 instead of waiting for the call.
521 * VHOST_USER_SET_VRING_ERR
524 Equivalent ioctl: VHOST_SET_VRING_ERR
527 Set the event file descriptor to signal when error occurs. It
528 is passed in the ancillary data.
529 Bits (0-7) of the payload contain the vring index. Bit 8 is the
530 invalid FD flag. This flag is set when there is no file descriptor
531 in the ancillary data.
533 * VHOST_USER_GET_QUEUE_NUM
536 Equivalent ioctl: N/A
540 Query how many queues the backend supports. This request should be
541 sent only when VHOST_USER_PROTOCOL_F_MQ is set in queried protocol
542 features by VHOST_USER_GET_PROTOCOL_FEATURES.
544 * VHOST_USER_SET_VRING_ENABLE
547 Equivalent ioctl: N/A
548 Master payload: vring state description
550 Signal slave to enable or disable corresponding vring.
551 This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES
554 * VHOST_USER_SEND_RARP
557 Equivalent ioctl: N/A
560 Ask vhost user backend to broadcast a fake RARP to notify the migration
561 is terminated for guest that does not support GUEST_ANNOUNCE.
562 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
563 VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP
564 is present in VHOST_USER_GET_PROTOCOL_FEATURES.
565 The first 6 bytes of the payload contain the mac address of the guest to
566 allow the vhost user backend to construct and broadcast the fake RARP.
568 * VHOST_USER_NET_SET_MTU
571 Equivalent ioctl: N/A
574 Set host MTU value exposed to the guest.
575 This request should be sent only when VIRTIO_NET_F_MTU feature has been
576 successfully negotiated, VHOST_USER_F_PROTOCOL_FEATURES is present in
577 VHOST_USER_GET_FEATURES and protocol feature bit
578 VHOST_USER_PROTOCOL_F_NET_MTU is present in
579 VHOST_USER_GET_PROTOCOL_FEATURES.
580 If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, slave must respond
581 with zero in case the specified MTU is valid, or non-zero otherwise.
583 * VHOST_USER_SET_SLAVE_REQ_FD
586 Equivalent ioctl: N/A
589 Set the socket file descriptor for slave initiated requests. It is passed
590 in the ancillary data.
591 This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES
592 has been negotiated, and protocol feature bit VHOST_USER_PROTOCOL_F_SLAVE_REQ
593 bit is present in VHOST_USER_GET_PROTOCOL_FEATURES.
594 If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, slave must respond
595 with zero for success, non-zero otherwise.
597 * VHOST_USER_IOTLB_MSG
600 Equivalent ioctl: N/A (equivalent to VHOST_IOTLB_MSG message type)
601 Master payload: struct vhost_iotlb_msg
604 Send IOTLB messages with struct vhost_iotlb_msg as payload.
605 Master sends such requests to update and invalidate entries in the device
606 IOTLB. The slave has to acknowledge the request with sending zero as u64
607 payload for success, non-zero otherwise.
608 This request should be send only when VIRTIO_F_IOMMU_PLATFORM feature
609 has been successfully negotiated.
611 * VHOST_USER_SET_VRING_ENDIAN
614 Equivalent ioctl: VHOST_SET_VRING_ENDIAN
615 Master payload: vring state description
617 Set the endianess of a VQ for legacy devices. Little-endian is indicated
618 with state.num set to 0 and big-endian is indicated with state.num set
619 to 1. Other values are invalid.
620 This request should be sent only when VHOST_USER_PROTOCOL_F_CROSS_ENDIAN
622 Backends that negotiated this feature should handle both endianesses
623 and expect this message once (per VQ) during device configuration
624 (ie. before the master starts the VQ).
629 * VHOST_USER_SLAVE_IOTLB_MSG
632 Equivalent ioctl: N/A (equivalent to VHOST_IOTLB_MSG message type)
633 Slave payload: struct vhost_iotlb_msg
636 Send IOTLB messages with struct vhost_iotlb_msg as payload.
637 Slave sends such requests to notify of an IOTLB miss, or an IOTLB
638 access failure. If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated,
639 and slave set the VHOST_USER_NEED_REPLY flag, master must respond with
640 zero when operation is successfully completed, or non-zero otherwise.
641 This request should be send only when VIRTIO_F_IOMMU_PLATFORM feature
642 has been successfully negotiated.
644 VHOST_USER_PROTOCOL_F_REPLY_ACK:
645 -------------------------------
646 The original vhost-user specification only demands replies for certain
647 commands. This differs from the vhost protocol implementation where commands
648 are sent over an ioctl() call and block until the client has completed.
650 With this protocol extension negotiated, the sender (QEMU) can set the
651 "need_reply" [Bit 3] flag to any command. This indicates that
652 the client MUST respond with a Payload VhostUserMsg indicating success or
653 failure. The payload should be set to zero on success or non-zero on failure,
654 unless the message already has an explicit reply body.
656 The response payload gives QEMU a deterministic indication of the result
657 of the command. Today, QEMU is expected to terminate the main vhost-user
658 loop upon receiving such errors. In future, qemu could be taught to be more
659 resilient for selective requests.
661 For the message types that already solicit a reply from the client, the
662 presence of VHOST_USER_PROTOCOL_F_REPLY_ACK or need_reply bit being set brings
663 no behavioural change. (See the 'Communication' section for details.)