4 Copyright (c) 2014 Virtual Open Systems Sarl.
6 This work is licensed under the terms of the GNU GPL, version 2 or later.
7 See the COPYING file in the top-level directory.
10 This protocol is aiming to complement the ioctl interface used to control the
11 vhost implementation in the Linux kernel. It implements the control plane needed
12 to establish virtqueue sharing with a user space process on the same host. It
13 uses communication over a Unix domain socket to share file descriptors in the
14 ancillary data of the message.
16 The protocol defines 2 sides of the communication, master and slave. Master is
17 the application that shares its virtqueues, in our case QEMU. Slave is the
18 consumer of the virtqueues.
20 In the current implementation QEMU is the Master, and the Slave is intended to
21 be a software Ethernet switch running in user space, such as Snabbswitch.
23 Master and slave can be either a client (i.e. connecting) or server (listening)
24 in the socket communication.
29 Note that all numbers are in the machine native byte order. A vhost-user message
30 consists of 3 header fields and a payload:
32 ------------------------------------
33 | request | flags | size | payload |
34 ------------------------------------
36 * Request: 32-bit type of the request
37 * Flags: 32-bit bit field:
38 - Lower 2 bits are the version (currently 0x01)
39 - Bit 2 is the reply flag - needs to be sent on each reply from the slave
40 - Bit 3 is the need_reply flag - see VHOST_USER_PROTOCOL_F_REPLY_ACK for
42 * Size - 32-bit size of the payload
45 Depending on the request type, payload can be:
47 * A single 64-bit integer
52 u64: a 64-bit unsigned integer
54 * A vring state description
62 * A vring address description
63 --------------------------------------------------------------
64 | index | flags | size | descriptor | used | available | log |
65 --------------------------------------------------------------
67 Index: a 32-bit vring index
68 Flags: a 32-bit vring flags
69 Descriptor: a 64-bit user address of the vring descriptor table
70 Used: a 64-bit user address of the vring used ring
71 Available: a 64-bit user address of the vring available ring
72 Log: a 64-bit guest address for logging
74 * Memory regions description
75 ---------------------------------------------------
76 | num regions | padding | region0 | ... | region7 |
77 ---------------------------------------------------
79 Num regions: a 32-bit number of regions
83 -----------------------------------------------------
84 | guest address | size | user address | mmap offset |
85 -----------------------------------------------------
87 Guest address: a 64-bit guest address of the region
89 User address: a 64-bit user address
90 mmap offset: 64-bit offset where region starts in the mapped memory
93 ---------------------------
94 | log size | log offset |
95 ---------------------------
96 log size: size of area used for logging
97 log offset: offset from start of supplied file descriptor
98 where logging starts (i.e. where guest address 0 would be logged)
100 In QEMU the vhost-user message is implemented with the following struct:
102 typedef struct VhostUserMsg {
103 VhostUserRequest request;
108 struct vhost_vring_state state;
109 struct vhost_vring_addr addr;
110 VhostUserMemory memory;
113 } QEMU_PACKED VhostUserMsg;
118 The protocol for vhost-user is based on the existing implementation of vhost
119 for the Linux Kernel. Most messages that can be sent via the Unix domain socket
120 implementing vhost-user have an equivalent ioctl to the kernel implementation.
122 The communication consists of master sending message requests and slave sending
123 message replies. Most of the requests don't require replies. Here is a list of
126 * VHOST_USER_GET_FEATURES
127 * VHOST_USER_GET_PROTOCOL_FEATURES
128 * VHOST_USER_GET_VRING_BASE
129 * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
131 [ Also see the section on REPLY_ACK protocol extension. ]
133 There are several messages that the master sends with file descriptors passed
134 in the ancillary data:
136 * VHOST_USER_SET_MEM_TABLE
137 * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
138 * VHOST_USER_SET_LOG_FD
139 * VHOST_USER_SET_VRING_KICK
140 * VHOST_USER_SET_VRING_CALL
141 * VHOST_USER_SET_VRING_ERR
143 If Master is unable to send the full message or receives a wrong reply it will
144 close the connection. An optional reconnection mechanism can be implemented.
146 Any protocol extensions are gated by protocol feature bits,
147 which allows full backwards compatibility on both master
149 As older slaves don't support negotiating protocol features,
150 a feature bit was dedicated for this purpose:
151 #define VHOST_USER_F_PROTOCOL_FEATURES 30
153 Starting and stopping rings
154 ----------------------
155 Client must only process each ring when it is started.
157 Client must only pass data between the ring and the
158 backend, when the ring is enabled.
160 If ring is started but disabled, client must process the
161 ring without talking to the backend.
163 For example, for a networking device, in the disabled state
164 client must not supply any new RX packets, but must process
165 and discard any TX packets.
167 If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized
170 If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized
171 in a disabled state. Client must not pass data to/from the backend until ring is enabled by
172 VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by
173 VHOST_USER_SET_VRING_ENABLE with parameter 0.
175 Each ring is initialized in a stopped state, client must not process it until
176 ring is started, or after it has been stopped.
178 Client must start ring upon receiving a kick (that is, detecting that file
179 descriptor is readable) on the descriptor specified by
180 VHOST_USER_SET_VRING_KICK, and stop ring upon receiving
181 VHOST_USER_GET_VRING_BASE.
183 While processing the rings (whether they are enabled or not), client must
184 support changing some configuration aspects on the fly.
186 Multiple queue support
187 ----------------------
189 Multiple queue is treated as a protocol extension, hence the slave has to
190 implement protocol features first. The multiple queues feature is supported
191 only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set.
193 The max number of queues the slave supports can be queried with message
194 VHOST_USER_GET_PROTOCOL_FEATURES. Master should stop when the number of
195 requested queues is bigger than that.
197 As all queues share one connection, the master uses a unique index for each
198 queue in the sent message to identify a specified queue. One queue pair
199 is enabled initially. More queues are enabled dynamically, by sending
200 message VHOST_USER_SET_VRING_ENABLE.
205 During live migration, the master may need to track the modifications
206 the slave makes to the memory mapped regions. The client should mark
207 the dirty pages in a log. Once it complies to this logging, it may
208 declare the VHOST_F_LOG_ALL vhost feature.
210 To start/stop logging of data/used ring writes, server may send messages
211 VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with
212 VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively.
214 All the modifications to memory pointed by vring "descriptor" should
215 be marked. Modifications to "used" vring should be marked if
216 VHOST_VRING_F_LOG is part of ring's flags.
218 Dirty pages are of size:
219 #define VHOST_LOG_PAGE 0x1000
221 The log memory fd is provided in the ancillary data of
222 VHOST_USER_SET_LOG_BASE message when the slave has
223 VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature.
225 The size of the log is supplied as part of VhostUserMsg
226 which should be large enough to cover all known guest
227 addresses. Log starts at the supplied offset in the
228 supplied file descriptor.
229 The log covers from address 0 to the maximum of guest
230 regions. In pseudo-code, to mark page at "addr" as dirty:
232 page = addr / VHOST_LOG_PAGE
233 log[page / 8] |= 1 << page % 8
235 Where addr is the guest physical address.
237 Use atomic operations, as the log may be concurrently manipulated.
239 Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG
240 is set for this ring), log_guest_addr should be used to calculate the log
241 offset: the write to first byte of the used ring is logged at this offset from
242 log start. Also note that this value might be outside the legal guest physical
243 address range (i.e. does not have to be covered by the VhostUserMemory table),
244 but the bit offset of the last byte of the ring must fall within
245 the size supplied by VhostUserLog.
247 VHOST_USER_SET_LOG_FD is an optional message with an eventfd in
248 ancillary data, it may be used to inform the master that the log has
251 Once the source has finished migration, rings will be stopped by
252 the source. No further update must be done before rings are
258 #define VHOST_USER_PROTOCOL_F_MQ 0
259 #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1
260 #define VHOST_USER_PROTOCOL_F_RARP 2
261 #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3
262 #define VHOST_USER_PROTOCOL_F_MTU 4
267 * VHOST_USER_GET_FEATURES
270 Equivalent ioctl: VHOST_GET_FEATURES
274 Get from the underlying vhost implementation the features bitmask.
275 Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
276 VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
278 * VHOST_USER_SET_FEATURES
281 Ioctl: VHOST_SET_FEATURES
284 Enable features in the underlying vhost implementation using a bitmask.
285 Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
286 VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
288 * VHOST_USER_GET_PROTOCOL_FEATURES
291 Equivalent ioctl: VHOST_GET_FEATURES
295 Get the protocol feature bitmask from the underlying vhost implementation.
296 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
297 VHOST_USER_GET_FEATURES.
298 Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
299 this message even before VHOST_USER_SET_FEATURES was called.
301 * VHOST_USER_SET_PROTOCOL_FEATURES
304 Ioctl: VHOST_SET_FEATURES
307 Enable protocol features in the underlying vhost implementation.
308 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
309 VHOST_USER_GET_FEATURES.
310 Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
311 this message even before VHOST_USER_SET_FEATURES was called.
313 * VHOST_USER_SET_OWNER
316 Equivalent ioctl: VHOST_SET_OWNER
319 Issued when a new connection is established. It sets the current Master
320 as an owner of the session. This can be used on the Slave as a
321 "session start" flag.
323 * VHOST_USER_RESET_OWNER
328 This is no longer used. Used to be sent to request disabling
329 all rings, but some clients interpreted it to also discard
330 connection state (this interpretation would lead to bugs).
331 It is recommended that clients either ignore this message,
332 or use it to disable all rings.
334 * VHOST_USER_SET_MEM_TABLE
337 Equivalent ioctl: VHOST_SET_MEM_TABLE
338 Master payload: memory regions description
340 Sets the memory map regions on the slave so it can translate the vring
341 addresses. In the ancillary data there is an array of file descriptors
342 for each memory mapped region. The size and ordering of the fds matches
343 the number and ordering of memory regions.
345 * VHOST_USER_SET_LOG_BASE
348 Equivalent ioctl: VHOST_SET_LOG_BASE
352 Sets logging shared memory space.
353 When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol
354 feature, the log memory fd is provided in the ancillary data of
355 VHOST_USER_SET_LOG_BASE message, the size and offset of shared
356 memory area provided in the message.
359 * VHOST_USER_SET_LOG_FD
362 Equivalent ioctl: VHOST_SET_LOG_FD
365 Sets the logging file descriptor, which is passed as ancillary data.
367 * VHOST_USER_SET_VRING_NUM
370 Equivalent ioctl: VHOST_SET_VRING_NUM
371 Master payload: vring state description
373 Set the size of the queue.
375 * VHOST_USER_SET_VRING_ADDR
378 Equivalent ioctl: VHOST_SET_VRING_ADDR
379 Master payload: vring address description
382 Sets the addresses of the different aspects of the vring.
384 * VHOST_USER_SET_VRING_BASE
387 Equivalent ioctl: VHOST_SET_VRING_BASE
388 Master payload: vring state description
390 Sets the base offset in the available vring.
392 * VHOST_USER_GET_VRING_BASE
395 Equivalent ioctl: VHOST_USER_GET_VRING_BASE
396 Master payload: vring state description
397 Slave payload: vring state description
399 Get the available vring base offset.
401 * VHOST_USER_SET_VRING_KICK
404 Equivalent ioctl: VHOST_SET_VRING_KICK
407 Set the event file descriptor for adding buffers to the vring. It
408 is passed in the ancillary data.
409 Bits (0-7) of the payload contain the vring index. Bit 8 is the
410 invalid FD flag. This flag is set when there is no file descriptor
411 in the ancillary data. This signals that polling should be used
412 instead of waiting for a kick.
414 * VHOST_USER_SET_VRING_CALL
417 Equivalent ioctl: VHOST_SET_VRING_CALL
420 Set the event file descriptor to signal when buffers are used. It
421 is passed in the ancillary data.
422 Bits (0-7) of the payload contain the vring index. Bit 8 is the
423 invalid FD flag. This flag is set when there is no file descriptor
424 in the ancillary data. This signals that polling will be used
425 instead of waiting for the call.
427 * VHOST_USER_SET_VRING_ERR
430 Equivalent ioctl: VHOST_SET_VRING_ERR
433 Set the event file descriptor to signal when error occurs. It
434 is passed in the ancillary data.
435 Bits (0-7) of the payload contain the vring index. Bit 8 is the
436 invalid FD flag. This flag is set when there is no file descriptor
437 in the ancillary data.
439 * VHOST_USER_GET_QUEUE_NUM
442 Equivalent ioctl: N/A
446 Query how many queues the backend supports. This request should be
447 sent only when VHOST_USER_PROTOCOL_F_MQ is set in queried protocol
448 features by VHOST_USER_GET_PROTOCOL_FEATURES.
450 * VHOST_USER_SET_VRING_ENABLE
453 Equivalent ioctl: N/A
454 Master payload: vring state description
456 Signal slave to enable or disable corresponding vring.
457 This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES
460 * VHOST_USER_SEND_RARP
463 Equivalent ioctl: N/A
466 Ask vhost user backend to broadcast a fake RARP to notify the migration
467 is terminated for guest that does not support GUEST_ANNOUNCE.
468 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
469 VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP
470 is present in VHOST_USER_GET_PROTOCOL_FEATURES.
471 The first 6 bytes of the payload contain the mac address of the guest to
472 allow the vhost user backend to construct and broadcast the fake RARP.
474 * VHOST_USER_NET_SET_MTU
477 Equivalent ioctl: N/A
480 Set host MTU value exposed to the guest.
481 This request should be sent only when VIRTIO_NET_F_MTU feature has been
482 successfully negotiated, VHOST_USER_F_PROTOCOL_FEATURES is present in
483 VHOST_USER_GET_FEATURES and protocol feature bit
484 VHOST_USER_PROTOCOL_F_NET_MTU is present in
485 VHOST_USER_GET_PROTOCOL_FEATURES.
486 If VHOST_USER_PROTOCOL_F_REPLY_ACK is negotiated, slave must respond
487 with zero in case the specified MTU is valid, or non-zero otherwise.
489 VHOST_USER_PROTOCOL_F_REPLY_ACK:
490 -------------------------------
491 The original vhost-user specification only demands replies for certain
492 commands. This differs from the vhost protocol implementation where commands
493 are sent over an ioctl() call and block until the client has completed.
495 With this protocol extension negotiated, the sender (QEMU) can set the
496 "need_reply" [Bit 3] flag to any command. This indicates that
497 the client MUST respond with a Payload VhostUserMsg indicating success or
498 failure. The payload should be set to zero on success or non-zero on failure,
499 unless the message already has an explicit reply body.
501 The response payload gives QEMU a deterministic indication of the result
502 of the command. Today, QEMU is expected to terminate the main vhost-user
503 loop upon receiving such errors. In future, qemu could be taught to be more
504 resilient for selective requests.
506 For the message types that already solicit a reply from the client, the
507 presence of VHOST_USER_PROTOCOL_F_REPLY_ACK or need_reply bit being set brings
508 no behavioural change. (See the 'Communication' section for details.)