4 Copyright (c) 2014 Virtual Open Systems Sarl.
6 This work is licensed under the terms of the GNU GPL, version 2 or later.
7 See the COPYING file in the top-level directory.
10 This protocol is aiming to complement the ioctl interface used to control the
11 vhost implementation in the Linux kernel. It implements the control plane needed
12 to establish virtqueue sharing with a user space process on the same host. It
13 uses communication over a Unix domain socket to share file descriptors in the
14 ancillary data of the message.
16 The protocol defines 2 sides of the communication, master and slave. Master is
17 the application that shares its virtqueues, in our case QEMU. Slave is the
18 consumer of the virtqueues.
20 In the current implementation QEMU is the Master, and the Slave is intended to
21 be a software Ethernet switch running in user space, such as Snabbswitch.
23 Master and slave can be either a client (i.e. connecting) or server (listening)
24 in the socket communication.
29 Note that all numbers are in the machine native byte order. A vhost-user message
30 consists of 3 header fields and a payload:
32 ------------------------------------
33 | request | flags | size | payload |
34 ------------------------------------
36 * Request: 32-bit type of the request
37 * Flags: 32-bit bit field:
38 - Lower 2 bits are the version (currently 0x01)
39 - Bit 2 is the reply flag - needs to be sent on each reply from the slave
40 - Bit 3 is the need_reply flag - see VHOST_USER_PROTOCOL_F_REPLY_ACK for
42 * Size - 32-bit size of the payload
45 Depending on the request type, payload can be:
47 * A single 64-bit integer
52 u64: a 64-bit unsigned integer
54 * A vring state description
62 * A vring address description
63 --------------------------------------------------------------
64 | index | flags | size | descriptor | used | available | log |
65 --------------------------------------------------------------
67 Index: a 32-bit vring index
68 Flags: a 32-bit vring flags
69 Descriptor: a 64-bit user address of the vring descriptor table
70 Used: a 64-bit user address of the vring used ring
71 Available: a 64-bit user address of the vring available ring
72 Log: a 64-bit guest address for logging
74 * Memory regions description
75 ---------------------------------------------------
76 | num regions | padding | region0 | ... | region7 |
77 ---------------------------------------------------
79 Num regions: a 32-bit number of regions
83 -----------------------------------------------------
84 | guest address | size | user address | mmap offset |
85 -----------------------------------------------------
87 Guest address: a 64-bit guest address of the region
89 User address: a 64-bit user address
90 mmap offset: 64-bit offset where region starts in the mapped memory
93 ---------------------------
94 | log size | log offset |
95 ---------------------------
96 log size: size of area used for logging
97 log offset: offset from start of supplied file descriptor
98 where logging starts (i.e. where guest address 0 would be logged)
100 In QEMU the vhost-user message is implemented with the following struct:
102 typedef struct VhostUserMsg {
103 VhostUserRequest request;
108 struct vhost_vring_state state;
109 struct vhost_vring_addr addr;
110 VhostUserMemory memory;
113 } QEMU_PACKED VhostUserMsg;
118 The protocol for vhost-user is based on the existing implementation of vhost
119 for the Linux Kernel. Most messages that can be sent via the Unix domain socket
120 implementing vhost-user have an equivalent ioctl to the kernel implementation.
122 The communication consists of master sending message requests and slave sending
123 message replies. Most of the requests don't require replies. Here is a list of
126 * VHOST_USER_GET_FEATURES
127 * VHOST_USER_GET_PROTOCOL_FEATURES
128 * VHOST_USER_GET_VRING_BASE
129 * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
131 [ Also see the section on REPLY_ACK protocol extension. ]
133 There are several messages that the master sends with file descriptors passed
134 in the ancillary data:
136 * VHOST_USER_SET_MEM_TABLE
137 * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
138 * VHOST_USER_SET_LOG_FD
139 * VHOST_USER_SET_VRING_KICK
140 * VHOST_USER_SET_VRING_CALL
141 * VHOST_USER_SET_VRING_ERR
143 If Master is unable to send the full message or receives a wrong reply it will
144 close the connection. An optional reconnection mechanism can be implemented.
146 Any protocol extensions are gated by protocol feature bits,
147 which allows full backwards compatibility on both master
149 As older slaves don't support negotiating protocol features,
150 a feature bit was dedicated for this purpose:
151 #define VHOST_USER_F_PROTOCOL_FEATURES 30
153 Starting and stopping rings
154 ----------------------
155 Client must only process each ring when it is started.
157 Client must only pass data between the ring and the
158 backend, when the ring is enabled.
160 If ring is started but disabled, client must process the
161 ring without talking to the backend.
163 For example, for a networking device, in the disabled state
164 client must not supply any new RX packets, but must process
165 and discard any TX packets.
167 If VHOST_USER_F_PROTOCOL_FEATURES has not been negotiated, the ring is initialized
170 If VHOST_USER_F_PROTOCOL_FEATURES has been negotiated, the ring is initialized
171 in a disabled state. Client must not pass data to/from the backend until ring is enabled by
172 VHOST_USER_SET_VRING_ENABLE with parameter 1, or after it has been disabled by
173 VHOST_USER_SET_VRING_ENABLE with parameter 0.
175 Each ring is initialized in a stopped state, client must not process it until
176 ring is started, or after it has been stopped.
178 Client must start ring upon receiving a kick (that is, detecting that file
179 descriptor is readable) on the descriptor specified by
180 VHOST_USER_SET_VRING_KICK, and stop ring upon receiving
181 VHOST_USER_GET_VRING_BASE.
183 While processing the rings (whether they are enabled or not), client must
184 support changing some configuration aspects on the fly.
186 Multiple queue support
187 ----------------------
189 Multiple queue is treated as a protocol extension, hence the slave has to
190 implement protocol features first. The multiple queues feature is supported
191 only when the protocol feature VHOST_USER_PROTOCOL_F_MQ (bit 0) is set.
193 The max number of queues the slave supports can be queried with message
194 VHOST_USER_GET_PROTOCOL_FEATURES. Master should stop when the number of
195 requested queues is bigger than that.
197 As all queues share one connection, the master uses a unique index for each
198 queue in the sent message to identify a specified queue. One queue pair
199 is enabled initially. More queues are enabled dynamically, by sending
200 message VHOST_USER_SET_VRING_ENABLE.
205 During live migration, the master may need to track the modifications
206 the slave makes to the memory mapped regions. The client should mark
207 the dirty pages in a log. Once it complies to this logging, it may
208 declare the VHOST_F_LOG_ALL vhost feature.
210 To start/stop logging of data/used ring writes, server may send messages
211 VHOST_USER_SET_FEATURES with VHOST_F_LOG_ALL and VHOST_USER_SET_VRING_ADDR with
212 VHOST_VRING_F_LOG in ring's flags set to 1/0, respectively.
214 All the modifications to memory pointed by vring "descriptor" should
215 be marked. Modifications to "used" vring should be marked if
216 VHOST_VRING_F_LOG is part of ring's flags.
218 Dirty pages are of size:
219 #define VHOST_LOG_PAGE 0x1000
221 The log memory fd is provided in the ancillary data of
222 VHOST_USER_SET_LOG_BASE message when the slave has
223 VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature.
225 The size of the log is supplied as part of VhostUserMsg
226 which should be large enough to cover all known guest
227 addresses. Log starts at the supplied offset in the
228 supplied file descriptor.
229 The log covers from address 0 to the maximum of guest
230 regions. In pseudo-code, to mark page at "addr" as dirty:
232 page = addr / VHOST_LOG_PAGE
233 log[page / 8] |= 1 << page % 8
235 Where addr is the guest physical address.
237 Use atomic operations, as the log may be concurrently manipulated.
239 Note that when logging modifications to the used ring (when VHOST_VRING_F_LOG
240 is set for this ring), log_guest_addr should be used to calculate the log
241 offset: the write to first byte of the used ring is logged at this offset from
242 log start. Also note that this value might be outside the legal guest physical
243 address range (i.e. does not have to be covered by the VhostUserMemory table),
244 but the bit offset of the last byte of the ring must fall within
245 the size supplied by VhostUserLog.
247 VHOST_USER_SET_LOG_FD is an optional message with an eventfd in
248 ancillary data, it may be used to inform the master that the log has
251 Once the source has finished migration, rings will be stopped by
252 the source. No further update must be done before rings are
258 #define VHOST_USER_PROTOCOL_F_MQ 0
259 #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1
260 #define VHOST_USER_PROTOCOL_F_RARP 2
261 #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3
266 * VHOST_USER_GET_FEATURES
269 Equivalent ioctl: VHOST_GET_FEATURES
273 Get from the underlying vhost implementation the features bitmask.
274 Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
275 VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
277 * VHOST_USER_SET_FEATURES
280 Ioctl: VHOST_SET_FEATURES
283 Enable features in the underlying vhost implementation using a bitmask.
284 Feature bit VHOST_USER_F_PROTOCOL_FEATURES signals slave support for
285 VHOST_USER_GET_PROTOCOL_FEATURES and VHOST_USER_SET_PROTOCOL_FEATURES.
287 * VHOST_USER_GET_PROTOCOL_FEATURES
290 Equivalent ioctl: VHOST_GET_FEATURES
294 Get the protocol feature bitmask from the underlying vhost implementation.
295 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
296 VHOST_USER_GET_FEATURES.
297 Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
298 this message even before VHOST_USER_SET_FEATURES was called.
300 * VHOST_USER_SET_PROTOCOL_FEATURES
303 Ioctl: VHOST_SET_FEATURES
306 Enable protocol features in the underlying vhost implementation.
307 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
308 VHOST_USER_GET_FEATURES.
309 Note: slave that reported VHOST_USER_F_PROTOCOL_FEATURES must support
310 this message even before VHOST_USER_SET_FEATURES was called.
312 * VHOST_USER_SET_OWNER
315 Equivalent ioctl: VHOST_SET_OWNER
318 Issued when a new connection is established. It sets the current Master
319 as an owner of the session. This can be used on the Slave as a
320 "session start" flag.
322 * VHOST_USER_RESET_OWNER
327 This is no longer used. Used to be sent to request disabling
328 all rings, but some clients interpreted it to also discard
329 connection state (this interpretation would lead to bugs).
330 It is recommended that clients either ignore this message,
331 or use it to disable all rings.
333 * VHOST_USER_SET_MEM_TABLE
336 Equivalent ioctl: VHOST_SET_MEM_TABLE
337 Master payload: memory regions description
339 Sets the memory map regions on the slave so it can translate the vring
340 addresses. In the ancillary data there is an array of file descriptors
341 for each memory mapped region. The size and ordering of the fds matches
342 the number and ordering of memory regions.
344 * VHOST_USER_SET_LOG_BASE
347 Equivalent ioctl: VHOST_SET_LOG_BASE
351 Sets logging shared memory space.
352 When slave has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol
353 feature, the log memory fd is provided in the ancillary data of
354 VHOST_USER_SET_LOG_BASE message, the size and offset of shared
355 memory area provided in the message.
358 * VHOST_USER_SET_LOG_FD
361 Equivalent ioctl: VHOST_SET_LOG_FD
364 Sets the logging file descriptor, which is passed as ancillary data.
366 * VHOST_USER_SET_VRING_NUM
369 Equivalent ioctl: VHOST_SET_VRING_NUM
370 Master payload: vring state description
372 Set the size of the queue.
374 * VHOST_USER_SET_VRING_ADDR
377 Equivalent ioctl: VHOST_SET_VRING_ADDR
378 Master payload: vring address description
381 Sets the addresses of the different aspects of the vring.
383 * VHOST_USER_SET_VRING_BASE
386 Equivalent ioctl: VHOST_SET_VRING_BASE
387 Master payload: vring state description
389 Sets the base offset in the available vring.
391 * VHOST_USER_GET_VRING_BASE
394 Equivalent ioctl: VHOST_USER_GET_VRING_BASE
395 Master payload: vring state description
396 Slave payload: vring state description
398 Get the available vring base offset.
400 * VHOST_USER_SET_VRING_KICK
403 Equivalent ioctl: VHOST_SET_VRING_KICK
406 Set the event file descriptor for adding buffers to the vring. It
407 is passed in the ancillary data.
408 Bits (0-7) of the payload contain the vring index. Bit 8 is the
409 invalid FD flag. This flag is set when there is no file descriptor
410 in the ancillary data. This signals that polling should be used
411 instead of waiting for a kick.
413 * VHOST_USER_SET_VRING_CALL
416 Equivalent ioctl: VHOST_SET_VRING_CALL
419 Set the event file descriptor to signal when buffers are used. It
420 is passed in the ancillary data.
421 Bits (0-7) of the payload contain the vring index. Bit 8 is the
422 invalid FD flag. This flag is set when there is no file descriptor
423 in the ancillary data. This signals that polling will be used
424 instead of waiting for the call.
426 * VHOST_USER_SET_VRING_ERR
429 Equivalent ioctl: VHOST_SET_VRING_ERR
432 Set the event file descriptor to signal when error occurs. It
433 is passed in the ancillary data.
434 Bits (0-7) of the payload contain the vring index. Bit 8 is the
435 invalid FD flag. This flag is set when there is no file descriptor
436 in the ancillary data.
438 * VHOST_USER_GET_QUEUE_NUM
441 Equivalent ioctl: N/A
445 Query how many queues the backend supports. This request should be
446 sent only when VHOST_USER_PROTOCOL_F_MQ is set in queried protocol
447 features by VHOST_USER_GET_PROTOCOL_FEATURES.
449 * VHOST_USER_SET_VRING_ENABLE
452 Equivalent ioctl: N/A
453 Master payload: vring state description
455 Signal slave to enable or disable corresponding vring.
456 This request should be sent only when VHOST_USER_F_PROTOCOL_FEATURES
459 * VHOST_USER_SEND_RARP
462 Equivalent ioctl: N/A
465 Ask vhost user backend to broadcast a fake RARP to notify the migration
466 is terminated for guest that does not support GUEST_ANNOUNCE.
467 Only legal if feature bit VHOST_USER_F_PROTOCOL_FEATURES is present in
468 VHOST_USER_GET_FEATURES and protocol feature bit VHOST_USER_PROTOCOL_F_RARP
469 is present in VHOST_USER_GET_PROTOCOL_FEATURES.
470 The first 6 bytes of the payload contain the mac address of the guest to
471 allow the vhost user backend to construct and broadcast the fake RARP.
473 VHOST_USER_PROTOCOL_F_REPLY_ACK:
474 -------------------------------
475 The original vhost-user specification only demands replies for certain
476 commands. This differs from the vhost protocol implementation where commands
477 are sent over an ioctl() call and block until the client has completed.
479 With this protocol extension negotiated, the sender (QEMU) can set the
480 "need_reply" [Bit 3] flag to any command. This indicates that
481 the client MUST respond with a Payload VhostUserMsg indicating success or
482 failure. The payload should be set to zero on success or non-zero on failure,
483 unless the message already has an explicit reply body.
485 The response payload gives QEMU a deterministic indication of the result
486 of the command. Today, QEMU is expected to terminate the main vhost-user
487 loop upon receiving such errors. In future, qemu could be taught to be more
488 resilient for selective requests.
490 For the message types that already solicit a reply from the client, the
491 presence of VHOST_USER_PROTOCOL_F_REPLY_ACK or need_reply bit being set brings
492 no behavioural change. (See the 'Communication' section for details.)