docs/rdma.txt

   1 (RDMA: Remote Direct Memory Access)
   2 RDMA Live Migration Specification, Version # 1
   3 ==============================================
   4 Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
   5 Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
   6
   7 Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
   8
   9 An *exhaustive* paper (2010) shows additional performance details
  10 linked on the QEMU wiki above.
  11
  12 Contents:
  13 =========
  14 * Introduction
  15 * Before running
  16 * Running
  17 * Performance
  18 * RDMA Migration Protocol Description
  19 * Versioning and Capabilities
  20 * QEMUFileRDMA Interface
  21 * Migration of pc.ram
  22 * Error handling
  23 * TODO
  24
  25 Introduction:
  26 =============
  27
  28 RDMA helps make your migration more deterministic under heavy load because
  29 of the significantly lower latency and higher throughput over TCP/IP. This is
  30 because the RDMA I/O architecture reduces the number of interrupts and
  31 data copies by bypassing the host networking stack. In particular, a TCP-based
  32 migration, under certain types of memory-bound workloads, may take a more
  33 unpredicatable amount of time to complete the migration if the amount of
  34 memory tracked during each live migration iteration round cannot keep pace
  35 with the rate of dirty memory produced by the workload.
  36
  37 RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
  38 over Convered Ethernet) as well as Infiniband-based. This implementation of
  39 migration using RDMA is capable of using both technologies because of
  40 the use of the OpenFabrics OFED software stack that abstracts out the
  41 programming model irrespective of the underlying hardware.
  42
  43 Refer to openfabrics.org or your respective RDMA hardware vendor for
  44 an understanding on how to verify that you have the OFED software stack
  45 installed in your environment. You should be able to successfully link
  46 against the "librdmacm" and "libibverbs" libraries and development headers
  47 for a working build of QEMU to run successfully using RDMA Migration.
  48
  49 BEFORE RUNNING:
  50 ===============
  51
  52 Use of RDMA during migration requires pinning and registering memory
  53 with the hardware. This means that memory must be physically resident
  54 before the hardware can transmit that memory to another machine.
  55 If this is not acceptable for your application or product, then the use
  56 of RDMA migration may in fact be harmful to co-located VMs or other
  57 software on the machine if there is not sufficient memory available to
  58 relocate the entire footprint of the virtual machine. If so, then the
  59 use of RDMA is discouraged and it is recommended to use standard TCP migration.
  60
  61 Experimental: Next, decide if you want dynamic page registration.
  62 For example, if you have an 8GB RAM virtual machine, but only 1GB
  63 is in active use, then enabling this feature will cause all 8GB to
  64 be pinned and resident in memory. This feature mostly affects the
  65 bulk-phase round of the migration and can be enabled for extremely
  66 high-performance RDMA hardware using the following command:
  67
  68 QEMU Monitor Command:
  69 $ migrate_set_capability x-rdma-pin-all on # disabled by default
  70
  71 Performing this action will cause all 8GB to be pinned, so if that's
  72 not what you want, then please ignore this step altogether.
  73
  74 On the other hand, this will also significantly speed up the bulk round
  75 of the migration, which can greatly reduce the "total" time of your migration.
  76 Example performance of this using an idle VM in the previous example
  77 can be found in the "Performance" section.
  78
  79 Note: for very large virtual machines (hundreds of GBs), pinning all
  80 *all* of the memory of your virtual machine in the kernel is very expensive
  81 may extend the initial bulk iteration time by many seconds,
  82 and thus extending the total migration time. However, this will not
  83 affect the determinism or predictability of your migration you will
  84 still gain from the benefits of advanced pinning with RDMA.
  85
  86 RUNNING:
  87 ========
  88
  89 First, set the migration speed to match your hardware's capabilities:
  90
  91 QEMU Monitor Command:
  92 $ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
  93
  94 Next, on the destination machine, add the following to the QEMU command line:
  95
  96 qemu ..... -incoming x-rdma:host:port
  97
  98 Finally, perform the actual migration on the source machine:
  99
 100 QEMU Monitor Command:
 101 $ migrate -d x-rdma:host:port
 102
 103 PERFORMANCE
 104 ===========
 105
 106 Here is a brief summary of total migration time and downtime using RDMA:
 107 Using a 40gbps infiniband link performing a worst-case stress test,
 108 using an 8GB RAM virtual machine:
 109
 110 Using the following command:
 111 $ apt-get install stress
 112 $ stress --vm-bytes 7500M --vm 1 --vm-keep
 113
 114 1. Migration throughput: 26 gigabits/second.
 115 2. Downtime (stop time) varies between 15 and 100 milliseconds.
 116
 117 EFFECTS of memory registration on bulk phase round:
 118
 119 For example, in the same 8GB RAM example with all 8GB of memory in
 120 active use and the VM itself is completely idle using the same 40 gbps
 121 infiniband link:
 122
 123 1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
 124 2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
 125
 126 These numbers would of course scale up to whatever size virtual machine
 127 you have to migrate using RDMA.
 128
 129 Enabling this feature does *not* have any measurable affect on
 130 migration *downtime*. This is because, without this feature, all of the
 131 memory will have already been registered already in advance during
 132 the bulk round and does not need to be re-registered during the successive
 133 iteration rounds.
 134
 135 RDMA Protocol Description:
 136 ==========================
 137
 138 Migration with RDMA is separated into two parts:
 139
 140 1. The transmission of the pages using RDMA
 141 2. Everything else (a control channel is introduced)
 142
 143 "Everything else" is transmitted using a formal
 144 protocol now, consisting of infiniband SEND messages.
 145
 146 An infiniband SEND message is the standard ibverbs
 147 message used by applications of infiniband hardware.
 148 The only difference between a SEND message and an RDMA
 149 message is that SEND messages cause notifications
 150 to be posted to the completion queue (CQ) on the
 151 infiniband receiver side, whereas RDMA messages (used
 152 for pc.ram) do not (to behave like an actual DMA).
 153
 154 Messages in infiniband require two things:
 155
 156 1. registration of the memory that will be transmitted
 157 2. (SEND only) work requests to be posted on both
 158    sides of the network before the actual transmission
 159    can occur.
 160
 161 RDMA messages are much easier to deal with. Once the memory
 162 on the receiver side is registered and pinned, we're
 163 basically done. All that is required is for the sender
 164 side to start dumping bytes onto the link.
 165
 166 (Memory is not released from pinning until the migration
 167 completes, given that RDMA migrations are very fast.)
 168
 169 SEND messages require more coordination because the
 170 receiver must have reserved space (using a receive
 171 work request) on the receive queue (RQ) before QEMUFileRDMA
 172 can start using them to carry all the bytes as
 173 a control transport for migration of device state.
 174
 175 To begin the migration, the initial connection setup is
 176 as follows (migration-rdma.c):
 177
 178 1. Receiver and Sender are started (command line or libvirt):
 179 2. Both sides post two RQ work requests
 180 3. Receiver does listen()
 181 4. Sender does connect()
 182 5. Receiver accept()
 183 6. Check versioning and capabilities (described later)
 184
 185 At this point, we define a control channel on top of SEND messages
 186 which is described by a formal protocol. Each SEND message has a
 187 header portion and a data portion (but together are transmitted
 188 as a single SEND message).
 189
 190 Header:
 191     * Length  (of the data portion, uint32, network byte order)
 192     * Type    (what command to perform, uint32, network byte order)
 193     * Repeat  (Number of commands in data portion, same type only)
 194
 195 The 'Repeat' field is here to support future multiple page registrations
 196 in a single message without any need to change the protocol itself
 197 so that the protocol is compatible against multiple versions of QEMU.
 198 Version #1 requires that all server implementations of the protocol must
 199 check this field and register all requests found in the array of commands located
 200 in the data portion and return an equal number of results in the response.
 201 The maximum number of repeats is hard-coded to 4096. This is a conservative
 202 limit based on the maximum size of a SEND message along with emperical
 203 observations on the maximum future benefit of simultaneous page registrations.
 204
 205 The 'type' field has 10 different command values:
 206     1. Unused
 207     2. Error              (sent to the source during bad things)
 208     3. Ready              (control-channel is available)
 209     4. QEMU File          (for sending non-live device state)
 210     5. RAM Blocks request (used right after connection setup)
 211     6. RAM Blocks result  (used right after connection setup)
 212     7. Compress page      (zap zero page and skip registration)
 213     8. Register request   (dynamic chunk registration)
 214     9. Register result    ('rkey' to be used by sender)
 215     10. Register finished  (registration for current iteration finished)
 216
 217 A single control message, as hinted above, can contain within the data
 218 portion an array of many commands of the same type. If there is more than
 219 one command, then the 'repeat' field will be greater than 1.
 220
 221 After connection setup, message 5 & 6 are used to exchange ram block
 222 information and optionally pin all the memory if requested by the user.
 223
 224 After ram block exchange is completed, we have two protocol-level
 225 functions, responsible for communicating control-channel commands
 226 using the above list of values:
 227
 228 Logically:
 229
 230 qemu_rdma_exchange_recv(header, expected command type)
 231
 232 1. We transmit a READY command to let the sender know that
 233    we are *ready* to receive some data bytes on the control channel.
 234 2. Before attempting to receive the expected command, we post another
 235    RQ work request to replace the one we just used up.
 236 3. Block on a CQ event channel and wait for the SEND to arrive.
 237 4. When the send arrives, librdmacm will unblock us.
 238 5. Verify that the command-type and version received matches the one we expected.
 239
 240 qemu_rdma_exchange_send(header, data, optional response header & data):
 241
 242 1. Block on the CQ event channel waiting for a READY command
 243    from the receiver to tell us that the receiver
 244    is *ready* for us to transmit some new bytes.
 245 2. Optionally: if we are expecting a response from the command
 246    (that we have no yet transmitted), let's post an RQ
 247    work request to receive that data a few moments later.
 248 3. When the READY arrives, librdmacm will
 249    unblock us and we immediately post a RQ work request
 250    to replace the one we just used up.
 251 4. Now, we can actually post the work request to SEND
 252    the requested command type of the header we were asked for.
 253 5. Optionally, if we are expecting a response (as before),
 254    we block again and wait for that response using the additional
 255    work request we previously posted. (This is used to carry
 256    'Register result' commands #6 back to the sender which
 257    hold the rkey need to perform RDMA. Note that the virtual address
 258    corresponding to this rkey was already exchanged at the beginning
 259    of the connection (described below).
 260
 261 All of the remaining command types (not including 'ready')
 262 described above all use the aformentioned two functions to do the hard work:
 263
 264 1. After connection setup, RAMBlock information is exchanged using
 265    this protocol before the actual migration begins. This information includes
 266    a description of each RAMBlock on the server side as well as the virtual addresses
 267    and lengths of each RAMBlock. This is used by the client to determine the
 268    start and stop locations of chunks and how to register them dynamically
 269    before performing the RDMA operations.
 270 2. During runtime, once a 'chunk' becomes full of pages ready to
 271    be sent with RDMA, the registration commands are used to ask the
 272    other side to register the memory for this chunk and respond
 273    with the result (rkey) of the registration.
 274 3. Also, the QEMUFile interfaces also call these functions (described below)
 275    when transmitting non-live state, such as devices or to send
 276    its own protocol information during the migration process.
 277 4. Finally, zero pages are only checked if a page has not yet been registered
 278    using chunk registration (or not checked at all and unconditionally
 279    written if chunk registration is disabled. This is accomplished using
 280    the "Compress" command listed above. If the page *has* been registered
 281    then we check the entire chunk for zero. Only if the entire chunk is
 282    zero, then we send a compress command to zap the page on the other side.
 283
 284 Versioning and Capabilities
 285 ===========================
 286 Current version of the protocol is version #1.
 287
 288 The same version applies to both for protocol traffic and capabilities
 289 negotiation. (i.e. There is only one version number that is referred to
 290 by all communication).
 291
 292 librdmacm provides the user with a 'private data' area to be exchanged
 293 at connection-setup time before any infiniband traffic is generated.
 294
 295 Header:
 296     * Version (protocol version validated before send/recv occurs), uint32, network byte order
 297     * Flags   (bitwise OR of each capability), uint32, network byte order
 298
 299 There is no data portion of this header right now, so there is
 300 no length field. The maximum size of the 'private data' section
 301 is only 192 bytes per the Infiniband specification, so it's not
 302 very useful for data anyway. This structure needs to remain small.
 303
 304 This private data area is a convenient place to check for protocol
 305 versioning because the user does not need to register memory to
 306 transmit a few bytes of version information.
 307
 308 This is also a convenient place to negotiate capabilities
 309 (like dynamic page registration).
 310
 311 If the version is invalid, we throw an error.
 312
 313 If the version is new, we only negotiate the capabilities that the
 314 requested version is able to perform and ignore the rest.
 315
 316 Currently there is only *one* capability in Version #1: dynamic page registration
 317
 318 Finally: Negotiation happens with the Flags field: If the primary-VM
 319 sets a flag, but the destination does not support this capability, it
 320 will return a zero-bit for that flag and the primary-VM will understand
 321 that as not being an available capability and will thus disable that
 322 capability on the primary-VM side.
 323
 324 QEMUFileRDMA Interface:
 325 =======================
 326
 327 QEMUFileRDMA introduces a couple of new functions:
 328
 329 1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
 330 2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
 331
 332 These two functions are very short and simply use the protocol
 333 describe above to deliver bytes without changing the upper-level
 334 users of QEMUFile that depend on a bytestream abstraction.
 335
 336 Finally, how do we handoff the actual bytes to get_buffer()?
 337
 338 Again, because we're trying to "fake" a bytestream abstraction
 339 using an analogy not unlike individual UDP frames, we have
 340 to hold on to the bytes received from control-channel's SEND
 341 messages in memory.
 342
 343 Each time we receive a complete "QEMU File" control-channel
 344 message, the bytes from SEND are copied into a small local holding area.
 345
 346 Then, we return the number of bytes requested by get_buffer()
 347 and leave the remaining bytes in the holding area until get_buffer()
 348 comes around for another pass.
 349
 350 If the buffer is empty, then we follow the same steps
 351 listed above and issue another "QEMU File" protocol command,
 352 asking for a new SEND message to re-fill the buffer.
 353
 354 Migration of pc.ram:
 355 ====================
 356
 357 At the beginning of the migration, (migration-rdma.c),
 358 the sender and the receiver populate the list of RAMBlocks
 359 to be registered with each other into a structure.
 360 Then, using the aforementioned protocol, they exchange a
 361 description of these blocks with each other, to be used later
 362 during the iteration of main memory. This description includes
 363 a list of all the RAMBlocks, their offsets and lengths, virtual
 364 addresses and possibly includes pre-registered RDMA keys in case dynamic
 365 page registration was disabled on the server-side, otherwise not.
 366
 367 Main memory is not migrated with the aforementioned protocol,
 368 but is instead migrated with normal RDMA Write operations.
 369
 370 Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
 371 Chunk size is not dynamic, but it could be in a future implementation.
 372 There's nothing to indicate that this is useful right now.
 373
 374 When a chunk is full (or a flush() occurs), the memory backed by
 375 the chunk is registered with librdmacm is pinned in memory on
 376 both sides using the aforementioned protocol.
 377 After pinning, an RDMA Write is generated and transmitted
 378 for the entire chunk.
 379
 380 Chunks are also transmitted in batches: This means that we
 381 do not request that the hardware signal the completion queue
 382 for the completion of *every* chunk. The current batch size
 383 is about 64 chunks (corresponding to 64 MB of memory).
 384 Only the last chunk in a batch must be signaled.
 385 This helps keep everything as asynchronous as possible
 386 and helps keep the hardware busy performing RDMA operations.
 387
 388 Error-handling:
 389 ===============
 390
 391 Infiniband has what is called a "Reliable, Connected"
 392 link (one of 4 choices). This is the mode in which
 393 we use for RDMA migration.
 394
 395 If a *single* message fails,
 396 the decision is to abort the migration entirely and
 397 cleanup all the RDMA descriptors and unregister all
 398 the memory.
 399
 400 After cleanup, the Virtual Machine is returned to normal
 401 operation the same way that would happen if the TCP
 402 socket is broken during a non-RDMA based migration.
 403
 404 TODO:
 405 =====
 406 1. 'migrate x-rdma:host:port' and '-incoming x-rdma' options will be
 407    renamed to 'rdma' after the experimental phase of this work has
 408    completed upstream.
 409 2. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
 410    are not compatible with infinband memory pinning and will result in
 411    an aborted migration (but with the source VM left unaffected).
 412 3. Use of the recent /proc/<pid>/pagemap would likely speed up
 413    the use of KSM and ballooning while using RDMA.
 414 4. Also, some form of balloon-device usage tracking would also
 415    help alleviate some issues.