docs/devel/multi-process.rst

   1 This is the design document for multi-process QEMU. It does not
   2 necessarily reflect the status of the current implementation, which
   3 may lack features or be considerably different from what is described
   4 in this document. This document is still useful as a description of
   5 the goals and general direction of this feature.
   6
   7 Please refer to the following wiki for latest details:
   8 https://wiki.qemu.org/Features/MultiProcessQEMU
   9
  10 Multi-process QEMU
  11 ===================
  12
  13 QEMU is often used as the hypervisor for virtual machines running in the
  14 Oracle cloud. Since one of the advantages of cloud computing is the
  15 ability to run many VMs from different tenants in the same cloud
  16 infrastructure, a guest that compromised its hypervisor could
  17 potentially use the hypervisor's access privileges to access data it is
  18 not authorized for.
  19
  20 QEMU can be susceptible to security attacks because it is a large,
  21 monolithic program that provides many features to the VMs it services.
  22 Many of these features can be configured out of QEMU, but even a reduced
  23 configuration QEMU has a large amount of code a guest can potentially
  24 attack. Separating QEMU reduces the attack surface by aiding to
  25 limit each component in the system to only access the resources that
  26 it needs to perform its job.
  27
  28 QEMU services
  29 -------------
  30
  31 QEMU can be broadly described as providing three main services. One is a
  32 VM control point, where VMs can be created, migrated, re-configured, and
  33 destroyed. A second is to emulate the CPU instructions within the VM,
  34 often accelerated by HW virtualization features such as Intel's VT
  35 extensions. Finally, it provides IO services to the VM by emulating HW
  36 IO devices, such as disk and network devices.
  37
  38 A multi-process QEMU
  39 ~~~~~~~~~~~~~~~~~~~~
  40
  41 A multi-process QEMU involves separating QEMU services into separate
  42 host processes. Each of these processes can be given only the privileges
  43 it needs to provide its service, e.g., a disk service could be given
  44 access only to the disk images it provides, and not be allowed to
  45 access other files, or any network devices. An attacker who compromised
  46 this service would not be able to use this exploit to access files or
  47 devices beyond what the disk service was given access to.
  48
  49 A QEMU control process would remain, but in multi-process mode, will
  50 have no direct interfaces to the VM. During VM execution, it would still
  51 provide the user interface to hot-plug devices or live migrate the VM.
  52
  53 A first step in creating a multi-process QEMU is to separate IO services
  54 from the main QEMU program, which would continue to provide CPU
  55 emulation. i.e., the control process would also be the CPU emulation
  56 process. In a later phase, CPU emulation could be separated from the
  57 control process.
  58
  59 Separating IO services
  60 ----------------------
  61
  62 Separating IO services into individual host processes is a good place to
  63 begin for a couple of reasons. One is the sheer number of IO devices QEMU
  64 can emulate provides a large surface of interfaces which could potentially
  65 be exploited, and, indeed, have been a source of exploits in the past.
  66 Another is the modular nature of QEMU device emulation code provides
  67 interface points where the QEMU functions that perform device emulation
  68 can be separated from the QEMU functions that manage the emulation of
  69 guest CPU instructions. The devices emulated in the separate process are
  70 referred to as remote devices.
  71
  72 QEMU device emulation
  73 ~~~~~~~~~~~~~~~~~~~~~
  74
  75 QEMU uses an object oriented SW architecture for device emulation code.
  76 Configured objects are all compiled into the QEMU binary, then objects
  77 are instantiated by name when used by the guest VM. For example, the
  78 code to emulate a device named "foo" is always present in QEMU, but its
  79 instantiation code is only run when the device is included in the target
  80 VM. (e.g., via the QEMU command line as *-device foo*)
  81
  82 The object model is hierarchical, so device emulation code names its
  83 parent object (such as "pci-device" for a PCI device) and QEMU will
  84 instantiate a parent object before calling the device's instantiation
  85 code.
  86
  87 Current separation models
  88 ~~~~~~~~~~~~~~~~~~~~~~~~~
  89
  90 In order to separate the device emulation code from the CPU emulation
  91 code, the device object code must run in a different process. There are
  92 a couple of existing QEMU features that can run emulation code
  93 separately from the main QEMU process. These are examined below.
  94
  95 vhost user model
  96 ^^^^^^^^^^^^^^^^
  97
  98 Virtio guest device drivers can be connected to vhost user applications
  99 in order to perform their IO operations. This model uses special virtio
 100 device drivers in the guest and vhost user device objects in QEMU, but
 101 once the QEMU vhost user code has configured the vhost user application,
 102 mission-mode IO is performed by the application. The vhost user
 103 application is a daemon process that can be contacted via a known UNIX
 104 domain socket.
 105
 106 vhost socket
 107 ''''''''''''
 108
 109 As mentioned above, one of the tasks of the vhost device object within
 110 QEMU is to contact the vhost application and send it configuration
 111 information about this device instance. As part of the configuration
 112 process, the application can also be sent other file descriptors over
 113 the socket, which then can be used by the vhost user application in
 114 various ways, some of which are described below.
 115
 116 vhost MMIO store acceleration
 117 '''''''''''''''''''''''''''''
 118
 119 VMs are often run using HW virtualization features via the KVM kernel
 120 driver. This driver allows QEMU to accelerate the emulation of guest CPU
 121 instructions by running the guest in a virtual HW mode. When the guest
 122 executes instructions that cannot be executed by virtual HW mode,
 123 execution returns to the KVM driver so it can inform QEMU to emulate the
 124 instructions in SW.
 125
 126 One of the events that can cause a return to QEMU is when a guest device
 127 driver accesses an IO location. QEMU then dispatches the memory
 128 operation to the corresponding QEMU device object. In the case of a
 129 vhost user device, the memory operation would need to be sent over a
 130 socket to the vhost application. This path is accelerated by the QEMU
 131 virtio code by setting up an eventfd file descriptor that the vhost
 132 application can directly receive MMIO store notifications from the KVM
 133 driver, instead of needing them to be sent to the QEMU process first.
 134
 135 vhost interrupt acceleration
 136 ''''''''''''''''''''''''''''
 137
 138 Another optimization used by the vhost application is the ability to
 139 directly inject interrupts into the VM via the KVM driver, again,
 140 bypassing the need to send the interrupt back to the QEMU process first.
 141 The QEMU virtio setup code configures the KVM driver with an eventfd
 142 that triggers the device interrupt in the guest when the eventfd is
 143 written. This irqfd file descriptor is then passed to the vhost user
 144 application program.
 145
 146 vhost access to guest memory
 147 ''''''''''''''''''''''''''''
 148
 149 The vhost application is also allowed to directly access guest memory,
 150 instead of needing to send the data as messages to QEMU. This is also
 151 done with file descriptors sent to the vhost user application by QEMU.
 152 These descriptors can be passed to ``mmap()`` by the vhost application
 153 to map the guest address space into the vhost application.
 154
 155 IOMMUs introduce another level of complexity, since the address given to
 156 the guest virtio device to DMA to or from is not a guest physical
 157 address. This case is handled by having vhost code within QEMU register
 158 as a listener for IOMMU mapping changes. The vhost application maintains
 159 a cache of IOMMMU translations: sending translation requests back to
 160 QEMU on cache misses, and in turn receiving flush requests from QEMU
 161 when mappings are purged.
 162
 163 applicability to device separation
 164 ''''''''''''''''''''''''''''''''''
 165
 166 Much of the vhost model can be re-used by separated device emulation. In
 167 particular, the ideas of using a socket between QEMU and the device
 168 emulation application, using a file descriptor to inject interrupts into
 169 the VM via KVM, and allowing the application to ``mmap()`` the guest
 170 should be re used.
 171
 172 There are, however, some notable differences between how a vhost
 173 application works and the needs of separated device emulation. The most
 174 basic is that vhost uses custom virtio device drivers which always
 175 trigger IO with MMIO stores. A separated device emulation model must
 176 work with existing IO device models and guest device drivers. MMIO loads
 177 break vhost store acceleration since they are synchronous - guest
 178 progress cannot continue until the load has been emulated. By contrast,
 179 stores are asynchronous, the guest can continue after the store event
 180 has been sent to the vhost application.
 181
 182 Another difference is that in the vhost user model, a single daemon can
 183 support multiple QEMU instances. This is contrary to the security regime
 184 desired, in which the emulation application should only be allowed to
 185 access the files or devices the VM it's running on behalf of can access.
 186 #### qemu-io model
 187
 188 Qemu-io is a test harness used to test changes to the QEMU block backend
 189 object code. (e.g., the code that implements disk images for disk driver
 190 emulation) Qemu-io is not a device emulation application per se, but it
 191 does compile the QEMU block objects into a separate binary from the main
 192 QEMU one. This could be useful for disk device emulation, since its
 193 emulation applications will need to include the QEMU block objects.
 194
 195 New separation model based on proxy objects
 196 -------------------------------------------
 197
 198 A different model based on proxy objects in the QEMU program
 199 communicating with remote emulation programs could provide separation
 200 while minimizing the changes needed to the device emulation code. The
 201 rest of this section is a discussion of how a proxy object model would
 202 work.
 203
 204 Remote emulation processes
 205 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 206
 207 The remote emulation process will run the QEMU object hierarchy without
 208 modification. The device emulation objects will be also be based on the
 209 QEMU code, because for anything but the simplest device, it would not be
 210 a tractable to re-implement both the object model and the many device
 211 backends that QEMU has.
 212
 213 The processes will communicate with the QEMU process over UNIX domain
 214 sockets. The processes can be executed either as standalone processes,
 215 or be executed by QEMU. In both cases, the host backends the emulation
 216 processes will provide are specified on its command line, as they would
 217 be for QEMU. For example:
 218
 219 ::
 220
 221     disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0  \
 222     -blockdev driver=qcow2,node-name=drive0,file=file0
 223
 224 would indicate process *disk-proc* uses a qcow2 emulated disk named
 225 *file0* as its backend.
 226
 227 Emulation processes may emulate more than one guest controller. A common
 228 configuration might be to put all controllers of the same device class
 229 (e.g., disk, network, etc.) in a single process, so that all backends of
 230 the same type can be managed by a single QMP monitor.
 231
 232 communication with QEMU
 233 ^^^^^^^^^^^^^^^^^^^^^^^
 234
 235 The first argument to the remote emulation process will be a Unix domain
 236 socket that connects with the Proxy object. This is a required argument.
 237
 238 ::
 239
 240     disk-proc <socket number> <backend list>
 241
 242 remote process QMP monitor
 243 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 244
 245 Remote emulation processes can be monitored via QMP, similar to QEMU
 246 itself. The QMP monitor socket is specified the same as for a QEMU
 247 process:
 248
 249 ::
 250
 251     disk-proc -qmp unix:/tmp/disk-mon,server
 252
 253 can be monitored over the UNIX socket path */tmp/disk-mon*.
 254
 255 QEMU command line
 256 ~~~~~~~~~~~~~~~~~
 257
 258 Each remote device emulated in a remote process on the host is
 259 represented as a *-device* of type *pci-proxy-dev*. A socket
 260 sub-option to this option specifies the Unix socket that connects
 261 to the remote process. An *id* sub-option is required, and it should
 262 be the same id as used in the remote process.
 263
 264 ::
 265
 266     qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3
 267
 268 can be used to add a device emulated in a remote process
 269
 270
 271 QEMU management of remote processes
 272 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 273
 274 QEMU is not aware of the type of type of the remote PCI device. It is
 275 a pass through device as far as QEMU is concerned.
 276
 277 communication with emulation process
 278 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 279
 280 primary channel
 281 '''''''''''''''
 282
 283 The primary channel (referred to as com in the code) is used to bootstrap
 284 the remote process. It is also used to pass on device-agnostic commands
 285 like reset.
 286
 287 per-device channels
 288 '''''''''''''''''''
 289
 290 Each remote device communicates with QEMU using a dedicated communication
 291 channel. The proxy object sets up this channel using the primary
 292 channel during its initialization.
 293
 294 QEMU device proxy objects
 295 ~~~~~~~~~~~~~~~~~~~~~~~~~
 296
 297 QEMU has an object model based on sub-classes inherited from the
 298 "object" super-class. The sub-classes that are of interest here are the
 299 "device" and "bus" sub-classes whose child sub-classes make up the
 300 device tree of a QEMU emulated system.
 301
 302 The proxy object model will use device proxy objects to replace the
 303 device emulation code within the QEMU process. These objects will live
 304 in the same place in the object and bus hierarchies as the objects they
 305 replace. i.e., the proxy object for an LSI SCSI controller will be a
 306 sub-class of the "pci-device" class, and will have the same PCI bus
 307 parent and the same SCSI bus child objects as the LSI controller object
 308 it replaces.
 309
 310 It is worth noting that the same proxy object is used to mediate with
 311 all types of remote PCI devices.
 312
 313 object initialization
 314 ^^^^^^^^^^^^^^^^^^^^^
 315
 316 The Proxy device objects are initialized in the exact same manner in
 317 which any other QEMU device would be initialized.
 318
 319 In addition, the Proxy objects perform the following two tasks:
 320 - Parses the "socket" sub option and connects to the remote process
 321 using this channel
 322 - Uses the "id" sub-option to connect to the emulated device on the
 323 separate process
 324
 325 class\_init
 326 '''''''''''
 327
 328 The ``class_init()`` method of a proxy object will, in general behave
 329 similarly to the object it replaces, including setting any static
 330 properties and methods needed by the proxy.
 331
 332 instance\_init / realize
 333 ''''''''''''''''''''''''
 334
 335 The ``instance_init()`` and ``realize()`` functions would only need to
 336 perform tasks related to being a proxy, such are registering its own
 337 MMIO handlers, or creating a child bus that other proxy devices can be
 338 attached to later.
 339
 340 Other tasks will be device-specific. For example, PCI device objects
 341 will initialize the PCI config space in order to make a valid PCI device
 342 tree within the QEMU process.
 343
 344 address space registration
 345 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 346
 347 Most devices are driven by guest device driver accesses to IO addresses
 348 or ports. The QEMU device emulation code uses QEMU's memory region
 349 function calls (such as ``memory_region_init_io()``) to add callback
 350 functions that QEMU will invoke when the guest accesses the device's
 351 areas of the IO address space. When a guest driver does access the
 352 device, the VM will exit HW virtualization mode and return to QEMU,
 353 which will then lookup and execute the corresponding callback function.
 354
 355 A proxy object would need to mirror the memory region calls the actual
 356 device emulator would perform in its initialization code, but with its
 357 own callbacks. When invoked by QEMU as a result of a guest IO operation,
 358 they will forward the operation to the device emulation process.
 359
 360 PCI config space
 361 ^^^^^^^^^^^^^^^^
 362
 363 PCI devices also have a configuration space that can be accessed by the
 364 guest driver. Guest accesses to this space is not handled by the device
 365 emulation object, but by its PCI parent object. Much of this space is
 366 read-only, but certain registers (especially BAR and MSI-related ones)
 367 need to be propagated to the emulation process.
 368
 369 PCI parent proxy
 370 ''''''''''''''''
 371
 372 One way to propagate guest PCI config accesses is to create a
 373 "pci-device-proxy" class that can serve as the parent of a PCI device
 374 proxy object. This class's parent would be "pci-device" and it would
 375 override the PCI parent's ``config_read()`` and ``config_write()``
 376 methods with ones that forward these operations to the emulation
 377 program.
 378
 379 interrupt receipt
 380 ^^^^^^^^^^^^^^^^^
 381
 382 A proxy for a device that generates interrupts will need to create a
 383 socket to receive interrupt indications from the emulation process. An
 384 incoming interrupt indication would then be sent up to its bus parent to
 385 be injected into the guest. For example, a PCI device object may use
 386 ``pci_set_irq()``.
 387
 388 live migration
 389 ^^^^^^^^^^^^^^
 390
 391 The proxy will register to save and restore any *vmstate* it needs over
 392 a live migration event. The device proxy does not need to manage the
 393 remote device's *vmstate*; that will be handled by the remote process
 394 proxy (see below).
 395
 396 QEMU remote device operation
 397 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 398
 399 Generic device operations, such as DMA, will be performed by the remote
 400 process proxy by sending messages to the remote process.
 401
 402 DMA operations
 403 ^^^^^^^^^^^^^^
 404
 405 DMA operations would be handled much like vhost applications do. One of
 406 the initial messages sent to the emulation process is a guest memory
 407 table. Each entry in this table consists of a file descriptor and size
 408 that the emulation process can ``mmap()`` to directly access guest
 409 memory, similar to ``vhost_user_set_mem_table()``. Note guest memory
 410 must be backed by file descriptors, such as when QEMU is given the
 411 *-mem-path* command line option.
 412
 413 IOMMU operations
 414 ^^^^^^^^^^^^^^^^
 415
 416 When the emulated system includes an IOMMU, the remote process proxy in
 417 QEMU will need to create a socket for IOMMU requests from the emulation
 418 process. It will handle those requests with an
 419 ``address_space_get_iotlb_entry()`` call. In order to handle IOMMU
 420 unmaps, the remote process proxy will also register as a listener on the
 421 device's DMA address space. When an IOMMU memory region is created
 422 within the DMA address space, an IOMMU notifier for unmaps will be added
 423 to the memory region that will forward unmaps to the emulation process
 424 over the IOMMU socket.
 425
 426 device hot-plug via QMP
 427 ^^^^^^^^^^^^^^^^^^^^^^^
 428
 429 An QMP "device\_add" command can add a device emulated by a remote
 430 process. It will also have "rid" option to the command, just as the
 431 *-device* command line option does. The remote process may either be one
 432 started at QEMU startup, or be one added by the "add-process" QMP
 433 command described above. In either case, the remote process proxy will
 434 forward the new device's JSON description to the corresponding emulation
 435 process.
 436
 437 live migration
 438 ^^^^^^^^^^^^^^
 439
 440 The remote process proxy will also register for live migration
 441 notifications with ``vmstate_register()``. When called to save state,
 442 the proxy will send the remote process a secondary socket file
 443 descriptor to save the remote process's device *vmstate* over. The
 444 incoming byte stream length and data will be saved as the proxy's
 445 *vmstate*. When the proxy is resumed on its new host, this *vmstate*
 446 will be extracted, and a secondary socket file descriptor will be sent
 447 to the new remote process through which it receives the *vmstate* in
 448 order to restore the devices there.
 449
 450 device emulation in remote process
 451 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 452
 453 The parts of QEMU that the emulation program will need include the
 454 object model; the memory emulation objects; the device emulation objects
 455 of the targeted device, and any dependent devices; and, the device's
 456 backends. It will also need code to setup the machine environment,
 457 handle requests from the QEMU process, and route machine-level requests
 458 (such as interrupts or IOMMU mappings) back to the QEMU process.
 459
 460 initialization
 461 ^^^^^^^^^^^^^^
 462
 463 The process initialization sequence will follow the same sequence
 464 followed by QEMU. It will first initialize the backend objects, then
 465 device emulation objects. The JSON descriptions sent by the QEMU process
 466 will drive which objects need to be created.
 467
 468 -  address spaces
 469
 470 Before the device objects are created, the initial address spaces and
 471 memory regions must be configured with ``memory_map_init()``. This
 472 creates a RAM memory region object (*system\_memory*) and an IO memory
 473 region object (*system\_io*).
 474
 475 -  RAM
 476
 477 RAM memory region creation will follow how ``pc_memory_init()`` creates
 478 them, but must use ``memory_region_init_ram_from_fd()`` instead of
 479 ``memory_region_allocate_system_memory()``. The file descriptors needed
 480 will be supplied by the guest memory table from above. Those RAM regions
 481 would then be added to the *system\_memory* memory region with
 482 ``memory_region_add_subregion()``.
 483
 484 -  PCI
 485
 486 IO initialization will be driven by the JSON descriptions sent from the
 487 QEMU process. For a PCI device, a PCI bus will need to be created with
 488 ``pci_root_bus_new()``, and a PCI memory region will need to be created
 489 and added to the *system\_memory* memory region with
 490 ``memory_region_add_subregion_overlap()``. The overlap version is
 491 required for architectures where PCI memory overlaps with RAM memory.
 492
 493 MMIO handling
 494 ^^^^^^^^^^^^^
 495
 496 The device emulation objects will use ``memory_region_init_io()`` to
 497 install their MMIO handlers, and ``pci_register_bar()`` to associate
 498 those handlers with a PCI BAR, as they do within QEMU currently.
 499
 500 In order to use ``address_space_rw()`` in the emulation process to
 501 handle MMIO requests from QEMU, the PCI physical addresses must be the
 502 same in the QEMU process and the device emulation process. In order to
 503 accomplish that, guest BAR programming must also be forwarded from QEMU
 504 to the emulation process.
 505
 506 interrupt injection
 507 ^^^^^^^^^^^^^^^^^^^
 508
 509 When device emulation wants to inject an interrupt into the VM, the
 510 request climbs the device's bus object hierarchy until the point where a
 511 bus object knows how to signal the interrupt to the guest. The details
 512 depend on the type of interrupt being raised.
 513
 514 -  PCI pin interrupts
 515
 516 On x86 systems, there is an emulated IOAPIC object attached to the root
 517 PCI bus object, and the root PCI object forwards interrupt requests to
 518 it. The IOAPIC object, in turn, calls the KVM driver to inject the
 519 corresponding interrupt into the VM. The simplest way to handle this in
 520 an emulation process would be to setup the root PCI bus driver (via
 521 ``pci_bus_irqs()``) to send a interrupt request back to the QEMU
 522 process, and have the device proxy object reflect it up the PCI tree
 523 there.
 524
 525 -  PCI MSI/X interrupts
 526
 527 PCI MSI/X interrupts are implemented in HW as DMA writes to a
 528 CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives
 529 these DMA writes, then calls into the KVM driver to inject the interrupt
 530 into the VM. A simple emulation process implementation would be to send
 531 the MSI DMA address from QEMU as a message at initialization, then
 532 install an address space handler at that address which forwards the MSI
 533 message back to QEMU.
 534
 535 DMA operations
 536 ^^^^^^^^^^^^^^
 537
 538 When a emulation object wants to DMA into or out of guest memory, it
 539 first must use dma\_memory\_map() to convert the DMA address to a local
 540 virtual address. The emulation process memory region objects setup above
 541 will be used to translate the DMA address to a local virtual address the
 542 device emulation code can access.
 543
 544 IOMMU
 545 ^^^^^
 546
 547 When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory
 548 regions to translate the DMA address to a guest physical address before
 549 that physical address can be translated to a local virtual address. The
 550 emulation process will need similar functionality.
 551
 552 -  IOTLB cache
 553
 554 The emulation process will maintain a cache of recent IOMMU translations
 555 (the IOTLB). When the translate() callback of an IOMMU memory region is
 556 invoked, the IOTLB cache will be searched for an entry that will map the
 557 DMA address to a guest PA. On a cache miss, a message will be sent back
 558 to QEMU requesting the corresponding translation entry, which be both be
 559 used to return a guest address and be added to the cache.
 560
 561 -  IOTLB purge
 562
 563 The IOMMU emulation will also need to act on unmap requests from QEMU.
 564 These happen when the guest IOMMU driver purges an entry from the
 565 guest's translation table.
 566
 567 live migration
 568 ^^^^^^^^^^^^^^
 569
 570 When a remote process receives a live migration indication from QEMU, it
 571 will set up a channel using the received file descriptor with
 572 ``qio_channel_socket_new_fd()``. This channel will be used to create a
 573 *QEMUfile* that can be passed to ``qemu_save_device_state()`` to send
 574 the process's device state back to QEMU. This method will be reversed on
 575 restore - the channel will be passed to ``qemu_loadvm_state()`` to
 576 restore the device state.
 577
 578 Accelerating device emulation
 579 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 580
 581 The messages that are required to be sent between QEMU and the emulation
 582 process can add considerable latency to IO operations. The optimizations
 583 described below attempt to ameliorate this effect by allowing the
 584 emulation process to communicate directly with the kernel KVM driver.
 585 The KVM file descriptors created would be passed to the emulation process
 586 via initialization messages, much like the guest memory table is done.
 587 #### MMIO acceleration
 588
 589 Vhost user applications can receive guest virtio driver stores directly
 590 from KVM. The issue with the eventfd mechanism used by vhost user is
 591 that it does not pass any data with the event indication, so it cannot
 592 handle guest loads or guest stores that carry store data. This concept
 593 could, however, be expanded to cover more cases.
 594
 595 The expanded idea would require a new type of KVM device:
 596 *KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master
 597 descriptor that QEMU can use for configuration, and a slave descriptor
 598 that the emulation process can use to receive MMIO notifications. QEMU
 599 would create both descriptors using the KVM driver, and pass the slave
 600 descriptor to the emulation process via an initialization message.
 601
 602 data structures
 603 ^^^^^^^^^^^^^^^
 604
 605 -  guest physical range
 606
 607 The guest physical range structure describes the address range that a
 608 device will respond to. It includes the base and length of the range, as
 609 well as which bus the range resides on (e.g., on an x86machine, it can
 610 specify whether the range refers to memory or IO addresses).
 611
 612 A device can have multiple physical address ranges it responds to (e.g.,
 613 a PCI device can have multiple BARs), so the structure will also include
 614 an enumerated identifier to specify which of the device's ranges is
 615 being referred to.
 616
 617 +--------+----------------------------+
 618 | Name   | Description                |
 619 +========+============================+
 620 | addr   | range base address         |
 621 +--------+----------------------------+
 622 | len    | range length               |
 623 +--------+----------------------------+
 624 | bus    | addr type (memory or IO)   |
 625 +--------+----------------------------+
 626 | id     | range ID (e.g., PCI BAR)   |
 627 +--------+----------------------------+
 628
 629 -  MMIO request structure
 630
 631 This structure describes an MMIO operation. It includes which guest
 632 physical range the MMIO was within, the offset within that range, the
 633 MMIO type (e.g., load or store), and its length and data. It also
 634 includes a sequence number that can be used to reply to the MMIO, and
 635 the CPU that issued the MMIO.
 636
 637 +----------+------------------------+
 638 | Name     | Description            |
 639 +==========+========================+
 640 | rid      | range MMIO is within   |
 641 +----------+------------------------+
 642 | offset   | offset withing *rid*   |
 643 +----------+------------------------+
 644 | type     | e.g., load or store    |
 645 +----------+------------------------+
 646 | len      | MMIO length            |
 647 +----------+------------------------+
 648 | data     | store data             |
 649 +----------+------------------------+
 650 | seq      | sequence ID            |
 651 +----------+------------------------+
 652
 653 -  MMIO request queues
 654
 655 MMIO request queues are FIFO arrays of MMIO request structures. There
 656 are two queues: pending queue is for MMIOs that haven't been read by the
 657 emulation program, and the sent queue is for MMIOs that haven't been
 658 acknowledged. The main use of the second queue is to validate MMIO
 659 replies from the emulation program.
 660
 661 -  scoreboard
 662
 663 Each CPU in the VM is emulated in QEMU by a separate thread, so multiple
 664 MMIOs may be waiting to be consumed by an emulation program and multiple
 665 threads may be waiting for MMIO replies. The scoreboard would contain a
 666 wait queue and sequence number for the per-CPU threads, allowing them to
 667 be individually woken when the MMIO reply is received from the emulation
 668 program. It also tracks the number of posted MMIO stores to the device
 669 that haven't been replied to, in order to satisfy the PCI constraint
 670 that a load to a device will not complete until all previous stores to
 671 that device have been completed.
 672
 673 -  device shadow memory
 674
 675 Some MMIO loads do not have device side-effects. These MMIOs can be
 676 completed without sending a MMIO request to the emulation program if the
 677 emulation program shares a shadow image of the device's memory image
 678 with the KVM driver.
 679
 680 The emulation program will ask the KVM driver to allocate memory for the
 681 shadow image, and will then use ``mmap()`` to directly access it. The
 682 emulation program can control KVM access to the shadow image by sending
 683 KVM an access map telling it which areas of the image have no
 684 side-effects (and can be completed immediately), and which require a
 685 MMIO request to the emulation program. The access map can also inform
 686 the KVM drive which size accesses are allowed to the image.
 687
 688 master descriptor
 689 ^^^^^^^^^^^^^^^^^
 690
 691 The master descriptor is used by QEMU to configure the new KVM device.
 692 The descriptor would be returned by the KVM driver when QEMU issues a
 693 *KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type.
 694
 695 KVM\_DEV\_TYPE\_USER device ops
 696
 697
 698 The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a
 699 ``kvm_register_device_ops()`` call when the KVM system in initialized by
 700 ``kvm_init()``. These device ops are called by the KVM driver when QEMU
 701 executes certain ``ioctl()`` operations on its KVM file descriptor. They
 702 include:
 703
 704 -  create
 705
 706 This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE*
 707 ``ioctl()`` on its per-VM file descriptor. It will allocate and
 708 initialize a KVM user device specific data structure, and assign the
 709 *kvm\_device* private field to it.
 710
 711 -  ioctl
 712
 713 This routine is invoked when QEMU issues an ``ioctl()`` on the master
 714 descriptor. The ``ioctl()`` commands supported are defined by the KVM
 715 device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands:
 716
 717 *KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will
 718 be passed to the device emulation program. Only one slave can be created
 719 by each master descriptor. The file operations performed by this
 720 descriptor are described below.
 721
 722 The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical
 723 address range that the slave descriptor will receive MMIO notifications
 724 for. The range is specified by a guest physical range structure
 725 argument. For buses that assign addresses to devices dynamically, this
 726 command can be executed while the guest is running, such as the case
 727 when a guest changes a device's PCI BAR registers.
 728
 729 *KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to
 730 register *kvm\_io\_device\_ops* callbacks to be invoked when the guest
 731 performs a MMIO operation within the range. When a range is changed,
 732 ``kvm_io_bus_unregister_dev()`` is used to remove the previous
 733 instantiation.
 734
 735 *KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies
 736 how long KVM will wait for the emulation process to respond to a MMIO
 737 indication.
 738
 739 -  destroy
 740
 741 This routine is called when the VM instance is destroyed. It will need
 742 to destroy the slave descriptor; and free any memory allocated by the
 743 driver, as well as the *kvm\_device* structure itself.
 744
 745 slave descriptor
 746 ^^^^^^^^^^^^^^^^
 747
 748 The slave descriptor will have its own file operations vector, which
 749 responds to system calls on the descriptor performed by the device
 750 emulation program.
 751
 752 -  read
 753
 754 A read returns any pending MMIO requests from the KVM driver as MMIO
 755 request structures. Multiple structures can be returned if there are
 756 multiple MMIO operations pending. The MMIO requests are moved from the
 757 pending queue to the sent queue, and if there are threads waiting for
 758 space in the pending to add new MMIO operations, they will be woken
 759 here.
 760
 761 -  write
 762
 763 A write also consists of a set of MMIO requests. They are compared to
 764 the MMIO requests in the sent queue. Matches are removed from the sent
 765 queue, and any threads waiting for the reply are woken. If a store is
 766 removed, then the number of posted stores in the per-CPU scoreboard is
 767 decremented. When the number is zero, and a non side-effect load was
 768 waiting for posted stores to complete, the load is continued.
 769
 770 -  ioctl
 771
 772 There are several ioctl()s that can be performed on the slave
 773 descriptor.
 774
 775 A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to
 776 allocate memory for the shadow image. This memory can later be
 777 ``mmap()``\ ed by the emulation process to share the emulation's view of
 778 device memory with the KVM driver.
 779
 780 A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the
 781 shadow image. It will send the KVM driver a shadow control map, which
 782 specifies which areas of the image can complete guest loads without
 783 sending the load request to the emulation program. It will also specify
 784 the size of load operations that are allowed.
 785
 786 -  poll
 787
 788 An emulation program will use the ``poll()`` call with a *POLLIN* flag
 789 to determine if there are MMIO requests waiting to be read. It will
 790 return if the pending MMIO request queue is not empty.
 791
 792 -  mmap
 793
 794 This call allows the emulation program to directly access the shadow
 795 image allocated by the KVM driver. As device emulation updates device
 796 memory, changes with no side-effects will be reflected in the shadow,
 797 and the KVM driver can satisfy guest loads from the shadow image without
 798 needing to wait for the emulation program.
 799
 800 kvm\_io\_device ops
 801 ^^^^^^^^^^^^^^^^^^^
 802
 803 Each KVM per-CPU thread can handle MMIO operation on behalf of the guest
 804 VM. KVM will use the MMIO's guest physical address to search for a
 805 matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM
 806 driver instead of exiting back to QEMU. If a match is found, the
 807 corresponding callback will be invoked.
 808
 809 -  read
 810
 811 This callback is invoked when the guest performs a load to the device.
 812 Loads with side-effects must be handled synchronously, with the KVM
 813 driver putting the QEMU thread to sleep waiting for the emulation
 814 process reply before re-starting the guest. Loads that do not have
 815 side-effects may be optimized by satisfying them from the shadow image,
 816 if there are no outstanding stores to the device by this CPU. PCI memory
 817 ordering demands that a load cannot complete before all older stores to
 818 the same device have been completed.
 819
 820 -  write
 821
 822 Stores can be handled asynchronously unless the pending MMIO request
 823 queue is full. In this case, the QEMU thread must sleep waiting for
 824 space in the queue. Stores will increment the number of posted stores in
 825 the per-CPU scoreboard, in order to implement the PCI ordering
 826 constraint above.
 827
 828 interrupt acceleration
 829 ^^^^^^^^^^^^^^^^^^^^^^
 830
 831 This performance optimization would work much like a vhost user
 832 application does, where the QEMU process sets up *eventfds* that cause
 833 the device's corresponding interrupt to be triggered by the KVM driver.
 834 These irq file descriptors are sent to the emulation process at
 835 initialization, and are used when the emulation code raises a device
 836 interrupt.
 837
 838 intx acceleration
 839 '''''''''''''''''
 840
 841 Traditional PCI pin interrupts are level based, so, in addition to an
 842 irq file descriptor, a re-sampling file descriptor needs to be sent to
 843 the emulation program. This second file descriptor allows multiple
 844 devices sharing an irq to be notified when the interrupt has been
 845 acknowledged by the guest, so they can re-trigger the interrupt if their
 846 device has not de-asserted its interrupt.
 847
 848 intx irq descriptor
 849
 850
 851 The irq descriptors are created by the proxy object
 852 ``using event_notifier_init()`` to create the irq and re-sampling
 853 *eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt.
 854 The interrupt route can be found with
 855 ``pci_device_route_intx_to_irq()``.
 856
 857 intx routing changes
 858
 859
 860 Intx routing can be changed when the guest programs the APIC the device
 861 pin is connected to. The proxy object in QEMU will use
 862 ``pci_device_set_intx_routing_notifier()`` to be informed of any guest
 863 changes to the route. This handler will broadly follow the VFIO
 864 interrupt logic to change the route: de-assigning the existing irq
 865 descriptor from its route, then assigning it the new route. (see
 866 ``vfio_intx_update()``)
 867
 868 MSI/X acceleration
 869 ''''''''''''''''''
 870
 871 MSI/X interrupts are sent as DMA transactions to the host. The interrupt
 872 data contains a vector that is programmed by the guest, A device may have
 873 multiple MSI interrupts associated with it, so multiple irq descriptors
 874 may need to be sent to the emulation program.
 875
 876 MSI/X irq descriptor
 877
 878
 879 This case will also follow the VFIO example. For each MSI/X interrupt,
 880 an *eventfd* is created, a virtual interrupt is allocated by
 881 ``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to
 882 the eventfd with ``kvm_irqchip_add_irqfd_notifier()``.
 883
 884 MSI/X config space changes
 885
 886
 887 The guest may dynamically update several MSI-related tables in the
 888 device's PCI config space. These include per-MSI interrupt enables and
 889 vector data. Additionally, MSIX tables exist in device memory space, not
 890 config space. Much like the BAR case above, the proxy object must look
 891 at guest config space programming to keep the MSI interrupt state
 892 consistent between QEMU and the emulation program.
 893
 894 --------------
 895
 896 Disaggregated CPU emulation
 897 ---------------------------
 898
 899 After IO services have been disaggregated, a second phase would be to
 900 separate a process to handle CPU instruction emulation from the main
 901 QEMU control function. There are no object separation points for this
 902 code, so the first task would be to create one.
 903
 904 Host access controls
 905 --------------------
 906
 907 Separating QEMU relies on the host OS's access restriction mechanisms to
 908 enforce that the differing processes can only access the objects they
 909 are entitled to. There are a couple types of mechanisms usually provided
 910 by general purpose OSs.
 911
 912 Discretionary access control
 913 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 914
 915 Discretionary access control allows each user to control who can access
 916 their files. In Linux, this type of control is usually too coarse for
 917 QEMU separation, since it only provides three separate access controls:
 918 one for the same user ID, the second for users IDs with the same group
 919 ID, and the third for all other user IDs. Each device instance would
 920 need a separate user ID to provide access control, which is likely to be
 921 unwieldy for dynamically created VMs.
 922
 923 Mandatory access control
 924 ~~~~~~~~~~~~~~~~~~~~~~~~
 925
 926 Mandatory access control allows the OS to add an additional set of
 927 controls on top of discretionary access for the OS to control. It also
 928 adds other attributes to processes and files such as types, roles, and
 929 categories, and can establish rules for how processes and files can
 930 interact.
 931
 932 Type enforcement
 933 ^^^^^^^^^^^^^^^^
 934
 935 Type enforcement assigns a *type* attribute to processes and files, and
 936 allows rules to be written on what operations a process with a given
 937 type can perform on a file with a given type. QEMU separation could take
 938 advantage of type enforcement by running the emulation processes with
 939 different types, both from the main QEMU process, and from the emulation
 940 processes of different classes of devices.
 941
 942 For example, guest disk images and disk emulation processes could have
 943 types separate from the main QEMU process and non-disk emulation
 944 processes, and the type rules could prevent processes other than disk
 945 emulation ones from accessing guest disk images. Similarly, network
 946 emulation processes can have a type separate from the main QEMU process
 947 and non-network emulation process, and only that type can access the
 948 host tun/tap device used to provide guest networking.
 949
 950 Category enforcement
 951 ^^^^^^^^^^^^^^^^^^^^
 952
 953 Category enforcement assigns a set of numbers within a given range to
 954 the process or file. The process is granted access to the file if the
 955 process's set is a superset of the file's set. This enforcement can be
 956 used to separate multiple instances of devices in the same class.
 957
 958 For example, if there are multiple disk devices provides to a guest,
 959 each device emulation process could be provisioned with a separate
 960 category. The different device emulation processes would not be able to
 961 access each other's backing disk images.
 962
 963 Alternatively, categories could be used in lieu of the type enforcement
 964 scheme described above. In this scenario, different categories would be
 965 used to prevent device emulation processes in different classes from
 966 accessing resources assigned to other classes.