Fix typo in comment
[kqemu.git] / kqemu-tech.html
blob1cae1bd360c6e0978d221447b0a370fcf4fb445a
1 <HTML>
2 <HEAD>
3 <!-- Created by texi2html 1.56k from kqemu-tech.texi on 30 May 2008 -->
5 <TITLE>QEMU Accelerator Technical Documentation</TITLE>
6 </HEAD>
7 <BODY>
8 <H1>QEMU Accelerator Technical Documentation</H1>
9 <P>
10 <P><HR><P>
11 <H1>Table of Contents</H1>
12 <UL>
13 <LI><A NAME="TOC1" HREF="kqemu-tech.html#SEC1">1. Introduction</A>
14 <LI><A NAME="TOC2" HREF="kqemu-tech.html#SEC2">2. API definition</A>
15 <UL>
16 <LI><A NAME="TOC3" HREF="kqemu-tech.html#SEC3">2.1 RAM, Physical and Virtual addresses</A>
17 <LI><A NAME="TOC4" HREF="kqemu-tech.html#SEC4">2.2 RAM page dirtiness</A>
18 <LI><A NAME="TOC5" HREF="kqemu-tech.html#SEC5">2.3 <TT>`/dev/kqemu'</TT> device</A>
19 <LI><A NAME="TOC6" HREF="kqemu-tech.html#SEC6">2.4 <CODE>KQEMU_GET_VERSION</CODE> ioctl</A>
20 <LI><A NAME="TOC7" HREF="kqemu-tech.html#SEC7">2.5 <CODE>KQEMU_INIT</CODE> ioctl</A>
21 <LI><A NAME="TOC8" HREF="kqemu-tech.html#SEC8">2.6 <CODE>KQEMU_SET_PHYS_MEM</CODE> ioctl</A>
22 <LI><A NAME="TOC9" HREF="kqemu-tech.html#SEC9">2.7 <CODE>KQEMU_MODIFY_RAM_PAGE</CODE> ioctl</A>
23 <LI><A NAME="TOC10" HREF="kqemu-tech.html#SEC10">2.8 <CODE>KQEMU_EXEC</CODE> ioctl</A>
24 </UL>
25 <LI><A NAME="TOC11" HREF="kqemu-tech.html#SEC11">3. KQEMU inner working and limitations</A>
26 <UL>
27 <LI><A NAME="TOC12" HREF="kqemu-tech.html#SEC12">3.1 Inner working</A>
28 <LI><A NAME="TOC13" HREF="kqemu-tech.html#SEC13">3.2 General limitations</A>
29 <LI><A NAME="TOC14" HREF="kqemu-tech.html#SEC14">3.3 Security</A>
30 <LI><A NAME="TOC15" HREF="kqemu-tech.html#SEC15">3.4 Developments Ideas</A>
31 </UL>
32 </UL>
33 <P><HR><P>
35 <P>
36 QEMU Accelerator Technical Documentation
41 <H1><A NAME="SEC1" HREF="kqemu-tech.html#TOC1">1. Introduction</A></H1>
43 <P>
44 The QEMU Accelerator (KQEMU) is a driver allowing a user application
45 to run x86 code in a Virtual Machine (VM). The code can be either user
46 or kernel code, in 64, 32 or 16 bit protected mode. KQEMU is very
47 similar in essence to the VM86 Linux syscall call, but it adds some
48 new concepts to improve memory handling.
51 <P>
52 KQEMU is ported on many host OSes (currently Linux, Windows, FreeBSD,
53 Solaris). It can execute code from many guest OSes (e.g. Linux,
54 Windows 2000/XP) even if the host CPU does not support hardware
55 virtualization.
58 <P>
59 In that document, we assume that the reader has good knowledge of the
60 x86 processor and of the problems associated with the virtualization
61 of x86 code.
66 <H1><A NAME="SEC2" HREF="kqemu-tech.html#TOC2">2. API definition</A></H1>
68 <P>
69 We describe the version 1.3.0 of the Linux implementation. The
70 implementations on other OSes use the same calls, so they can be
71 understood by reading the Linux API specification.
76 <H2><A NAME="SEC3" HREF="kqemu-tech.html#TOC3">2.1 RAM, Physical and Virtual addresses</A></H2>
78 <P>
79 KQEMU manipulates three kinds of addresses:
83 <UL>
85 <LI>RAM addresses are between 0 and the available VM RAM size minus one.
87 They are currently stored on 32 bit words.
89 <LI>Physical addresses are addresses after MMU translation.
91 <LI>Virtual addresses are addresses before MMU translation.
93 </UL>
95 <P>
96 KQEMU has a physical page table which is used to associate a RAM
97 address or a device I/O address range to a given physical page. It
98 also tells if a given RAM address is visible as read-only memory. The
99 same RAM address can be mapped at several different physical
100 addresses. Only 4 GB of physical address space is supported in the
101 current KQEMU implementation. Hence the bits of order &#62;= 32 of the
102 physical addresses are ignored.
107 <H2><A NAME="SEC4" HREF="kqemu-tech.html#TOC4">2.2 RAM page dirtiness</A></H2>
110 It is very important for the VM to be able to tell if a given RAM page
111 has been modified. It can be used to optimize VGA refreshes, to flush
112 a dynamic translator cache (when used with QEMU), to handle live
113 migration or to optimize MMU emulation.
117 In KQEMU, each RAM page has an associated <EM>dirty byte</EM> in the
118 array <CODE>init_params.ram_dirty</CODE>. The dirty byte is set to
119 <CODE>0xff</CODE> if the corresponding RAM page is modified. That way, at
120 most 8 clients can manage a dirty bit in each page.
124 KQEMU reserves one dirty bit <CODE>0x04</CODE> for its internal use.
128 The client must notify KQEMU if some entries of the array
129 <CODE>init_params.ram_dirty</CODE> were modified from <CODE>0xff</CODE> to a
130 different value. The address of the corresponding RAM pages are stored
131 by the client in the array <CODE>init_parms.ram_pages_to_update</CODE>.
135 The client must also notify KQEMU if a RAM page has been modified
136 independently of the <CODE>init_params.ram_dirty</CODE> state. It is done
137 with the <CODE>init_params.modified_ram_pages</CODE> array.
141 Symmetrically, KQEMU notifies the client if a RAM page has been
142 modified with the <CODE>init_params.modified_ram_pages</CODE> array. The
143 client can use this information for example to invalidate a dynamic
144 translation cache.
149 <H2><A NAME="SEC5" HREF="kqemu-tech.html#TOC5">2.3 <TT>`/dev/kqemu'</TT> device</A></H2>
152 A user client wishing to create a new virtual machine must open the
153 device <TT>`/dev/kqemu'</TT>. There is no hard limit on the number of
154 virtual machines that can be created and run at the same time, except
155 for the available memory.
160 <H2><A NAME="SEC6" HREF="kqemu-tech.html#TOC6">2.4 <CODE>KQEMU_GET_VERSION</CODE> ioctl</A></H2>
163 It returns the KQEMU API version as an int. The client must use it to
164 determine if it is compatible with the KQEMU driver.
169 <H2><A NAME="SEC7" HREF="kqemu-tech.html#TOC7">2.5 <CODE>KQEMU_INIT</CODE> ioctl</A></H2>
172 Input parameter: <CODE>struct kqemu_init init_params</CODE>
176 It must be called once to initialize the VM. The following structure
177 is used as input parameter:
181 <PRE>
182 struct kqemu_init {
183 uint8_t *ram_base;
184 uint64_t ram_size;
185 uint8_t *ram_dirty;
186 uint64_t *pages_to_flush;
187 uint64_t *ram_pages_to_update;
188 uint64_t *modified_ram_pages;
190 </PRE>
193 The pointers <CODE>ram_base</CODE>, <CODE>ram_dirty</CODE>,
194 <CODE>phys_to_ram_map</CODE>, <CODE>pages_to_flush</CODE>,
195 <CODE>ram_pages_to_update</CODE> and <CODE>modified_ram_pages</CODE> must be page
196 aligned and must point to user allocated memory.
200 On Linux, due to a kernel bug related to memory swapping, the
201 corresponding memory must be mmaped from a file. We plan to remove
202 this restriction in a future implementation.
206 <CODE>ram_size</CODE> must be a multiple of 4K and is the quantity of RAM
207 allocated to the VM.
211 <CODE>ram_base</CODE> is a pointer to the VM RAM. It must contain at least
212 <CODE>ram_size</CODE> bytes.
216 <CODE>ram_dirty</CODE> is a pointer to a byte array of length
217 <CODE>ramsize/4096</CODE>. Each byte indicates if the corresponding VM RAM
218 page has been modified (see section <A HREF="kqemu-tech.html#SEC4">2.2 RAM page dirtiness</A>)
222 <CODE>pages_to_flush</CODE> is a pointer to an array of
223 <CODE>KQEMU_MAX_PAGES_TO_FLUSH</CODE> longs. It is used to indicate which
224 TLB must be flushed before executing code in the VM.
228 <CODE>ram_pages_to_update</CODE> is a pointer to an array of
229 <CODE>KQEMU_MAX_RAM_PAGES_TO_UPDATE</CODE> longs. It is used to notify the VM that
230 some RAM pages have been dirtied.
234 <CODE>modified_ram_pages</CODE> is a pointer to an array of
235 <CODE>KQEMU_MAX_MODIFIED_RAM_PAGES</CODE> longs. It is used to notify the VM or the
236 client that RAM pages have been modified.
240 The value 0 is return if the ioctl succeeded.
245 <H2><A NAME="SEC8" HREF="kqemu-tech.html#TOC8">2.6 <CODE>KQEMU_SET_PHYS_MEM</CODE> ioctl</A></H2>
248 The following structure is used as input parameter:
252 <PRE>
253 struct kqemu_phys_mem {
254 uint64_t phys_addr;
255 uint64_t size;
256 uint64_t ram_addr;
257 uint32_t io_index;
258 uint32_t padding1;
260 </PRE>
263 The ioctl modifies the internal KQEMU physical to ram mappings. After
264 the ioctl is executed, the physical address range <CODE>[phys_addr;
265 phys_addr + size[</CODE> is mapped to the RAM addresses <CODE>[ram_addr;
266 ram_addr + size[</CODE> if <CODE>io_index</CODE> is <CODE>KQEMU_IO_MEM_RAM</CODE> or
267 <CODE>KQEMU_IO_MEM_ROM</CODE>. If <CODE>KQEMU_IO_MEM_ROM</CODE> is used, the
268 writes to the RAM are ignored.
272 When <CODE>io_index</CODE> is <CODE>KQEMU_IO_MEM_UNASSIGNED</CODE>, it means the
273 physical memory range corresponds to a device I/O region. When a
274 memory access is done to it, <CODE>KQEMU_EXEC</CODE> returns with
275 <CODE>cpu_state.retval</CODE> set to <CODE>KQEMU_RET_SOFTMMU</CODE>.
280 <H2><A NAME="SEC9" HREF="kqemu-tech.html#TOC9">2.7 <CODE>KQEMU_MODIFY_RAM_PAGE</CODE> ioctl</A></H2>
283 Input parameter: <CODE>int nb_pages</CODE>
287 Notify the VM that <CODE>nb_pages</CODE> RAM pages were modified. The
288 corresponding RAM page addresses are written by the client in the
289 <CODE>init_state.modified_ram_pages</CODE> array given with the KQEMU_INIT ioctl.
293 Note: This ioctl does currently nothing, but the clients must use it
294 for later compatibility.
299 <H2><A NAME="SEC10" HREF="kqemu-tech.html#TOC10">2.8 <CODE>KQEMU_EXEC</CODE> ioctl</A></H2>
302 Input/Output parameter: <CODE>struct kqemu_cpu_state cpu_state</CODE>
306 Structure definitions:
308 <PRE>
309 struct kqemu_segment_cache {
310 uint16_t selector;
311 uint16_t padding1;
312 uint32_t flags;
313 uint64_t base;
314 uint32_t limit;
315 uint32_t padding2;
318 struct kqemu_cpu_state {
319 uint64_t regs[16];
320 uint64_t eip;
321 uint64_t eflags;
323 struct kqemu_segment_cache segs[6]; /* selector values */
324 struct kqemu_segment_cache ldt;
325 struct kqemu_segment_cache tr;
326 struct kqemu_segment_cache gdt; /* only base and limit are used */
327 struct kqemu_segment_cache idt; /* only base and limit are used */
329 uint64_t cr0;
330 uint64_t cr2;
331 uint64_t cr3;
332 uint64_t cr4;
333 uint64_t a20_mask;
335 /* sysenter registers */
336 uint64_t sysenter_cs;
337 uint64_t sysenter_esp;
338 uint64_t sysenter_eip;
339 uint64_t efer;
340 uint64_t star;
342 uint64_t lstar;
343 uint64_t cstar;
344 uint64_t fmask;
345 uint64_t kernelgsbase;
347 uint64_t tsc_offset;
349 uint64_t dr0;
350 uint64_t dr1;
351 uint64_t dr2;
352 uint64_t dr3;
353 uint64_t dr6;
354 uint64_t dr7;
356 uint8_t cpl;
357 uint8_t user_only;
358 uint16_t padding1;
360 uint32_t error_code; /* error_code when exiting with an exception */
361 uint64_t next_eip; /* next eip value when exiting with an interrupt */
362 uint32_t nb_pages_to_flush;
363 int32_t retval;
365 uint32_t nb_ram_pages_to_update;
367 uint32_t nb_modified_ram_pages;
369 </PRE>
372 Execute x86 instructions in the VM context. The full x86 CPU state is
373 defined in this structure. It contains in particular the value of the
374 8 (or 16 for x86_64) general purpose registers, the contents of the
375 segment caches, the RIP and EFLAGS values, etc...
379 If <CODE>cpu_state.user_only</CODE> is 1, a user only emulation is
380 done. <CODE>cpu_state.cpl</CODE> must be 3 in that case.
384 <CODE>KQEMU_EXEC</CODE> does the following:
388 <OL>
390 <LI>Update the internal dirty state of the
392 <CODE>cpu_state.nb_ram_pages_to_update</CODE> RAM pages from the array
393 <CODE>init_params.ram_pages_to_update</CODE>. If
394 <CODE>cpu_state.nb_ram_pages_to_update</CODE> has the value
395 <CODE>KQEMU_RAM_PAGES_UPDATE_ALL</CODE>, it means that all the RAM pages may
396 have been dirtied. The array <CODE>init_params.ram_pages_to_update</CODE> is
397 ignored in that case.
399 <LI>Update the internal KQEMU state by taking into account that the
401 <CODE>cpu_state.nb_modified_ram_pages</CODE> RAM pages from the array
402 <CODE>init_params.modified_ram_pages</CODE> where modified by the client.
404 <LI>Flush virtual CPU TLBs corresponding to the virtual address from
406 the array <CODE>init_params.pages_to_flush</CODE> of length
407 <CODE>cpu_state.nb_pages_to_flush</CODE>. If
408 <CODE>cpu_state.nb_pages_to_flush</CODE> is <CODE>KQEMU_FLUSH_ALL</CODE>, all the
409 TLBs are flushed. The array <CODE>init_params.pages_to_flush</CODE> is
410 ignored in that case.
412 <LI>Load the virtual CPU state from <CODE>cpu_state</CODE>.
414 <LI>Execute some code in the VM context.
416 <LI>Save the virtual CPU state into <CODE>cpu_state</CODE>.
418 <LI>Indicate the reason for which the execution was stopped in
420 <CODE>cpu_state.retval</CODE>.
422 <LI>Update <CODE>cpu_state.nb_pages_to_flush</CODE> and
424 <CODE>init_params.pages_to_flush</CODE> to notify the client that some
425 virtual CPU TLBs were flushed. The client can use this notification to
426 synchronize its own virtual TLBs with KQEMU.
428 <LI>Set <CODE>cpu_state.nb_ram_pages_to_update</CODE> to 1 if some
430 RAM dirty bytes were transitionned from dirty (0xff) to a non dirty
431 value. Otherwise, <CODE>cpu_state.nb_ram_pages_to_update</CODE> is set to 0.
433 <LI>Update <CODE>cpu_state.nb_modified_ram_pages</CODE> and
435 <CODE>init_params.modified_ram_pages</CODE> to notify the client that some
436 RAM pages were modified.
438 </OL>
441 <CODE>cpu_state.retval</CODE> indicate the reason why the execution was
442 stopped:
445 <DL COMPACT>
447 <DT><CODE>KQEMU_RET_EXCEPTION | n</CODE>
448 <DD>
449 The virtual CPU raised an exception and KQEMU cannot handle it. The
450 exception number <VAR>n</VAR> is stored in the 8 low order bits. The field
451 <CODE>cpu_state.error_code</CODE> contains the exception error code if it is
452 needed. It should be noted that in <EM>user only</EM> emulation, KQEMU
453 handles no exceptions by itself.
455 <DT><CODE>KQEMU_RET_INT | n</CODE>
456 <DD>
457 (<EM>user only</EM> emulation) The virtual CPU generated a software
458 interrupt (INT instruction for example). The exception number <VAR>n</VAR>
459 is stored in the 8 low order bits. The field <CODE>cpu_state.next_eip</CODE>
460 contains value of RIP after the instruction raising the
461 interrupt. <CODE>cpu_state.eip</CODE> contains the value of RIP at the
462 intruction raising the interrupt.
464 <DT><CODE>KQEMU_RET_SOFTMMU</CODE>
465 <DD>
466 The virtual CPU could not handle the current instruction. This is not
467 a fatal error. Usually the client just needs to interpret it. It can
468 happen because of the following reasons:
471 <UL>
472 <LI>memory access to an unassigned address or unknown device type ;
474 <LI>an instruction cannot be accurately executed by KQEMU
476 (e.g. SYSENTER, HLT, ...) ;
478 <LI>more than KQEMU_MAX_MODIFIED_RAM_PAGES were modified ;
480 <LI>some unsupported bits were modified in CR0 or CR4 ;
482 <LI>GDT.base or LDT.base are not a multiple of 8 ;
484 <LI>the GDT or LDT tables were modified while CPL = 3 ;
486 <LI>EFLAGS.VM was set.
488 </UL>
490 <DT><CODE>KQEMU_RET_INTR</CODE>
491 <DD>
492 A signal from the OS interrupted KQEMU.
494 <DT><CODE>KQEMU_RET_SYSCALL</CODE>
495 <DD>
496 (<EM>user only</EM> emulation) The SYSCALL instruction was executed. The
497 field <CODE>cpu_state.next_eip</CODE> contains value of RIP after the
498 instruction. <CODE>cpu_state.eip</CODE> contains the RIP of the intruction.
500 <DT><CODE>KQEMU_RET_ABORT</CODE>
501 <DD>
502 An unrecoverable error was detected. This is usually due to a bug in
503 KQEMU, so it should never happen !
505 </DL>
509 <H1><A NAME="SEC11" HREF="kqemu-tech.html#TOC11">3. KQEMU inner working and limitations</A></H1>
513 <H2><A NAME="SEC12" HREF="kqemu-tech.html#TOC12">3.1 Inner working</A></H2>
516 The main priority when implementing KQEMU was simplicity and
517 security. Unlike other virtualization systems, it does not do any
518 dynamic translation nor code patching.
522 <UL>
524 <LI>KQEMU always executes the target code at CPL = 3 on the host
526 processor. It means that KQEMU can use the page protections to ensure
527 that the VM cannot modify the host OS nor the KQEMU monitor. Moreover,
528 it means that KQEMU does not need to modify the segment limits to
529 ensure memory protection. Another advantage is that this methods works
530 with 64 bit code too.
532 <LI>KQEMU maintains a shadow page table simulating the TLBs of the
534 virtual CPU. The shadow page table persists between calls to
535 KQEMU_EXEC.
537 <LI>When the target CPL is 3, the target GDT and LDT are copied to
539 the host GDT and LDT so that the LAR and LSL instructions return
540 a meaningful value. This is important for 16 bit code.
542 <LI>When the target CPL is different to 3, the host GDT and LDT
544 are cleared so that any segment loading causes a General Protection
545 Fault. That way, KQEMU can intercept every segment loading.
547 <LI>All the code running with EFLAGS.IF = 0 is interpreted so that
549 EFLAGS.IF can be accurately reset in the VM. Fortunately, moderns OSes
550 tend to execute very little code with interrupt disabled.
552 <LI>KQEMU maintains dirty bits for every RAM pages so that modified
554 RAM pages can be tracked. It it useful to know if the GDT and LDT are
555 modified in user mode, and will be useful later to optimize shadow
556 page tables switching. It is also useful to maintain the coherency of
557 the user space QEMU translation cache.
559 </UL>
563 <H2><A NAME="SEC13" HREF="kqemu-tech.html#TOC13">3.2 General limitations</A></H2>
566 Note 1: KQEMU does not currently use the hardware virtualization
567 features of newer x86 CPUs. We expect that the limitations would be
568 different in that case.
572 Note 2: KQEMU supports both x86 and x86_64 CPUs.
576 Before entering the VM, the following conditions must be satisfied :
580 <OL>
582 <LI>CR0.PE = 1 (protected mode must be enabled)
584 <LI>CR0.MP = 1 (native math support)
586 <LI>CR0.WP = 1 (write protection for user pages)
588 <LI>EFLAGS.VM = 0 (no VM86 support)
590 <LI>At least 8 consecutive GDT descriptors must be available
592 (currently at a fixed location in the GDT).
594 <LI>At least 32 MB of virtual address must be free (currently at a
596 fixed location).
598 <LI>All the pages containing the LDT and GDT must be RAM pages.
600 </OL>
603 If EFLAGS.IF is set, the following assumptions are made on the
604 executing code:
608 <OL>
609 <LI>If EFLAGS.IOPL = 3, EFLAGS.IOPL = 0 is returned in EFLAGS.
611 <LI>POPF cannot be used to clear EFLAGS.IF
613 <LI>RDTSC returns host cycles (could be improved if needed).
615 <LI>The values returned by SGDT, SIDT, SLDT are invalid.
617 <LI>Reading CS.rpl and SS.rpl always returns 3 regardless of the CPL.
619 <LI>in 64 bit mode with CPL != 3, reading SS.sel does not give 0
621 if the OS stored 0 in it.
623 <LI>LAR, LSL, VERR, VERW return invalid results if CPL != 3.
625 <LI>The CS and SS segment cache must be consistent with the descriptor
627 tables.
629 <LI>The DS, ES, FS, and GS segment cache must be consistent with the
631 descriptor tables for CPL = 3.
633 <LI>Some rarely used intructions trap to the user space client
635 (performance issue).
637 </OL>
640 If eflags.IF if reset the code is interpreted, so the VM code can be
641 accurately executed. Some intructions trap to the user space emulator
642 because the interpreter does not handle them. A limitation of the
643 interpreter is that currently segment limits are not always tested.
648 <H2><A NAME="SEC14" HREF="kqemu-tech.html#TOC14">3.3 Security</A></H2>
651 The VM code is always run with CPL = 3 on the host, so <EM>the VM
652 code has no more priviliedge than regular user code</EM>.
656 The MMU is used to protect the memory used by the KQEMU monitor. That
657 way, no segment limit patching is necessary. Moreover, the guest OS is
658 free to use any virtual address, in particular the ones near the start
659 or the end of the virtual address space. The price to pay is that CR3
660 must be modified at every emulated system call because different page
661 tables are needed for user and kernel modes.
666 <H2><A NAME="SEC15" HREF="kqemu-tech.html#TOC15">3.4 Developments Ideas</A></H2>
669 <UL>
671 <LI>Instead of interpreting the code when IF=0, compile it dynamically.
673 The dynamic compiler itself can be implemented in user space, so the
674 kernel module would be simplified.
676 <LI>Use APIs closer to KVM.
678 <LI>Optimization of the page table shadowing. A shadow page table cache
680 could be implemented by tracking the modification of the guest page
681 tables. The exact performance gains are difficult to estimate because
682 the tracking itself would introduce some performance loss.
684 <LI>Support of guest SMP. There is no particular problem except
686 when a RAM page must be unlocked because the host has not enough
687 memory. This particular case needs specific Inter Processor Interrupts
688 (IPI).
690 <LI>Dynamic relocation of the monitor code so that a 32 MB hole
692 in the guest address space is found automatically without making
693 assumptions on the guest OS.
695 </UL>
697 <P><HR><P>
698 This document was generated on 30 May 2008 using
699 <A HREF="http://wwwinfo.cern.ch/dis/texi2html/">texi2html</A>&nbsp;1.56k.
700 </BODY>
701 </HTML>