7 On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
8 to easily and efficiently run guest operating systems. Normally, these guests
9 *cannot* themselves be hypervisors running their own guests, because in VMX,
10 guests cannot use VMX instructions.
12 The "Nested VMX" feature adds this missing capability - of running guest
13 hypervisors (which use VMX) with their own nested guests. It does so by
14 allowing a guest to use VMX instructions, and correctly and efficiently
15 emulating them using the single level of VMX available in the hardware.
17 We describe in much greater detail the theory behind the nested VMX feature,
18 its implementation and its performance characteristics, in the OSDI 2010 paper
19 "The Turtles Project: Design and Implementation of Nested Virtualization",
22 http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
28 Single-level virtualization has two levels - the host (KVM) and the guests.
29 In nested virtualization, we have three levels: The host (KVM), which we call
30 L0, the guest hypervisor, which we call L1, and its nested guest, which we
37 The current code supports running Linux guests under KVM guests.
38 Only 64-bit guest hypervisors are supported.
40 Additional patches for running Windows under guest KVM, and Linux under
41 guest VMware server, and support for nested EPT, are currently running in
42 the lab, and will be sent as follow-on patchsets.
48 The nested VMX feature is disabled by default. It can be enabled by giving
49 the "nested=1" option to the kvm-intel module.
51 No modifications are required to user space (qemu). However, qemu's default
52 emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be
53 explicitly enabled, by giving qemu one of the following options:
55 -cpu host (emulated CPU has all features of the real CPU)
57 -cpu qemu64,+vmx (add just the vmx feature to a named CPU type)
63 Nested VMX aims to present a standard and (eventually) fully-functional VMX
64 implementation for the a guest hypervisor to use. As such, the official
65 specification of the ABI that it provides is Intel's VMX specification,
66 namely volume 3B of their "Intel 64 and IA-32 Architectures Software
67 Developer's Manual". Not all of VMX's features are currently fully supported,
68 but the goal is to eventually support them all, starting with the VMX features
69 which are used in practice by popular hypervisors (KVM and others).
71 As a VMX implementation, nested VMX presents a VMCS structure to L1.
72 As mandated by the spec, other than the two fields revision_id and abort,
73 this structure is *opaque* to its user, who is not supposed to know or care
74 about its internal structure. Rather, the structure is accessed through the
75 VMREAD and VMWRITE instructions.
76 Still, for debugging purposes, KVM developers might be interested to know the
77 internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
79 The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we
80 also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS
81 which L0 builds to actually run L2 - how this is done is explained in the
84 For convenience, we repeat the content of struct vmcs12 here. If the internals
85 of this structure changes, this can break live migration across KVM versions.
86 VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner
87 struct shadow_vmcs is ever changed.
89 typedef u64 natural_width;
90 struct __packed vmcs12 {
91 /* According to the Intel spec, a VMCS region must start with
92 * these two user-visible fields */
96 u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
97 u32 padding[7]; /* room for future expansion */
102 u64 vm_exit_msr_store_addr;
103 u64 vm_exit_msr_load_addr;
104 u64 vm_entry_msr_load_addr;
106 u64 virtual_apic_page_addr;
107 u64 apic_access_addr;
109 u64 guest_physical_address;
110 u64 vmcs_link_pointer;
111 u64 guest_ia32_debugctl;
120 u64 padding64[8]; /* room for future expansion */
121 natural_width cr0_guest_host_mask;
122 natural_width cr4_guest_host_mask;
123 natural_width cr0_read_shadow;
124 natural_width cr4_read_shadow;
125 natural_width cr3_target_value0;
126 natural_width cr3_target_value1;
127 natural_width cr3_target_value2;
128 natural_width cr3_target_value3;
129 natural_width exit_qualification;
130 natural_width guest_linear_address;
131 natural_width guest_cr0;
132 natural_width guest_cr3;
133 natural_width guest_cr4;
134 natural_width guest_es_base;
135 natural_width guest_cs_base;
136 natural_width guest_ss_base;
137 natural_width guest_ds_base;
138 natural_width guest_fs_base;
139 natural_width guest_gs_base;
140 natural_width guest_ldtr_base;
141 natural_width guest_tr_base;
142 natural_width guest_gdtr_base;
143 natural_width guest_idtr_base;
144 natural_width guest_dr7;
145 natural_width guest_rsp;
146 natural_width guest_rip;
147 natural_width guest_rflags;
148 natural_width guest_pending_dbg_exceptions;
149 natural_width guest_sysenter_esp;
150 natural_width guest_sysenter_eip;
151 natural_width host_cr0;
152 natural_width host_cr3;
153 natural_width host_cr4;
154 natural_width host_fs_base;
155 natural_width host_gs_base;
156 natural_width host_tr_base;
157 natural_width host_gdtr_base;
158 natural_width host_idtr_base;
159 natural_width host_ia32_sysenter_esp;
160 natural_width host_ia32_sysenter_eip;
161 natural_width host_rsp;
162 natural_width host_rip;
163 natural_width paddingl[8]; /* room for future expansion */
164 u32 pin_based_vm_exec_control;
165 u32 cpu_based_vm_exec_control;
166 u32 exception_bitmap;
167 u32 page_fault_error_code_mask;
168 u32 page_fault_error_code_match;
169 u32 cr3_target_count;
170 u32 vm_exit_controls;
171 u32 vm_exit_msr_store_count;
172 u32 vm_exit_msr_load_count;
173 u32 vm_entry_controls;
174 u32 vm_entry_msr_load_count;
175 u32 vm_entry_intr_info_field;
176 u32 vm_entry_exception_error_code;
177 u32 vm_entry_instruction_len;
179 u32 secondary_vm_exec_control;
180 u32 vm_instruction_error;
182 u32 vm_exit_intr_info;
183 u32 vm_exit_intr_error_code;
184 u32 idt_vectoring_info_field;
185 u32 idt_vectoring_error_code;
186 u32 vm_exit_instruction_len;
187 u32 vmx_instruction_info;
194 u32 guest_ldtr_limit;
196 u32 guest_gdtr_limit;
197 u32 guest_idtr_limit;
198 u32 guest_es_ar_bytes;
199 u32 guest_cs_ar_bytes;
200 u32 guest_ss_ar_bytes;
201 u32 guest_ds_ar_bytes;
202 u32 guest_fs_ar_bytes;
203 u32 guest_gs_ar_bytes;
204 u32 guest_ldtr_ar_bytes;
205 u32 guest_tr_ar_bytes;
206 u32 guest_interruptibility_info;
207 u32 guest_activity_state;
208 u32 guest_sysenter_cs;
209 u32 host_ia32_sysenter_cs;
210 u32 padding32[8]; /* room for future expansion */
211 u16 virtual_processor_id;
212 u16 guest_es_selector;
213 u16 guest_cs_selector;
214 u16 guest_ss_selector;
215 u16 guest_ds_selector;
216 u16 guest_fs_selector;
217 u16 guest_gs_selector;
218 u16 guest_ldtr_selector;
219 u16 guest_tr_selector;
220 u16 host_es_selector;
221 u16 host_cs_selector;
222 u16 host_ss_selector;
223 u16 host_ds_selector;
224 u16 host_fs_selector;
225 u16 host_gs_selector;
226 u16 host_tr_selector;
233 These patches were written by:
234 Abel Gordon, abelg <at> il.ibm.com
235 Nadav Har'El, nyh <at> il.ibm.com
236 Orit Wasserman, oritw <at> il.ibm.com
237 Ben-Ami Yassor, benami <at> il.ibm.com
238 Muli Ben-Yehuda, muli <at> il.ibm.com
240 With contributions by:
241 Anthony Liguori, aliguori <at> us.ibm.com
242 Mike Day, mdday <at> us.ibm.com
243 Michael Factor, factor <at> il.ibm.com
244 Zvi Dubitzky, dubi <at> il.ibm.com
246 And valuable reviews by:
247 Avi Kivity, avi <at> redhat.com
248 Gleb Natapov, gleb <at> redhat.com
249 Marcelo Tosatti, mtosatti <at> redhat.com
250 Kevin Tian, kevin.tian <at> intel.com