tests: Use the new DO_TEST_CAPS_*() macros
[libvirt/ericb.git] / docs / cgroups.html.in
blob081ba2eae17de3e7cd7133555bc6ba7a4d2ff3f2
1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE html>
3 <html xmlns="http://www.w3.org/1999/xhtml">
4 <body>
5 <h1>Control Groups Resource Management</h1>
7 <ul id="toc"></ul>
9 <p>
10 The QEMU and LXC drivers make use of the Linux "Control Groups" facility
11 for applying resource management to their virtual machines and containers.
12 </p>
14 <h2><a id="requiredControllers">Required controllers</a></h2>
16 <p>
17 The control groups filesystem supports multiple "controllers". By default
18 the init system (such as systemd) should mount all controllers compiled
19 into the kernel at <code>/sys/fs/cgroup/$CONTROLLER-NAME</code>. Libvirt
20 will never attempt to mount any controllers itself, merely detect where
21 they are mounted.
22 </p>
24 <p>
25 The QEMU driver is capable of using the <code>cpuset</code>,
26 <code>cpu</code>, <code>cpuacct</code>, <code>memory</code>,
27 <code>blkio</code> and <code>devices</code> controllers.
28 None of them are compulsory. If any controller is not mounted,
29 the resource management APIs which use it will cease to operate.
30 It is possible to explicitly turn off use of a controller,
31 even when mounted, via the <code>/etc/libvirt/qemu.conf</code>
32 configuration file.
33 </p>
35 <p>
36 The LXC driver is capable of using the <code>cpuset</code>,
37 <code>cpu</code>, <code>cpuacct</code>, <code>freezer</code>,
38 <code>memory</code>, <code>blkio</code> and <code>devices</code>
39 controllers. The <code>cpuacct</code>, <code>devices</code>
40 and <code>memory</code> controllers are compulsory. Without
41 them mounted, no containers can be started. If any of the
42 other controllers are not mounted, the resource management APIs
43 which use them will cease to operate.
44 </p>
46 <h2><a id="currentLayout">Current cgroups layout</a></h2>
48 <p>
49 As of libvirt 1.0.5 or later, the cgroups layout created by libvirt has been
50 simplified, in order to facilitate the setup of resource control policies by
51 administrators / management applications. The new layout is based on the concepts
52 of "partitions" and "consumers". A "consumer" is a cgroup which holds the
53 processes for a single virtual machine or container. A "partition" is a cgroup
54 which does not contain any processes, but can have resource controls applied.
55 A "partition" will have zero or more child directories which may be either
56 "consumer" or "partition".
57 </p>
59 <p>
60 As of libvirt 1.1.1 or later, the cgroups layout will have some slight
61 differences when running on a host with systemd 205 or later. The overall
62 tree structure is the same, but there are some differences in the naming
63 conventions for the cgroup directories. Thus the following docs split
64 in two, one describing systemd hosts and the other non-systemd hosts.
65 </p>
67 <h3><a id="currentLayoutSystemd">Systemd cgroups integration</a></h3>
69 <p>
70 On hosts which use systemd, each consumer maps to a systemd scope unit,
71 while partitions map to a system slice unit.
72 </p>
74 <h4><a id="systemdScope">Systemd scope naming</a></h4>
76 <p>
77 The systemd convention is for the scope name of virtual machines / containers
78 to be of the general format <code>machine-$NAME.scope</code>. Libvirt forms the
79 <code>$NAME</code> part of this by concatenating the driver type with the id
80 and truncated name of the guest, and then escaping any systemd reserved
81 characters.
82 So for a guest <code>demo</code> running under the <code>lxc</code> driver,
83 we get a <code>$NAME</code> of <code>lxc-12345-demo</code> which when escaped
84 is <code>lxc\x2d12345\x2ddemo</code>. So the complete scope name is
85 <code>machine-lxc\x2d12345\x2ddemo.scope</code>.
86 The scope names map directly to the cgroup directory names.
87 </p>
89 <h4><a id="systemdSlice">Systemd slice naming</a></h4>
91 <p>
92 The systemd convention for slice naming is that a slice should include the
93 name of all of its parents prepended on its own name. So for a libvirt
94 partition <code>/machine/engineering/testing</code>, the slice name will
95 be <code>machine-engineering-testing.slice</code>. Again the slice names
96 map directly to the cgroup directory names. Systemd creates three top level
97 slices by default, <code>system.slice</code> <code>user.slice</code> and
98 <code>machine.slice</code>. All virtual machines or containers created
99 by libvirt will be associated with <code>machine.slice</code> by default.
100 </p>
102 <h4><a id="systemdLayout">Systemd cgroup layout</a></h4>
105 Given this, a possible systemd cgroups layout involving 3 qemu guests,
106 3 lxc containers and 3 custom child slices, would be:
107 </p>
109 <pre>
110 $ROOT
112 +- system.slice
114 | +- libvirtd.service
116 +- machine.slice
118 +- machine-qemu\x2d1\x2dvm1.scope
120 | +- emulator
121 | +- vcpu0
122 | +- vcpu1
124 +- machine-qemu\x2d2\x2dvm2.scope
126 | +- emulator
127 | +- vcpu0
128 | +- vcpu1
130 +- machine-qemu\x2d3\x2dvm3.scope
132 | +- emulator
133 | +- vcpu0
134 | +- vcpu1
136 +- machine-engineering.slice
138 | +- machine-engineering-testing.slice
139 | | |
140 | | +- machine-lxc\x2d11111\x2dcontainer1.scope
142 | +- machine-engineering-production.slice
144 | +- machine-lxc\x2d22222\x2dcontainer2.scope
146 +- machine-marketing.slice
148 +- machine-lxc\x2d33333\x2dcontainer3.scope
149 </pre>
151 <h3><a id="currentLayoutGeneric">Non-systemd cgroups layout</a></h3>
154 On hosts which do not use systemd, each consumer has a corresponding cgroup
155 named <code>$VMNAME.libvirt-{qemu,lxc}</code>. Each consumer is associated
156 with exactly one partition, which also have a corresponding cgroup usually
157 named <code>$PARTNAME.partition</code>. The exceptions to this naming rule
158 are the three top level default partitions, named <code>/system</code> (for
159 system services), <code>/user</code> (for user login sessions) and
160 <code>/machine</code> (for virtual machines and containers). By default
161 every consumer will of course be associated with the <code>/machine</code>
162 partition.
163 </p>
166 Given this, a possible systemd cgroups layout involving 3 qemu guests,
167 3 lxc containers and 2 custom child slices, would be:
168 </p>
170 <pre>
171 $ROOT
173 +- system
175 | +- libvirtd.service
177 +- machine
179 +- qemu-1-vm1.libvirt-qemu
181 | +- emulator
182 | +- vcpu0
183 | +- vcpu1
185 +- qeme-2-vm2.libvirt-qemu
187 | +- emulator
188 | +- vcpu0
189 | +- vcpu1
191 +- qemu-3-vm3.libvirt-qemu
193 | +- emulator
194 | +- vcpu0
195 | +- vcpu1
197 +- engineering.partition
199 | +- testing.partition
200 | | |
201 | | +- lxc-11111-container1.libvirt-lxc
203 | +- production.partition
205 | +- lxc-22222-container2.libvirt-lxc
207 +- marketing.partition
209 +- lxc-33333-container3.libvirt-lxc
210 </pre>
212 <h2><a id="customPartiton">Using custom partitions</a></h2>
215 If there is a need to apply resource constraints to groups of
216 virtual machines or containers, then the single default
217 partition <code>/machine</code> may not be sufficiently
218 flexible. The administrator may wish to sub-divide the
219 default partition, for example into "testing" and "production"
220 partitions, and then assign each guest to a specific
221 sub-partition. This is achieved via a small element addition
222 to the guest domain XML config, just below the main <code>domain</code>
223 element
224 </p>
226 <pre>
228 &lt;resource&gt;
229 &lt;partition&gt;/machine/production&lt;/partition&gt;
230 &lt;/resource&gt;
232 </pre>
235 Note that the partition names in the guest XML are using a
236 generic naming format, not the low level naming convention
237 required by the underlying host OS. That is, you should not include
238 any of the <code>.partition</code> or <code>.slice</code>
239 suffixes in the XML config. Given a partition name
240 <code>/machine/production</code>, libvirt will automatically
241 apply the platform specific translation required to get
242 <code>/machine/production.partition</code> (non-systemd)
243 or <code>/machine.slice/machine-production.slice</code>
244 (systemd) as the underlying cgroup name
245 </p>
248 Libvirt will not auto-create the cgroups directory to back
249 this partition. In the future, libvirt / virsh will provide
250 APIs / commands to create custom partitions, but currently
251 this is left as an exercise for the administrator.
252 </p>
255 <strong>Note:</strong> the ability to place guests in custom
256 partitions is only available with libvirt &gt;= 1.0.5, using
257 the new cgroup layout. The legacy cgroups layout described
258 later in this document did not support customization per guest.
259 </p>
261 <h3><a id="createSystemd">Creating custom partitions (systemd)</a></h3>
264 Given the XML config above, the admin on a systemd based host would
265 need to create a unit file <code>/etc/systemd/system/machine-production.slice</code>
266 </p>
268 <pre>
269 # cat &gt; /etc/systemd/system/machine-testing.slice &lt;&lt;EOF
270 [Unit]
271 Description=VM testing slice
272 Before=slices.target
273 Wants=machine.slice
275 # systemctl start machine-testing.slice
276 </pre>
278 <h3><a id="createNonSystemd">Creating custom partitions (non-systemd)</a></h3>
281 Given the XML config above, the admin on a non-systemd based host
282 would need to create a cgroup named '/machine/production.partition'
283 </p>
285 <pre>
286 # cd /sys/fs/cgroup
287 # for i in blkio cpu,cpuacct cpuset devices freezer memory net_cls perf_event
289 mkdir $i/machine/production.partition
290 done
291 # for i in cpuset.cpus cpuset.mems
293 cat cpuset/machine/$i > cpuset/machine/production.partition/$i
294 done
295 </pre>
297 <h2><a id="resourceAPIs">Resource management APIs/commands</a></h2>
300 Since libvirt aims to provide an API which is portable across
301 hypervisors, the concept of cgroups is not exposed directly
302 in the API or XML configuration. It is considered to be an
303 internal implementation detail. Instead libvirt provides a
304 set of APIs for applying resource controls, which are then
305 mapped to corresponding cgroup tunables
306 </p>
308 <h3>Scheduler tuning</h3>
311 Parameters from the "cpu" controller are exposed via the
312 <code>schedinfo</code> command in virsh.
313 </p>
315 <pre>
316 # virsh schedinfo demo
317 Scheduler : posix
318 cpu_shares : 1024
319 vcpu_period : 100000
320 vcpu_quota : -1
321 emulator_period: 100000
322 emulator_quota : -1</pre>
325 <h3>Block I/O tuning</h3>
328 Parameters from the "blkio" controller are exposed via the
329 <code>bkliotune</code> command in virsh.
330 </p>
333 <pre>
334 # virsh blkiotune demo
335 weight : 500
336 device_weight : </pre>
338 <h3>Memory tuning</h3>
341 Parameters from the "memory" controller are exposed via the
342 <code>memtune</code> command in virsh.
343 </p>
345 <pre>
346 # virsh memtune demo
347 hard_limit : 580192
348 soft_limit : unlimited
349 swap_hard_limit: unlimited
350 </pre>
352 <h3>Network tuning</h3>
355 The <code>net_cls</code> is not currently used. Instead traffic
356 filter policies are set directly against individual virtual
357 network interfaces.
358 </p>
360 <h2><a id="legacyLayout">Legacy cgroups layout</a></h2>
363 Prior to libvirt 1.0.5, the cgroups layout created by libvirt was different
364 from that described above, and did not allow for administrator customization.
365 Libvirt used a fixed, 3-level hierarchy <code>libvirt/{qemu,lxc}/$VMNAME</code>
366 which was rooted at the point in the hierarchy where libvirtd itself was
367 located. So if libvirtd was placed at <code>/system/libvirtd.service</code>
368 by systemd, the groups for each virtual machine / container would be located
369 at <code>/system/libvirtd.service/libvirt/{qemu,lxc}/$VMNAME</code>. In addition
370 to this, the QEMU drivers further child groups for each vCPU thread and the
371 emulator thread(s). This leads to a hierarchy that looked like
372 </p>
375 <pre>
376 $ROOT
378 +- system
380 +- libvirtd.service
382 +- libvirt
384 +- qemu
386 | +- vm1
387 | | |
388 | | +- emulator
389 | | +- vcpu0
390 | | +- vcpu1
392 | +- vm2
393 | | |
394 | | +- emulator
395 | | +- vcpu0
396 | | +- vcpu1
398 | +- vm3
400 | +- emulator
401 | +- vcpu0
402 | +- vcpu1
404 +- lxc
406 +- container1
408 +- container2
410 +- container3
411 </pre>
414 Although current releases are much improved, historically the use of deep
415 hierarchies has had a significant negative impact on the kernel scalability.
416 The legacy libvirt cgroups layout highlighted these problems, to the detriment
417 of the performance of virtual machines and containers.
418 </p>
419 </body>
420 </html>