1 Copyright (c) 2015-2016 Linaro Ltd.
3 This work is licensed under the terms of the GNU GPL, version 2 or
4 later. See the COPYING file in the top-level directory.
9 This document outlines the design for multi-threaded TCG system-mode
10 emulation. The current user-mode emulation mirrors the thread
11 structure of the translated executable. Some of the work will be
12 applicable to both system and linux-user emulation.
14 The original system-mode TCG implementation was single threaded and
15 dealt with multiple CPUs with simple round-robin scheduling. This
16 simplified a lot of things but became increasingly limited as systems
17 being emulated gained additional cores and per-core performance gains
18 for host systems started to level off.
23 We introduce a new running mode where each vCPU will run on its own
24 user-space thread. This will be enabled by default for all FE/BE
25 combinations that have had the required work done to support this
28 In the general case of running translated code there should be no
29 inter-vCPU dependencies and all vCPUs should be able to run at full
30 speed. Synchronisation will only be required while accessing internal
31 shared data structures or when the emulated architecture requires a
32 coherent representation of the emulated machine state.
34 Shared Data Structures
35 ======================
40 Even when there is no code being generated there are a number of
41 structures associated with the hot-path through the main run-loop.
42 These are associated with looking up the next translation block to
43 execute. These include:
45 tb_jmp_cache (per-vCPU, cache of recent jumps)
46 tb_ctx.htable (global hash table, phys address->tb lookup)
48 As TB linking only occurs when blocks are in the same page this code
49 is critical to performance as looking up the next TB to execute is the
50 most common reason to exit the generated code.
52 DESIGN REQUIREMENT: Make access to lookup structures safe with
53 multiple reader/writer threads. Minimise any lock contention to do it.
55 The hot-path avoids using locks where possible. The tb_jmp_cache is
56 updated with atomic accesses to ensure consistent results. The fall
57 back QHT based hash table is also designed for lockless lookups. Locks
58 are only taken when code generation is required or TranslationBlocks
59 have their block-to-block jumps patched.
64 We need to protect the entire code generation cycle including any post
65 generation patching of the translated code. This also implies a shared
66 translation buffer which contains code running on all cores. Any
67 execution path that comes to the main run loop will need to hold a
68 mutex for code generation. This also includes times when we need flush
69 code or entries from any shared lookups/caches. Structures held on a
70 per-vCPU basis won't need locking unless other vCPUs will need to
73 DESIGN REQUIREMENT: Add locking around all code generation and TB
78 Mainly as part of the linux-user work all code generation is
79 serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the
80 place of mmap_lock() in linux-user.
85 Currently the whole system shares a single code generation buffer
86 which when full will force a flush of all translations and start from
87 scratch again. Some operations also force a full flush of translations
90 - debugging operations (breakpoint insertion/removal)
91 - some CPU helper functions
93 This is done with the async_safe_run_on_cpu() mechanism to ensure all
94 vCPUs are quiescent when changes are being made to shared global
97 More granular translation invalidation events are typically due
98 to a change of the state of a physical page:
100 - code modification (self modify code, patching code)
101 - page changes (new page mapping in linux-user mode)
103 While setting the invalid flag in a TranslationBlock will stop it
104 being used when looked up in the hot-path there are a number of other
105 book-keeping structures that need to be safely cleared.
107 Any TranslationBlocks which have been patched to jump directly to the
108 now invalid blocks need the jump patches reversing so they will return
111 There are a number of look-up caches that need to be properly updated
115 - the physical-to-tb lookup hash table
116 - the global page table
118 The global page table (l1_map) which provides a multi-level look-up
119 for PageDesc structures which contain pointers to the start of a
120 linked list of all Translation Blocks in that page (see page_next).
122 Both the jump patching and the page cache involve linked lists that
123 the invalidated TranslationBlock needs to be removed from.
125 DESIGN REQUIREMENT: Safely handle invalidation of TBs
126 - safely patch/revert direct jumps
127 - remove central PageDesc lookup entries
128 - ensure lookup caches/hashes are safely updated
132 The direct jump themselves are updated atomically by the TCG
133 tb_set_jmp_target() code. Modification to the linked lists that allow
134 searching for linked pages are done under the protect of the
137 The global page table is protected by the tb_lock() in system-mode and
138 mmap_lock() in linux-user mode.
140 The lookup caches are updated atomically and the lookup hash uses QHT
141 which is designed for concurrent safe lookup.
147 The memory handling code is fairly critical to the speed of memory
148 access in the emulated system. The SoftMMU code is designed so the
149 hot-path can be handled entirely within translated code. This is
150 handled with a per-vCPU TLB structure which once populated will allow
151 a series of accesses to the page to occur without exiting the
152 translated code. It is possible to set flags in the TLB address which
153 will ensure the slow-path is taken for each access. This can be done
156 - Memory regions (dividing up access to PIO, MMIO and RAM)
157 - Dirty page tracking (for code gen, SMC detection, migration and display)
158 - Virtual TLB (for translating guest address->real address)
160 When the TLB tables are updated by a vCPU thread other than their own
161 we need to ensure it is done in a safe way so no inconsistent state is
162 seen by the vCPU thread.
164 Some operations require updating a number of vCPUs TLBs at the same
165 time in a synchronised manner.
170 - can be across-vCPUs
171 - cross vCPU TLB flush may need other vCPU brought to halt
172 - change may need to be visible to the calling vCPU immediately
175 - want change to be visible as soon as possible
176 - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
177 - This is a per-vCPU table - by definition can't race
178 - updated by its own thread when the slow-path is forced
182 We have updated cputlb.c to defer operations when a cross-vCPU
183 operation with async_run_on_cpu() which ensures each vCPU sees a
184 coherent state when it next runs its work (in a few instructions
187 A new set up operations (tlb_flush_*_all_cpus) take an additional flag
188 which when set will force synchronisation by setting the source vCPUs
189 work as "safe work" and exiting the cpu run loop. This ensure by the
190 time execution restarts all flush operations have completed.
192 TLB flag updates are all done atomically and are also protected by the
193 tb_lock() which is used by the functions that update the TLB in bulk.
197 Not really a limitation but the wait mechanism is overly strict for
198 some architectures which only need flushes completed by a barrier
199 instruction. This could be a future optimisation.
201 Emulated hardware state
202 -----------------------
204 Currently thanks to KVM work any access to IO memory is automatically
205 protected by the global iothread mutex, also known as the BQL (Big
206 Qemu Lock). Any IO region that doesn't use global mutex is expected to
209 However IO memory isn't the only way emulated hardware state can be
210 modified. Some architectures have model specific registers that
211 trigger hardware emulation features. Generally any translation helper
212 that needs to update more than a single vCPUs of state should take the
215 As the BQL, or global iothread mutex is shared across the system we
216 push the use of the lock as far down into the TCG code as possible to
221 MMIO access automatically serialises hardware emulation by way of the
222 BQL. Currently ARM targets serialise all ARM_CP_IO register accesses
223 and also defer the reset/startup of vCPUs to the vCPU context by way
224 of async_run_on_cpu().
226 Updates to interrupt state are also protected by the BQL as they can
232 Between emulated guests and host systems there are a range of memory
233 consistency models. Even emulating weakly ordered systems on strongly
234 ordered hosts needs to ensure things like store-after-load re-ordering
235 can be prevented when the guest wants to.
240 Barriers (sometimes known as fences) provide a mechanism for software
241 to enforce a particular ordering of memory operations from the point
242 of view of external observers (e.g. another processor core). They can
243 apply to any memory operations as well as just loads or stores.
245 The Linux kernel has an excellent write-up on the various forms of
246 memory barrier and the guarantees they can provide [1].
248 Barriers are often wrapped around synchronisation primitives to
249 provide explicit memory ordering semantics. However they can be used
250 by themselves to provide safe lockless access by ensuring for example
251 a change to a signal flag will only be visible once the changes to
254 DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
256 This would enforce a strong load/store ordering so all loads/stores
257 complete at the memory barrier. On single-core non-SMP strongly
258 ordered backends this could become a NOP.
260 Aside from explicit standalone memory barrier instructions there are
261 also implicit memory ordering semantics which comes with each guest
262 memory access instruction. For example all x86 load/stores come with
263 fairly strong guarantees of sequential consistency where as ARM has
264 special variants of load/store instructions that imply acquire/release
267 In the case of a strongly ordered guest architecture being emulated on
268 a weakly ordered host the scope for a heavy performance impact is
271 DESIGN REQUIREMENTS: Be efficient with use of memory barriers
272 - host systems with stronger implied guarantees can skip some barriers
273 - merge consecutive barriers to the strongest one
277 The system currently has a tcg_gen_mb() which will add memory barrier
278 operations if code generation is being done in a parallel context. The
279 tcg_optimize() function attempts to merge barriers up to their
280 strongest form before any load/store operations. The solution was
281 originally developed and tested for linux-user based systems. All
282 backends have been converted to emit fences when required. So far the
283 following front-ends have been updated to emit fences when required:
291 Memory Control and Maintenance
292 ------------------------------
294 This includes a class of instructions for controlling system cache
295 behaviour. While QEMU doesn't model cache behaviour these instructions
296 are often seen when code modification has taken place to ensure the
299 Synchronisation Primitives
300 --------------------------
302 There are two broad types of synchronisation primitives found in
303 modern ISAs: atomic instructions and exclusive regions.
305 The first type offer a simple atomic instruction which will guarantee
306 some sort of test and conditional store will be truly atomic w.r.t.
307 other cores sharing access to the memory. The classic example is the
308 x86 cmpxchg instruction.
310 The second type offer a pair of load/store instructions which offer a
311 guarantee that an region of memory has not been touched between the
312 load and store instructions. An example of this is ARM's ldrex/strex
313 pair where the strex instruction will return a flag indicating a
314 successful store only if no other CPU has accessed the memory region
317 Traditionally TCG has generated a series of operations that work
318 because they are within the context of a single translation block so
319 will have completed before another CPU is scheduled. However with
320 the ability to have multiple threads running to emulate multiple CPUs
321 we will need to explicitly expose these semantics.
324 - Support classic atomic instructions
325 - Support load/store exclusive (or load link/store conditional) pairs
326 - Generic enough infrastructure to support all guest architectures
327 CURRENT OPEN QUESTIONS:
328 - How problematic is the ABA problem in general?
332 The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
333 can be used directly or combined to emulate other instructions like
334 ARM's ldrex/strex instructions. While they are susceptible to the ABA
335 problem so far common guests have not implemented patterns where
336 this may be a problem - typically presenting a locking ABI which
337 assumes cmpxchg like semantics.
339 The code also includes a fall-back for cases where multi-threaded TCG
340 ops can't work (e.g. guest atomic width > host atomic width). In this
341 case an EXCP_ATOMIC exit occurs and the instruction is emulated with
342 an exclusive lock which ensures all emulation is serialised.
344 While the atomic helpers look good enough for now there may be a need
345 to look at solutions that can more closely model the guest
346 architectures semantics.
350 [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt