docs/devel/multi-thread-tcg.txt

   1 Copyright (c) 2015-2016 Linaro Ltd.
   2
   3 This work is licensed under the terms of the GNU GPL, version 2 or
   4 later. See the COPYING file in the top-level directory.
   5
   6 Introduction
   7 ============
   8
   9 This document outlines the design for multi-threaded TCG system-mode
  10 emulation. The current user-mode emulation mirrors the thread
  11 structure of the translated executable. Some of the work will be
  12 applicable to both system and linux-user emulation.
  13
  14 The original system-mode TCG implementation was single threaded and
  15 dealt with multiple CPUs with simple round-robin scheduling. This
  16 simplified a lot of things but became increasingly limited as systems
  17 being emulated gained additional cores and per-core performance gains
  18 for host systems started to level off.
  19
  20 vCPU Scheduling
  21 ===============
  22
  23 We introduce a new running mode where each vCPU will run on its own
  24 user-space thread. This will be enabled by default for all FE/BE
  25 combinations that have had the required work done to support this
  26 safely.
  27
  28 In the general case of running translated code there should be no
  29 inter-vCPU dependencies and all vCPUs should be able to run at full
  30 speed. Synchronisation will only be required while accessing internal
  31 shared data structures or when the emulated architecture requires a
  32 coherent representation of the emulated machine state.
  33
  34 Shared Data Structures
  35 ======================
  36
  37 Main Run Loop
  38 -------------
  39
  40 Even when there is no code being generated there are a number of
  41 structures associated with the hot-path through the main run-loop.
  42 These are associated with looking up the next translation block to
  43 execute. These include:
  44
  45     tb_jmp_cache (per-vCPU, cache of recent jumps)
  46     tb_ctx.htable (global hash table, phys address->tb lookup)
  47
  48 As TB linking only occurs when blocks are in the same page this code
  49 is critical to performance as looking up the next TB to execute is the
  50 most common reason to exit the generated code.
  51
  52 DESIGN REQUIREMENT: Make access to lookup structures safe with
  53 multiple reader/writer threads. Minimise any lock contention to do it.
  54
  55 The hot-path avoids using locks where possible. The tb_jmp_cache is
  56 updated with atomic accesses to ensure consistent results. The fall
  57 back QHT based hash table is also designed for lockless lookups. Locks
  58 are only taken when code generation is required or TranslationBlocks
  59 have their block-to-block jumps patched.
  60
  61 Global TCG State
  62 ----------------
  63
  64 ### User-mode emulation
  65 We need to protect the entire code generation cycle including any post
  66 generation patching of the translated code. This also implies a shared
  67 translation buffer which contains code running on all cores. Any
  68 execution path that comes to the main run loop will need to hold a
  69 mutex for code generation. This also includes times when we need flush
  70 code or entries from any shared lookups/caches. Structures held on a
  71 per-vCPU basis won't need locking unless other vCPUs will need to
  72 modify them.
  73
  74 DESIGN REQUIREMENT: Add locking around all code generation and TB
  75 patching.
  76
  77 (Current solution)
  78
  79 Code generation is serialised with mmap_lock().
  80
  81 ### !User-mode emulation
  82 Each vCPU has its own TCG context and associated TCG region, thereby
  83 requiring no locking.
  84
  85 Translation Blocks
  86 ------------------
  87
  88 Currently the whole system shares a single code generation buffer
  89 which when full will force a flush of all translations and start from
  90 scratch again. Some operations also force a full flush of translations
  91 including:
  92
  93   - debugging operations (breakpoint insertion/removal)
  94   - some CPU helper functions
  95
  96 This is done with the async_safe_run_on_cpu() mechanism to ensure all
  97 vCPUs are quiescent when changes are being made to shared global
  98 structures.
  99
 100 More granular translation invalidation events are typically due
 101 to a change of the state of a physical page:
 102
 103   - code modification (self modify code, patching code)
 104   - page changes (new page mapping in linux-user mode)
 105
 106 While setting the invalid flag in a TranslationBlock will stop it
 107 being used when looked up in the hot-path there are a number of other
 108 book-keeping structures that need to be safely cleared.
 109
 110 Any TranslationBlocks which have been patched to jump directly to the
 111 now invalid blocks need the jump patches reversing so they will return
 112 to the C code.
 113
 114 There are a number of look-up caches that need to be properly updated
 115 including the:
 116
 117   - jump lookup cache
 118   - the physical-to-tb lookup hash table
 119   - the global page table
 120
 121 The global page table (l1_map) which provides a multi-level look-up
 122 for PageDesc structures which contain pointers to the start of a
 123 linked list of all Translation Blocks in that page (see page_next).
 124
 125 Both the jump patching and the page cache involve linked lists that
 126 the invalidated TranslationBlock needs to be removed from.
 127
 128 DESIGN REQUIREMENT: Safely handle invalidation of TBs
 129                       - safely patch/revert direct jumps
 130                       - remove central PageDesc lookup entries
 131                       - ensure lookup caches/hashes are safely updated
 132
 133 (Current solution)
 134
 135 The direct jump themselves are updated atomically by the TCG
 136 tb_set_jmp_target() code. Modification to the linked lists that allow
 137 searching for linked pages are done under the protection of tb->jmp_lock,
 138 where tb is the destination block of a jump. Each origin block keeps a
 139 pointer to its destinations so that the appropriate lock can be acquired before
 140 iterating over a jump list.
 141
 142 The global page table is a lockless radix tree; cmpxchg is used
 143 to atomically insert new elements.
 144
 145 The lookup caches are updated atomically and the lookup hash uses QHT
 146 which is designed for concurrent safe lookup.
 147
 148 Parallel code generation is supported. QHT is used at insertion time
 149 as the synchronization point across threads, thereby ensuring that we only
 150 keep track of a single TranslationBlock for each guest code block.
 151
 152 Memory maps and TLBs
 153 --------------------
 154
 155 The memory handling code is fairly critical to the speed of memory
 156 access in the emulated system. The SoftMMU code is designed so the
 157 hot-path can be handled entirely within translated code. This is
 158 handled with a per-vCPU TLB structure which once populated will allow
 159 a series of accesses to the page to occur without exiting the
 160 translated code. It is possible to set flags in the TLB address which
 161 will ensure the slow-path is taken for each access. This can be done
 162 to support:
 163
 164   - Memory regions (dividing up access to PIO, MMIO and RAM)
 165   - Dirty page tracking (for code gen, SMC detection, migration and display)
 166   - Virtual TLB (for translating guest address->real address)
 167
 168 When the TLB tables are updated by a vCPU thread other than their own
 169 we need to ensure it is done in a safe way so no inconsistent state is
 170 seen by the vCPU thread.
 171
 172 Some operations require updating a number of vCPUs TLBs at the same
 173 time in a synchronised manner.
 174
 175 DESIGN REQUIREMENTS:
 176
 177   - TLB Flush All/Page
 178     - can be across-vCPUs
 179     - cross vCPU TLB flush may need other vCPU brought to halt
 180     - change may need to be visible to the calling vCPU immediately
 181   - TLB Flag Update
 182     - usually cross-vCPU
 183     - want change to be visible as soon as possible
 184   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
 185     - This is a per-vCPU table - by definition can't race
 186     - updated by its own thread when the slow-path is forced
 187
 188 (Current solution)
 189
 190 We have updated cputlb.c to defer operations when a cross-vCPU
 191 operation with async_run_on_cpu() which ensures each vCPU sees a
 192 coherent state when it next runs its work (in a few instructions
 193 time).
 194
 195 A new set up operations (tlb_flush_*_all_cpus) take an additional flag
 196 which when set will force synchronisation by setting the source vCPUs
 197 work as "safe work" and exiting the cpu run loop. This ensure by the
 198 time execution restarts all flush operations have completed.
 199
 200 TLB flag updates are all done atomically and are also protected by the
 201 corresponding page lock.
 202
 203 (Known limitation)
 204
 205 Not really a limitation but the wait mechanism is overly strict for
 206 some architectures which only need flushes completed by a barrier
 207 instruction. This could be a future optimisation.
 208
 209 Emulated hardware state
 210 -----------------------
 211
 212 Currently thanks to KVM work any access to IO memory is automatically
 213 protected by the global iothread mutex, also known as the BQL (Big
 214 Qemu Lock). Any IO region that doesn't use global mutex is expected to
 215 do its own locking.
 216
 217 However IO memory isn't the only way emulated hardware state can be
 218 modified. Some architectures have model specific registers that
 219 trigger hardware emulation features. Generally any translation helper
 220 that needs to update more than a single vCPUs of state should take the
 221 BQL.
 222
 223 As the BQL, or global iothread mutex is shared across the system we
 224 push the use of the lock as far down into the TCG code as possible to
 225 minimise contention.
 226
 227 (Current solution)
 228
 229 MMIO access automatically serialises hardware emulation by way of the
 230 BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
 231 and also defer the reset/startup of vCPUs to the vCPU context by way
 232 of async_run_on_cpu().
 233
 234 Updates to interrupt state are also protected by the BQL as they can
 235 often be cross vCPU.
 236
 237 Memory Consistency
 238 ==================
 239
 240 Between emulated guests and host systems there are a range of memory
 241 consistency models. Even emulating weakly ordered systems on strongly
 242 ordered hosts needs to ensure things like store-after-load re-ordering
 243 can be prevented when the guest wants to.
 244
 245 Memory Barriers
 246 ---------------
 247
 248 Barriers (sometimes known as fences) provide a mechanism for software
 249 to enforce a particular ordering of memory operations from the point
 250 of view of external observers (e.g. another processor core). They can
 251 apply to any memory operations as well as just loads or stores.
 252
 253 The Linux kernel has an excellent write-up on the various forms of
 254 memory barrier and the guarantees they can provide [1].
 255
 256 Barriers are often wrapped around synchronisation primitives to
 257 provide explicit memory ordering semantics. However they can be used
 258 by themselves to provide safe lockless access by ensuring for example
 259 a change to a signal flag will only be visible once the changes to
 260 payload are.
 261
 262 DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
 263
 264 This would enforce a strong load/store ordering so all loads/stores
 265 complete at the memory barrier. On single-core non-SMP strongly
 266 ordered backends this could become a NOP.
 267
 268 Aside from explicit standalone memory barrier instructions there are
 269 also implicit memory ordering semantics which comes with each guest
 270 memory access instruction. For example all x86 load/stores come with
 271 fairly strong guarantees of sequential consistency whereas Arm has
 272 special variants of load/store instructions that imply acquire/release
 273 semantics.
 274
 275 In the case of a strongly ordered guest architecture being emulated on
 276 a weakly ordered host the scope for a heavy performance impact is
 277 quite high.
 278
 279 DESIGN REQUIREMENTS: Be efficient with use of memory barriers
 280        - host systems with stronger implied guarantees can skip some barriers
 281        - merge consecutive barriers to the strongest one
 282
 283 (Current solution)
 284
 285 The system currently has a tcg_gen_mb() which will add memory barrier
 286 operations if code generation is being done in a parallel context. The
 287 tcg_optimize() function attempts to merge barriers up to their
 288 strongest form before any load/store operations. The solution was
 289 originally developed and tested for linux-user based systems. All
 290 backends have been converted to emit fences when required. So far the
 291 following front-ends have been updated to emit fences when required:
 292
 293     - target-i386
 294     - target-arm
 295     - target-aarch64
 296     - target-alpha
 297     - target-mips
 298
 299 Memory Control and Maintenance
 300 ------------------------------
 301
 302 This includes a class of instructions for controlling system cache
 303 behaviour. While QEMU doesn't model cache behaviour these instructions
 304 are often seen when code modification has taken place to ensure the
 305 changes take effect.
 306
 307 Synchronisation Primitives
 308 --------------------------
 309
 310 There are two broad types of synchronisation primitives found in
 311 modern ISAs: atomic instructions and exclusive regions.
 312
 313 The first type offer a simple atomic instruction which will guarantee
 314 some sort of test and conditional store will be truly atomic w.r.t.
 315 other cores sharing access to the memory. The classic example is the
 316 x86 cmpxchg instruction.
 317
 318 The second type offer a pair of load/store instructions which offer a
 319 guarantee that a region of memory has not been touched between the
 320 load and store instructions. An example of this is Arm's ldrex/strex
 321 pair where the strex instruction will return a flag indicating a
 322 successful store only if no other CPU has accessed the memory region
 323 since the ldrex.
 324
 325 Traditionally TCG has generated a series of operations that work
 326 because they are within the context of a single translation block so
 327 will have completed before another CPU is scheduled. However with
 328 the ability to have multiple threads running to emulate multiple CPUs
 329 we will need to explicitly expose these semantics.
 330
 331 DESIGN REQUIREMENTS:
 332   - Support classic atomic instructions
 333   - Support load/store exclusive (or load link/store conditional) pairs
 334   - Generic enough infrastructure to support all guest architectures
 335 CURRENT OPEN QUESTIONS:
 336   - How problematic is the ABA problem in general?
 337
 338 (Current solution)
 339
 340 The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
 341 can be used directly or combined to emulate other instructions like
 342 Arm's ldrex/strex instructions. While they are susceptible to the ABA
 343 problem so far common guests have not implemented patterns where
 344 this may be a problem - typically presenting a locking ABI which
 345 assumes cmpxchg like semantics.
 346
 347 The code also includes a fall-back for cases where multi-threaded TCG
 348 ops can't work (e.g. guest atomic width > host atomic width). In this
 349 case an EXCP_ATOMIC exit occurs and the instruction is emulated with
 350 an exclusive lock which ensures all emulation is serialised.
 351
 352 While the atomic helpers look good enough for now there may be a need
 353 to look at solutions that can more closely model the guest
 354 architectures semantics.
 355
 356 ==========
 357
 358 [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt