docs/devel/multi-thread-tcg.rst

   1 ..
   2   Copyright (c) 2015-2020 Linaro Ltd.
   3
   4   This work is licensed under the terms of the GNU GPL, version 2 or
   5   later. See the COPYING file in the top-level directory.
   6
   7 Introduction
   8 ============
   9
  10 This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
  11 system-mode emulation. user-mode emulation has always mirrored the
  12 thread structure of the translated executable although some of the
  13 changes done for MTTCG system emulation have improved the stability of
  14 linux-user emulation.
  15
  16 The original system-mode TCG implementation was single threaded and
  17 dealt with multiple CPUs with simple round-robin scheduling. This
  18 simplified a lot of things but became increasingly limited as systems
  19 being emulated gained additional cores and per-core performance gains
  20 for host systems started to level off.
  21
  22 vCPU Scheduling
  23 ===============
  24
  25 We introduce a new running mode where each vCPU will run on its own
  26 user-space thread. This is enabled by default for all FE/BE
  27 combinations where the host memory model is able to accommodate the
  28 guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
  29 guest has had the required work done to support this safely
  30 (TARGET_SUPPORTS_MTTCG).
  31
  32 System emulation will fall back to the original round robin approach
  33 if:
  34
  35 * forced by --accel tcg,thread=single
  36 * enabling --icount mode
  37 * 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
  38
  39 In the general case of running translated code there should be no
  40 inter-vCPU dependencies and all vCPUs should be able to run at full
  41 speed. Synchronisation will only be required while accessing internal
  42 shared data structures or when the emulated architecture requires a
  43 coherent representation of the emulated machine state.
  44
  45 Shared Data Structures
  46 ======================
  47
  48 Main Run Loop
  49 -------------
  50
  51 Even when there is no code being generated there are a number of
  52 structures associated with the hot-path through the main run-loop.
  53 These are associated with looking up the next translation block to
  54 execute. These include:
  55
  56     tb_jmp_cache (per-vCPU, cache of recent jumps)
  57     tb_ctx.htable (global hash table, phys address->tb lookup)
  58
  59 As TB linking only occurs when blocks are in the same page this code
  60 is critical to performance as looking up the next TB to execute is the
  61 most common reason to exit the generated code.
  62
  63 DESIGN REQUIREMENT: Make access to lookup structures safe with
  64 multiple reader/writer threads. Minimise any lock contention to do it.
  65
  66 The hot-path avoids using locks where possible. The tb_jmp_cache is
  67 updated with atomic accesses to ensure consistent results. The fall
  68 back QHT based hash table is also designed for lockless lookups. Locks
  69 are only taken when code generation is required or TranslationBlocks
  70 have their block-to-block jumps patched.
  71
  72 Global TCG State
  73 ----------------
  74
  75 User-mode emulation
  76 ~~~~~~~~~~~~~~~~~~~
  77
  78 We need to protect the entire code generation cycle including any post
  79 generation patching of the translated code. This also implies a shared
  80 translation buffer which contains code running on all cores. Any
  81 execution path that comes to the main run loop will need to hold a
  82 mutex for code generation. This also includes times when we need flush
  83 code or entries from any shared lookups/caches. Structures held on a
  84 per-vCPU basis won't need locking unless other vCPUs will need to
  85 modify them.
  86
  87 DESIGN REQUIREMENT: Add locking around all code generation and TB
  88 patching.
  89
  90 (Current solution)
  91
  92 Code generation is serialised with mmap_lock().
  93
  94 !User-mode emulation
  95 ~~~~~~~~~~~~~~~~~~~~
  96
  97 Each vCPU has its own TCG context and associated TCG region, thereby
  98 requiring no locking during translation.
  99
 100 Translation Blocks
 101 ------------------
 102
 103 Currently the whole system shares a single code generation buffer
 104 which when full will force a flush of all translations and start from
 105 scratch again. Some operations also force a full flush of translations
 106 including:
 107
 108   - debugging operations (breakpoint insertion/removal)
 109   - some CPU helper functions
 110   - linux-user spawning its first thread
 111
 112 This is done with the async_safe_run_on_cpu() mechanism to ensure all
 113 vCPUs are quiescent when changes are being made to shared global
 114 structures.
 115
 116 More granular translation invalidation events are typically due
 117 to a change of the state of a physical page:
 118
 119   - code modification (self modify code, patching code)
 120   - page changes (new page mapping in linux-user mode)
 121
 122 While setting the invalid flag in a TranslationBlock will stop it
 123 being used when looked up in the hot-path there are a number of other
 124 book-keeping structures that need to be safely cleared.
 125
 126 Any TranslationBlocks which have been patched to jump directly to the
 127 now invalid blocks need the jump patches reversing so they will return
 128 to the C code.
 129
 130 There are a number of look-up caches that need to be properly updated
 131 including the:
 132
 133   - jump lookup cache
 134   - the physical-to-tb lookup hash table
 135   - the global page table
 136
 137 The global page table (l1_map) which provides a multi-level look-up
 138 for PageDesc structures which contain pointers to the start of a
 139 linked list of all Translation Blocks in that page (see page_next).
 140
 141 Both the jump patching and the page cache involve linked lists that
 142 the invalidated TranslationBlock needs to be removed from.
 143
 144 DESIGN REQUIREMENT: Safely handle invalidation of TBs
 145                       - safely patch/revert direct jumps
 146                       - remove central PageDesc lookup entries
 147                       - ensure lookup caches/hashes are safely updated
 148
 149 (Current solution)
 150
 151 The direct jump themselves are updated atomically by the TCG
 152 tb_set_jmp_target() code. Modification to the linked lists that allow
 153 searching for linked pages are done under the protection of tb->jmp_lock,
 154 where tb is the destination block of a jump. Each origin block keeps a
 155 pointer to its destinations so that the appropriate lock can be acquired before
 156 iterating over a jump list.
 157
 158 The global page table is a lockless radix tree; cmpxchg is used
 159 to atomically insert new elements.
 160
 161 The lookup caches are updated atomically and the lookup hash uses QHT
 162 which is designed for concurrent safe lookup.
 163
 164 Parallel code generation is supported. QHT is used at insertion time
 165 as the synchronization point across threads, thereby ensuring that we only
 166 keep track of a single TranslationBlock for each guest code block.
 167
 168 Memory maps and TLBs
 169 --------------------
 170
 171 The memory handling code is fairly critical to the speed of memory
 172 access in the emulated system. The SoftMMU code is designed so the
 173 hot-path can be handled entirely within translated code. This is
 174 handled with a per-vCPU TLB structure which once populated will allow
 175 a series of accesses to the page to occur without exiting the
 176 translated code. It is possible to set flags in the TLB address which
 177 will ensure the slow-path is taken for each access. This can be done
 178 to support:
 179
 180   - Memory regions (dividing up access to PIO, MMIO and RAM)
 181   - Dirty page tracking (for code gen, SMC detection, migration and display)
 182   - Virtual TLB (for translating guest address->real address)
 183
 184 When the TLB tables are updated by a vCPU thread other than their own
 185 we need to ensure it is done in a safe way so no inconsistent state is
 186 seen by the vCPU thread.
 187
 188 Some operations require updating a number of vCPUs TLBs at the same
 189 time in a synchronised manner.
 190
 191 DESIGN REQUIREMENTS:
 192
 193   - TLB Flush All/Page
 194     - can be across-vCPUs
 195     - cross vCPU TLB flush may need other vCPU brought to halt
 196     - change may need to be visible to the calling vCPU immediately
 197   - TLB Flag Update
 198     - usually cross-vCPU
 199     - want change to be visible as soon as possible
 200   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
 201     - This is a per-vCPU table - by definition can't race
 202     - updated by its own thread when the slow-path is forced
 203
 204 (Current solution)
 205
 206 We have updated cputlb.c to defer operations when a cross-vCPU
 207 operation with async_run_on_cpu() which ensures each vCPU sees a
 208 coherent state when it next runs its work (in a few instructions
 209 time).
 210
 211 A new set up operations (tlb_flush_*_all_cpus) take an additional flag
 212 which when set will force synchronisation by setting the source vCPUs
 213 work as "safe work" and exiting the cpu run loop. This ensure by the
 214 time execution restarts all flush operations have completed.
 215
 216 TLB flag updates are all done atomically and are also protected by the
 217 corresponding page lock.
 218
 219 (Known limitation)
 220
 221 Not really a limitation but the wait mechanism is overly strict for
 222 some architectures which only need flushes completed by a barrier
 223 instruction. This could be a future optimisation.
 224
 225 Emulated hardware state
 226 -----------------------
 227
 228 Currently thanks to KVM work any access to IO memory is automatically
 229 protected by the global iothread mutex, also known as the BQL (Big
 230 Qemu Lock). Any IO region that doesn't use global mutex is expected to
 231 do its own locking.
 232
 233 However IO memory isn't the only way emulated hardware state can be
 234 modified. Some architectures have model specific registers that
 235 trigger hardware emulation features. Generally any translation helper
 236 that needs to update more than a single vCPUs of state should take the
 237 BQL.
 238
 239 As the BQL, or global iothread mutex is shared across the system we
 240 push the use of the lock as far down into the TCG code as possible to
 241 minimise contention.
 242
 243 (Current solution)
 244
 245 MMIO access automatically serialises hardware emulation by way of the
 246 BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
 247 and also defer the reset/startup of vCPUs to the vCPU context by way
 248 of async_run_on_cpu().
 249
 250 Updates to interrupt state are also protected by the BQL as they can
 251 often be cross vCPU.
 252
 253 Memory Consistency
 254 ==================
 255
 256 Between emulated guests and host systems there are a range of memory
 257 consistency models. Even emulating weakly ordered systems on strongly
 258 ordered hosts needs to ensure things like store-after-load re-ordering
 259 can be prevented when the guest wants to.
 260
 261 Memory Barriers
 262 ---------------
 263
 264 Barriers (sometimes known as fences) provide a mechanism for software
 265 to enforce a particular ordering of memory operations from the point
 266 of view of external observers (e.g. another processor core). They can
 267 apply to any memory operations as well as just loads or stores.
 268
 269 The Linux kernel has an excellent `write-up
 270 <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
 271 on the various forms of memory barrier and the guarantees they can
 272 provide.
 273
 274 Barriers are often wrapped around synchronisation primitives to
 275 provide explicit memory ordering semantics. However they can be used
 276 by themselves to provide safe lockless access by ensuring for example
 277 a change to a signal flag will only be visible once the changes to
 278 payload are.
 279
 280 DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
 281
 282 This would enforce a strong load/store ordering so all loads/stores
 283 complete at the memory barrier. On single-core non-SMP strongly
 284 ordered backends this could become a NOP.
 285
 286 Aside from explicit standalone memory barrier instructions there are
 287 also implicit memory ordering semantics which comes with each guest
 288 memory access instruction. For example all x86 load/stores come with
 289 fairly strong guarantees of sequential consistency whereas Arm has
 290 special variants of load/store instructions that imply acquire/release
 291 semantics.
 292
 293 In the case of a strongly ordered guest architecture being emulated on
 294 a weakly ordered host the scope for a heavy performance impact is
 295 quite high.
 296
 297 DESIGN REQUIREMENTS: Be efficient with use of memory barriers
 298        - host systems with stronger implied guarantees can skip some barriers
 299        - merge consecutive barriers to the strongest one
 300
 301 (Current solution)
 302
 303 The system currently has a tcg_gen_mb() which will add memory barrier
 304 operations if code generation is being done in a parallel context. The
 305 tcg_optimize() function attempts to merge barriers up to their
 306 strongest form before any load/store operations. The solution was
 307 originally developed and tested for linux-user based systems. All
 308 backends have been converted to emit fences when required. So far the
 309 following front-ends have been updated to emit fences when required:
 310
 311     - target-i386
 312     - target-arm
 313     - target-aarch64
 314     - target-alpha
 315     - target-mips
 316
 317 Memory Control and Maintenance
 318 ------------------------------
 319
 320 This includes a class of instructions for controlling system cache
 321 behaviour. While QEMU doesn't model cache behaviour these instructions
 322 are often seen when code modification has taken place to ensure the
 323 changes take effect.
 324
 325 Synchronisation Primitives
 326 --------------------------
 327
 328 There are two broad types of synchronisation primitives found in
 329 modern ISAs: atomic instructions and exclusive regions.
 330
 331 The first type offer a simple atomic instruction which will guarantee
 332 some sort of test and conditional store will be truly atomic w.r.t.
 333 other cores sharing access to the memory. The classic example is the
 334 x86 cmpxchg instruction.
 335
 336 The second type offer a pair of load/store instructions which offer a
 337 guarantee that a region of memory has not been touched between the
 338 load and store instructions. An example of this is Arm's ldrex/strex
 339 pair where the strex instruction will return a flag indicating a
 340 successful store only if no other CPU has accessed the memory region
 341 since the ldrex.
 342
 343 Traditionally TCG has generated a series of operations that work
 344 because they are within the context of a single translation block so
 345 will have completed before another CPU is scheduled. However with
 346 the ability to have multiple threads running to emulate multiple CPUs
 347 we will need to explicitly expose these semantics.
 348
 349 DESIGN REQUIREMENTS:
 350   - Support classic atomic instructions
 351   - Support load/store exclusive (or load link/store conditional) pairs
 352   - Generic enough infrastructure to support all guest architectures
 353 CURRENT OPEN QUESTIONS:
 354   - How problematic is the ABA problem in general?
 355
 356 (Current solution)
 357
 358 The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
 359 can be used directly or combined to emulate other instructions like
 360 Arm's ldrex/strex instructions. While they are susceptible to the ABA
 361 problem so far common guests have not implemented patterns where
 362 this may be a problem - typically presenting a locking ABI which
 363 assumes cmpxchg like semantics.
 364
 365 The code also includes a fall-back for cases where multi-threaded TCG
 366 ops can't work (e.g. guest atomic width > host atomic width). In this
 367 case an EXCP_ATOMIC exit occurs and the instruction is emulated with
 368 an exclusive lock which ensures all emulation is serialised.
 369
 370 While the atomic helpers look good enough for now there may be a need
 371 to look at solutions that can more closely model the guest
 372 architectures semantics.