docs/internals/t-chaining-notes.txt

   1
   2 Verification todo
   3 ~~~~~~~~~~~~~~~~~
   4 check that illegal insns on all targets don't cause the _toIR.c's to
   5 assert.  [DONE: amd64 x86 ppc32 ppc64 arm s390]
   6
   7 check also with --vex-guest-chase-cond=yes
   8
   9 check that all targets can run their insn set tests with
  10 --vex-guest-max-insns=1.
  11
  12 all targets: run some tests using --profile-flags=... to exercise
  13 function patchProfInc_<arch> [DONE: amd64 x86 ppc32 ppc64 arm s390]
  14
  15 figure out if there is a way to write a test program that checks
  16 that event checks are actually getting triggered
  17
  18
  19 Cleanups
  20 ~~~~~~~~
  21 host_arm_isel.c and host_arm_defs.c: get rid of global var arm_hwcaps.
  22
  23 host_x86_defs.c, host_amd64_defs.c: return proper VexInvalRange
  24 records from the patchers, instead of {0,0}, so that transparent
  25 self hosting works properly.
  26
  27 host_ppc_defs.h: is RdWrLR still needed?  If not delete.
  28
  29 ditto ARM, Ld8S
  30
  31 Comments that used to be in m_scheduler.c:
  32    tchaining tests:
  33    - extensive spinrounds
  34    - with sched quantum = 1  -- check that handle_noredir_jump
  35      doesn't return with INNER_COUNTERZERO
  36    other:
  37    - out of date comment w.r.t. bit 0 set in libvex_trc_values.h
  38    - can VG_TRC_BORING still happen?  if not, rm
  39    - memory leaks in m_transtab (InEdgeArr/OutEdgeArr leaking?)
  40    - move do_cacheflush out of m_transtab
  41    - more economical unchaining when nuking an entire sector
  42    - ditto w.r.t. cache flushes
  43    - verify case of 2 paths from A to B
  44    - check -- is IP_AT_SYSCALL still right?
  45
  46
  47 Optimisations
  48 ~~~~~~~~~~~~~
  49 ppc: chain_XDirect: generate short form jumps when possible
  50
  51 ppc64: immediate generation is terrible .. should be able
  52        to do better
  53
  54 arm codegen: Generate ORRS for CmpwNEZ32(Or32(x,y))
  55
  56 all targets: when nuking an entire sector, don't bother to undo the
  57 patching for any translations within the sector (nor with their
  58 invalidations).
  59
  60 (somewhat implausible) for jumps to disp_cp_indir, have multiple
  61 copies of disp_cp_indir, one for each of the possible registers that
  62 could have held the target guest address before jumping to the stub.
  63 Then disp_cp_indir wouldn't have to reload it from memory each time.
  64 Might also have the effect of spreading out the indirect mispredict
  65 burden somewhat (across the multiple copies.)
  66
  67
  68 Implementation notes
  69 ~~~~~~~~~~~~~~~~~~~~
  70 T-chaining changes -- summary
  71
  72 * The code generators (host_blah_isel.c, host_blah_defs.[ch]) interact
  73   more closely with Valgrind than before.  In particular the
  74   instruction selectors must use one of 3 different kinds of
  75   control-transfer instructions: XDirect, XIndir and XAssisted.
  76   All archs must use these the same; no more ad-hoc control transfer
  77   instructions.
  78   (more detail below)
  79
  80
  81 * With T-chaining, translations can jump between each other without
  82   going through the dispatcher loop every time.  This means that the
  83   event check (counter dec, and exit if negative) the dispatcher loop
  84   previously did now needs to be compiled into each translation.
  85
  86
  87 * The assembly dispatcher code (dispatch-arch-os.S) is still
  88   present.  It still provides table lookup services for
  89   indirect branches, but it also provides a new feature:
  90   dispatch points, to which the generated code jumps.  There
  91   are 5:
  92
  93   VG_(disp_cp_chain_me_to_slowEP):
  94   VG_(disp_cp_chain_me_to_fastEP):
  95     These are chain-me requests, used for Boring conditional and
  96     unconditional jumps to destinations known at JIT time.  The
  97     generated code calls these (doesn't jump to them) and the
  98     stub recovers the return address.  These calls never return;
  99     instead the call is done so that the stub knows where the
 100     calling point is.  It needs to know this so it can patch
 101     the calling point to the requested destination.
 102   VG_(disp_cp_xindir):
 103     Old-style table lookup and go; used for indirect jumps
 104   VG_(disp_cp_xassisted):
 105     Most general and slowest kind.  Can transfer to anywhere, but
 106     first returns to scheduler to do some other event (eg a syscall)
 107     before continuing.
 108   VG_(disp_cp_evcheck_fail):
 109     Code jumps here when the event check fails.
 110
 111
 112 * new instructions in backends: XDirect, XIndir and XAssisted.
 113   XDirect is used for chainable jumps.  It is compiled into a
 114   call to VG_(disp_cp_chain_me_to_slowEP) or
 115   VG_(disp_cp_chain_me_to_fastEP).
 116
 117   XIndir is used for indirect jumps.  It is compiled into a jump
 118   to VG_(disp_cp_xindir)
 119
 120   XAssisted is used for "assisted" (do something first, then jump)
 121   transfers.  It is compiled into a jump to VG_(disp_cp_xassisted)
 122
 123   All 3 of these may be conditional.
 124
 125   More complexity: in some circumstances (no-redir translations)
 126   all transfers must be done with XAssisted.  In such cases the
 127   instruction selector will be told this.
 128
 129
 130 * Patching: XDirect is compiled basically into
 131      %r11 = &VG_(disp_cp_chain_me_to_{slow,fast}EP)
 132      call *%r11
 133   Backends must provide a function (eg) chainXDirect_AMD64
 134   which converts it into a jump to a specified destination
 135      jmp $delta-of-PCs
 136   or
 137      %r11 = 64-bit immediate
 138      jmpq *%r11
 139   depending on branch distance.
 140
 141   Backends must provide a function (eg) unchainXDirect_AMD64
 142   which restores the original call-to-the-stub version.
 143
 144
 145 * Event checks.  Each translation now has two entry points,
 146   the slow one (slowEP) and fast one (fastEP).  Like this:
 147
 148      slowEP:
 149         counter--
 150         if (counter < 0) goto VG_(disp_cp_evcheck_fail)
 151      fastEP:
 152         (rest of the translation)
 153
 154   slowEP is used for control flow transfers that are or might be
 155   a back edge in the control flow graph.  Insn selectors are
 156   given the address of the highest guest byte in the block so
 157   they can determine which edges are definitely not back edges.
 158
 159   The counter is placed in the first 8 bytes of the guest state,
 160   and the address of VG_(disp_cp_evcheck_fail) is placed in
 161   the next 8 bytes.  This allows very compact checks on all
 162   targets, since no immediates need to be synthesised, eg:
 163
 164     decq 0(%baseblock-pointer)
 165     jns  fastEP
 166     jmpq *8(baseblock-pointer)
 167     fastEP:
 168
 169   On amd64 a non-failing check is therefore 2 insns; all 3 occupy
 170   just 8 bytes.
 171
 172   On amd64 the event check is created by a special single
 173   pseudo-instruction AMD64_EvCheck.
 174
 175
 176 * BB profiling (for --profile-flags=).  The dispatch assembly
 177   dispatch-arch-os.S no longer deals with this and so is much
 178   simplified.  Instead the profile inc is compiled into each
 179   translation, as the insn immediately following the event
 180   check.  Again, on amd64 a pseudo-insn AMD64_ProfInc is used.
 181   Counters are now 64 bit even on 32 bit hosts, to avoid overflow.
 182
 183   One complexity is that at JIT time it is not known where the
 184   address of the counter is.  To solve this, VexTranslateResult
 185   now returns the offset of the profile inc in the generated
 186   code.  When the counter address is known, VEX can be called
 187   again to patch it in.  Backends must supply eg
 188   patchProfInc_AMD64 to make this happen.
 189
 190
 191 * Front end changes (guest_blah_toIR.c)
 192
 193   The way the guest program counter is handled has changed
 194   significantly.  Previously, the guest PC was updated (in IR)
 195   at the start of each instruction, except for the first insn
 196   in an IRSB.  This is inconsistent and doesn't work with the
 197   new framework.
 198
 199   Now, each instruction must update the guest PC as its last
 200   IR statement -- not its first.  And no special exemption for
 201   the first insn in the block.  As before most of these are
 202   optimised out by ir_opt, so no concerns about efficiency.
 203
 204   As a logical side effect of this, exits (IRStmt_Exit) and the
 205   block-end transfer are both considered to write to the guest state
 206   (the guest PC) and so need to be told the offset of it.
 207
 208   IR generators (eg disInstr_AMD64) are no longer allowed to set the
 209   IRSB::next, to specify the block-end transfer address.  Instead they
 210   now indicate, to the generic steering logic that drives them (iow,
 211   guest_generic_bb_to_IR.c), that the block has ended.  This then
 212   generates effectively "goto GET(PC)" (which, again, is optimised
 213   away).  What this does mean is that if the IR generator function
 214   ends the IR of the last instruction in the block with an incorrect
 215   assignment to the guest PC, execution will transfer to an incorrect
 216   destination -- making the error obvious quickly.