docs/ssapre.txt

   1
   2 SSAPRE stands for "SSA based Partial Redundancy Elimination".
   3
   4 The algorithm is explained in this paper:
   5
   6 Partial Redundancy Elimination in SSA Form (1999)
   7 Robert Kennedy, Sun Chan, SHIN-MING LIU, RAYMOND LO, PENG TU, FRED CHOW
   8 ACM Transactions on Programming Languages and Systems
   9
  10 http://citeseer.ist.psu.edu/kennedy99partial.html
  11
  12 In this document I give a gentle introduction to the concept of "partial"
  13 redundancies, and I explain the basic design decisions I took in implementing
  14 SSAPRE, but the paper is essential to understand the code.
  15
  16 Partial Redundancy Elimination (or PRE) is an optimization that (guess what?)
  17 tries to remove redundant computations.
  18 It achieves this by saving the result of "not redundant" evaluations of
  19 expressions into appositely created temporary variables, so that "redundant"
  20 evaluations can be replaced by a load from the appropriate variable.
  21
  22 Of course, on register starved architectures (x86) a temporary could cost more
  23 than the evaluation itself... PRE guarantees that the live range of the
  24 introduced variables is the minimal possible, but the added pressure on the
  25 register allocator can be an issue.
  26
  27 The nice thing about PRE is that it not only removes "full" redundancies, but
  28 also "partial" ones.
  29 A full redundancy is easy to spot, and straightforward to handle, like in the
  30 following example (in every example here, the "expression" is "a + b"):
  31
  32 int FullRedundancy1 (int a, int b) {
  33      int v1 = a + b;
  34      int v2 = a + b;
  35      return v1 + v2;
  36 }
  37
  38 PRE would transform it like this:
  39
  40 int FullRedundancy1 (int a, int b) {
  41      int t = a + b;
  42      int v1 = t;
  43      int v2 = t;
  44      return v1 + v2;
  45 }
  46
  47 Of course, either a copy propagation pass or a register allocator smart enough
  48 to remove unneeded variables would be necessary afterwords.
  49
  50 Another example of full redundancy is the following:
  51
  52 int FullRedundancy2 (int a, int b) {
  53      int v1;
  54
  55      if (a >= 0) {
  56           v1 = a + b; // BB1
  57      } else {
  58           a = -a; // BB2
  59           v1 = a + b;
  60      }
  61
  62      int v2 = a + b; // BB3
  63      return v1 + v2;
  64 }
  65
  66 Here the two expressions in BB1 and BB2 are *not* the same thing (a is
  67 modified in BB2), but both are redundant with the expression in BB3, so the
  68 code can be transformed like this:
  69
  70 int FullRedundancy2 (int a, int b) {
  71      int v1;
  72      int t;
  73
  74      if (a >= 0) {
  75           t = a + b; // BB1
  76           v1 = t;
  77      } else {
  78           a = -a; // BB2
  79           t = a + b;
  80           v1 = t;
  81      }
  82
  83      int v2 = t; // BB3
  84      return v1 + v2;
  85 }
  86
  87 Note that there are still two occurrences of the expression, while it can be
  88 easily seen that one (at the beginning of BB3) would suffice.
  89 This, however, is not a redundancy for PRE, because there is no path in the
  90 CFG where the expression is evaluated twice.
  91 Maybe this other kind of redundancy (which affects code size, and not the
  92 computations that are actually performed) would be eliminated by code hoisting,
  93 but I should check it; anyway, it is not a PRE related thing.
  94
  95 An example of partial redundancy, on the other hand, is the following:
  96
  97 int PartialRedundancy (int a, int b) {
  98      int v1;
  99
 100      if (a >= 0) {
 101           v1 = a + b; // BB1
 102      } else {
 103           v1 = 0; // BB2
 104      }
 105
 106      int v2 = a + b; // BB3
 107      return v1 + v2;
 108 }
 109
 110 The redundancy is partial because the expression is computed more than once
 111 along some path in the CFG, not all paths.
 112 In fact, on the path BB1 - BB3 the expression is computed twice, but on the
 113 path BB2 - BB3 it is computed only once.
 114 In this case, PRE must insert new occurrences of the expression in order to
 115 obtain a full redundancy, and then use temporary variables as before.
 116 Adding a computation in BB2 would do the job.
 117
 118 One nice thing about PRE is that loop invariants can be seen as partial
 119 redundancies.
 120 The idea is that you can get into the loop from two locations: from before the
 121 loop (at the 1st iteration), and from inside the loop itself (at any other
 122 iteration). If there is a computation inside the loop that is in fact a loop
 123 invariant, PRE will spot this, and will handle the BB before the loop as a
 124 place where to insert a new computation to get a full redundancy.
 125 At this point, the computation inside the loop would be replaced by an use of
 126 the temporary stored before the loop, effectively performing "loop invariant
 127 code motion".
 128
 129 Now, this is what PRE does to the code.
 130
 131 But how does it work?
 132
 133 In "classic" solutions, PRE is formulated as a data flow analysis problem.
 134 The Muchnick provides a detailed description of the algorithm in this way (it
 135 is by far the most complex problem of this kind in the whole book).
 136 The point is that this algorithm does not exploit the SSA form.
 137 In fact, it has to perform all that amount of data flow analysis exactly
 138 because it does not take advantage of the work already done if you have put
 139 the program into SSA form.
 140
 141 The SSAPRE algorithm, on the other hand, is designed with SSA in mind.
 142 It fully exploits the properties of SSA variables, it also explicitly reuses
 143 some data structures that must have been computed when building the SSA form,
 144 and takes great care to output its code in that form already (which means that
 145 the temporaries it introduces are already "versioned", with all the phi
 146 variables correctly placed).
 147
 148 The main concept used in this algorithm is the "Factored Redundancy Graph" (or
 149 FRG in short). Basically, for each given expression, this graph shows all its
 150 occurrences in the program, and redundant ones are linked to their
 151 "representative occurrence" (the 1st occurrence met in a CFG traversal).
 152 The central observation is that the FRG is "factored" because each expression
 153 occurrence has exactly one representative occurrence, in the same way as the
 154 SSA form is a "factored" use-definition graph (each use is related to exactly
 155 one definition).
 156 And in fact building the FRG is much like building an SSA representation, with
 157 PHI nodes and all.
 158 By the way, I use "PHI" for "PHI expression occurrences", and "phi" for the
 159 usual phi definitions in SSA, because the paper uses an uppercase phi greek
 160 letter for PHI occurrences (while the lowercase one is standard in SSA).
 161
 162 The really interesting point is that the FRG for a given expression has exactly
 163 the same "shape" of the use-definition graph for the temporary var that must
 164 be introduced to remove the redundancy, and this is why SSAPRE can easily emit
 165 its output code in correct SSA form.
 166
 167 One particular characteristic of the SSAPRE algorithm is that it is "sparse",
 168 in the sense that it operates on expressions individually, looking only at the
 169 specific nodes it needs. This is in contrast to the classical way of solving
 170 data flow analysis problems, which is to operate globally, using data
 171 structures that work "in parallel" on all the entities they operate on (in
 172 practice bit sets, with for instance one bit per variable, or one bit per
 173 expression occurrence, you get the idea). This is handy (it exploits the
 174 parallelism of bit operations), but in general setting up those data
 175 structures is time consuming, and if the number of the things represented by
 176 the bits is not fixed in advance the code can get quite messy (bit sets must
 177 become growable).
 178 Moreover, applying special handling to individual expressions becomes a very
 179 tricky thing.
 180
 181 SSAPRE, on the other hand, looks at the whole program (method in our case) only
 182 once, when it scans the code to collect (and classify) all expression
 183 occurrences.
 184 From here on, it operates on one expression at a time, looking only at its
 185 specific data structures (which, for each expression, are much smaller than
 186 the whole program, so the operations are fast).
 187
 188 This approach has another advantage: the data structures used to operate on
 189 one expressions can be recycled when operating on other expressions, making
 190 the memory usage of the compiler lower, and (also important) avoiding losing
 191 time with memory allocations at all.
 192 This reflects directly on the design of those data structures.
 193 We can better see these advantages following which data structures are used
 194 during the application of SSAPRE to a method.
 195
 196 The steps that are performed are the following:
 197
 198     * Initialization: scan the whole program, collect all the occurrences, and
 199       build a worklist of expressions to be processed (each worklist entry
 200       describes all the occurrences of the given expression).
 201 Here the data structures are the following:
 202 - One struct (the working area), containing the worklist and other
 203  "global" data. The worklist itself contains an entry for each expression
 204   which in turn has an entry for each occurrence.
 205 - One "info" struct for each BB, containing interesting dominance and CFG
 206   related properties of the BB.
 207
 208 Then, for each entry in the worklist, these operations are performed:
 209
 210     * PHI placement: find where the PHI nodes of the FRG must be placed.
 211     * Renaming: assign appropriate "redundancy classes" to all occurrences (it
 212       is like assigning variable versions when building an SSA form).
 213     * Analyze: compute various flags in PHI nodes (which are the only places
 214       that define where additional computations may be added).
 215       This conceptually is composed of two data flow analysis passes, which in
 216       practice only scan the PHI nodes in the FRG, not the whole code, so they
 217       are not that heavy.
 218     * Finalize: make so that the FRG is exactly like the use-def graph for the
 219       temporary that will be introduced (it generally must be reshaped a bit
 220       according to the flags computed in the previous step).
 221       This is also made of two steps, but more for implementation reasons than
 222       for conceptual ones.
 223     * Code motion: actually update the code using the FRG.
 224 Here, what's needed is the following:
 225 - PHI occurrences (and some flags sbout them)
 226 - PHI argument occurrences (and some flags sbout them)
 227 - The renaming stack
 228 In practice, one can observe that each BB can have at most one PHI (we work
 229 on one expression at a time), and also one PHI argument (which we consider
 230 occurring at the end of the BB). Therefore, we do not have separate structures
 231 for these, but store them directly in the BB infos (which are kept for the
 232 whole SSAPRE invocation).
 233 The renaming stack is managed directly as a list of occurrences, with
 234 special handling for PHI nodes (which, being represented directly by their
 235 BB, are not "occurrences").
 236
 237
 238 So far, the only two missing things (with respect to SSAPRE in the paper) are
 239 unneeded PHIs eliminantion and the handling of "composite" expressions.
 240 Otherwise, the implementation is complete.
 241
 242 Other interesting issues are:
 243 - SSAPRE has the assumption that:
 244   - each SSA variable is related to one "original" (not SSA) variable, and
 245   - no more than one version of each originsl variable is live at the same time
 246     in the CFG.
 247   It would be better to relax these assumptions.
 248 - SSAPRE operates on "syntactic" redundancies, not on "values".
 249   GVNPRE (or other means of merging GVN) would be a nice alternative, see
 250   "http://www.cs.purdue.edu/homes/vandrutj/papers/thesis.pdf".