docs/Architecture.rst

   1 ================
   2 The Architecture
   3 ================
   4
   5 Polly is a loop optimizer for LLVM. Starting from LLVM-IR it detects and
   6 extracts interesting loop kernels. For each kernel a mathematical model is
   7 derived which precisely describes the individual computations and memory
   8 accesses in the kernels. Within Polly a variety of analysis and code
   9 transformations are performed on this mathematical model. After all
  10 optimizations have been derived and applied, optimized LLVM-IR is regenerated
  11 and inserted into the LLVM-IR module.
  12
  13 .. image:: images/architecture.png
  14     :align: center
  15
  16 Polly in the LLVM pass pipeline
  17 -------------------------------
  18
  19 The standard LLVM pass pipeline as it is used in -O1/-O2/-O3 mode of clang/opt
  20 consists of a sequence of passes that can be grouped in different conceptual
  21 phases. The first phase, we call it here **Canonicalization**, is a scalar
  22 canonicalization phase that contains passes like -mem2reg, -instcombine,
  23 -cfgsimplify, or early loop unrolling. It has the goal of removing and
  24 simplifying the given IR as much as possible focusing mostly on scalar
  25 optimizations. The second phase consists of three conceptual groups that  are
  26 executed in the so-called **Inliner cycle**, This is again a set of **Scalar
  27 Simplification** passes, a set of **Simple Loop Optimizations**, and the
  28 **Inliner** itself. Even though these passes make up the majority of the LLVM
  29 pass pipeline, the primary goal of these passes is still canonicalization
  30 without loosing semantic information that complicates later analysis. As part of
  31 the inliner cycle, the LLVM inliner step-by-step tries to inline functions, runs
  32 canonicalization passes to exploit newly exposed simplification opportunities,
  33 and then tries to inline the further simplified functions. Some simple loop
  34 optimizations are executed as part of the inliner cycle. Even though they
  35 perform some optimizations, their primary goal is still the simplification of
  36 the program code. Loop invariant code motion is one such optimization that
  37 besides being beneficial for program performance also allows us to move
  38 computation out of loops and in the best case enables us to eliminate certain
  39 loops completely.  Only after the inliner cycle has been finished, a last
  40 **Target Specialization** phase is run, where IR complexity is deliberately
  41 increased to take advantage of target specific features that maximize the
  42 execution performance on the device we target. One of the principal
  43 optimizations in this phase is vectorization, but also target specific loop
  44 unrolling, or some loop transformations (e.g., distribution) that expose more
  45 vectorization opportunities.
  46
  47 .. image:: images/LLVM-Passes-only.png
  48     :align: center
  49
  50 Polly can conceptually be run at three different positions in the pass pipeline.
  51 As an early optimizer before the standard LLVM pass pipeline, as a later
  52 optimizer as part of the target specialization sequence, and theoretically also
  53 with the loop optimizations in the inliner cycle. We only discuss the first two
  54 options, as running Polly in the inline loop, is likely to disturb the inliner
  55 and is consequently not a good idea.
  56
  57 .. image:: images/LLVM-Passes-all.png
  58     :align: center
  59
  60 Running Polly early before the standard pass pipeline has the benefit that the
  61 LLVM-IR processed by Polly is still very close to the original input code.
  62 Hence, it is less likely that transformations applied by LLVM change the IR in
  63 ways not easily understandable for the programmer. As a result, user feedback is
  64 likely better and it is less likely that kernels that in C seem a perfect fit
  65 for Polly have been transformed such that Polly can not handle them any
  66 more. On the other hand, codes that require inlining to be optimized won't
  67 benefit if Polly is scheduled at this position. The additional set of
  68 canonicalization passes required will result in a small, but general compile
  69 time increase and some random run-time performance changes due to slightly
  70 different IR being passed through the optimizers. To force Polly to run early in
  71 the pass pipleline use the option *-polly-position=early* (default today).
  72
  73 .. image:: images/LLVM-Passes-early.png
  74     :align: center
  75
  76 Running Polly right before the vectorizer has the benefit that the full inlining
  77 cycle has been run and as a result even heavily templated C++ code could
  78 theoretically benefit from Polly (more work is necessary to make Polly here
  79 really effective). As the IR that is passed to Polly has already been
  80 canonicalized, there is also no need to run additional canonicalization passes.
  81 General compile time is almost not affected by Polly, as detection of loop
  82 kernels is generally very fast and the actual optimization and cleanup passes
  83 are only run on functions which contain loop kernels that are worth optimizing.
  84 However, due to the many optimizations that LLVM runs before Polly the IR that
  85 reaches Polly often has additional scalar dependences that make Polly a lot less
  86 efficient. To force Polly to run before the vectorizer in the pass pipleline use
  87 the option *-polly-position=before-vectorizer*. This position is not yet the
  88 default for Polly, but work is on its way to be effective even in presence of
  89 scalar dependences. After this work has been completed, Polly will likely use
  90 this position by default.
  91
  92 .. image:: images/LLVM-Passes-late.png
  93     :align: center