[isl++] add isl_constraint to C++ bindings [NFC]
[polly-mirror.git] / www / example_manual_matmul.html
blobadac73167b337efff0dd981bb70bff525f61284b
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3 <!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
4 <html>
5 <head>
6 <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
7 <title>Polly - Examples</title>
8 <link type="text/css" rel="stylesheet" href="menu.css">
9 <link type="text/css" rel="stylesheet" href="content.css">
10 </head>
11 <body>
12 <div id="box">
13 <!--#include virtual="menu.html.incl"-->
14 <div id="content">
15 <!--=====================================================================-->
16 <h1>Execute the individual Polly passes manually</h1>
17 <!--=====================================================================-->
19 <p>
20 This example presents the individual passes that are involved when optimizing
21 code with Polly. We show how to execute them individually and explain for each
22 which analysis is performed or what transformation is applied. In this example
23 the polyhedral transformation is user-provided to show how much performance
24 improvement can be expected by an optimal automatic optimizer.</p>
26 The files used and created in this example are available in the Polly checkout
27 in the folder <em>www/experiments/matmul</em>. They can be created automatically
28 by running the <em>www/experiments/matmul/runall.sh</em> script.
30 <ol>
31 <li><h4>Create LLVM-IR from the C code</h4>
33 Polly works on LLVM-IR. Hence it is necessary to translate the source files into
34 LLVM-IR. If more than on file should be optimized the files can be combined into
35 a single file with llvm-link.
37 <pre class="code">clang -S -emit-llvm matmul.c -o matmul.s</pre>
38 </li>
41 <li><h4>Load Polly automatically when calling the 'opt' tool</h4>
43 Polly is not built into opt or bugpoint, but it is a shared library that needs
44 to be loaded into these tools explicitally. The Polly library is called
45 LVMPolly.so. It is available in the build/lib/ directory. For convenience we create
46 an alias that automatically loads Polly if 'opt' is called.
47 <pre class="code">
48 export PATH_TO_POLLY_LIB="~/polly/build/lib/"
49 alias opt="opt -load ${PATH_TO_POLLY_LIB}/LLVMPolly.so"</pre>
50 </li>
52 <li><h4>Prepare the LLVM-IR for Polly</h4>
54 Polly is only able to work with code that matches a canonical form. To translate
55 the LLVM-IR into this form we use a set of canonicalication passes. They are
56 scheduled by using '-polly-canonicalize'.
57 <pre class="code">opt -S -polly-canonicalize matmul.s &gt; matmul.preopt.ll</pre></li>
59 <li><h4>Show the SCoPs detected by Polly (optional)</h4>
61 To understand if Polly was able to detect SCoPs, we print the
62 structure of the detected SCoPs. In our example two SCoPs were detected. One in
63 'init_array' the other in 'main'.
65 <pre class="code">opt -basicaa -polly-ast -analyze -q matmul.preopt.ll</pre>
67 <pre>
68 init_array():
69 for (c2=0;c2&lt;=1023;c2++) {
70 for (c4=0;c4&lt;=1023;c4++) {
71 Stmt_5(c2,c4);
75 main():
76 for (c2=0;c2&lt;=1023;c2++) {
77 for (c4=0;c4&lt;=1023;c4++) {
78 Stmt_4(c2,c4);
79 for (c6=0;c6&lt;=1023;c6++) {
80 Stmt_6(c2,c4,c6);
84 </pre>
85 </li>
86 <li><h4>Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)</h4>
88 Polly can use graphviz to graphically show a CFG in which the detected SCoPs are
89 highlighted. It can also create '.dot' files that can be translated by
90 the 'dot' utility into various graphic formats.
92 <pre class="code">opt -basicaa -view-scops -disable-output matmul.preopt.ll
93 opt -basicaa -view-scops-only -disable-output matmul.preopt.ll</pre>
94 The output for the different functions<br />
95 view-scops:
96 <a href="experiments/matmul/scops.main.dot.png">main</a>,
97 <a href="experiments/matmul/scops.init_array.dot.png">init_array</a>,
98 <a href="experiments/matmul/scops.print_array.dot.png">print_array</a><br />
99 view-scops-only:
100 <a href="experiments/matmul/scopsonly.main.dot.png">main</a>,
101 <a href="experiments/matmul/scopsonly.init_array.dot.png">init_array</a>,
102 <a href="experiments/matmul/scopsonly.print_array.dot.png">print_array</a>
103 </li>
105 <li><h4>View the polyhedral representation of the SCoPs</h4>
106 <pre class="code">opt -basicaa -polly-scops -analyze matmul.preopt.ll</pre>
107 <pre>
108 [...]
109 Printing analysis 'Polly - Create polyhedral description of Scops' for region:
110 'for.cond =&gt; for.end19' in function 'init_array':
111 Context:
112 { [] }
113 Statements {
114 Stmt_5
115 Domain&nbsp;:=
116 { Stmt_5[i0, i1]&nbsp;: i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 };
117 Schedule&nbsp;:=
118 { Stmt_5[i0, i1] -&gt; schedule[0, i0, 0, i1, 0] };
119 WriteAccess&nbsp;:=
120 { Stmt_5[i0, i1] -&gt; MemRef_A[1037i0 + i1] };
121 WriteAccess&nbsp;:=
122 { Stmt_5[i0, i1] -&gt; MemRef_B[1047i0 + i1] };
123 FinalRead
124 Domain&nbsp;:=
125 { FinalRead[0] };
126 Schedule&nbsp;:=
127 { FinalRead[i0] -&gt; schedule[200000000, o1, o2, o3, o4] };
128 ReadAccess&nbsp;:=
129 { FinalRead[i0] -&gt; MemRef_A[o0] };
130 ReadAccess&nbsp;:=
131 { FinalRead[i0] -&gt; MemRef_B[o0] };
133 [...]
134 Printing analysis 'Polly - Create polyhedral description of Scops' for region:
135 'for.cond =&gt; for.end30' in function 'main':
136 Context:
137 { [] }
138 Statements {
139 Stmt_4
140 Domain&nbsp;:=
141 { Stmt_4[i0, i1]&nbsp;: i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 };
142 Schedule&nbsp;:=
143 { Stmt_4[i0, i1] -&gt; schedule[0, i0, 0, i1, 0, 0, 0] };
144 WriteAccess&nbsp;:=
145 { Stmt_4[i0, i1] -&gt; MemRef_C[1067i0 + i1] };
146 Stmt_6
147 Domain&nbsp;:=
148 { Stmt_6[i0, i1, i2]&nbsp;: i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1023 };
149 Schedule&nbsp;:=
150 { Stmt_6[i0, i1, i2] -&gt; schedule[0, i0, 0, i1, 1, i2, 0] };
151 ReadAccess&nbsp;:=
152 { Stmt_6[i0, i1, i2] -&gt; MemRef_C[1067i0 + i1] };
153 ReadAccess&nbsp;:=
154 { Stmt_6[i0, i1, i2] -&gt; MemRef_A[1037i0 + i2] };
155 ReadAccess&nbsp;:=
156 { Stmt_6[i0, i1, i2] -&gt; MemRef_B[i1 + 1047i2] };
157 WriteAccess&nbsp;:=
158 { Stmt_6[i0, i1, i2] -&gt; MemRef_C[1067i0 + i1] };
159 FinalRead
160 Domain&nbsp;:=
161 { FinalRead[0] };
162 Schedule&nbsp;:=
163 { FinalRead[i0] -&gt; schedule[200000000, o1, o2, o3, o4, o5, o6] };
164 ReadAccess&nbsp;:=
165 { FinalRead[i0] -&gt; MemRef_C[o0] };
166 ReadAccess&nbsp;:=
167 { FinalRead[i0] -&gt; MemRef_A[o0] };
168 ReadAccess&nbsp;:=
169 { FinalRead[i0] -&gt; MemRef_B[o0] };
171 [...]
172 </pre>
173 </li>
175 <li><h4>Show the dependences for the SCoPs</h4>
176 <pre class="code">opt -basicaa -polly-dependences -analyze matmul.preopt.ll</pre>
177 <pre>Printing analysis 'Polly - Calculate dependences for SCoP' for region:
178 'for.cond =&gt; for.end19' in function 'init_array':
179 Must dependences:
181 May dependences:
183 Must no source:
185 May no source:
187 Printing analysis 'Polly - Calculate dependences for SCoP' for region:
188 'for.cond =&gt; for.end30' in function 'main':
189 Must dependences:
190 { Stmt_4[i0, i1] -&gt; Stmt_6[i0, i1, 0]&nbsp;:
191 i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023;
192 Stmt_6[i0, i1, i2] -&gt; Stmt_6[i0, i1, 1 + i2]&nbsp;:
193 i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1022;
194 Stmt_6[i0, i1, 1023] -&gt; FinalRead[0]&nbsp;:
195 i1 &lt;= 1091540 - 1067i0 and i1 &gt;= -1067i0 and i1 &gt;= 0 and i1 &lt;= 1023;
196 Stmt_6[1023, i1, 1023] -&gt; FinalRead[0]&nbsp;:
197 i1 &gt;= 0 and i1 &lt;= 1023
199 May dependences:
201 Must no source:
202 { Stmt_6[i0, i1, i2] -&gt; MemRef_A[1037i0 + i2]&nbsp;:
203 i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1023;
204 Stmt_6[i0, i1, i2] -&gt; MemRef_B[i1 + 1047i2]&nbsp;:
205 i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1023;
206 FinalRead[0] -&gt; MemRef_A[o0];
207 FinalRead[0] -&gt; MemRef_B[o0]
208 FinalRead[0] -&gt; MemRef_C[o0]&nbsp;:
209 o0 &gt;= 1092565 or (exists (e0 = [(o0)/1067]: o0 &lt;= 1091540 and o0 &gt;= 0
210 and 1067e0 &lt;= -1024 + o0 and 1067e0 &gt;= -1066 + o0)) or o0 &lt;= -1;
212 May no source:
214 </pre></li>
216 <li><h4>Export jscop files</h4>
218 Polly can export the polyhedral representation in so called jscop files. Jscop
219 files contain the polyhedral representation stored in a JSON file.
220 <pre class="code">opt -basicaa -polly-export-jscop matmul.preopt.ll</pre>
221 <pre>Writing SCoP 'for.cond =&gt; for.end19' in function 'init_array' to './init_array___%for.cond---%for.end19.jscop'.
222 Writing SCoP 'for.cond =&gt; for.end30' in function 'main' to './main___%for.cond---%for.end30.jscop'.
223 </pre></li>
225 <li><h4>Import the changed jscop files and print the updated SCoP structure
226 (optional)</h4>
227 <p>Polly can reimport jscop files, in which the schedules of the statements are
228 changed. These changed schedules are used to descripe transformations.
229 It is possible to import different jscop files by providing the postfix
230 of the jscop file that is imported.</p>
231 <p> We apply three different transformations on the SCoP in the main function.
232 The jscop files describing these transformations are hand written (and available
233 in <em>www/experiments/matmul</em>).
235 <h5>No Polly</h5>
237 <p>As a baseline we do not call any Polly code generation, but only apply the
238 normal -O3 optimizations.</p>
240 <pre class="code">
241 opt matmul.preopt.ll -basicaa \
242 -polly-import-jscop \
243 -polly-ast -analyze
244 </pre>
245 <pre>
246 [...]
247 main():
248 for (c2=0;c2&ltg;=1535;c2++) {
249 for (c4=0;c4&ltg;=1535;c4++) {
250 Stmt_4(c2,c4);
251 for (c6=0;c6&ltg;=1535;c6++) {
252 Stmt_6(c2,c4,c6);
256 [...]
257 </pre>
258 <h5>Interchange (and Fission to allow the interchange)</h5>
259 <p>We split the loops and can now apply an interchange of the loop dimensions that
260 enumerate Stmt_6.</p>
261 <pre class="code">
262 opt matmul.preopt.ll -basicaa \
263 -polly-import-jscop -polly-import-jscop-postfix=interchanged \
264 -polly-ast -analyze
265 </pre>
266 <pre>
267 [...]
268 Reading JScop 'for.cond =&gt; for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled'.
269 [...]
270 main():
271 for (c2=0;c2&lt;=1535;c2++) {
272 for (c4=0;c4&lt;=1535;c4++) {
273 Stmt_4(c2,c4);
276 for (c2=0;c2&lt;=1535;c2++) {
277 for (c4=0;c4&lt;=1535;c4++) {
278 for (c6=0;c6&lt;=1535;c6++) {
279 Stmt_6(c2,c6,c4);
283 [...]
284 </pre>
285 <h5>Interchange + Tiling</h5>
286 <p>In addition to the interchange we tile now the second loop nest.</p>
288 <pre class="code">
289 opt matmul.preopt.ll -basicaa \
290 -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
291 -polly-ast -analyze
292 </pre>
293 <pre>
294 [...]
295 Reading JScop 'for.cond =&gt; for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled'.
296 [...]
297 main():
298 for (c2=0;c2&lt;=1535;c2++) {
299 for (c4=0;c4&lt;=1535;c4++) {
300 Stmt_4(c2,c4);
303 for (c2=0;c2&lt;=1535;c2+=64) {
304 for (c3=0;c3&lt;=1535;c3+=64) {
305 for (c4=0;c4&lt;=1535;c4+=64) {
306 for (c5=c2;c5&lt;=c2+63;c5++) {
307 for (c6=c4;c6&lt;=c4+63;c6++) {
308 for (c7=c3;c7&lt;=c3+63;c7++) {
309 Stmt_6(c5,c7,c6);
316 [...]
317 </pre>
318 <h5>Interchange + Tiling + Strip-mining to prepare vectorization</h5>
319 To later allow vectorization we create a so called trivially parallelizable
320 loop. It is innermost, parallel and has only four iterations. It can be
321 replaced by 4-element SIMD instructions.
322 <pre class="code">
323 opt matmul.preopt.ll -basicaa \
324 -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
325 -polly-ast -analyze </pre>
327 <pre>
328 [...]
329 Reading JScop 'for.cond =&gt; for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.
330 [...]
331 main():
332 for (c2=0;c2&lt;=1535;c2++) {
333 for (c4=0;c4&lt;=1535;c4++) {
334 Stmt_4(c2,c4);
337 for (c2=0;c2&lt;=1535;c2+=64) {
338 for (c3=0;c3&lt;=1535;c3+=64) {
339 for (c4=0;c4&lt;=1535;c4+=64) {
340 for (c5=c2;c5&lt;=c2+63;c5++) {
341 for (c6=c4;c6&lt;=c4+63;c6++) {
342 for (c7=c3;c7&lt;=c3+63;c7+=4) {
343 for (c8=c7;c8&lt;=c7+3;c8++) {
344 Stmt_6(c5,c8,c6);
352 [...]
353 </pre>
355 </li>
357 <li><h4>Codegenerate the SCoPs</h4>
359 This generates new code for the SCoPs detected by polly.
360 If -polly-import-jscop is present, transformations specified in the imported
361 jscop files will be applied.</p>
362 <pre class="code">opt matmul.preopt.ll | opt -O3 &gt; matmul.normalopt.ll</pre>
363 <pre class="code">
364 opt -basicaa \
365 -polly-import-jscop -polly-import-jscop-postfix=interchanged \
366 -polly-codegen matmul.preopt.ll \
367 | opt -O3 &gt; matmul.polly.interchanged.ll</pre>
368 <pre>
369 Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
370 './init_array___%for.cond---%for.end19.jscop.interchanged'.
371 File could not be read: No such file or directory
372 Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
373 './main___%for.cond---%for.end30.jscop.interchanged'.
374 </pre>
375 <pre class="code">
376 opt -basicaa \
377 -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
378 -polly-codegen matmul.preopt.ll \
379 | opt -O3 &gt; matmul.polly.interchanged+tiled.ll</pre>
380 <pre>
381 Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
382 './init_array___%for.cond---%for.end19.jscop.interchanged+tiled'.
383 File could not be read: No such file or directory
384 Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
385 './main___%for.cond---%for.end30.jscop.interchanged+tiled'.
386 </pre>
387 <pre class="code">
388 opt -basicaa \
389 -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
390 -polly-codegen -polly-vectorizer=polly matmul.preopt.ll \
391 | opt -O3 &gt; matmul.polly.interchanged+tiled+vector.ll</pre>
392 <pre>
393 Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
394 './init_array___%for.cond---%for.end19.jscop.interchanged+tiled+vector'.
395 File could not be read: No such file or directory
396 Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
397 './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.
398 </pre>
399 <pre class="code">
400 opt -basicaa \
401 -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
402 -polly-codegen -polly-vectorizer=polly -polly-parallel matmul.preopt.ll \
403 | opt -O3 &gt; matmul.polly.interchanged+tiled+openmp.ll</pre>
404 <pre>
405 Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
406 './init_array___%for.cond---%for.end19.jscop.interchanged+tiled+vector'.
407 File could not be read: No such file or directory
408 Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
409 './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.
410 </pre>
412 <li><h4>Create the executables</h4>
414 Create one executable optimized with plain -O3 as well as a set of executables
415 optimized in different ways with Polly. One changes only the loop structure, the
416 other adds tiling, the next adds vectorization and finally we use OpenMP
417 parallelism.
418 <pre class="code">
419 llc matmul.normalopt.ll -o matmul.normalopt.s &amp;&amp; \
420 gcc matmul.normalopt.s -o matmul.normalopt.exe
421 llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s &amp;&amp; \
422 gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe
423 llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s &amp;&amp; \
424 gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe
425 llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s &amp;&amp; \
426 gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe
427 llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s &amp;&amp; \
428 gcc -lgomp matmul.polly.interchanged+tiled+vector+openmp.s -o matmul.polly.interchanged+tiled+vector+openmp.exe </pre>
430 <li><h4>Compare the runtime of the executables</h4>
432 By comparing the runtimes of the different code snippets we see that a simple
433 loop interchange gives here the largest performance boost. However by adding
434 vectorization and by using OpenMP we can further improve the performance
435 significantly.
436 <pre class="code">time ./matmul.normalopt.exe</pre>
437 <pre>42.68 real, 42.55 user, 0.00 sys</pre>
438 <pre class="code">time ./matmul.polly.interchanged.exe</pre>
439 <pre>04.33 real, 4.30 user, 0.01 sys</pre>
440 <pre class="code">time ./matmul.polly.interchanged+tiled.exe</pre>
441 <pre>04.11 real, 4.10 user, 0.00 sys</pre>
442 <pre class="code">time ./matmul.polly.interchanged+tiled+vector.exe</pre>
443 <pre>01.39 real, 1.36 user, 0.01 sys</pre>
444 <pre class="code">time ./matmul.polly.interchanged+tiled+vector+openmp.exe</pre>
445 <pre>00.66 real, 2.58 user, 0.02 sys</pre>
446 </li>
447 </ol>
449 </div>
450 </div>
451 </body>
452 </html>