README

   1 Requirements:
   2
   3 - automake, autoconf, libtool
   4         (not needed when compiling a release)
   5 - pkg-config (http://www.freedesktop.org/wiki/Software/pkg-config)
   6         (not needed when compiling a release using the included isl and pet)
   7 - gmp (http://gmplib.org/)
   8 - libyaml (http://pyyaml.org/wiki/LibYAML)
   9         (only needed if you want to compile the pet executable)
  10 - LLVM/clang libraries, 2.9 or higher (http://clang.llvm.org/get_started.html)
  11         Unless you have some other reasons for wanting to use the svn version,
  12         it is best to install the latest release (3.9).
  13         For more details, see pet/README.
  14
  15 If you are installing on Ubuntu, then you can install the following packages:
  16
  17 automake autoconf libtool pkg-config libgmp3-dev libyaml-dev libclang-dev llvm
  18
  19 Note that you need at least version 3.2 of libclang-dev (ubuntu raring).
  20 Older versions of this package did not include the required libraries.
  21 If you are using an older version of ubuntu, then you need to compile and
  22 install LLVM/clang from source.
  23
  24
  25 Preparing:
  26
  27 Grab the latest release and extract it or get the source from
  28 the git repository as follows.  This process requires autoconf,
  29 automake, libtool and pkg-config.
  30
  31         git clone git://repo.or.cz/ppcg.git
  32         cd ppcg
  33         ./get_submodules.sh
  34         ./autogen.sh
  35
  36
  37 Compilation:
  38
  39         ./configure
  40         make
  41         make check
  42
  43 If you have installed any of the required libraries in a non-standard
  44 location, then you may need to use the --with-gmp-prefix,
  45 --with-libyaml-prefix and/or --with-clang-prefix options
  46 when calling "./configure".
  47
  48
  49 Using PPCG to generate CUDA or OpenCL code
  50
  51 To convert a fragment of a C program to CUDA, insert a line containing
  52
  53         #pragma scop
  54
  55 before the fragment and add a line containing
  56
  57         #pragma endscop
  58
  59 after the fragment.  To generate CUDA code run
  60
  61         ppcg --target=cuda file.c
  62
  63 where file.c is the file containing the fragment.  The generated
  64 code is stored in file_host.cu and file_kernel.cu.
  65
  66 To generate OpenCL code run
  67
  68         ppcg --target=opencl file.c
  69
  70 where file.c is the file containing the fragment.  The generated code
  71 is stored in file_host.c and file_kernel.cl.
  72
  73
  74 Specifying tile, grid and block sizes
  75
  76 The iterations space tile size, grid size and block size can
  77 be specified using the --sizes option.  The argument is a union map
  78 in isl notation mapping kernels identified by their sequence number
  79 in a "kernel" space to singleton sets in the "tile", "grid" and "block"
  80 spaces.  The sizes are specified outermost to innermost.
  81
  82 The dimension of the "tile" space indicates the (maximal) number of loop
  83 dimensions to tile.  The elements of the single integer tuple
  84 specify the tile sizes in each dimension.
  85 In case of hybrid tiling, the first element is half the size of
  86 the tile in the time (sequential) dimension.  The second element
  87 specifies the number of elements in the base of the hexagon.
  88 The remaining elements specify the tile sizes in the remaining space
  89 dimensions.
  90
  91 The dimension of the "grid" space indicates the (maximal) number of block
  92 dimensions in the grid.  The elements of the single integer tuple
  93 specify the number of blocks in each dimension.
  94
  95 The dimension of the "block" space indicates the (maximal) number of thread
  96 dimensions in the grid.  The elements of the single integer tuple
  97 specify the number of threads in each dimension.
  98
  99 For example,
 100
 101     { kernel[0] -> tile[64,64]; kernel[i] -> block[16] : i != 4 }
 102
 103 specifies that in kernel 0, two loops should be tiled with a tile
 104 size of 64 in both dimensions and that all kernels except kernel 4
 105 should be run using a block of 16 threads.
 106
 107 Since PPCG performs some scheduling, it can be difficult to predict
 108 what exactly will end up in a kernel.  If you want to specify
 109 tile, grid or block sizes, you may want to run PPCG first with the defaults,
 110 examine the kernels and then run PPCG again with the desired sizes.
 111 Instead of examining the kernels, you can also specify the option
 112 --dump-sizes on the first run to obtain the effectively used default sizes.
 113
 114
 115 Compiling the generated CUDA code with nvcc
 116
 117 To get optimal performance from nvcc, it is important to choose --arch
 118 according to your target GPU.  Specifically, use the flag "--arch sm_20"
 119 for fermi, "--arch sm_30" for GK10x Kepler and "--arch sm_35" for
 120 GK110 Kepler.  We discourage the use of older cards as we have seen
 121 correctness issues with compilation for older architectures.
 122 Note that in the absence of any --arch flag, nvcc defaults to
 123 "--arch sm_13". This will not only be slower, but can also cause
 124 correctness issues.
 125 If you want to obtain results that are identical to those obtained
 126 by the original code, then you may need to disable some optimizations
 127 by passing the "--fmad=false" option.
 128
 129
 130 Compiling the generated OpenCL code with gcc
 131
 132 To compile the host code you need to link against the file
 133 ocl_utilities.c which contains utility functions used by the generated
 134 OpenCL host code.  To compile the host code with gcc, run
 135
 136   gcc -std=c99 file_host.c ocl_utilities.c -lOpenCL
 137
 138 Note that we have experienced the generated OpenCL code freezing
 139 on some inputs (e.g., the PolyBench symm benchmark) when using
 140 at least some version of the Nvidia OpenCL library, while the
 141 corresponding CUDA code runs fine.
 142 We have experienced no such freezes when using AMD, ARM or Intel
 143 OpenCL libraries.
 144
 145 By default, the compiled executable will need the _kernel.cl file at
 146 run time.  Alternatively, the option --opencl-embed-kernel-code may be
 147 given to place the kernel code in a string literal.  The kernel code is
 148 then compiled into the host binary, such that the _kernel.cl file is no
 149 longer needed at run time.  Any kernel include files, in particular
 150 those supplied using --opencl-include-file, will still be required at
 151 run time.
 152
 153
 154 Function calls
 155
 156 Function calls inside the analyzed fragment are reproduced
 157 in the CUDA or OpenCL code, but for now it is left to the user
 158 to make sure that the functions that are being called are
 159 available from the generated kernels.
 160
 161 In the case of OpenCL code, the --opencl-include-file option
 162 may be used to specify one or more files to be #include'd
 163 from the generated code.  These files may then contain
 164 the definitions of the functions being called from the
 165 program fragment.  If the pathnames of the included files
 166 are relative to the current directory, then you may need
 167 to additionally specify the --opencl-compiler-options=-I.
 168 to make sure that the files can be found by the OpenCL compiler.
 169 The included files may contain definitions of types used by the
 170 generated kernels.  By default, PPCG generates definitions for
 171 types as needed, but these definitions may collide with those in
 172 the included files, as PPCG does not consider the contents of the
 173 included files.  The --no-opencl-print-kernel-types will prevent
 174 PPCG from generating type definitions.
 175
 176
 177 GNU extensions
 178
 179 By default, PPCG may print out macro definitions that involve
 180 GNU extensions such as __typeof__ and statement expressions.
 181 Some compilers may not support these extensions.
 182 In particular, OpenCL 1.2 beignet 1.1.1 (git-6de6918)
 183 has been reported not to support __typeof__.
 184 The use of these extensions can be turned off with the
 185 --no-allow-gnu-extensions option.
 186
 187
 188 Processing PolyBench
 189
 190 When processing a PolyBench/C 3.2 benchmark, you should always specify
 191 -DPOLYBENCH_USE_C99_PROTO on the ppcg command line.  Otherwise, the source
 192 files are inconsistent, having fixed size arrays but parametrically
 193 bounded loops iterating over them.
 194 However, you should not specify this define when compiling
 195 the PPCG generated code using nvcc since CUDA does not support VLAs.
 196
 197
 198 CUDA and function overloading
 199
 200 While CUDA supports function overloading based on the arguments types,
 201 no such function overloading exists in the input language C.  Since PPCG
 202 simply prints out the same function name as in the original code, this
 203 may result in a different function being called based on the types
 204 of the arguments.  For example, if the original code contains a call
 205 to the function sqrt() with a float argument, then the argument will
 206 be promoted to a double and the sqrt() function will be called.
 207 In the transformed (CUDA) code, however, overloading will cause the
 208 function sqrtf() to be called.  Until this issue has been resolved in PPCG,
 209 we recommend that users either explicitly call the function sqrtf() or
 210 explicitly cast the argument to double in the input code.
 211
 212
 213 Contact
 214
 215 For bug reports, feature requests and questions,
 216 contact http://groups.google.com/group/isl-development
 217
 218 Whenever you report a bug, please mention the exact version of PPCG
 219 that you are using (output of "./ppcg --version").  If you are unable
 220 to compile PPCG, then report the git version (output of "git describe")
 221 or the version number included in the name of the tarball.
 222
 223
 224 Citing PPCG
 225
 226 If you use PPCG for your research, you are invited to cite
 227 the following paper.
 228
 229 @article{Verdoolaege2013PPCG,
 230     author = {Verdoolaege, Sven and Juega, Juan Carlos and Cohen, Albert and
 231                 G\'{o}mez, Jos{\'e} Ignacio and Tenllado, Christian and
 232                 Catthoor, Francky},
 233     title = {Polyhedral parallel code generation for CUDA},
 234     journal = {ACM Trans. Archit. Code Optim.},
 235     issue_date = {January 2013},
 236     volume = {9},
 237     number = {4},
 238     month = jan,
 239     year = {2013},
 240     issn = {1544-3566},
 241     pages = {54:1--54:23},
 242     doi = {10.1145/2400682.2400713},
 243     acmid = {2400713},
 244     publisher = {ACM},
 245     address = {New York, NY, USA},
 246 }