README

   1 Requirements:
   2
   3 - automake, autoconf, libtool
   4         (not needed when compiling a release)
   5 - pkg-config (http://www.freedesktop.org/wiki/Software/pkg-config)
   6         (not needed when compiling a release using the included isl and pet)
   7 - gmp (http://gmplib.org/)
   8 - libyaml (http://pyyaml.org/wiki/LibYAML)
   9         (only needed if you want to compile the pet executable)
  10 - LLVM/clang libraries, 2.9 or higher (http://clang.llvm.org/get_started.html)
  11         Unless you have some other reasons for wanting to use the svn version,
  12         it is best to install the latest supported release.
  13         For more details, including the latest supported release,
  14         see pet/README.
  15
  16 If you are installing on Ubuntu, then you can install the following packages:
  17
  18 automake autoconf libtool pkg-config libgmp3-dev libyaml-dev libclang-dev llvm
  19
  20 Note that you need at least version 3.2 of libclang-dev (ubuntu raring).
  21 Older versions of this package did not include the required libraries.
  22 If you are using an older version of ubuntu, then you need to compile and
  23 install LLVM/clang from source.
  24
  25
  26 Preparing:
  27
  28 Grab the latest release and extract it or get the source from
  29 the git repository as follows.  This process requires autoconf,
  30 automake, libtool and pkg-config.
  31
  32         git clone git://repo.or.cz/ppcg.git
  33         cd ppcg
  34         ./get_submodules.sh
  35         ./autogen.sh
  36
  37
  38 Compilation:
  39
  40         ./configure
  41         make
  42         make check
  43
  44 If you have installed any of the required libraries in a non-standard
  45 location, then you may need to use the --with-gmp-prefix,
  46 --with-libyaml-prefix and/or --with-clang-prefix options
  47 when calling "./configure".
  48
  49
  50 Using PPCG to generate CUDA or OpenCL code
  51
  52 To convert a fragment of a C program to CUDA, insert a line containing
  53
  54         #pragma scop
  55
  56 before the fragment and add a line containing
  57
  58         #pragma endscop
  59
  60 after the fragment.  To generate CUDA code run
  61
  62         ppcg --target=cuda file.c
  63
  64 where file.c is the file containing the fragment.  The generated
  65 code is stored in file_host.cu and file_kernel.cu.
  66
  67 To generate OpenCL code run
  68
  69         ppcg --target=opencl file.c
  70
  71 where file.c is the file containing the fragment.  The generated code
  72 is stored in file_host.c and file_kernel.cl.
  73
  74
  75 Specifying tile, grid and block sizes
  76
  77 The iterations space tile size, grid size and block size can
  78 be specified using the --sizes option.  The argument is a union map
  79 in isl notation mapping kernels identified by their sequence number
  80 in a "kernel" space to singleton sets in the "tile", "grid" and "block"
  81 spaces.  The sizes are specified outermost to innermost.
  82
  83 The dimension of the "tile" space indicates the (maximal) number of loop
  84 dimensions to tile.  The elements of the single integer tuple
  85 specify the tile sizes in each dimension.
  86 In case of hybrid tiling, the first element is half the size of
  87 the tile in the time (sequential) dimension.  The second element
  88 specifies the number of elements in the base of the hexagon.
  89 The remaining elements specify the tile sizes in the remaining space
  90 dimensions.
  91
  92 The dimension of the "grid" space indicates the (maximal) number of block
  93 dimensions in the grid.  The elements of the single integer tuple
  94 specify the number of blocks in each dimension.
  95
  96 The dimension of the "block" space indicates the (maximal) number of thread
  97 dimensions in the grid.  The elements of the single integer tuple
  98 specify the number of threads in each dimension.
  99
 100 For example,
 101
 102     { kernel[0] -> tile[64,64]; kernel[i] -> block[16] : i != 4 }
 103
 104 specifies that in kernel 0, two loops should be tiled with a tile
 105 size of 64 in both dimensions and that all kernels except kernel 4
 106 should be run using a block of 16 threads.
 107
 108 Since PPCG performs some scheduling, it can be difficult to predict
 109 what exactly will end up in a kernel.  If you want to specify
 110 tile, grid or block sizes, you may want to run PPCG first with the defaults,
 111 examine the kernels and then run PPCG again with the desired sizes.
 112 Instead of examining the kernels, you can also specify the option
 113 --dump-sizes on the first run to obtain the effectively used default sizes.
 114
 115
 116 Compiling the generated CUDA code with nvcc
 117
 118 To get optimal performance from nvcc, it is important to choose --arch
 119 according to your target GPU.  Specifically, use the flag "--arch sm_20"
 120 for fermi, "--arch sm_30" for GK10x Kepler and "--arch sm_35" for
 121 GK110 Kepler.  We discourage the use of older cards as we have seen
 122 correctness issues with compilation for older architectures.
 123 Note that in the absence of any --arch flag, nvcc defaults to
 124 "--arch sm_13". This will not only be slower, but can also cause
 125 correctness issues.
 126 If you want to obtain results that are identical to those obtained
 127 by the original code, then you may need to disable some optimizations
 128 by passing the "--fmad=false" option.
 129
 130
 131 Compiling the generated OpenCL code with gcc
 132
 133 To compile the host code you need to link against the file
 134 ocl_utilities.c which contains utility functions used by the generated
 135 OpenCL host code.  To compile the host code with gcc, run
 136
 137   gcc -std=c99 file_host.c ocl_utilities.c -lOpenCL
 138
 139 Note that we have experienced the generated OpenCL code freezing
 140 on some inputs (e.g., the PolyBench symm benchmark) when using
 141 at least some version of the Nvidia OpenCL library, while the
 142 corresponding CUDA code runs fine.
 143 We have experienced no such freezes when using AMD, ARM or Intel
 144 OpenCL libraries.
 145
 146 By default, the compiled executable will need the _kernel.cl file at
 147 run time.  Alternatively, the option --opencl-embed-kernel-code may be
 148 given to place the kernel code in a string literal.  The kernel code is
 149 then compiled into the host binary, such that the _kernel.cl file is no
 150 longer needed at run time.  Any kernel include files, in particular
 151 those supplied using --opencl-include-file, will still be required at
 152 run time.
 153
 154
 155 Function calls
 156
 157 Function calls inside the analyzed fragment are reproduced
 158 in the CUDA or OpenCL code, but for now it is left to the user
 159 to make sure that the functions that are being called are
 160 available from the generated kernels.
 161
 162 In the case of OpenCL code, the --opencl-include-file option
 163 may be used to specify one or more files to be #include'd
 164 from the generated code.  These files may then contain
 165 the definitions of the functions being called from the
 166 program fragment.  If the pathnames of the included files
 167 are relative to the current directory, then you may need
 168 to additionally specify the --opencl-compiler-options=-I.
 169 to make sure that the files can be found by the OpenCL compiler.
 170 The included files may contain definitions of types used by the
 171 generated kernels.  By default, PPCG generates definitions for
 172 types as needed, but these definitions may collide with those in
 173 the included files, as PPCG does not consider the contents of the
 174 included files.  The --no-opencl-print-kernel-types will prevent
 175 PPCG from generating type definitions.
 176
 177
 178 GNU extensions
 179
 180 By default, PPCG may print out macro definitions that involve
 181 GNU extensions such as __typeof__ and statement expressions.
 182 Some compilers may not support these extensions.
 183 In particular, OpenCL 1.2 beignet 1.1.1 (git-6de6918)
 184 has been reported not to support __typeof__.
 185 The use of these extensions can be turned off with the
 186 --no-allow-gnu-extensions option.
 187
 188
 189 Processing PolyBench
 190
 191 When processing a PolyBench/C 3.2 benchmark, you should always specify
 192 -DPOLYBENCH_USE_C99_PROTO on the ppcg command line.  Otherwise, the source
 193 files are inconsistent, having fixed size arrays but parametrically
 194 bounded loops iterating over them.
 195 However, you should not specify this define when compiling
 196 the PPCG generated code using nvcc since CUDA does not support VLAs.
 197
 198
 199 CUDA and function overloading
 200
 201 While CUDA supports function overloading based on the arguments types,
 202 no such function overloading exists in the input language C.  Since PPCG
 203 simply prints out the same function name as in the original code, this
 204 may result in a different function being called based on the types
 205 of the arguments.  For example, if the original code contains a call
 206 to the function sqrt() with a float argument, then the argument will
 207 be promoted to a double and the sqrt() function will be called.
 208 In the transformed (CUDA) code, however, overloading will cause the
 209 function sqrtf() to be called.  Until this issue has been resolved in PPCG,
 210 we recommend that users either explicitly call the function sqrtf() or
 211 explicitly cast the argument to double in the input code.
 212
 213
 214 Contact
 215
 216 For bug reports, feature requests and questions,
 217 contact http://groups.google.com/group/isl-development
 218
 219 Whenever you report a bug, please mention the exact version of PPCG
 220 that you are using (output of "./ppcg --version").  If you are unable
 221 to compile PPCG, then report the git version (output of "git describe")
 222 or the version number included in the name of the tarball.
 223
 224
 225 Citing PPCG
 226
 227 If you use PPCG for your research, you are invited to cite
 228 the following paper.
 229
 230 @article{Verdoolaege2013PPCG,
 231     author = {Verdoolaege, Sven and Juega, Juan Carlos and Cohen, Albert and
 232                 G\'{o}mez, Jos{\'e} Ignacio and Tenllado, Christian and
 233                 Catthoor, Francky},
 234     title = {Polyhedral parallel code generation for CUDA},
 235     journal = {ACM Trans. Archit. Code Optim.},
 236     issue_date = {January 2013},
 237     volume = {9},
 238     number = {4},
 239     month = jan,
 240     year = {2013},
 241     issn = {1544-3566},
 242     pages = {54:1--54:23},
 243     doi = {10.1145/2400682.2400713},
 244     acmid = {2400713},
 245     publisher = {ACM},
 246     address = {New York, NY, USA},
 247 }