doc/charm++/looplevel.tex

   1 To better utilize the multicore chip, it has become increasingly popular to
   2 adopt shared-memory multithreading programming methods to exploit parallelism
   3 on a node. For example, in hybrid MPI programs, OpenMP is the most popular
   4 choice.  When launching such hybrid programs, users have to make sure there are
   5 spare physical cores allocated to the shared-memory multithreading runtime.
   6 Otherwise, the runtime that handles distributed-memory programming may
   7 interfere with resource contention because the two independent runtime systems
   8 are not coordinated.  If spare cores are allocated, in the same way of
   9 launching a MPI+OpenMP hybrid program, \charmpp{} will work perfectly with any
  10 shared-memory parallel programming languages (e.g. OpenMP). As with ordinary
  11 OpenMP applications, the number of threads used in the OpenMP parts of the
  12 program can be controlled with the {\tt OMP\_NUM\_THREADS} environment
  13 variable.  See Sec.~\ref{charmrun} for details on how to propagate such
  14 environment variables.
  15
  16 If there are no spare cores allocated, to avoid resource contention, a
  17 \emph{unified runtime} is needed to support both intra-node shared-memory
  18 multithreading parallelism and inter-node distributed-memory
  19 message-passing parallelism. Additionally, considering that a parallel
  20 application may have only a small fraction of its critical computation be
  21 suitable for porting to shared-memory parallelism (the savings on critical
  22 computation may also reduce the communication cost, thus leading to more
  23 performance improvement), dedicating physical cores on every node to the
  24 shared-memory multithreading runtime will waste computational power because
  25 those dedicated cores are not utilized at all during most of the application's
  26 execution time. This case indicates the necessity of a unified
  27 runtime supporting both types of parallelism.
  28
  29 The \emph{CkLoop} library is an add-on to the \charmpp{} runtime to achieve such
  30 a unified runtime.  The library implements a simple OpenMP-like shared-memory
  31 multithreading runtime that reuses \charmpp{} PEs to perform tasks spawned by
  32 the multithreading runtime. This library targets the SMP mode of \charmpp{}.
  33
  34 The \emph{CkLoop} library is built in
  35 \$CHARM\_DIR/\$MACH\_LAYER/tmp/libs/ck-libs/ckloop by executing ``make''.
  36 To use it for user applications, one has to include ``CkLoopAPI.h'' in
  37 the source code. The interface functions of this library are as
  38 follows:
  39
  40 \begin{itemize}
  41 \item CProxy\_FuncCkLoop \textbf{CkLoop\_Init}(int
  42 numThreads=0): This function initializes the CkLoop library, and it only needs
  43 to be called once on a single PE during the initialization phase of the
  44 application.  The argument ``numThreads'' is only used in non-SMP mode,
  45 specifying the number of threads to be created for the single-node shared-memory
  46 parallelism. It will be ignored in SMP mode.
  47
  48 \item void \textbf{CkLoop\_Exit}(CProxy\_FuncCkLoop ckLoop): This function is
  49 intended to be used in non-SMP mode, as it frees the resources
  50 (e.g. terminating the spawned threads) used by the CkLoop library. It should
  51 be called on just one PE.
  52
  53 \item void \textbf{CkLoop\_Parallelize}( \\
  54 HelperFn func, /* the function that finishes partial work on another thread */ \\
  55 int paramNum, /* the number of parameters for func */\\
  56 void * param, /* the input parameters for the above func */ \\
  57 int numChunks, /* number of chunks to be partitioned */ \\
  58 int lowerRange, /* lower range of the loop-like parallelization [lowerRange, upperRange] */ \\
  59 int upperRange, /* upper range of the loop-like parallelization [lowerRange, upperRange] */ \\
  60 int sync=1, /* toggle implicit barrier after each parallelized loop */ \\
  61 void *redResult=NULL, /* the reduction result, ONLY SUPPORT SINGLE VAR of TYPE int/float/double */ \\
  62 REDUCTION\_TYPE type=CKLOOP\_NONE /* type of the reduction result */ \\
  63 CallerFn cfunc=NULL, /* caller PE will call this function before ckloop is done and before starting to work on its chunks */ \\
  64 int cparamNum=0, void *cparam=NULL /* the input parameters to the above function */ \\
  65 ) \\
  66 The ``HelperFn'' is defined as ``typedef void (*HelperFn)(int first,int last, void *result, int paramNum, void *param);''
  67 and the ``result'' is the buffer for reduction result on a single simple-type variable.
  68 The ``CallerFn'' is defined as ``typedef void (*CallerFn)(int paramNum, void *param);''
  69 \end{itemize}
  70
  71 Examples using this library can be found in \examplerefdir{ckloop} and the
  72 widely used molecular dynamics simulation application
  73 NAMD\footnote{http://www.ks.uiuc.edu/Research/namd}.
  74
  75 \section{Charm++/Converse Runtime Scheduler Integrated OpenMP}
  76 The compiler-provided OpenMP runtime library can work with Charm++ but it creates its own thread pool so that Charm++
  77 and OpenMP can have oversubscription problem. The integrated OpenMP runtime library parallelizes OpenMP regions in each chare
  78 and runs on the Charm++ runtime without oversubscription. The integrated runtime creates OpenMP user-level threads, which can migrate among PEs within
  79 a node. This fine-grained parallelism by the integrated runtime helps resolve load imbalance within a node easily. When PEs become idle, they help other busy PEs within a node via work-stealing.
  80 \subsection{Instructions to build and use the integrated OpenMP library}
  81 \subsubsection{Instructions to build}
  82 The OpenMP library can be built with `omp' keyword and any smp version of Charm++ including multicore build when you build Charm++ or AMPI.\\
  83 \begin{verbatim}
  84 e.g.) $CHARM_DIR/build charm++ multicore-linux64 omp
  85       $CHARM_DIR/build charm++ netlrts-linux-x86_64 smp omp
  86 \end{verbatim}
  87 This library is based on the LLVM OpenMP runtime library. So it supports the ABI used by clang, intel and gcc compilers.
  88
  89 The following is the list of compilers which are verified to support this integrated library on Linux.
  90 \begin{itemize}
  91   \item GCC: 4.8 or newer
  92   \item ICC: 15.0 or newer
  93   \item Clang: 3.7 or newer
  94 \end{itemize}
  95
  96 You can use this integrated OpenMP with clang on IBM Bluegene machines without special compilation flags.
  97 (Don't need to add -fopenmp or -openmp on Bluegene clang)
  98
  99 On Linux, the OpenMP supported version of clang has been installed in default recently. For example,
 100 Ubuntu has been released with clang higher than 3.7 since 15.10.
 101 Depending on which version of clang is installed in your working environments, you should follow additional instructions
 102 to use this integrated OpenMP with Clang. The following is the instruction to use
 103 clang on Ubuntu where the default clang is older than 3.7. If you want to use clang on other Linux
 104 distributions, you can use package managers on those Linux distributions to install clang and OpenMP library.
 105 This installation of clang will add headers for OpenMP environmental routines and allow you to parse the OpenMP directives.
 106 However, on Ubuntu, the installation of clang doesn't come with its OpenMP runtime library so it results in an error message saying that
 107 it fails to link the compiler provided OpenMP library. This library is not needed to use the integrated OpenMP runtime but you
 108 need to avoid this error to succeed compiling your codes. The following is the instruction to avoid the error.
 109
 110 \begin{verbatim}
 111 /* When you want to compile Integrated OpenMP on Ubuntu where the pre-installed clang
 112 is older than 3.7, you can use integrated openmp with the following instructions.
 113 e.g.) Ubuntu 14.04, the version of default clang is 3.4.  */
 114 sudo apt-get install clang-3.8 //you can use any version of clang higher than 3.8
 115 sudo ln -svT /usr/bin/clang-3.8 /usr/bin/clang
 116 sudo ln -svT /usr/bin/clang++-3.8 /usr/bin/clang
 117
 118 $(CHARM_DIR)/build charm++ multicore-linux64 clang omp --with-production -j8
 119 echo '!<arch>' > $(CHARM_DIR)/lib/libomp.a //Dummy library. This will make you avoid the error message.
 120 \end{verbatim}
 121
 122 On Mac, the Apple-provided clang installed in default doesn't have OpenMP feature. We're working on the support of
 123 this library on Mac with OpenMP enabled clang which can be downloaded and installed through `Homebrew or MacPorts`.
 124 Currently, this integrated library is built and compiled on Mac with the normal GCC which can be downloaded and
 125 installed via Homebrew and MacPorts. You should set environmental variables so that Charm++ build script use the
 126 normal gcc installed from Homebrew or MacPorts. The following is an example using Homebrew on Mac OS X 10.12.5.
 127
 128 \begin{verbatim}
 129 /* Install Homebrew at https://brew.sh
 130  * Install gcc using 'brew' */
 131 brew install gcc
 132 /* gcc, g++ and other binaries are installed at /usr/local/Cellar/gcc/<version>/bin
 133  * You need to make symbolic links to the gcc binaries at /usr/local/bin
 134  * In this example, gcc 7.1.0 is installed at the directory.
 135  */
 136 cd /usr/local/bin
 137 ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-7 gcc
 138 ln -sv /usr/local/Cellar/gcc/7.1.0/bin/g++-7 g++
 139 ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-nm-7 gcc-nm
 140 ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-ranlib-7 gcc-ranlib
 141 ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-ar-7 gcc-ar
 142 /* Finally, you should set PATH variable so that these binaries are accessed first in the build script.
 143    export PATH=/usr/local/bin:$PATH
 144 \end{verbatim}
 145
 146 In addition, this library will be supported on Windows in the next release of Charm++.
 147
 148 \subsubsection{How to use the integrated OpenMP on Charm++}
 149
 150 To use this library on your applications, you have to add `-module OmpCharm' in compile flags
 151 to link this library instead of the compiler-provided library in compilers. Without `-module OmpCharm',
 152 your application will use the compiler-provided OpenMP library which running on its own separate runtime.
 153 (You don't need to add `-fopenmp or -openmp' with gcc and icc. These flags are included
 154 in the predefined compile options when you build Charm++ with `omp')
 155
 156 This integrated OpenMP adjusts the number of OpenMP instances on each chare so the number of
 157 OpenMP instances can be changed for each OpenMP region over execution.
 158 If your code shares some data structures among OpenMP instances in a parallel region, you can set the size of
 159 the data structures before the start of the OpenMP region with ``omp\_get\_max\_threads()''
 160 and use the data structure within each OpenMP instance with ``omp\_get\_thread\_num()''.
 161 After the OpenMP region, you can iterate over the data structure to combine partial results
 162 with ``CmiGetCurKnownOmpThreads()''. ``CmiGetCurKnownOmpThreads() returns the number of OpenMP
 163 threads for the latest OpenMP region on the PE where a chare is running.'' The following is an
 164 example to describe how you can use shared data structures for OpenMP regions on the integrated
 165 OpenMP with Charm++.
 166 \begin{verbatim}
 167 /* Maximum possible number of OpenMP threads in the upcoming OpenMP region.
 168    Users can restrict this number with 'omp_set_num_threads()' for each chare
 169    and the environmental variable, 'OMP_NUM_THREADS' for all chares.
 170    By default, omp_get_max_threads() returns the number of PEs for each logical node.
 171 */
 172 int maxAvailableThreads = omp_get_max_threads();
 173 int *partialResult = new int[maxAvailableThreads]{0};
 174
 175 /* Partial sum for subsets of iterations assigned to each OpenMP thread.
 176    The integrated OpenMP runtime decides how many OpenMP threads to create
 177    with some heuristics internally.
 178 */
 179 #pragma omp parallel for
 180 for (int i = 0; i < 128; i++) {
 181   partialResult[omp_get_thread_num()] +=i;
 182 }
 183 /* We can know how many OpenMP threads are created in the latest OpenMP region
 184    by CmiCurKnownOmpthreads().
 185    You can get partial results each OpenMP thread generated */
 186 for (int j = 0; j < CmiCurKnownOmpThreads(); j++)
 187   CkPrintf("partial sum of thread %d: %d \n", j, partialResult[j]);
 188 \end{verbatim}
 189
 190 \subsection{Limitation of the current implementation}
 191 \subsubsection{The lack of barrier within OpenMP region}
 192 In OpenMP standards, within a OpenMP region, each worksharing constructs has implicit barriers
 193 between each other so that the result of the earlier worksharing construct can be used for
 194 later constructs. However our current implementation doesn't support this barrier between
 195 worksharing constructs in the same OpenMP region. The following is an example to execute
 196 two `omp for' loops within a single OpenMP region.
 197 \begin{verbatim}
 198 int result = 0;
 199 #pragma omp parallel
 200 {
 201   #pragma omp for reduction(+: result)
 202   for (int i=0; i < 128 ; i++) {
 203     result+= i;
 204   }/* Implicit barrier should exist here for threads within this team.
 205       But the current implementation of the integrated OpenMP doesn't provide barrier here.
 206       The OpenMP threads within this team continue to move forward without being blocked here
 207       as if it works with 'nowait' clause on 'omp for' .
 208
 209   #pragma omp for reduction(+: result)
 210   for (int j=0; j < 128; j++) {
 211     result += j;
 212   }
 213   /* Because there was not a barrier before this 'omp for' construct,
 214      the value in `result` may not be consistent with what is expected. */
 215 } /* We provide an implicit barrier at the end of each OpenMP parallel region. */
 216 \end{verbatim}
 217
 218 So if you want to use multiple worksharing constructs, you should use them with separate
 219 'parallel' pragmas. The example above can be rewritten as the following.
 220 \begin{verbatim}
 221 int result = 0;
 222 #pragma omp parallel for reduction(+: result)
 223 for (int i = 0; i < 128; i ++) {
 224   result +=i;
 225 }
 226
 227 #pragma omp parallel for reduction(+: result)
 228 for (int j = 0; j < 128; j++) {
 229   result +=j;
 230 }
 231 \end{verbatim}
 232 The implicit barrier between worksharing constructs will be supported in the next release of Charm++.
 233
 234 \subsubsection{The list of supported pragmas}
 235 This library is forked from LLVM OpenMP Library supporting OpenMP 4.0. Among many number of
 236 directives specified in OpenMP 4.0,  limited set of directives are supported.
 237 The following is the list of supported pragmas which has been confirmed to work on this library.
 238 \begin{verbatim}
 239 omp_atomic
 240 omp_master
 241 omp_critical
 242 omp_master_3
 243 omp_get_wtick
 244 omp_for_private
 245 omp_in_parallel
 246 omp_parallel_if
 247 omp_for_reduction
 248 omp_for_lastprivate
 249 omp_for_firstprivate
 250 omp_for_schedule_static
 251 omp_for_schedule_dynamic
 252 omp_get_num_threads
 253 omp_parallel_for_if
 254 omp_parallel_shared
 255 omp_section_private
 256 omp_parallel_default
 257 omp_parallel_private
 258 omp_parallel_reduction
 259 omp_parallel_for_private
 260 omp_parallel_firstprivate
 261 omp_sections_reduction
 262 omp_section_lastprivate
 263 omp_section_firstprivate
 264 omp_parallel_for_firstprivate
 265 omp_parallel_for_lastprivate
 266 omp_parallel_for_reduction
 267 omp_parallel_sections_firstprivate
 268 omp_parallel_sections_lastprivate
 269 omp_parallel_sections_private
 270 omp_parallel_sections_reduction
 271 omp_for_schedule_guided
 272 omp_flush
 273 omp_get_wtime
 274 \end{verbatim}
 275 The other directives in OpenMP standard will be supported in the next version.
 276
 277 A simple example using this library can be found in \examplerefdir{openmp}. You can compare ckloop
 278 and the integrated OpenMP with this example. You can see that the total execution time of
 279 this example with enough big size of problem is faster with OpenMP than CkLoop thanks to
 280 load balancing through work-stealing between threads within a node while the execution
 281 time of each chare can be slower on OpenMP because idle PEs helping busy PEs.