Update year in README and charmc to 2018
[charm.git] / doc / charm++ / looplevel.tex
blobe791aec05b060835df7ef64f16fa32e1b7156fef
1 To better utilize the multicore chip, it has become increasingly popular to
2 adopt shared-memory multithreading programming methods to exploit parallelism
3 on a node. For example, in hybrid MPI programs, OpenMP is the most popular
4 choice. When launching such hybrid programs, users have to make sure there are
5 spare physical cores allocated to the shared-memory multithreading runtime.
6 Otherwise, the runtime that handles distributed-memory programming may
7 interfere with resource contention because the two independent runtime systems
8 are not coordinated. If spare cores are allocated, in the same way of
9 launching a MPI+OpenMP hybrid program, \charmpp{} will work perfectly with any
10 shared-memory parallel programming languages (e.g. OpenMP). As with ordinary
11 OpenMP applications, the number of threads used in the OpenMP parts of the
12 program can be controlled with the {\tt OMP\_NUM\_THREADS} environment
13 variable. See Sec.~\ref{charmrun} for details on how to propagate such
14 environment variables.
16 If there are no spare cores allocated, to avoid resource contention, a
17 \emph{unified runtime} is needed to support both intra-node shared-memory
18 multithreading parallelism and inter-node distributed-memory
19 message-passing parallelism. Additionally, considering that a parallel
20 application may have only a small fraction of its critical computation be
21 suitable for porting to shared-memory parallelism (the savings on critical
22 computation may also reduce the communication cost, thus leading to more
23 performance improvement), dedicating physical cores on every node to the
24 shared-memory multithreading runtime will waste computational power because
25 those dedicated cores are not utilized at all during most of the application's
26 execution time. This case indicates the necessity of a unified
27 runtime supporting both types of parallelism.
29 The \emph{CkLoop} library is an add-on to the \charmpp{} runtime to achieve such
30 a unified runtime. The library implements a simple OpenMP-like shared-memory
31 multithreading runtime that reuses \charmpp{} PEs to perform tasks spawned by
32 the multithreading runtime. This library targets the SMP mode of \charmpp{}.
34 The \emph{CkLoop} library is built in
35 \$CHARM\_DIR/\$MACH\_LAYER/tmp/libs/ck-libs/ckloop by executing ``make''.
36 To use it for user applications, one has to include ``CkLoopAPI.h'' in
37 the source code. The interface functions of this library are as
38 follows:
40 \begin{itemize}
41 \item CProxy\_FuncCkLoop \textbf{CkLoop\_Init}(int
42 numThreads=0): This function initializes the CkLoop library, and it only needs
43 to be called once on a single PE during the initialization phase of the
44 application. The argument ``numThreads'' is only used in non-SMP mode,
45 specifying the number of threads to be created for the single-node shared-memory
46 parallelism. It will be ignored in SMP mode.
48 \item void \textbf{CkLoop\_Exit}(CProxy\_FuncCkLoop ckLoop): This function is
49 intended to be used in non-SMP mode, as it frees the resources
50 (e.g. terminating the spawned threads) used by the CkLoop library. It should
51 be called on just one PE.
53 \item void \textbf{CkLoop\_Parallelize}( \\
54 HelperFn func, /* the function that finishes partial work on another thread */ \\
55 int paramNum, /* the number of parameters for func */\\
56 void * param, /* the input parameters for the above func */ \\
57 int numChunks, /* number of chunks to be partitioned */ \\
58 int lowerRange, /* lower range of the loop-like parallelization [lowerRange, upperRange] */ \\
59 int upperRange, /* upper range of the loop-like parallelization [lowerRange, upperRange] */ \\
60 int sync=1, /* toggle implicit barrier after each parallelized loop */ \\
61 void *redResult=NULL, /* the reduction result, ONLY SUPPORT SINGLE VAR of TYPE int/float/double */ \\
62 REDUCTION\_TYPE type=CKLOOP\_NONE /* type of the reduction result */ \\
63 CallerFn cfunc=NULL, /* caller PE will call this function before ckloop is done and before starting to work on its chunks */ \\
64 int cparamNum=0, void *cparam=NULL /* the input parameters to the above function */ \\
65 ) \\
66 The ``HelperFn'' is defined as ``typedef void (*HelperFn)(int first,int last, void *result, int paramNum, void *param);''
67 and the ``result'' is the buffer for reduction result on a single simple-type variable.
68 The ``CallerFn'' is defined as ``typedef void (*CallerFn)(int paramNum, void *param);''
69 \end{itemize}
71 Examples using this library can be found in \examplerefdir{ckloop} and the
72 widely used molecular dynamics simulation application
73 NAMD\footnote{http://www.ks.uiuc.edu/Research/namd}.
75 \section{Charm++/Converse Runtime Scheduler Integrated OpenMP}
76 The compiler-provided OpenMP runtime library can work with Charm++ but it creates its own thread pool so that Charm++
77 and OpenMP can have oversubscription problem. The integrated OpenMP runtime library parallelizes OpenMP regions in each chare
78 and runs on the Charm++ runtime without oversubscription. The integrated runtime creates OpenMP user-level threads, which can migrate among PEs within
79 a node. This fine-grained parallelism by the integrated runtime helps resolve load imbalance within a node easily. When PEs become idle, they help other busy PEs within a node via work-stealing.
80 \subsection{Instructions to build and use the integrated OpenMP library}
81 \subsubsection{Instructions to build}
82 The OpenMP library can be built with `omp' keyword and any smp version of Charm++ including multicore build when you build Charm++ or AMPI.\\
83 \begin{verbatim}
84 e.g.) $CHARM_DIR/build charm++ multicore-linux64 omp
85 $CHARM_DIR/build charm++ netlrts-linux-x86_64 smp omp
86 \end{verbatim}
87 This library is based on the LLVM OpenMP runtime library. So it supports the ABI used by clang, intel and gcc compilers.
89 The following is the list of compilers which are verified to support this integrated library on Linux.
90 \begin{itemize}
91 \item GCC: 4.8 or newer
92 \item ICC: 15.0 or newer
93 \item Clang: 3.7 or newer
94 \end{itemize}
96 You can use this integrated OpenMP with clang on IBM Bluegene machines without special compilation flags.
97 (Don't need to add -fopenmp or -openmp on Bluegene clang)
99 On Linux, the OpenMP supported version of clang has been installed in default recently. For example,
100 Ubuntu has been released with clang higher than 3.7 since 15.10.
101 Depending on which version of clang is installed in your working environments, you should follow additional instructions
102 to use this integrated OpenMP with Clang. The following is the instruction to use
103 clang on Ubuntu where the default clang is older than 3.7. If you want to use clang on other Linux
104 distributions, you can use package managers on those Linux distributions to install clang and OpenMP library.
105 This installation of clang will add headers for OpenMP environmental routines and allow you to parse the OpenMP directives.
106 However, on Ubuntu, the installation of clang doesn't come with its OpenMP runtime library so it results in an error message saying that
107 it fails to link the compiler provided OpenMP library. This library is not needed to use the integrated OpenMP runtime but you
108 need to avoid this error to succeed compiling your codes. The following is the instruction to avoid the error.
110 \begin{verbatim}
111 /* When you want to compile Integrated OpenMP on Ubuntu where the pre-installed clang
112 is older than 3.7, you can use integrated openmp with the following instructions.
113 e.g.) Ubuntu 14.04, the version of default clang is 3.4. */
114 sudo apt-get install clang-3.8 //you can use any version of clang higher than 3.8
115 sudo ln -svT /usr/bin/clang-3.8 /usr/bin/clang
116 sudo ln -svT /usr/bin/clang++-3.8 /usr/bin/clang
118 $(CHARM_DIR)/build charm++ multicore-linux64 clang omp --with-production -j8
119 echo '!<arch>' > $(CHARM_DIR)/lib/libomp.a //Dummy library. This will make you avoid the error message.
120 \end{verbatim}
122 On Mac, the Apple-provided clang installed in default doesn't have OpenMP feature. We're working on the support of
123 this library on Mac with OpenMP enabled clang which can be downloaded and installed through `Homebrew or MacPorts`.
124 Currently, this integrated library is built and compiled on Mac with the normal GCC which can be downloaded and
125 installed via Homebrew and MacPorts. You should set environmental variables so that Charm++ build script use the
126 normal gcc installed from Homebrew or MacPorts. The following is an example using Homebrew on Mac OS X 10.12.5.
128 \begin{verbatim}
129 /* Install Homebrew at https://brew.sh
130 * Install gcc using 'brew' */
131 brew install gcc
132 /* gcc, g++ and other binaries are installed at /usr/local/Cellar/gcc/<version>/bin
133 * You need to make symbolic links to the gcc binaries at /usr/local/bin
134 * In this example, gcc 7.1.0 is installed at the directory.
136 cd /usr/local/bin
137 ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-7 gcc
138 ln -sv /usr/local/Cellar/gcc/7.1.0/bin/g++-7 g++
139 ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-nm-7 gcc-nm
140 ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-ranlib-7 gcc-ranlib
141 ln -sv /usr/local/Cellar/gcc/7.1.0/bin/gcc-ar-7 gcc-ar
142 /* Finally, you should set PATH variable so that these binaries are accessed first in the build script.
143 export PATH=/usr/local/bin:$PATH
144 \end{verbatim}
146 In addition, this library will be supported on Windows in the next release of Charm++.
148 \subsubsection{How to use the integrated OpenMP on Charm++}
150 To use this library on your applications, you have to add `-module OmpCharm' in compile flags
151 to link this library instead of the compiler-provided library in compilers. Without `-module OmpCharm',
152 your application will use the compiler-provided OpenMP library which running on its own separate runtime.
153 (You don't need to add `-fopenmp or -openmp' with gcc and icc. These flags are included
154 in the predefined compile options when you build Charm++ with `omp')
156 This integrated OpenMP adjusts the number of OpenMP instances on each chare so the number of
157 OpenMP instances can be changed for each OpenMP region over execution.
158 If your code shares some data structures among OpenMP instances in a parallel region, you can set the size of
159 the data structures before the start of the OpenMP region with ``omp\_get\_max\_threads()''
160 and use the data structure within each OpenMP instance with ``omp\_get\_thread\_num()''.
161 After the OpenMP region, you can iterate over the data structure to combine partial results
162 with ``CmiGetCurKnownOmpThreads()''. ``CmiGetCurKnownOmpThreads() returns the number of OpenMP
163 threads for the latest OpenMP region on the PE where a chare is running.'' The following is an
164 example to describe how you can use shared data structures for OpenMP regions on the integrated
165 OpenMP with Charm++.
166 \begin{verbatim}
167 /* Maximum possible number of OpenMP threads in the upcoming OpenMP region.
168 Users can restrict this number with 'omp_set_num_threads()' for each chare
169 and the environmental variable, 'OMP_NUM_THREADS' for all chares.
170 By default, omp_get_max_threads() returns the number of PEs for each logical node.
172 int maxAvailableThreads = omp_get_max_threads();
173 int *partialResult = new int[maxAvailableThreads]{0};
175 /* Partial sum for subsets of iterations assigned to each OpenMP thread.
176 The integrated OpenMP runtime decides how many OpenMP threads to create
177 with some heuristics internally.
179 #pragma omp parallel for
180 for (int i = 0; i < 128; i++) {
181 partialResult[omp_get_thread_num()] +=i;
183 /* We can know how many OpenMP threads are created in the latest OpenMP region
184 by CmiCurKnownOmpthreads().
185 You can get partial results each OpenMP thread generated */
186 for (int j = 0; j < CmiCurKnownOmpThreads(); j++)
187 CkPrintf("partial sum of thread %d: %d \n", j, partialResult[j]);
188 \end{verbatim}
190 \subsection{Limitation of the current implementation}
191 \subsubsection{The lack of barrier within OpenMP region}
192 In OpenMP standards, within a OpenMP region, each worksharing constructs has implicit barriers
193 between each other so that the result of the earlier worksharing construct can be used for
194 later constructs. However our current implementation doesn't support this barrier between
195 worksharing constructs in the same OpenMP region. The following is an example to execute
196 two `omp for' loops within a single OpenMP region.
197 \begin{verbatim}
198 int result = 0;
199 #pragma omp parallel
201 #pragma omp for reduction(+: result)
202 for (int i=0; i < 128 ; i++) {
203 result+= i;
204 }/* Implicit barrier should exist here for threads within this team.
205 But the current implementation of the integrated OpenMP doesn't provide barrier here.
206 The OpenMP threads within this team continue to move forward without being blocked here
207 as if it works with 'nowait' clause on 'omp for' .
209 #pragma omp for reduction(+: result)
210 for (int j=0; j < 128; j++) {
211 result += j;
213 /* Because there was not a barrier before this 'omp for' construct,
214 the value in `result` may not be consistent with what is expected. */
215 } /* We provide an implicit barrier at the end of each OpenMP parallel region. */
216 \end{verbatim}
218 So if you want to use multiple worksharing constructs, you should use them with separate
219 'parallel' pragmas. The example above can be rewritten as the following.
220 \begin{verbatim}
221 int result = 0;
222 #pragma omp parallel for reduction(+: result)
223 for (int i = 0; i < 128; i ++) {
224 result +=i;
227 #pragma omp parallel for reduction(+: result)
228 for (int j = 0; j < 128; j++) {
229 result +=j;
231 \end{verbatim}
232 The implicit barrier between worksharing constructs will be supported in the next release of Charm++.
234 \subsubsection{The list of supported pragmas}
235 This library is forked from LLVM OpenMP Library supporting OpenMP 4.0. Among many number of
236 directives specified in OpenMP 4.0, limited set of directives are supported.
237 The following is the list of supported pragmas which has been confirmed to work on this library.
238 \begin{verbatim}
239 omp_atomic
240 omp_master
241 omp_critical
242 omp_master_3
243 omp_get_wtick
244 omp_for_private
245 omp_in_parallel
246 omp_parallel_if
247 omp_for_reduction
248 omp_for_lastprivate
249 omp_for_firstprivate
250 omp_for_schedule_static
251 omp_for_schedule_dynamic
252 omp_get_num_threads
253 omp_parallel_for_if
254 omp_parallel_shared
255 omp_section_private
256 omp_parallel_default
257 omp_parallel_private
258 omp_parallel_reduction
259 omp_parallel_for_private
260 omp_parallel_firstprivate
261 omp_sections_reduction
262 omp_section_lastprivate
263 omp_section_firstprivate
264 omp_parallel_for_firstprivate
265 omp_parallel_for_lastprivate
266 omp_parallel_for_reduction
267 omp_parallel_sections_firstprivate
268 omp_parallel_sections_lastprivate
269 omp_parallel_sections_private
270 omp_parallel_sections_reduction
271 omp_for_schedule_guided
272 omp_flush
273 omp_get_wtime
274 \end{verbatim}
275 The other directives in OpenMP standard will be supported in the next version.
277 A simple example using this library can be found in \examplerefdir{openmp}. You can compare ckloop
278 and the integrated OpenMP with this example. You can see that the total execution time of
279 this example with enough big size of problem is faster with OpenMP than CkLoop thanks to
280 load balancing through work-stealing between threads within a node while the execution
281 time of each chare can be slower on OpenMP because idle PEs helping busy PEs.