Fix -Wstrict-overflow in domdec.c
[gromacs.git] / manual / gmxpar.tex
blob58a4043b66e3bf3ffbd34f4d6ae72d67c6cbcfe1
1 % This file is part of the GROMACS molecular simulation package.
3 % Copyright (c) 2013, by the GROMACS development team, led by
4 % David van der Spoel, Berk Hess, Erik Lindahl, and including many
5 % others, as listed in the AUTHORS file in the top-level source
6 % directory and at http://www.gromacs.org.
8 % GROMACS is free software; you can redistribute it and/or
9 % modify it under the terms of the GNU Lesser General Public License
10 % as published by the Free Software Foundation; either version 2.1
11 % of the License, or (at your option) any later version.
13 % GROMACS is distributed in the hope that it will be useful,
14 % but WITHOUT ANY WARRANTY; without even the implied warranty of
15 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
16 % Lesser General Public License for more details.
18 % You should have received a copy of the GNU Lesser General Public
19 % License along with GROMACS; if not, see
20 % http://www.gnu.org/licenses, or write to the Free Software Foundation,
21 % Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
23 % If you want to redistribute modifications to GROMACS, please
24 % consider that scientific software is very special. Version
25 % control is crucial - bugs must be traceable. We will be happy to
26 % consider code for inclusion in the official distribution, but
27 % derived work must not be called official GROMACS. Details are found
28 % in the README & COPYING files - if they are missing, get the
29 % official version at http://www.gromacs.org.
31 % To help us fund GROMACS development, we humbly ask that you cite
32 % the research papers on the package. Check out http://www.gromacs.org
34 \section{Parallelization}
35 \label{sec:par}
37 \newcommand{\abs}[1]{\mid \! {#1} \! \mid}
39 The purpose of this section is to discuss the
40 \normindex{parallelization} of the
41 principle MD algorithm and not to describe the algorithms that are in
42 practical use for molecular systems with their complex variety of atoms
43 and terms in the force field descriptions. We shall therefore consider
44 as an example a simple system consisting only of a single type of atoms
45 with a simple form of the interaction potential. The emphasis will be
46 on the special problems that arise when the algorithm is implemented on
47 a parallel computer.
49 The simple model problem already contains the bottleneck of all MD
50 simulations: the computationally intensive evaluation of the
51 {\em non-bonded} forces between pairs of atoms, based on the distance
52 between particles. Complex molecular systems will in addition
53 involve many different kinds of {\em bonded} forces between designated
54 atoms. Such interactions add to the complexity of the algorithm but do
55 not modify the basic considerations concerning parallelization.
58 \subsection{Methods of parallelization}
59 There are a number of methods to parallelize the MD algorithm, each of
60 them with their own advantages and disadvantages. The method to
61 choose depends on the hardware and compilers available.
62 We list them here:
63 \begin{enumerate}
64 \item[1] {\em \normindex{Message Passing}.}\\
65 In this method, which is more or less the traditional
66 way of parallel programming, all the parallelism is
67 explicitly programmed by the user. The disadvantage
68 is that it takes extra code and effort, the advantage
69 is that the programmer keeps full control over the data
70 flow and can do optimizations a compiler could not come
71 up with.
73 The implementation is typically done by calling a set of
74 library routines to send and receive data to and from
75 other processors. Almost all hardware vendors support
76 this way of
77 parallelism in their C and Fortran compilers.
79 \item[2] {\em \swapindex{Data}{Parallel}.}\\
80 This method lets the user define arrays on which to
81 operate in parallel. Programming this way is much
82 like vectorizing: recurrence is not parallelized
83 ({\eg} {\tt for(i=1; (i<MAX); i++) a[i] = a[i-1] + 1;}
84 does not vectorize and not parallelize, because for
85 every i the result from the previous step is needed).
87 The advantage of data parallelism is that it is
88 easier for the user; the compiler takes care of the
89 parallelism. The disadvantage is that it is supported
90 by a small (though growing) number of hardware vendors,
91 and that it is much harder to maintain a program that has to
92 run on both parallel and sequential machines, because
93 the only standard language that supports it is Fortran-90
94 which is not available on many platforms.
95 \end{enumerate}
96 Both methods allow for the MD algorithm to be implemented without much
97 trouble. Message passing MD algorithms have been published
98 since the mid 80's (\cite{Fincham87}, \cite{Raine89})
99 and development is still continuing.
100 Data parallel programming is newer,
101 but starting from a well vectorized program it is not hard to do.
103 Our implementation of MD is a message passing one, the reason for which
104 is partly historical: the project to develop a parallel MD program started
105 when Fortran-90 was still in the making, and no compilers were
106 expected to be available.
107 At current, we still believe that message passing is the way
108 to go, after having done some experiments with data parallel programming on a
109 Connection Machine (CM-5), because of portability to other hardware,
110 the poor performance of the code produced by the compilers
111 and because this way of programming
112 has the same drawback as vectorization: the part of the program that is
113 not vectorized or parallelized determines the run time of the program
114 (\normindex{Amdahl's law}).
116 The approach we took to parallelism was a minimalist one: use as few
117 non-standard elements in the software as possible, and use the
118 simplest \swapindex{processor}{topology} that does the job. We therefore
119 decided to use a standard language (ANSI-C) with as few non-standard
120 routines as possible. We only use 5 communication routines that are
121 non-standard. It is therefore very easy to port our code to other machines.
123 For an $O(N^2)$ problem like MD, one of the best schemes for the
124 inter-processor connections is a ring, so our software demands that a
125 ring is present in the inter-processor connections. A ring can essentially
126 always be mapped onto another network like a \normindex{hypercube}, a
127 bus interface (Ethernet {\eg} using
128 \seeindex{Message Passing Interface}{MPI} \normindex{MPI}) or
129 a \normindex{tree}
130 (CM-5). Some hardware vendors have very luxurious connection schemes
131 that connect every processor to every other processor, but we do not
132 really need it and so do not use it even though it might come in handy
133 at times. The advantage with this simple scheme is that {\gromacs}
134 performs extremely well even on inexpensive workstation clusters.
136 When using a message passing scheme one has to divide the particles
137 over processors, which can be done in two ways:
138 \begin{itemize}
139 \item {\em \swapindex{Space}{Decomposition}.}\\
140 An element of space is allocated to each processor, when dividing
141 a cubic box with edge $b$ over $P$ processors this can be done
142 by giving
143 each processor a slab of length $b/P$. This method
144 has the advantage
145 that each processor has about the same number of interactions
146 to calculate (at least when the simulated system has a homogeneous
147 density, like a liquid or a gas). The disadvantage is that a lot of
148 bookkeeping is necessary for particles that move over processor
149 boundaries. When using more complex systems, such as macromolecules there
150 are also 3- and 4-atom interactions that make
151 the bookkeeping too complicated for our taste.
152 \item {\em \swapindex{Particle}{Decomposition}.}\\
153 Every processor is allocated a number of particles. When
154 dividing $N$ particles over $P$ processors each processor will
155 get $N/P$ particles. The implementation of this method
156 is described in the next section.
157 \end{itemize}
159 \begin {figure}[p]
160 \centerline{\includegraphics[width=10cm]{plots/int-mat}}
161 \caption[The interaction matrix.]{The interaction matrix (left) and
162 the same using action$~=~-$reaction (right).}
163 \label{fig:int_mat}
164 \end {figure}
166 \begin{table}[p]
167 \centerline{
168 \newcolumntype{s}[1]{D{/}{/}{#1}}
169 \begin{tabular}{|l|s4|s4|s4|s4|}
170 \dline
171 & \mcc{1}{i mod 2 = 0}
172 & \mcc{1}{i mod 2 = 0}
173 & \mcc{1}{i mod 2 = 1}
174 & \mcc{1}{i mod 2 = 1} \\
175 & \mcc{1}{i $<$ N/2}
176 & \mcc{1}{i $\ge$ N/2}
177 & \mcc{1}{i $<$ N/2}
178 & \mcc{1}{i $\ge$ N/2} \\
179 \hline
180 N mod 2 = 1 & N/2 & N/2 & N/2 & N/2 \\
181 N mod 4 = 2 & N/2 & N/2 & N/2 - 1 & N/2 - 1 \\
182 N mod 4 = 0 & N/2 & N/2 - 1 & N/2 - 1 & N/2 \\
183 \dline
184 \end{tabular}
186 \caption[The number of interactions between particles.]{The number of
187 interactions between particles. The number of $j$ particles per $i$
188 particle is a function of the total number of particles $N$ and
189 particle number $i$. Note that here the $/$ operator is used for
190 integer division, {\ie} truncating the reminder.}
191 \label{tab:decomp}
192 \end{table}
194 \begin{figure}[p]
195 \centerline{\includegraphics[width=\linewidth]{plots/decomp}}
196 \caption[Interaction matrices for different $N$.]{Interaction matrices for different $N$. The number of $j$-particles an $i$-particle interacts with depends on the {\em total} number of particles and on the {\em particle number}.}
197 \label{fig:decomp}
198 \end{figure}
200 \subsection{MD on a ring of processors}
201 When a neighbor list is not used the MD problem is in principle an $O(N^2)$
202 problem as each particle can interact
203 with every other. This can be simplified using Newton's third law
204 \beq
205 F_{ij} ~=~ -F_{ji}
206 \label{eqn:Newt3}
207 \eeq
208 This implies that there is half a matrix of interactions (without diagonal,
209 a particle does not interact with itself) to consider (\figref{int_mat}).
210 When we reflect the upper right triangle of interactions to the lower
211 left triangle of the matrix, we still cover all possible interactions,
212 but now every row in the matrix has almost the same number of points
213 or possible interactions. We can now assign a (preferably equal)
214 number of rows to each processor to compute the forces and at the same
215 time a number of particles to do the update on, the {\em home}
216 particles. The number of interactions per particle is dependent on the
217 {\em total number} $N$ of particles (see \figref{decomp}) and on the
218 {\em particle number} $i$. The exact formulae are given in
219 \tabref{decomp}.
221 A flow chart of the algorithm is given in \figref{mdpar}.
222 \begin{figure}
223 \centerline{\includegraphics[height=18cm]{plots/mdpar}}
224 \caption[The Parallel MD algorithm.]{The Parallel MD algorithm. If
225 the steps marked * are left out we have the sequential algorithm
226 again.}
227 \label{fig:mdpar}
228 \end{figure}
229 \vfill
231 It is the same as the sequential algorithm, except for two
232 communication steps. After the particles have been reset in the box,
233 each processor sends its coordinates onward (left) and then starts computation
234 of the forces. After this step each processor holds the {\em partial
235 forces} for the available particles, {\eg} processor 0 holds forces
236 acting on home particles from processor 0, 1, 2 and 3. These forces
237 must be accumulated and sent back (right) to the home
238 processor. Finally the update of the velocity and coordinates is done
239 on the home processor.
241 The {\tt communicate_r} routine is given below in the full C-code:\\
242 \begin{footnotesize}
243 \begin{verbatim}
244 void communicate_r(int nprocs,int pid,rvec vecs[],int start[],int homenr[])
246 * nprocs = number of processors
247 * pid = processor id (0..nprocs-1)
248 * vecs = vectors
249 * start = starting index in vecs for each processor
250 * homenr = number of home particles for each processor
253 int i; /* processor counter */
254 int shift; /* the amount of processors to communicate with */
255 int cur; /* current processor to send data from */
256 int next; /* next processor on a ring (using modulo) */
258 cur = pid;
259 shift = nprocs/2;
261 for (i=0; (i<shift); i++) {
262 next=(cur+1) \% nprocs;
263 send (left, vecs[start[cur]], homenr[cur]);
264 receive(right, vecs[start[next]], homenr[next]);
265 cur=next;
268 \end{verbatim}
270 \end{footnotesize}
272 The data flow around the ring is visualized in \figref{ring}.
273 Note that because of the ring topology each processor automatically
274 gets the proper particles to interact with.
275 \begin {figure}
276 \centerline{\includegraphics[width=6cm]{plots/ring}}
277 \caption {Data flow in a ring of processors.}
278 \label{fig:ring}
279 \end {figure}
281 \section{Parallel Molecular Dynamics}
282 In this chapter we describe some details of the \swapindex{parallel}{MD}
283 algorithm used
284 in {\gromacs}. This also includes some other information on neighbor searching and
285 a side excursion to parallel sorting.
286 Please note the following which we use throughout this chapter:\\
287 {\bf definition:} {\natom}: Number of particles, {\nproc} number of processors.\\
288 {\gromacs} employs two different grids: the neighbor searching grid ({\nsgrid})
289 and the combined charge/potential grid ({\fftgrid}), as will be described below.
290 To maximize the confusion,
291 these two grids are mapped onto a grid of processors when {\gromacs} runs on a
292 parallel computer.
294 \subsection{Domain decomposition}
295 Modern parallel computers, such as an IBM SP/2 or a Cray T3E
296 consist of relatively small numbers of relatively fast scalar
297 processors (typically 8 to 256). The communication channels that are
298 available in hardware on these machine are not directly visible to
299 the programmer; a software layer (usually \normindex{MPI})
300 hides this, and makes communication from all processors to all
301 others possible. In contrast, in the {\gromacs} hardware~\cite{Berendsen95a}
302 only communication in a ring was available, {\ie} each processor could communicate
303 with its direct neighbors only.
305 It seems logical to map the computational box of an MD simulation system
306 to a 3D grid of
307 processors ({\eg} 4x4x4 for a 64 processor system). This ensures that most
308 interactions that are local in space can be computed with information from
309 neighboring processors only. However, this means that there have to be
310 communication channels in 3 dimensions too, which is not necessarily the case.
311 Although this may be overcome in software, such a mapping complicates the MD
312 software as well, without clear performance benefits on most
313 parallel computers.
315 Therefore we opt for a simple one-dimensional division scheme
316 for the computational box. Each processor gets a slab of this box in the
317 X-dimension.
318 For the communication between processors this has two main advantages:
319 \begin{enumerate}
320 \item Simplicity of coding. Communication can only be to two neighbors
321 (called {\em left} and {\em right} in {\gromacs}).
322 \item Communication can usually be done in large chunks, which makes it
323 more efficient on most hardware platforms.
324 \end{enumerate}
326 Most interactions in molecular dynamics have in principle a
327 short-range character. Bonds, angles and dihedrals are guaranteed to
328 have the corresponding particles close in space.
331 \subsection{Domain decomposition for non-bonded forces}
332 For large parallel computers, domain decomposition is preferable over
333 particle decomposition, since it is easier to do load
334 balancing. Without load balancing the scaling of the code is rather
335 poor. For this purpose, the computational box is divided in {\nproc}
336 slabs, where {\nproc} is equal to the number of processors. There are
337 multiple ways of dividing the box over processors, but since the
338 {\gromacs} code assumes a ring topology for the processors, it is
339 logical to cut the system in slabs in just one dimension, the X
340 dimension. The algorithm for neighbor searching then becomes:
341 \begin{enumerate}
342 \item Make a list of charge group indices sorted on (increasing) X coordinate
343 (\figref{parsort}).
344 {\bf Note} that care must be taken to parallelize the sorting algorithm
345 as well. See \secref{parsort}.
346 \item Divide this list into slabs, with each slab having the same number of
347 charge groups
348 \item Put the particles corresponding to the local slab on a 3D {\nsgrid} as
349 described in \secref{nsgrid}.
350 \item Communicate the {\nsgrid} to neighboring processors (not necessarily to all
351 processors). The amount of neighboring {\nsgrid} cells (N$_{gx}$) to
352 communicate is determined by the cut-off length $r_c$ according to
353 \beq
354 N_{gx} ~=~ \frac{r_c \nproc}{l_x}
355 \eeq
356 where $l_x$ is the box length in the slabbing direction.
357 \item On each processor compute the neighbor list for all charge groups in
358 its slab using the normal grid neighbor-searching.
359 \end{enumerate}
361 \begin{figure}
362 \centerline{\includegraphics[width=10cm]{plots/parsort}}
363 \caption[Index in the coordinate array.]{Index in the coordinate
364 array. The division in slabs is indicated by dashed lines.}
365 \label{fig:parsort}
366 \end{figure}
368 For homogeneous system, this is close to an optimal load balancing,
369 without actually doing load balancing. For inhomogeneous system, such
370 as membranes or interfaces, the slabs should be perpendicular to the
371 interface; this way, each processor has ``a little bit of
372 everything''. The {\gromacs} utility program {\tt editconf} has an
373 option to rotate a whole computational box.
375 The following observations are important here:
376 \begin{itemize}
377 \item Particles may diffuse from one slab to the other, therefore each processor
378 must hold coordinates for all particles all the time, and distribute forces
379 back to all processors as well.
380 \item Velocities are kept on the ``home processor'' for each particle,
381 where the integration of Newton's equations is done.
382 \item Fixed interaction lists (bonds, angles etc.) are kept each
383 on a single processor. Since all processors have all
384 coordinates, it does not matter where interactions are
385 calculated. The division is actually done by the {\gromacs}
386 preprocessor {\tt grompp} and care is taken that, as far as
387 possible, every processor gets the same number of bonded
388 interactions.
389 \end{itemize}
391 In all, this makes for a mixed particle decomposition/domain decomposition scheme
392 for parallelization of the MD code. The communication costs are four times higher
393 than for the simple particle decomposition method described in \secref{par}
394 (the whole coordinate and force array are communicated across the whole ring,
395 rather than half the array over half the ring).
396 However, for large numbers of processors the improved load balancing
397 compensates this easily.
399 \subsection{Parallel \normindex{PPPM}}
400 A further reason for domain decomposition is the PPPM algorithm. This
401 algorithm works with a 3D Fast Fourier Transform. It employs a
402 discrete grid of dimensions (\nx,\ny,\nz), the {\fftgrid}. The
403 algorithm consist of five steps, each of which have to be
404 parallelized:
405 \begin{enumerate}
406 \item Spreading charges on the {\fftgrid} to obtain the charge
407 distribution $\rho(\ve{r})$.
408 This bit involves the following sub-steps:
409 \begin{enumerate}
410 \item[{\bf a.}] put particle in the box
411 \item[{\bf b.}] find the {\fftgrid} cell in which the particle resides
412 \item[{\bf c.}] add the charge of the particle times the appropriate
413 weight factor (see \secref{pppm}) to
414 each of the 27 grid points (3 x 3 x 3).
415 \end{enumerate}
416 In the parallel case, the {\fftgrid}
417 must be filled on each processor with its
418 share of the particles, and subsequently the {\fftgrid}s of all processors
419 must be summed to find the total charge distribution. It may be clear that
420 this induces a large amount of unnecessary work, unless we use domain
421 decomposition. If each processor only has particles in a certain region
422 of space, it only has to calculate the charge distribution for
423 that region of space. Since {\gromacs} works with slabs, this means that
424 each processor fills the {\fftgrid} cells corresponding to its slab in space
425 and addition of {\fftgrid}s need only be done for neighboring slabs.\\
426 To be more precise, the slab $x$ for processor $i$ is defined as:
427 \beq
428 i\, \frac{l_x}{M} \le x <\, (i+1)\frac{l_x}{M}
429 \eeq
430 Particle with this $x$ coordinate range will add to the charge distribution
431 on the following range of
432 of {\fftgrid} slabs in the $x$ direction:
433 \beq
434 {\rm trunc}\left(i\,\frac{l_x \nx}{M}\right)-1 \le i_x \le {\rm trunc}\left((i+1)\,\frac{l_x \nx}{M}\right)+2
435 \eeq
436 where trunc indicates the truncation of a real number to the largest integer
437 smaller than or equal to that real number.
439 \item Doing the Fourier transform of the charge distribution $\rho(\ve{r})$
440 in parallel to obtain $\hat{\rho}(\ve{k})$. This is done using
441 the FFTW library (see \href{http://www.fftw.org}{www.fftw.org})
442 which employs the MPI library for message passing programs
443 (note that there are also \swapindex{shared}{memory} versions
444 of the FFTW code).\\
445 This FFT algorithm actually use slabs as well (good
446 thinking!). Each processor does 2D FFTs on its slab, and then
447 the whole {\fftgrid} is transposed {\em in place}
448 ({\ie} without using extra memory). This means that after the
449 FFT the X and Y components are swapped. To complete the FFT,
450 this swapping should be undone in principle (by transposing
451 back). Happily the FFTW code has an option to omit this,
452 which we use in the next step.
453 \item Convolute $\hat{\rho}(\ve{k})$ with the Fourier transform of the
454 charge spread function $\hat{g}(\ve{k})$ (which we have tabulated before)
455 to obtain the potential $\hat{\phi}(k)$.
456 As an optimization, we store the $\hat{g}(\ve{k})$ in transposed form
457 as well, matching the transposed form of $\hat{\rho}(\ve{k})$
458 which we get from the FFTW routine. After this step we have the
459 potential $\hat{\phi}(k)$ in Fourier space, but still on the transposed
460 {\fftgrid}.
461 \item Do an inverse transform of $\hat{\phi}(k)$ to obtain
462 ${\phi}(\ve{r})$. Since the algorithm must do a transpose of the data
463 this step actually yields the wanted result: the un-transposed
464 potential in real space.
465 \item Interpolate the potential ${\phi}(\ve{r})$ in real space at the particle
466 positions to obtain forces and energy. For this bit the same considerations
467 about parallelism hold as for the charge spreading. However in this
468 case more neighboring grid cells are needed, implying that we need
469 the following set of {\fftgrid} slabs in the $x$ direction:
470 \beq
471 {\rm trunc}\left(i\,\frac{l_x \nx}{M}\right)-3 \le i_x \le {\rm trunc}\left((i+1)\,\frac{l_x \nx}{M}\right)+4
472 \eeq
474 \end{enumerate}
475 The algorithm as sketched above requires communication for spreading
476 the charges, for the forward and backward FFTs, and for interpolating
477 the forces. The {\gromacs} bits of the program use only left and
478 right communication, {\ie} using two communication channels. The FFTW
479 routines actually use other forms of communication as well, and these
480 routines are coded with MPI routines for message passing. This implies
481 that {\gromacs} can only perform the PPPM algorithm on parallel
482 computers that support MPI. However, most
483 \swapindex{shared}{memory} computers, such as the SGI Origin, also
484 support MPI using the
485 shared memory for communication.
487 \subsection{Parallel sorting}
488 \label{sec:parsort}
489 For the domain decomposition bit of {\gromacs} it is necessary to sort the
490 coordinates (or rather the index to coordinates) every time a neighbor list is made.
491 If we use brute force, and sort all coordinates on each processor (which is
492 technically possible since we have all the coordinates), then this sorting procedure
493 will take a constant wall clock time, proportional to {\natom}$^2\log${\natom},
494 regardless of the number of processors. We can however do a little
495 better, if we assume that particles diffuse only slowly.
496 A parallel sorting algorithm can be conceived as follows: \\
497 At the first step of the simulation
498 \begin{enumerate}
499 \item Do a full sort of all indices using {\eg} the Quicksort algorithm that is
500 built-in in the standard C-library
501 \item Divide the sorted array into slabs (as described above see
502 \figref{parsort}).
503 \end{enumerate}
504 At subsequent steps of the simulation:
505 \begin{enumerate}
506 \item Send the indices for each processor to the preceding processor (if
507 not processor 0) and to the next processor (if not {\nproc}-1). The
508 communication associated with this operation is proportional to
509 2{\natom}/{\nproc}.
510 \item Sort the combined indices of the three (or two) processors. Note that
511 the CPU time associated with sorting is now
512 (3{\natom}/{\nproc})$^2\log$ (3{\natom}/{\nproc}).
513 \item On each processor, the indices belonging to its slab can be determined
514 from the order of the array (\figref{parsort}).
515 \end{enumerate}
517 %\section{A Worked Example}
518 %Suppose our box size is 4.0 nm and the cut-off is 0.8 nm. For neighborsearching
519 %we use {\dgrid} = 2, such that there 10 {\nsgrid} cells.
522 % LocalWords: Parallelization parallelization parallelize Fortran vectorizing
523 % LocalWords: parallelized vectorize Amdahl's MPI ij ji decomp mdpar nprocs gx
524 % LocalWords: pid rvec vecs homenr dihedrals indices parsort nsgrid editconf
525 % LocalWords: preprocessor grompp PPPM pppm trunc FFTW FFT FFTs Convolute un
526 % LocalWords: SGI Quicksort formulae