manual/gmxpar.tex

   1 % This file is part of the GROMACS molecular simulation package.
   2 %
   3 % Copyright (c) 2013, by the GROMACS development team, led by
   4 % David van der Spoel, Berk Hess, Erik Lindahl, and including many
   5 % others, as listed in the AUTHORS file in the top-level source
   6 % directory and at http://www.gromacs.org.
   7 %
   8 % GROMACS is free software; you can redistribute it and/or
   9 % modify it under the terms of the GNU Lesser General Public License
  10 % as published by the Free Software Foundation; either version 2.1
  11 % of the License, or (at your option) any later version.
  12 %
  13 % GROMACS is distributed in the hope that it will be useful,
  14 % but WITHOUT ANY WARRANTY; without even the implied warranty of
  15 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
  16 % Lesser General Public License for more details.
  17 %
  18 % You should have received a copy of the GNU Lesser General Public
  19 % License along with GROMACS; if not, see
  20 % http://www.gnu.org/licenses, or write to the Free Software Foundation,
  21 % Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA.
  22 %
  23 % If you want to redistribute modifications to GROMACS, please
  24 % consider that scientific software is very special. Version
  25 % control is crucial - bugs must be traceable. We will be happy to
  26 % consider code for inclusion in the official distribution, but
  27 % derived work must not be called official GROMACS. Details are found
  28 % in the README & COPYING files - if they are missing, get the
  29 % official version at http://www.gromacs.org.
  30 %
  31 % To help us fund GROMACS development, we humbly ask that you cite
  32 % the research papers on the package. Check out http://www.gromacs.org
  33
  34 \section{Parallelization}
  35 \label{sec:par}
  36
  37 \newcommand{\abs}[1]{\mid \! {#1} \! \mid}
  38
  39 The purpose of this section is to discuss the
  40 \normindex{parallelization} of the
  41 principle MD algorithm and not to describe the algorithms that are in
  42 practical use for molecular systems with their complex variety of atoms
  43 and terms in the force field descriptions. We shall therefore consider
  44 as an example a simple system consisting only of a single type of atoms
  45 with a simple form of the interaction potential. The emphasis will be
  46 on the special problems that arise when the algorithm is implemented on
  47 a parallel computer.
  48
  49 The simple model problem already contains the bottleneck of all MD
  50 simulations: the computationally intensive evaluation of the
  51 {\em non-bonded} forces between pairs of atoms, based on the distance
  52 between particles. Complex molecular systems will in addition
  53 involve many different kinds of {\em bonded} forces between designated
  54 atoms. Such interactions add to the complexity of the algorithm but do
  55 not modify the basic considerations concerning parallelization.
  56
  57
  58 \subsection{Methods of parallelization}
  59 There are a number of methods to parallelize the MD algorithm, each of
  60 them with their own advantages and disadvantages. The method to
  61 choose depends on the hardware and compilers available.
  62 We list them here:
  63 \begin{enumerate}
  64 \item[1]        {\em \normindex{Message Passing}.}\\
  65                 In this method, which is more or less the traditional
  66                 way of parallel programming, all the parallelism is
  67                 explicitly programmed by the user. The disadvantage
  68                 is that it takes extra code and effort, the advantage
  69                 is that the programmer keeps full control over the data
  70                 flow and can do optimizations a compiler could not come
  71                 up with.
  72
  73                 The implementation is typically done by calling a set of
  74                 library routines to send and receive data to and from
  75                 other processors. Almost all hardware vendors support
  76                 this way of
  77                 parallelism in their C and Fortran compilers.
  78
  79 \item[2]        {\em \swapindex{Data}{Parallel}.}\\
  80                 This method lets the user define arrays on which to
  81                 operate in parallel. Programming this way is much
  82                 like vectorizing: recurrence is not parallelized
  83                 ({\eg} {\tt for(i=1; (i<MAX); i++) a[i] = a[i-1] + 1;}
  84                 does not vectorize and not parallelize, because for
  85                 every i the result from the previous step is needed).
  86
  87                 The advantage of data parallelism is that it is
  88                 easier for the user; the compiler takes care of the
  89                 parallelism. The disadvantage is that it is supported
  90                 by a small (though growing) number of hardware vendors,
  91                 and that it is much harder to maintain a program that has to
  92                 run on both parallel and sequential machines, because
  93                 the only standard language that supports it is Fortran-90
  94                 which is not available on many platforms.
  95 \end{enumerate}
  96 Both methods allow for the MD algorithm to be implemented without much
  97 trouble. Message passing MD algorithms have been published
  98 since the mid 80's (\cite{Fincham87}, \cite{Raine89})
  99 and development is still continuing.
 100 Data parallel programming is newer,
 101 but starting from a well vectorized program it is not hard to do.
 102
 103 Our implementation of MD is a message passing one, the reason for which
 104 is partly historical: the project to develop a parallel MD program started
 105 when Fortran-90 was still in the making, and no compilers were
 106 expected to be available.
 107 At current, we still believe that message passing is the way
 108 to go, after having done some experiments with data parallel programming on a
 109 Connection Machine (CM-5), because of portability to other hardware,
 110 the poor performance of the code produced by the compilers
 111 and because this way of programming
 112 has the same drawback as vectorization: the part of the program that is
 113 not vectorized or parallelized determines the run time of the program
 114 (\normindex{Amdahl's law}).
 115
 116 The approach we took to parallelism was a minimalist one: use as few
 117 non-standard elements in the software as possible, and use the
 118 simplest \swapindex{processor}{topology} that does the job. We therefore
 119 decided to use a standard language (ANSI-C) with as few non-standard
 120 routines as possible. We only use 5 communication routines that are
 121 non-standard. It is therefore very easy to port our code to other machines.
 122
 123 For an $O(N^2)$ problem like MD, one of the best schemes for the
 124 inter-processor connections is a ring, so our software demands that a
 125 ring is present in the inter-processor connections. A ring can essentially
 126 always be mapped onto another network like a \normindex{hypercube}, a
 127 bus interface (Ethernet {\eg} using
 128 \seeindex{Message Passing Interface}{MPI} \normindex{MPI}) or
 129 a \normindex{tree}
 130 (CM-5). Some hardware vendors have very luxurious connection schemes
 131 that connect every processor to every other processor, but we do not
 132 really need it and so do not use it even though it might come in handy
 133 at times. The advantage with this simple scheme is that {\gromacs}
 134 performs extremely well even on inexpensive workstation clusters.
 135
 136 When using a message passing scheme one has to divide the particles
 137 over processors, which can be done in two ways:
 138 \begin{itemize}
 139 \item   {\em \swapindex{Space}{Decomposition}.}\\
 140         An element of space is allocated to each processor, when dividing
 141         a cubic box with edge $b$ over $P$ processors this can be done
 142         by giving
 143         each processor a slab of length $b/P$. This method
 144         has the advantage
 145         that each processor has about the same number of interactions
 146         to calculate (at least when the simulated system has a homogeneous
 147         density, like a liquid or a gas). The disadvantage is that a lot of
 148         bookkeeping is necessary for particles that move over processor
 149         boundaries. When using more complex systems, such as macromolecules there
 150         are also 3- and 4-atom interactions that make
 151         the bookkeeping too complicated for our taste.
 152 \item   {\em \swapindex{Particle}{Decomposition}.}\\
 153         Every processor is allocated a number of particles. When
 154         dividing $N$ particles over $P$ processors each processor will
 155         get $N/P$ particles. The implementation of this method
 156         is described in the next section.
 157 \end{itemize}
 158
 159 \begin {figure}[p]
 160 \centerline{\includegraphics[width=10cm]{plots/int-mat}}
 161 \caption[The interaction matrix.]{The interaction matrix (left) and
 162 the same using action$~=~-$reaction (right).}
 163 \label{fig:int_mat}
 164 \end {figure}
 165
 166 \begin{table}[p]
 167 \centerline{
 168 \newcolumntype{s}[1]{D{/}{/}{#1}}
 169 \begin{tabular}{|l|s4|s4|s4|s4|}
 170 \dline
 171         & \mcc{1}{i mod 2 = 0}
 172                 & \mcc{1}{i mod 2 = 0}
 173                                 & \mcc{1}{i mod 2 = 1}
 174                                                 & \mcc{1}{i mod 2 = 1} \\
 175         & \mcc{1}{i $<$ N/2}
 176                 & \mcc{1}{i $\ge$ N/2}
 177                                 & \mcc{1}{i $<$ N/2}
 178                                                 & \mcc{1}{i $\ge$ N/2} \\
 179 \hline
 180 N mod 2 = 1     & N/2   & N/2           & N/2           & N/2           \\
 181 N mod 4 = 2     & N/2   & N/2           & N/2 - 1       & N/2 - 1       \\
 182 N mod 4 = 0     & N/2   & N/2 - 1       & N/2 - 1       & N/2           \\
 183 \dline
 184 \end{tabular}
 185 }
 186 \caption[The number of interactions between particles.]{The number of
 187 interactions between particles. The number of $j$ particles per $i$
 188 particle is a function of the total number of particles $N$ and
 189 particle number $i$. Note that here the $/$ operator is used for
 190 integer division, {\ie} truncating the reminder.}
 191 \label{tab:decomp}
 192 \end{table}
 193
 194 \begin{figure}[p]
 195 \centerline{\includegraphics[width=\linewidth]{plots/decomp}}
 196 \caption[Interaction matrices for different $N$.]{Interaction matrices for different $N$. The number of $j$-particles an $i$-particle interacts with depends on the {\em total} number of particles and on the {\em particle number}.}
 197 \label{fig:decomp}
 198 \end{figure}
 199
 200 \subsection{MD on a ring of processors}
 201 When a neighbor list is not used the MD problem is in principle an $O(N^2)$
 202 problem as each particle can interact
 203 with every other. This can be simplified using Newton's third law
 204 \beq
 205 F_{ij}  ~=~     -F_{ji}
 206 \label{eqn:Newt3}
 207 \eeq
 208 This implies that there is half a matrix of interactions (without diagonal,
 209 a particle does not interact with itself) to consider (\figref{int_mat}).
 210 When we reflect the upper right triangle of interactions to the lower
 211 left triangle of the matrix, we still cover all possible interactions,
 212 but now every row in the matrix has almost the same number of points
 213 or possible interactions.  We can now assign a (preferably equal)
 214 number of rows to each processor to compute the forces and at the same
 215 time a number of particles to do the update on, the {\em home}
 216 particles. The number of interactions per particle is dependent on the
 217 {\em total number} $N$ of particles (see \figref{decomp}) and on the
 218 {\em particle number} $i$.  The exact formulae are given in
 219 \tabref{decomp}.
 220
 221 A flow chart of the algorithm is given in \figref{mdpar}.
 222 \begin{figure}
 223 \centerline{\includegraphics[height=18cm]{plots/mdpar}}
 224 \caption[The Parallel MD algorithm.]{The Parallel MD algorithm. If
 225 the steps marked * are left out we have the sequential algorithm
 226 again.}
 227 \label{fig:mdpar}
 228 \end{figure}
 229 \vfill
 230
 231 It is the same as the sequential algorithm, except for two
 232 communication steps. After the particles have been reset in the box,
 233 each processor sends its coordinates onward (left) and then starts computation
 234 of the forces.  After this step each processor holds the {\em partial
 235 forces} for the available particles, {\eg} processor 0 holds forces
 236 acting on home particles from processor 0, 1, 2 and 3. These forces
 237 must be accumulated and sent back (right) to the home
 238 processor. Finally the update of the velocity and coordinates is done
 239 on the home processor.
 240
 241 The {\tt communicate_r} routine is given below in the full C-code:\\
 242 \begin{footnotesize}
 243 \begin{verbatim}
 244 void communicate_r(int nprocs,int pid,rvec vecs[],int start[],int homenr[])
 245 /*
 246  * nprocs = number of processors
 247  * pid    = processor id (0..nprocs-1)
 248  * vecs   = vectors
 249  * start  = starting index in vecs for each processor
 250  * homenr = number of home particles for each processor
 251  */
 252 {
 253   int i;        /* processor counter */
 254   int shift;    /* the amount of processors to communicate with */
 255   int cur;      /* current processor to send data from */
 256   int next;     /* next processor on a ring (using modulo) */
 257
 258   cur   = pid;
 259   shift = nprocs/2;
 260
 261   for (i=0; (i<shift); i++) {
 262     next=(cur+1) \% nprocs;
 263     send   (left,  vecs[start[cur]],  homenr[cur]);
 264     receive(right, vecs[start[next]], homenr[next]);
 265     cur=next;
 266   }
 267 }
 268 \end{verbatim}
 269
 270 \end{footnotesize}
 271
 272 The data flow around the ring is visualized in \figref{ring}.
 273 Note that because of the ring topology each processor automatically
 274 gets the proper particles to interact with.
 275 \begin {figure}
 276 \centerline{\includegraphics[width=6cm]{plots/ring}}
 277 \caption {Data flow in a ring of processors.}
 278 \label{fig:ring}
 279 \end {figure}
 280
 281 \section{Parallel Molecular Dynamics}
 282 In this chapter we describe some details of the \swapindex{parallel}{MD}
 283 algorithm used
 284 in {\gromacs}. This also includes some other information on neighbor searching and
 285 a side excursion to parallel sorting.
 286 Please note the following which we use throughout this chapter:\\
 287 {\bf definition:} {\natom}: Number of particles, {\nproc} number of processors.\\
 288 {\gromacs} employs two different grids: the neighbor searching grid ({\nsgrid})
 289 and the combined charge/potential grid ({\fftgrid}), as will be described below.
 290 To maximize the confusion,
 291 these two grids are mapped onto a grid of processors when {\gromacs} runs on a
 292 parallel computer.
 293
 294 \subsection{Domain decomposition}
 295 Modern parallel computers, such as an IBM SP/2 or a Cray T3E
 296 consist of relatively small numbers of relatively fast scalar
 297 processors (typically 8 to 256).  The communication channels that are
 298 available in hardware on these machine are not directly visible to
 299 the programmer; a software layer (usually \normindex{MPI})
 300 hides this, and makes communication from all processors to all
 301 others possible. In contrast, in the {\gromacs} hardware~\cite{Berendsen95a}
 302 only communication in a ring was available, {\ie} each processor could communicate
 303 with its direct neighbors only.
 304
 305 It seems logical to map the computational box of an MD simulation system
 306 to a 3D grid of
 307 processors ({\eg} 4x4x4 for a 64 processor system). This ensures that most
 308 interactions that are local in space can be computed with information from
 309 neighboring processors only. However, this means that there have to be
 310 communication channels in 3 dimensions too, which is not necessarily the case.
 311 Although this may be overcome in software, such a mapping complicates the MD
 312 software as well, without clear performance benefits on most
 313 parallel computers.
 314
 315 Therefore we opt for a simple one-dimensional division scheme
 316 for the computational box. Each processor gets a slab of this box in the
 317 X-dimension.
 318 For the communication between processors this has two main advantages:
 319 \begin{enumerate}
 320 \item   Simplicity of coding. Communication can only be to two neighbors
 321         (called {\em left} and {\em right} in {\gromacs}).
 322 \item   Communication can usually be done in large chunks, which makes it
 323         more efficient on most hardware platforms.
 324 \end{enumerate}
 325
 326 Most interactions in molecular dynamics have in principle a
 327 short-range character.  Bonds, angles and dihedrals are guaranteed to
 328 have the corresponding particles close in space.
 329
 330
 331 \subsection{Domain decomposition for non-bonded forces}
 332 For large parallel computers, domain decomposition is preferable over
 333 particle decomposition, since it is easier to do load
 334 balancing. Without load balancing the scaling of the code is rather
 335 poor. For this purpose, the computational box is divided in {\nproc}
 336 slabs, where {\nproc} is equal to the number of processors. There are
 337 multiple ways of dividing the box over processors, but since the
 338 {\gromacs} code assumes a ring topology for the processors, it is
 339 logical to cut the system in slabs in just one dimension, the X
 340 dimension.  The algorithm for neighbor searching then becomes:
 341 \begin{enumerate}
 342 \item   Make a list of charge group indices sorted on (increasing) X coordinate
 343         (\figref{parsort}).
 344         {\bf Note} that care must be taken to parallelize the sorting algorithm
 345         as well. See \secref{parsort}.
 346 \item   Divide this list into slabs, with each slab having the same number of
 347         charge groups
 348 \item   Put the particles corresponding to the local slab on a 3D {\nsgrid} as
 349         described in \secref{nsgrid}.
 350 \item   Communicate the {\nsgrid} to neighboring processors (not necessarily to all
 351         processors). The amount of neighboring {\nsgrid} cells (N$_{gx}$) to
 352         communicate is determined by the cut-off length $r_c$ according to
 353         \beq
 354         N_{gx}  ~=~     \frac{r_c \nproc}{l_x}
 355         \eeq
 356         where $l_x$ is the box length in the slabbing direction.
 357 \item   On each processor compute the neighbor list for all charge groups in
 358         its slab using the normal grid neighbor-searching.
 359 \end{enumerate}
 360
 361 \begin{figure}
 362 \centerline{\includegraphics[width=10cm]{plots/parsort}}
 363 \caption[Index in the coordinate array.]{Index in the coordinate
 364 array. The division in slabs is indicated by dashed lines.}
 365 \label{fig:parsort}
 366 \end{figure}
 367
 368 For homogeneous system, this is close to an optimal load balancing,
 369 without actually doing load balancing. For inhomogeneous system, such
 370 as membranes or interfaces, the slabs should be perpendicular to the
 371 interface; this way, each processor has ``a little bit of
 372 everything''.  The {\gromacs} utility program {\tt editconf} has an
 373 option to rotate a whole computational box.
 374
 375 The following observations are important here:
 376 \begin{itemize}
 377 \item   Particles may diffuse from one slab to the other, therefore each processor
 378         must hold coordinates for all particles all the time, and distribute forces
 379         back to all processors as well.
 380 \item   Velocities are kept on the ``home processor'' for each particle,
 381         where the integration of Newton's equations is done.
 382 \item   Fixed interaction lists (bonds, angles etc.) are kept each
 383         on a single processor.  Since all processors have all
 384         coordinates, it does not matter where interactions are
 385         calculated.  The division is actually done by the {\gromacs}
 386         preprocessor {\tt grompp} and care is taken that, as far as
 387         possible, every processor gets the same number of bonded
 388         interactions.
 389 \end{itemize}
 390
 391 In all, this makes for a mixed particle decomposition/domain decomposition scheme
 392 for parallelization of the MD code. The communication costs are four times higher
 393 than for the simple particle decomposition method described in \secref{par}
 394 (the whole coordinate and force array are communicated across the whole ring,
 395 rather than half the array over half the ring).
 396 However, for large numbers of processors the improved load balancing
 397 compensates this easily.
 398
 399 \subsection{Parallel \normindex{PPPM}}
 400 A further reason for domain decomposition is the PPPM algorithm. This
 401 algorithm works with a 3D Fast Fourier Transform. It employs a
 402 discrete grid of dimensions (\nx,\ny,\nz), the {\fftgrid}. The
 403 algorithm consist of five steps, each of which have to be
 404 parallelized:
 405 \begin{enumerate}
 406 \item   Spreading charges on the {\fftgrid} to obtain the charge
 407         distribution $\rho(\ve{r})$.
 408         This bit involves the following sub-steps:
 409         \begin{enumerate}
 410         \item[{\bf a.}] put particle in the box
 411         \item[{\bf b.}] find the {\fftgrid} cell in which the particle resides
 412         \item[{\bf c.}] add the charge of the particle times the appropriate
 413                         weight factor (see \secref{pppm}) to
 414                         each of the 27 grid points (3 x 3 x 3).
 415         \end{enumerate}
 416         In the parallel case, the {\fftgrid}
 417         must be filled on each processor with its
 418         share of the particles, and subsequently the {\fftgrid}s of all processors
 419         must be summed to find the total charge distribution. It may be clear that
 420         this induces a large amount of unnecessary work, unless we use domain
 421         decomposition. If each processor only has particles in a certain region
 422         of space, it only has to calculate the charge distribution for
 423         that region of space. Since {\gromacs} works with slabs, this means that
 424         each processor fills the {\fftgrid} cells corresponding to its slab in space
 425         and addition of {\fftgrid}s need only be done for neighboring slabs.\\
 426         To be more precise, the slab $x$ for processor $i$ is defined as:
 427         \beq
 428         i\, \frac{l_x}{M} \le x <\, (i+1)\frac{l_x}{M}
 429         \eeq
 430         Particle with this $x$ coordinate range will add to the charge distribution
 431         on the following range of
 432         of {\fftgrid} slabs in the $x$ direction:
 433         \beq
 434         {\rm trunc}\left(i\,\frac{l_x \nx}{M}\right)-1 \le i_x \le {\rm trunc}\left((i+1)\,\frac{l_x \nx}{M}\right)+2
 435         \eeq
 436         where trunc indicates the truncation of a real number to the largest integer
 437         smaller than or equal to that real number.
 438
 439 \item   Doing the Fourier transform of the charge distribution $\rho(\ve{r})$
 440         in parallel to obtain $\hat{\rho}(\ve{k})$. This is done using
 441         the FFTW library (see \href{http://www.fftw.org}{www.fftw.org})
 442         which employs the MPI library for message passing programs
 443         (note that there are also \swapindex{shared}{memory} versions
 444         of the FFTW code).\\
 445         This FFT algorithm actually use slabs as well (good
 446         thinking!).  Each processor does 2D FFTs on its slab, and then
 447         the whole {\fftgrid} is transposed {\em in place}
 448         ({\ie} without using extra memory).  This means that after the
 449         FFT the X and Y components are swapped.  To complete the FFT,
 450         this swapping should be undone in principle (by transposing
 451         back).  Happily the FFTW code has an option to omit this,
 452         which we use in the next step.
 453 \item   Convolute $\hat{\rho}(\ve{k})$ with the Fourier transform of the
 454         charge spread function $\hat{g}(\ve{k})$ (which we have tabulated before)
 455         to obtain the potential $\hat{\phi}(k)$.
 456         As an optimization, we store the $\hat{g}(\ve{k})$  in transposed form
 457         as well, matching the transposed form of $\hat{\rho}(\ve{k})$
 458         which we get from the FFTW routine. After this step we have the
 459         potential $\hat{\phi}(k)$ in Fourier space, but still on the transposed
 460         {\fftgrid}.
 461 \item   Do an inverse transform of $\hat{\phi}(k)$ to obtain
 462         ${\phi}(\ve{r})$. Since the algorithm must do a transpose of the data
 463         this step actually yields the wanted result: the un-transposed
 464         potential in real space.
 465 \item   Interpolate the potential ${\phi}(\ve{r})$ in real space at the particle
 466         positions to obtain forces and energy. For this bit the same considerations
 467         about parallelism hold as for the charge spreading. However in this
 468         case more neighboring grid cells are needed, implying that we need
 469         the following set of {\fftgrid} slabs in the $x$ direction:
 470         \beq
 471         {\rm trunc}\left(i\,\frac{l_x \nx}{M}\right)-3 \le i_x \le {\rm trunc}\left((i+1)\,\frac{l_x \nx}{M}\right)+4
 472         \eeq
 473
 474 \end{enumerate}
 475 The algorithm as sketched above requires communication for spreading
 476 the charges, for the forward and backward FFTs, and for interpolating
 477 the forces.  The {\gromacs} bits of the program use only left and
 478 right communication, {\ie} using two communication channels. The FFTW
 479 routines actually use other forms of communication as well, and these
 480 routines are coded with MPI routines for message passing. This implies
 481 that {\gromacs} can only perform the PPPM algorithm on parallel
 482 computers that support MPI. However, most
 483 \swapindex{shared}{memory} computers, such as the SGI Origin, also
 484 support MPI using the
 485 shared memory for communication.
 486
 487 \subsection{Parallel sorting}
 488 \label{sec:parsort}
 489 For the domain decomposition bit of {\gromacs} it is necessary to sort the
 490 coordinates (or rather the index to coordinates) every time a neighbor list is made.
 491 If we use brute force, and sort all coordinates on each processor (which is
 492 technically possible since we have all the coordinates), then this sorting procedure
 493 will take a constant wall clock time, proportional to {\natom}$^2\log${\natom},
 494 regardless of the number of processors. We can however do a little
 495 better, if we assume that particles diffuse only slowly.
 496 A parallel sorting algorithm can be conceived as follows: \\
 497 At the first step of the simulation
 498 \begin{enumerate}
 499 \item   Do a full sort of all indices using {\eg} the  Quicksort algorithm that is
 500         built-in in the standard C-library
 501 \item   Divide the sorted array into slabs (as described above see
 502         \figref{parsort}).
 503 \end{enumerate}
 504 At subsequent steps of the simulation:
 505 \begin{enumerate}
 506 \item   Send the indices for each processor to the preceding processor (if
 507         not processor 0) and to the next processor (if not {\nproc}-1). The
 508         communication associated with this operation is proportional to
 509         2{\natom}/{\nproc}.
 510 \item   Sort the combined indices of the three (or two) processors. Note that
 511         the CPU time associated with sorting is now
 512         (3{\natom}/{\nproc})$^2\log$ (3{\natom}/{\nproc}).
 513 \item   On each processor, the indices belonging to its slab can be determined
 514         from the order of the array (\figref{parsort}).
 515 \end{enumerate}
 516
 517 %\section{A Worked Example}
 518 %Suppose our box size is 4.0 nm and the cut-off is 0.8 nm. For neighborsearching
 519 %we use {\dgrid} = 2, such that there 10 {\nsgrid} cells.
 520
 521
 522 % LocalWords:  Parallelization parallelization parallelize Fortran vectorizing
 523 % LocalWords:  parallelized vectorize Amdahl's MPI ij ji decomp mdpar nprocs gx
 524 % LocalWords:  pid rvec vecs homenr dihedrals indices parsort nsgrid editconf
 525 % LocalWords:  preprocessor grompp PPPM pppm trunc FFTW FFT FFTs Convolute un
 526 % LocalWords:  SGI Quicksort formulae