From 617038e21d776d74a7b4e2d36709b331f491788a Mon Sep 17 00:00:00 2001
From: Mark Abraham <mark.j.abraham@gmail.com>
Date: Tue, 15 Jul 2014 16:40:19 +0200
Subject: [PATCH] New and reorganized documentation

Covers more mdrun options, moves a bit of "practical" content from the
reference manual to the user guide.

Imported and updated information from wiki page on cutoff
schemes. Consolidated with information from reference manual.

Updated some use of "atom" to "particle" in both guides.

We could update the performance numbers, but with the impending
removal of the group scheme, I don't think that's worth bothering
about. e.g. on Haswell, Erik already tested performance of group is a
bit slower than Verlet, even for unbuffered water systems.

Change-Id: I6410ba9fc08bb133ec8669e14dba11bcbd454fe3
---
 docs/manual/algorithms.tex            | 112 ++++++------------------
 docs/user-guide/cutoff-schemes.rst    | 159 +++++++++++++++++++++++++++++++++-
 docs/user-guide/mdp-options.rst       |  44 +++++-----
 docs/user-guide/mdrun-features.rst    |  11 +++
 docs/user-guide/mdrun-performance.rst |  53 +++++++++---
 5 files changed, 258 insertions(+), 121 deletions(-)

diff --git a/docs/manual/algorithms.tex b/docs/manual/algorithms.tex
index ac864521a4..7ea3e25879 100644
--- a/docs/manual/algorithms.tex
+++ b/docs/manual/algorithms.tex
@@ -472,9 +472,9 @@ of $j$ is less than a given cut-off radius $R_c$. Some of the particle
 pairs that fulfill this criterion are excluded, when their interaction
 is already fully accounted for by bonded interactions.  {\gromacs}
 employs a {\em pair list} that contains those particle pairs for which
-non-bonded forces must be calculated.  The pair list contains atoms
-$i$, a displacement vector for atom $i$, and all particles $j$ that
-are within \verb'rlist' of this particular image of atom $i$.  The
+non-bonded forces must be calculated.  The pair list contains particles
+$i$, a displacement vector for particle $i$, and all particles $j$ that
+are within \verb'rlist' of this particular image of particle $i$.  The
 list is updated every \verb'nstlist' steps, where \verb'nstlist' is
 typically 10. There is an option to calculate the total non-bonded
 force on each particle due to all particle in a shell around the
@@ -491,7 +491,7 @@ $O(N^2)$ algorithm is still available under some conditions.
 
 \subsubsection{\normindex{Cut-off schemes}: group versus Verlet}
 From version 4.6, {\gromacs} supports two different cut-off scheme
-setups: the original one based on atom groups and one using a Verlet
+setups: the original one based on particle groups and one using a Verlet
 buffer. There are some important differences that affect results,
 performance and feature support. The group scheme can be made to work
 (almost) like the Verlet scheme, but this will lead to a decrease in
@@ -500,27 +500,28 @@ which are abundant in many simulations, but on the most recent x86
 processors, this advantage is negated by the better instruction-level
 parallelism available in the Verlet-scheme implementation. The group
 scheme is deprecated in version 5.0, and will be removed in a future
-version.
+version. For practical details of choosing and setting up
+cut-off schemes, please see the User Guide.
 
 In the group scheme, a neighbor list is generated consisting of pairs
-of groups of at least one atom. These groups were originally
+of groups of at least one particle. These groups were originally
 \swapindex{charge}{group}s \ifthenelse{\equal{\gmxlite}{1}}{}{(see
   \secref{chargegroup})}, but with a proper treatment of long-range
-electrostatics, performance is their only advantage. A pair of groups
+electrostatics, performance in unbuffered simulations is their only advantage. A pair of groups
 is put into the neighbor list when their center of geometry is within
-the cut-off distance. Interactions between all atom pairs (one from
+the cut-off distance. Interactions between all particle pairs (one from
 each charge group) are calculated for a certain number of MD steps,
 until the neighbor list is updated. This setup is efficient, as the
-neighbor search only checks distance between charge group pair, not
-atom pairs (saves a factor of $3 \times 3 = 9$ with a three-atom water
+neighbor search only checks distance between charge-group pair, not
+particle pairs (saves a factor of $3 \times 3 = 9$ with a three-particle water
 model) and the non-bonded force kernels can be optimized for, say, a
 water molecule ``group''. Without explicit buffering, this setup leads
-to energy drift as some atom pairs which are within the cut-off don't
+to energy drift as some particle pairs which are within the cut-off don't
 interact and some outside the cut-off do interact. This can be caused
 by
 \begin{itemize}
-\item atoms moving across the cut-off between neighbor search steps, and/or
-\item for charge groups consisting of more than one atom, atom pairs
+\item particles moving across the cut-off between neighbor search steps, and/or
+\item for charge groups consisting of more than one particle, particle pairs
   moving in/out of the cut-off when their charge group center of
   geometry distance is outside/inside of the cut-off.
 \end{itemize}
@@ -530,18 +531,18 @@ artifacts are depends on the system, the properties in which you are
 interested, and the cut-off setup.
 
 The Verlet cut-off scheme uses a buffered pair list by default. It
-also uses clusters of atoms, but these are not static as in the group
+also uses clusters of particles, but these are not static as in the group
 scheme. Rather, the clusters are defined spatially and consist of 4 or
-8 atoms, which is convenient for stream computing, using e.g. SSE, AVX
+8 particles, which is convenient for stream computing, using e.g. SSE, AVX
 or CUDA on GPUs. At neighbor search steps, a pair list is created
 with a Verlet buffer, ie. the pair-list cut-off is larger than the
-interaction cut-off. In the non-bonded force kernels, forces are only
-added when an atom pair is within the cut-off distance at that
-particular time step. This ensures that as atoms move between pair
-search steps, forces between nearly all atoms within the cut-off
-distance are calculated. We say {\em nearly} all atoms, because
+interaction cut-off. In the non-bonded kernels, interactions are only
+computed when a particle pair is within the cut-off distance at that
+particular time step. This ensures that as particles move between pair
+search steps, forces between nearly all particles within the cut-off
+distance are calculated. We say {\em nearly} all particles, because
 {\gromacs} uses a fixed pair list update frequency for
-efficiency. An atom-pair, whose distance was outside the cut-off,
+efficiency. A particle-pair, whose distance was outside the cut-off,
 could possibly move enough during this fixed number of
 steps that its distance is now within the cut-off. This
 small chance results in a small energy drift, and the size of the
@@ -549,76 +550,17 @@ chance depends on the temperature. When temperature
 coupling is used, the buffer size can be determined automatically,
 given a certain tolerance on the energy drift.
 
-The {\tt mdp} file settings specific to the Verlet scheme are:
-\begin{verbatim}
-cutoff-scheme           = Verlet
-verlet-buffer-tolerance = 0.005
-\end{verbatim}
-The Verlet buffer size is determined from the latter option, which is
-by default set to 0.005 kJ/mol/ps pair energy error per atom. Note that
-errors in pair energies cancel and the effect on the total energy drift
-is usually at least an order of magnitude smaller than the tolerance.
-Furthermore, the drift of the total energy is affected by many other
-factors; often, the contribution from the constraint algorithm dominates.
-
-For constant-energy (NVE) simulations, the buffer size will be
-inferred from the temperature that corresponds to the velocities
-(either those generated, if applicable, or those found in the input
-configuration). Alternatively, the tolerance can be set to -1 and a
-buffer set manually by specifying {\tt rlist} $>$ {\tt max(rcoulomb,
-  rvdw)}. The simplest way to get a reasonable buffer size is to use
-an NVT {\tt mdp} file with the target temperature set to what you
-expect in your NVE simulation, and transfer the buffer size printed by
-{\tt grompp} to your NVE {\tt mdp} file.
-
 The Verlet cut-off scheme is implemented in a very efficient fashion
 based on clusters of particles. The simplest example is a cluster size
 of 4 particles. The pair list is then constructed based on cluster
 pairs. The cluster-pair search is much faster searching based on
 particle pairs, because $4 \times 4 = 16$ particle pairs are put in
 the list at once. The non-bonded force calculation kernel can then
-calculate all 16 particle-pair interactions at once, which maps nicely
-to SIMD units which can perform multiple floating operations at once
-(e.g. SSE, AVX, CUDA on GPUs, BlueGene FPUs). These non-bonded kernels
+calculate many particle-pair interactions at once, which maps nicely
+to SIMD or SIMT units on modern hardware, which can perform multiple
+floating operations at once. These non-bonded kernels
 are much faster than the kernels used in the group scheme for most
-types of systems, except for water molecules on processors with short
-SIMD widths when not using a buffered pair list. This latter case is
-common for (bio-)molecular simulations, so for greatest speed, it is
-worth comparing the performance of both schemes.
-
-As the Verlet cut-off scheme was introduced in version 4.6, not
-all features of the group scheme are supported yet. The Verlet scheme
-supports a few new features which the group scheme does not support.
-A list of features not (fully) supported in both cut-off schemes is
-given in \tabref{cutoffschemesupport}.
-
-\begin{table}
-\centerline{
-\begin{tabular}{|l|c|c|}
-\dline
-Non-bonded interaction feature    & group & Verlet \\
-\dline
-unbuffered cut-off scheme         & $\surd$ & not by default \\
-exact cut-off                     & shift/switch & $\surd$ \\
-shifted interactions              & force+energy & energy \\
-switched potential                & $\surd$ & $\surd$ \\
-switched forces                   & $\surd$ & $\surd$ \\
-non-periodic systems              & $\surd$ & Z  + walls \\
-implicit solvent                  & $\surd$ & \\
-free energy perturbed non-bondeds & $\surd$ & $\surd$ \\
-group energy contributions        & $\surd$ & CPU (not on GPU) \\
-energy group exclusions           & $\surd$ & \\
-AdResS multi-scale                & $\surd$ & \\
-OpenMP multi-threading            & only PME & $\surd$ \\
-native GPU support                &         & $\surd$ \\
-Lennard-Jones PME                 & $\surd$ & $\surd$ \\
-\dline
-\end{tabular}
-}
-\caption{Differences (only) in the support of non-bonded features
-  between the group and Verlet cut-off schemes.}
-\label{tab:cutoffschemesupport}
-\end{table}
+types of systems, particularly on newer hardware.
 
 \ifthenelse{\equal{\gmxlite}{1}}{}{
 \subsubsection{Energy drift and pair-list buffering}
@@ -632,7 +574,7 @@ displacements and the shape of the potential at the cut-off.
 The displacement distribution along one dimension for a freely moving
 particle with mass $m$ over time $t$ at temperature $T$ is Gaussian
 with zero mean and variance $\sigma^2 = t\,k_B T/m$. For the distance
-between two atoms, the variance changes to $\sigma^2 = \sigma_{12}^2 =
+between two particles, the variance changes to $\sigma^2 = \sigma_{12}^2 =
 t\,k_B T(1/m_1+1/m_2)$.  Note that in practice particles usually
 interact with other particles over time $t$ and therefore the real
 displacement distribution is much narrower.  Given a non-bonded
diff --git a/docs/user-guide/cutoff-schemes.rst b/docs/user-guide/cutoff-schemes.rst
index bcae2d7edf..5cdfdfd6c0 100644
--- a/docs/user-guide/cutoff-schemes.rst
+++ b/docs/user-guide/cutoff-schemes.rst
@@ -1,4 +1,161 @@
 Non-bonded cut-off schemes
 ==========================
 
-TODO in future patch
+The default cut-off scheme in |Gromacs| |version| is based on classical
+buffered Verlet lists. These are implemented extremely efficiently
+on modern CPUs and accelerators, and support nearly all of the
+algorithms used in |Gromacs|.
+
+Before version 4.6, |Gromacs| always used pair-lists based on groups of
+particles. These groups of particles were orginally charge-groups, which were
+necessary with plain cut-off electrostatics. With the use of PME (or
+reaction-field with a buffer), charge groups are no longer necessary
+(and are ignored in the Verlet scheme). In |Gromacs| 4.6 and later, the
+group-based cut-off scheme is still available, but is **deprecated in
+5.0 and 5.1**. It is still available mainly for backwards
+compatibility, to support the algorithms that have not yet been
+converted, and for the few cases where it may allow faster simulations
+with bio-molecular systems dominated by water.
+
+Without PME, the group cut-off scheme should generally be combined
+with a buffered pair-list to help avoid artefacts. However, the
+group-scheme kernels that can implement this are much slower than
+either the unbuffered group-scheme kernels, or the buffered
+Verlet-scheme kernels. Use of the Verlet scheme is strongly encouraged
+for all kinds of simulations, because it is easier and faster to run
+correctly. In particular, GPU acceleration is available only with the
+Verlet scheme.
+
+The Verlet scheme uses properly buffered lists with exact cut-offs.
+The size of the buffer is chosen with :mdp:`verlet-buffer-tolerance`
+to permit a certain level of drift.  Both the LJ and Coulomb potential
+are shifted to zero by subtracting the value at the cut-off. This
+ensures that the energy is the integral of the force. Still it is
+advisable to have small forces at the cut-off, hence to use PME or
+reaction-field with infinite epsilon.
+
+Non-bonded scheme feature comparison
+------------------------------------
+
+All |Gromacs| |version| features not directly related to non-bonded
+interactions are supported in both schemes. Eventually, all non-bonded
+features will be supported in the Verlet scheme. A table describing
+the compatibility of just non-bonded features with the two schemes is
+given below.
+
+Table: Support levels within the group and Verlet cut-off schemes
+for features related to non-bonded interactions
+
+====================================  ============ =======
+Feature                               group        Verlet
+====================================  ============ =======
+unbuffered cut-off scheme             default      not by default
+exact cut-off                         shift/switch always
+potential-shift interactions          yes          yes
+potential-switch interactions         yes          yes
+force-switch interations              yes          yes
+switched potential                    yes          yes
+switched forces                       yes          yes
+non-periodic systems                  yes          Z + walls
+implicit solvent                      yes          no
+free energy perturbed non-bondeds     yes          yes
+energy group contributions            yes          only on CPU
+energy group exclusions               yes          no
+AdResS multi-scale                    yes          no
+OpenMP multi-threading                only PME     all
+native GPU support                    no           yes
+Coulomb PME                           yes          yes
+Lennard-Jones PME                     yes          yes
+virtual sites                         yes          yes
+User-supplied tabulated interactions  yes          no
+Buckingham VdW interactions           yes          no
+rcoulomb != rvdw                      yes          no
+twin-range                            yes          no
+====================================  ============ =======
+
+Performance
+-----------
+
+The performance of the group cut-off scheme depends very much on the
+composition of the system and the use of buffering. There are
+optimized kernels for interactions with water, so anything with a lot
+of water runs very fast. But if you want properly buffered
+interactions, you need to add a buffer that takes into account both
+charge-group size and diffusion, and check each interaction against
+the cut-off length each time step. This makes simulations much
+slower. The performance of the Verlet scheme with the new non-bonded
+kernels is independent of system composition and is intended to always
+run with a buffered pair-list. Typically, buffer size is 0 to 10% of
+the cut-off, so you could win a bit of peformance by reducing or
+removing the buffer, but this might not be a good trade-off of
+simulation quality.
+
+The table below shows a performance comparison of most of the relevant
+setups. Any atomistic model will have performance comparable to tips3p
+(which has LJ on the hydrogens), unless a united-atom force field is
+used. The performance of a protein in water will be between the tip3p
+and tips3p performance. The group scheme is optimized for water
+interactions, which means a single charge group containing one particle
+with LJ, and 2 or 3 particles without LJ. Such kernels for water are
+roughly twice as fast as a comparable system with LJ and/or without
+charge groups. The implementation of the Verlet cut-off scheme has no
+interaction-specific optimizations, except for only calculating half
+of the LJ interactions if less than half of the particles have LJ. For
+molecules solvated in water the scaling of the Verlet scheme to higher
+numbers of cores is better than that of the group scheme, because the
+load is more balanced. On the most recent Intel CPUs, the absolute
+performance of the Verlet scheme exceeds that of the group scheme,
+even for water-only systems.
+
+Table: Performance in ns/day of various water systems under different
+non-bonded setups in |Gromacs| using either 8 thread-MPI ranks (group
+scheme), or 8 OpenMP threads (Verlet scheme). 3000 particles, 1.0 nm
+cut-off, PME with 0.11 nm grid, dt=2 fs, Intel Core i7 2600 (AVX), 3.4
+GHz + Nvidia GTX660Ti
+
+========================  =================  ===============  ================  =====================
+system                    group, unbuffered  group, buffered  Verlet, buffered  Verlet, buffered, GPU
+========================  =================  ===============  ================  =====================
+tip3p, charge groups      208                116              170               450
+tips3p, charge groups     129                63               162               450
+tips3p, no charge groups  104                75               162               450
+========================  =================  ===============  ================  =====================
+
+How to use the Verlet scheme
+----------------------------
+
+The Verlet scheme is enabled by default with option :mdp:`cutoff-scheme`.
+The value of [.mdp] option :mdp:`verlet-buffer-tolerance` will add a
+pair-list buffer whose size is tuned for the given energy drift (in
+kJ/mol/ns per particle). The effective drift is usually much lower, as
+:ref:`gmx grompp` assumes constant particle velocities. (Note that in single
+precision for normal atomistic simulations constraints cause a drift
+somewhere around 0.0001 kJ/mol/ns per particle, so it doesn't make sense
+to go much lower.) Details on how the buffer size is chosen can be
+found in the reference below and in the Reference Manual.
+
+For constant-energy (NVE) simulations, the buffer size will be
+inferred from the temperature that corresponds to the velocities
+(either those generated, if applicable, or those found in the input
+configuration). Alternatively, :mdp:`verlet-buffer-tolerance` can be set
+to -1 and a buffer set manually by specifying :mdp:`rlist` greater than
+the larger of :mdp:`rcoulomb` and :mdp:`rvdw`. The simplest way to get a
+reasonable buffer size is to use an NVT mdp file with the target
+temperature set to what you expect in your NVE simulation, and
+transfer the buffer size printed by grompp to your NVE [.mdp] file.
+
+When a GPU is used, nstlist is automatically increased by mdrun,
+usually to 20 or more; rlist is increased along to stay below the
+target energy drift. Further information on [running mdrun with
+GPUs] is available.
+
+Further information
+-------------------
+
+For further information on algorithmic and implementation details of
+the Verlet cut-off scheme and the MxN kernels, as well as detailed
+performance analysis, please consult the following article:
+
+Páll, S. and Hess, B. A flexible algorithm for calculating pair
+interactions on SIMD architectures. Comput. Phys. Commun. 184,
+2641–2650 (2013). <http://dx.doi.org/10.1016/j.cpc.2013.06.003>
diff --git a/docs/user-guide/mdp-options.rst b/docs/user-guide/mdp-options.rst
index 07b3688b02..5129754f26 100644
--- a/docs/user-guide/mdp-options.rst
+++ b/docs/user-guide/mdp-options.rst
@@ -555,32 +555,32 @@ Neighbor searching
 
    (0.005) \[kJ/mol/ps\]
 
-   Useful only with the :mdp:`Verlet` :mdp:`cutoff-scheme`. This
-   sets the maximum allowed error for pair interactions per particle
-   caused by the Verlet buffer, which indirectly sets
-   :mdp:`rlist`. As both :mdp:`nstlist` and the Verlet buffer size
-   are fixed (for performance reasons), particle pairs not in the pair
-   list can occasionally get within the cut-off distance during
+   Useful only with the :mdp:`Verlet` :mdp:`cutoff-scheme`. This sets
+   the maximum allowed error for pair interactions per particle caused
+   by the Verlet buffer, which indirectly sets :mdp:`rlist`. As both
+   :mdp:`nstlist` and the Verlet buffer size are fixed (for
+   performance reasons), particle pairs not in the pair list can
+   occasionally get within the cut-off distance during
    :mdp:`nstlist` -1 steps. This causes very small jumps in the
    energy. In a constant-temperature ensemble, these very small energy
    jumps can be estimated for a given cut-off and :mdp:`rlist`. The
    estimate assumes a homogeneous particle distribution, hence the
    errors might be slightly underestimated for multi-phase
-   systems. For longer pair-list life-time (:mdp:`nstlist` -1) *
-   :mdp:`dt` the buffer is overestimated, because the interactions
-   between particles are ignored. Combined with cancellation of
-   errors, the actual drift of the total energy is usually one to two
-   orders of magnitude smaller. Note that the generated buffer size
-   takes into account that the |Gromacs| pair-list setup leads to a
-   reduction in the drift by a factor 10, compared to a simple
-   particle-pair based list. Without dynamics (energy minimization
-   etc.), the buffer is 5% of the cut-off. For NVE simulations the
-   initial temperature is used, unless this is zero, in which case a
-   buffer of 10% is used. For NVE simulations the tolerance usually
-   needs to be lowered to achieve proper energy conservation on the
-   nanosecond time scale. To override the automated buffer setting,
-   use :mdp:`verlet-buffer-tolerance` =-1 and set :mdp:`rlist`
-   manually.
+   systems. (See the `reference manual`_ for details). For longer
+   pair-list life-time (:mdp:`nstlist` -1) * :mdp:`dt` the buffer is
+   overestimated, because the interactions between particles are
+   ignored. Combined with cancellation of errors, the actual drift of
+   the total energy is usually one to two orders of magnitude
+   smaller. Note that the generated buffer size takes into account
+   that the |Gromacs| pair-list setup leads to a reduction in the
+   drift by a factor 10, compared to a simple particle-pair based
+   list. Without dynamics (energy minimization etc.), the buffer is 5%
+   of the cut-off. For NVE simulations the initial temperature is
+   used, unless this is zero, in which case a buffer of 10% is
+   used. For NVE simulations the tolerance usually needs to be lowered
+   to achieve proper energy conservation on the nanosecond time
+   scale. To override the automated buffer setting, use
+   :mdp:`verlet-buffer-tolerance` =-1 and set :mdp:`rlist` manually.
 
 .. mdp:: rlist
 
@@ -2866,3 +2866,5 @@ User defined thingies
    These you can use if you modify code. You can pass integers and
    reals and groups to your subroutine. Check the inputrec definition
    in ``src/gromacs/legacyheaders/types/inputrec.h``
+
+.. _reference manual: gmx-manual-parent-dir_
diff --git a/docs/user-guide/mdrun-features.rst b/docs/user-guide/mdrun-features.rst
index ede2590d4b..1dc811f629 100644
--- a/docs/user-guide/mdrun-features.rst
+++ b/docs/user-guide/mdrun-features.rst
@@ -120,3 +120,14 @@ input files. The random seed for replica exchange is set with
 ``-reseed``. After every exchange, the velocities are scaled and
 neighbor searching is performed. See the Reference Manual for more
 details on how replica exchange functions in GROMACS.
+
+Controlling the length of the simulation
+----------------------------------------
+
+Normally, the length of an MD simulation is best managed through the
+[.mdp] option [nsteps](#nsteps), however there are situations where
+more control is useful. `mdrun -nsteps 100` overrides the [.mdp] file
+and executes 100 steps. `mdrun -maxh 2.5` will terminate the
+simulation shortly before 2.5 hours elapse, which can be useful when
+running under cluster queues (as long as the queueing system does not
+ever suspend the simulation).
diff --git a/docs/user-guide/mdrun-performance.rst b/docs/user-guide/mdrun-performance.rst
index cba6ff2771..df22386f8d 100644
--- a/docs/user-guide/mdrun-performance.rst
+++ b/docs/user-guide/mdrun-performance.rst
@@ -112,7 +112,7 @@ see the Reference Manual. The most important of these are
         members of its domain. A GPU may perform work for more than
         one PP rank, but it is normally most efficient to use a single
         PP rank per GPU and for that rank to have thousands of
-        atoms. When the work of a PP rank is done on the CPU, mdrun
+        particles. When the work of a PP rank is done on the CPU, mdrun
         will make extensive use of the SIMD capabilities of the
         core. There are various `command-line options
         <controlling-the-domain-decomposition-algorithm` to control
@@ -135,7 +135,7 @@ Running mdrun within a single node
 ----------------------------------
 
 :ref:`gmx mdrun` can be configured and compiled in several different ways that
-are efficient to use within a single node. The default configuration
+are efficient to use within a single :term:`node`. The default configuration
 using a suitable compiler will deploy a multi-level hybrid parallelism
 that uses CUDA, OpenMP and the threading platform native to the
 hardware. For programming convenience, in GROMACS, those native
@@ -315,24 +315,49 @@ OpenMP to spread the work of an MPI rank over more than one
 core is needed to continue to improve absolute performance.
 The location of the scaling limit depends on the processor,
 presence of GPUs, network, and simulation algorithm, but
-it is worth measuring at around ~200 atoms/core if you
+it is worth measuring at around ~200 particles/core if you
 need maximum throughput.
 
 There are further command-line parameters that are relevant in these
 cases.
 
 ``-tunepme``
-    If "on," will optimize various aspects of the PME
-    and DD algorithms, shifting load between ranks and/or
-    GPUs to maximize throughput
+    Defaults to "on." If "on," will optimize various aspects of the
+    PME and DD algorithms, shifting load between ranks and/or GPUs to
+    maximize throughput
+
+``-dlb``
+    Can be set to "auto," "no," or "yes."
+    Defaults to "auto." Doing Dynamic Load Balancing between MPI ranks
+    is needed to maximize performance. This is particularly important
+    for molecular systems with heterogeneous particle or interaction
+    density. When a certain threshold for performance loss is
+    exceeded, DLB activates and shifts particles between ranks to improve
+    performance.
 
 ``-gcom``
-    Can be used to limit global communication every n steps. This can
-    improve performance for highly parallel simulations where this global
-    communication step becomes the bottleneck. For a global thermostat
-    and/or barostat, the temperature and/or pressure will also only be
-    updated every ``-gcom`` steps. By default, it is set to the
-    minimum of :mdp:`nstcalcenergy` and :mdp:`nstlist`.
+    During the simulation :ref:`gmx mdrun` must communicate between all ranks to
+    compute quantities such as kinetic energy. By default, this
+    happens whenever plausible, and is influenced by a lot of [.mdp
+    options](#mdp-options). The period between communication phases
+    must be a multiple of :mdp:`nstlist`, and defaults to
+    the minimum of :mdp:`nstcalcenergy` and :mdp:`nstlist`.
+    ``mdrun -gcom`` sets the number of steps that must elapse between
+    such communication phases, which can improve performance when
+    running on a lot of nodes. Note that this means that _e.g._
+    temperature coupling algorithms will
+    effectively remain at constant energy until the next global
+    communication phase.
+
+Note that ``-tunepme`` has more effect when there is more than one
+:term:`node`, because the cost of communication for the PP and PME
+ranks differs. It still shifts load between PP and PME ranks, but does
+not change the number of separate PME ranks in use.
+
+Note also that ``-dlb`` and ``-tunepme`` can interfere with each other, so
+if you experience performance variation that could result from this,
+you may wish to tune PME separately, and run the result with ``mdrun
+-notunepme -dlb yes``.
 
 The :ref:`gmx tune_pme` utility is available to search a wider
 range of parameter space, including making safe
@@ -432,7 +457,7 @@ parallel hardware.
     Can be used to set the required maximum distance for inter
     charge-group bonded interactions. Communication for two-body
     bonded interactions below the non-bonded cut-off distance always
-    comes for free with the non-bonded communication. Atoms beyond
+    comes for free with the non-bonded communication. Particles beyond
     the non-bonded cut-off are only communicated when they have
     missing bonded interactions; this means that the extra cost is
     minor and nearly indepedent of the value of ``-rdd``. With dynamic
@@ -451,7 +476,7 @@ parallel hardware.
 ``-rcon``
     When constraints are present, option ``-rcon`` influences
     the cell size limit as well.  
-    Atoms connected by NC constraints, where NC is the LINCS order
+    Particles connected by NC constraints, where NC is the LINCS order
     plus 1, should not be beyond the smallest cell size. A error
     message is generated when this happens, and the user should change
     the decomposition or decrease the LINCS order and increase the
-- 
2.11.4.GIT