From 617038e21d776d74a7b4e2d36709b331f491788a Mon Sep 17 00:00:00 2001 From: Mark Abraham Date: Tue, 15 Jul 2014 16:40:19 +0200 Subject: [PATCH] New and reorganized documentation Covers more mdrun options, moves a bit of "practical" content from the reference manual to the user guide. Imported and updated information from wiki page on cutoff schemes. Consolidated with information from reference manual. Updated some use of "atom" to "particle" in both guides. We could update the performance numbers, but with the impending removal of the group scheme, I don't think that's worth bothering about. e.g. on Haswell, Erik already tested performance of group is a bit slower than Verlet, even for unbuffered water systems. Change-Id: I6410ba9fc08bb133ec8669e14dba11bcbd454fe3 --- docs/manual/algorithms.tex | 112 ++++++------------------ docs/user-guide/cutoff-schemes.rst | 159 +++++++++++++++++++++++++++++++++- docs/user-guide/mdp-options.rst | 44 +++++----- docs/user-guide/mdrun-features.rst | 11 +++ docs/user-guide/mdrun-performance.rst | 53 +++++++++--- 5 files changed, 258 insertions(+), 121 deletions(-) diff --git a/docs/manual/algorithms.tex b/docs/manual/algorithms.tex index ac864521a4..7ea3e25879 100644 --- a/docs/manual/algorithms.tex +++ b/docs/manual/algorithms.tex @@ -472,9 +472,9 @@ of $j$ is less than a given cut-off radius $R_c$. Some of the particle pairs that fulfill this criterion are excluded, when their interaction is already fully accounted for by bonded interactions. {\gromacs} employs a {\em pair list} that contains those particle pairs for which -non-bonded forces must be calculated. The pair list contains atoms -$i$, a displacement vector for atom $i$, and all particles $j$ that -are within \verb'rlist' of this particular image of atom $i$. The +non-bonded forces must be calculated. The pair list contains particles +$i$, a displacement vector for particle $i$, and all particles $j$ that +are within \verb'rlist' of this particular image of particle $i$. The list is updated every \verb'nstlist' steps, where \verb'nstlist' is typically 10. There is an option to calculate the total non-bonded force on each particle due to all particle in a shell around the @@ -491,7 +491,7 @@ $O(N^2)$ algorithm is still available under some conditions. \subsubsection{\normindex{Cut-off schemes}: group versus Verlet} From version 4.6, {\gromacs} supports two different cut-off scheme -setups: the original one based on atom groups and one using a Verlet +setups: the original one based on particle groups and one using a Verlet buffer. There are some important differences that affect results, performance and feature support. The group scheme can be made to work (almost) like the Verlet scheme, but this will lead to a decrease in @@ -500,27 +500,28 @@ which are abundant in many simulations, but on the most recent x86 processors, this advantage is negated by the better instruction-level parallelism available in the Verlet-scheme implementation. The group scheme is deprecated in version 5.0, and will be removed in a future -version. +version. For practical details of choosing and setting up +cut-off schemes, please see the User Guide. In the group scheme, a neighbor list is generated consisting of pairs -of groups of at least one atom. These groups were originally +of groups of at least one particle. These groups were originally \swapindex{charge}{group}s \ifthenelse{\equal{\gmxlite}{1}}{}{(see \secref{chargegroup})}, but with a proper treatment of long-range -electrostatics, performance is their only advantage. A pair of groups +electrostatics, performance in unbuffered simulations is their only advantage. A pair of groups is put into the neighbor list when their center of geometry is within -the cut-off distance. Interactions between all atom pairs (one from +the cut-off distance. Interactions between all particle pairs (one from each charge group) are calculated for a certain number of MD steps, until the neighbor list is updated. This setup is efficient, as the -neighbor search only checks distance between charge group pair, not -atom pairs (saves a factor of $3 \times 3 = 9$ with a three-atom water +neighbor search only checks distance between charge-group pair, not +particle pairs (saves a factor of $3 \times 3 = 9$ with a three-particle water model) and the non-bonded force kernels can be optimized for, say, a water molecule ``group''. Without explicit buffering, this setup leads -to energy drift as some atom pairs which are within the cut-off don't +to energy drift as some particle pairs which are within the cut-off don't interact and some outside the cut-off do interact. This can be caused by \begin{itemize} -\item atoms moving across the cut-off between neighbor search steps, and/or -\item for charge groups consisting of more than one atom, atom pairs +\item particles moving across the cut-off between neighbor search steps, and/or +\item for charge groups consisting of more than one particle, particle pairs moving in/out of the cut-off when their charge group center of geometry distance is outside/inside of the cut-off. \end{itemize} @@ -530,18 +531,18 @@ artifacts are depends on the system, the properties in which you are interested, and the cut-off setup. The Verlet cut-off scheme uses a buffered pair list by default. It -also uses clusters of atoms, but these are not static as in the group +also uses clusters of particles, but these are not static as in the group scheme. Rather, the clusters are defined spatially and consist of 4 or -8 atoms, which is convenient for stream computing, using e.g. SSE, AVX +8 particles, which is convenient for stream computing, using e.g. SSE, AVX or CUDA on GPUs. At neighbor search steps, a pair list is created with a Verlet buffer, ie. the pair-list cut-off is larger than the -interaction cut-off. In the non-bonded force kernels, forces are only -added when an atom pair is within the cut-off distance at that -particular time step. This ensures that as atoms move between pair -search steps, forces between nearly all atoms within the cut-off -distance are calculated. We say {\em nearly} all atoms, because +interaction cut-off. In the non-bonded kernels, interactions are only +computed when a particle pair is within the cut-off distance at that +particular time step. This ensures that as particles move between pair +search steps, forces between nearly all particles within the cut-off +distance are calculated. We say {\em nearly} all particles, because {\gromacs} uses a fixed pair list update frequency for -efficiency. An atom-pair, whose distance was outside the cut-off, +efficiency. A particle-pair, whose distance was outside the cut-off, could possibly move enough during this fixed number of steps that its distance is now within the cut-off. This small chance results in a small energy drift, and the size of the @@ -549,76 +550,17 @@ chance depends on the temperature. When temperature coupling is used, the buffer size can be determined automatically, given a certain tolerance on the energy drift. -The {\tt mdp} file settings specific to the Verlet scheme are: -\begin{verbatim} -cutoff-scheme = Verlet -verlet-buffer-tolerance = 0.005 -\end{verbatim} -The Verlet buffer size is determined from the latter option, which is -by default set to 0.005 kJ/mol/ps pair energy error per atom. Note that -errors in pair energies cancel and the effect on the total energy drift -is usually at least an order of magnitude smaller than the tolerance. -Furthermore, the drift of the total energy is affected by many other -factors; often, the contribution from the constraint algorithm dominates. - -For constant-energy (NVE) simulations, the buffer size will be -inferred from the temperature that corresponds to the velocities -(either those generated, if applicable, or those found in the input -configuration). Alternatively, the tolerance can be set to -1 and a -buffer set manually by specifying {\tt rlist} $>$ {\tt max(rcoulomb, - rvdw)}. The simplest way to get a reasonable buffer size is to use -an NVT {\tt mdp} file with the target temperature set to what you -expect in your NVE simulation, and transfer the buffer size printed by -{\tt grompp} to your NVE {\tt mdp} file. - The Verlet cut-off scheme is implemented in a very efficient fashion based on clusters of particles. The simplest example is a cluster size of 4 particles. The pair list is then constructed based on cluster pairs. The cluster-pair search is much faster searching based on particle pairs, because $4 \times 4 = 16$ particle pairs are put in the list at once. The non-bonded force calculation kernel can then -calculate all 16 particle-pair interactions at once, which maps nicely -to SIMD units which can perform multiple floating operations at once -(e.g. SSE, AVX, CUDA on GPUs, BlueGene FPUs). These non-bonded kernels +calculate many particle-pair interactions at once, which maps nicely +to SIMD or SIMT units on modern hardware, which can perform multiple +floating operations at once. These non-bonded kernels are much faster than the kernels used in the group scheme for most -types of systems, except for water molecules on processors with short -SIMD widths when not using a buffered pair list. This latter case is -common for (bio-)molecular simulations, so for greatest speed, it is -worth comparing the performance of both schemes. - -As the Verlet cut-off scheme was introduced in version 4.6, not -all features of the group scheme are supported yet. The Verlet scheme -supports a few new features which the group scheme does not support. -A list of features not (fully) supported in both cut-off schemes is -given in \tabref{cutoffschemesupport}. - -\begin{table} -\centerline{ -\begin{tabular}{|l|c|c|} -\dline -Non-bonded interaction feature & group & Verlet \\ -\dline -unbuffered cut-off scheme & $\surd$ & not by default \\ -exact cut-off & shift/switch & $\surd$ \\ -shifted interactions & force+energy & energy \\ -switched potential & $\surd$ & $\surd$ \\ -switched forces & $\surd$ & $\surd$ \\ -non-periodic systems & $\surd$ & Z + walls \\ -implicit solvent & $\surd$ & \\ -free energy perturbed non-bondeds & $\surd$ & $\surd$ \\ -group energy contributions & $\surd$ & CPU (not on GPU) \\ -energy group exclusions & $\surd$ & \\ -AdResS multi-scale & $\surd$ & \\ -OpenMP multi-threading & only PME & $\surd$ \\ -native GPU support & & $\surd$ \\ -Lennard-Jones PME & $\surd$ & $\surd$ \\ -\dline -\end{tabular} -} -\caption{Differences (only) in the support of non-bonded features - between the group and Verlet cut-off schemes.} -\label{tab:cutoffschemesupport} -\end{table} +types of systems, particularly on newer hardware. \ifthenelse{\equal{\gmxlite}{1}}{}{ \subsubsection{Energy drift and pair-list buffering} @@ -632,7 +574,7 @@ displacements and the shape of the potential at the cut-off. The displacement distribution along one dimension for a freely moving particle with mass $m$ over time $t$ at temperature $T$ is Gaussian with zero mean and variance $\sigma^2 = t\,k_B T/m$. For the distance -between two atoms, the variance changes to $\sigma^2 = \sigma_{12}^2 = +between two particles, the variance changes to $\sigma^2 = \sigma_{12}^2 = t\,k_B T(1/m_1+1/m_2)$. Note that in practice particles usually interact with other particles over time $t$ and therefore the real displacement distribution is much narrower. Given a non-bonded diff --git a/docs/user-guide/cutoff-schemes.rst b/docs/user-guide/cutoff-schemes.rst index bcae2d7edf..5cdfdfd6c0 100644 --- a/docs/user-guide/cutoff-schemes.rst +++ b/docs/user-guide/cutoff-schemes.rst @@ -1,4 +1,161 @@ Non-bonded cut-off schemes ========================== -TODO in future patch +The default cut-off scheme in |Gromacs| |version| is based on classical +buffered Verlet lists. These are implemented extremely efficiently +on modern CPUs and accelerators, and support nearly all of the +algorithms used in |Gromacs|. + +Before version 4.6, |Gromacs| always used pair-lists based on groups of +particles. These groups of particles were orginally charge-groups, which were +necessary with plain cut-off electrostatics. With the use of PME (or +reaction-field with a buffer), charge groups are no longer necessary +(and are ignored in the Verlet scheme). In |Gromacs| 4.6 and later, the +group-based cut-off scheme is still available, but is **deprecated in +5.0 and 5.1**. It is still available mainly for backwards +compatibility, to support the algorithms that have not yet been +converted, and for the few cases where it may allow faster simulations +with bio-molecular systems dominated by water. + +Without PME, the group cut-off scheme should generally be combined +with a buffered pair-list to help avoid artefacts. However, the +group-scheme kernels that can implement this are much slower than +either the unbuffered group-scheme kernels, or the buffered +Verlet-scheme kernels. Use of the Verlet scheme is strongly encouraged +for all kinds of simulations, because it is easier and faster to run +correctly. In particular, GPU acceleration is available only with the +Verlet scheme. + +The Verlet scheme uses properly buffered lists with exact cut-offs. +The size of the buffer is chosen with :mdp:`verlet-buffer-tolerance` +to permit a certain level of drift. Both the LJ and Coulomb potential +are shifted to zero by subtracting the value at the cut-off. This +ensures that the energy is the integral of the force. Still it is +advisable to have small forces at the cut-off, hence to use PME or +reaction-field with infinite epsilon. + +Non-bonded scheme feature comparison +------------------------------------ + +All |Gromacs| |version| features not directly related to non-bonded +interactions are supported in both schemes. Eventually, all non-bonded +features will be supported in the Verlet scheme. A table describing +the compatibility of just non-bonded features with the two schemes is +given below. + +Table: Support levels within the group and Verlet cut-off schemes +for features related to non-bonded interactions + +==================================== ============ ======= +Feature group Verlet +==================================== ============ ======= +unbuffered cut-off scheme default not by default +exact cut-off shift/switch always +potential-shift interactions yes yes +potential-switch interactions yes yes +force-switch interations yes yes +switched potential yes yes +switched forces yes yes +non-periodic systems yes Z + walls +implicit solvent yes no +free energy perturbed non-bondeds yes yes +energy group contributions yes only on CPU +energy group exclusions yes no +AdResS multi-scale yes no +OpenMP multi-threading only PME all +native GPU support no yes +Coulomb PME yes yes +Lennard-Jones PME yes yes +virtual sites yes yes +User-supplied tabulated interactions yes no +Buckingham VdW interactions yes no +rcoulomb != rvdw yes no +twin-range yes no +==================================== ============ ======= + +Performance +----------- + +The performance of the group cut-off scheme depends very much on the +composition of the system and the use of buffering. There are +optimized kernels for interactions with water, so anything with a lot +of water runs very fast. But if you want properly buffered +interactions, you need to add a buffer that takes into account both +charge-group size and diffusion, and check each interaction against +the cut-off length each time step. This makes simulations much +slower. The performance of the Verlet scheme with the new non-bonded +kernels is independent of system composition and is intended to always +run with a buffered pair-list. Typically, buffer size is 0 to 10% of +the cut-off, so you could win a bit of peformance by reducing or +removing the buffer, but this might not be a good trade-off of +simulation quality. + +The table below shows a performance comparison of most of the relevant +setups. Any atomistic model will have performance comparable to tips3p +(which has LJ on the hydrogens), unless a united-atom force field is +used. The performance of a protein in water will be between the tip3p +and tips3p performance. The group scheme is optimized for water +interactions, which means a single charge group containing one particle +with LJ, and 2 or 3 particles without LJ. Such kernels for water are +roughly twice as fast as a comparable system with LJ and/or without +charge groups. The implementation of the Verlet cut-off scheme has no +interaction-specific optimizations, except for only calculating half +of the LJ interactions if less than half of the particles have LJ. For +molecules solvated in water the scaling of the Verlet scheme to higher +numbers of cores is better than that of the group scheme, because the +load is more balanced. On the most recent Intel CPUs, the absolute +performance of the Verlet scheme exceeds that of the group scheme, +even for water-only systems. + +Table: Performance in ns/day of various water systems under different +non-bonded setups in |Gromacs| using either 8 thread-MPI ranks (group +scheme), or 8 OpenMP threads (Verlet scheme). 3000 particles, 1.0 nm +cut-off, PME with 0.11 nm grid, dt=2 fs, Intel Core i7 2600 (AVX), 3.4 +GHz + Nvidia GTX660Ti + +======================== ================= =============== ================ ===================== +system group, unbuffered group, buffered Verlet, buffered Verlet, buffered, GPU +======================== ================= =============== ================ ===================== +tip3p, charge groups 208 116 170 450 +tips3p, charge groups 129 63 162 450 +tips3p, no charge groups 104 75 162 450 +======================== ================= =============== ================ ===================== + +How to use the Verlet scheme +---------------------------- + +The Verlet scheme is enabled by default with option :mdp:`cutoff-scheme`. +The value of [.mdp] option :mdp:`verlet-buffer-tolerance` will add a +pair-list buffer whose size is tuned for the given energy drift (in +kJ/mol/ns per particle). The effective drift is usually much lower, as +:ref:`gmx grompp` assumes constant particle velocities. (Note that in single +precision for normal atomistic simulations constraints cause a drift +somewhere around 0.0001 kJ/mol/ns per particle, so it doesn't make sense +to go much lower.) Details on how the buffer size is chosen can be +found in the reference below and in the Reference Manual. + +For constant-energy (NVE) simulations, the buffer size will be +inferred from the temperature that corresponds to the velocities +(either those generated, if applicable, or those found in the input +configuration). Alternatively, :mdp:`verlet-buffer-tolerance` can be set +to -1 and a buffer set manually by specifying :mdp:`rlist` greater than +the larger of :mdp:`rcoulomb` and :mdp:`rvdw`. The simplest way to get a +reasonable buffer size is to use an NVT mdp file with the target +temperature set to what you expect in your NVE simulation, and +transfer the buffer size printed by grompp to your NVE [.mdp] file. + +When a GPU is used, nstlist is automatically increased by mdrun, +usually to 20 or more; rlist is increased along to stay below the +target energy drift. Further information on [running mdrun with +GPUs] is available. + +Further information +------------------- + +For further information on algorithmic and implementation details of +the Verlet cut-off scheme and the MxN kernels, as well as detailed +performance analysis, please consult the following article: + +Páll, S. and Hess, B. A flexible algorithm for calculating pair +interactions on SIMD architectures. Comput. Phys. Commun. 184, +2641–2650 (2013). diff --git a/docs/user-guide/mdp-options.rst b/docs/user-guide/mdp-options.rst index 07b3688b02..5129754f26 100644 --- a/docs/user-guide/mdp-options.rst +++ b/docs/user-guide/mdp-options.rst @@ -555,32 +555,32 @@ Neighbor searching (0.005) \[kJ/mol/ps\] - Useful only with the :mdp:`Verlet` :mdp:`cutoff-scheme`. This - sets the maximum allowed error for pair interactions per particle - caused by the Verlet buffer, which indirectly sets - :mdp:`rlist`. As both :mdp:`nstlist` and the Verlet buffer size - are fixed (for performance reasons), particle pairs not in the pair - list can occasionally get within the cut-off distance during + Useful only with the :mdp:`Verlet` :mdp:`cutoff-scheme`. This sets + the maximum allowed error for pair interactions per particle caused + by the Verlet buffer, which indirectly sets :mdp:`rlist`. As both + :mdp:`nstlist` and the Verlet buffer size are fixed (for + performance reasons), particle pairs not in the pair list can + occasionally get within the cut-off distance during :mdp:`nstlist` -1 steps. This causes very small jumps in the energy. In a constant-temperature ensemble, these very small energy jumps can be estimated for a given cut-off and :mdp:`rlist`. The estimate assumes a homogeneous particle distribution, hence the errors might be slightly underestimated for multi-phase - systems. For longer pair-list life-time (:mdp:`nstlist` -1) * - :mdp:`dt` the buffer is overestimated, because the interactions - between particles are ignored. Combined with cancellation of - errors, the actual drift of the total energy is usually one to two - orders of magnitude smaller. Note that the generated buffer size - takes into account that the |Gromacs| pair-list setup leads to a - reduction in the drift by a factor 10, compared to a simple - particle-pair based list. Without dynamics (energy minimization - etc.), the buffer is 5% of the cut-off. For NVE simulations the - initial temperature is used, unless this is zero, in which case a - buffer of 10% is used. For NVE simulations the tolerance usually - needs to be lowered to achieve proper energy conservation on the - nanosecond time scale. To override the automated buffer setting, - use :mdp:`verlet-buffer-tolerance` =-1 and set :mdp:`rlist` - manually. + systems. (See the `reference manual`_ for details). For longer + pair-list life-time (:mdp:`nstlist` -1) * :mdp:`dt` the buffer is + overestimated, because the interactions between particles are + ignored. Combined with cancellation of errors, the actual drift of + the total energy is usually one to two orders of magnitude + smaller. Note that the generated buffer size takes into account + that the |Gromacs| pair-list setup leads to a reduction in the + drift by a factor 10, compared to a simple particle-pair based + list. Without dynamics (energy minimization etc.), the buffer is 5% + of the cut-off. For NVE simulations the initial temperature is + used, unless this is zero, in which case a buffer of 10% is + used. For NVE simulations the tolerance usually needs to be lowered + to achieve proper energy conservation on the nanosecond time + scale. To override the automated buffer setting, use + :mdp:`verlet-buffer-tolerance` =-1 and set :mdp:`rlist` manually. .. mdp:: rlist @@ -2866,3 +2866,5 @@ User defined thingies These you can use if you modify code. You can pass integers and reals and groups to your subroutine. Check the inputrec definition in ``src/gromacs/legacyheaders/types/inputrec.h`` + +.. _reference manual: gmx-manual-parent-dir_ diff --git a/docs/user-guide/mdrun-features.rst b/docs/user-guide/mdrun-features.rst index ede2590d4b..1dc811f629 100644 --- a/docs/user-guide/mdrun-features.rst +++ b/docs/user-guide/mdrun-features.rst @@ -120,3 +120,14 @@ input files. The random seed for replica exchange is set with ``-reseed``. After every exchange, the velocities are scaled and neighbor searching is performed. See the Reference Manual for more details on how replica exchange functions in GROMACS. + +Controlling the length of the simulation +---------------------------------------- + +Normally, the length of an MD simulation is best managed through the +[.mdp] option [nsteps](#nsteps), however there are situations where +more control is useful. `mdrun -nsteps 100` overrides the [.mdp] file +and executes 100 steps. `mdrun -maxh 2.5` will terminate the +simulation shortly before 2.5 hours elapse, which can be useful when +running under cluster queues (as long as the queueing system does not +ever suspend the simulation). diff --git a/docs/user-guide/mdrun-performance.rst b/docs/user-guide/mdrun-performance.rst index cba6ff2771..df22386f8d 100644 --- a/docs/user-guide/mdrun-performance.rst +++ b/docs/user-guide/mdrun-performance.rst @@ -112,7 +112,7 @@ see the Reference Manual. The most important of these are members of its domain. A GPU may perform work for more than one PP rank, but it is normally most efficient to use a single PP rank per GPU and for that rank to have thousands of - atoms. When the work of a PP rank is done on the CPU, mdrun + particles. When the work of a PP rank is done on the CPU, mdrun will make extensive use of the SIMD capabilities of the core. There are various `command-line options