Accompanying changes for analytical CUDA PME kernels
[gromacs/manual.git] / implement.tex
blob84338cbefc059c7cb4fc0b9670cff5c0e55cbc8c
1 % This file is part of the GROMACS molecular simulation package.
3 % Copyright (c) 2013, by the GROMACS development team, led by
4 % David van der Spoel, Berk Hess, Erik Lindahl, and including many
5 % others, as listed in the AUTHORS file in the top-level source
6 % directory and at http://www.gromacs.org.
8 % GROMACS is free software; you can redistribute it and/or
9 % modify it under the terms of the GNU Lesser General Public License
10 % as published by the Free Software Foundation; either version 2.1
11 % of the License, or (at your option) any later version.
13 % GROMACS is distributed in the hope that it will be useful,
14 % but WITHOUT ANY WARRANTY; without even the implied warranty of
15 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
16 % Lesser General Public License for more details.
18 % You should have received a copy of the GNU Lesser General Public
19 % License along with GROMACS; if not, see
20 % http://www.gnu.org/licenses, or write to the Free Software Foundation,
21 % Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
23 % If you want to redistribute modifications to GROMACS, please
24 % consider that scientific software is very special. Version
25 % control is crucial - bugs must be traceable. We will be happy to
26 % consider code for inclusion in the official distribution, but
27 % derived work must not be called official GROMACS. Details are found
28 % in the README & COPYING files - if they are missing, get the
29 % official version at http://www.gromacs.org.
31 % To help us fund GROMACS development, we humbly ask that you cite
32 % the research papers on the package. Check out http://www.gromacs.org
34 \chapter{Some implementation details}
35 In this chapter we will present some implementation details. This is
36 far from complete, but we deemed it necessary to clarify some things
37 that would otherwise be hard to understand.
39 \section{Single Sum Virial in {\gromacs}}
40 \label{sec:virial}
41 The \normindex{virial} $\Xi$ can be written in full tensor form as:
42 \beq
43 \Xi~=~-\half~\sum_{i < j}^N~\rvij\otimes\Fvij
44 \eeq
45 where $\otimes$ denotes the {\em direct product} of two vectors.\footnote
46 {$({\bf u}\otimes{\bf v})^{\ab}~=~{\bf u}_{\al}{\bf v}_{\be}$} When this is
47 computed in the inner loop of an MD program 9 multiplications and 9
48 additions are needed.\footnote{The calculation of
49 Lennard-Jones and Coulomb forces is about 50 floating point operations.}
51 Here it is shown how it is possible to extract the virial calculation
52 from the inner loop~\cite{Bekker93b}.
54 \subsection{Virial}
55 In a system with \index{periodic boundary conditions}, the
56 periodicity must be taken into account for the virial:
57 \beq
58 \Xi~=~-\half~\sum_{i < j}^{N}~\rnij\otimes\Fvij
59 \eeq
60 where $\rnij$ denotes the distance vector of the
61 {\em nearest image} of atom $i$ from atom $j$. In this definition we add
62 a {\em shift vector} $\delta_i$ to the position vector $\rvi$
63 of atom $i$. The difference vector $\rnij$ is thus equal to:
64 \beq
65 \rnij~=~\rvi+\delta_i-\rvj
66 \eeq
67 or in shorthand:
68 \beq
69 \rnij~=~\rni-\rvj
70 \eeq
71 In a triclinic system, there are 27 possible images of $i$; when a truncated
72 octahedron is used, there are 15 possible images.
74 \subsection{Virial from non-bonded forces}
75 Here the derivation for the single sum virial in the {\em non-bonded force}
76 routine is given. $i \neq j$ in all formulae below.
77 \newcommand{\di}{\delta_{i}}
78 \newcommand{\qrt}{\frac{1}{4}}
79 \bea
80 \Xi
81 &~=~&-\half~\sum_{i < j}^{N}~\rnij\otimes\Fvij \\
82 &~=~&-\qrt\sum_{i=1}^N~\sum_{j=1}^N ~(\rvi+\di-\rvj)\otimes\Fvij \\
83 &~=~&-\qrt\sum_{i=1}^N~\sum_{j=1}^N ~(\rvi+\di)\otimes\Fvij-\rvj\otimes\Fvij \\
84 &~=~&-\qrt\left(\sum_{i=1}^N~\sum_{j=1}^N ~(\rvi+\di)\otimes\Fvij~-~\sum_{i=1}^N~\sum_{j=1}^N ~\rvj\otimes\Fvij\right) \\
85 &~=~&-\qrt\left(\sum_{i=1}^N~(\rvi+\di)\otimes\sum_{j=1}^N~\Fvij~-~\sum_{j=1}^N ~\rvj\otimes\sum_{i=1}^N~\Fvij\right) \\
86 &~=~&-\qrt\left(\sum_{i=1}^N~(\rvi+\di)\otimes\Fvi~+~\sum_{j=1}^N ~\rvj\otimes\Fvj\right) \\
87 &~=~&-\qrt\left(2~\sum_{i=1}^N~\rvi\otimes\Fvi+\sum_{i=1}^N~\di\otimes\Fvi\right)
88 \eea
89 In these formulae we introduced:
90 \bea
91 \Fvi&~=~&\sum_{j=1}^N~\Fvij \\
92 \Fvj&~=~&\sum_{i=1}^N~\Fvji
93 \eea
94 which is the total force on $i$ with respect to $j$. Because we use Newton's Third Law:
95 \beq
96 \Fvij~=~-\Fvji
97 \eeq
98 we must, in the implementation, double the term containing the shift $\delta_i$.
100 \subsection{The intra-molecular shift (mol-shift)}
101 For the bonded forces and SHAKE it is possible to make a {\em mol-shift}
102 list, in which the periodicity is stored. We simple have an array {\tt mshift}
103 in which for each atom an index in the {\tt shiftvec} array is stored.
105 The algorithm to generate such a list can be derived from graph theory,
106 considering each particle in a molecule as a bead in a graph, the bonds
107 as edges.
108 \begin{enumerate}
109 \item[1] Represent the bonds and atoms as bidirectional graph
110 \item[2] Make all atoms white
111 \item[3] Make one of the white atoms black (atom $i$) and put it in the
112 central box
113 \item[4] Make all of the neighbors of $i$ that are currently
114 white, gray
115 \item[5] Pick one of the gray atoms (atom $j$), give it the
116 correct periodicity with respect to any of
117 its black neighbors
118 and make it black
119 \item[6] Make all of the neighbors of $j$ that are currently
120 white, gray
121 \item[7] If any gray atom remains, go to [5]
122 \item[8] If any white atom remains, go to [3]
123 \end{enumerate}
124 Using this algorithm we can
125 \begin{itemize}
126 \item optimize the bonded force calculation as well as SHAKE
127 \item calculate the virial from the bonded forces
128 in the single sum method again
129 \end{itemize}
131 Find a representation of the bonds as a bidirectional graph.
133 \subsection{Virial from Covalent Bonds}
134 Since the covalent bond force gives a contribution to the virial, we have:
135 \bea
136 b &~=~& \|\rnij\| \\
137 V_b &~=~& \half k_b(b-b_0)^2 \\
138 \Fvi &~=~& -\nabla V_b \\
139 &~=~& k_b(b-b_0)\frac{\rnij}{b} \\
140 \Fvj &~=~& -\Fvi
141 \eea
142 The virial contribution from the bonds then is:
143 \bea
144 \Xi_b &~=~& -\half(\rni\otimes\Fvi~+~\rvj\otimes\Fvj) \\
145 &~=~& -\half\rnij\otimes\Fvi
146 \eea
148 \subsection{Virial from SHAKE}
149 An important contribution to the virial comes from shake. Satisfying
150 the constraints a force {\bf G} that is exerted on the particles ``shaken.'' If this
151 force does not come out of the algorithm (as in standard SHAKE) it can be
152 calculated afterward (when using {\em leap-frog}) by:
153 \bea
154 \Delta\rvi&~=~&\rvi(t+\Dt)-
155 [\rvi(t)+{\bf v}_i(t-\frac{\Dt}{2})\Dt+\frac{\Fvi}{m_i}\Dt^2] \\
156 {\bf G}_i&~=~&\frac{m_i\Delta\rvi}{\Dt^2}
157 \eea
158 This does not help us in the general case. Only when no periodicity
159 is needed (like in rigid water) this can be used, otherwise
160 we must add the virial calculation in the inner loop of SHAKE.
162 When it {\em is} applicable the virial can be calculated in the single sum way:
163 \beq
164 \Xi~=~-\half\sum_i^{N_c}~\rvi\otimes\Fvi
165 \eeq
166 where $N_c$ is the number of constrained atoms.
168 %Another method is the Non-Iterative shake as proposed (and implemented)
169 %by Yoneya. In this algorithm the Lagrangian multipliers are solved in a
170 %matrix equation, and given these multipliers it is easy to get the periodicity
171 %in the virial afterwards.
173 %More...
176 \section{Optimizations}
177 Here we describe some of the algorithmic optimizations used
178 in {\gromacs}, apart from parallelism.
179 One of these, the implementation of the
180 1.0/sqrt(x) function is treated separately in \secref{sqrt}.
181 The most important other optimizations are described below.
183 \subsection{Inner Loops for Water}
184 \label{sec:waterloops}
185 {\gromacs} uses special inner loops to calculate non-bonded
186 interactions for water molecules with other atoms, and yet
187 another set of loops for interactions between pairs of
188 water molecules. There highly optimized loops for two types of water models.
189 For three site models similar to
190 SPC~\cite{Berendsen81}, {\ie}:
191 \begin{enumerate}
192 \item There are three atoms in the molecule.
193 \item The whole molecule is a single charge group.
194 \item The first atom has Lennard-Jones (\secref{lj}) and
195 Coulomb (\secref{coul}) interactions.
196 \item Atoms two and three have only Coulomb interactions,
197 and equal charges.
198 \end{enumerate}
199 These loops also works for the SPC/E~\cite{Berendsen87} and
200 TIP3P~\cite{Jorgensen83} water models.
201 And for four site water models similar to TIP4P~\cite{Jorgensen83}:
202 \begin{enumerate}
203 \item There are four atoms in the molecule.
204 \item The whole molecule is a single charge group.
205 \item The first atom has only Lennard-Jones (\secref{lj}) interactions.
206 \item Atoms two and three have only Coulomb (\secref{coul}) interactions,
207 and equal charges.
208 \item Atom four has only Coulomb interactions.
209 \end{enumerate}
211 The benefit of these implementations is that there are more floating-point
212 operations in a single loop, which implies that some compilers
213 can schedule the code better. However, it turns out that even
214 some of the most advanced compilers have problems with scheduling,
215 implying that manual tweaking is necessary to get optimum
216 \normindex{performance}.
217 This may include common-sub-expression elimination, or moving
218 code around.
220 \subsection{Fortran Code}
221 Unfortunately, on a few platforms \normindex{Fortran} compilers are
222 still better than C-compilers. For some machines ({\eg} SGI
223 Power Challenge) the difference may be up to a factor of 3, in the
224 case of vector computers this may be even larger. Therefore, some of
225 the routines that take up a lot of computer time have been translated
226 into Fortran and even assembly code for Intel and AMD x86 processors.
227 In most cases, the Fortran or assembly loops should be selected
228 automatically by the {\tt configure} script when appropriate, but you can
229 also tweak this by setting options to the {\tt configure} script.
231 \section{Computation of the 1.0/sqrt function}
232 \label{sec:sqrt}
233 \subsection{Introduction}
234 The {\gromacs} project started with the development of a $1/\sqrt{x}$
235 processor that calculates:
236 \begin{equation}
237 Y(x) = \frac{1}{ \sqrt{x} }
238 \end{equation}
239 As the project continued, the {\intel} processor was used to implement
240 {\gromacs}, which now turned into almost a full software project. The
241 $1/\sqrt{x}$ processor was implemented using a Newton-Raphson
242 iteration scheme for one step. For this it needed look-up tables to
243 provide the initial approximation. The $1/\sqrt{x}$ function makes it
244 possible to use two almost independent tables for the exponent seed
245 and the fraction seed with the IEEE floating-point representation.
247 \subsection{General}
248 According to~\cite{Bekker87} the $1/\sqrt{x}$ function can be evaluated using
249 the Newton-Raphson iteration scheme. The inverse function is:
250 \begin{equation}
251 X(y) = \frac{1}{y^{2}}
252 \end{equation}
253 So instead of calculating:
254 \begin{equation}
255 Y(a) = q
256 \end{equation}
257 the equation:
258 \begin{equation}
259 X(q) - a = 0
260 \label{eqn:inversef}
261 \end{equation}
262 can now be solved using Newton-Raphson. An iteration is performed by
263 calculating:
264 \begin{equation}
265 y_{n+1} = y_{n} - \frac{f(y_{n})}{f'(y_{n})}
266 \label{eqn:nr}
267 \end{equation}
268 The absolute error $\varepsilon$, in this approximation is defined by:
269 \begin{equation}
270 \varepsilon \equiv y_{n} - q
271 \end{equation}
272 Using Taylor series expansion to estimate the error results in:
273 \begin{equation}
274 \varepsilon _{n+1} = - \frac{\varepsilon _{n}^{2}}{2}
275 \frac{ f''(y_{n})}{f'(y_{n})}
276 \label{eqn:taylor}
277 \end{equation}
278 according to~\cite{Bekker87} equation (3.2). This is an estimation of the
279 absolute error.
281 \subsection{Applied to floating-point numbers}
282 Floating-point numbers in IEEE 32 bit single-precision format have a nearly
283 constant relative error of $\Delta x / x = 2^{-24}$. As seen earlier in the
284 Taylor series expansion equation (\eqnref{taylor}), the error in every
285 iteration step is absolute and in general dependent of $y$. If the error is
286 expressed as a relative error $\varepsilon_{r}$ the following holds:
287 \begin{equation}
288 \varepsilon _{{r}_{n+1}} \equiv \frac{\varepsilon_{n+1}}{y}
289 \end{equation}
290 and so:
291 \begin{equation}
292 \varepsilon _{{r}_{n+1}} =
293 - ( \frac{\varepsilon _{n}}{y} )^{2} y \frac{ f''}{2f'}
294 \end{equation}
295 For the function $f(y) = y^{-2}$ the term $y f''/2f'$ is constant (equal
296 to $-3/2$) so the relative error $\varepsilon _{r_{n}}$ is independent of $y$.
297 \begin{equation}
298 \varepsilon _{{r}_{n+1}} =
299 \frac{3}{2} (\varepsilon_{r_{n}})^{2}
300 \label{eqn:epsr}
301 \end{equation}
303 The conclusion of this is that the function $1/\sqrt{x}$ can be
304 calculated with a specified accuracy.
306 \begin{figure}
307 \begin{center}
308 \newcommand{\twltt}{\tt}
309 \setlength{\unitlength}{0.0080in}
310 \begin{picture}(489,176)(40,390)
311 \thicklines
312 \put(180,505){$\underbrace{\hspace{2.68in}}$}
313 \put( 60,505){$\underbrace{\hspace{0.88in}}$}
314 \put( 40,510){\framebox(480,30){}}
315 \put( 45,505){\vector( 0,-1){ 15}}
316 \put( 40,540){\dashbox{4}(0,0){}}
317 \multiput(250,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
318 \multiput(220,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
319 \multiput(190,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
320 \multiput(280,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
321 \multiput(310,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
322 \multiput(340,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
323 \multiput(370,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
324 \multiput(400,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
325 \multiput(430,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
326 \multiput(460,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
327 \multiput(490,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
328 \multiput(205,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
329 \multiput(235,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
330 \multiput(265,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
331 \multiput(295,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
332 \multiput(325,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
333 \multiput(355,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
334 \multiput(385,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
335 \multiput(415,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
336 \multiput(445,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
337 \multiput(475,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
338 \multiput(505,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
339 \put( 40,510){\framebox(480,30){}}
340 \put( 40,540){\dashbox{4}(0,0){}}
341 \multiput(250,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
342 \multiput(220,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
343 \multiput(190,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
344 \multiput(160,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
345 \multiput(130,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
346 \multiput(100,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
347 \multiput(280,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
348 \multiput(310,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
349 \multiput(340,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
350 \multiput(370,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
351 \multiput(400,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
352 \multiput(430,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
353 \multiput(460,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
354 \multiput(490,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
355 \multiput( 70,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
356 \put( 55,540){\line( 0,-1){ 30}}
357 \multiput( 85,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
358 \multiput(115,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
359 \multiput(145,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
360 \multiput(205,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
361 \multiput(235,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
362 \multiput(265,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
363 \multiput(295,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
364 \multiput(325,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
365 \multiput(355,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
366 \multiput(385,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
367 \multiput(415,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
368 \multiput(445,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
369 \multiput(475,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
370 \multiput(505,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
371 \put(175,540){\line( 0,-1){ 30}}
372 \put(175,540){\line( 0,-1){ 30}}
373 \multiput(145,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
374 \multiput(115,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
375 \multiput( 85,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
376 \put( 55,540){\line( 0,-1){ 30}}
377 \multiput( 70,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
378 \multiput(100,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
379 \multiput(130,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
380 \multiput(160,540)(0.00000,-8.57143){4}{\line( 0,-1){ 4.286}}
381 \put(345,470){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $F$}}}
382 \put(110,470){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $E$}}}
383 \put( 40,470){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $S$}}}
384 \put(505,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $0$}}}
385 \put(160,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $23$}}}
386 \put( 40,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $31$}}}
387 \put(140,400){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $Value=(-1)^{S}(2^{E-127})(1.F)$}}}
388 \put(505,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $0$}}}
389 \put(160,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $23$}}}
390 \put( 40,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $31$}}}
391 \put(140,400){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $Value=(-1)^{S}(2^{E-127})(1.F)$}}}
392 \end{picture}
393 \end{center}
394 \caption{IEEE single-precision floating-point format}
395 \label{fig:ieee}
396 \end{figure}
398 \subsection{Specification of the look-up table}
399 To calculate the function $1/\sqrt{x}$ using the previously mentioned
400 iteration scheme, it is clear that the first estimation of the solution must
401 be accurate enough to get precise results. The requirements for the
402 calculation are
403 \begin{itemize}
404 \item Maximum possible accuracy with the used IEEE format
405 \item Use only one iteration step for maximum speed
406 \end{itemize}
408 The first requirement states that the result of $1/\sqrt{x}$ may have a
409 relative error $\varepsilon_{r}$ equal to the $\varepsilon_{r}$ of a IEEE 32
410 bit single-precision floating-point number. From this, the $1/\sqrt{x}$
411 of the initial approximation can be derived, rewriting the definition of
412 the relative error for succeeding steps (\eqnref{epsr}):
413 \begin{equation}
414 \frac{\varepsilon_{n}}{y} =
415 \sqrt{\varepsilon_ {r_{n+1}} \frac{2f'}{yf''}}
416 \end{equation}
417 So for the look-up table the needed accuracy is:
418 \begin{equation}
419 \frac{\Delta Y}{Y} = \sqrt{\frac{2}{3} 2^{-24}}
420 \label{eqn:accy}
421 \end{equation}
422 which defines the width of the table that must be $\geq 13$ bit.
424 At this point the relative error, $\varepsilon_{r_{n}}$, of the look-up table
425 is known. From this the maximum relative error in the argument can be
426 calculated as follows. The absolute error $\Delta x$ is defined as:
427 \begin{equation}
428 \Delta x \equiv \frac{\Delta Y}{Y'}
429 \end{equation}
430 and thus:
431 \begin{equation}
432 \frac{\Delta x}{Y} = \frac{\Delta Y}{Y} (Y')^{-1}
433 \end{equation}
434 and thus:
435 \begin{equation}
436 \Delta x = constant \frac{Y}{Y'}
437 \end{equation}
438 For the $1/\sqrt{x}$ function, $Y / Y' \sim x$ holds, so
439 $\Delta x / x = constant$. This is a property of the used floating-point
440 representation as earlier mentioned. The needed accuracy of the argument of the
441 look-up table follows from:
442 \begin{equation}
443 \frac{\Delta x}{x} = -2 \frac{\Delta Y}{Y}
444 \end{equation}
445 So, using the floating-point accuracy (\eqnref{accy}):
446 \begin{equation}
447 \frac{\Delta x}{x} = -2 \sqrt{\frac{2}{3} 2^{-24}}
448 \end{equation}
449 This defines the length of the look-up table which should be $\geq 12$ bit.
451 \subsection{Separate exponent and fraction computation}
452 The used IEEE 32 bit single-precision floating-point format specifies
453 that a number is represented by a exponent and a fraction. The previous
454 section specifies for every possible floating-point number the look-up table
455 length and width. Only the size of the fraction of a floating-point number
456 defines the accuracy. The conclusion from this can be that the size of the
457 look-up table is length of look-up table, earlier specified, times the size of
458 the exponent ($2^{12}2^{8}, 1Mb$). The $1/\sqrt{x}$ function has the
459 property that the exponent is independent of the fraction. This becomes clear
460 if the floating-point representation is used. Define:
461 \begin{equation}
462 x \equiv (-1)^{S}(2^{E-127})(1.F)
463 \label{eqn:fpdef}
464 \end{equation}
465 See \figref{ieee}, where $0 \leq S \leq 1$, $0 \leq E \leq 255$,
466 $1 \leq 1.F < 2$ and $S$, $E$, $F$ integer (normalization conditions).
467 The sign bit ($S$) can be omitted because $1/\sqrt{x}$ is only defined
468 for $x > 0$. The $1/\sqrt{x}$ function applied to $x$ results in:
469 \begin{equation}
470 y(x) = \frac{1}{\sqrt{x}}
471 \end{equation}
473 \begin{equation}
474 y(x) = \frac{1}{\sqrt{(2^{E-127})(1.F)}}
475 \end{equation}
476 This can be rewritten as:
477 \begin{equation}
478 y(x) = (2^{E-127})^{-1/2}(1.F)^{-1/2}
479 \label{eqn:yx}
480 \end{equation}
481 Define:
482 \begin{equation}
483 (2^{E'-127}) \equiv (2^{E-127})^{-1/2}
484 \end{equation}
485 \begin{equation}
486 1.F'\equiv (1.F)^{-1/2}
487 \end{equation}
488 Then $\frac{1}{\sqrt{2}} < 1.F' \leq 1$ holds, so the condition
489 $1 \leq 1.F' < 2$, which is essential for normalized real representation, is
490 not valid anymore. By introducing an extra term, this can be corrected.
491 Rewrite the $1/\sqrt{x}$ function applied to floating-point numbers (\eqnref{yx}) as:
492 \begin{equation}
493 y(x) = (2^{\frac{127-E}{2}-1}) (2(1.F)^{-1/2})
494 \end{equation}
495 and:
496 \begin{equation}
497 (2^{E'-127}) \equiv (2^{\frac{127-E}{2}-1})
498 \label{eqn:exp}
499 \end{equation}
500 \begin{equation}
501 1.F'\equiv 2(1.F)^{-1/2}
502 \label{eqn:frac}
503 \end{equation}
504 Then $\sqrt{2} < 1.F \leq 2$ holds. This is not the exact valid range as
505 defined for normalized floating-point numbers in \eqnref{fpdef}.
506 The value $2$ causes the problem. By mapping this value on the nearest
507 representation $< 2$, this can be solved. The small error that is introduced
508 by this approximation is within the allowable range.
510 The integer representation of the exponent is the next problem. Calculating
511 $(2^{\frac{127-E}{2}-1})$ introduces a fractional result if $(127-E) = odd$.
512 This is again easily accounted for by splitting up the calculation into an
513 odd and an even part. For $(127-E) = even$ $E'$ in equation (\eqnref{exp})
514 can be exactly calculated in integer arithmetic as a function of $E$.
515 \begin{equation}
516 E' = \frac{127-E}{2} + 126
517 \end{equation}
519 For $(127-E) = odd$ equation (\eqnref{yx}) can be rewritten as:
520 \begin{equation}
521 y(x) = (2^{\frac{127-E-1}{2}})(\frac{1.F}{2})^{-1/2}
522 \end{equation}
523 Thus:
524 \begin{equation}
525 E' = \frac{126-E}{2} + 127
526 \end{equation}
527 which also can be calculated exactly in integer arithmetic.
528 {\bf Note} that the fraction is automatically corrected for its range earlier
529 mentioned, so the exponent does not need an extra correction.
531 The conclusions from this are:
532 \begin{itemize}
533 \item The fraction and exponent look-up table are independent. The fraction
534 look-up table exists of two tables (odd and even exponent) so the odd/even
535 information of the exponent (lsb bit) has to be used to select the right
536 table.
537 \item The exponent table is an 256 x 8 bit table, initialized for $odd$
538 and $even$.% as specified before
539 \end{itemize}
541 \subsection{Implementation}
542 The look-up tables can be generated by a small C program, which uses
543 floating-point numbers and operations with IEEE 32 bit single-precision format.
544 Note that because of the $odd$/$even$ information that is needed, the
545 fraction table is twice the size earlier specified (13 bit i.s.o. 12 bit).
546 %\figref{expgen} in the appendix shows how to fill the exponent table,
547 %\figref{fractgen} shows how to fill the fraction table.
549 The function according to~\eqnref{nr} has to be implemented.
550 Applied to the $1/\sqrt{x}$ function, equation~\eqnref{inversef} leads to:
551 \begin{equation}
552 f = a - \frac{1}{y^{2}}
553 \end{equation}
554 and so:
555 \begin{equation}
556 f' = \frac{2}{y^{3}}
557 \end{equation}
559 \begin{equation}
560 y_{n+1} = y_{n} - \frac{ a - \frac{1}{y^{2}_{n}} }{ \frac{2}{y^{3}_{n}} }
561 \end{equation}
563 \begin{equation}
564 y_{n+1} = \frac{y_{n}}{2} (3 - a y^{2}_{n})
565 \end{equation}
566 Where $y_{0}$ can be found in the look-up tables, and $y_{1}$ gives the result
567 to the maximum accuracy.
568 %In \figref{as_implementation}) the assembler implementation is given.
569 It is clear that only one iteration extra (in double
570 precision) is needed for a double-precision result.
572 \section{Modifying GROMACS}
573 The following files have to be edited in case you want to add a bonded
574 potential of any type.
575 \begin{enumerate}
576 \item {\tt include/bondf.h}
577 \item {\tt include/types/idef.h}
578 \item {\tt include/types/nrnb.h}
579 \item {\tt include/types/enums.h}
580 \item {\tt include/grompp.h}
581 \item {\tt src/kernel/topdirs.c}
582 \item {\tt src/gmxlib/tpxio.c}
583 \item {\tt src/gmxlib/bondfree.c}
584 \item {\tt src/gmxlib/ifunc.c}
585 \item {\tt src/gmxlib/nrnb.c}
586 \item {\tt src/kernel/convparm.c}
587 \item {\tt src/kernel/topdirs.c}
588 \item {\tt src/kernel/topio.c}
589 \end{enumerate}
591 % LocalWords: Virial virial triclinic intra mol mshift shiftvec sqrt SPC lj yf
592 % LocalWords: coul Fortran SGI AMD Raphson IEEE taylor epsr accy ieee yx fpdef
593 % LocalWords: lsb nr inversef src formulae GROMACS