after multiple objections of libtom users [1], we decided to change licensing
[libtomfloat.git] / float.tex
blobcec4b463aaff7dbe7c809497ba6c2119efe01a8d
1 \documentclass[b5paper]{book}
2 \usepackage{hyperref}
3 \usepackage{makeidx}
4 \usepackage{amssymb}
5 \usepackage{color}
6 \usepackage{alltt}
7 \usepackage{graphicx}
8 \usepackage{layout}
9 \def\union{\cup}
10 \def\intersect{\cap}
11 \def\getsrandom{\stackrel{\rm R}{\gets}}
12 \def\cross{\times}
13 \def\cat{\hspace{0.5em} \| \hspace{0.5em}}
14 \def\catn{$\|$}
15 \def\divides{\hspace{0.3em} | \hspace{0.3em}}
16 \def\nequiv{\not\equiv}
17 \def\approx{\raisebox{0.2ex}{\mbox{\small $\sim$}}}
18 \def\lcm{{\rm lcm}}
19 \def\gcd{{\rm gcd}}
20 \def\log{{\rm log}}
21 \def\ord{{\rm ord}}
22 \def\abs{{\mathit abs}}
23 \def\rep{{\mathit rep}}
24 \def\mod{{\mathit\ mod\ }}
25 \renewcommand{\pmod}[1]{\ ({\rm mod\ }{#1})}
26 \newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor}
27 \newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil}
28 \def\Or{{\rm\ or\ }}
29 \def\And{{\rm\ and\ }}
30 \def\iff{\hspace{1em}\Longleftrightarrow\hspace{1em}}
31 \def\implies{\Rightarrow}
32 \def\undefined{{\rm ``undefined"}}
33 \def\Proof{\vspace{1ex}\noindent {\bf Proof:}\hspace{1em}}
34 \let\oldphi\phi
35 \def\phi{\varphi}
36 \def\Pr{{\rm Pr}}
37 \newcommand{\str}[1]{{\mathbf{#1}}}
38 \def\F{{\mathbb F}}
39 \def\N{{\mathbb N}}
40 \def\Z{{\mathbb Z}}
41 \def\R{{\mathbb R}}
42 \def\C{{\mathbb C}}
43 \def\Q{{\mathbb Q}}
44 \definecolor{DGray}{gray}{0.5}
45 \newcommand{\emailaddr}[1]{\mbox{$<${#1}$>$}}
46 \def\twiddle{\raisebox{0.3ex}{\mbox{\tiny $\sim$}}}
47 \def\gap{\vspace{0.5ex}}
48 \makeindex
49 \begin{document}
50 \frontmatter
51 \pagestyle{empty}
52 \title{LibTomFloat User Manual \\ v0.02}
53 \author{Tom St Denis \\ tomstdenis@iahu.ca}
54 \maketitle
55 This text and the library are hereby placed in the public domain. This book has been formatted for B5 [176x250] paper using the \LaTeX{} {\em book}
56 macro package.
58 \vspace{10cm}
60 \begin{flushright}Open Source. Open Academia. Open Minds.
62 \mbox{ }
64 Tom St Denis,
66 Ontario, Canada
67 \end{flushright}
69 \tableofcontents
70 \listoffigures
71 \mainmatter
72 \pagestyle{headings}
73 \chapter{Introduction}
74 \section{What is LibTomFloat?}
75 LibTomFloat is a library of source code that provides multiple precision floating point arithmetic. It allows developers to manipulate floating
76 point numbers of variable precision. The library was written in portable ISO C source code and depends upon the public domain
77 LibTomMath package.
79 Along with providing the core mathematical operations such as addition and subtraction LibTomFloat also provides various complicated algorithms
80 such as trigonometry's sine, cosine and tangent operators as well as Calculus's square root, inverse square root, exponential and logarithm
81 operators.
83 LibTomFloat has been written for portability and numerical stability and is not particularly optimized for any given platform. It uses optimal
84 algorithms for manipulating the mantissa by using LibTomMath and uses numerically stable series for the various trig and calculus functions.
86 \section{License}
87 LibTomFloat is public domain.
89 \section{Building LibTomFloat}
90 LibTomFloat requires version 0.30 or higher of LibTomMath to be installed in order to build. Once LibTomMath is installed building LibTomFloat
91 is as simple as:
93 \begin{alltt}
94 make
95 \end{alltt}
97 Which will build ``libtomfloat.a'' and along with ``tomfloat.h'' complete an installation of LibTomFloat. You can also use the make target
98 ``install'' to automatically build and copy the files (into *NIX specific) locations.
100 \begin{alltt}
101 make install
102 \end{alltt}
104 \textbf{Note}: LibTomFloat does not use ISO C's native floating point types which means that the standard math library does not have to be
105 linked in. This also means that LibTomFloat will work decently on platforms that do not have a floating point unit.
108 \section{Purpose of LibTomFloat}
109 LibTomFloat is as much as an exercise in hardcore math for myself as it is a service to any programmer who needs high precision float point
110 data types. ISO C provides for fairly reasonable precision floating point data types but is limited. A proper analogy is LibTomFloat solves
111 ISO C's floating point problems in the same way LibTomMath solves ISO C's integer data type problems.
113 A classic example of a good use for large precision floats is long simulations where the numbers are not perfectly stable. A $128$--bit mantissa
114 (for example) can provide for exceptional precision.
116 That and knowing the value of $e$ to 512 bits is fun.
118 \section{How the types work}
120 \index{mantissa} \index{exponent}
121 The floating point types are emulated with three components. The \textbf{mantissa}, the \textbf{exponent} and the \textbf{radix}.
122 The mantissa forms the digits of number being represented. The exponent scales the number to give it a larger range. The radix controls
123 how many bits there are in the mantissa. The larger the radix the more precise the types become.
125 The representation of a number is given by the simple product $m \cdot 2^e$ where $m$ is the mantissa and $e$ the exponent. Numbers are
126 always normalized such that there are $radix$ bits per mantissa. For example, with $radix = 16$ the number $2$ is represented by
127 $32768 \cdot 2^{-14}$. A zero is represented by a mantissa of zero and an exponent of one and is a special case.
129 The sign flag is a standard ISO C ``long'' which gives it the range $2^{-31} \le e < 2^{31}$ which is considerably large.
131 Technically, LibTomFloat does not implement IEEE standard floating point types. The exponent is not normalized and the sign flag does not
132 count as a bit in the radix. There is also no ``implied'' bit in this system. The mantissa explicitly dictates the digits.
134 \chapter{Getting Started with LibTomFloat}
135 \section{Building Programs}
136 In order to use libTomFloat you must include ``tomfloat.h'' and link against the appropriate library file (typically
137 libtomfloat.a). There is no library initialization required and the entire library is thread safe.
139 \section{Return Codes}
140 There are three possible return codes a function may return.
142 \index{MP\_OKAY}\index{MP\_YES}\index{MP\_NO}\index{MP\_VAL}\index{MP\_MEM}
143 \begin{figure}[here!]
144 \begin{center}
145 \begin{small}
146 \begin{tabular}{|l|l|}
147 \hline \textbf{Code} & \textbf{Meaning} \\
148 \hline MP\_OKAY & The function succeeded. \\
149 \hline MP\_VAL & The function input was invalid. \\
150 \hline MP\_MEM & Heap memory exhausted. \\
151 \hline &\\
152 \hline MP\_YES & Response is yes. \\
153 \hline MP\_NO & Response is no. \\
154 \hline
155 \end{tabular}
156 \end{small}
157 \end{center}
158 \caption{Return Codes}
159 \end{figure}
161 The last two codes listed are not actually ``return'ed'' by a function. They are placed in an integer (the caller must
162 provide the address of an integer it can store to) which the caller can access. To convert one of the three return codes
163 to a string use the following function.
165 \index{mp\_error\_to\_string}
166 \begin{alltt}
167 char *mp_error_to_string(int code);
168 \end{alltt}
170 This will return a pointer to a string which describes the given error code. It will not work for the return codes
171 MP\_YES and MP\_NO.
173 \section{Data Types}
175 To better work with LibTomFloat it helps to know what makes up the primary data type within LibTomFloat.
177 \begin{alltt}
178 typedef struct \{
179 mp_int mantissa;
180 long radix,
181 exp;
182 \} mp_float;
183 \end{alltt}
185 The mp\_float data type is what all LibTomFloat functions will operate with and upon. The members of the structre are as follows:
187 \begin{enumerate}
188 \item The \textbf{mantissa} variable is a LibTomMath mp\_int that represents the digits of the float. Since it's a mp\_int it can accomodate
189 any practical range of numbers.
190 \item The \textbf{radix} variable is the precision desired for the mp\_float in bits. The higher the value the more precise (and slow) the
191 calculations are. This value must be larger than two and ideally shouldn't be lower than what a ``double'' provides (55-bits of mantissa).
192 \item The \textbf{exp} variable is the exponent associated with the number.
193 \end{enumerate}
195 \section{Function Organization}
197 Many of the functions operate as their LibTomMath counterparts. That is the source operands are on the left and the destination is on the
198 right. For instance:
200 \begin{alltt}
201 mpf_add(&a, &b, &c); /* c = a + b */
202 mpf_mul(&a, &a, &c); /* c = a * a */
203 mpf_div(&a, &b, &c); /* c = a / b */
204 \end{alltt}
206 One major difference (and similar to LibTomPoly) is that the radix of the destination operation controls the radix of the internal computation and
207 the final result. For instance, if $a$ and $b$ have a $24$--bit mantissa and $c$ has a $96$--bit mantissa then all three operations are performed
208 with $96$--bits of precision.
210 This is non--issue for algorithms such as addition or multiplication but more important for the series calculations such as division, inversion,
211 square roots, etc.
213 All functions normalize the result before returning.
215 \section{Initialization}
216 \subsection{Single Initializers}
218 To initialize or clear a single mp\_float use the following two functions.
220 \index{mpf\_init} \index{mpf\_clear}
221 \begin{alltt}
222 int mpf_init(mp_float *a, long radix);
223 void mpf_clear(mp_float *a);
224 \end{alltt}
226 mpf\_init will initialize $a$ with the given radix to the default value of zero. mpf\_clear will free the memory used by the
227 mp\_float.
229 \begin{alltt}
230 int main(void)
232 mp_float a;
233 int err;
235 /* initialize a mp_float with a 96-bit mantissa */
236 if ((err = mpf_init(&a, 96)) != MP_OKAY) \{
237 // error handle
240 /* we now have a 96-bit mp_float ready ... do work */
242 /* done */
243 mpf_clear(&a);
245 return EXIT_SUCCESS;
247 \end{alltt}
249 \subsection{Multiple Initializers}
251 To initialize or clear multiple mp\_floats simultaneously use the following two functions.
253 \index{mpf\_init\_multi} \index{mpf\_clear\_multi}
254 \begin{alltt}
255 int mpf_init_multi(long radix, mp_float *a, ...);
256 void mpf_clear_multi(mp_float *a, ...);
257 \end{alltt}
259 mpf\_init\_multi will initialize a \textbf{NULL} terminated list of mp\_floats with the same given radix. mpf\_clear\_multi will free
260 up a \textbf{NULL} terminated list of mp\_floats.
262 \begin{alltt}
263 int main(void)
265 mp_float a, b;
266 int err;
268 /* initialize two mp_floats with a 96-bit mantissa */
269 if ((err = mpf_init_multi(96, &a, &b, NULL)) != MP_OKAY) \{
270 // error handle
273 /* we now have two 96-bit mp_floats ready ... do work */
275 /* done */
276 mpf_clear_multi(&a, &b, NULL);
278 return EXIT_SUCCESS;
280 \end{alltt}
282 \subsection{Initialization of Copies}
284 In order to initialize an mp\_float and make a copy of a source mp\_float the following function has been provided.
286 \index{mpf\_init\_copy}
287 \begin{alltt}
288 int mpf_init_copy(mp_float *a, mp_float *b);
289 \end{alltt}
291 This will initialize $b$ and make it a copy of $a$.
293 \begin{alltt}
294 int main(void)
296 mp_float a, b;
297 int err;
299 /* initialize a mp_float with a 96-bit mantissa */
300 if ((err = mpf_init(&a, 96)) != MP_OKAY) \{
301 // error handle
304 /* we now have a 96-bit mp_float ready ... do work */
306 /* now make our copy */
307 if ((err = mpf_init_copy(&a, &b)) != MP_OKAY) \{
308 // error handle
311 /* now b is a copy of a */
313 /* done */
314 mpf_clear_multi(&a, &b, NULL);
316 return EXIT_SUCCESS;
318 \end{alltt}
320 \section{Data Movement}
321 \subsection{Copying}
322 In order to copy one mp\_float into another mp\_float the following function has been provided.
324 \index{mpf\_copy}
325 \begin{alltt}
326 int mpf_copy(mp_float *src, mp_float *dest);
327 \end{alltt}
328 This will copy the mp\_float from $src$ into $dest$. Note that the final radix of $dest$ will be that of $src$.
330 \begin{alltt}
331 int main(void)
333 mp_float a, b;
334 int err;
336 /* initialize two mp_floats with a 96-bit mantissa */
337 if ((err = mpf_init_multi(96, &a, &b, NULL)) != MP_OKAY) \{
338 // error handle
341 /* we now have two 96-bit mp_floats ready ... do work */
343 /* put a into b */
344 if ((err = mpf_copy(&a, &b)) != MP_OKAY) \{
345 // error handle
348 /* done */
349 mpf_clear_multi(&a, &b, NULL);
351 return EXIT_SUCCESS;
353 \end{alltt}
355 \subsection{Exchange}
357 To exchange the contents of two mp\_float data types use this f00.
359 \index{mpf\_exch}
360 \begin{alltt}
361 void mpf_exch(mp_float *a, mp_float *b);
362 \end{alltt}
364 This will swap the contents of $a$ and $b$.
366 \chapter{Basic Operations}
367 \section{Normalization}
369 \subsection{Simple Normalization}
370 Normalization is not required by the user unless they fiddle with the mantissa on their own. If that's the case you can
371 use this function.
372 \index{mpf\_normalize}
373 \begin{alltt}
374 int mpf_normalize(mp_float *a);
375 \end{alltt}
376 This will fix up the mantissa of $a$ such that the leading bit is one (if the number is non--zero).
378 \subsection{Normalize to New Radix}
379 In order to change the radix of a non--zero number you must call this function.
381 \index{mpf\_normalize\_to}
382 \begin{alltt}
383 int mpf_normalize_to(mp_float *a, long radix);
384 \end{alltt}
385 This will change the radix of $a$ then normalize it accordingly.
387 \section{Constants}
389 \subsection{Quick Constants}
390 The following are helpers for various numbers.
392 \index{mpf\_const\_0} \index{mpf\_const\_d} \index{mpf\_const\_ln\_d} \index{mpf\_const\_sqrt\_d}
393 \begin{alltt}
394 int mpf_const_0(mp_float *a);
395 int mpf_const_d(mp_float *a, long d);
396 int mpf_const_ln_d(mp_float *a, long b);
397 int mpf_const_sqrt_d(mp_float *a, long b);
398 \end{alltt}
400 mpf\_const\_0 will set $a$ to a valid representation of zero. mpf\_const\_d will set $a$ to a valid signed representation of
401 $d$. mpf\_const\_ln\_d will set $a$ to the natural logarithm of $b$. mpf\_const\_sqrt\_d will set $a$ to the square root of
402 $b$.
404 The next set of constants (fig. \ref{fig:const}) compute the standard constants as defined in ``math.h''.
405 \begin{figure}[here]
406 \begin{center}
407 \begin{tabular}{|l|l|}
408 \hline \textbf{Function Name} & \textbf{Value} \\
409 mpf\_const\_e & $e$ \\
410 mpf\_const\_l2e & log$_2(e)$ \\
411 mpf\_const\_l10e & log$_{10}(e)$ \\
412 mpf\_const\_le2 & ln$(2)$ \\
413 mpf\_const\_pi & $\pi$ \\
414 mpf\_const\_pi2 & $\pi / 2$ \\
415 mpf\_const\_pi4 & $\pi / 4$ \\
416 mpf\_const\_1pi & $1 / \pi$ \\
417 mpf\_const\_2pi & $2 / \pi$ \\
418 mpf\_const\_2rpi & $2 / \sqrt{\pi}$ \\
419 mpf\_const\_r2 & ${\sqrt{2}}$ \\
420 mpf\_const\_1r2 & $1 / {\sqrt{2}}$ \\
421 \hline
422 \end{tabular}
423 \end{center}
424 \caption{LibTomFloat Constants.}
425 \label{fig:const}
426 \end{figure}
428 All of these functions accept a single input argument. They calculate the constant at run--time using the precision specified in the input
429 argument.
431 \begin{alltt}
432 int main(void)
434 mp_float a;
435 int err;
437 /* initialize a mp_float with a 96-bit mantissa */
438 if ((err = mpf_init(&a, 96)) != MP_OKAY) \{
439 // error handle
442 /* let's find out what the square root of 2 is (approximately ;-)) */
443 if ((err = mpf_const_r2(&a)) != MP_OKAY) \{
444 // error handle
447 /* now a has sqrt(2) to 96-bits of precision */
449 /* done */
450 mpf_clear(&a);
452 return EXIT_SUCCESS;
454 \end{alltt}
456 \section{Sign Manipulation}
457 To manipulate the sign of a mp\_float use the following two functions.
459 \index{mpf\_abs} \index{mpf\_neg}
460 \begin{alltt}
461 int mpf_abs(mp_float *a, mp_float *b);
462 int mpf_neg(mp_float *a, mp_float *b);
463 \end{alltt}
465 mpf\_abs computes the absolute of $a$ and stores it in $b$. mpf\_neg computes the negative of $a$ and stores it in $b$. Note that the numbers
466 are normalized to the radix of $b$ before being returned.
468 \begin{alltt}
469 int main(void)
471 mp_float a;
472 int err;
474 /* initialize a mp_float with a 96-bit mantissa */
475 if ((err = mpf_init(&a, 96)) != MP_OKAY) \{
476 // error handle
479 /* let's find out what the square root of 2 is (approximately ;-)) */
480 if ((err = mpf_const_r2(&a)) != MP_OKAY) \{
481 // error handle
484 /* now make it negative */
485 if ((err = mpf_neg(&a, &a)) != MP_OKAY) \{
486 // error handle
489 /* done */
490 mpf_clear(&a);
492 return EXIT_SUCCESS;
494 \end{alltt}
496 \chapter{Basic Algebra}
497 \section{Algebraic Operators}
499 The following four functions provide for basic addition, subtraction, multiplication and division of mp\_float numbers.
501 \index{mpf\_add} \index{mpf\_sub} \index{mpf\_mul} \index{mpf\_div}
502 \begin{alltt}
503 int mpf_add(mp_float *a, mp_float *b, mp_float *c);
504 int mpf_sub(mp_float *a, mp_float *b, mp_float *c);
505 int mpf_mul(mp_float *a, mp_float *b, mp_float *c);
506 int mpf_div(mp_float *a, mp_float *b, mp_float *c);
507 \end{alltt}
508 These functions perform their respective operations on $a$ and $b$ and store the result in $c$.
510 \subsection{Additional Interfaces}
511 In order to make programming easier with the library the following four functions have been provided as well.
513 \index{mpf\_add\_d} \index{mpf\_sub\_d} \index{mpf\_mul\_d} \index{mpf\_div\_d}
514 \begin{alltt}
515 int mpf_add_d(mp_float *a, long b, mp_float *c);
516 int mpf_sub_d(mp_float *a, long b, mp_float *c);
517 int mpf_mul_d(mp_float *a, long b, mp_float *c);
518 int mpf_div_d(mp_float *a, long b, mp_float *c);
519 \end{alltt}
520 These work like the previous four functions except the second argument is a ``long'' type. This allow operations with
521 mixed mp\_float and integer types (specifically constants) to be performed relatively easy.
523 \textit{I will put an example of all op/op\_d functions here...}
525 \subsection{Additional Operators}
526 The next three functions round out the simple algebraic operators.
528 \index{mpf\_mul\_2} \index{mpf\_div\_2} \index{mpf\_sqr}
529 \begin{alltt}
530 int mpf_mul_2(mp_float *a, mp_float *b);
531 int mpf_div_2(mp_float *a, mp_float *b);
532 int mpf_sqr(mp_float *a, mp_float *b);
533 \end{alltt}
535 mpf\_mul\_2 and mpf\_div\_2 multiply (or divide) $a$ by two and store it in $b$. mpf\_sqr squares $a$ and stores it in $b$. mpf\_sqr is
536 faster than using mpf\_mul for squaring mp\_floats.
538 \section{Comparisons}
539 To compare two mp\_floats the following function can be used.
540 \index{mp\_cmp}
541 \begin{alltt}
542 int mpf_cmp(mp_float *a, mp_float *b);
543 \end{alltt}
544 This will compare $a$ to $b$ and return one of the LibTomMath comparison flags. Simply put, if $a$ is larger than $b$ it returns
545 MP\_GT. If $a$ is smaller than $b$ it returns MP\_LT, otherwise it returns MP\_EQ. The comparison is signed.
547 To quickly compare an mp\_float to a ``long'' the following is provided.
549 \index{mpf\_cmp\_d}
550 \begin{alltt}
551 int mpf_cmp_d(mp_float *a, long b, int *res);
552 \end{alltt}
554 Which compares $a$ to $b$ and stores the result in $res$. This function can fail which is unlike the digit compare from LibTomMath.
556 \chapter{Advanced Algebra}
557 \section{Powers}
558 \subsection{Exponential}
559 The following function computes $exp(x)$ otherwise known as $e^x$.
561 \index{mpf\_exp}
562 \begin{alltt}
563 int mpf_exp(mp_float *a, mp_float *b);
564 \end{alltt}
566 This computes $e^a$ and stores it into $b$.
568 \subsection{Power Operator}
569 The following function computes the generic $a^b$ operation.
571 \index{mpf\_pow}
572 \begin{alltt}
573 int mpf_pow(mp_float *a, mp_float *b, mp_float *c);
574 \end{alltt}
575 This computes $a^b$ and stores the result in $c$.
577 \subsection{Natural Logarithm}
579 The following function computes the natural logarithm.
580 \index{mpf\_ln}
581 \begin{alltt}
582 int mpf_ln(mp_float *a, mp_float *b);
583 \end{alltt}
584 This computes $ln(a)$ and stores the result in $b$.
586 \section{Inversion and Roots}
588 \subsection{Inverse Square Root}
589 The following function computes $1 / \sqrt{x}$.
591 \index{mpf\_invsqrt}
592 \begin{alltt}
593 int mpf_invsqrt(mp_float *a, mp_float *b);
594 \end{alltt}
596 This computes $1 / \sqrt{a}$ and stores the result in $b$.
598 \subsection{Inverse}
600 The following function computes $1 / x$.
601 \index{mpf\_inv}
602 \begin{alltt}
603 int mpf_inv(mp_float *a, mp_float *b);
604 \end{alltt}
605 This computes $1/a$ and stores the result in $b$.
608 \subsection{Square Root}
610 The following function computes $\sqrt{x}$.
612 \index{mpf\_sqrt}
613 \begin{alltt}
614 int mpf_sqrt(mp_float *a, mp_float *b);
615 \end{alltt}
617 This computes $\sqrt{a}$ and stores the result in $b$.
619 \section{Trigonometry Functions}
620 The following functions compute various trigonometric functions. All inputs are assumed to be in radians.
622 \index{mpf\_cos} \index{mpf\_sin} \index{mpf\_tan} \index{mpf\_acos} \index{mpf\_asin} \index{mpf\_atan}
623 \begin{alltt}
624 int mpf_cos(mp_float *a, mp_float *b);
625 int mpf_sin(mp_float *a, mp_float *b);
626 int mpf_tan(mp_float *a, mp_float *b);
627 int mpf_acos(mp_float *a, mp_float *b);
628 int mpf_asin(mp_float *a, mp_float *b);
629 int mpf_atan(mp_float *a, mp_float *b);
630 \end{alltt}
632 These all compute their respective trigonometric function on $a$ and store the result in $b$. The ``a'' prefix stands for ``arc'' or more
633 commonly known as inverse.
635 \input{float.ind}
637 \end{document}