manual/=float.texinfo

   1 @node Floating-Point Limits
   2 @chapter Floating-Point Limits
   3 @pindex <float.h>
   4 @cindex floating-point number representation
   5 @cindex representation of floating-point numbers
   6
   7 Because floating-point numbers are represented internally as approximate
   8 quantities, algorithms for manipulating floating-point data often need
   9 to be parameterized in terms of the accuracy of the representation.
  10 Some of the functions in the C library itself need this information; for
  11 example, the algorithms for printing and reading floating-point numbers
  12 (@pxref{I/O on Streams}) and for calculating trigonometric and
  13 irrational functions (@pxref{Mathematics}) use information about the
  14 underlying floating-point representation to avoid round-off error and
  15 loss of accuracy.  User programs that implement numerical analysis
  16 techniques also often need to be parameterized in this way in order to
  17 minimize or compute error bounds.
  18
  19 The specific representation of floating-point numbers varies from
  20 machine to machine.  The GNU C Library defines a set of parameters which
  21 characterize each of the supported floating-point representations on a
  22 particular system.
  23
  24 @menu
  25 * Floating-Point Representation::   Definitions of terminology.
  26 * Floating-Point Parameters::       Descriptions of the library facilities.
  27 * IEEE Floating-Point::             An example of a common representation.
  28 @end menu
  29
  30 @node Floating-Point Representation
  31 @section Floating-Point Representation
  32
  33 This section introduces the terminology used to characterize the
  34 representation of floating-point numbers.
  35
  36 You are probably already familiar with most of these concepts in terms
  37 of scientific or exponential notation for floating-point numbers.  For
  38 example, the number @code{123456.0} could be expressed in exponential
  39 notation as @code{1.23456e+05}, a shorthand notation indicating that the
  40 mantissa @code{1.23456} is multiplied by the base @code{10} raised to
  41 power @code{5}.
  42
  43 More formally, the internal representation of a floating-point number
  44 can be characterized in terms of the following parameters:
  45
  46 @itemize @bullet
  47 @item
  48 The @dfn{sign} is either @code{-1} or @code{1}.
  49 @cindex sign (of floating-point number)
  50
  51 @item
  52 The @dfn{base} or @dfn{radix} for exponentiation; an integer greater
  53 than @code{1}.  This is a constant for the particular representation.
  54 @cindex base (of floating-point number)
  55 @cindex radix (of floating-point number)
  56
  57 @item
  58 The @dfn{exponent} to which the base is raised.  The upper and lower
  59 bounds of the exponent value are constants for the particular
  60 representation.
  61 @cindex exponent (of floating-point number)
  62
  63 Sometimes, in the actual bits representing the floating-point number,
  64 the exponent is @dfn{biased} by adding a constant to it, to make it
  65 always be represented as an unsigned quantity.  This is only important
  66 if you have some reason to pick apart the bit fields making up the
  67 floating-point number by hand, which is something for which the GNU
  68 library provides no support.  So this is ignored in the discussion that
  69 follows.
  70 @cindex bias, in exponent (of floating-point number)
  71
  72 @item
  73 The value of the @dfn{mantissa} or @dfn{significand}, which is an
  74 unsigned quantity.
  75 @cindex mantissa (of floating-point number)
  76 @cindex significand (of floating-point number)
  77
  78 @item
  79 The @dfn{precision} of the mantissa.  If the base of the representation
  80 is @var{b}, then the precision is the number of base-@var{b} digits in
  81 the mantissa.  This is a constant for the particular representation.
  82
  83 Many floating-point representations have an implicit @dfn{hidden bit} in
  84 the mantissa.  Any such hidden bits are counted in the precision.
  85 Again, the GNU library provides no facilities for dealing with such low-level
  86 aspects of the representation.
  87 @cindex precision (of floating-point number)
  88 @cindex hidden bit, in mantissa (of floating-point number)
  89 @end itemize
  90
  91 The mantissa of a floating-point number actually represents an implicit
  92 fraction whose denominator is the base raised to the power of the
  93 precision.  Since the largest representable mantissa is one less than
  94 this denominator, the value of the fraction is always strictly less than
  95 @code{1}.  The mathematical value of a floating-point number is then the
  96 product of this fraction; the sign; and the base raised to the exponent.
  97
  98 If the floating-point number is @dfn{normalized}, the mantissa is also
  99 greater than or equal to the base raised to the power of one less
 100 than the precision (unless the number represents a floating-point zero,
 101 in which case the mantissa is zero).  The fractional quantity is
 102 therefore greater than or equal to @code{1/@var{b}}, where @var{b} is
 103 the base.
 104 @cindex normalized floating-point number
 105
 106 @node Floating-Point Parameters
 107 @section Floating-Point Parameters
 108
 109 @strong{Incomplete:}  This section needs some more concrete examples
 110 of what these parameters mean and how to use them in a program.
 111
 112 These macro definitions can be accessed by including the header file
 113 @file{<float.h>} in your program.
 114
 115 Macro names starting with @samp{FLT_} refer to the @code{float} type,
 116 while names beginning with @samp{DBL_} refer to the @code{double} type
 117 and names beginning with @samp{LDBL_} refer to the @code{long double}
 118 type.  (In implementations that do not support @code{long double} as
 119 a distinct data type, the values for those constants are the same
 120 as the corresponding constants for the @code{double} type.)@refill
 121
 122 Note that only @code{FLT_RADIX} is guaranteed to be a constant
 123 expression, so the other macros listed here cannot be reliably used in
 124 places that require constant expressions, such as @samp{#if}
 125 preprocessing directives and array size specifications.
 126
 127 Although the ANSI C standard specifies minimum and maximum values for
 128 most of these parameters, the GNU C implementation uses whatever
 129 floating-point representations are supported by the underlying hardware.
 130 So whether GNU C actually satisfies the ANSI C requirements depends on
 131 what machine it is running on.
 132
 133 @comment float.h
 134 @comment ANSI
 135 @defvr Macro FLT_ROUNDS
 136 This value characterizes the rounding mode for floating-point addition.
 137 The following values indicate standard rounding modes:
 138
 139 @table @code
 140 @item -1
 141 The mode is indeterminable.
 142 @item 0
 143 Rounding is towards zero.
 144 @item 1
 145 Rounding is to the nearest number.
 146 @item 2
 147 Rounding is towards positive infinity.
 148 @item 3
 149 Rounding is towards negative infinity.
 150 @end table
 151
 152 @noindent
 153 Any other value represents a machine-dependent nonstandard rounding
 154 mode.
 155 @end defvr
 156
 157 @comment float.h
 158 @comment ANSI
 159 @defvr Macro FLT_RADIX
 160 This is the value of the base, or radix, of exponent representation.
 161 This is guaranteed to be a constant expression, unlike the other macros
 162 described in this section.
 163 @end defvr
 164
 165 @comment float.h
 166 @comment ANSI
 167 @defvr Macro FLT_MANT_DIG
 168 This is the number of base-@code{FLT_RADIX} digits in the floating-point
 169 mantissa for the @code{float} data type.
 170 @end defvr
 171
 172 @comment float.h
 173 @comment ANSI
 174 @defvr Macro DBL_MANT_DIG
 175 This is the number of base-@code{FLT_RADIX} digits in the floating-point
 176 mantissa for the @code{double} data type.
 177 @end defvr
 178
 179 @comment float.h
 180 @comment ANSI
 181 @defvr Macro LDBL_MANT_DIG
 182 This is the number of base-@code{FLT_RADIX} digits in the floating-point
 183 mantissa for the @code{long double} data type.
 184 @end defvr
 185
 186 @comment float.h
 187 @comment ANSI
 188 @defvr Macro FLT_DIG
 189 This is the number of decimal digits of precision for the @code{float}
 190 data type.  Technically, if @var{p} and @var{b} are the precision and
 191 base (respectively) for the representation, then the decimal precision
 192 @var{q} is the maximum number of decimal digits such that any floating
 193 point number with @var{q} base 10 digits can be rounded to a floating
 194 point number with @var{p} base @var{b} digits and back again, without
 195 change to the @var{q} decimal digits.
 196
 197 The value of this macro is guaranteed to be at least @code{6}.
 198 @end defvr
 199
 200 @comment float.h
 201 @comment ANSI
 202 @defvr Macro DBL_DIG
 203 This is similar to @code{FLT_DIG}, but is for the @code{double} data
 204 type.  The value of this macro is guaranteed to be at least @code{10}.
 205 @end defvr
 206
 207 @comment float.h
 208 @comment ANSI
 209 @defvr Macro LDBL_DIG
 210 This is similar to @code{FLT_DIG}, but is for the @code{long double}
 211 data type.  The value of this macro is guaranteed to be at least
 212 @code{10}.
 213 @end defvr
 214
 215 @comment float.h
 216 @comment ANSI
 217 @defvr Macro FLT_MIN_EXP
 218 This is the minimum negative integer such that the mathematical value
 219 @code{FLT_RADIX} raised to this power minus 1 can be represented as a
 220 normalized floating-point number of type @code{float}.  In terms of the
 221 actual implementation, this is just the smallest value that can be
 222 represented in the exponent field of the number.
 223 @end defvr
 224
 225 @comment float.h
 226 @comment ANSI
 227 @defvr Macro DBL_MIN_EXP
 228 This is similar to @code{FLT_MIN_EXP}, but is for the @code{double} data
 229 type.
 230 @end defvr
 231
 232 @comment float.h
 233 @comment ANSI
 234 @defvr Macro LDBL_MIN_EXP
 235 This is similar to @code{FLT_MIN_EXP}, but is for the @code{long double}
 236 data type.
 237 @end defvr
 238
 239 @comment float.h
 240 @comment ANSI
 241 @defvr Macro FLT_MIN_10_EXP
 242 This is the minimum negative integer such that the mathematical value
 243 @code{10} raised to this power minus 1 can be represented as a
 244 normalized floating-point number of type @code{float}.  This is
 245 guaranteed to be no greater than @code{-37}.
 246 @end defvr
 247
 248 @comment float.h
 249 @comment ANSI
 250 @defvr Macro DBL_MIN_10_EXP
 251 This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{double}
 252 data type.
 253 @end defvr
 254
 255 @comment float.h
 256 @comment ANSI
 257 @defvr Macro LDBL_MIN_10_EXP
 258 This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{long
 259 double} data type.
 260 @end defvr
 261
 262
 263
 264 @comment float.h
 265 @comment ANSI
 266 @defvr Macro FLT_MAX_EXP
 267 This is the maximum negative integer such that the mathematical value
 268 @code{FLT_RADIX} raised to this power minus 1 can be represented as a
 269 floating-point number of type @code{float}.  In terms of the actual
 270 implementation, this is just the largest value that can be represented
 271 in the exponent field of the number.
 272 @end defvr
 273
 274 @comment float.h
 275 @comment ANSI
 276 @defvr Macro DBL_MAX_EXP
 277 This is similar to @code{FLT_MAX_EXP}, but is for the @code{double} data
 278 type.
 279 @end defvr
 280
 281 @comment float.h
 282 @comment ANSI
 283 @defvr Macro LDBL_MAX_EXP
 284 This is similar to @code{FLT_MAX_EXP}, but is for the @code{long double}
 285 data type.
 286 @end defvr
 287
 288 @comment float.h
 289 @comment ANSI
 290 @defvr Macro FLT_MAX_10_EXP
 291 This is the maximum negative integer such that the mathematical value
 292 @code{10} raised to this power minus 1 can be represented as a
 293 normalized floating-point number of type @code{float}.  This is
 294 guaranteed to be at least @code{37}.
 295 @end defvr
 296
 297 @comment float.h
 298 @comment ANSI
 299 @defvr Macro DBL_MAX_10_EXP
 300 This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{double}
 301 data type.
 302 @end defvr
 303
 304 @comment float.h
 305 @comment ANSI
 306 @defvr Macro LDBL_MAX_10_EXP
 307 This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{long
 308 double} data type.
 309 @end defvr
 310
 311
 312 @comment float.h
 313 @comment ANSI
 314 @defvr Macro FLT_MAX
 315 The value of this macro is the maximum representable floating-point
 316 number of type @code{float}, and is guaranteed to be at least
 317 @code{1E+37}.
 318 @end defvr
 319
 320 @comment float.h
 321 @comment ANSI
 322 @defvr Macro DBL_MAX
 323 The value of this macro is the maximum representable floating-point
 324 number of type @code{double}, and is guaranteed to be at least
 325 @code{1E+37}.
 326 @end defvr
 327
 328 @comment float.h
 329 @comment ANSI
 330 @defvr Macro LDBL_MAX
 331 The value of this macro is the maximum representable floating-point
 332 number of type @code{long double}, and is guaranteed to be at least
 333 @code{1E+37}.
 334 @end defvr
 335
 336
 337 @comment float.h
 338 @comment ANSI
 339 @defvr Macro FLT_MIN
 340 The value of this macro is the minimum normalized positive
 341 floating-point number that is representable by type @code{float}, and is
 342 guaranteed to be no more than @code{1E-37}.
 343 @end defvr
 344
 345 @comment float.h
 346 @comment ANSI
 347 @defvr Macro DBL_MIN
 348 The value of this macro is the minimum normalized positive
 349 floating-point number that is representable by type @code{double}, and
 350 is guaranteed to be no more than @code{1E-37}.
 351 @end defvr
 352
 353 @comment float.h
 354 @comment ANSI
 355 @defvr Macro LDBL_MIN
 356 The value of this macro is the minimum normalized positive
 357 floating-point number that is representable by type @code{long double},
 358 and is guaranteed to be no more than @code{1E-37}.
 359 @end defvr
 360
 361
 362 @comment float.h
 363 @comment ANSI
 364 @defvr Macro FLT_EPSILON
 365 This is the minimum positive floating-point number of type @code{float}
 366 such that @code{1.0 + FLT_EPSILON != 1.0} is true.  It's guaranteed to
 367 be no greater than @code{1E-5}.
 368 @end defvr
 369
 370 @comment float.h
 371 @comment ANSI
 372 @defvr Macro DBL_EPSILON
 373 This is similar to @code{FLT_EPSILON}, but is for the @code{double}
 374 type.  The maximum value is @code{1E-9}.
 375 @end defvr
 376
 377 @comment float.h
 378 @comment ANSI
 379 @defvr Macro LDBL_EPSILON
 380 This is similar to @code{FLT_EPSILON}, but is for the @code{long double}
 381 type.  The maximum value is @code{1E-9}.
 382 @end defvr
 383
 384
 385
 386 @node IEEE Floating Point
 387 @section IEEE Floating Point
 388
 389 Here is an example showing how these parameters work for a common
 390 floating point representation, specified by the @cite{IEEE Standard for
 391 Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985)}.
 392
 393 The IEEE single-precision float representation uses a base of 2.  There
 394 is a sign bit, a mantissa with 23 bits plus one hidden bit (so the total
 395 precision is 24 base-2 digits), and an 8-bit exponent that can represent
 396 values in the range -125 to 128, inclusive.
 397
 398 So, for an implementation that uses this representation for the
 399 @code{float} data type, appropriate values for the corresponding
 400 parameters are:
 401
 402 @example
 403 FLT_RADIX                         2
 404 FLT_MANT_DIG                     24
 405 FLT_DIG                           6
 406 FLT_MIN_EXP                    -125
 407 FLT_MIN_10_EXP                  -37
 408 FLT_MAX_EXP                     128
 409 FLT_MAX_10_EXP                  +38
 410 FLT_MIN             1.17549435E-38F
 411 FLT_MAX             3.40282347E+38F
 412 FLT_EPSILON         1.19209290E-07F
 413 @end example
 414
 415
 416