Floating-point

From HvWiki

(Redirected from Floating Point)

Almost all modern computers approximates real numbers by using floating point arithmetic as defined in the IEEE 754 standard.


Single Precision

The IEEE 754 single precision number requires 32 bits of storage.

31 30      23 22                     0
 S  EEEEEEEE   FFFFFFFFFFFFFFFFFFFFFFF
  • S - Sign bit
  • E - Exponent
  • F - Fraction


The value of the 32 bit word:

  • If E = 255 and F is nonzero, then V = NaN ("Not a number")
  • If E = 255 and F is zero and S is 1, then V = -Infinity
  • If E = 255 and F is zero and S is 0, then V = Infinity
  • If 0<E<255 then V = (-1)**S * 2 ** (E-127) * (1.F) where "1.F" represents the binary number created by prefixing F with an implicit leading 1 and a binary point.
  • If E = 0 and F is nonzero, then V = (-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values.
  • If E = 0 and F is zero and S is 1, then V = -0
  • If E = 0 and F is zero and S is 0, then V = 0

Single precision corresponds roughly to 6 significant figures of precision. With single precision the sum 1.000 000 1 + 0.000 000 1 will give an answer of just 1.

The C type corresponding to single precision is "float"

Double Precision

The IEEE 754 single precision number requires 64 bits of storage.

63 62         52 51                                                  0
 S  EEEEEEEEEEE   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  • S - Sign bit
  • E - Exponent
  • F - Fraction


The value of the 64 bit word:

  • If E = 2047 and F is nonzero, then V = NaN ("Not a number")
  • If E = 2047 and F is zero and S is 1, then V = -Infinity
  • If E = 2047 and F is zero and S is 0, then V = Infinity
  • If 0<E<2047 then V = (-1)**S * 2 ** (E-1023) * (1.F) where "1.F" represents the binary number created by prefixing F with an implicit leading 1 and a binary point.
  • If E = 0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values.
  • If E = 0 and F is zero and S is 1, then V = -0
  • If E = 0 and F is zero and S is 0, then V = 0

Double precision is roughly 12 significant figures in decimal. The C type is - believe it or not - "double"