Floating-point
From HvWiki
Almost all modern computers approximates real numbers by using floating point arithmetic as defined in the IEEE 754 standard.
[edit]
Single Precision
The IEEE 754 single precision number requires 32 bits of storage.
31 30 23 22 0 S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
- S - Sign bit
- E - Exponent
- F - Fraction
The value of the 32 bit word:
- If E = 255 and F is nonzero, then V = NaN ("Not a number")
- If E = 255 and F is zero and S is 1, then V = -Infinity
- If E = 255 and F is zero and S is 0, then V = Infinity
- If 0<E<255 then V = (-1)**S * 2 ** (E-127) * (1.F) where "1.F" represents the binary number created by prefixing F with an implicit leading 1 and a binary point.
- If E = 0 and F is nonzero, then V = (-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values.
- If E = 0 and F is zero and S is 1, then V = -0
- If E = 0 and F is zero and S is 0, then V = 0
Single precision corresponds roughly to 6 significant figures of precision. With single precision the sum 1.000 000 1 + 0.000 000 1 will give an answer of just 1.
The C type corresponding to single precision is "float"
[edit]
Double Precision
The IEEE 754 single precision number requires 64 bits of storage.
63 62 52 51 0 S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
- S - Sign bit
- E - Exponent
- F - Fraction
The value of the 64 bit word:
- If E = 2047 and F is nonzero, then V = NaN ("Not a number")
- If E = 2047 and F is zero and S is 1, then V = -Infinity
- If E = 2047 and F is zero and S is 0, then V = Infinity
- If 0<E<2047 then V = (-1)**S * 2 ** (E-1023) * (1.F) where "1.F" represents the binary number created by prefixing F with an implicit leading 1 and a binary point.
- If E = 0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values.
- If E = 0 and F is zero and S is 1, then V = -0
- If E = 0 and F is zero and S is 0, then V = 0
Double precision is roughly 12 significant figures in decimal. The C type is - believe it or not - "double"

