Floating point
A floating-point number is a digital representation for a number in a
certain subset of the rational numbers, and is often used to approximate an
arbitrary real number on a computer. In particular, it represents an integer
or fixed-point number (the mantissa) multiplied by a base (usually 2) to
some integer power (the exponent); it is the binary analog of scientific
notation (base 10). A floating point calculation is an arithmetic
calculation done with floating point numbers, and often involves some
approximation or rounding because the result of an operation may not be
exactly representable,
In a floating-point number, the number of significant digits (the relative
precision) is a constant, rather than the absolute precision as in fixed-point.
Representation
A floating-point number a is represented by two numbers m and e, such that a
= m × be. In any such system we pick a base b (called the base of
numeration, also the radix) and a precision p (how many digits to store). m
(which is called the mantissa, also the significand) is a p digit number of
the form +-d.ddd...ddd (each digit being an integer between 0 and b-1
inclusive). If the leading digit of m is non-zero then the number is said to
be normalized. Some descriptions use a separate sign bit (s, which
represents -1 or +1) and require m to be positive. e is called the exponent.
(For more on the concept of "mantissa", see common logarithm.)
This scheme allows a large range of magnitudes to be represented within a
limited precision field, which is not possible in a fixed point notation.
As an example, a floating point number with four decimal digits (b=10, p=4)
could be used to represent 4321 or 0.00004321, but would not have enough
precision to represent 432.123 and 43212.3 (which would have to be rounded
to 432.1 and 43210). Of course, in practice, the number of digits is usually
larger than four.
Hidden bit
When using binary (b=2), one bit can be saved if all numbers are required to
be normalized. The leading digit of the mantissa of a normalised binary
number is always non-zero, in particular it is always 1. This means that it
does not need to be stored explicitly, for a normalised number it can be
understood to be 1. The IEEE standard exploits this fact. Requiring all
numbers to be normalised means that 0 cannot be represented; typically some
special representation of zero is chosen.
Usage in computing
While in the examples above the numbers are represented in the decimal
system (that is the base of numeration, b = 10, computers usually do so in
the binary system, which means that b = 2). In computers, floating-point
numbers are sized by the number of bits used to store them. This size is
usually 32 bits or 64 bits, often called "single-precision" and
"double-precision". A few machines offer larger sizes; Intel FPUs such as
8087 (and its descendands integrated into the x86 architecture) offer 80 bit
floating point numbers for intermediate results, and several systems offer
128 bit floating-point, generally implemented in software.
Problems with floating point
Floating point numbers usually behave very similarly to the real numbers
they are used to approximate. However, this can easily lead programmers into
over-confidently ignoring the need for numerical analysis. There are many
cases where floating point numbers do not model real numbers well, even in
simple cases such are representing the decimal fraction 0.1, which cannot be
exactly represented in any binary floating-point format. For this reason,
financial software tends not to use floating point number representation.
Errors in floating point computation can be :
* Rounding
o Non representable numbers
o Rounding of arithmetic operations
* Absorption : 1á1015 + 1 = 1á1015
* Cancellation : substraction between nearly equivalent operands
resulting from rounded operations
* Overflow / Underflow
IEEE standard
The IEEE have standized the computer representation in IEEE 754. This
standard is followed by almost all modern machines. The only exceptions are
IBM Mainframes, which recently acquired an IEEE mode, and Cray vector
machines, where the T90 series had an IEEE version, but the SV1 still uses
Cray floating point format.
Examples
* The value of Pi, π = 3.1415926...10 decimal, which is equivalent to
binary 11.001001000011111...2. When represented in a computer that
allocates 17 bits for the mantissa, it will become 0.11001001000011111
× 22. Hence the floating point representation would starts with
bits 01100100100001111 and end with bits 01 (which represent the
exponent 2 in the binary system). Note: the first zero indicate a
positive number, the ending 102 = 210.)
* The value of -0.37510 = 0.0112 or 0.11 × 2-1. In 2's complement
notation, -1 is represented as 11111111 (assuming 8 bits are used in
the exponent). In floating point notation, the number with start with a
1 for sign bit, followed by 110000... and then followed by 11111111 at
the end, or 1110...011111111 (where ... are zeros).
Note that though the examples in this article used a consistent system of
floating-point notation, the notation is different from the IEEE standard.
For example, in IEEE 754, the exponent is between the sign bit and the
mantissa, not at the end of the number. Also the IEEE exponent uses a biased
integer instead of a 2's-complement number. The readers need to understand
that the examples serve the purpose of illustrating how floating-point
numbers could be representated, but the actual bits shown in the article is
different from what an IEEE 754-compliant number would look like. The
placement of the bits in the IEEE standard enables two floating point
numbers to be compared bitwise (sans sign-bit) to yield a result without
interpreting the actual values. The arbituary system used in this article
cannot do the same. Some good wikipedians with spare time can rewrite the
examples using the IEEE standard if desired, though the current version is
good enough as textbook examples for it highlighted all the major components
of a floating-point notation. This also illustrated that a non-standard
notation system also works as long as it is consistent.
This content from Wikipedia is licensed under the GNU Free Documentation License.
|