Study notes: Floating point

来源：互联网发布：淘宝衣服模特编辑：程序博客网时间：2024/05/18 01:38

1. IEEE 754

32-bit float represented as:
Sign bit 8-bit exponent 23-bit significant
Categories:
1). Normalized (exponent != 0..0s or 1..1s)
1.xx..x * 2^expo, where expo = exponent as unsigned - 127, range [1-127, 254-127], that is [ -126, 127].

2). De-normalized (exponent = 0..0s, and significant != 0..0s or 1..1s)
0.xx..x * 2^-126
uniform density, and finest density.
Specifically, the pattern "S 0...0 0...0" represents +/- zero.

3) Special
S 1...1 0...0 : +/- infinity
S 1..1 x...x : NaN

Density (symmetric wrt 0, only view +ve portion)
from          to                     granularity
0                2^-126            2^-23 * 2^-126 = 2^-149, note this is de-normalized case, 0.xxx * 2^-126
2^-126      2^-125            2^-23 * 2^-126 = 2^-148, starting here, it's normalized case, 1.xxx * 2^-126
...
2^0           2^1                   2^-23 * 2^0 = 2^-23
2^1           2^2                   2^-23 * 2^1 = 2^-22
..

2^23         2^24                2^0 = 1
..
When a float = 2^23, roughly 8 million, the granularity is already 1, means there's no 8000000.1, 8000000.2, etc.

So, how to determine if two floating points are equal?
      First, define equal.
      Then, based on context, choose from operator==, absolute epsilon, relative epsilon etc. Refer to: Bruce Dawson

2. Correspondence with C++ standard library <limits>

numeric_limits<float>::epsilon() : defined as e such that 1.0 + e != 1.0, now can see why it is 2^-23, (1.192*10^-7)
numeric_limits<float>::infinity() : 1.11..1 * 2^127 = 2^128 - 2^-23, roughly 3.4*10^38
numeric_limits<double>::epsilon() : 2^-52, roughly 2.2 * 10^-16. VS debugger usually displays up to 15 digits beyond decimal point, not a bad idea.

3. Arithmetic overflows, underflows etc

Why no exception thrown after floating point arithmetic overflow??
e.g.
int main(int argc, char* argv[])
{
   float base = 2.0, expo=127.0;
   float a = pow(base, expo);
   float b = a * base;       // b is infinity, ie, 1.#INF

   bool nan = _isnan(b);   // false
   nan = _isnan(b+b);       // false
   nan = _isnan(b-b);       // true
   nan = _isnan(b*b);       // false
   nan = _isnan(b/b);         // true
   return 0;
}
e.g.
a floating point divided by 0.0 is also giving 1.#INF, no exceptions thrown. Must be aware of that.
p.s. an integer divided by 0 is also tricky, by section 5.6 of C++0x,
             If the second operand of / or % is zero, the behavior is undefined.
      so anything could happen. Using VS2012 express, an unrecoverable error occurs and program is forced to exit.

0 0