Study notes: Floating point

来源:互联网 发布:淘宝衣服模特 编辑:程序博客网 时间:2024/05/18 01:38

1. IEEE 754

32-bit float represented as:

Sign bit      8-bit exponent      23-bit significant

Categories:

1). Normalized (exponent != 0..0s or 1..1s)

1.xx..x * 2^expo, where expo = exponent as unsigned - 127, range [1-127, 254-127], that is [ -126, 127].


2). De-normalized (exponent = 0..0s, and significant != 0..0s or 1..1s)

0.xx..x * 2^-126

uniform density, and finest density.

Specifically, the pattern "S  0...0  0...0" represents +/- zero.


3) Special

S  1...1  0...0 : +/- infinity

S  1..1  x...x : NaN

Density (symmetric wrt 0, only view +ve portion)

from          to                     granularity

0                2^-126            2^-23 * 2^-126 = 2^-149, note this is de-normalized case, 0.xxx * 2^-126

2^-126      2^-125            2^-23 * 2^-126 = 2^-148, starting here, it's normalized case, 1.xxx * 2^-126

...

2^0           2^1                   2^-23 * 2^0 = 2^-23

2^1           2^2                   2^-23 * 2^1 = 2^-22

..

2^23         2^24                2^0 = 1

..

When a float = 2^23, roughly 8 million, the granularity is already 1, means there's no 8000000.1, 8000000.2, etc.


So, how to determine if two floating points are equal?

      First, define equal.

      Then, based on context, choose from operator==, absolute epsilon, relative epsilon etc. Refer to: Bruce Dawson

2. Correspondence with C++ standard library <limits>

numeric_limits<float>::epsilon() : defined as e such that 1.0 + e != 1.0, now can see why it is 2^-23, (1.192*10^-7)

numeric_limits<float>::infinity() : 1.11..1 * 2^127 = 2^128 - 2^-23, roughly 3.4*10^38

numeric_limits<double>::epsilon() : 2^-52, roughly 2.2 * 10^-16. VS debugger usually displays up to 15 digits beyond decimal point, not a bad idea.

3. Arithmetic overflows, underflows etc

Why no exception thrown after floating point arithmetic overflow??

e.g.

int main(int argc, char* argv[])
{
    float base = 2.0, expo=127.0;
    float a = pow(base, expo);
    float b = a * base;        // b is infinity, ie, 1.#INF

    bool nan = _isnan(b);    // false
    nan = _isnan(b+b);        // false
    nan = _isnan(b-b);         // true
    nan = _isnan(b*b);         // false
    nan = _isnan(b/b);         // true
    return 0;
}

e.g.

a floating point divided by 0.0 is also giving 1.#INF, no exceptions thrown. Must be aware of that.

p.s. an integer divided by 0 is also tricky, by section 5.6 of C++0x

             If the second operand of / or % is zero, the behavior is undefined.

        so anything could happen. Using VS2012 express, an unrecoverable error occurs and program is forced to exit.


0 0