About float and double.
For this first post, I planned to present several square root approximations, but since I manipulate the bits of the float representation, I need to first write about float/double in C++.
All the code below can be found in my github under BSD licence.
Table of Contents
Floating points
float
double
bit_cast
Limitations
Comparing floats
Fun facts
Links
Floating point numbers:
The IEC 559/IEEE 754 is a technical standard for floatingpoint computation.
In C++, compliance with IEC 559 can be checked with the is_iec559 member of std::numeric_limits
.
Nearly all modern CPUs from Intel, AMD and ARMs and GPUs from NVIDIA and AMD should be compliant.
float and double computations can be done with different rounding modes
All the tests in this article were done with the default Round to nearest
mode.
float:
The type float
is 32 bits: 1 bit, 8 bits of and 23 bits of .
When the exponent is 255:
The float is either infinity or NaN (Not a Number):
 Negative infinity is
 Positive infinity is
 NaNs may be signalling or quiet.
Comparisons with NaNs (>, >=, <, <= and ==) always return false even when both sides have the same bit pattern:
NaN != NaN
So if you want to accept only non negative numbers and NaNs (including ones with 1 as sign bit), you may change
assert(x>= 0);
into
assert(!(x<0));
When the exponent is in :
The bits are interpreted as .
When exponent is :
The bits are interpreted as .
Because of this format, there are two zeros that compare equal but have different bits:
0.0f : +0.0f :
Other numbers with exponent are called denormalized.
The smallest positive float is 1.401e45#DEN: . The biggest finite float is 3.40282347e+38: .
double:
The type double
is 64 bits: 1 bit, 11 bits of and 52 bits of .
It follows the same rules but the offset on the exponent is 1023 instead of 127, and the special exponents are 0 and 2047.
bit_cast:
To access the bits of a float, the safest way is to use memcpy, as both reinterpret_cast and unions are conflicting with strong aliasing rules according to the standard (see 3.9 and 3.1010):
uint32_t bits;
memcpy(&bits, &someFloat, sizeof(someFloat));
It can be wrapped in a template, which I chose to call bit_cast:
template<typename DstType, typename SrcType>
DstType bit_cast(SrcType src)
{
static_assert(sizeof(DstType) == sizeof(SrcType),
"sizeof(DstType) != sizeof(SrcType)");
DstType dst;
memcpy(&dst, &src, sizeof(SrcType));
return dst;
}
While looking for a better name for it, I found that chromium uses the same one so I gave up :D.
Limitations:
Since float has a finite number of bits, it is obviously not able to represent all the real numbers in [1.401e45, 3.40282347e+38].
This has some consequences:
Dividing by 2.0f or multiplying by 0.5 is the same. Dividing by 3.0f and multiplying by 1.0f/3.0f is not the same:
0.00472378824f * (1.0f/3.0f) = 0.001574596
16
0.00472378824f / 3.0f = 0.001574596
04
Because 1/3 can’t be represented exactly as a float:
1.0f/3.0f == 0.3333333**4**3f with round to nearest
Another consequence of this is that the following loop is infinite:
for(float f = 0.0f; f < 20000000.0f; ++f){}
as 16777216.0f + 1.0f
equals 16777216.0f
: ++f will never reach 20000000.0f !
Since the mantissa is only 24 bits (including the implicit leading 1 or 0), adding a number with a very small exponent to one with a very big exponent will leave the mantissa of the big number unaffected by the addition of the small one.
This problem is particularly important when subtracting nearly equal numbers (or adding nearly opposite ones).
Let’s say that I have two real numbers num1 and num2 both needing 26 bits of mantissa, and differing only on the least significant one.
Because float has only 24 bits of mantissa, the last two bits are already lost in previous computations at the point where num1 – num2 is computed, which leads to catastrophic cancellation.
Example:
num1 = 16777216.0f + 1.0f
num2 = 16777216.0f
num1  num2 = 0.0f
But with double:
num1 = 16777216.0 + 1.0
num2 = 16777216.0
num1  num2 = 1.0
Note that double has the exact same problem, but you need even bigger differences, like 10000000000000000.0+1.0.
They are different ways to overcome this when high precision is needed, like interval arithmetic or this.
Comparing floats:
In general, you should avoid direct equality comparisons for float, as they usually don’t do what you may expect:
Unless they are produced by the exact same steps, rounding errors may make two numbers you expect to be equal a bit different.

If you want two floats to have the exact same bits, do
bit_cast<uint32_t>(lhs) == bit_cast<uint32_t>(rhs)
(this considers +0 and 0 as different, but NaNs with same bits as equal). 
If you want to check if they are approximately equal, do
fabs(lhsrhs) <= epsilon*((fabs(lhs) < fabs(rhs) ? fabs(rhs) : fabs(lhs)
(from The art of computer programming) with epsilon chosen for your application. 
If you specifically want exactly the same bits with the exception of zeroes and NaNs, use the regular
==
operator.
Fun facts:

More than half of positive floats are inferior to 2 (exponents to ).

It is possible to use radixsort for float!

Another idea that I found important is that as
float
are only 32bits, it is possible to test every value when trying to approximate/implement a function with a single parameter (sqrt takes less than a few minutes for example) if it is not too complex.
Links:
What Every Computer Scientist Should Know About FloatingPoint Arithmetic
Wikipedia: Singleprecision floatingpoint format
Wikipedia: Doubleprecision floatingpoint format
Kahan Summation
As this is my first post using Jekyll, please feel free to criticize both the form and the content!