About float and double.

For this first post, I planned to present several square root approximations, but since I manipulate the bits of the float representation, I need to first write about float/double in C++.

All the code below can be found in my github under BSD licence.

Table of Contents

Floating points
Comparing floats
Fun facts

Floating point numbers:

The IEC 559/IEEE 754 is a technical standard for floating-point computation. In C++, compliance with IEC 559 can be checked with the is_iec559 member of std::numeric_limits. Nearly all modern CPUs from Intel, AMD and ARMs and GPUs from NVIDIA and AMD should be compliant.

float and double computations can be done with different rounding modes All the tests in this article were done with the default Round to nearest mode.


The type float is 32 bits: 1 bit, 8 bits of and 23 bits of .

When the exponent is 255:

The float is either infinity or NaN (Not a Number):

  • Negative infinity is
  • Positive infinity is
  • NaNs may be signalling or quiet.

Comparisons with NaNs (>, >=, <, <= and ==) always return false even when both sides have the same bit pattern:

NaN != NaN

So if you want to accept only non negative numbers and NaNs (including ones with 1 as sign bit), you may change

assert(x>= 0); 



When the exponent is in :

The bits are interpreted as .

When exponent is :

The bits are interpreted as .

Because of this format, there are two zeros that compare equal but have different bits:

-0.0f : +0.0f :

Other numbers with exponent are called denormalized.

The smallest positive float is 1.401e-45#DEN: . The biggest finite float is 3.40282347e+38: .


The type double is 64 bits: 1 bit, 11 bits of and 52 bits of . It follows the same rules but the offset on the exponent is 1023 instead of 127, and the special exponents are 0 and 2047.


To access the bits of a float, the safest way is to use memcpy, as both reinterpret_cast and unions are conflicting with strong aliasing rules according to the standard (see 3.9 and 3.10-10):

uint32_t bits;
memcpy(&bits, &someFloat, sizeof(someFloat));

It can be wrapped in a template, which I chose to call bit_cast:

template<typename DstType, typename SrcType>
DstType bit_cast(SrcType src)
	static_assert(sizeof(DstType) == sizeof(SrcType), 
	"sizeof(DstType) != sizeof(SrcType)");

	DstType dst;
	memcpy(&dst, &src, sizeof(SrcType));
	return dst;

While looking for a better name for it, I found that chromium uses the same one so I gave up :D.


Since float has a finite number of bits, it is obviously not able to represent all the real numbers in [1.401e-45, 3.40282347e+38].

This has some consequences:

Dividing by 2.0f or multiplying by 0.5 is the same. Dividing by 3.0f and multiplying by 1.0f/3.0f is not the same:

0.00472378824f * (1.0f/3.0f) = 0.00157459616 0.00472378824f / 3.0f = 0.00157459604

Because 1/3 can’t be represented exactly as a float:

1.0f/3.0f == 0.3333333**4**3f with round to nearest

Another consequence of this is that the following loop is infinite:

for(float f = 0.0f; f < 20000000.0f; ++f){} 

as 16777216.0f + 1.0f equals 16777216.0f: ++f will never reach 20000000.0f !

Since the mantissa is only 24 bits (including the implicit leading 1 or 0), adding a number with a very small exponent to one with a very big exponent will leave the mantissa of the big number unaffected by the addition of the small one.

This problem is particularly important when subtracting nearly equal numbers (or adding nearly opposite ones).

Let’s say that I have two real numbers num1 and num2 both needing 26 bits of mantissa, and differing only on the least significant one.

Because float has only 24 bits of mantissa, the last two bits are already lost in previous computations at the point where num1 – num2 is computed, which leads to catastrophic cancellation.


num1 = 16777216.0f + 1.0f num2 = 16777216.0f num1 - num2 = 0.0f

But with double:

num1 = 16777216.0 + 1.0 num2 = 16777216.0 num1 - num2 = 1.0

Note that double has the exact same problem, but you need even bigger differences, like 10000000000000000.0+1.0.

They are different ways to overcome this when high precision is needed, like interval arithmetic or this.

Comparing floats:

In general, you should avoid direct equality comparisons for float, as they usually don’t do what you may expect:

Unless they are produced by the exact same steps, rounding errors may make two numbers you expect to be equal a bit different.

  • If you want two floats to have the exact same bits, do bit_cast<uint32_t>(lhs) == bit_cast<uint32_t>(rhs) (this considers +0 and -0 as different, but NaNs with same bits as equal).

  • If you want to check if they are approximately equal, do fabs(lhs-rhs) <= epsilon*((fabs(lhs) < fabs(rhs) ? fabs(rhs) : fabs(lhs) (from The art of computer programming) with epsilon chosen for your application.

  • If you specifically want exactly the same bits with the exception of zeroes and NaNs, use the regular == operator.

Fun facts:

  • More than half of positive floats are inferior to 2 (exponents to ).

  • It is possible to use radixsort for float!

  • Another idea that I found important is that as float are only 32bits, it is possible to test every value when trying to approximate/implement a function with a single parameter (sqrt takes less than a few minutes for example) if it is not too complex.

Written on November 13, 2016