Floating Point One-Pager

Broadly speaking this post will go into increasing technical detail so that you may stop reading at the point where you either:

  1. Lose Interest
  2. Find the Answer you are After
  3. Lose the Ability to Follow My Garbled Writing

Note that regardless, this is by no means an exhaustive exploration of the topic. There is some significant glossing-over-details; I have merely tried to put “The Rules” into some context, and point the way to some problems you might want to delve deeper into by reading the references at the bottom for yourself.

After studying and digesting articles for a week, the only thing I can be certain of is that I have never fully comprehended all the dangers inherent in floating point numbers. And I never will. And neither, probably, will you.

All we’ll be able to do is take as much care as possible, and hope for the best.

Now, let’s go limit the damage we all might do.

Into the lair of IEEE 754 we go!

The Rules

If you don’t want to think, if you just want some rules-of-thumb to blindly follow, this is the section for you.

  • Multiplications and divisions are safe
  • Series of additions and subtractions should be ordered to operate on values of the closest magnitude first where possible
  • Do not compare floating point numbers directly
  • Do not use “Epsilon” to compare floating point numbers either
  • Use bankers rounding (round-to-even) whenever possible
  • Run code as 64-bit executables whenever possible
  • Never make any assumptions about how many decimal places are “good”, no, not even 2!

Comparing the Wrong Way

Most, if not all, programmers know that directly comparing two floating point numbers for equality does not work. If they are almost-but-not-quite-completely the same number, the code will fail.

0.1 x 10 != 1.0

Rounding errors along the way will kill any exact comparison.

The most common solution that gets suggested is to check if the absolute difference between two numbers is smaller than some tiny amount, and declare the numbers equal if that is the case:

if (Math.Abs(a - b) < 0.00001)
    ... NOTE: WRONG WAY TO DO IT! ...

if (Math.Abs(a - b) < Double.Epsilon)
    ... NOTE: WRONG WAY TO DO IT! ...

The former will work or not depending on the magnitude of “a” and “b”; if the two numbers are large relative to the “0.00001” bias it will work, but if they are smaller or even close to the bias itself, then the code will suddenly start failing.

The latter will probably never work; it is essentially the same as saying the two numbers need to be exactly equal, because “Double.Epsilon” is the smallest representable fraction in doubles, and the only thing that is smaller than the smallest representable fraction is equality.

Do not use either of these approaches.

Comparing using Relative Error

It is much better to do a comparison based on relative error, because the difference will scale with the inputs; the author of The Floating Point Guide recommends the following:

static bool RelativeCompare(double a, double b, double epsilon)
    // Compare NaNs and infinites
    if (a == b)
        return true;

    // Where relative error is meaningless use a direct epsilon
    double diff = Math.Abs(a - b);
    if ((a == 0) || (b == 0) || diff < double.MinValue)
        return diff < epsilon * double.MinValue;

    // Otherwise use relative error
    return diff / (Math.Abs(a) + Math.Abs(b)) < epsilon;

See The Floating Point Guide for a more detailed explanation and a link to an extensive test-suite for this code.

Comparing using ULPs

The most advanced solution to comparing floating point numbers is based on a key trick; when you re-interpret IEEE 754 floating point numbers as integers, the difference between two of those integers represents how many different floating point numbers exist between the two that were converted.

If you want a bit more explanation I suggest reading the 2012 edition of Comparing Floating Point Numbers by Bruce Dawson.

A .NET implementation of the relevant algorithms:

// Slow but safe version
static long DoubleToLong(double d)
    return BitConverter.ToInt64(BitConverter.GetBytes(d), 0);

// Fast but "unsafe" version
static unsafe long DoubleToLong(double d)
    return *((long*)&d);

static long UlpDistance(double a, double b)
    long intA = DoubleToLong(a);
    if (intA < 0)
        intA = Int64.MinValue - intA;

    long intB = DoubleToLong(b);
    if (intB < 0)
        intB = Int64.MinValue - intB;

    return Math.Abs(intA - intB);

static bool UlpCompare(double a, double b, long distance)
    return UlpDistance(a, b) <= distance;

The UlpDistance function returns the distance between two double precision floating point numbers. If this returns:

  • 0 – the floating point numbers are exactly identical
  • 1 – the floating point numbers are adjacent
  • 2 – there is one possible floating point representation between them
  • etc.

And by using the UlpCompare function you can indicate how close two floating point numbers must be to be considered equal, in a scale-independent way. In essence, the magnitude of the bias scales with the size of the values being compared.

A distance of about a million is enough to reasonably establish equality.

Problem Scenarios

Multiplication and Division

It may at first seem counter-intuitive that multiplication and division would not pose any problems. The reason for this is fairly simple though.

To illustrate, lets work in decimal and pretend we have 4 significant digits in our floating point representation:

(2.131 x 102) x (1.943 x 10-4)
= (non-rounded interim value) 4.140533 x 10-2
= 4.141 x 10-2

Even when two floating point numbers have wildly different exponents, the act of multiplying or dividing simply shifts the exponent of the result. And on either side of the calculation we have a full complement of significant digits.

Addition and Subtraction

For exactly the same reason, additions and subtractions can be problematic when dealing with big differences in exponents:

(2.131 x 102) + (1.943 x 10-4)
= 213.1 + 0.0001943
= (non-rounded interim value) 213.1001943
= 213.1
= 2.131 x 102

With a single addition or subtraction this loss of precision is unavoidable, because at the end of the calculation everything has to fit again into a finite amount of precision.

However, if you add a large series of numbers it suddenly becomes important to think about your additions:

(2.131 x 102) + (1.943 x 10-4) + … 999 times the same number
= 213.1 + 0.0001943 + … 999 more
= 213.1 + … 999 more
= 213.1 + 0.0001943 + … 998 more
= …
= 2.131 x 102

Every time an addition is performed in the CPU, the result needs to fit within our floating point representation and gets rounded to the original value. If we however perform the additions the other way around:

(2.131 x 102) + (1.943 x 10-5) + … 999 times the same number
= 213.1 + 0.0001943 + … 999 more
= 213.1 + 0.0003886 + … 998 more
= …
= 213.1 + 0.1943
= 213.2943
= 2.133 x 102

It’s not a big difference in this case, but depending on the number of additions involved and the relative magnitudes of the numbers it can make more or less of a difference. And all for the sake of re-ordering some additions.

If you were really keen you could probably develop some LINQ extensions in .NET that automatically re-order addition and subtraction sequences into the sequence in which the result is most accurate.

For now the point is: consider whether the values you are adding have wildly divergent magnitudes, and where possible try to order them to keep values with the same magnitude closer together to optimise for the most accurate result.


Have you ever used Math.Round in .NET? Were you a little confused at first? So was I.

You see, it uses Bankers Rounding by default (overloads exist where a rounding algorithm can be specified). When rounding 2.5, you’d usually expect to end up with 3, but Bankers Rounding doesn’t round all .5s equally. It rounds towards even, so 1.5 becomes 2, but 2.5 also becomes 2, then 3.5 becomes 4, and so on.

This is actually a very sensible default, but it isn’t well explained. In most cases the default for IEEE 754 operations is the same for the same reasons.

The problem with “normal” rounding is that it rounds 4 values down (0.1 – 0.4) and it rounds 5 values up (0.5 – 0.9). For one rounding operation this isn’t too big a deal, but most calculations round multiple times along the way. The compounding effect of this rounding bias is that results can slowly creep upwards with every rounding.

Bankers Rounding however on average rounds just as much up as it does down. As a result repeating rounding along the path of a calculation will jitter up just as much as down and on average not introduce any bias.

If you have end-users that might be confused by Bankers Rounding, then try to restrict “normal” rounding only to the last operation before display and keep using Bankers Rounding internally.

Representational Error

Below in the technical bits there is an explanation why IEEE 754 floating point numbers cannot precisely represent the value 0.1; it results in an infinite binary expansion:

0.1 (decimal) = 1.1001100110011… (binary) × 2−4

As a consequence it is safe to say that most real-world values cannot be accurately represented in double precision floating point numbers; as soon as you parse a value from a text file or a NUMBER from a database into a double, it is no longer completely accurate.

Basically, any real-world value becomes inaccurate before you even perform any operations on it.

The only reason this isn’t patently obvious all over the place is because floating point numbers tend to get rounded again before display, and in most cases this rounds away the error that this parsing introduces.

0.1 (decimal) becomes 0.10000000000000000555111… once it is parsed. Display routines will never show that many decimal places for a double, but it’s only a matter of a few calculations before that error creeps into decimals that do show up. Nine additions of the same number is all it takes to turn this into a real error.

It is important to remember that errors are pervasive and unavoidable.

It’s all about strategically structuring code to limit the ways in which these errors can compound and inflate.

Loss of Scale Issues

This is a particularly insidious problem, because it tends to not be evident when a program is first written. It literally takes time for this problem to become evident, typically once a program is running in production, or long after a library is first written.

This problem exhibits itself when measuring elapsed time (since application start, or some fixed point in the past), or when data volumes might increase over time.

When the elapsed time is small, or calculated values operate on limited data, this means that the magnitude of values will also be small. As a result, a large amount of the precision of a float is still in the fractional part of the result.

As time passes, as data volumes grow, as the magnitude of results grows, less and less of the floating point precision is represented in the fractional part.

If you assume your values will always have 2, 3, 5, any number of accurate decimal places, you may one day be taken by surprise when that stops being true. A single-precision float has only 6-9 significant digits. Once the magnitude of your value goes into the millions, you cannot count on any accurate decimals. Doubles have a bit more headroom, but it is still not impossible to run out of headroom with 15 significant digits.

And your calculations will start showing visible inaccuracies in any decimal places well before you hit the limits on the result.

The Technical Bits

Double precision floating point numbers are 64-bit numbers, laid out as follows:

Double precision floating point bits

The meaning of these bits is expressed in the formula in the header of this post:

Double precision floating point interpretation


  • The sign-bit is set for negative numbers
  • The fraction is applied as the binary “decimal” places in: 1.xxxxx…
  • The exponent is used to multiply this fraction across 2-1022 to 21023

As a result, double precision can roughly express:

  • 15 significant decimal places of precision
  • Values as small as ±5.0 × 10−324
  • Values as large as ±1.7 × 10308

Magic with Integers – 1

Less obviously, but more remarkably: this carefully designed binary format means that if you re-interpret two doubles of the same sign as 64-bit integers and subtract these integers, the difference will tell you how many distinct double precision floating point numbers lie in between. (0 = equal, 1 = none in between, 2 = 1 in between, etc.)

Conversely, if you increment or decrement the integer representation of a floating point number, it will give you the very next larger or smaller floating point number that can be represented.

Magic with Integers – 2

Between using the fraction bits with a leading 1 and an exponent of 1075 (corresponding to 252), and the de-normalized values including 0, doubles can contain exact representations of all integers from -(253) up to +(253). Within these bounds all arithmetic with integer inputs and results will be exact at all times.

Because of the 52-bit fraction though, once you pass 253, you will only be able to exactly represent even numbers. And past 254 only multiples of 4, and so on.

The distance between representable integers doubles past each power of 2.

Floating Point Spacing

This is also true in the opposite direction, where up to 252 all halves can be exactly represented, and up to 251 all quarters, and so on.

If you were to plot marks of exactly representable integers in the double precision floating point domain you would have a ruler where towards the left, passing each power of two, the marks get twice as close together and to the right, passing each power of two, the marks get twice as distant.

Double Precision Floating Point Ruler
Double Precision Floating Point Ruler

Representation Difficulties

The biggest problem with binary floating point numbers is that we live in a decimal world. Our reality has a nasty habit of thinking in terms of tenths, or hundredths, or thousandths.

Per the formula for floating point representations, something simple can prove impossible to represent:

0.1 (decimal) = 1.1001100110011… (binary) × 2−4

To exactly represent a decimal 0.1, you need an infinitely recurring fraction in binary. As a result, the closest possible representation in binary that fits within a double translates to approximately 0.10000000000000000555111… in decimal.

And depending on how these tiny inaccuracies stack it might get better or worse.

If you multiply 0.1 by 10.0, you get an exactly represented 1 due to some fortuitous rounding. If you add 0.1 ten times however, you get something slightly above 1. The type and order of arithmetic operations can either make the problem worse, or purely coincidentally, as if it never existed.

Internal Representation Trouble

If all this wasn’t difficult enough yet, in a well-meaning effort to improve accuracy, the x87 floating point instructions have 80-bit internal registers available. Arguably use of this extra precision does not harm the accuracy of calculations, but depending on when they do or do not get used, two applications executing the exact same sequence of operations may provide different results with no easy way to determine which of the two is the more accurate result.

It goes something like this:

  • Calculations that remain in internal registers use 80-bits
  • Except when they get written to/from memory, in which case they are rounded to 64-bit. When and whether this happens depends on how good the optimizer of the compiler is, and it may alter from version to version of the compiler itself.
  • Or if the FPU is switched into a mode where it forces 64-bit precision internally. From what I have read it is highly likely the .NET runtime does this, but none of the sources are committing to saying so outright.
  • Also, 64-bit executables will be 64-bit only because they will be using SSE and not x87, which has never and will never have 80-bit registers.

All-in-all, it is best to try and force applications to use 64-bit internally to ensure repeatable behaviour. And while at it, switch to a 64-bit version of Excel, and hope that Microsoft does not decide to do anything clever internally.

These discrepancies are the most annoying in dev/test scenarios where developers and testers get different results. It can be very hard to be certain whether a difference is due to discrepancies in accuracy, or if either algorithm is incorrect.

Decimal To The Rescue!

Working in .NET, one possible solution is to use Decimal type wherever possible.

The reason this helps has less to do with the fact this type uses 96 bits for the representation of the digits. It has a lot more to do with the exponent. The formula for Decimal is as follows:

(-1)s × (96-bit integer number) × 10-e, where e = 0-28.

The exponent is applied to powers of 10, which means that commonly occurring real-world fractions can be exactly represented. If you deal with money, a decimal can contain 2 exact fractional digits.

The only down-side is performance; you pay for these awkward-for-computers-to-process powers of 10 with up to 20-fold reduction in performance for arithmetic operations. This is definitely a worst-case-scenario number, but regardless it is a steep price to pay.


If you want to investigate further yourself, consider the following sources:

  1. Wikipedia: Double-precision floating-point format
  2. The Floating-Point Guide
  3. What Every Computer Scientist Should Know About Floating-Point Arithmetic
  4. IEEE 754-2008

One thought on “Floating Point One-Pager”

Comments are closed.