Go Back   Orbiter-Forum > Blogs > cjp
Register Blogs Orbinauts List Social Groups FAQ Projects Mark Forums Read

Rate this Entry

Floating point variables: nothing is what it seems

Posted 03-16-2010 at 09:14 PM by cjp

It took some time since I first discovered the blog function on this forum, until I found a topic that would be suitable to post here. But with the high number of (potential) software developers here, I think that this first post will be useful to some of you.

Floating point numbers
The topic of today is floating point numbers. Programmers often think of them as the computer equivalent of what mathematicians call real numbers. Real programmers also know that this is not the whole truth, and that you have to take into account several accuracy issues. For instance, (1.0e-20 + 1.0) - 1.0 does not become 1.0e-20, because there are not enough digits available to represent 1.000000000000000000001, and hence the result of the first addition is rounded to 1.0.

There are even weirder issues. For instance, 0.1 can not exactly be represented in the binary floating point format, so the value that is stored is closer to 0.10000000000000001. And if you calculate (0.1+0.2)-0.3, it actually becomes 5.5511151231257827e-17 instead of zero. On the other hand, (1.0+2.0)-3.0 becomes zero, because 1.0, 2.0 and 3.0 can be represented exactly, and the addition and subtraction can happen without loss of precision.

Usually, a programmer can get away with ignoring these issues, because they only add a very small amount of inaccuracy. Usually, it is sufficient if a programmer knows that inaccuracies can occur (and that as a result, exact comparisons like a==1.0 are often a bad idea), and has some knowledge of what kind of equations can create large errors (such as (1.0e-20 + 1.0) - 1.0).

I am currently working with TBlaxland on some bug fixes for . While I was localizing an issue on the handling of parabolic orbits, I discovered that floating point values sometimes behave a lot weirder than the programming textbooks say!

The code I was investigating had the following structure:
double e = <some formula>;

if(e == 1.0)
    <modify e so that it becomes slightly different from 1.0>;
The reason for the second part of the code is that some of the formulas in the rest of the code can't handle exact values of e==1.0. Typically, a formula could e.g. include something like x / (e - 1.0). This can handle any value of e, except 1.0. This also justifies the exact comparison 'e == 1.0' in my code.

The problem I had was that somehow, the value of e was still exactly 1.0 after this code fragment. So, I added some tests like this:
double e = <some formula>;

if(e == 1.0)
    <modify e so that it becomes slightly different from 1.0>;
    printf("Test 1: %d\n", e == 1.0);
printf("Test 2: %d\n", e == 1.0);
Strangely enough, the first test was not triggered (indicating that e was not 1.0 in the first place), but the second test returned 1 (true).

Further testing showed that adding more printf lines made the problem magically disappear. Also, disabling optimization made the bug disappear. So I started searching for compiler bugs on optimization of floating point code.

What you didn't know about floating point numbers
I found the following page about the GCC compiler:
It links to the following document, which is also worth reading:

It turns out that this is not really a compiler bug: instead, it is a bug in 32-bit x86 processors. The makers of GCC deliberately decided to not make a workaround for this bug, because that would dramatically hurt performance.

The source of the problem is that x86 processors only really supports 'extended precision' 80-bit floating point operations. When loaded from RAM into floating point registers, all other formats are converted into this format, and they are converted back when floating point registers are stored back into RAM.

Usually, this is not a problem, because the extra size in the registers creates extra precision, which is only a good thing for most applications. The problem appears when the compiler starts optimizing things.

Unpredictable behavior
One of the most basic optimization techniques applied by a compiler is that it places as many variables as possible in processor registers. Access to registers is many times faster than access to RAM, so this generally makes programs much faster. Sometimes, it is still necessary to store variables in RAM, e.g. when registers are needed for other variables, or when function calls are involved.

Usually, the programmer doesn't have to worry about this, because the compiler takes care of it, and it makes sure that the variables behave in just the same way in register memory as they would in RAM. Guess what: for floating point variables, this isn't exactly the case. You better start worrying now about compiler optimizations!

As I wrote before, floating point registers have a higher precision than floating point values stored in RAM. This also means that they have a different precision, and hence they behave differently.

So, how does this explain the weird behavior of my code in KOST?
What probably happened was that, after the value of e was evaluated, e was still present in a register. It had a value that was extremely close to, but different from 1.0. As a result, the 'e == 1.0' test, which was completely evaluated using register values, evaluated to false, and the code between the {} brackets was not executed. At a later point (for instance, the 'printf("Test2:%d\n", e==1.0);' line), the compiler decided that e should be stored in RAM. In the process of storing to RAM, e lost some precision, and as a result it became exactly equal to 1.0.

So, one moment, e==1.0 is false, and the next moment it is true. You can't predict when this will happen, because you can't control how the compiler optimizes your code.

The solution
The real solution would be to have a better floating point unit in your processor. Luckily, this is already the case for many processors: the SSE extension contains floating point instructions that solve exactly this issue. You can tell your compiler to use SSE instructions instead of the old x87 instructions, but unfortunately, this makes your executables non-portable to old computers (pre-Pentium III), so most compilers have this option disabled by default. The good thing is that SSE is available on all 64-bit x86 processors, and naturally, many compilers have SSE enabled by default for 64-bit executables.

A more portable solution is to tell your compiler to store all floating point variables in RAM. For GCC, this is the "-ffloat-store" option. Unfortunately, this does have a serious impact on the efficiency of your program.

The solution which I'll try to use in KOST is to try to deal with the situation, by making the code robust against unexpected changes in accuracy. This is a bit more work, but the result is bot portable and fast, and it is consistent with the typical way of using KOST, where the user decides which compiler options are used. Besides, I might also use this opportunity to increase the accuracy of values of (e-1.0).

In retrospect, I realize that I've already seen this bug before. In my game Ultimate Stunts, I once had a bug that 3D shapes where messed up on 64-bit systems. It turned out that my code implicitly depended on the increased accuracy of temporary values. On 64-bit systems, this increased accuracy was not present, and as a result the code failed. In that case, the solution was simple: explicitly use a higher-accuracy type for the temporary value, so that the calculation works on all platforms.

I think that, if you are planning to do a lot of floating point programming, sooner or later you are going to to run into this problem. I hope that having read this post will then help you recognize this problem.
Posted in Uncategorized
Views 5341 Comments 3
« Prev     Main     Next »
Total Comments 3


  1. Old Comment
    tblaxland's Avatar
    Thanks for that cjp. A thorough explanation of a tricky problem.
    Posted 03-16-2010 at 11:04 PM by tblaxland tblaxland is offline
  2. Old Comment
    Wraith's Avatar
    A more portable solution is to tell your compiler to store all floating point variables in RAM. For GCC, this is the "-ffloat-store" option. Unfortunately, this does have a serious impact on the efficiency of your program.
    Well, it's broken, because it works for variables only, not for intermediate computations, eg.
    float a = (0.3f  + x) * ( 0.4f + y); 
    float b = 5.0f + a;
    will only forcibly store-load x,y,a and b, but neither the results of 0.3f + x nor 0.4f + y. Imagine if you break a complex expression into several simpler ones and discover that the results are not quite the same.

    I personally avoid direct comparisons of floats whenever I can, resorting to a slightly slower, but much safer fabsf(x - y) <= TOL instead of x==y.
    Posted 03-17-2010 at 08:53 PM by Wraith Wraith is offline
  3. Old Comment
    cjp's Avatar
    Originally Posted by Wraith View Comment
    A more portable solution is to tell your compiler to store all floating point variables in RAM. For GCC, this is the "-ffloat-store" option. Unfortunately, this does have a serious impact on the efficiency of your program.
    Well, it's broken, because it works for variables only, not for intermediate computations,
    True, but it fixes two important things:
    • There is no longer a difference between the optimized and the non-optimized code
    • Variables no longer change their value unexpectedly in the middle of the program.
    But, agreed, with -ffloat-store, you still have temporaries with a precision that is different from what you'd expect from the equation. So, it is only a partial solution.
    Posted 03-22-2010 at 04:25 PM by cjp cjp is offline

All times are GMT. The time now is 07:09 AM.

Quick Links Need Help?

About Us | Rules & Guidelines | TOS Policy | Privacy Policy

Orbiter-Forum is hosted at Orbithangar.com
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2019, vBulletin Solutions Inc.
Copyright 2007 - 2017, Orbiter-Forum.com. All rights reserved.