Floating point comparison irreproducibility

Asked 15/2, 2011 at 19:35 Answered 8/10, 2016 at 6:21

I and my Ph.D. student have encountered a problem in a physics data analysis context that I could use some insight on. We have code that analyzes data from one of the LHC experiments that gives irreproducible results. In particular, the results of calculations obtained from the same binary, run on the same machine can differ between successive executions. We are aware of the many different sources of irreproducibility, but have excluded the usual suspects.

We have tracked the problem down to irreproducibility of (double precision) floating point comparison operations when comparing two numbers that that nominally have the same value. This can happen occasionally as a result of prior steps in the analysis. An example we just found an example that tests whether a number is less than 0.3 (note that we NEVER test for equality between floating values). It turns out that due to the geometry of the detector, it was possible for the calculation to occasionally produce a result which would be exactly 0.3 (or its closest double precision representation).

We are well aware of the pitfalls in comparing floating point numbers and also with the potential for excess precision in the FPU to affect the results of the comparison. The question I would like to have answered is "why are the results irreproducible?" Is it because the FPU register load or other FPU instructions are not clearing the excess bits and thus "leftover" bits from previous calculations are affecting the results? (this seems unlikely) I saw a suggestion on another forum that context switches between processes or threads could also induce a change in floating point comparison results due to the contents of the FPU being stored on the stack, and thus, being truncated. Any comments on these =or other possible explanations would be appreciated.

Abrasion answered 15/2, 2011 at 19:35 Comment(10)

Could you please add a reference to the suggestion about context switches? While I can picture a processor moving accumulator data and discarding bits, this mechanism does not seem a good explanation to me, and some more detail could be interesting. – Openwork 15/2, 2011 at 19:42

Perhaps using different compiler optimization flags might fix this problem. – Aparicio 15/2, 2011 at 19:42

@Coffee on Mars: That was going to be my suggestion, so I think I can explain :) The issue is that the FPU can be using a higher number of bits in its registers, in some recent processors as many as 80 bits for doubles. Now, in a single threaded environment the FPU will be able to perform all the operations with that precision and you will get one result. If you add other threads/processes to the mix, when the OS performs the context switch it has to store the value of the 80bit register in a 64bit double, loosing precision. – Cristoforo 15/2, 2011 at 19:52

Being not an answer, but a 'guess'. It would be wiser to the working software if it did the comparison in the integer form of the float, like, transforming the float into the sign-expoent-mantissa form of an unsigned long. This also implies storing the values of the experiment in that form, so you don't have issues when the numbers are too next of each other. – Ineslta 15/2, 2011 at 20:4

To test your theory of "excess bits" I might suggest declaring and setting a new variable prior to the comparison. This will degrade performance and may not be a "solution" by any means but it would be an interesting way to reject your hypothesis. – Uncomfortable 15/2, 2011 at 20:57

Also, is your program multithreaded or distributed ? The only cases where I encountered such discrepancies was with (subtle) race conditions. Also MPI could be a possible culprit if you're using it. – Necrophobia 15/2, 2011 at 21:0

@Uncomfortable PK, a different approach that actually forces the compiler into writting to a cache line (the optimizer could remove the extra variable) you can declare the variables volatile. That will degrade performance a little, as it forces going through L1 cache (and if there is false sharing it might end up going to L2-L3 cache) – Cristoforo 15/2, 2011 at 21:4

Sorry for neglecting important information like platform and OS as I had to finish the submission in a hurry. The calculations are being carried out on x86_64 platform and 64 bit Linux (Scientific linux). Here's some of /proc/cpuinfo and /proc/version: – Abrasion 15/2, 2011 at 21:7

proc/version: Linux version 2.6.18-238.1.1.el5 gcc version 4.1.2 20080704 (Red Hat 4.1.2-50) Note that Scientific Linux is a variant of RedHat – Abrasion 15/2, 2011 at 21:16

@David Rodriguez: that was the situation I was thinking about, but the problem I thought I saw in it was that the processor could be expected to save the exact state of the registers, as opposed to making a logical conversion to its logical "most significant" closest state. Bo Persson, in a comment below, cites the FSAVE/FRSTOR instructions which I would expect to be used in a context switch situation. – Openwork 15/2, 2011 at 22:15

At a guess, what's happening is that your computations are normally being carried out to a few extra bits of precision inside the FPU, and only rounded at specific points (e.g., when you assign a result to a value).

When there's a context switch, however, the state of the FPU has to be saved and restored -- and there's at least a pretty fair chance that those extra bits are not being saved and restored in the context switch. When it happens, that probably wouldn't cause a major change, but if (for example) you later subtract off a fixed amount from each and multiply what's left, the difference would be multiplied as well.

To be clear: I doubt that "left over" bits would be the culprit. Rather, it would be loss of extra bits causing rounding at slightly different points in the computation.

Laurelaureano answered 15/2, 2011 at 19:53 Comment(3)

I would bet that the OS is using the FSAVE/FRSTOR instructions,which actually dumps and restores all the bits of the x87 states. This includes all 80 bits of the internal registers. Assuming an x86 with a 32 bit OS, which Brian forgot to tell us if he is using. :-) – Lindeberg 15/2, 2011 at 20:55

@Bo Persson: the FP registers have 80 visible bits, but most have at least one or two "guard bits" as well, to give a properly rounded LSB in the 80 that are visible. – Laurelaureano 15/2, 2011 at 21:6

All such conversions and state save/restore are deterministic. The saved FP state is a perfect copy. – Merrow 15/2, 2011 at 22:5

What platform?

Most FPUs can internally store more accuracy than the ieee double representation - to avoid rounding error in intermediate results. There is often a compiler switch to trade speed/accuracy - see http://msdn.microsoft.com/en-us/library/e7s85ffb(VS.80).aspx

Lanny answered 15/2, 2011 at 19:42 Comment(2)

And how would this produce non-deterministic results? – Merrow 15/2, 2011 at 22:7

Context switches as described by Jerry – Lanny 15/2, 2011 at 23:10

Is the program multi-threaded?

If yes, I would suspect a race condition.

If not, program execution is deterministic. The most probable reasong for getting different results given the same inputs is undefined behaviour, i.e., a bug in your program. Reading an uninitialized variable, stale pointer, overwriting lowest bits of some FP number on the stack, etc. The possibilities are endless. If you're running this on linux, try running it under valgrind and see if it uncovers some bugs.

BTW, how did you narrow down the problem to FP comparison?

(Long shot: failing hardware? E.g., failing RAM chip might cause data to be read differently on different occasions. Though, that'd probably crash the OS rather quickly.)

Any other explanation is implausible -- bugs in the OS or the HW would have not gone undiscovered for long.

Merrow answered 15/2, 2011 at 22:1 Comment(6)

Program is not multi-threaded though it is running in a multi-process context. And it has been run through Valgrind (including Memcheck) with no problems. In frustration at not identifying the source of irreproducibility we resorted to low-tech debugging -- dumping the value being compared to a analysis "cut" (0.3) to cout. Two separate executions both print 0.3 to 15 significant digit precision ( ... << std::setw(15) << dR ) yet the subsequent line that performs the comparison of dR to 0.3 produces a different result. – Abrasion 15/2, 2011 at 22:25

why don't you log hex dumps of variable values? log2(10^15) is ~49.82, and there're more mantissa bits in long double. – Meneau 15/2, 2011 at 22:39

OK, though valgrind does not always detect all issues. I suggest that you perform your debugging on the machine code level. Set breakpoint on the print statement, execute instruction by instruction and inspect the state of all registers after each instruction. Be sure to also look at XMM registers. Can you paste the code in question here (what is around the print statement and the comparison itself)? – Merrow 16/2, 2011 at 7:7

I agree with this analysis. I sincerely doubt the context switch thing, unless I am proven wrong of course. I don't believe that register states are pushed onto the stack incorrectly. – Necrophobia 16/2, 2011 at 9:4

In 2016 if this is still a problem, you could recompile with -fsanitize=undefined instead. This should catch more arithmetic errors I think. – Marquet 8/10, 2016 at 6:10

It could also be some part of the code or a third-party library it loads/links with being badly behaved and leaving the floating-point rounding mode (or other runtime floating-point configuration) in the non-default state. Of course, that library would also have to be doing something nondeterministic (e.g. random numbers, I/O, or threading), but that's possible. – Gentilism 23/6, 2021 at 13:35

I made this:

#include <stdio.h>
#include <stdlib.h>

typedef long double ldbl;

ldbl x[1<<20];

void hexdump( void* p, int N ) {
  for( int i=0; i<N; i++ ) printf( "%02X", ((unsigned char*)p)[i] );
}

int main( int argc, char** argv ) {

  printf( "sizeof(long double)=%i\n", sizeof(ldbl) );

  if( argc<2 ) return 1;

  int i;
  ldbl a = ldbl(1)/atoi(argv[1]);

  for( i=0; i<sizeof(x)/sizeof(x[0]); i++ ) x[i]=a;

  while(1) {
    for( i=0; i<sizeof(x)/sizeof(x[0]); i++ ) if( x[i]!=a ) {
      hexdump( &a, sizeof(a) );
      printf( " " );
      hexdump( &x[i], sizeof(x[i]) );
      printf( "\n" );
    }
  }

}

compiled with IntelC using /Qlong_double, so that it produced this:

;;;     for( i=0; i<sizeof(x)/sizeof(x[0]); i++ ) if( x[i]!=a ) {

        xor       ebx, ebx                                      ;25.10
                                ; LOE ebx f1
.B1.9:                          ; Preds .B1.19 .B1.8
        mov       esi, ebx                                      ;25.47
        shl       esi, 4                                        ;25.47
        fld       TBYTE PTR [?x@@3PA_TA+esi]                    ;25.51
        fucomp                                                  ;25.57
        fnstsw    ax                                            ;25.57
        sahf                                                    ;25.57
        jp        .B1.10        ; Prob 0%                       ;25.57
        je        .B1.19        ; Prob 79%                      ;25.57
[...]
.B1.19:                         ; Preds .B1.18 .B1.9
        inc       ebx                                           ;25.41
        cmp       ebx, 1048576                                  ;25.17
        jb        .B1.9         ; Prob 82%                      ;25.17

and started 10 instances with different "seeds". As you can see, it compares the 10-byte long doubles from memory with one on the FPU stack, so in the case when OS doesn't preserve full precision, we'd surely see an error. And well, they're still running without detecting anything... which is not really surprising, because x86 has commands to save/restore the whole FPU state at once, and anyway an OS which won't preserve full precision would be completely broken.

So either its some unique OS/cpu/compiler, or differing comparison results are produced after changing something in the program and recompiling it, or its a bug in the program, eg. a buffer overrun.

Meneau answered 15/2, 2011 at 20:48 Comment(0)

The CPU's internal FPU can store floating points at higher accuracy than double or float. These values have to be converted whenever the values in the register have to be stored somewhere else, including when memory is swapped out into cache (this I know for a fact) and a context switch or OS interrupt on that core sounds like another easy source. Of course, the timing of OS interrupts or context switches or the swapping of non-hot memory is completely unpredictable and uncontrollable by the application.

Of course, this depends on platform, but your description sounds like you run on a modern desktop or server (so x86).

Abramabramo answered 15/2, 2011 at 19:59 Comment(3)

Nice try, but all such conversions are deterministic. The FP state is saved via dedicated instructions and such save/restore mechanism does not lose information. – Merrow 15/2, 2011 at 22:4

@zvrba: Deterministic does not mean lossless. An instruction which is deterministic doesn't behave deterministically if called non-deterministically. – Abramabramo 16/2, 2011 at 11:36

Huh? All FP instructions produce exactly the same (even bitwise!) results given the same inputs, no matter under which circumstances they are called. (With the possible exception of one of the arguments being infinity of NaN.) – Merrow 16/2, 2011 at 12:16

-1

I'll just merge some of the comments from David Rodriguez and Bo Persson and make a wild guess.

Could it be task switching while using SSE3 instructions? Based on this Intel article on using SSE3 instructions the commands to preserve register status FSAVE and FRESTOR have been replaced by FXSAVE and FXRESTOR, which should handle the full length of the accumulator.

On an x64 machine, I suppose that the "incorrect" instruction could be contained in some external compiled library.

Openwork answered 15/2, 2011 at 22:33 Comment(3)

Those instructions are typically not used by a user-mode program, and if the linux kernel had such bug (i.e., using wrong instructions to save/restore FP state), it would have been found long ago. – Merrow 16/2, 2011 at 7:4

Yes, I believe you're right. On the other side, a good task switching should not leave behind bits in the registers, and such a possibility has been discussed in the comments to the question. I'll leave the answer here, as a reference for a possible inquire in the assembly side of the problem. – Openwork 16/2, 2011 at 11:0

Nobody here has yet proven that task switching is leaving bits in the registers. Hardware instructions meant for FP state switching certainly do not leave any garbage around. – Merrow 16/2, 2011 at 14:16

-1

You are certainly hitting GCC Bug n°323, which, as other points out is due to the excess precision of the FPU.

Solutions :

Using SSE (or AVX, it's 2016...) to perform your computations
Using the -ffloat-store compile switch. From the GCC docs.

Do not store floating-point variables in registers, and inhibit other options that might change whether a floating-point value is taken from a register or memory.
This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a double is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.

Slogan answered 8/10, 2016 at 6:21 Comment(1)

That bug alone would not explain non-determinism across different runs of the same binary, on the same machine, with the same inputs. – Gentilism 23/6, 2021 at 13:37

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags