I and my Ph.D. student have encountered a problem in a physics data analysis context that I could use some insight on. We have code that analyzes data from one of the LHC experiments that gives irreproducible results. In particular, the results of calculations obtained from the same binary, run on the same machine can differ between successive executions. We are aware of the many different sources of irreproducibility, but have excluded the usual suspects.
We have tracked the problem down to irreproducibility of (double precision) floating point comparison operations when comparing two numbers that that nominally have the same value. This can happen occasionally as a result of prior steps in the analysis. An example we just found an example that tests whether a number is less than 0.3 (note that we NEVER test for equality between floating values). It turns out that due to the geometry of the detector, it was possible for the calculation to occasionally produce a result which would be exactly 0.3 (or its closest double precision representation).
We are well aware of the pitfalls in comparing floating point numbers and also with the potential for excess precision in the FPU to affect the results of the comparison. The question I would like to have answered is "why are the results irreproducible?" Is it because the FPU register load or other FPU instructions are not clearing the excess bits and thus "leftover" bits from previous calculations are affecting the results? (this seems unlikely) I saw a suggestion on another forum that context switches between processes or threads could also induce a change in floating point comparison results due to the contents of the FPU being stored on the stack, and thus, being truncated. Any comments on these =or other possible explanations would be appreciated.
volatile
. That will degrade performance a little, as it forces going through L1 cache (and if there is false sharing it might end up going to L2-L3 cache) – Cristoforo