How reproducible are floating point CPU operations on x86-64?
Asked Answered
D

0

6

Note: this question is about CPU instructions, not high-level languages (where you are at the mercy of the compiler)


From a popular answer:

The same floating-point operations, run on the same hardware, always produces the same result.

Can we make a stronger guarantee though, on x86-64? What if the hardware is a bit different? Are CPU instructions reproducible within the same family of CPUs? Where is the boundary of reproducibility?

Dunkirk answered 13/11, 2023 at 19:25 Comment(24)
What if the hardware is a bit different? ... then that's not the same hardwareGulgee
Different generations of x86 processors produce different numerical results, even for the same vendor. AFAIK the elementary operations fadd, fsub, fmul, fdiv, fsqrt are stable, but transcendental instructions as well as more advanced elementary operations (like rsqrtss) can vary quite a bit depending on µarch, even for processors from the same vendor.Tamatamable
@Tamatamable Thanks. Within the same generation of CPUs, can we expect complete reproducibility?Dunkirk
@Dunkirk Within the same microarchitecture, of course. However, I cannot say for sure if different microarchitectures of the same generation (e.g. Intel's mainline offering vs. their low-power offering) always have the same behaviour. But note also that within a microarchitecture, different binnings have different ISA extensions and if your code does runtime dispatch based on CPU features, it may end up executing different code paths, leading to different results.Tamatamable
Are you including x87 instructions like fsin that aren't required to be "correctly rounded" (precise to the last mantissa bit) the way IEEE basic ops are? (+ - * / and sqrt). Most x86-64 math libraries don't use 387 instructions because they're not fast. Outside of x87, the only SSE/AVX instructions that leave room for implentation-dependent results are I think rsqrtss/ps and rcpss/ps like @Tamatamable mentioned. And with AVX-512, VRSQRT14PS/pd and vrcp14ps/pd. (And probably the Xeon Phi 28-bit versions and Xeon Phi vexp2ps/pd)Functionalism
@PeterCordes For the purposes of this question, I'm including all legal instructions that are not intentionally nondeterministic. Assuming the CPU flags (such as in lscpu) and the CPU family numbers are the same, can we expect reproducible results?Dunkirk
Just the "Family" field in CPUID? That's 6 for Intel CPUs from PPro to current, other than Pentium 4. Sandybridge is basically a new microarchitecture family, but it inherits a lot from P6 and they didn't bump the family number. The "model" changes every microarchitecture but not "family". With the same CPUID feature flags, Haswell through current Alder Lake haven't added any FP-related stuff (if we skip Ice Lake / Tiger Lake that have AVX-512 even on client CPUs, or look at "Celeron" versions of those), although there are other features that would make lscpu output different.Functionalism
I don't know for sure if fsincos etc. microcode might have changed at any point between those, or rsqrtps, but Haswell didn't have AVX-512 at all, Skylake and later do but they leave it disabled. If there is an increase in precision of rsqrtps to make it the same as vrsqrt14ps, I'd expect it between Haswell and some later CPU, perhaps Skylake (1st gen with AVX-512). That's an interesting question I don't know the answer to. I'd encourage you to flesh out your question with an awareness of this kind of detail.Functionalism
Within a single microarchitecture generation, I expect it's fully deterministic across CPUs. Probably no microcode updates have ever changed results of math instructions. (The fdiv bug in P5 was one of the motivations for having microcode in P6 that's loaded from the firmware and updateable, but I don't think they've made a similar mistake since.)Functionalism
I think the point that is getting lost here is that apart from legacy x87 instructions that you shouldn't use anyway and some instructions that are explicitly just approximations (rsqrt), all instructions are accurate within the limits of floating point. Addition, multiplication, division, square root all produce the correctly rounded result and do so fully deterministically. They have done so for ages (not counting CPU bugs) and will continue to do so in future generations of x86 chips.Ancelin
@Ancelin Exact reproducibility is a pretty desirable property in science, even when all possible results are equally valid.Dunkirk
@Dunkirk I don't understand why you are telling me this because my comment said that this is what you will get. So, yes, I do agree that exact reproducibility is a desirable property. Good for us that the SSE floating point instructions are deterministic and always produce the correctly rounded result, meaning the result closest to the infinitely precise result.Ancelin
@Ancelin Thanks! I misread your original commentDunkirk
You need to start with the floating point format IEEE-754 not from the implementation. And the exact and reproducable values you will get for each set of inputs. Then talk about if the processor has properly met that standard or made up its own or has bugs or other. From the linked question order of operations a+b+c vs c+b+a may vary due to rounding of the individual values for example, yes? but a+b+c with the processor in the same mode with the same inputs/instructions etc should give the same result.Benson
The ieee-754 version from way back that I studied and worked on a compliant floating point unit had of course different rounding modes but also if exceptions enabled or not the spec had different results for certain operations. So a+b+c, at least back then with the same exact inputs, but in different modes of course can give different results, due to the modes. A non-buggy chip or lets say family of some range of generation of non-buggy x86 processors should give the same results in the same modes for the same inputs. for the same operations.Benson
someone in the prior question said that compile the same C code two times and you can get different results which depending on context may or may not be true. And you can have the same problems with assembly as well. Some human written string in decimal for example through the shared/system C library the compiler/assembler is using to convert that to some IEEE-754 binary floating point format is subject to the mode, etc the COMPILER is in at the time, and may not make the same binary version of the value as you expected. and thus the result not what you expected.Benson
that reference may not be related to what I experienced, but the do it twice for any language in the same mode same day same computer, etc should give the same results. compilers generally do not have randomizers and make different code each build for the same sources.Benson
Sorry a range of non buggy x86 processors that claim to support the same standard in the same mode, etc etc etc...Benson
@Benson Thanks! I replaced "assembly-level" with "CPU" in the question.Dunkirk
IMO it is about the spec and if you conform or not (and if the spec changed, and how that may affect results). Intel was heavy in creating the original spec, not sure what their involvement has been since. If there are operations that you support that are not part of a spec, then, well, there you go for those, but being the same company and being intel specifically, there is a spec, internal or exteral, and multiple departments validating it (because they dont get along and dont talk to each other).Benson
Note that the IEEE 754 standard allows some different behaviors concerning the exceptions (e.g. underflow before vs after rounding, fma(0,∞,qNaN)…). There may also be differences concerning NaN encoding. But I don't know whether the x86-64 spec completely specifies such cases. Moreover, in the past, concerning the x87 hardware math functions, AMD had more accurate implementations than Intel, but AMD users were complaining that they did not get the same results as with Intel processors.Distaste
@vinc17: If exceptions are masked, the bit-pattern in the destination register is standardized, right? If the exact result is just over half the smallest subnormal, it's guaranteed to round up to non-zero (unless the rounding mode is truncation or towards -Inf, and assuming FTZ isn't set). My Skylake does indeed report DE UE PE (Denormal, Underflow, and Precision exceptions) in MXCSR after 7.0064923216e-39f / 5005000.0f (with divss in asm) which produces the binary32 bit-pattern 0x00000001. (The first number is min_subnormal * 5000000). vs. only DE w. x / 5000000.0Functionalism
@PeterCordes non-NaN FP results are fully predictable from the IEEE/IEC spec, but when the result is a NaN you need to lean on additional architecture-specific knowledge regarding which (canonical) NaN is produced when operands are non-NaN, which NaN is propagated when more than one operand is a NaN, and similarly for FP->INT conversions: what result is produced for out-of-range operands.Corriecorriedale
@amonakov: Oh right, other than NaN payloads. (x86 does standardize out-of-range FP->int conversions, signed produce MSB-set, rest clear (INT_MIN) which Intel calls the "integer indefinite" bit-pattern. e.g. felixcloutier.com/x86/cvtsd2si Unsigned (AVX-512) produce all-set (UINT_MAX). I didn't check AMD's manuals, but this is clearly documented and easily observable. But yes, some non-x86 ISAs do produce different bit-patterns.)Functionalism

© 2022 - 2024 — McMap. All rights reserved.