Why does C++ rounding behavior (for compile-time constants) change if math is moved to an inline function?
Asked Answered
J

1

7

Consider the following functions:

static inline float Eps(const float x) {
  const float eps = std::numeric_limits<float>::epsilon();
  return (1.0f + eps) * x - x;
}

float Eps1() {
  return Eps(0xFFFFFFp-24f);
}

float Eps2() {
  const float eps = std::numeric_limits<float>::epsilon();
  const float x = 0xFFFFFFp-24f;
  return (1.0f + eps) * x - x;
}

At -O2 with -std=c++20, both of these functions compile down to a single movss followed by a ret using clang 16.0.0 targetting x86 and a mov followed by a bx with gcc 11.2.1 targeting ARM. The assembly generated for ARM is consistent with a returned value of ~5.96e-8, but the assembly generated for x86 is not. Eps1() (using the inline function) returns ~1.19e-7 while Eps2() returns ~5.96e-8. [Compiler Explorer / Godbolt]

.LCPI0_0:
  .long 0x33ffffff # float 1.19209282E-7
Eps1(): # @Eps1()
  movss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
  ret
.LCPI1_0:
  .long 0x33800000 # float 5.96046448E-8
Eps2(): # @Eps2()
  movss xmm0, dword ptr [rip + .LCPI1_0] # xmm0 = mem[0],zero,zero,zero
  ret

I can sort of understand the compiler choosing either option. With x = 0xFFFFFFp-24f (i.e. the next representable value below 1.0f), both compilers consistently round (1.0f + eps) * x to 1.0f which means that (1.0f + eps) * x - x will give the smaller value. However, machine precision of 1.0f is twice that of 0xFFFFFFp-24f so something like a multiply-add instruction that preserves extra precision would have an intermediate value of roughly 1.0 + 0.5 * eps which will yield the larger value.

The thing I don't understand is why the answer changes depending on whether the math is in an inline function or directly invoked. Is there somewhere in the standard that rationalizes this, is this undefined behavior, or is this a Clang bug?

Jab answered 10/5, 2023 at 3:49 Comment(1)
As I understand it, later C++ standards allow but do not require floating point expression contraction. Which (unless you explicitly overrule) there is nothing preventing a compiler from doing things differently (therefore producing different numerical results) when compiling for optimisation versus without optimisation.Autarky
E
8

With clang 16's default of -ffp-contract=on (like #pragma STDC FP_CONTRACT ON), ISO C++ allows the compiler to keep infinite precision for FP temporaries or not, its choice, including on a case by case basis. Notably, contracting a*b+c into fma(a,b,c). This includes when doing constant-propagation at compile time. ISO C++ allows either default for the pragma.

See also

If you use -ffp-contract=off (or surprisingly -ffp-contract=fast), both functions return 5.96046448E-8. I haven't checked, but that's probably the same as fma((1.0f + eps), x, -x) vs. rounding after the *x step.

Strange quirk that it rounds differently during compile-time eval with an inline function or not.

With a runtime-variable float x function arg, it makes the same asm for both Eps1 and Eps2, contracting both to vfmsub132ss, when you compile with -march=haswell or -march=x86-64-v3 to make FMA available.

If you want to write source that definitely does round between steps (except with fp-contract=fast), don't make multiple things part of the same expression. https://godbolt.org/z/15T5jcdfv shows an Eps3 using separate statements giving a return value that matches Eps2 but not Eps1 (with -ffp-contract=on).


Theory about the reason for a difference with an inline function

Probably clang / LLVM's internals contracted the inline function into an FMA before inlining it, so constant propagation happened through an FMA for Eps1.

But in Eps2, the constant was available right away, so constant propagation could be done first. Plugging in numbers one operation at a time is cheaper for the compiler than optimizing the abstract operations. So in fact it does do that without looking for the opportunity to do one FMA.

https://godbolt.org/z/7c9EYK8fb shows versions with float x function args, with contract on/off, confirming that Eps1 and Eps2 do compile to the same asm using FMA when FP contraction is enabled. (And Eps3 doesn't contract, as expected. Unless you use -ffp-contract=fast.)

As for why -ffp-contract=fast gives the same constant-propagation result as -ffp-contract=off, perhaps when it doesn't have to keep track of whether operations were part of the same statement in the source, it can defer the optimization pass that looks for contraction. Deferring it until after inlining and constant propagation would explain the fact that constant propagation was done in a way that gave the same result as with contraction disabled.

Etra answered 10/5, 2023 at 4:22 Comment(1)
Thank you for the reply. This doesn't definitively answer my question, but it seems very plausible and explains a lot. One nit though: your first Godbolt link has computation within Eps3() similarly to Eps2(). The fact that they give the same result isn't super meaningful. However, I did confirm that moving the computation in Eps3() to an inline function doesn't change the constant-propagation.Jab

© 2022 - 2024 — McMap. All rights reserved.