With GCC 5.3 the following code compield with -O3 -fma
float mul_add(float a, float b, float c) {
return a*b + c;
}
produces the following assembly
vfmadd132ss %xmm1, %xmm2, %xmm0
ret
I noticed GCC doing this with -O3
already in GCC 4.8.
Clang 3.7 with -O3 -mfma
produces
vmulss %xmm1, %xmm0, %xmm0
vaddss %xmm2, %xmm0, %xmm0
retq
but Clang 3.7 with -Ofast -mfma
produces the same code as GCC with -O3 fast
.
I am surprised that GCC does with -O3
because from this answer it says
The compiler is not allowed to fuse a separated add and multiply unless you allow for a relaxed floating-point model.
This is because an FMA has only one rounding, while an ADD + MUL has two. So the compiler will violate strict IEEE floating-point behaviour by fusing.
However, from this link it says
Regardless of the value of FLT_EVAL_METHOD, any floating-point expression may be contracted, that is, calculated as if all intermediate results have infinite range and precision.
So now I am confused and concerned.
- Is GCC justified in using FMA with
-O3
? - Does fusing violate strict IEEE floating-point behaviour?
- If fusing does violate IEEE floating-point beahviour and since GCC returns
__STDC_IEC_559__
isn't this a contradiction?
Since FMA can be emulated in software it seems to be there should be two compiler switches for FMA: one to tell the compiler to use FMA in calculations and one to tell the compiler that the hardware has FMA.
Apprently this can be controlled with the option -ffp-contract
. With GCC the default is -ffp-contract=fast
and with Clang it's not. Other options such as -ffp-contract=on
and -ffp-contract=off
do no produce the FMA instruction.
For example Clang 3.7 with -O3 -mfma -ffp-contract=fast
produces vfmadd132ss
.
I checked some permutations of #pragma STDC FP_CONTRACT
set to ON
and OFF
with -ffp-contract
set to on
, off
, and fast
. IN all cases I also used -O3 -mfma
.
With GCC the answer is simple. #pragma STDC FP_CONTRACT
ON or OFF makes no difference. Only -ffp-contract
matters.
GCC it uses fma
with
-ffp-contract=fast
(default).
With Clang it uses fma
- with
-ffp-contract=fast
. - with
-ffp-contract=on
(default) and#pragma STDC FP_CONTRACT ON
(default isOFF
).
In other words with Clang you can get fma
with #pragma STDC FP_CONTRACT ON
(since -ffp-contract=on
is the default) or with -ffp-contract=fast
. -ffast-math
(and hence -Ofast
) set -ffp-contract=fast
.
I looked into MSVC and ICC.
With MSVC it uses the fma instruction with /O2 /arch:AVX2 /fp:fast
. With MSVC /fp:precise
is the default.
With ICC it uses fma with -O3 -march=core-avx2
(acctually -O1
is sufficient). This is because by default ICC uses -fp-model fast
. But ICC uses fma even with -fp-model precise
. To disable fma with ICC use -fp-model strict
or -no-fma
.
So by default GCC and ICC use fma when fma is enabled (with -mfma
for GCC/Clang or -march=core-avx2
with ICC) but Clang and MSVC do not.
clang
doesn't do this. I'm not posting this as an answer, since it's not based on any real standards documentation, just my understanding of how I think things should work / should have been designed, given the material in the question. – Misreckondouble fma(double x, double y, double z);
instead as that is a function call that in an optimized compiler will call the expected assembly code. This does not violate "IEEE floating-point behaviour". – Dearing