How to use Fused Multiply-Add (FMA) instructions with SSE/AVX
Asked Answered
E

2

48

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX:
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2.

I like to know how to do this best in code and I also want to know how it's done internally in the CPU. I mean with the super-scalar architecture. Let's say I want to do a long sum such as the following in SSE:

//sum = a1*b1 + a2*b2 + a3*b3 +... where a is a scalar and b is a SIMD vector (e.g. from matrix multiplication)
sum = _mm_set1_ps(0.0f);
a1  = _mm_set1_ps(a[0]); 
b1  = _mm_load_ps(&b[0]);
sum = _mm_add_ps(sum, _mm_mul_ps(a1, b1));

a2  = _mm_set1_ps(a[1]); 
b2  = _mm_load_ps(&b[4]);
sum = _mm_add_ps(sum, _mm_mul_ps(a2, b2));

a3  = _mm_set1_ps(a[2]); 
b3  = _mm_load_ps(&b[8]);
sum = _mm_add_ps(sum, _mm_mul_ps(a3, b3));
...

My question is how does this get converted to simultaneous multiply and add? Can the data be dependent? I mean can the CPU do _mm_add_ps(sum, _mm_mul_ps(a1, b1)) simultaneously or do the registers used in the multiplication and add have to be independent?

Lastly how does this apply to FMA (with Haswell)? Is _mm_add_ps(sum, _mm_mul_ps(a1, b1)) automatically converted to a single FMA instruction or micro-operation?

Excite answered 10/4, 2013 at 18:2 Comment(0)
H
51

The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate).

An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two.

The IEEE and C standards allow this when #pragma STDC FP_CONTRACT ON is in effect, and compilers are allowed to have it ON by default (but not all do). Gcc contracts into FMA by default (with the default -std=gnu*, but not -std=c*, e.g. -std=c++14). For Clang, it's only enabled with -ffp-contract=fast. (With just the #pragma enabled, only within a single expression like a+b*c, not across separate C++ statements.).

This is different from strict vs. relaxed floating point (or in gcc terms, -ffast-math vs. -fno-fast-math) that would allow other kinds of optimizations that could increase the rounding error depending on the input values. This one is special because of the infinite precision of the FMA internal temporary; if there was any rounding at all in the internal temporary, this wouldn't be allowed in strict FP.

Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.


So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:

FMA3 Intrinsics: (AVX2 - Intel Haswell)

  • _mm_fmadd_pd(), _mm256_fmadd_pd()
  • _mm_fmadd_ps(), _mm256_fmadd_ps()
  • and about a gazillion other variations...

FMA4 Intrinsics: (XOP - AMD Bulldozer)

  • _mm_macc_pd(), _mm256_macc_pd()
  • _mm_macc_ps(), _mm256_macc_ps()
  • and about a gazillion other variations...
Hairdo answered 10/4, 2013 at 18:33 Comment(22)
Thanks, that more or less answers my question about FMA. I should really spend some time learning some x86 assembly. That would probably answer most of my questions.Excite
As for your question about whether a multiply and an add can be done simultaneously (FMA). The answer is no since the add uses the result of the multiply. So you eat the latency of add + multiply. An FMA instruction does both instructions together - usually with the same latency as a single muliply. So the add is free.Hairdo
Thanks, that's what I thought. Now I just need to figure out how to organize my code so that the sum like I defined above does independent adds and multiplies simultaneously (so I avoid latencies).Excite
You actually don't need to spend too much time on it. The hardware is capable of reordering things. As it is right now, all the multiplies are independent. But the adds are all dependent on the sum variable. So your best best is to split the sum variable into multiple variables and multiple sum chains in parallel across all of them. Then at the end, combine them back into one sum.Hairdo
Okay, I do something like that already (unrolling my loop). I assume I don't need separate sum variables for every sum? Right now I do four (sum1, sum2, sum3, sum4) and then add those together and increment my loop counter by four.Excite
You only need to separate them as much as it takes to reach the max throughput. The critical path is on the additions. The latency of an addps is 3 cycles. But the throughput is 1. So you need a minimum of 3 separate sum chains to fully utilize it. You currently have 4, so that's sufficient.Hairdo
Thanks, that answers all my questions.Excite
gcc in practice does fuse mul/add intrinsics at -O3, without -ffast-math, even in situations where FLT_EVAL_METHOD=0, not 2. This is supported by some documentation (that keeping infinite precision for temporaries is always legal, to allow collapsing expressions). There's a new question about this.Sundry
@PeterCordes That's good to know. Either I wasn't aware of that at the time, or things have changed.Hairdo
Are you sure that "the compiler will violate strict IEEE floating-point behavior by fusing". I am not so sure any more. I think maybe IEEE supports multiple "modes" one of which allows single rounding contractions. GCC uses x87 for 32-bit code by default and I don't think that violates strict IEEE behaviour. GCC by default uses x87 for 32-bit code, SSE for 64-bit code, and FMA for FMA hardware. I don't think any of these cases necessarily violate IEEE. Clang on the other hand uses SSE by default in all thread case so it's more consistent.Ope
@Zboson I believe it's not allowed if you store the intermediate into a variable first. C/C++ allows immediates to be done at higher precision. That implies it can go ahead and fuse _mm_add_ps(c, _mm_mul_ps(a, b)). But not if you do temp = _mm_mul_ps(a, b); temp = _mm_add_ps(temp, c); But we're getting into very pedantic parts of the standard so I can't say for sure. It also doesn't help when compilers like ICC (by default) will break from the standard specification.Hairdo
Well that's why I am asking because GCC by default does fma (it uses -ffp-contract=fast by default) if you use -mfma so I want to know if GCC is breaking the standard. From your answer i would assume it is . I know ICC uses a fast floating point model by default.Ope
I am surprised nobody answer my question about this (maybe because I asked more than one question). Maybe I need to ask it again but be more clear.Ope
I think your answer is misleading since a compiler can uses FMA by default without breaking IEEE rules https://mcmap.net/q/15000/-fused-multiply-add-and-default-rounding-modesOpe
@Zboson Feel free to edit it accordingly. Since this apparently seems to have changed over the years. At this point, you know more than me on the topic.Hairdo
Just to confirm, not available in Sandy Bridge?Cinerarium
@AlecTeal Correct. Sandy Bridge does not have FMA.Hairdo
@Hairdo just to confirm (I should have been clearer) Sandy Bridge lacks /any/ FMA instruction? IIRC that's "no FMA in:" SSSE3, SSE4.1, SSE4.2 and AVX and obviously the other SSEs. I can't believe it was nearly 8 years ago! I'm pretty sure Intel/AMD kept switching on FMA3/4 which came from SSE4 (which got dropped, but adendums 1 and 2 didn't) - Intel went with neither in SB? For my work I still support SB and I've been designing something with FMA (the concept, not an extension) in mind, so this is a shock to me!Cinerarium
@AlecTeal That's correct. Sandy Bridge has no FMA at all. There are some nontrivial technical reasons why Intel flip-flopped with the FMA3/4 stuff and why it took them so long to do any FMA for their x86 line.Hairdo
@Hairdo I had no idea their hatred of not having /total/ control over AMD as a thing they can trot out when someone with "anti-trust" on a folder walks by was imbued so deep in the resulting hardware! I love they call it IA-64 now. The technical/political thing has been discussed to death elsewhere, on a serious note (and speaking of IA-64) - Intel has done FMA before, as did many other of the RISCs that got killed off by it (RIP ALPHA WE LOVE YOU) which became HP's bastard step-child from Compaq, HP was strongly backing Itanium. and this was well post standardisation of FP.Cinerarium
In any event @Hairdo thanks for the news, no wonder I couldn't find it in the manual, I just told my self I'd missed it because Intel had their own name for it and the document was huge; the document for the thing that powers just shy of 90% of the stuff the thing works on. But that's a me problem :P Thanks againCinerarium
Maybe a good idea to mention scalar fma() from math.h, as a building-block that can be auto-vectorized. It seems to work around some missed-optimization bugs like in Compiler is not producing FMA instructions for simple loop compiled for AVX512 CPU where GCC can vectorize for -march=cascade-lake but doesn't use FMA.Sundry
O
21

I tested the following code in GCC 5.3, Clang 3.7, ICC 13.0.1 and MSVC 2015 (compiler version 19.00).

float mul_add(float a, float b, float c) {
    return a*b + c;
}

__m256 mul_addv(__m256 a, __m256 b, __m256 c) {
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}

With the right compiler options (see below) every compiler will generate a vfmadd instruction (e.g. vfmadd213ss) from mul_add. However, only MSVC fails to contract mul_addv to a single vfmadd instruction (e.g. vfmadd213ps).

The following compiler options are sufficient to generate vfmadd instructions (except with mul_addv with MSVC).

GCC:   -O2 -mavx2 -mfma
Clang: -O1 -mavx2 -mfma -ffp-contract=fast
ICC:   -O1 -march=core-avx2
MSVC:  /O1 /arch:AVX2 /fp:fast

GCC 4.9 will not contract mul_addv to a single fma instruction but since at least GCC 5.1 it does. I don't know when the other compilers started doing this.

Ope answered 25/12, 2015 at 9:40 Comment(5)
See also #pragma STDC FP_CONTRACT ON. Stephen Canon points out that it allows contraction only within a single statement, not across statements. (lists.llvm.org/pipermail/cfe-dev/2015-September/045110.html). Also note that gcc enables contraction only with -std=gnu*, not with -std=c11 or whatever. (And then it enables contraction across statements, beyond what IEEE + ISO C strictly allow). Another test function that uses separate variables might be worth trying.Sundry
@PeterCordes, see this https://mcmap.net/q/15000/-fused-multiply-add-and-default-rounding-modes/2542702 and Stephen Canon's answer. I think what GCC is doing is okay according toe Stephen's answer (assuming that GCC did not ignore STDC FP_CONTRACT which is unfortunately does last time I checked).Ope
Your question there only asks about return a*b + c;, not about float mul = a*b; return mul + c;. Read Stephen's mailing-list post carefully: he mentions that clang's STDC FP_CONTRACT ON only enables contraction within an expression, unlike clangs -ffp-contract=fast which would enable it for my second example in this comment, too. That's why clang has separate on vs. fast settings for the command-line option. See my recent edits to Mysticial's answer on this question. It's messier than I thought at first :(Sundry
@PeterCordes, one of my points is that GCC ignores #pragma STDC FP_CONTRACT. At least last time I checked. I should check this again (for e.g. gnuc99 and c99 or whatever).Ope
I think that's still true. And its actual behaviour goes beyond what #pragma STDC FP_CONTRACT ON allows, so it's not quite like defaulting that to ON and failing to provide a way to turn it off. I think from what I've read that IEEE + C doesn't specify a #pragma STDC FP_CONTRACT FAST, even though that is a useful setting.Sundry

© 2022 - 2024 — McMap. All rights reserved.