_mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps?
Asked Answered
E

1

3

I have an image processing algorithm to calculate a*b+c*d with AVX. The pseudo code is as follows:

float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];

//assign values to a, b, c and d
__m256 sum;
double start=cv::getTickCount();
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
    __m256 am=_mm256_loadu_ps(a+i);
    __m256 bm=_mm256_loadu_ps(b+i);
    __m256 cm=_mm256_loadu_ps(c+i);
    __m256 dm=_mm256_loadu_ps(d+i);

    __m256 abm=_mm256_mul_ps(am, bm);
    __m256 cdm=_mm256_mul_ps(cm, dm);
    __m256 abcdm=_mm256_add_ps(abm, cdm);
    sum=_mm256_add_ps(sum, abcdm);
}
double time1=(cv::getTickCount()-start)/cv::getTickFrequency();

I change _mm256_mul_ps and _mm256_add_ps on the above to _mm256_fmadd_ps as follows:

float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];

//assign values to a, b, c and d
__m256 sum;
double start=cv::getTickCount();
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
    __m256 am=_mm256_loadu_ps(a+i);
    __m256 bm=_mm256_loadu_ps(b+i);
    __m256 cm=_mm256_loadu_ps(c+i);
    __m256 dm=_mm256_loadu_ps(d+i);

    sum=_mm256_fmadd_ps(am, bm, sum);
    sum=_mm256_fmadd_ps(cm, dm, sum);
}
double time2=(cv::getTickCount()-start)/cv::getTickFrequency();

But the code below is slower than the above! The above code execution time1 is 50ms, the below code execution time2 is 90ms. _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps ???

I use Ubuntu 16.04, GCC 7.5.0 ,compiler flags: -fopenmp -march=native -O3

Engage answered 18/2, 2021 at 13:10 Comment(6)
How are you benchmarking this? You wouldn't be including all those calls to new in your timing, are you?And
@AndrewHenle Yes,not include new in my timing. I changed the code above.Engage
This is all about latency vs throughput. In the first example you have only 1 _mm256_add_ps which depends on the result from the previous loop iteration, in the second you have two _mm256_fmadd_ps which depend on each other and the previous loop iteration. (There is probably a duplicate for this ...)Bev
@chtz: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) is sort of a canonical for multiple accumulators, but it's pretty involved and not an exact dup for this (where the problem is making the dep chain longer with two FMAs into sum). I decided it might be worth posting a summary answer instead of just the link.Charlatan
There is also this answer that tackles this problem: #65818732Boothman
@AndreySemashev: Oh yeah, that's a decent beginner example for a simple dot-product with multiple accumulators; added a link to my answer. We were I think lacking a canonical duplicate for a dot product of an array (not just a single vector). I retitled it, main downside is the code in the question is almost too naive (the braindead loads), but it's still pretty good.Charlatan
C
2

Your reduction loops both bottleneck on latency, not throughput, because you're only using one FP vector accumulator. The FMA one is slower because you made the critical path longer (a chain of 2 instructions per loop iteration instead of just 1).

In the add case, the loop carried dependency chain for sum is only sum=_mm256_add_ps(sum, abcdm);. The other instructions are independent for each iteration, and can have that abcdm input ready to go before the previous vaddps has this iteration's sum ready.

In the fma case, the loop-carried dep chain goes through two _mm256_fmadd_ps operations, both into sum, so yes, you'd expect it to be about twice as slow.

Unroll with more accumulators to hide FP latency (like normal for a dot product). See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for much more detail about that and how OoO exec works.

Also see Improving performance of floating-point dot-product of an array with SIMD for a much simpler beginner-friendly example of 2 accumulators.

(Adding up those separate __m256 sum0, sum1, sum2, etc vars should be done after the loop. You can also use __m256 sum[4] to save typing. You can even use an inner loop over that array; most compilers will fully unroll small fixed-count loops, so you get the desired unrolled asm with each __m256 in a separate YMM register.)

Or let clang auto-vectorize this; it will normally do that unrolling with multiple accumulators for you.

Or if you for some reason didn't want to unroll, you could use FMA while keeping the loop-carried latency low with sum += fma(a, b, c*d); (one mul, one FMA, one add). Of course assuming your compiler didn't "contract" your mul and add into FMA for you if you compiled with -ffast-math; GCC will do that aggressively across statements by default, clang won't.

Once you do this, your throughput will bottleneck on 2 loads per clock (best case even with aligned arrays for no cache-line splits, which new won't give you), so using FMA barely helps except to reduce the front-end bottleneck. (Compared to a multiple accumulator mul/add version that needs to run 1 FP op per load to keep up; using multiple accumulators will let you go faster than either original loop. Like one iteration (4 loads) per 2 cycles, instead of 1 per 3 cycles with the vaddps latency bottleneck).


On Skylake and later, FMA/add/mul all have the same latency: 4 cycles. On Haswell/Broadwell, vaddps latency is 3 cycles (one dedicated FP add unit) while FMA latency is 5.

Zen2 has 3 cycle vaddps, 5 cycle vfma....ps (https://uops.info/). (2/clock throughput for both, and on different execution ports, so you could in theory run 2 FMAs and 2 vaddps per clock on Zen2.)

With your longer-latency FMA loop being less than twice as slow, I'm guessing you might be on a Skylake-derived CPU. Perhaps the mul/add version was bottlenecking a bit on the front-end or resource conflicts or something and not quite achieving the expected 1 iteration per 3 clocks latency-limited speed.

In general, see https://uops.info/ for latency and uops / port breakdowns. (also https://agner.org/optimize/).

Charlatan answered 18/2, 2021 at 13:40 Comment(6)
Some CPUs (maybe not x86?) have a different latency with respect to the different arguments of FMA, they first start the multiplication, and only need the accumulator a cycle or two later. Anyway, FMA exists foremost for the extra precision, which makes it a more complicated operation than mul+add, it wouldn't be shocking for it to be more expensive, even if in practice it seldom is, except for the latency that is the heart of this question.Harmonie
@MarcGlisse: Indeed, Aarch64 what is late-forwarding? is an example of that. But no x86 CPUs work that way. The way they do OoO scheduling, all inputs have to be ready before sending the uop to the execution port. uops.info did actually test this, e.g. for vfmadd213ps on Skylake the latency is 4 cycles from each of the 3 inputs to the output. (The main table would show [5:4] or similar if there were multiple latencies, but no for any of Zen 1/2, HSW/SKL/ICL)Charlatan
They did measure 5:4 latency for Cascade Lake: uops.info/html-instr/VFMADD231PS_XMM_XMM_XMM.htmlHughmanick
Where does this claim come from: The way they do OoO scheduling, all inputs have to be ready before sending the uop to the execution port? Elsewhere there is evidence that OoO scheduling is speculative, reissuing uops when an input ends up not ready.Hughmanick
@amonakov: fair point, "expected to be ready that cycle" would be more precise. For forwarding from other ALU instructions with known latency, the cycle the result can be put on the bypass-forwarding network is known by the scheduler. (And in later cycles it can be read from the physical register.) But yes, forwarding from memory optimistically hopes for an L1d hit and may have to replay uops dependent on the load result, on Intel at least. But AFAIK, it's still true that a single uop can't start (successfully dispatch) if not all its inputs are ready, which is the point I intended.Charlatan
@amonakov: Thanks for the Cascade Lake result; those XMM FMAs should be running on p0/p1, so any possible weirdness from the extra 512-bit FMA unit on port 5 shouldn't be coming into it. The dep-breaking vmovupd xmm1,xmm15 seems to be copying from a register that was only ever initialized by vzeroupper, not written by an earlier math OP; makes me wonder about Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says (extra bypass latency "infects" a register indefinitely), tests with diff versions of the nanobench SW could have this?Charlatan

© 2022 - 2024 — McMap. All rights reserved.