_mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps?

float *a=new float[N]; float *b=new float[N]; float *c=new float[N]; float *d=new float[N]; //assign values to a, b, c and d __m256 sum; double start=cv::getTickCount(); for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8 { __m256 am=_mm256_loadu_ps(a+i); __m256 bm=_mm256_loadu_ps(b+i); __m256 cm=_mm256_loadu_ps(c+i); __m256 dm=_mm256_loadu_ps(d+i); __m256 abm=_mm256_mul_ps(am, bm); __m256 cdm=_mm256_mul_ps(cm, dm); __m256 abcdm=_mm256_add_ps(abm, cdm); sum=_mm256_add_ps(sum, abcdm); } double time1=(cv::getTickCount()-start)/cv::getTickFrequency();

float *a=new float[N]; float *b=new float[N]; float *c=new float[N]; float *d=new float[N]; //assign values to a, b, c and d __m256 sum; double start=cv::getTickCount(); for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8 { __m256 am=_mm256_loadu_ps(a+i); __m256 bm=_mm256_loadu_ps(b+i); __m256 cm=_mm256_loadu_ps(c+i); __m256 dm=_mm256_loadu_ps(d+i); sum=_mm256_fmadd_ps(am, bm, sum); sum=_mm256_fmadd_ps(cm, dm, sum); } double time2=(cv::getTickCount()-start)/cv::getTickFrequency();

Your reduction loops both bottleneck on latency, not throughput, because you're only using one FP vector accumulator. The FMA one is slower because you made the critical path longer (a chain of 2 instructions per loop iteration instead of just 1).

In the add case, the loop carried dependency chain for sum is only sum=_mm256_add_ps(sum, abcdm);. The other instructions are independent for each iteration, and can have that abcdm input ready to go before the previous vaddps has this iteration's sum ready.

In the fma case, the loop-carried dep chain goes through two _mm256_fmadd_ps operations, both into sum, so yes, you'd expect it to be about twice as slow.

Unroll with more accumulators to hide FP latency (like normal for a dot product). See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for much more detail about that and how OoO exec works.

Also see Improving performance of floating-point dot-product of an array with SIMD for a much simpler beginner-friendly example of 2 accumulators.

(Adding up those separate __m256 sum0, sum1, sum2, etc vars should be done after the loop. You can also use __m256 sum[4] to save typing. You can even use an inner loop over that array; most compilers will fully unroll small fixed-count loops, so you get the desired unrolled asm with each __m256 in a separate YMM register.)

Or let clang auto-vectorize this; it will normally do that unrolling with multiple accumulators for you.

Or if you for some reason didn't want to unroll, you could use FMA while keeping the loop-carried latency low with sum += fma(a, b, c*d); (one mul, one FMA, one add). Of course assuming your compiler didn't "contract" your mul and add into FMA for you if you compiled with -ffast-math; GCC will do that aggressively across statements by default, clang won't.

Once you do this, your throughput will bottleneck on 2 loads per clock (best case even with aligned arrays for no cache-line splits, which new won't give you), so using FMA barely helps except to reduce the front-end bottleneck. (Compared to a multiple accumulator mul/add version that needs to run 1 FP op per load to keep up; using multiple accumulators will let you go faster than either original loop. Like one iteration (4 loads) per 2 cycles, instead of 1 per 3 cycles with the vaddps latency bottleneck).

On Skylake and later, FMA/add/mul all have the same latency: 4 cycles. On Haswell/Broadwell, vaddps latency is 3 cycles (one dedicated FP add unit) while FMA latency is 5.

Zen2 has 3 cycle vaddps, 5 cycle vfma....ps (https://uops.info/). (2/clock throughput for both, and on different execution ports, so you could in theory run 2 FMAs and 2 vaddps per clock on Zen2.)

With your longer-latency FMA loop being less than twice as slow, I'm guessing you might be on a Skylake-derived CPU. Perhaps the mul/add version was bottlenecking a bit on the front-end or resource conflicts or something and not quite achieving the expected 1 iteration per 3 clocks latency-limited speed.

In general, see https://uops.info/ for latency and uops / port breakdowns. (also https://agner.org/optimize/).

Recommended topics

Hot tags