Optimize for fast multiplication but slow addition: FMA and doubledouble

Asked 1/6, 2015 at 12:25 Answered 3/7, 2018 at 11:23

assembly x86 floating-point fma double-double-arithmetic

When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this:

intn = 0;
for(int32_t i=0; i<maxiter; i++) {
    floatn x2 = square(x), y2 = square(y); //square(x) = x*x
    floatn r2 = x2 + y2;
    booln mask = r2<cut; //booln is in the float domain non integer domain
    if(!horizontal_or(mask)) break; //_mm256_testz_pd(mask)
    n -= mask
    floatn t = x*y; mul2(t); //mul2(t): t*=2
    x = x2 - y2 + cx;
    y = t + cy;
}

This determines if n pixels are in the Mandelbrot set. So for double floating point it runs over 4 pixels (floatn = __m256d, intn = __m256i). This requires 4 SIMD floating point multiplication and four SIMD floating point additions.

Then I modified this to work with FMA like this

intn n = 0; 
for(int32_t i=0; i<maxiter; i++) {
    floatn r2 = mul_add(x,x,y*y);
    booln mask = r2<cut;
    if(!horizontal_or(mask)) break;
    add_mask(n,mask);
    floatn t = x*y;
    x = mul_sub(x,x, mul_sub(y,y,cx));
    y = mul_add(2.0f,t,cy);
}

where mul_add calls _mm256_fmad_pd and mul_sub calls _mm256_fmsub_pd. This method uses 4 FMA SIMD operations, and two SIMD multiplications which is two less arithmetic operations then without FMA. Additionally, FMA and multiplication can use two ports and addition only one.

To make my tests less biased I zoomed into a region which is entirely in the Mandelbrot set so all the values are maxiter. In this case the method using FMA is about 27% faster. That's certainly an improvement but going from SSE to AVX doubled my performance so I was hoping for maybe another factor of two with FMA.

But then I found this answer in regards to FMA where it says

The important aspect of the fused-multiply-add instruction is the (virtually) infinite precision of the intermediate result. This helps with performance, but not so much because two operations are encoded in a single instruction — It helps with performance because the virtually infinite precision of the intermediate result is sometimes important, and very expensive to recover with ordinary multiplication and addition when this level of precision is really what the programmer is after.

and later gives an example of double*double to double-double multiplication

high = a * b; /* double-precision approximation of the real product */
low = fma(a, b, -high); /* remainder of the real product */

From this, I concluded that I was implementing FMA non-optimally and so I decided to implement SIMD double-double. I implemented double-double based on the paper Extended-Precision Floating-Point Numbers for GPU Computation. The paper is for double-float so I modified it for double-double. Additionally, instead of packing one double-double value in a SIMD registers I pack 4 double-double values into one AVX high register and one AVX low register.

For the Mandelbrot set what I really need is double-double multiplication and addition. In that paper these are the df64_add and df64_mult functions. The image below shows the assembly for my df64_mult function for software FMA (left) and hardware FMA (right). This clearly shows that hardware FMA is a big improvement for double-double multiplication.

fma software vs hardware

So how does hardware FMA perform in the double-double Mandelbrot set calculation? The answer is that's only about 15% faster than with software FMA. That's much less than I hoped for. The double-double Mandelbrot calculation needs 4 double-double additions, and four double-double multiplications (x*x, y*y, x*y, and 2*(x*y)). However, the 2*(x*y) multiplication is trivial for double-double so this multiplication can be ignored in the cost. Therefore, the reason I think the improvement using hardware FMA is so small is that the calculation is dominated by the slow double-double addition (see the assembly below).

It used to be that multiplication was slower than addition (and programers used several tricks to avoid multiplication) but with Haswell it seems that it's the other way around. Not only due to FMA but also because multiplication can use two ports but addition only one.

So my questions (finally) are:

How does one optimize when addition is slow compared to multiplication?
Is there an algebraic way to change my algorithm to use more multiplications and less additions? I know there are method to do the reverse, e.g. (x+y)*(x+y) - (x*x+y*y) = 2*x*y which use two more additions for one less multiplication.
Is there a way to simply the df64_add function (e.g. using FMA)?

In case anyone is wondering the double-double method is about ten times slower than double. That's not so bad I think as if there was a hardware quad-precision type it would likely be at least twice as slow as double so my software method is about five times slower than what I would expect for hardware if it existed.

df64_add assembly

vmovapd 8(%rsp), %ymm0
movq    %rdi, %rax
vmovapd 72(%rsp), %ymm1
vmovapd 40(%rsp), %ymm3
vaddpd  %ymm1, %ymm0, %ymm4
vmovapd 104(%rsp), %ymm5
vsubpd  %ymm0, %ymm4, %ymm2
vsubpd  %ymm2, %ymm1, %ymm1
vsubpd  %ymm2, %ymm4, %ymm2
vsubpd  %ymm2, %ymm0, %ymm0
vaddpd  %ymm1, %ymm0, %ymm2
vaddpd  %ymm5, %ymm3, %ymm1
vsubpd  %ymm3, %ymm1, %ymm6
vsubpd  %ymm6, %ymm5, %ymm5
vsubpd  %ymm6, %ymm1, %ymm6
vaddpd  %ymm1, %ymm2, %ymm1
vsubpd  %ymm6, %ymm3, %ymm3
vaddpd  %ymm1, %ymm4, %ymm2
vaddpd  %ymm5, %ymm3, %ymm3
vsubpd  %ymm4, %ymm2, %ymm4
vsubpd  %ymm4, %ymm1, %ymm1
vaddpd  %ymm3, %ymm1, %ymm0
vaddpd  %ymm0, %ymm2, %ymm1
vsubpd  %ymm2, %ymm1, %ymm2
vmovapd %ymm1, (%rdi)
vsubpd  %ymm2, %ymm0, %ymm0
vmovapd %ymm0, 32(%rdi)
vzeroupper
ret

Grin answered 1/6, 2015 at 12:25 Comment(17)

As Agner Fog points out in some of his optimization resources (agner.org/optimize but I don't remember the exact file or page), sometimes FMA makes things slower. I think I remember that the example was x*x + y*y, where, for some Intel processor model, the latency if implemented as mul-fma makes the timing worse than mul-mul-add, which contains more parallelism. – Colettecoleus 1/6, 2015 at 12:47

@PascalCuoq, so far I have been using IACA and then looking at the timing. For double it's been mostly a guessing game since there are so many permutations with FMA for the Mandelbrot set. With double-double it's pretty clear where to use it. But the main problem is addition with double-double. It's so slow compared to multiplication. What can be done? – Grin 1/6, 2015 at 12:54

A correctly programmed double-double addition comprises pretty exactly 20 basic floating-point operations. You may encounter versions that use fewer instructions, however these lose accuracy precisely when you need and want it, which is the effective subtraction of operands almost identical in magnitude. Speed-up of double-double addition requires HW improvements. Max-magnitude and min-magnitude operations help a little bit, and three-input addition with single rounding helps a lot. Some processor architectures offer the former, I don't know any that offer the latter, – Traweek 1/6, 2015 at 17:41

If you're just after a bit more accuracy than double, you could compute x2hi + x2lo + y2hi + y2lo + cx by doing a Knuth 2-sum of x2hi and y2hi (6 flops), another Knuth 2-sum with cx (another 6 flops), then just sum the four remainder terms. You could compute 2*x*y+cy using the FMA-based double*double -> double double product, doing a Knuth 2-sum with cy, and adding together the remainder terms. (I haven't tested this or done a serious analysis, so this suggestion may be garbage.) – Regale 1/6, 2015 at 23:20

@njuffa, my double-double add function df64_add uses exactly 20 add/sub instructions (though each one operates on four doubles so it's really 80 floating point operations in 20 instructions). That's in agreement of your statement. If you write up an answer with more information particularly on the Max/min-magnitude operations and/or can you provide a paper discussing this I would greatly appreciate it. – Grin 2/6, 2015 at 9:23

@tmyklebu, thanks, I'll give your idea a try and get back to you. – Grin 2/6, 2015 at 10:34

@Zboson: Min and max by magnitude lets you use "Fast2Sum"; if x is larger in magnitude than y, you can do hi = x+y; lo = (x-hi)+y; the difference and the sum in the computation of lo are exact. "The handbook of floating-point arithmetic" has all these tricks and many more. I found Ogita, Rump, and Oishi's "Accurate sum and dot product" pretty educational, but it's by no means the original reference for Fast2Sum. – Regale 2/6, 2015 at 14:26

@tmyklebu, thank you for the clarification of min and max magnitude. I understand now. I'm already using Fast2Sum in my general double-double addition. But understanding the details may help my use it to simplify the Mandelbrot set calculation (which I'm mostly doing just for something to test double-double and parallel computing on). That handbook is rather expensive but it looks like maybe I should get it. – Grin 2/6, 2015 at 14:49

@tmyklebu, thank you for the paper "Accurate sum and dot product". I just printed it and will read it this evening. – Grin 2/6, 2015 at 14:56

IEEE-754 (2008) defined MinMag and MaxMag operations. Itanium had a hardware implementation, instructions famin and famax. How to use these to shave off two operations of double-double addition is shown in N. Louvet's 2007 Ph.D. thesis. The three-input addition cuts operation count in half, as I recall. I can't find the reference right now (giant mess in my collected papers), I guess it was research by Sylvie Boldo, published within the past five years. I did a prototype implementation at the time, but can't find that either. – Traweek 2/6, 2015 at 14:56

@njuffa, I found the relevant section in N. Louvet's thesis (TwoSumMinMax(a,b)). He also has another function TwoSumCond(a,b) I can consider. I think the minmag and maxmag functions are the ones defined here. I'll see if these can help me with AVX2. – Grin 2/6, 2015 at 15:33

@Zboson As I recall it is very important that famax() and famin() are used in such a way that when |a|==|b| the further computation continues using both a and b which may be of opposite sign. So for TwoSum() a.k.a. add12() you would want something like this: s=a+b; x=famax(a,b); y=famin(b,a); e=(x-s)+y; return (e, s); Note the argument swap between the calls to famax() and famin(). – Traweek 3/6, 2015 at 15:31

@Traweek I found a faster solution for double-double addition from the paper hal.archives-ouvertes.fr/docs/00/06/33/56/PDF/float-float.pdf. I posted an answer with the solution. If you have time, let me know what you think. – Grin 4/6, 2015 at 12:21

I am aware of the paper (see stackoverflow.com/questions/6769881/…) but have not looked at it in a long time. Your answer looks plausible, but as always in such cases I would suggest detailed testing. I once had a sign error in a double-double addition code and did not notice for months, because the chance of hitting the bug randomly was miniscule. You would want to focus testing on cases of subtractive cancellation, i.e. with arguments close in magnitude but of opposite sign. – Traweek 4/6, 2015 at 12:38

@njuffa, good point. I need to make my test unit testing more robust. What I do now is to run my fractal generator and zoom to a region where the single float results are bogus. Then I compare doublefloat with double and using my new definition of doubledouble addition actually had less errors (.01% of the pixels disagreed with the new method instead of .02% with the old - single float disagrees almost 100%). I'll try and make a more advanced unit test. I don't have a good method (yet) to test doubledouble because I don't have anything more accurate than double besides doubledouble. – Grin 4/6, 2015 at 12:46

I used a multi-precision library to test my double-double implementation. Specifically, I used Brent's MP library. Since this is Fortran code from 1978, I would not recommend it for new development, but I have used it for more than twenty years and am simply carrying the code forward. – Traweek 4/6, 2015 at 12:52

@njuffa, thanks, I'll get a MP library and compare with that. That's a good idea. – Grin 4/6, 2015 at 12:57

To answer my third question I found a faster solution for double-double addition. I found an alternative definition in the paper Implementation of float-float operators on graphics hardware.

Theorem 5 (Add22 theorem) Let be ah+al and bh+bl the float-float arguments of the following
algorithm:
Add22 (ah ,al ,bh ,bl)
1 r = ah ⊕ bh
2 if | ah | ≥ | bh | then
3     s = ((( ah ⊖ r ) ⊕ bh ) ⊕ b l ) ⊕ a l
4 e l s e
5     s = ((( bh ⊖ r ) ⊕ ah ) ⊕ a l ) ⊕ b l
6 ( rh , r l ) = add12 ( r , s )
7 return (rh , r l)

Here is how I implemented this (pseudo-code):

static inline doubledoublen add22(doubledoublen const &a, doubledouble const &b) {
    doublen aa,ab,ah,bh,al,bl;
    booln mask;
    aa = abs(a.hi);                //_mm256_and_pd
    ab = abs(b.hi); 
    mask = aa >= ab;               //_mm256_cmple_pd
    // z = select(cut,x,y) is a SIMD version of z = cut ? x : y;
    ah = select(mask,a.hi,b.hi);   //_mm256_blendv_pd
    bh = select(mask,b.hi,a.hi);
    al = select(mask,a.lo,b.lo);
    bl = select(mask,b.lo,a.lo);

    doublen r, s;
    r = ah + bh;
    s = (((ah - r) + bh) + bl ) + al;
    return two_sum(r,s);
}

This definition of Add22 uses 11 additions instead of 20 but it requires some additional code to determine if |ah| >= |bh|. Here is a discussion on how to implement SIMD minmag and maxmag functions. Fortunately, most of the additional code does not use port 1. Now only 12 instructions go to port 1 instead of 20.

Here is a throughput analysis form IACA for the new Add22

Throughput Analysis Report
--------------------------
Block Throughput: 12.05 Cycles       Throughput Bottleneck: Port1

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 12.0 | 2.5    2.5  | 2.5    2.5  | 2.0  | 10.0 | 0.0  | 2.0  |
---------------------------------------------------------------------------------------


| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rip]
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rdx]
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm4, ymmword ptr [rsi]
|   1    |           |     |           |           |     | 1.0 |     |     |    | vandpd ymm2, ymm4, ymm3
|   1    |           |     |           |           |     | 1.0 |     |     |    | vandpd ymm3, ymm0, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vcmppd ymm2, ymm3, ymm2, 0x2
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rsi+0x20]
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm1, ymm0, ymm4, ymm2
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm4, ymm4, ymm0, ymm2
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rdx+0x20]
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm5, ymm0, ymm3, ymm2
|   2    |           |     |           |           |     | 2.0 |     |     |    | vblendvpd ymm0, ymm3, ymm0, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm3, ymm1, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm1, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm1, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm3, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm2, ymm3
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi], ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm2, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm3, ymm3, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm3, ymm0
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi+0x20], ymm0

and here is the throughput analysis from the old

Throughput Analysis Report
--------------------------
Block Throughput: 20.00 Cycles       Throughput Bottleneck: Port1

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 0.0    0.0  | 20.0 | 2.0    2.0  | 2.0    2.0  | 2.0  | 0.0  | 0.0  | 2.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm0, ymmword ptr [rsi]
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovapd ymm1, ymmword ptr [rdx]
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovapd ymm3, ymmword ptr [rsi+0x20]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm4, ymm0, ymm1
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovapd ymm5, ymmword ptr [rdx+0x20]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm4, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm1, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm4, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm2
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm0, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm3, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm6, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm5, ymm5, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm6, ymm1, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm3, ymm3, ymm6
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm2, ymm4, ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm3, ymm3, ymm5
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm4, ymm2, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm1, ymm1, ymm4
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm0, ymm1, ymm3
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vaddpd ymm1, ymm2, ymm0
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm2, ymm1, ymm2
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi], ymm1
|   1    |           | 1.0 |           |           |     |     |     |     | CP | vsubpd ymm0, ymm0, ymm2
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovapd ymmword ptr [rdi+0x20], ymm0

A better solution would be if there were three operand single rounding mode instructions besides FMA. It seems to me there should be single rounding mode instructions for

a + b + c
a * b + c //FMA - this is the only one in x86 so far
a * b * c

Grin answered 4/6, 2015 at 12:20 Comment(3)

What does a*b*c get you? Sum-of-three makes double double addition trivial and FMA makes double double multiplication trivial, but a*b*c doesn't seem to have an obvious purpose like that. – Regale 4/6, 2015 at 14:38

@tmyklebu, I don't know. Maybe you're right. For double-double I agree with you that a*b*c does not seem to have a point. I am wondering what other applications these three operand one rounding mode instructions are useful for? I have only used it for double-double. – Grin 4/6, 2015 at 15:17

@tmyklebu, actually come to think of it the only ones that are necessary are a + b - c and a*b - c. – Grin 12/6, 2015 at 14:18

To speed up the algorithm, I use a simplified version based on 2 fma, 1 mul and 2 add. I process 8 iterations that way. Then compute the escape radius and rollback the last 8 iterations if necessary.

The following critical loop X = X^2 + C written with x86 intrinsics is nicely unrolled by the compiler, and you will spot after unrolling that the 2 FMA operations not badly dependent of each other.

//  IACA_START;
for (j = 0; j < 8; j++) {
    Xrm = _mm256_mul_ps(Xre, Xim);
    Xtt = _mm256_fmsub_ps(Xim, Xim, Cre);
    Xrm = _mm256_add_ps(Xrm, Xrm);
    Xim = _mm256_add_ps(Cim, Xrm);
    Xre = _mm256_fmsub_ps(Xre, Xre, Xtt);
}       // for
//  IACA_END;

And then I compute the escape radius (|X| < threshold), which costs an other fma and another multiplication, only every 8 iterations.

cmp = _mm256_mul_ps(Xre, Xre);
cmp = _mm256_fmadd_ps(Xim, Xim, cmp);
cmp = _mm256_cmp_ps(cmp, vec_threshold, _CMP_LE_OS);
if (_mm256_testc_si256((__m256i) cmp, vec_one)) {
    i += 8;
    continue;
}

You mention "addition is slow", this is not exactly true, but you are right, multiplication throughput get higher and higher over time on recent architectures.

Multiplication latencies and dependencies are the key. FMA has a throughput of 1 cycle and a latency of 5 cycles. the execution of independent FMA instructions can overlap.

Additions based on the result of a multiplication get the full latency hit.

So you have to break these immediate dependencies by doing "code stitching" and compute 2 points in the same loop, and just interleave the code before checking with IACA what will be going on. The following code has 2 sets of variables (suffixed by 0 and 1 for X0=X0^2+C0, X1=X1^2+C1) and starts to fill the FMA holes

for (j = 0; j < 8; j++) {
    Xrm0 = _mm256_mul_ps(Xre0, Xim0);
    Xrm1 = _mm256_mul_ps(Xre1, Xim1);
    Xtt0 = _mm256_fmsub_ps(Xim0, Xim0, Cre);
    Xtt1 = _mm256_fmsub_ps(Xim1, Xim1, Cre);
    Xrm0 = _mm256_add_ps(Xrm0, Xrm0);
    Xrm1 = _mm256_add_ps(Xrm1, Xrm1);
    Xim0 = _mm256_add_ps(Cim0, Xrm0);
    Xim1 = _mm256_add_ps(Cim1, Xrm1);
    Xre0 = _mm256_fmsub_ps(Xre0, Xre0, Xtt0);
    Xre1 = _mm256_fmsub_ps(Xre1, Xre1, Xtt1);
}       // for

To summarize,

you can halve the number of instructions in your critical loop
you can add more independent instructions and get advantage of high throughput vs. low latency of the multiplications and fused multiply and add.

Holism answered 26/6, 2018 at 22:15 Comment(5)

What you're calling "code stitching" looks like software pipelining. But yeah, keeping more data in flight to hide FP latency exposes more instruction-level parallelism. Unrolling over more points is a good way to turn the data-parallelism in the original problem into SIMD + ILP. – Otherness 26/6, 2018 at 22:39

Your FMA latency / throughput numbers appear to be from Ryzen or Bulldozer-family. But what do you mean "Additions based on the result of a multiplication get the full latency hit"? Any dependent instruction gets the full latency hit; FMA->FMA has the same latency as MUL->ADD. (Even on Bulldozer, where there's special fast forwarding within the FMA unit, making it 5c instead of 6c. agner.org/optimize). Anyway, all Intel CPUs with FMA have 0.5c throughput for FMA. But 1c throughput for ADD, on CPUs before Skylake (which does everything on the FMA unit, lowering it to 4c lat). – Otherness 26/6, 2018 at 22:44

0.5c is based on units with 2 FMA engines. It is quite hard to "fill the pipeline" for every single variant of haswell, skylake, coffeelake, cubilake ..... I use the simple threshold "ideal number of fma" = "latency" / "throughput" . I know the code I suggest is not optimal, since hyperthreads can still fill the holes. Pipeline the calculations for 3 x 8 points would have been better, If the compiler can avoid spilling live registers to memory (use avx512vl variants of skylake ...) . – Holism 27/6, 2018 at 7:23

You're using 256-bit FMA. All Intel CPUs with FMA support have two 256-bit FMA units. You're thinking of Skylake-AVX512 where some only have one 512-bit FMA unit. (SKX shuts down the vector ALU on port 1 when 512-bit uops are in flight. The two 256-bit FMA units on port 0 / port 1 work as a single 512-bit FMA unit. Some models have an extra 512-bit FMA unit on port 5 which activates only for 512-bit uops.) – Otherness 27/6, 2018 at 17:16

@Holism The point of my question was about using double-double to increase the precision. I already know how to optimize for double. Your method will lose precision after zooming in past 10^-7. But I have found a better method for the Mandelbrot set using perturbation theory which moves the limiting factor from the mantissa to the exponent. So now I can go to 10^-308 which is much better than double-double which does not change the precision of the exponent. Anyway, your answer does not answer my question at all as far as I can tell. – Grin 28/6, 2018 at 7:40

You mention the following code:

vsubpd  %ymm0, %ymm4, %ymm2
vsubpd  %ymm2, %ymm1, %ymm1  <-- immediate dependency ymm2
vsubpd  %ymm2, %ymm4, %ymm2
vsubpd  %ymm2, %ymm0, %ymm0  <-- immediate dependency ymm2
vaddpd  %ymm1, %ymm0, %ymm2  <-- immediate dependency ymm0
vaddpd  %ymm5, %ymm3, %ymm1
vsubpd  %ymm3, %ymm1, %ymm6  <-- immediate dependency ymm1
vsubpd  %ymm6, %ymm5, %ymm5  <-- immediate dependency ymm6
vsubpd  %ymm6, %ymm1, %ymm6  <-- dependency ymm1, ymm6
vaddpd  %ymm1, %ymm2, %ymm1
vsubpd  %ymm6, %ymm3, %ymm3  <-- dependency ymm6
vaddpd  %ymm1, %ymm4, %ymm2 
vaddpd  %ymm5, %ymm3, %ymm3  <-- dependency ymm3
vsubpd  %ymm4, %ymm2, %ymm4 
vsubpd  %ymm4, %ymm1, %ymm1  <-- immediate dependency ymm4
vaddpd  %ymm3, %ymm1, %ymm0  <-- immediate dependency ymm1, ymm3
vaddpd  %ymm0, %ymm2, %ymm1  <-- immediate dependency ymm0
vsubpd  %ymm2, %ymm1, %ymm2  <-- immediate dependency ymm1

if you check carefully, these are mostly dependent operations, and a basic rule about latency/throughput efficiency is not met . Most instructions depend on the outcome of the previous one, or 2 instructions before. This sequence contains a critical path of 30 cycles (about 9 or 10 instructions about "3 cycles latency"/"1 cycle throughput").

Your IACA reports "CP" => instruction in the critical path, and the evaluated cost is 20 cycles throughput. You should get the latency report because it is the one that matter if you are interested in execution speed.

To remove the cost of this critical path, you have to interleave about 20 more similar instructions if the compiler cannot do it (e.g. because your double-double code is in a separate library compiled without -flto optimizations and vzeroupper everywhere at function entry and exit, vectorizer only works well with inlined code).

A possibility is to run 2 computations in parallel (see about code stitching in a previous post to improve the pipelining)

If I assume that your double-double code looks like this "standard" implementation

// (r,e) = x + y
#define two_sum(x, y, r, e) 
    do { double t; r = x + y; t = r - x; e = (x - (r - t)) + (y - t); } while (0)
#define two_difference(x, y, r, e) \
    do { double t; r = x - y; t = r - x; e = (x - (r - t)) - (y + t); } while (0)
.....

Then you have to consider the following code, where instructions are interleaved at quite a fine grain.

// (r1, e1) = x1 + y1, (r2, e2) x2 + y2
#define two_sum(x1, y1, x2, y2, r1, e1, r2, e2) 
    do { double t1, t2 \
    r1 = x1 + y1; r2 = x2 + y2; \
    t1 = r1 - x1; t2 = r2 - x2; \
    e1 = (x1 - (r1 - t1)) + (y1 - t1); e2 = (x2 - (r2 - t2)) + (y2 - t2);  \
} while (0)
....

Then this creates code like the following one (about same critical path in a latency report, and about 35 instructions). For details about run-time, Out-Of-Order execution should fly over that without stalling.

vsubsd  %xmm2, %xmm0, %xmm8
vsubsd  %xmm3, %xmm1, %xmm1
vaddsd  %xmm4, %xmm4, %xmm4
vaddsd  %xmm5, %xmm5, %xmm5
vsubsd  %xmm0, %xmm8, %xmm9
vsubsd  %xmm9, %xmm8, %xmm10
vaddsd  %xmm2, %xmm9, %xmm2
vsubsd  %xmm10, %xmm0, %xmm0
vsubsd  %xmm2, %xmm0, %xmm11
vaddsd  %xmm14, %xmm4, %xmm2
vaddsd  %xmm11, %xmm1, %xmm12
vsubsd  %xmm4, %xmm2, %xmm0
vaddsd  %xmm12, %xmm8, %xmm13
vsubsd  %xmm0, %xmm2, %xmm11
vsubsd  %xmm0, %xmm14, %xmm1
vaddsd  %xmm6, %xmm13, %xmm3
vsubsd  %xmm8, %xmm13, %xmm8
vsubsd  %xmm11, %xmm4, %xmm4
vsubsd  %xmm13, %xmm3, %xmm15
vsubsd  %xmm8, %xmm12, %xmm12
vaddsd  %xmm1, %xmm4, %xmm14
vsubsd  %xmm15, %xmm3, %xmm9
vsubsd  %xmm15, %xmm6, %xmm6
vaddsd  %xmm7, %xmm12, %xmm7
vsubsd  %xmm9, %xmm13, %xmm10
vaddsd  16(%rsp), %xmm5, %xmm9
vaddsd  %xmm6, %xmm10, %xmm15
vaddsd  %xmm14, %xmm9, %xmm10
vaddsd  %xmm15, %xmm7, %xmm13
vaddsd  %xmm10, %xmm2, %xmm15
vaddsd  %xmm13, %xmm3, %xmm6
vsubsd  %xmm2, %xmm15, %xmm2
vsubsd  %xmm3, %xmm6, %xmm3
vsubsd  %xmm2, %xmm10, %xmm11
vsubsd  %xmm3, %xmm13, %xmm0

Summary:

inline your double-double source code: the compiler and vectorizer cannot optimize across function calls because of ABI constraints, and across memory accesses by fear of aliasing.
stitch code to balance throughput and latency and maximize the CPU ports usage (and also maximize instructions per cycle), as long as the compiler does not spill too much registers to memory.

You can track the optimization impacts with perf utility (packages linux-tools-generic and linux-cloud-tools-generic) to get the number of instructions executed and the number of instructions per cycle.

Holism answered 3/7, 2018 at 11:23 Comment(2)

out-of-order execution will hide dependency chains as long as they aren't loop-carried, or partially hide them if there's any instruction-level parallelism. The IACA report would say "dependency" or "latency" (I forget which) if the latency of a loop-carried dep chain was the bottleneck, rather than front-end or an execution port. Instruction-scheduling on modern x86 is generally not very important. – Otherness 3/7, 2018 at 11:28

@PeterCordes You are right, but in the context of the Mandelbrot example, execution of the next loop depends on the result of the current loop. The solution I suggest is to start to interleave 2 independent dependency chains in such simple loops (and each dependency chain is already vectorized). Running the same code on 2 hyperthreads is another way to fill the latency holes. – Holism 3/7, 2018 at 19:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags