When I first got a Haswell processor I tried implementing FMA to determine the Mandelbrot set. The main algorithm is this:
intn = 0;
for(int32_t i=0; i<maxiter; i++) {
floatn x2 = square(x), y2 = square(y); //square(x) = x*x
floatn r2 = x2 + y2;
booln mask = r2<cut; //booln is in the float domain non integer domain
if(!horizontal_or(mask)) break; //_mm256_testz_pd(mask)
n -= mask
floatn t = x*y; mul2(t); //mul2(t): t*=2
x = x2 - y2 + cx;
y = t + cy;
}
This determines if n
pixels are in the Mandelbrot set. So for double floating point it runs over 4 pixels (floatn = __m256d
, intn = __m256i
). This requires 4 SIMD floating point multiplication and four SIMD floating point additions.
Then I modified this to work with FMA like this
intn n = 0;
for(int32_t i=0; i<maxiter; i++) {
floatn r2 = mul_add(x,x,y*y);
booln mask = r2<cut;
if(!horizontal_or(mask)) break;
add_mask(n,mask);
floatn t = x*y;
x = mul_sub(x,x, mul_sub(y,y,cx));
y = mul_add(2.0f,t,cy);
}
where mul_add calls _mm256_fmad_pd
and mul_sub calls _mm256_fmsub_pd
. This method uses 4 FMA SIMD operations, and two SIMD multiplications which is two less arithmetic operations then without FMA. Additionally, FMA and multiplication can use two ports and addition only one.
To make my tests less biased I zoomed into a region which is entirely in the Mandelbrot set so all the values are maxiter
. In this case the method using FMA is about 27% faster. That's certainly an improvement but going from SSE to AVX doubled my performance so I was hoping for maybe another factor of two with FMA.
But then I found this answer in regards to FMA where it says
The important aspect of the fused-multiply-add instruction is the (virtually) infinite precision of the intermediate result. This helps with performance, but not so much because two operations are encoded in a single instruction — It helps with performance because the virtually infinite precision of the intermediate result is sometimes important, and very expensive to recover with ordinary multiplication and addition when this level of precision is really what the programmer is after.
and later gives an example of double*double to double-double multiplication
high = a * b; /* double-precision approximation of the real product */
low = fma(a, b, -high); /* remainder of the real product */
From this, I concluded that I was implementing FMA non-optimally and so I decided to implement SIMD double-double. I implemented double-double based on the paper Extended-Precision Floating-Point Numbers for GPU Computation. The paper is for double-float so I modified it for double-double. Additionally, instead of packing one double-double value in a SIMD registers I pack 4 double-double values into one AVX high register and one AVX low register.
For the Mandelbrot set what I really need is double-double multiplication and addition. In that paper these are the df64_add
and df64_mult
functions.
The image below shows the assembly for my df64_mult
function for software FMA (left) and hardware FMA (right). This clearly shows that hardware FMA is a big improvement for double-double multiplication.
So how does hardware FMA perform in the double-double Mandelbrot set calculation? The answer is that's only about 15% faster than with software FMA. That's much less than I hoped for. The double-double Mandelbrot calculation needs 4 double-double additions, and four double-double multiplications (x*x
, y*y
, x*y
, and 2*(x*y)
). However, the 2*(x*y)
multiplication is trivial for double-double so this multiplication can be ignored in the cost. Therefore, the reason I think the improvement using hardware FMA is so small is that the calculation is dominated by the slow double-double addition (see the assembly below).
It used to be that multiplication was slower than addition (and programers used several tricks to avoid multiplication) but with Haswell it seems that it's the other way around. Not only due to FMA but also because multiplication can use two ports but addition only one.
So my questions (finally) are:
- How does one optimize when addition is slow compared to multiplication?
- Is there an algebraic way to change my algorithm to use more multiplications
and less additions? I know there are method to do the reverse, e.g.
(x+y)*(x+y) - (x*x+y*y) = 2*x*y
which use two more additions for one less multiplication. - Is there a way to simply the df64_add function (e.g. using FMA)?
In case anyone is wondering the double-double method is about ten times slower than double. That's not so bad I think as if there was a hardware quad-precision type it would likely be at least twice as slow as double so my software method is about five times slower than what I would expect for hardware if it existed.
df64_add
assembly
vmovapd 8(%rsp), %ymm0
movq %rdi, %rax
vmovapd 72(%rsp), %ymm1
vmovapd 40(%rsp), %ymm3
vaddpd %ymm1, %ymm0, %ymm4
vmovapd 104(%rsp), %ymm5
vsubpd %ymm0, %ymm4, %ymm2
vsubpd %ymm2, %ymm1, %ymm1
vsubpd %ymm2, %ymm4, %ymm2
vsubpd %ymm2, %ymm0, %ymm0
vaddpd %ymm1, %ymm0, %ymm2
vaddpd %ymm5, %ymm3, %ymm1
vsubpd %ymm3, %ymm1, %ymm6
vsubpd %ymm6, %ymm5, %ymm5
vsubpd %ymm6, %ymm1, %ymm6
vaddpd %ymm1, %ymm2, %ymm1
vsubpd %ymm6, %ymm3, %ymm3
vaddpd %ymm1, %ymm4, %ymm2
vaddpd %ymm5, %ymm3, %ymm3
vsubpd %ymm4, %ymm2, %ymm4
vsubpd %ymm4, %ymm1, %ymm1
vaddpd %ymm3, %ymm1, %ymm0
vaddpd %ymm0, %ymm2, %ymm1
vsubpd %ymm2, %ymm1, %ymm2
vmovapd %ymm1, (%rdi)
vsubpd %ymm2, %ymm0, %ymm0
vmovapd %ymm0, 32(%rdi)
vzeroupper
ret
x*x + y*y
, where, for some Intel processor model, the latency if implemented as mul-fma makes the timing worse than mul-mul-add, which contains more parallelism. – Colettecoleusdouble
, you could computex2hi + x2lo + y2hi + y2lo + cx
by doing a Knuth 2-sum ofx2hi
andy2hi
(6 flops), another Knuth 2-sum withcx
(another 6 flops), then just sum the four remainder terms. You could compute2*x*y+cy
using the FMA-based double*double -> double double product, doing a Knuth 2-sum with cy, and adding together the remainder terms. (I haven't tested this or done a serious analysis, so this suggestion may be garbage.) – Regaledf64_add
uses exactly 20 add/sub instructions (though each one operates on four doubles so it's really 80 floating point operations in 20 instructions). That's in agreement of your statement. If you write up an answer with more information particularly on the Max/min-magnitude operations and/or can you provide a paper discussing this I would greatly appreciate it. – Grinx
is larger in magnitude thany
, you can dohi = x+y; lo = (x-hi)+y
; the difference and the sum in the computation oflo
are exact. "The handbook of floating-point arithmetic" has all these tricks and many more. I found Ogita, Rump, and Oishi's "Accurate sum and dot product" pretty educational, but it's by no means the original reference for Fast2Sum. – Regalefamin
andfamax
. How to use these to shave off two operations of double-double addition is shown in N. Louvet's 2007 Ph.D. thesis. The three-input addition cuts operation count in half, as I recall. I can't find the reference right now (giant mess in my collected papers), I guess it was research by Sylvie Boldo, published within the past five years. I did a prototype implementation at the time, but can't find that either. – Traweekfamax()
andfamin()
are used in such a way that when |a|==|b| the further computation continues using both a and b which may be of opposite sign. So forTwoSum()
a.k.a.add12()
you would want something like this:s=a+b; x=famax(a,b); y=famin(b,a); e=(x-s)+y; return (e, s);
Note the argument swap between the calls tofamax()
andfamin()
. – Traweek