I want to implement SIMD minmag and maxmag functions. As far as I understand these functions are
minmag(a,b) = |a|<|b| ? a : b
maxmag(a,b) = |a|>|b| ? a : b
I want these for float and double and my target hardware is Haswell. What I really need is code which calculates both. Here is what I have for SSE4.1 for double (the AVX code is almost identical)
static inline void maxminmag(__m128d & a, __m128d & b) {
__m128d mask = _mm_castsi128_pd(_mm_setr_epi32(-1,0x7FFFFFFF,-1,0x7FFFFFFF));
__m128d aa = _mm_and_pd(a,mask);
__m128d ab = _mm_and_pd(b,mask);
__m128d cmp = _mm_cmple_pd(ab,aa);
__m128d cmpi = _mm_xor_pd(cmp, _mm_castsi128_pd(_mm_set1_epi32(-1)));
__m128d minmag = _mm_blendv_pd(a, b, cmp);
__m128d maxmag = _mm_blendv_pd(a, b, cmpi);
a = maxmag, b = minmag;
}
However, this is not as efficient as I would like. Is there a better method or at least an alternative worth considering? I would like to try to avoid port 1 since I already have many additions/subtractions using that port. The _mm_cmple_pd
instrinsic goes to port 1.
The main function I am interested is this:
//given |a| > |b|
static inline doubledouble4 quick_two_sum(const double4 & a, const double4 & b) {
double4 s = a + b;
double4 e = b - (s - a);
return (doubledouble4){s, e};
}
So what I am really after is this
static inline doubledouble4 two_sum_MinMax(const double4 & a, const double4 & b) {
maxminmag(a,b);
return quick_to_sum(a,b);
}
Edit: My goal is for two_sum_MinMax
to be faster than two_sum
below:
static inline doubledouble4 two_sum(const double4 &a, const double4 &b) {
double4 s = a + b;
double4 v = s - a;
double4 e = (a - (s - v)) + (b - v);
return (doubledouble4){s, e};
}
Edit: here is the ultimate function I'm after. It does 20 add/subs all of which go to port 1 on Haswell. Using my implementation of two_sum_MinMax
in this question gets it down to 16 add/subs on port 1 but it has worse latency and is still slower. You can see the assembly for this function and read more about why I care about this at optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble
static inline doublefloat4 adddd(const doubledouble4 &a, const doubledouble4 &b) {
doubledouble4 s, t;
s = two_sum(a.hi, b.hi);
t = two_sum(a.lo, b.lo);
s.lo += t.hi;
s = quick_two_sum(s.hi, s.lo);
s.lo += t.lo;
s = quick_two_sum(s.hi, s.lo);
return s;
// 2*two_sum, 2 add, 2*quick_two_sum = 2*6 + 2 + 2*3 = 20 add
}
minmag = blendv(a, b, cmp); maxmag = blendv(b, a, cmp);
do the same as your code while reusing the same mask? – Mcevoymaxmag = _mm_blendv_pd(a, b, cmpi);
maybe I should have called iticmp
instead ofcmpi
. Thei
for invert. – Worthingtontwo_sum
is supposed to do ? It doesn't make much sense to me at first glance. Is it for Kahan summation, or something like that ? – Eyepieces
means sum ande
means error (I assume). See this question and read the comments for why I'm interested in this. – Worthingtondouble4
anddoubledouble4
data types were (I'm more of an integer/fixed-point guy myself ;-)). – Eyepiecetwo_sum
function I mentioned uses six adds/subs. When you know |a|>|b| it can be done in three. – Worthington1.0
). Latency=5, so don't do it on the critical path. – Greathearted