SIMD minmag and maxmag
Asked Answered
W

1

7

I want to implement SIMD minmag and maxmag functions. As far as I understand these functions are

minmag(a,b) = |a|<|b| ? a : b
maxmag(a,b) = |a|>|b| ? a : b

I want these for float and double and my target hardware is Haswell. What I really need is code which calculates both. Here is what I have for SSE4.1 for double (the AVX code is almost identical)

static inline void maxminmag(__m128d & a, __m128d & b) {
    __m128d mask    = _mm_castsi128_pd(_mm_setr_epi32(-1,0x7FFFFFFF,-1,0x7FFFFFFF));
    __m128d aa      = _mm_and_pd(a,mask);
    __m128d ab      = _mm_and_pd(b,mask);
    __m128d cmp     = _mm_cmple_pd(ab,aa);
    __m128d cmpi    = _mm_xor_pd(cmp, _mm_castsi128_pd(_mm_set1_epi32(-1)));
    __m128d minmag  = _mm_blendv_pd(a, b, cmp);
    __m128d maxmag  = _mm_blendv_pd(a, b, cmpi);
    a = maxmag, b = minmag;
}

However, this is not as efficient as I would like. Is there a better method or at least an alternative worth considering? I would like to try to avoid port 1 since I already have many additions/subtractions using that port. The _mm_cmple_pd instrinsic goes to port 1.

The main function I am interested is this:

//given |a| > |b|
static inline doubledouble4 quick_two_sum(const double4 & a, const double4 & b)  {
    double4 s = a + b;
    double4 e = b - (s - a);
    return (doubledouble4){s, e};
}

So what I am really after is this

static inline doubledouble4 two_sum_MinMax(const double4 & a, const double4 & b) {
    maxminmag(a,b);       
    return quick_to_sum(a,b);
}

Edit: My goal is for two_sum_MinMax to be faster than two_sum below:

static inline doubledouble4 two_sum(const double4 &a, const double4 &b) {
        double4 s = a + b;
        double4 v = s - a;
        double4 e = (a - (s - v)) + (b - v);
        return (doubledouble4){s, e};
}

Edit: here is the ultimate function I'm after. It does 20 add/subs all of which go to port 1 on Haswell. Using my implementation of two_sum_MinMax in this question gets it down to 16 add/subs on port 1 but it has worse latency and is still slower. You can see the assembly for this function and read more about why I care about this at optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble

static inline doublefloat4 adddd(const doubledouble4 &a, const doubledouble4 &b) {
        doubledouble4 s, t;
        s = two_sum(a.hi, b.hi);
        t = two_sum(a.lo, b.lo);
        s.lo += t.hi;
        s = quick_two_sum(s.hi, s.lo);
        s.lo += t.lo;
        s = quick_two_sum(s.hi, s.lo);
        return s;
        // 2*two_sum, 2 add, 2*quick_two_sum = 2*6 + 2 + 2*3 = 20 add
}
Worthington answered 3/6, 2015 at 11:37 Comment(15)
Correct me if I'm wrong, but wouldn't minmag = blendv(a, b, cmp); maxmag = blendv(b, a, cmp); do the same as your code while reusing the same mask?Mcevoy
@EOF, it's maxmag = _mm_blendv_pd(a, b, cmpi); maybe I should have called it icmp instead of cmpi. The i for invert.Worthington
Yes, I'm aware. But you can also invert the blend by reversing the arguments...Mcevoy
@EOF, oh, good point! I read your comment wrong not you mine.Worthington
Can you give a brief explanation of what two_sum is supposed to do ? It doesn't make much sense to me at first glance. Is it for Kahan summation, or something like that ?Eyepiece
@PaulR, it's not Kahan summation. It's doing (double + double) to doubledouble addition. It's like 64-bit + 64-bit to 128-bit integer addition but for floating point instead of integer: s means sum and e means error (I assume). See this question and read the comments for why I'm interested in this.Worthington
OK - thanks - that makes more sense now - I guess I wasn't sure what the double4 and doubledouble4 data types were (I'm more of an integer/fixed-point guy myself ;-)).Eyepiece
@Zboson This paper describes a method to do doubledouble addition with only 3 additional instructions when you don't know which number is greater in magnitude. I can't say whether it's faster than the solutions here since the 3 extra instructions are adds/subs which would contend for the same execution units.Underwood
@Mysticial, thanks, that's one of the paper's I'm using. The two_sum function I mentioned uses six adds/subs. When you know |a|>|b| it can be done in three.Worthington
You can do an FP add on port 0 by using an FMA (multiplying by 1.0). Latency=5, so don't do it on the critical path.Greathearted
@PeterCordes, I already tried that, it was not better. See Evgeny Kluev's comment in Paul R's answer.Worthington
You said you only tried replacing ALL your adds with FMAs. If dependency chains were an issue at all, then 5 cycle latency instead of 3 would be a problem. Only about half of the adds need to be FMA to balance port0/port1 (depending on what ports the other insns use).Greathearted
@PeterCordes, you're correct. I could tune it suing some additions with FMA instead of all. I have not done this. I might look into again later.Worthington
Try IACA to find adds that aren't on the critical path, and turn those into FMA, if you do get back to it.Greathearted
@PeterCordes, good point. I was using IACA but I did not think to only turn the adds not on the critical path to FMA.Worthington
E
7

Here's an alternate implementation which uses fewer instructions:

static inline void maxminmag_test(__m128d & a, __m128d & b) {
    __m128d cmp     = _mm_add_pd(a, b); // test for mean(a, b) >= 0
    __m128d amin    = _mm_min_pd(a, b);
    __m128d amax    = _mm_max_pd(a, b);
    __m128d minmag  = _mm_blendv_pd(amin, amax, cmp);
    __m128d maxmag  = _mm_blendv_pd(amax, amin, cmp);
    a = maxmag, b = minmag;
}

It uses a somewhat subtle algorithm (see below), combined with the fact that we can use the sign bit as a selection mask.

It also uses @EOF's suggestion of using only one mask and switching the operand order, which saves an instruction.

I've tested it with a small number of cases and it seems to match your original implementation.


Algorithm:

 if (mean(a, b) >= 0)       // this can just be reduced to (a + b) >= 0
 {
     minmag = min(a, b);
     maxmag = max(a, b);
 }
 else
 {
     minmag = max(a, b);
     maxmag = min(a, b);
 }
Eyepiece answered 3/6, 2015 at 12:20 Comment(14)
That's a clever solution. I wanted to use the min/max instructions but did not think of this. Thanks. I'll give a try and get back to you.Worthington
Do you know what ports min and max use? I need to beat three add/sub instructions. Your solution already uses one add so if min and max still use port 1 I'm not sure it will be any better (but it's certainly worth a try).Worthington
I added to the end of my question the function two_sum which I want to replace with two_sum_MaxMin if it's faster.Worthington
@Zboson: According to Agner Fog's instruction tables, MAXPD on Haswell is port 1, 3 cycles latency. You could shift some work to port 5 by amin = _mm_min_pd(a,b); amax = a^b^amin.Mcevoy
@Zboson: and you could shift some work to port 0 by using FMA instead of "add".Rillis
@EvgenyKluev, good point, I tried that last night using fma(a,1.0,b) but it gave worse performance. But I used it everywhere (for every addition and subtraction). Maybe if I use fma in some cases instead of for every addition and subtraction it will help.Worthington
I made one more edit (sorry) with the ultimate function I'm after. It's ``doubledouble + doubledouble to doubledouble. It needs 20 additions/subtractions. Using my implementation of minmag and maxmag gets it down to 16.Worthington
You provided a clever alternative to my solution. However, your solution is slower than the two_sum function. Also the bit-wise and instructions I think have a latency of 1 and don't use port 1. Additionally, your solution is a bit less accurate. I think that's because of the addition you do which can "carry" but I'm not sure. Nevertheless, I really appreciate your answer.Worthington
Re accuracy - I guess there might be an edge case when a == -b - I think think you're effectively using |a| > |b| in you test whereas I'm using |a| >= |b| - it didn't show up in my limited testing but it might be worth looking at in case it affects the final result.Eyepiece
@PaulR, I found a quicker solution to my double-double addition using these minmaxmag functions. See my answer stackoverflow.com/questions/30573443/…Worthington
I think I am doing |a| >= |b| (using _mm_cmple_pd(ab,aa))Worthington
I just realized in my definition above I did not define what happens when |a| == |b|. There appear to be different definitions. The one in OpenCL returns max(a,b) or min (a,b) when |a| == |b| another definition returns the first argument. I guess I should ask a question about this. My code does the second case.Worthington
E.g. For a = -3, b = 3 should maxmag return a=-3 or max(a,b)=3?Worthington
It appears even the max and min functions in IEEE 754 2008 can be user defined e.g max(-0,+0) can return the first argument en.wikipedia.org/wiki/IEEE_754_revision#min_and_maxWorthington

© 2022 - 2024 — McMap. All rights reserved.