Horizontal minimum and maximum using SSE
Asked Answered
G

3

14

I have a function using SSE to do a lot of stuff, and the profiler shows me that the code portion I use to compute the horizontal minimum and maximum consumes most of the time.

I have been using the following implementation for the minimum for instance:

static inline int16_t hMin(__m128i buffer) {
    buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m1));
    buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m2));
    buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m3));
    buffer = _mm_min_epi8(buffer, _mm_shuffle_epi8(buffer, m4));
    return ((int8_t*) ((void *) &buffer))[0];
}

I need to compute the minimum and the maximum of 16 1-byte integers, as you see.

Any good suggestions are highly appreciated :)

Thanks

Garrotte answered 7/3, 2014 at 17:17 Comment(8)
What are the values of the masks m1, m2, m3, and m4?Gilbertgilberta
I don't know if there is a better way. Without a horizontal max min operator you have to do it in four binary steps (which I think you are doing): compare 8 with 8, then 4 with 4, then 2 with 2 and then 1 with 1. With int4 it takes two steps: #9878200Gilbertgilberta
Yes, sorry, I forgot the shuffle control masks. They are used as Z boson assumed beforeGarrotte
As highlighted in Marat Dukhan's answer as the primarily reason, taking the address of an SIMD vector is not the correct way of extracting element values into the CPU register because that style will force the value to be written to memory. (Neither is accessing the value of the "struct" directly.) Changing the code to _mm_cvtsi128_si32 will be the best thing to do.Photodynamics
It's probably worth mentioning that if your calling code has (much) more than 16 1-byte integers to find the overall minimum of then you can do this much quicker by just accumulating the results of a single _mm_min_epi8 operation into an __m128i value, and doing the step in your function of combining the results once at the end.Haplosis
@Haplosis Excellent suggestion! Will definitely give it a try laterGarrotte
@Haplosis I was too optimistic, the vectors I use (and in other frequent use cases) are the result of a scan operation over a bytewise population count or a similar operation, i.e. if the length is more than 16 bytes, you will not be able to store the amount in extreme cases. In cases where this can be done without the prefix sum step, it will be for sure a huge improvement.Garrotte
Glad to hear that it at least partially helps. Thank you for the description, I think I'd need a little more information however to fully understand the problem. I think you mean integer rollover will occur in the vector components in some cases. Keep in mind you can pretend negative bit patterns are positive for some operations. Speaking generally, you could perhaps limit some input somewhere to avoid the case all together. Or you could maybe unpack to 16-bit values and do your computations on that. That should still be a win if it allows you to combine once at the end.Haplosis
B
9

I suggest two changes:

  • Replace ((int8_t*) ((void *) &buffer))[0] with _mm_cvtsi128_si32.
  • Replace _mm_shuffle_epi8 with _mm_shuffle_epi32/_mm_shufflelo_epi16 which have lower latency on recent AMD processors and Intel Atom, and will save you memory load operations:

    static inline int16_t hMin(__m128i buffer)
    {
        buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(3, 2, 3, 2)));
        buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
        buffer = _mm_min_epi8(buffer, _mm_shufflelo_epi16(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
        buffer = _mm_min_epi8(buffer, _mm_srli_epi16(buffer, 8));
        return (int8_t)_mm_cvtsi128_si32(buffer);
    }
    
Bihar answered 7/3, 2014 at 20:0 Comment(2)
I have tested the method today, it works for most cases, but it fails to find the minimum when it is located at first position, e.g. if you test it with the vector (0,1,..,15), it will return that the minimum is 1. For all other cases it seems to work!Garrotte
@Garrotte You're right, there was a bug. Now its fixed.Bihar
G
18

SSE 4.1 has an instruction that does almost what you want. Its name is PHMINPOSUW, C/C++ intrinsic is _mm_minpos_epu16. It is limited to 16-bit unsigned values and cannot give maximum, but these problems could be easily solved.

  1. If you need to find minimum of non-negative bytes, do nothing. If bytes may be negative, add 128 to each. If you need maximum, subtract each from 127.
  2. Use either _mm_srli_pi16 or _mm_shuffle_epi8, and then _mm_min_epu8 to get 8 pairwise minimum values in even bytes and zeros in odd bytes of some XMM register. (These zeros are produced by shift/shuffle instruction and should remain at their places after _mm_min_epu8).
  3. Use _mm_minpos_epu16 to find minimum among these values.
  4. Extract the resulting minimum value with _mm_cvtsi128_si32.
  5. Undo effect of step 1 to get the original byte value.

Here is an example that returns maximum of 16 signed bytes:

static inline int16_t hMax(__m128i buffer)
{
    __m128i tmp1 = _mm_sub_epi8(_mm_set1_epi8(127), buffer);
    __m128i tmp2 = _mm_min_epu8(tmp1, _mm_srli_epi16(tmp1, 8));
    __m128i tmp3 = _mm_minpos_epu16(tmp2);
    return (int8_t)(127 - _mm_cvtsi128_si32(tmp3));
}
Gilli answered 8/3, 2014 at 10:54 Comment(3)
I gave your method also a try in my benchmarks, it is indeed a bit faster than the other methods.Garrotte
Nice, I like how _mm_min_epu8 leaves zeros in the high half of each 16-bit element, since unsigned_min(0,x) = 0. So there's no instruction spent on just zero-extending to 16-bit.Kedge
Note that subtracting 128 is the same as adding or XORing with 128 (since there's nowhere for the carry to go). pxor runs on more ports than psubb (and is commutative, giving the optimizer more flexibility with register allocation), so you should prefer that when range-shifting to unsigned.Kedge
B
9

I suggest two changes:

  • Replace ((int8_t*) ((void *) &buffer))[0] with _mm_cvtsi128_si32.
  • Replace _mm_shuffle_epi8 with _mm_shuffle_epi32/_mm_shufflelo_epi16 which have lower latency on recent AMD processors and Intel Atom, and will save you memory load operations:

    static inline int16_t hMin(__m128i buffer)
    {
        buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(3, 2, 3, 2)));
        buffer = _mm_min_epi8(buffer, _mm_shuffle_epi32(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
        buffer = _mm_min_epi8(buffer, _mm_shufflelo_epi16(buffer, _MM_SHUFFLE(1, 1, 1, 1)));
        buffer = _mm_min_epi8(buffer, _mm_srli_epi16(buffer, 8));
        return (int8_t)_mm_cvtsi128_si32(buffer);
    }
    
Bihar answered 7/3, 2014 at 20:0 Comment(2)
I have tested the method today, it works for most cases, but it fails to find the minimum when it is located at first position, e.g. if you test it with the vector (0,1,..,15), it will return that the minimum is 1. For all other cases it seems to work!Garrotte
@Garrotte You're right, there was a bug. Now its fixed.Bihar
R
0

here's an implementation without shuffle, shuffle is slow on AMD 5000 Ryzen 7 for some reason

    float max_elem3() const {
        __m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
        __m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
        __m128 c = _mm_max_ps(a, b); // ..., max(x, z), ..., ...
        Vector4 res = _mm_max_ps(mm, c); // ..., max(y, max(x, z)), ..., ...
        return res.y;
    }

    float min_elem3() const {
        __m128 a = _mm_unpacklo_ps(mm, mm); // x x y y
        __m128 b = _mm_unpackhi_ps(mm, mm); // z z w w
        __m128 c = _mm_min_ps(a, b); // ..., min(x, z), ..., ...
        Vector4 res = _mm_min_ps(mm, c); // ..., min(y, min(x, z)), ..., ...
        return res.y;
    }
Roan answered 13/1, 2023 at 9:33 Comment(3)
shufps xmm, xmm, i8 is 1 uop with 1 cycle latency and 2/clock throughput on Zen 1/2/3. 3/clock on Zen4. If some other version of this with _mm_shuffle_ps was slow in some context, it's probably not because of shufps. It does cost extra code size vs. unpcklps/hps, though, and unlike pshufd would still require a movaps register copy (if you don't compile with AVX).Kedge
This pattern of shuffling, with 2 shuffles on the original data that can happen in parallel, is good especially on Zen which can run more than 1 shuffle per clock. (Ice Lake has 2/c shufps but only 1/c unpcklps) uops.info. So that's a good trick if you only care about 3 elements and min/max(y,w) always keeps y. Maybe with w = NaN, if you have the right operand order for your shuffles?Kedge
Awesome insights, I'm honestly not an expert with SIMD but I came up with the above solution and thankfully it works good I'm still trying to optimize other parts of the program, you can find full code here: github.com/iyadahmed/bvhRoan

© 2022 - 2024 — McMap. All rights reserved.