Getting max value in a __m128i vector with SSE?
Asked Answered
G

4

16

I have just started using SSE and I am confused how to get the maximum integer value (max) of a __m128i. For instance:

__m128i t = _mm_setr_ps(0,1,2,3);
// max(t) = 3;

Searching around led me to MAXPS instruction but I can't seem to find how to use that with "xmmintrin.h".

Also, is there any documentation for "xmmintrin.h" that you would recommend, rather than looking into the header file itself?

Godfearing answered 26/3, 2012 at 18:32 Comment(2)
The shuffles you need are the same as for a horizontal sum, or pretty much any other horizontal reduction. See stackoverflow.com/questions/6996764/… for some optimized versions for float, integer, and double, with SSE2, SSE3, and AVX. Also discussion of what shuffles are optimal on which CPUs.Behn
This question seems to be confused about float vs. integer. __m128i is an integer vector. *_ps and MAXPS are packed-single float. For documentation, see the SSE tag wiki for links, and many more links at stackoverflow.com/tags/x86/info. One very good resource is Intel's intrinsics search/finder which has details on what each one does, but not as much detail as in the asm instruction reference manual.Behn
C
11

If you find yourself needing to do horizontal operations on vectors, especially if it's inside an inner loop, then it's usually a sign that you are approaching your SIMD implementation in the wrong way. SIMD likes to operate element-wise on vectors - "vertically" if you like, not horizontally.

As for documentation, there is a very useful reference on intel.com which contains all the opcodes and intrinsics for everything from MMX through the various flavours of SSE all the way up to AVX and AVX-512.

Cassation answered 26/3, 2012 at 19:19 Comment(3)
Thank you for the link. The horizontal part is for a loop condition only but I will revise my approachGodfearing
The link is currently: software.intel.com/sites/landingpage/IntrinsicsGuideLabarbera
@MarkLakata: thanks - answer updated - I miss the old off-line guide - as well as working without an internet connection it was also useful in that you could scrape the data for other uses. Never mind though - the new online version is still good.Cassation
K
19

In case anyone cares and since intrinsics seem to be the way to go these days here is a solution in terms of intrinsics.

int horizontal_max_Vec4i(__m128i x) {
    __m128i max1 = _mm_shuffle_epi32(x, _MM_SHUFFLE(0,0,3,2));
    __m128i max2 = _mm_max_epi32(x,max1);
    __m128i max3 = _mm_shuffle_epi32(max2, _MM_SHUFFLE(0,0,0,1));
    __m128i max4 = _mm_max_epi32(max2,max3);
    return _mm_cvtsi128_si32(max4);
}

I don't know if that's any better than this:

int horizontal_max_Vec4i(__m128i x) {
    int result[4] __attribute__((aligned(16))) = {0};
    _mm_store_si128((__m128i *) result, x);
    return max(max(max(result[0], result[1]), result[2]), result[3]); 
}
Kermanshah answered 4/9, 2013 at 14:39 Comment(0)
C
11

If you find yourself needing to do horizontal operations on vectors, especially if it's inside an inner loop, then it's usually a sign that you are approaching your SIMD implementation in the wrong way. SIMD likes to operate element-wise on vectors - "vertically" if you like, not horizontally.

As for documentation, there is a very useful reference on intel.com which contains all the opcodes and intrinsics for everything from MMX through the various flavours of SSE all the way up to AVX and AVX-512.

Cassation answered 26/3, 2012 at 19:19 Comment(3)
Thank you for the link. The horizontal part is for a loop condition only but I will revise my approachGodfearing
The link is currently: software.intel.com/sites/landingpage/IntrinsicsGuideLabarbera
@MarkLakata: thanks - answer updated - I miss the old off-line guide - as well as working without an internet connection it was also useful in that you could scrape the data for other uses. Never mind though - the new online version is still good.Cassation
F
10

According to this page, there is no horizontal max, and you need to test the elements vertically:

movhlps xmm1,xmm0         ; Move top two floats to lower part of xmm1
maxps   xmm0,xmm1         ; Get the maximum of the two sets of floats
pshufd  xmm1,xmm0,$55     ; Move second float to lower part of xmm1
maxps   xmm0,xmm1         ; Get the maximum of the two remaining floats

Conversely, getting the minimum:

movhlps xmm1,xmm0
minps   xmm0,xmm1
pshufd  xmm1,xmm0,$55
minps   xmm0,xmm1
Feriga answered 26/3, 2012 at 19:13 Comment(6)
pshufd between maxps instructions has extra latency on many CPUs (including Intel). SSE3 movshdup will duplicate the upper float in each half of the register, so you can use it to avoid a movaps copy.Behn
@PeterCordes, Could you write your own optimized solution? Would it be different if it was a vector of float? Thank You.Gusher
@Royi: this answer is for a vector of float (because the question is mis-titled or mixed up about float vs. integer, see my comments on the question). Optimized for which microarchitecture(s), and with which level of SSE? SSE3? Or limited to SSE2? Or AVX2? See stackoverflow.com/questions/6996764/… (but replace add with max) for various optimized float and integer shuffles.Behn
Let's say SSE4. Optimized for Haswell and above. Thank You. P. S. I meant using SSE Intrinsics, Isn't the above Assembly?Gusher
@PeterCordes, I used it as can be shown here - codereview.stackexchange.com/questions/177658. Is that what you meant? Any idea why is it still so slow?Gusher
@Royi: Yes, the above is assembly. Writing the same thing with intrinsics just requires some _mm_cast intrinsics, except that starting with movhlps into a different register to save a movaps usually requires that you have a left-over C variable, because _mm_undefined_ps() sometimes gets you an xor-zeroed register in some compilers, which defeats the purpose of trying to save instructions.Behn
B
5

There is no Horizontal Maximum opcode in SSE (at least up until the point where I stopped keep track of new SSE instructions).

So you are stuck doing some shuffling. What you end up with is...

movhlps %xmm0, %xmm1            # Move top two floats to lower part of %xmm1
maxps   %xmm1, %xmm0            # Get minimum of sets of two floats
pshufd  $0x55, %xmm0, %xmm1     # Move second float to lower part of %xmm1
maxps   %xmm1, %xmm0            # Get minimum of all four floats originally in %xmm0

http://locklessinc.com/articles/instruction_wishlist/

MSDN has the intrinsic and macro function mappings documented

http://msdn.microsoft.com/en-us/library/t467de55.aspx

Beggar answered 26/3, 2012 at 19:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.