Divide 8-bit integers by 4 (or shift) using SSE

Asked 9/1, 2017 at 19:32 Answered 11/6, 2024 at 7:54

How can I divide 16 8-bit integers by 4 (or shift them 2 to the right) using SSE intrinsics?

Betimes answered 9/1, 2017 at 19:32 Comment(7)

i think just specifiying the correct -march or -mtune makes it happen automagically: godbolt.org/g/jxGyFd – Alliteration 9/1, 2017 at 19:39

First of all that tool is awesome for Q&A pages like StackOverflow. I immediately bookmarked it. And for the real content of answer, thanks I'll have a look at the assembly, if the compiler does it automatically for some case I should be able to read it out of the assembly anyways. – Betimes 9/1, 2017 at 19:42

@RichardHodges I find that code fairly disappointing actually, Clang does a good job though. – Mara 10/1, 2017 at 13:24

@harold why disappointing? remember that it's dealing with the pathological case of the block not being aligned on an SSE-compatible boundary. Once it handles the edge cases, the main block is vectorised. If you marked the pointers as being aligned I'm sure you'd see perfect code. – Alliteration 10/1, 2017 at 15:21

Hmm, no, annotating the pointers with a 16-byte alignment doesn't help GCC. It turns out that the real difference is that mtune=native is causing Clang to assume AVX2 support, whereas GCC is assuming only AVX support. If you explicitly pass -mavx2 to GCC, you get much better output that resembles Clang's. Clang's, of course, doesn't change. A good lesson in why "native" doesn't make much sense for an online compiler whose system specs you don't control. :-) @richard – Planchet 10/1, 2017 at 16:8

@RichardHodges the fixup isn't the problem, the main SSE-using part is. It converts to words (which is in the first place not necessary and causes significant loss of throughput), and then it emits useless vpand's (which would have been useful if it hadn't converted to words, but it did, so the only reason I can see for them is that GCC is scared of the saturating pack and couldn't figure out that it would be harmless) – Mara 10/1, 2017 at 22:23

@harold I see. Perhaps it's just not a popular use case so has not caught the attention of the code generator maintainers? – Alliteration 10/1, 2017 at 23:20

Unfortunately there are no SSE shift instructions for 8 bit elements. If the elements are 8 bit unsigned then you can use a 16 bit shift and mask out the unwanted high bits, e.g.

v = _mm_srli_epi16(v, 2);
v = _mm_and_si128(v, _mm_set1_epi8(0x3f));

For 8 bit signed elements it's a little fiddlier, but still possible, although it might just be easier to unpack to 16 bits, do the shifts, then pack back to 8 bits.

Dexamethasone answered 9/1, 2017 at 20:18 Comment(6)

Thanks, just solved it myself by writing a macro which fakes epi8: #define _mm_srli_epi8(mm, Imm) _mm_and_si128(_mm_set1_epi8(0xFF >> Imm), _mm_srli_epi32(mm, Imm)) – Betimes 9/1, 2017 at 20:19

@miho: Note that there aren't really any advantages to writing this as a macro instead of an inline function here. – Abbey 9/1, 2017 at 20:27

@DietrichEpp: actually some compilers complain if the Imm in _mm_srli_epi32 is not a literal constant (particularly in debug builds), which can be a problem with inline functions, although you should be fine with current/recent versions of gcc, clang, ICC. – Dexamethasone 9/1, 2017 at 20:29

Ah yes, of course. There are still other ways you can avoid macros here. – Abbey 9/1, 2017 at 20:33

Yes, it depends whether your code needs to be portable and work with a bunch of different compilers, in which case you have to play to the lowest common denominator (MSVC usually). If you only need to work with one (or a few) compiler versions then you can probably get away with an inline function here. – Dexamethasone 9/1, 2017 at 20:35

@PaulR The hack that I use is to make the shift amount a template argument. – Wagtail 20/1, 2017 at 21:7

If your 8-Bit integers are unsigned, you can use:

static inline __m128i divBy4(__m128i v)
{
  const __m128i zeros = _mm_set1_epi8(0);
  __m128i vHalf = _mm_avg_epu8(zeros, v); // (0 + v) / 2 = v / 2
  __m128i vQuarter = _mm_avg_epu8(zeros, vHalf); // (0 + (v/2)) / 2 = v/4
  return vQuarter;
}

Please note that the calculation is actually:
(0 + v + 1) / 2 = (v + 1)/2
(0 + (v + 1)/2 + 1) / 2 = (v + 3)/4
So if you want a correct rounded result you can subtract 1 from vHalf for (v+2)/4

If you don't want rounding at all aka integer division, you can subtract 1 from v.

In most cases, the small rounding error isn't worth another instruction. Also, you should move the zero constant out of the function.

Pagurian answered 11/6, 2024 at 7:54 Comment(2)

Also, you should move the zero constant out of the function. - Why? The compiler will hoist it out of a loop for you after inlining, unless you're using an ancient MSVC version. And you definitely don't want to make it static const; for some reason static const __m128i doesn't properly optimize away: a static constructor runs at startup to write to the BSS. (Normally copying from .rodata, but here hopefully would still materialize it with pxor xmm0,xmm0 which is as cheap as a NOP on Intel CPUs, and pretty cheap on AMD too.) – Smriti 11/6, 2024 at 19:32

If you don't want rounding at all aka integer division, you can subtract 1 from v. - that could wrap 0 to 255. You could use set1(-1) instead of set1(0) - also cheap to materialize on the fly for compilers, with pcmpeqd xmm1,xmm1 or similar. Or just use PaulR's shift/AND since that's also just 2 instructions, although with a non-trivial constant that would have to get loaded from .rodata. – Smriti 11/6, 2024 at 19:36

Recommended topics

Hot tags