How can I divide 16 8-bit integers by 4 (or shift them 2 to the right) using SSE intrinsics?
Unfortunately there are no SSE shift instructions for 8 bit elements. If the elements are 8 bit unsigned then you can use a 16 bit shift and mask out the unwanted high bits, e.g.
v = _mm_srli_epi16(v, 2);
v = _mm_and_si128(v, _mm_set1_epi8(0x3f));
For 8 bit signed elements it's a little fiddlier, but still possible, although it might just be easier to unpack to 16 bits, do the shifts, then pack back to 8 bits.
#define _mm_srli_epi8(mm, Imm) _mm_and_si128(_mm_set1_epi8(0xFF >> Imm), _mm_srli_epi32(mm, Imm))
–
Betimes Imm
in _mm_srli_epi32
is not a literal constant (particularly in debug builds), which can be a problem with inline functions, although you should be fine with current/recent versions of gcc, clang, ICC. –
Dexamethasone If your 8-Bit integers are unsigned, you can use:
static inline __m128i divBy4(__m128i v)
{
const __m128i zeros = _mm_set1_epi8(0);
__m128i vHalf = _mm_avg_epu8(zeros, v); // (0 + v) / 2 = v / 2
__m128i vQuarter = _mm_avg_epu8(zeros, vHalf); // (0 + (v/2)) / 2 = v/4
return vQuarter;
}
Please note that the calculation is actually:
(0 + v + 1) / 2 = (v + 1)/2
(0 + (v + 1)/2 + 1) / 2 = (v + 3)/4
So if you want a correct rounded result you can subtract 1 from vHalf for (v+2)/4
If you don't want rounding at all aka integer division, you can subtract 1 from v.
In most cases, the small rounding error isn't worth another instruction. Also, you should move the zero constant out of the function.
static const
; for some reason static const __m128i
doesn't properly optimize away: a static constructor runs at startup to write to the BSS. (Normally copying from .rodata
, but here hopefully would still materialize it with pxor xmm0,xmm0
which is as cheap as a NOP on Intel CPUs, and pretty cheap on AMD too.) –
Smriti set1(-1)
instead of set1(0)
- also cheap to materialize on the fly for compilers, with pcmpeqd xmm1,xmm1
or similar. Or just use PaulR's shift/AND since that's also just 2 instructions, although with a non-trivial constant that would have to get loaded from .rodata
. –
Smriti © 2022 - 2025 — McMap. All rights reserved.
mtune=native
is causing Clang to assume AVX2 support, whereas GCC is assuming only AVX support. If you explicitly pass-mavx2
to GCC, you get much better output that resembles Clang's. Clang's, of course, doesn't change. A good lesson in why "native" doesn't make much sense for an online compiler whose system specs you don't control. :-) @richard – Planchetvpand
's (which would have been useful if it hadn't converted to words, but it did, so the only reason I can see for them is that GCC is scared of the saturating pack and couldn't figure out that it would be harmless) – Mara