SSE Instructions: Byte+Short

Asked 17/5, 2012 at 14:2 Answered 29/6, 2016 at 13:17

I have very long byte arrays that need to be added to a destination array of type short (or int). Does such SSE instruction exist? Or maybe their set ?

Laurinda answered 17/5, 2012 at 14:2 Comment(0)

You need to unpack each vector of 8 bit values to two vectors of 16 bit values and then add those.

__m128i v = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
__m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); // vl = { 7, 6, 5, 4, 3, 2, 1, 0 }
__m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); // vh = { 15, 14, 13, 12, 11, 10, 9, 8 }

where v is a vector of 16 x 8 bit values and vl, vh are the two unpacked vectors of 8 x 16 bit values.

Note that I'm assuming that the 8 bit values are unsigned so when unpacking to 16 bits the high byte is set to 0 (i.e. no sign extension).

If you want to sum a lot of these vectors and get a 32 bit result then a useful trick is to use _mm_madd_epi16 with a multiplier of 1, e.g.

__m128i vsuml = _mm_set1_epi32(0);
__m128i vsumh = _mm_set1_epi32(0);
__m128i vsum;
int sum;

for (int i = 0; i < N; i += 16)
{
    __m128i v = _mm_load_si128(&x[i]);
    __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0));
    __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0));
    vsuml = _mm_add_epi32(vsuml, _mm_madd_epi16(vl, _mm_set1_epi16(1)));
    vsumh = _mm_add_epi32(vsumh, _mm_madd_epi16(vh, _mm_set1_epi16(1)));
}
// do horizontal sum of 4 partial sums and store in scalar int
vsum = _mm_add_epi32(vsuml, vsumh);
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
sum = _mm_cvtsi128_si32(vsum);

Kriegspiel answered 17/5, 2012 at 14:18 Comment(4)

Pardon my ignorance but are you sure this is correct? This vsum = _mm_madd_epi16(vh, _mm_set1_epi16(1)); would erase the previous value of vsum. – Slovenly 15/2, 2015 at 20:52

@Alexandros: you're right, and I see at least one other mistake in there too - I guess I must have been in a hurry when I wrote this answer - I'll fix the code soon, but I'm travelling at present. – Kriegspiel 15/2, 2015 at 20:55

Thanks Paul, no hurry. You have helped me a lot in the past, so any time you can, fix it. Have a nice trip!! – Slovenly 15/2, 2015 at 21:1

Actually it wasn't hard to fix - I can't test it right now but the code should at least be close to a working solution now. – Kriegspiel 15/2, 2015 at 21:1

If you need to sign-extend your byte vectors instead of zero-extend, use pmovsxbw (_mm_cvtepi8_epi16). Unlike the unpack hi/lo instructions, you can only pmovsx from the low half/quarter/eighth of a src register.

You can pmovsx directly from memory though, even though intrinsics make this really clumsy. Since shuffle throughput is more limited than load throughput on most CPUs, it's probably preferable to do two load+pmovsx than to do one load + three shuffles.

Crews answered 29/6, 2016 at 13:17 Comment(0)

Recommended topics

Hot tags