Deinterleve vector of nibbles using SIMD - McMap

About

Deinterleve vector of nibbles using SIMD

Asked 31/7, 2020 at 22:55 Answered 1/8, 2020 at 9:28

c++sse simd avx2

P

1

3

I have an input vector of 16384 signed four bit integers. They are packed into 8192 Bytes. I need to interleave the values and unpack into signed 8 bit integers in two separate arrays.

a,b,c,d are 4 bit values.
A,B,C,D are 8 bit values.

Input = [ab,cd,...]
Out_1 = [A,C, ...]
Out_2 = [B,D, ...]

I can do this quite easily in C++.

constexpr size_t size = 32768;
int8_t input[size]; // raw packed 4bit integers
int8_t out_1[size];
int8_t out_2[size];

for (int i = 0; i < size; i++) {
    out_1[i] = input[i] << 4;
    out_1[i] = out_1[i] >> 4;
    out_2[i] = input[i] >> 4;
}

I would like to implement this to operate as fast as possible on general purpose processors. Good SIMD implementations of 8 bit deinterleaving to 16 bit integers exist such as in VOLK but I cannot find even basic bytewise SIMD shift operators.

https://github.com/gnuradio/volk/blob/master/kernels/volk/volk_8ic_deinterleave_16i_x2.h#L63

Thanks!

Plumley answered 31/7, 2020 at 22:55 Comment(5)

This may be interesting: https://mcmap.net/q/22918/-avx-4-bit-integers – Castroprauxel 31/7, 2020 at 23:14

x86 SIMD doesn't have byte shifts, you have to emulate them via 16-bit or wider shifts and mask away bits that came into the top of each byte (SSE/SIMD shift with one-byte element size / granularity?). This kind of sucks when you want arithmetic shifts to sign-extend; perhaps a different trick to get the high bits set could work for a fixed shift count. Like maybe xor with 0xf8 to set the high bits and flip the 4th bit, then paddb with 0x08 will correct bit 4 and either carry-out and clear the high bits, or leave them set. – Amanda 1/8, 2020 at 0:2

Wait, your question says you need signed, but your C++ uses uint8_t for everything, not int8_t. Unsigned is much easier, just shift and mask. (Shifting twice for the low half is inefficient even if you have byte shifts; AND with _mm_set1_epi8(0x0f)) – Amanda 1/8, 2020 at 0:5

Updated SSE/SIMD shift with one-byte element size / granularity? with that idea: 4 uops (for Intel) is better than previous emulations of the non-existent psrab (_mm_srai_epi8). – Amanda 1/8, 2020 at 0:42

With uint8_t input, your out_2 results are still broken. (zero-extended not sign-extended.) You could make it an int8_t*, or cast it like ((int8_t)input[i]) >> 4. That does actually auto-vectorize, fairly well with clang, fairly poorly with GCC: godbolt.org/z/zYhff7 – Amanda 1/8, 2020 at 18:5

T

1

Here is an example. Your question contained code that used unsigned operations, but the question asked about signed, so I was not sure what you wanted. If it is unsigned what you want, just remove the bits that implement sign extension.

const __m128i mm_mask = _mm_set1_epi32(0x0F0F0F0F);
const __m128i mm_signed_max = _mm_set1_epi32(0x07070707);

for (size_t i = 0u, n = size / 16u; i < n; ++i)
{
    // Load and deinterleave input half-bytes
    __m128i mm_input_even = _mm_loadu_si128(reinterpret_cast< const __m128i* >(input) + i);
    __m128i mm_input_odd = _mm_srli_epi32(mm_input_even, 4);

    mm_input_even = _mm_and_si128(mm_input_even, mm_mask);
    mm_input_odd = _mm_and_si128(mm_input_odd, mm_mask);

    // If you need sign extension, you need the following
    // Get the sign bits
    __m128i mm_sign_even = _mm_cmpgt_epi8(mm_input_even, mm_signed_max);
    __m128i mm_sign_odd = _mm_cmpgt_epi8(mm_input_odd, mm_signed_max);

    // Combine sign bits with deinterleaved input
    mm_input_even = _mm_or_si128(mm_input_even, _mm_andnot_si128(mm_mask, mm_sign_even));
    mm_input_odd = _mm_or_si128(mm_input_odd, _mm_andnot_si128(mm_mask, mm_sign_odd));

    // Store the results
    _mm_storeu_si128(reinterpret_cast< __m128i* >(out_1) + i, mm_input_even);
    _mm_storeu_si128(reinterpret_cast< __m128i* >(out_2) + i, mm_input_odd);
}

If your size is not a multiple of 16 then you need to also add handling of the tail bytes. You could use your non-vectorized code for that.

Note that in the code above you don't need byte-granular shifts as you have to apply the mask anyway. So any more coarse-grained shifts would do here.

Tingley answered 1/8, 2020 at 9:28 Comment(7)

style note: I find mm_ in your var names makes it harder to read, too many mms everywhere. When I want to distinguish between a scalar and vector with similar names, I often put a v as the first letter of a var name, like vsign. But here I'd just use __m128i input_even = ... – Amanda 1/8, 2020 at 16:43

See SSE/SIMD shift with one-byte element size / granularity? for a more efficient way to sign-extend the low 4 bits into a byte in 2 instructions, using pxor / paddb with set1(0x08). (With the upper 4 already zeroed). Should be more efficient than pcmpgtb / pandn / por. Hmm, especially if we do the pxor before separating into high/low halves! We can pxor with set1_epi8(0x88) before shifting. – Amanda 1/8, 2020 at 16:45

Some versions using that trick, or vpshufb, including one which auto-vectorizes nicely that way with clang: godbolt.org/z/sGafhr. Unfortunate GCC missed optimizations make the pshufb version less attractive (doing an extra load of the same data) if you care about GCC. Will post as an answer at some point. – Amanda 2/8, 2020 at 0:38

@PeterCordes The pxor/paddb trick is nice. I don't think moving pxor before deinterleaving will be faster because it will just move the same latency in the dependency chain. The two pxors at a later point would be executed in parallel, so they are effectively equivalent to the one pxor before deinterleave. But the one pxor will probably consume slightly less power and definitely less memory. – Tingley 2/8, 2020 at 12:54

Re code on godbolt, note that you have loop unrolling enabled for clang but not gcc. Also, if you find inefficiencies with gcc, please report them upstream (gcc.gnu.org/bugzilla). – Tingley 2/8, 2020 at 12:56

This bottlenecks on front-end throughput, not latency! Saving front-end uops is how you optimize this! Each 32-byte chunk of input is independent so the only loop-carried dependency chain is the loop counter. Out-of-order exec can overlap across iterations. – Amanda 2/8, 2020 at 12:58

re: loop unrolling: no unrolling is the default for gcc (without PGO), unrolling is the default for clang. I had the option there because I had been looking at the clang -fno-unroll-loops output to simplify it for easy reading. – Amanda 2/8, 2020 at 13:2

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.