I have an input vector of 16384 signed four bit integers. They are packed into 8192 Bytes. I need to interleave the values and unpack into signed 8 bit integers in two separate arrays.
a,b,c,d are 4 bit values.
A,B,C,D are 8 bit values.
Input = [ab,cd,...]
Out_1 = [A,C, ...]
Out_2 = [B,D, ...]
I can do this quite easily in C++.
constexpr size_t size = 32768;
int8_t input[size]; // raw packed 4bit integers
int8_t out_1[size];
int8_t out_2[size];
for (int i = 0; i < size; i++) {
out_1[i] = input[i] << 4;
out_1[i] = out_1[i] >> 4;
out_2[i] = input[i] >> 4;
}
I would like to implement this to operate as fast as possible on general purpose processors. Good SIMD implementations of 8 bit deinterleaving to 16 bit integers exist such as in VOLK but I cannot find even basic bytewise SIMD shift operators.
https://github.com/gnuradio/volk/blob/master/kernels/volk/volk_8ic_deinterleave_16i_x2.h#L63
Thanks!
xor
with0xf8
to set the high bits and flip the 4th bit, thenpaddb
with0x08
will correct bit 4 and either carry-out and clear the high bits, or leave them set. – Amandauint8_t
for everything, notint8_t
. Unsigned is much easier, just shift and mask. (Shifting twice for the low half is inefficient even if you have byte shifts; AND with_mm_set1_epi8(0x0f)
) – Amandapsrab
(_mm_srai_epi8
). – Amandauint8_t input
, yourout_2
results are still broken. (zero-extended not sign-extended.) You could make it anint8_t*
, or cast it like((int8_t)input[i]) >> 4
. That does actually auto-vectorize, fairly well with clang, fairly poorly with GCC: godbolt.org/z/zYhff7 – Amanda