Say there are a lot of uint32s store in aligned memory uint32 *p
, how to convert them to uint8s with simd?
I see there is _mm256_cvtepi32_epi8/vpmovdb but it belongs to avx512, and my cpu doesn't support it 😢
Say there are a lot of uint32s store in aligned memory uint32 *p
, how to convert them to uint8s with simd?
I see there is _mm256_cvtepi32_epi8/vpmovdb but it belongs to avx512, and my cpu doesn't support it 😢
If you really have a lot of them, I would do something like this (untested).
The main loop reads 64 bytes per iteration containing 16 uint32_t values, shuffles around the bytes implementing the truncation, merges result into a single register, and writes 16 bytes with a vector store instruction.
void convertToBytes( const uint32_t* source, uint8_t* dest, size_t count )
{
// 4 bytes of the shuffle mask to fetch bytes 0, 4, 8 and 12 from a 16-bytes source vector
constexpr int shuffleScalar = 0x0C080400;
// Mask to shuffle first 8 values of the batch, making first 8 bytes of the result
const __m256i shuffMaskLow = _mm256_setr_epi32( shuffleScalar, -1, -1, -1, -1, shuffleScalar, -1, -1 );
// Mask to shuffle last 8 values of the batch, making last 8 bytes of the result
const __m256i shuffMaskHigh = _mm256_setr_epi32( -1, -1, shuffleScalar, -1, -1, -1, -1, shuffleScalar );
// Indices for the final _mm256_permutevar8x32_epi32
const __m256i finalPermute = _mm256_setr_epi32( 0, 5, 2, 7, 0, 5, 2, 7 );
const uint32_t* const sourceEnd = source + count;
// Vectorized portion, each iteration handles 16 values.
// Round down the count making it a multiple of 16.
const size_t countRounded = count & ~( (size_t)15 );
const uint32_t* const sourceEndAligned = source + countRounded;
while( source < sourceEndAligned )
{
// Load 16 inputs into 2 vector registers
const __m256i s1 = _mm256_load_si256( ( const __m256i* )source );
const __m256i s2 = _mm256_load_si256( ( const __m256i* )( source + 8 ) );
source += 16;
// Shuffle bytes into correct positions; this zeroes out the rest of the bytes.
const __m256i low = _mm256_shuffle_epi8( s1, shuffMaskLow );
const __m256i high = _mm256_shuffle_epi8( s2, shuffMaskHigh );
// Unused bytes were zeroed out, using bitwise OR to merge, very fast.
const __m256i res32 = _mm256_or_si256( low, high );
// Final shuffle of the 32-bit values into correct positions
const __m256i res16 = _mm256_permutevar8x32_epi32( res32, finalPermute );
// Store lower 16 bytes of the result
_mm_storeu_si128( ( __m128i* )dest, _mm256_castsi256_si128( res16 ) );
dest += 16;
}
// Deal with the remainder
while( source < sourceEnd )
{
*dest = (uint8_t)( *source );
source++;
dest++;
}
}
res16
32->16 byte shuffle with one vpermd
(or maybe even vpermq
), rather than vextracti128
+ vpor
. Unless you're tuning for Zen1 (where lane-extract is very cheap), just 1 shuffle is better than shuffle+or. –
Clonus vpshufb
+ vpermd
. IDK if that's any better, although Skylake runs vpblendvb
as 2 uops for any ALU port. With a 64-byte aligned source, you can arrange it so none of the loads are cache-line splits. –
Clonus __m256i
, vs. this using 3 shuffles per 2 input vectors. The 2x vpackssdw
+ vpackuswb
+ vpermd
strategy seems better than this, if you have lots of data. –
Clonus vpshufb
is faster than vpackssdw
, 2x throughput, 1/3 latency. I don’t think it’s a clear win, but on new AMD CPUs packing probably faster: packing same speed as vpshufb
, and bitwise ops throughput is higher, 3-4 instructions/clock. –
Ionogen vpackssdw
/ vpackusbw
is nice. Good point about shuffle throughputs on Ice Lake and later; ironically I was just commenting about vshufpd
vs. vpermilpd imm8
on another of your answers. –
Clonus © 2022 - 2024 — McMap. All rights reserved.
vpshufb
. All thevpack...
instructions treat their input as signed, even if they to unsigned saturation of the output (likevpackusdw
), so0xFFFFFFFF
would signed-saturate to0
(-1 to 0) rather than to 0xFFFF (UINT_MAX -> USHORT_MAX) – Clonus