how to convert uint32 to uint8 using simd but not avx512?
Asked Answered
A

1

1

Say there are a lot of uint32s store in aligned memory uint32 *p, how to convert them to uint8s with simd?

I see there is _mm256_cvtepi32_epi8/vpmovdb but it belongs to avx512, and my cpu doesn't support it 😢

Aryanize answered 7/9, 2020 at 9:14 Comment(5)
How exactly do you want to convert them? With saturation or truncation? What is the range of the 32-bit values?Euryale
truncation them to 255Aryanize
You might be best starting with vpshufb. All the vpack... instructions treat their input as signed, even if they to unsigned saturation of the output (like vpackusdw), so 0xFFFFFFFF would signed-saturate to 0 (-1 to 0) rather than to 0xFFFF (UINT_MAX -> USHORT_MAX)Clonus
> truncation them to 255 -- This does not clarify things. What should be the result of converting the value of 256?Euryale
will I mean just pick the lowest 8 bits, 0x87654321 shall be 0x21Aryanize
I
4

If you really have a lot of them, I would do something like this (untested).

The main loop reads 64 bytes per iteration containing 16 uint32_t values, shuffles around the bytes implementing the truncation, merges result into a single register, and writes 16 bytes with a vector store instruction.

void convertToBytes( const uint32_t* source, uint8_t* dest, size_t count )
{
    // 4 bytes of the shuffle mask to fetch bytes 0, 4, 8 and 12 from a 16-bytes source vector
    constexpr int shuffleScalar = 0x0C080400;
    // Mask to shuffle first 8 values of the batch, making first 8 bytes of the result
    const __m256i shuffMaskLow = _mm256_setr_epi32( shuffleScalar, -1, -1, -1, -1, shuffleScalar, -1, -1 );
    // Mask to shuffle last 8 values of the batch, making last 8 bytes of the result
    const __m256i shuffMaskHigh = _mm256_setr_epi32( -1, -1, shuffleScalar, -1, -1, -1, -1, shuffleScalar );
    // Indices for the final _mm256_permutevar8x32_epi32
    const __m256i finalPermute = _mm256_setr_epi32( 0, 5, 2, 7, 0, 5, 2, 7 );

    const uint32_t* const sourceEnd = source + count;
    // Vectorized portion, each iteration handles 16 values.
    // Round down the count making it a multiple of 16.
    const size_t countRounded = count & ~( (size_t)15 );
    const uint32_t* const sourceEndAligned = source + countRounded;
    while( source < sourceEndAligned )
    {
        // Load 16 inputs into 2 vector registers
        const __m256i s1 = _mm256_load_si256( ( const __m256i* )source );
        const __m256i s2 = _mm256_load_si256( ( const __m256i* )( source + 8 ) );
        source += 16;
        // Shuffle bytes into correct positions; this zeroes out the rest of the bytes.
        const __m256i low = _mm256_shuffle_epi8( s1, shuffMaskLow );
        const __m256i high = _mm256_shuffle_epi8( s2, shuffMaskHigh );
        // Unused bytes were zeroed out, using bitwise OR to merge, very fast.
        const __m256i res32 = _mm256_or_si256( low, high );
        // Final shuffle of the 32-bit values into correct positions
        const __m256i res16 = _mm256_permutevar8x32_epi32( res32, finalPermute );
        // Store lower 16 bytes of the result
        _mm_storeu_si128( ( __m128i* )dest, _mm256_castsi256_si128( res16 ) );
        dest += 16;
    }

    // Deal with the remainder
    while( source < sourceEnd )
    {
        *dest = (uint8_t)( *source );
        source++;
        dest++;
    }
}
Ionogen answered 7/9, 2020 at 13:27 Comment(7)
If you arrange your epi8 shuffles correctly, you should be able to do the final res16 32->16 byte shuffle with one vpermd (or maybe even vpermq), rather than vextracti128 + vpor. Unless you're tuning for Zen1 (where lane-extract is very cheap), just 1 shuffle is better than shuffle+or.Clonus
Hmm, another alternative would be differently-aligned loads to feed a byte-blend + vpshufb + vpermd. IDK if that's any better, although Skylake runs vpblendvb as 2 uops for any ALU port. With a 64-byte aligned source, you can arrange it so none of the loads are cache-line splits.Clonus
@PeterCordes I wouldn’t mess with loads. The only reason sequential RAM loads are fast is prefetcher in CPUs, dense aligned sequential access is the best case for that piece of hardware. Once you start introducing offsets, you’re at the mercy of the implementation, may or may not do a good job performance-wise.Ionogen
Interesting point, that might possibly throw off L1d prefetching. But the main prefetchers are in L2 and they only see the stream of requests from L1 for full cache lines. But I'd guess even L1d prefetch would probably still be fine; you have an unrolled loop where each load sees an offset of 64 bytes since last iteration; the fact that the loads are offset from each other by 31 bytes is not AFAIK significant. I think there was another Q&A where someone implemented a similar alternating pair of slightly overlapping loads + blends for a similar problem with good results.Clonus
How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i) uses 4 shuffles per 4 input vectors to make a __m256i, vs. this using 3 shuffles per 2 input vectors. The 2x vpackssdw + vpackuswb + vpermd strategy seems better than this, if you have lots of data.Clonus
@PeterCordes Yeah, but that strategy would need 4 extra bitwise instructions to zero the higher 3 bytes in each integer, to work around the saturation of these packing instructions. And on latest-gen Intel CPUs, vpshufb is faster than vpackssdw, 2x throughput, 1/3 latency. I don’t think it’s a clear win, but on new AMD CPUs packing probably faster: packing same speed as vpshufb, and bitwise ops throughput is higher, 3-4 instructions/clock.Ionogen
Oh right, in the FP conversion question, the floats were supposed to be in a 0..255 value-range. This one doesn't specify, so some use-cases probably need truncation. If you actually wanted saturation, vpackssdw / vpackusbw is nice. Good point about shuffle throughputs on Ice Lake and later; ironically I was just commenting about vshufpd vs. vpermilpd imm8 on another of your answers.Clonus

© 2022 - 2024 — McMap. All rights reserved.