How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)

Asked 10/8, 2018 at 3:54 Answered 10/8, 2018 at 7:16

What I want to do is:

Multiply the input floating point number by a fixed factor.
Convert them to 8-bit signed char.

Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].

I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(

I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.

void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
  // input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
  assert(width % 16 == 0);

  int num_input_chunks = width / 16;

  __m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
                                     quant_mult, quant_mult, quant_mult, quant_mult);

  for (int i = 0; i < num_rows; ++i) {
    const float* input_row = input + i * width;
    __m256i* output_row = output + i * num_input_chunks;
    for (int j = 0; j < num_input_chunks; ++j) {
      const float* x = input_row + j * 16;
      // Process 16 floats at once, since each __m256i can contain 16 16-bit integers.

      __m256 f_0 = _mm256_loadu_ps(x);
      __m256 f_1 = _mm256_loadu_ps(x + 8);

      __m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
      __m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);

      __m256i i_0 = _mm256_cvtps_epi32(m_0);
      __m256i i_1 = _mm256_cvtps_epi32(m_1);

      *(output_row + j) = _mm256_packs_epi32(i_0, i_1);
    }
  }
}

Any help is welcome, thank you so much!

Oleaceous answered 10/8, 2018 at 3:54 Comment(4)

Is truncation ok? Use _mm256_shuffle_epi8. Otherwise use pack(same,same), or better pack 4 vectors of floats down to 1 vector of int8_t in multiple steps: 2x epi32 and 1x epi16. (and then fix the in-lane ordering with a single vpermq). See SSE - AVX conversion from double to char for an example using 128-bit epi32 -> epi8 – Curling 10/8, 2018 at 4:20

Lane-crossing correction is similar to the float->int16 case: How can I convert a vector of float to short int using avx instructions?. Bizzare, there are no hits on SO (other than this) for _mm256_packs_epi16, so no exact duplicates of this exist. – Curling 10/8, 2018 at 4:27

@PeterCordes Truncation is fine. BTW, can u tell me which solution is the fastest, is throughput an absolute standard? thx! – Oleaceous 10/8, 2018 at 4:31

You have multiple vectors of input floats, so 2x vpackssdw + 1x vpacksswb + 1x vpermd to produce 1 wide vector from 4 input vectors is better than 4x vpshufb + 4x vpermd + 4x stores. – Curling 10/8, 2018 at 4:33

For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)

Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.

Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.

#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
    __m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
    __m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
    __m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
    __m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
    __m256i ab = _mm256_packs_epi32(a,b);        // 16x int16_t
    __m256i cd = _mm256_packs_epi32(c,d);
    __m256i abcd = _mm256_packs_epi16(ab, cd);   // 32x int8_t
    // packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
    // if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done

    // but if you need sequential order, then vpermd:
    __m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
    return lanefix;
}

(Compiles nicely on the Godbolt compiler explorer).

Call this in a loop and _mm256_store_si256 the resulting vector.

(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)

With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.

If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.

Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)

SSE - AVX conversion from double to char
How can I convert a vector of float to short int using avx instructions?
how to convert uint32 to uint8 using simd but not avx512? shows an interesting strategy with 2x vpshufb -> vpor -> vpermd to create one 128-bit vector from 2x __m256i, with truncation instead of saturation.

3 shuffles per 2 vectors of input, instead of 4 per 4 with this, but note that Intel since Ice Lake can run vpshufb on either of two ports (p1 / p5), vs. vpack* only on port 5. So throughput is potentially better on recent Intel. And if you want truncation instead of saturation, with my version you'd have to vpand first to mask down to a 0..255 value range, which costs front-end and SIMD-ALU throughput.

Curling answered 10/8, 2018 at 4:55 Comment(5)

Works perfectly. Related info and detailed explanation are very helpful. Thank you very much. – Oleaceous 10/8, 2018 at 6:10

Peter, What would you do for uint16 instead of uint8? – Syncretism 26/4, 2019 at 15:26

@Royi: Replace _mm256_packs_epi32 with _mm256_packus_epi32, and stop after that step. Seems pretty obvious. – Curling 26/4, 2019 at 18:53

Yep, I figure it out and that's what I did. Thank You. – Syncretism 26/4, 2019 at 19:30

This is a full usage I did to the code - codereview.stackexchange.com/a/219207/7723. – Syncretism 26/4, 2019 at 19:58

-2

Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .

Feltonfelts answered 10/8, 2018 at 7:16 Comment(4)

It's a question related with SIMD and AVX. – Goatskin 10/8, 2018 at 7:19

x86 has a machine instruction to convert float to integer (in fact it has multiple, for scalar vs. packed, and legacy x87). Doing it yourself with bit-manipulation would be slower than my answer, which converts 8 floats per core clock cycle on Haswell or Skylake. IDK if you're talking about printing a float to a decimal string, but this question is about converting them to int8_t. For converting to a decimal string, yes you normally do want to pick apart the exponent and significand. – Curling 10/8, 2018 at 7:40

I don't know such instruction and I jz started to learn this things , thats why according to my knowledge(at this state) , I posted this answer and I guarnatee that it will definitely work. – Feltonfelts 10/8, 2018 at 11:3

When you posted it, the OP had already replied to my answer to confirm that it definitely works. If you're just learning, I suggest you read my answer and follow the links in it (including to Intel's intrinsics guide), and only post your own if you're confident yours is an improvement. And look at stackoverflow.com/tags/sse/info for some intro-to-SIMD stuff. – Curling 11/8, 2018 at 1:6

Recommended topics

Hot tags