How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)
Asked Answered
O

2

8

What I want to do is:

  1. Multiply the input floating point number by a fixed factor.
  2. Convert them to 8-bit signed char.

Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].

I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8 can't be used. I would like to use _mm256_packs_epi16 but it mixes two inputs together. :(

I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.

void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
  // input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
  assert(width % 16 == 0);

  int num_input_chunks = width / 16;

  __m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
                                     quant_mult, quant_mult, quant_mult, quant_mult);

  for (int i = 0; i < num_rows; ++i) {
    const float* input_row = input + i * width;
    __m256i* output_row = output + i * num_input_chunks;
    for (int j = 0; j < num_input_chunks; ++j) {
      const float* x = input_row + j * 16;
      // Process 16 floats at once, since each __m256i can contain 16 16-bit integers.

      __m256 f_0 = _mm256_loadu_ps(x);
      __m256 f_1 = _mm256_loadu_ps(x + 8);

      __m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
      __m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);

      __m256i i_0 = _mm256_cvtps_epi32(m_0);
      __m256i i_1 = _mm256_cvtps_epi32(m_1);

      *(output_row + j) = _mm256_packs_epi32(i_0, i_1);
    }
  }
}

Any help is welcome, thank you so much!

Oleaceous answered 10/8, 2018 at 3:54 Comment(4)
Is truncation ok? Use _mm256_shuffle_epi8. Otherwise use pack(same,same), or better pack 4 vectors of floats down to 1 vector of int8_t in multiple steps: 2x epi32 and 1x epi16. (and then fix the in-lane ordering with a single vpermq). See SSE - AVX conversion from double to char for an example using 128-bit epi32 -> epi8Curling
Lane-crossing correction is similar to the float->int16 case: How can I convert a vector of float to short int using avx instructions?. Bizzare, there are no hits on SO (other than this) for _mm256_packs_epi16, so no exact duplicates of this exist.Curling
@PeterCordes Truncation is fine. BTW, can u tell me which solution is the fastest, is throughput an absolute standard? thx!Oleaceous
You have multiple vectors of input floats, so 2x vpackssdw + 1x vpacksswb + 1x vpermd to produce 1 wide vector from 4 input vectors is better than 4x vpshufb + 4x vpermd + 4x stores.Curling
C
13

For good throughput with multiple source vectors, it's a good thing that _mm256_packs_epi16 has 2 input vectors instead of producing a narrower output. (AVX512 _mm256_cvtepi32_epi8 isn't necessarily the most efficient way to do things, because the version with a memory destination decodes to multiple uops, or the regular version gives you multiple small outputs that need to be stored separately.)

Or are you complaining about how it operates in-lane? Yes that's annoying, but _mm256_packs_epi32 does the same thing. If it's ok for your outputs to have interleaved groups of data there, do the same thing for this, too.

Your best bet is to combine 4 vectors down to 1, in 2 steps of in-lane packing (because there's no lane-crossing pack). Then use one lane-crossing shuffle to fix it up.

#include <immintrin.h>
// loads 128 bytes = 32 floats
// converts and packs with signed saturation to 32 int8_t
__m256i pack_float_int8(const float*p) {
    __m256i a = _mm256_cvtps_epi32(_mm256_loadu_ps(p));
    __m256i b = _mm256_cvtps_epi32(_mm256_loadu_ps(p+8));
    __m256i c = _mm256_cvtps_epi32(_mm256_loadu_ps(p+16));
    __m256i d = _mm256_cvtps_epi32(_mm256_loadu_ps(p+24));
    __m256i ab = _mm256_packs_epi32(a,b);        // 16x int16_t
    __m256i cd = _mm256_packs_epi32(c,d);
    __m256i abcd = _mm256_packs_epi16(ab, cd);   // 32x int8_t
    // packed to one vector, but in [ a_lo, b_lo, c_lo, d_lo | a_hi, b_hi, c_hi, d_hi ] order
    // if you can deal with that in-memory format (e.g. for later in-lane unpack), great, you're done

    // but if you need sequential order, then vpermd:
    __m256i lanefix = _mm256_permutevar8x32_epi32(abcd, _mm256_setr_epi32(0,4, 1,5, 2,6, 3,7));
    return lanefix;
}

(Compiles nicely on the Godbolt compiler explorer).

Call this in a loop and _mm256_store_si256 the resulting vector.


(For uint8_t unsigned destination, use _mm256_packus_epi16 for the 16->8 step and keep everything else the same. We still use signed 32->16 packing, because 16 -> u8 vpackuswb packing still takes its epi16 input as signed. You need -1 to be treated as -1, not +0xFFFF, for unsigned saturation to clamp it to 0.)


With 4 total shuffles per 256-bit store, 1 shuffle per clock throughput will be the bottleneck on Intel CPUs. You should get a throughput of one float vector per clock, bottlenecked on port 5. (https://agner.org/optimize/). Or maybe bottlenecked on memory bandwidth if data isn't hot in L2.


If you only have a single vector to do, you could consider using _mm256_shuffle_epi8 to put the low byte of each epi32 element into the low 32 bits of each lane, then _mm256_permutevar8x32_epi32 for lane-crossing.

Another single-vector alternative (good on Ryzen) is extracti128 + 128-bit packssdw + packsswb. But that's still only good if you're just doing a single vector. (Still on Ryzen, you'll want to work in 128-bit vectors to avoid extra lane-crossing shuffles, because Ryzen splits every 256-bit instruction into (at least) 2 128-bit uops.)

Related:

Curling answered 10/8, 2018 at 4:55 Comment(5)
Works perfectly. Related info and detailed explanation are very helpful. Thank you very much.Oleaceous
Peter, What would you do for uint16 instead of uint8?Syncretism
@Royi: Replace _mm256_packs_epi32 with _mm256_packus_epi32, and stop after that step. Seems pretty obvious.Curling
Yep, I figure it out and that's what I did. Thank You.Syncretism
This is a full usage I did to the code - codereview.stackexchange.com/a/219207/7723.Syncretism
F
-2

Please check the IEEE754 standard format to store float values, first understand how this float and double get store in memory ,then you only came to know how to convert float or double to the char , it is quite simple .

Feltonfelts answered 10/8, 2018 at 7:16 Comment(4)
It's a question related with SIMD and AVX.Goatskin
x86 has a machine instruction to convert float to integer (in fact it has multiple, for scalar vs. packed, and legacy x87). Doing it yourself with bit-manipulation would be slower than my answer, which converts 8 floats per core clock cycle on Haswell or Skylake. IDK if you're talking about printing a float to a decimal string, but this question is about converting them to int8_t. For converting to a decimal string, yes you normally do want to pick apart the exponent and significand.Curling
I don't know such instruction and I jz started to learn this things , thats why according to my knowledge(at this state) , I posted this answer and I guarnatee that it will definitely work.Feltonfelts
When you posted it, the OP had already replied to my answer to confirm that it definitely works. If you're just learning, I suggest you read my answer and follow the links in it (including to Intel's intrinsics guide), and only post your own if you're confident yours is an improvement. And look at stackoverflow.com/tags/sse/info for some intro-to-SIMD stuff.Curling

© 2022 - 2024 — McMap. All rights reserved.