What I want to do is:
- Multiply the input floating point number by a fixed factor.
- Convert them to 8-bit signed char.
Note that most of the inputs have a small absolute range of values, like [-6, 6], so that the fixed factor can map them to [-127, 127].
I work on avx2 instruction set only, so intrinsics function like _mm256_cvtepi32_epi8
can't be used. I would like to use _mm256_packs_epi16
but it mixes two inputs together. :(
I also wrote some code that converts 32-bit float to 16-bit int, and it works as exactly what I want.
void Quantize(const float* input, __m256i* output, float quant_mult, int num_rows, int width) {
// input is a matrix actuaaly, num_rows and width represent the number of rows and columns of the matrix
assert(width % 16 == 0);
int num_input_chunks = width / 16;
__m256 avx2_quant_mult = _mm256_set_ps(quant_mult, quant_mult, quant_mult, quant_mult,
quant_mult, quant_mult, quant_mult, quant_mult);
for (int i = 0; i < num_rows; ++i) {
const float* input_row = input + i * width;
__m256i* output_row = output + i * num_input_chunks;
for (int j = 0; j < num_input_chunks; ++j) {
const float* x = input_row + j * 16;
// Process 16 floats at once, since each __m256i can contain 16 16-bit integers.
__m256 f_0 = _mm256_loadu_ps(x);
__m256 f_1 = _mm256_loadu_ps(x + 8);
__m256 m_0 = _mm256_mul_ps(f_0, avx2_quant_mult);
__m256 m_1 = _mm256_mul_ps(f_1, avx2_quant_mult);
__m256i i_0 = _mm256_cvtps_epi32(m_0);
__m256i i_1 = _mm256_cvtps_epi32(m_1);
*(output_row + j) = _mm256_packs_epi32(i_0, i_1);
}
}
}
Any help is welcome, thank you so much!
_mm256_shuffle_epi8
. Otherwise usepack(same,same)
, or better pack 4 vectors of floats down to 1 vector ofint8_t
in multiple steps: 2x epi32 and 1x epi16. (and then fix the in-lane ordering with a singlevpermq
). See SSE - AVX conversion from double to char for an example using 128-bitepi32
->epi8
– Curling_mm256_packs_epi16
, so no exact duplicates of this exist. – Curling