Using SSE intrinsics, I've gotten a vector of four 32-bit floats clamped to the range 0-255 and rounded to nearest integer. I'd now like to write those four out as bytes.
There is an intrinsic _mm_cvtps_pi8
that will convert 32-bit to 8-bit signed int, but the problem there is that any value over 127 gets clamped to 127. I can't find any instructions that will clamp to unsigned 8-bit values.
I have an intuition that what I may want to do is some combination of _mm_cvtps_pi16
and _mm_shuffle_pi8
followed by move instruction to get the four bytes I care about into memory. Is that the best way to do it? I'm going to see if I can figure out how to encode the shuffle control mask.
UPDATE: The following appears to do exactly what I want. Is there a better way?
#include <tmmintrin.h>
#include <stdio.h>
unsigned char out[8];
unsigned char shuf[8] = { 0, 2, 4, 6, 128, 128, 128, 128 };
float ins[4] = {500, 0, 120, 240};
int main()
{
__m128 x = _mm_load_ps(ins); // Load the floats
__m64 y = _mm_cvtps_pi16(x); // Convert them to 16-bit ints
__m64 sh = *(__m64*)shuf; // Get the shuffle mask into a register
y = _mm_shuffle_pi8(y, sh); // Shuffle the lower byte of each into the first four bytes
*(int*)out = _mm_cvtsi64_si32(y); // Store the lower 32 bits
printf("%d\n", out[0]);
printf("%d\n", out[1]);
printf("%d\n", out[2]);
printf("%d\n", out[3]);
return 0;
}
UPDATE2: Here's an even better solution based on Harold's answer:
#include <smmintrin.h>
#include <stdio.h>
unsigned char out[8];
float ins[4] = {10.4, 10.6, 120, 100000};
int main()
{
__m128 x = _mm_load_ps(ins); // Load the floats
__m128i y = _mm_cvtps_epi32(x); // Convert them to 32-bit ints
y = _mm_packus_epi32(y, y); // Pack down to 16 bits
y = _mm_packus_epi16(y, y); // Pack down to 8 bits
*(int*)out = _mm_cvtsi128_si32(y); // Store the lower 32 bits
printf("%d\n", out[0]);
printf("%d\n", out[1]);
printf("%d\n", out[2]);
printf("%d\n", out[3]);
return 0;
}
_mm_shuffle_pi8
is the mm-register version, right? Don't forget your_mm_empty
– Darwindarwinian-mfpmath=sse
on the compiler command line. – Sermon_mm_packus_epi32
by_mm_packs_epi32
? As Peter said, it works just fine and requires only SSE2. Yours (based on harold's) requires SSE4.1 – Fasto