SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers
Asked Answered
S

2

8

Using SSE intrinsics, I've gotten a vector of four 32-bit floats clamped to the range 0-255 and rounded to nearest integer. I'd now like to write those four out as bytes.

There is an intrinsic _mm_cvtps_pi8 that will convert 32-bit to 8-bit signed int, but the problem there is that any value over 127 gets clamped to 127. I can't find any instructions that will clamp to unsigned 8-bit values.

I have an intuition that what I may want to do is some combination of _mm_cvtps_pi16 and _mm_shuffle_pi8 followed by move instruction to get the four bytes I care about into memory. Is that the best way to do it? I'm going to see if I can figure out how to encode the shuffle control mask.

UPDATE: The following appears to do exactly what I want. Is there a better way?

#include <tmmintrin.h>
#include <stdio.h>

unsigned char out[8];
unsigned char shuf[8] = { 0, 2, 4, 6, 128, 128, 128, 128 };
float ins[4] = {500, 0, 120, 240};

int main()
{
    __m128 x = _mm_load_ps(ins);    // Load the floats
    __m64 y = _mm_cvtps_pi16(x);    // Convert them to 16-bit ints
    __m64 sh = *(__m64*)shuf;       // Get the shuffle mask into a register
    y = _mm_shuffle_pi8(y, sh);     // Shuffle the lower byte of each into the first four bytes
    *(int*)out = _mm_cvtsi64_si32(y); // Store the lower 32 bits

    printf("%d\n", out[0]);
    printf("%d\n", out[1]);
    printf("%d\n", out[2]);
    printf("%d\n", out[3]);
    return 0;
}

UPDATE2: Here's an even better solution based on Harold's answer:

#include <smmintrin.h>
#include <stdio.h>

unsigned char out[8];
float ins[4] = {10.4, 10.6, 120, 100000};

int main()
{   
    __m128 x = _mm_load_ps(ins);       // Load the floats
    __m128i y = _mm_cvtps_epi32(x);    // Convert them to 32-bit ints
    y = _mm_packus_epi32(y, y);        // Pack down to 16 bits
    y = _mm_packus_epi16(y, y);        // Pack down to 8 bits
    *(int*)out = _mm_cvtsi128_si32(y); // Store the lower 32 bits

    printf("%d\n", out[0]);
    printf("%d\n", out[1]);
    printf("%d\n", out[2]);
    printf("%d\n", out[3]);
    return 0;
}
Sermon answered 24/4, 2015 at 19:35 Comment(3)
Wait, you know _mm_shuffle_pi8 is the mm-register version, right? Don't forget your _mm_emptyDarwindarwinian
@harold: Oh, good point. However, I have -mfpmath=sse on the compiler command line.Sermon
May I suggest replacing that _mm_packus_epi32 by _mm_packs_epi32? As Peter said, it works just fine and requires only SSE2. Yours (based on harold's) requires SSE4.1Fasto
D
11

There is no direct conversion from float to byte, _mm_cvtps_pi8 is a composite. _mm_cvtps_pi16 is also a composite, and in this case it's just doing some pointless stuff that you undo with the shuffle. They also return annoying __m64's.

Anyway, we can convert to dwords (signed, but that doesn't matter), and then pack (unsigned) or shuffle them into bytes. _mm_shuffle_(e)pi8 generates a pshufb, Core2 45nm and AMD processors aren't too fond of it and you have to get a mask from somewhere.

Either way you don't have to round to the nearest integer first, the convert will do that. At least, if you haven't messed with the rounding mode.

Using packs 1: (not tested) -- probably not useful, packusdw already outputs unsigned words but then packuswb wants signed words again. Kept around because it is referred to elsewhere.

cvtps2dq xmm0, xmm0  
packusdw xmm0, xmm0     ; unsafe: saturates to a different range than packuswb accepts
packuswb xmm0, xmm0
movd somewhere, xmm0

Using different shuffles:

cvtps2dq xmm0, xmm0  
packssdw xmm0, xmm0     ; correct: signed saturation on first step to feed packuswb
packuswb xmm0, xmm0
movd somewhere, xmm0

Using shuffle: (not tested)

cvtps2dq xmm0, xmm0
pshufb xmm0, [shufmask]
movd somewhere, xmm0

shufmask: db 0, 4, 8, 12, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h, 80h
Darwindarwinian answered 24/4, 2015 at 19:52 Comment(6)
I really like your pack solution. What's nice is that the rounding AND the clamping happen automatically. There is one corner case, however, although I don't think it affects me: If I put, say, 100000 into one of the floats, the first time, it gets clamped to 65535 (I assume). The second time, however, it gets reinterpreted as a signed value (-1) and then clamped to zero by the packuswb. Any low-cost fix for this?Sermon
@TimothyMiller maybe, I can't really think of anything clever, just the obvious "pminuw with 255"Darwindarwinian
@TimothyMiller: Yeah, packuswb treats its input as signed, but output as unsigned, so there's a problem. You could use pand to mask off the even-numbered bytes between packusdw and packuswb to achieve the same result as pminuw. Or work with floats in the [-128..127] range, and convert them to the [0..255] range with paddb a vector of 128s.Sadomasochism
I think I solved the issue: just use packssdw as the first step, because that's how packuswb will interpret it. I added that as an answer. I feel like I must be missing something, or else I feel dumb for not thinking of this last time I was looking, when I wrote an answer for stackoverflow.com/questions/32284106/…Sadomasochism
It should be noted that packusdw requires SSE4 (AMD's SSE4a doesn't support it).Antifebrile
I agree with zett42. But replacing it by packssdw (as Peter suggested) seems to work fine and bring us back to SSE2.Fasto
S
7

We can solve the unsigned clamping issue by doing the first stage of packing with signed saturation. [0-255] fits in a signed 16-bit int, so values in that range will remain unclamped. Values outside that range will stay on the same side of it. Thus, the signed16 -> unsigned8 step will clamp them correctly.

;; SSE2: good for arrays of inputs
cvtps2dq xmm0, [rsi]      ; 4 floats
cvtps2dq xmm1, [rsi+16]   ; 4 more floats
packssdw xmm0, xmm1       ; 8 int16_t

cvtps2dq xmm1, [rsi+32]
cvtps2dq xmm2, [rsi+48]
packssdw xmm1, xmm2       ; 8 more int16_t
                          ; signed because that's how packuswb treats its input
packuswb xmm0, xmm1       ; 16 uint8_t
movdqa   [rdi], xmm0

This only requires SSE2, not SSE4.1 for packusdw.

I assume this is the reason SSE2 only included signed pack from dword to word, but both signed and unsigned pack from word to byte. packuswd is only useful if your final goal is uint16_t, rather than further packing. (Since then you'd need to mask off the sign bit before feeding it to a further pack).

If you did use packusdw -> packuswb, you'd get bogus results when the first step saturated to a uint16_t > 0x7fff. packuswb would interpret that as a negative int16_t and saturate it to 0. packssdw would saturate such inputs to 0x7fff, the max int16_t.

(If your 32-bit inputs are always <= 0x7fff, you can use either, but SSE4.1 packusdw takes more instruction bytes than SSE2 packsswd, and never runs faster.)


If your source values can't be negative, and you only have one vector of 4 floats, not many, you can use harold's pshufb idea. If not, you need to clamp negative values to zero rather than truncate the by shuffling the low bytes into place.

Using

;; SSE4.1, good for a single vector.  Use the PACK version above for arrays
cvtps2dq   xmm0, xmm0
pmaxsd     xmm0, zeroed-register
pshufb     xmm0, [mask]
movd       [somewhere], xmm0

may be slightly more efficient than using two pack instructions, because pmax can run on port 1 or 5 (Intel Haswell). cvtps2dq is port 1 only, pshufb and pack* are port 5 only.

Sadomasochism answered 3/12, 2015 at 21:44 Comment(3)
In my case I got negative values, so harold's shuffle was not enough. Your shuffle works, but unfortunately requires SSE4.1 because of the pmaxsd. Both SSE4.1 solutions (packs and suffle) run at the same speed on my i7 980x. Will give your first solution a try now.Fasto
Your first suggestion, using packssdw, works great (used it with harold's). Now we got SSE2 and SSE4.1! (both run at the same speed too)Fasto
AVX2 version of this: How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i) (u8 instead of i8 is just a matter of changing the final step to use vpackuswb aka _mm256_packus_epi16). Also how to convert uint32 to uint8 using simd but not avx512? shows an interesting strategy with 2x vpshufb -> vpor -> vpermd to create one 128-bit vector.Sadomasochism

© 2022 - 2024 — McMap. All rights reserved.