Clamp unsigned int to 0x10000 using SSE2
Asked Answered
F

3

6

I want to clamp 32-bit unsigned ints to fixed value (0x10000) using only SSE2 instructions.

Basically, this C code: if (c>0x10000) c=0x10000;

This code below works, but I'm wondering if it can be simplified, considering it's a specific constant (0xFFFF+0x0001)

movdqa    xmm3, xmm0 <-- xmm0 contains 4 dword unsigned values
movdqa    xmm4, xmm5 <-- xmm5: four dword 0x10000 values
pxor      xmm3, xmm5
pcmpgtd   xmm4, xmm0
psrad     xmm3, 31
pxor      xmm4, xmm3
pand      xmm0, xmm4
pandn     xmm4, xmm5
por       xmm0, xmm4

The value of c is in the range 0x00000000-0xFFFFFFFF, but code that assumes it is in the range 0x00000000-0x00FFFFFF or 0x00000000-0x00FF0000 may be acceptable.

Franzen answered 2/2 at 17:46 Comment(8)
c = (c & 0xFFFF) | (c == 0x10000 ? 0x10000 : 0) the second part - inside the second parentheses - is actually just 2 ops: pcmpeqd and pandnJemena
@Jemena Doesn't work, try c = 0x10001Steward
Oh yes indeed, forget about my answerJemena
Just for the record, for future readers without the SSE2 constraint, SSE4.1 pminud / _mm_min_epu32 is what you want. SSE2 only had min/max for a couple combinations of size and signedness that were relevant for audio and pixels in the late 90s.Rothko
If you can assume the 0 to 0x00FFFFFF value range like your edit added, all your inputs will be treated as non-negative by pcmpgtd so you can remove the pxor/psrad correction for "different signs" which emulates an unsigned compare. @fuz's deleted answer shows that (fuz: does notify work with 's after your username?). So one pand can replace pxor/psrad/pxor if you need to clear high garbage to narrow to that value-range.Rothko
@PeterCordes I got a notification, so seems like it does work.Steward
I think it might just be possible to utilise SSE2 pminub in this particular case to mask against 0x0001ffff, then psub 0x00010000. So a 1 in the high word becomes 0 and 0 in the high word 0xffff. Use pshufw to duplicate the high word as masks into the low word, pand the mask against the result then padd 0x00010000 (untested)Kerek
@MartinBrown I think this would work. Some minor improvements: Use pminsw (against 0x00017fff) instead of pminub to extend the input range up to 0x7fffffff. Also, there is no pshufw on a full 128 bit register, but duplicating a 0xffff/0x0 mask can be achieved with an arithmetic right shift (psrad xmm,16).Erogenous
E
6

Here is a SSE2 solution working on the full range using saturated addition/subtraction. It requires 4 uops and 2 constants (and one copy):

(Edit: minor improvement to previous version. Neither of the required constants get destroyed)

The right column describes what happens if the high 16 bits of the input (x.h) are zero (in that case x.l needs to be returned) or not zero (in that case 0x10000 needs to be returned.

// assumes xmm1 contains 0xffffffff -- can be generated by pcmpeqd
// assumes xmm3 contains 0xfffe0000 -- could be generated by left-shifting a ffffffff vector

                           x.h==0      x.h!=0
    paddusw xmm0, xmm3     [fffe,x.l]  [ffff,x.l]
    movdqa  xmm2, xmm0
    psrld   xmm2, 16       [0000,fffe] [0000,ffff]
    psubw   xmm2, xmm1     [0001,ffff] [0001,0000]
    pand    xmm0, xmm2     [0000,x.l]  [0001,0000]

If you have SSE4.1, of course pminud is simpler and better. And if you don't need to cover the full input range of xmm0, the solution by fuz is more generic, easier and more straight-forward (it also has a slightly smaller dependency chain and requires just one constant vector.)

Erogenous answered 3/2 at 11:15 Comment(0)
S
5

If the range can be assumed to be 0x00000000 to 0x7fffffff or narrower, you can pretend the values are signed and simplify the sequence to:

; xmm0 contains 4 dword unsigned values (input)
; xmm5 contains [0x10000, 0x10000, 0x10000, 0x10000]
movdqa    xmm1, xmm5
pcmpgtd   xmm1, xmm0  ; input < 0x10000
pand      xmm0, xmm1  ; input < 0x10000 ? input :       0
pandn     xmm1, xmm5  ; input < 0x10000 ?     0 : 0x10000
por       xmm0, xmm1  ; input < 0x10000 ? input : 0x10000

With SSE4.1, you can further simplify the code to just

pminud    xmm0, xmm5  ; input < 0x10000 ? input : 0x10000
Steward answered 2/2 at 18:2 Comment(0)
D
5

assumes it is in the range 0x00000000-0x00FFFFFF

minps     xmm0, xmm5

This works if you haven't set DAZ (Denormals Are Zero) in MXCSR. With DAZ set (bit 1<<6 = 0x40), minps treats 0x10000 as representing exactly 0.0, so the result is 0x00000000.

This is very slow on some of the CPUs where it would be useful (because of microcode assists for denormals), including first-gen Core 2 Duo (E6600) which has SSSE3 but not SSE4.1 for pminud. A test loop has a throughput of 1/clock minps clock with normalized inputs, but with these subnormals it averages 119 cycles per minps. It's fast on Skylake even with subnormals.

Note that linking with gcc -ffast-math will include CRT startup code that sets FTZ and DAZ, so real programs can have it set without doing any x86-specific stuff. DAZ avoids minps slowdowns on CPUs like Core 2, but of course makes it non-useful for playing with small integers.

(FTZ doesn't affect minps; it doesn't have to round its output.)


This might have some extra bypass latency between SIMD-integer instructions (and itself has multi-cycle latency), but still better for throughput than SSE2 emulation of SSE4.1 pminsd / pminud on CPUs where it doesn't take a microcode assist due to subnormal inputs.

Integer values in this limited range are bit-patterns for finite non-negative floats (IEEE binary32). Larger integer bit-patterns represent larger-magnitude values, up to the first NaN (0x7F800001).

Half the values in this range have exponent field = 0 (bits 30:23), so are subnormal aka denormal floats. 0x00800000 is the bit-pattern for the smallest normalized float.

Dewittdewlap answered 2/2 at 22:57 Comment(4)
Cool idea. I expanded the answer to explain why it works, and why it doesn't if DAZ is set. (I tested on my Skylake CPU to confirm that minps produces 0.0 with either ordering of the operands, if either of them has a subnormal element.)Rothko
Unfortunately Core 2 Duo has microcode assists for this: a simple loop that starts a new dep chain of 3 minps every iteration averages one per 119 cycles, vs. 1/cycle with normalized values (or with DAZ set) on a Core 2 Duo E6600. Some AMD CPUs might not take microcode assists for minps so there might be some CPUs with SSE4.1 where this is useful. And it's fast on newer Intel like Skylake (probably since Sandybridge), so if your goal is code that can run fast on modern CPUs, and run without faulting (but arbitrarily slowly) on all CPUs, this could work.Rothko
If anyone's considering using this, make sure to test on Alder Lake E-cores (Gracemont); they might have slowdowns for denormals even though modern Intel P-cores don't.Rothko
With two additional bitops, you could avoid the DAZ problems, e.g., minps(x^0x40000000, 0x40010000) ^ 0x40000000). The code would still work for inputs up to 0x3fffffff (edit: not quite, since 0x7fffffff is a NaN -- unless you make sure that the non-NaN argument of minps is returned in that case). Or, with higher latency, but working up to 0x7fffffff: cvtps2dq(minps(cvtdq2ps(x), float(0x10000)))Erogenous

© 2022 - 2024 — McMap. All rights reserved.