assumes it is in the range 0x00000000-0x00FFFFFF
minps xmm0, xmm5
This works if you haven't set DAZ (Denormals Are Zero) in MXCSR. With DAZ set (bit 1<<6 = 0x40
), minps
treats 0x10000
as representing exactly 0.0
, so the result is 0x00000000
.
This is very slow on some of the CPUs where it would be useful (because of microcode assists for denormals), including first-gen Core 2 Duo (E6600) which has SSSE3 but not SSE4.1 for pminud
. A test loop has a throughput of 1/clock minps
clock with normalized inputs, but with these subnormals it averages 119 cycles per minps
. It's fast on Skylake even with subnormals.
Note that linking with gcc -ffast-math
will include CRT startup code that sets FTZ and DAZ, so real programs can have it set without doing any x86-specific stuff. DAZ avoids minps
slowdowns on CPUs like Core 2, but of course makes it non-useful for playing with small integers.
(FTZ doesn't affect minps
; it doesn't have to round its output.)
This might have some extra bypass latency between SIMD-integer instructions (and itself has multi-cycle latency), but still better for throughput than SSE2 emulation of SSE4.1 pminsd
/ pminud
on CPUs where it doesn't take a microcode assist due to subnormal inputs.
Integer values in this limited range are bit-patterns for finite non-negative floats (IEEE binary32). Larger integer bit-patterns represent larger-magnitude values, up to the first NaN (0x7F800001
).
Half the values in this range have exponent field = 0 (bits 30:23), so are subnormal aka denormal floats. 0x00800000 is the bit-pattern for the smallest normalized float.
c = (c & 0xFFFF) | (c == 0x10000 ? 0x10000 : 0)
the second part - inside the second parentheses - is actually just 2 ops: pcmpeqd and pandn – Jemenac = 0x10001
– Stewardpminud
/_mm_min_epu32
is what you want. SSE2 only had min/max for a couple combinations of size and signedness that were relevant for audio and pixels in the late 90s. – Rothkopcmpgtd
so you can remove thepxor
/psrad
correction for "different signs" which emulates an unsigned compare. @fuz's deleted answer shows that (fuz: does notify work with 's after your username?). So onepand
can replace pxor/psrad/pxor if you need to clear high garbage to narrow to that value-range. – Rothkopminsw
(against0x00017fff
) instead ofpminub
to extend the input range up to0x7fffffff
. Also, there is nopshufw
on a full 128 bit register, but duplicating a0xffff
/0x0
mask can be achieved with an arithmetic right shift (psrad xmm,16
). – Erogenous