Branchy: one compare-and-branch, and an LEA + cmov.
Branchless: one CMP, 2 single-uop CMOV
(Or fully branchless using the same trick in @Martijn's answer, which might be good if out-of-range values are common but unpredictable.)
It's not worth moving scalar data to the vector regs for one or two instructions, and then moving it back. If you could usefully do whole vectors at a time, then you could use PMINSD/PMAXSD
to clamp values to a signed range. (Or 2x packssdw
+ packuswb
to pack 4 input vectors down to one vector of 8-bit elements with eventual unsigned saturation.)
In your original, a couple things are clearly sub-optimal. The first two only matter for code-size most of the time, but LEA
for a non-destructive add is a small but clear win:
See the links in the x86 wiki, esp. Agner Fog's guides.
If clamping is uncommon, branch on it being needed at all:
You need a register (or memory location) holding 0
, or else an extra instruction to mov reg, 0
.
...
cmp rax, r12
jae .clamp ; not-taken fast-path = no clamping
.clamp_finished:
...
ret
.clamp:
; flags still set from the cmp rax, r12
; we only get here if rax is >= r12 (`ge` signed compare), or negative (so `l` rax < r12 indicates rax<0)
; mov r15d, 0 ; or zero it outside the loop so it can be used when needed. Can't xor-zero because we need to preserve flags
lea rax, [r12-1] ; still doesn't modify flags
cmovl rax, r15 ; rax=0 if orig_rax<r12 (signed), which means we got here because orig_rax<0
jmp .clamp_finished
quick perf analysis for Intel Skylake:
Fast path: one not-taken compare-and-branch uop. Latency for rax: 0 cycles.
Clamping-needed case: One taken compare-and-branch uop, plus 3 more uops (lea, 1 for cmov, 1 more to jmp back.) Latency for rax: 2 cycles from the later of RAX or R12 being ready (cmp + lea in parallel, then cmov reading FLAGS from cmp).
On Intel Haswell or earlier, cmovl is 2 uops and costs an extra cycle of latency on the critical path, so 3 total.
Obviously you can use jb
instead of jae
to skip over the clamping lea/cmov
, instead of pulling them out of the main flow. See the section below for motivation for that. (And/or see Anatolyg's excellent answer, which covers this. I got the cool trick of using jb
to do the [0 .. limit]
with one branch from Anatolyg's answer, too).
I think the version using jae/cmov is the best bet here, even though cmov has a lot of downsides and isn't always faster. Its input operands were already needed, so it's not adding much latency even when clamping is needed.
An alternative branchy implementation of the .clamp
code block that doesn't need a zeroed-register would be:
.clamp:
lea rax, [r12-1]
jge .clamp_finished
xor eax, eax
jmp .clamp_finished
It still computes a result it might throw away, cmov-style. However, the following xor starts a fresh dependency chain, so it doesn't have to wait for lea
to write rax
if the xor-zeroing executes.
An important question is how often you expect these branches to be taken. If there's a common case (e.g. the no-clamping case), make that the fast-path through the code (as few instructions and as few taken-branches as possible). Depending on how infrequently branches are taken, it can be worth putting the code for the uncommon case off at the end of the function.
func:
...
test
jcc .unlikely
...
.ret_from_unlikely:
...
... ;; lots of code
ret
.unlikely:
xor eax,eax
jmp .ret_from_unlikely ;; this extra jump makes the slow path slower, but that's worth it to make the fast path faster.
Gcc does this, I think when it decides a branch is unlikely to be taken. So instead of having the typical case take a branch that skips some instructions, the common case falls through. Typically, the default branch prediction is not-taken for forward jumps, so this never even needs a branch-predictor entry until it sees the unlikely case.
random thoughts: The code
if (eax < 0) { eax = 0; }
else if (eax >= r12) { eax := r12 - 1 } // If r12 can be zero, the else matters
is equivalent to
eax = min(eax, r12-1);
eax = max(eax, 0);
r12
can't be negative, but OP didn't say it couldn't be zero. This ordering preserves the if/else semantics. (edit: actually OP did say you can assume r12>0, not >=0.) If we had a fast min/max in asm, we could use it here. vector-max is a single-instruction, but scalar takes more code.
Related code review: Fastest way to clamp an integer to the range 0-255. But that's in C, and none of the compiler-generated asm versions are optimal. Still, it provided some initial inspiration.
Also related, current clang pessimizes std::clamp into storing, selecting a pointer, and reloading. https://bugs.llvm.org/show_bug.cgi?id=47271
TODO: file missed-optimization bug reports with this clamp peephole so compilers can look for it.
For now, I have versions that branchlessly clamp to [0, limit]
(closed ranges at both ends, so to limit instead of limit-1. That required some trickery to get it done while avoiding cmova
/ cmovbe
(which are still 2 uops on Intel, unlike most CMOV predicates that only read CF or some of SPAZO flags.)
# gcc -nostdlib -static testloop.S -o testloop &&
# taskset -c 3 perf stat --all-user -etask-clock:u,context-switches:u,cpu-migrations:u,page-faults:u,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,idq.dsb_uops:u -r1 ./testloop
# or idq.mite_uops to make sure it's low
.intel_syntax noprefix
.global _start
_start:
mov edi, 34
mov esi, 100
xor ecx, ecx
mov ebp, 100000000
.p2align 6
.loop:
# ~3.10 cycles latency, with unroll or an imul to give OoO scheduling an easier time. Can be 3.3c in worse cases.
.macro clamp0n dst, n, high
xor \dst, \dst
cmp \n, \high
cmovg \dst, \high # prepare clamped value: n>high [signed] ? high : 0
cmovbe \dst, \n # copy original if n <= high [unsigned], i.e. in range
.endm
# ~4.00 cycles latency, no ILP for 2-uop cmovbe
.macro clamp0n_rev dst, n, high
xor \dst, \dst
cmp \n, \high
cmovbe \dst, \n # copy original if n <= high [unsigned]; no clamping
cmovg \dst, \high # high if n>high (replacing 0), else leave orig
.endm
# ~3.00 cycles latency, only single-uop CMOV
.macro clamp0n_rev_intel dst, n, high
xor \dst, \dst
cmp \n, \high
cmovb \dst, \n # copy original if n < high [unsigned]; no clamping. (cmovbe is 2 uops on Intel, let next insn handle that case)
cmovge \dst, \high # high if n>=high (replacing 0), else leave orig. copy on equal restores the value destroyed by cmovb
.endm
# ~3.1 to 3.3 cycle latency
.macro clamp0n_inplace_destroy_zero n, high, tmp
xor \tmp, \tmp
cmp \n, \high
cmovg \tmp, \high # prepare clamped value: 0 or high, per signed compare.
cmova \n, \tmp # if clamping needed at all, apply clamped value
.endm
# 4.0 cycles latency.
.macro clamp0n_inplace n, high, zero
cmp \n, \high
cmova \n, \zero # if clamping needed at all, apply 0. 2 uops on Intel. could be 2nd if we destroy \high?
cmovg \n, \high # if signed greater than limit, clamp to high
.endm
# 3.0 cycles latency, only single uop CMOV.
.macro clamp0n_inplace_intel n, high, zero
cmp \n, \high
cmovae \n, \zero # if clamping needed at all, apply 0. (or on equal, to avoid 2-uop cmov)
cmovge \n, \high # if signed greater than limit, clamp to high. (or on equal, to restore the correct value)
.endm
#define CLAMP_INPLACE clamp0n_inplace_intel
#define CLAMP_COPY clamp0n_rev_intel
CLAMP_INPLACE edi, esi, ecx
CLAMP_INPLACE edi, esi, ecx
CLAMP_INPLACE edi, esi, ecx
CLAMP_INPLACE edi, esi, ecx
# imul edi, edi
#if 0
//#define clamp0n clamp0n_rev // use the slow version
CLAMP_COPY eax, edi, esi
and eax, edi
imul edi, eax, 123
#endif
#if 0
CLAMP_COPY edi, eax, esi
CLAMP_COPY eax, edi, esi
CLAMP_COPY edi, eax, esi
CLAMP_COPY rax, rdi, rsi # 64-bit for REX prefixes, keep dec/jnz off a 32-byte boundary so uop cache works (JCC erratum mitigation)
CLAMP_COPY rdi, rax, rsi
#endif
#nop # pad the loop up to 32 total uops. Tiny benefit on skylake in this artifical fully latency-bound case.
dec ebp
jnz .loop
xor edi,edi
mov eax,231 # __NR_exit_group from /usr/include/asm/unistd_64.h
syscall # sys_exit_group(0)
rax
but you're using it in your actual code. Is that intentional? They don't match. Outside of that, the code looks simple enough. – Andrielse
? Can we assumer12 > 0
? – Tintinnabulumrax
? If the rax comes from an insn that sets flags, can you re-order the code before this to put that instruction at the very end? – Boony