x86 assembly - optimization of clamping rax to [ 0 .. limit )
Asked Answered
L

4

2

I'm writing a simple assembler procedure, which, naturally, aims to be as quick as possible. However, a certain part, which is located in the most nested loop, doesn't seem 'right' and I believe it is possible to come up with cleverer and quicker implementation, maybe even without using conditional jumps. The code implements a simple thing:

if rax < 0 then rax := 0 else if rax >= r12 then rax := r12 - 1

And here's my naive implementation:

cmp rax, 0
jge offsetXGE
   mov rax, 0
   jmp offsetXReady
offsetXGE:
   cmp rax, r12
   jl offsetXReady
   mov rax, r12
   dec rax
offsetXReady:

Any ideas are welcome, even those using MMX and some masking tricks.

EDIT: To answer some questions in comments - yes we can assume that r12 > 0 but rax can be negative.

Lahey answered 3/12, 2015 at 16:24 Comment(5)
Do you work with eax or rax? Are both eax and r12 signed?Subtenant
Your pseudo code doesn't mention rax but you're using it in your actual code. Is that intentional? They don't match. Outside of that, the code looks simple enough.Andri
Also, do you need the else? Can we assume r12 > 0?Tintinnabulum
Ups, my mistake, in pseudocode I meant rax. Both registers are signed, but r12 cannot hold negative value.Lahey
Are the flags already set from whatever generated rax? If the rax comes from an insn that sets flags, can you re-order the code before this to put that instruction at the very end?Boony
B
6

Branchy: one compare-and-branch, and an LEA + cmov.

Branchless: one CMP, 2 single-uop CMOV

(Or fully branchless using the same trick in @Martijn's answer, which might be good if out-of-range values are common but unpredictable.)


It's not worth moving scalar data to the vector regs for one or two instructions, and then moving it back. If you could usefully do whole vectors at a time, then you could use PMINSD/PMAXSD to clamp values to a signed range. (Or 2x packssdw + packuswb to pack 4 input vectors down to one vector of 8-bit elements with eventual unsigned saturation.)

In your original, a couple things are clearly sub-optimal. The first two only matter for code-size most of the time, but LEA for a non-destructive add is a small but clear win:

See the links in the wiki, esp. Agner Fog's guides.


If clamping is uncommon, branch on it being needed at all:

You need a register (or memory location) holding 0, or else an extra instruction to mov reg, 0.

    ...
    cmp  rax, r12
    jae  .clamp      ; not-taken fast-path = no clamping
.clamp_finished:

    ...
    ret

.clamp:   
    ; flags still set from the cmp rax, r12
    ; we only get here if rax is >= r12 (`ge` signed compare), or negative (so `l` rax < r12 indicates rax<0)

    ; mov r15d, 0        ; or zero it outside the loop so it can be used when needed.  Can't xor-zero because we need to preserve flags

    lea    rax, [r12-1]  ; still doesn't modify flags
    cmovl  rax, r15      ; rax=0 if  orig_rax<r12 (signed), which means we got here because orig_rax<0
    jmp  .clamp_finished

quick perf analysis for Intel Skylake:

  • Fast path: one not-taken compare-and-branch uop. Latency for rax: 0 cycles.

  • Clamping-needed case: One taken compare-and-branch uop, plus 3 more uops (lea, 1 for cmov, 1 more to jmp back.) Latency for rax: 2 cycles from the later of RAX or R12 being ready (cmp + lea in parallel, then cmov reading FLAGS from cmp).

    On Intel Haswell or earlier, cmovl is 2 uops and costs an extra cycle of latency on the critical path, so 3 total.

Obviously you can use jb instead of jae to skip over the clamping lea/cmov, instead of pulling them out of the main flow. See the section below for motivation for that. (And/or see Anatolyg's excellent answer, which covers this. I got the cool trick of using jb to do the [0 .. limit] with one branch from Anatolyg's answer, too).

I think the version using jae/cmov is the best bet here, even though cmov has a lot of downsides and isn't always faster. Its input operands were already needed, so it's not adding much latency even when clamping is needed.

An alternative branchy implementation of the .clamp code block that doesn't need a zeroed-register would be:

.clamp:
    lea    rax, [r12-1]
    jge  .clamp_finished
    xor    eax, eax
    jmp  .clamp_finished

It still computes a result it might throw away, cmov-style. However, the following xor starts a fresh dependency chain, so it doesn't have to wait for lea to write rax if the xor-zeroing executes.


An important question is how often you expect these branches to be taken. If there's a common case (e.g. the no-clamping case), make that the fast-path through the code (as few instructions and as few taken-branches as possible). Depending on how infrequently branches are taken, it can be worth putting the code for the uncommon case off at the end of the function.

func:
    ...
    test
    jcc .unlikely
    ...        
.ret_from_unlikely:
    ...
    ... ;; lots of code
    ret

.unlikely:
    xor   eax,eax
    jmp .ret_from_unlikely   ;; this extra jump makes the slow path slower, but that's worth it to make the fast path faster.

Gcc does this, I think when it decides a branch is unlikely to be taken. So instead of having the typical case take a branch that skips some instructions, the common case falls through. Typically, the default branch prediction is not-taken for forward jumps, so this never even needs a branch-predictor entry until it sees the unlikely case.


random thoughts: The code

if (eax < 0) { eax = 0; }
else if (eax >= r12) { eax := r12 - 1 }  // If r12 can be zero, the else matters

is equivalent to

eax = min(eax, r12-1);
eax = max(eax, 0);

r12 can't be negative, but OP didn't say it couldn't be zero. This ordering preserves the if/else semantics. (edit: actually OP did say you can assume r12>0, not >=0.) If we had a fast min/max in asm, we could use it here. vector-max is a single-instruction, but scalar takes more code.


Related code review: Fastest way to clamp an integer to the range 0-255. But that's in C, and none of the compiler-generated asm versions are optimal. Still, it provided some initial inspiration.

Also related, current clang pessimizes std::clamp into storing, selecting a pointer, and reloading. https://bugs.llvm.org/show_bug.cgi?id=47271

TODO: file missed-optimization bug reports with this clamp peephole so compilers can look for it.

For now, I have versions that branchlessly clamp to [0, limit] (closed ranges at both ends, so to limit instead of limit-1. That required some trickery to get it done while avoiding cmova / cmovbe (which are still 2 uops on Intel, unlike most CMOV predicates that only read CF or some of SPAZO flags.)

# gcc -nostdlib -static testloop.S -o testloop && 
# taskset -c 3 perf stat --all-user -etask-clock:u,context-switches:u,cpu-migrations:u,page-faults:u,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u,idq.dsb_uops:u -r1 ./testloop

# or idq.mite_uops to make sure it's low

.intel_syntax noprefix
.global _start
_start:

    mov     edi, 34
    mov     esi, 100
    xor     ecx, ecx

    mov     ebp, 100000000

.p2align 6
.loop:

# ~3.10 cycles latency, with unroll or an imul to give OoO scheduling an easier time. Can be 3.3c in worse cases.
.macro  clamp0n  dst, n, high
    xor    \dst, \dst
    cmp    \n, \high
    cmovg  \dst, \high       # prepare clamped value: n>high [signed] ? high : 0
    cmovbe \dst, \n          # copy original if n <= high  [unsigned], i.e. in range
.endm

# ~4.00 cycles latency, no ILP for 2-uop cmovbe
.macro  clamp0n_rev  dst, n, high
    xor    \dst, \dst
    cmp    \n, \high
    cmovbe \dst, \n          # copy original if n <= high [unsigned]; no clamping
    cmovg  \dst, \high       # high if n>high (replacing 0), else leave orig
.endm

# ~3.00 cycles latency, only single-uop CMOV
.macro  clamp0n_rev_intel  dst, n, high
    xor    \dst, \dst
    cmp    \n, \high
    cmovb  \dst, \n          # copy original if n < high [unsigned]; no clamping.  (cmovbe is 2 uops on Intel, let next insn handle that case)
    cmovge \dst, \high       # high if n>=high (replacing 0), else leave orig.  copy on equal restores the value destroyed by cmovb
.endm

# ~3.1 to 3.3 cycle latency
.macro  clamp0n_inplace_destroy_zero  n, high, tmp
    xor    \tmp, \tmp
    cmp    \n, \high
    cmovg  \tmp, \high       # prepare clamped value: 0 or high, per signed compare.
    cmova  \n, \tmp         # if clamping needed at all, apply clamped value
.endm

# 4.0 cycles latency.
.macro  clamp0n_inplace  n, high, zero
    cmp    \n, \high
    cmova  \n, \zero        # if clamping needed at all, apply 0.  2 uops on Intel.  could be 2nd if we destroy \high?
    cmovg  \n, \high        # if signed greater than limit, clamp to high
.endm

# 3.0 cycles latency, only single uop CMOV.
.macro  clamp0n_inplace_intel  n, high, zero
    cmp    \n, \high
    cmovae  \n, \zero        # if clamping needed at all, apply 0.  (or on equal, to avoid 2-uop cmov)
    cmovge  \n, \high        # if signed greater than limit, clamp to high. (or on equal, to restore the correct value)
.endm

#define CLAMP_INPLACE clamp0n_inplace_intel
#define CLAMP_COPY    clamp0n_rev_intel

   CLAMP_INPLACE  edi, esi, ecx
   CLAMP_INPLACE  edi, esi, ecx
   CLAMP_INPLACE  edi, esi, ecx
   CLAMP_INPLACE  edi, esi, ecx
#   imul     edi, edi

#if 0
//#define clamp0n clamp0n_rev        // use the slow version
   CLAMP_COPY   eax, edi, esi
   and      eax, edi
   imul     edi, eax, 123
#endif

#if 0
   CLAMP_COPY  edi, eax, esi

   CLAMP_COPY  eax, edi, esi
   CLAMP_COPY  edi, eax, esi

   CLAMP_COPY  rax, rdi, rsi    # 64-bit for REX prefixes, keep dec/jnz off a 32-byte boundary so uop cache works (JCC erratum mitigation)
   CLAMP_COPY  rdi, rax, rsi
#endif

#nop  # pad the loop up to 32 total uops.  Tiny benefit on skylake in this artifical fully latency-bound case.
    dec ebp
    jnz .loop

    xor edi,edi
    mov eax,231   # __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       # sys_exit_group(0)
Boony answered 3/12, 2015 at 19:46 Comment(1)
Related: some answers on this C++ codereview Fastest way to clamp an integer to the range 0-255 show compiler output for some versions, including std::clamp.Boony
R
3

A common trick (compilers use it) is to make an unsigned comparison:

    cmp rax, r12
    jb done
    ...
    ...
done:

Here, if rax is negative, when interpreted as an unsigned number (by jb, "jump if below") it looks like a large number (greater than 263), so unsigned comparison lumps together the two "exceptional" cases (less than 0 and too big).

If the exceptional case is very rare, the performance of the code denoted by ... doesn't matter much, and the usual case contains one conditional branch, usually taken. If you want to improve it even more, you can rearrange the code like this:

    cmp rax, r12
    jb work_needed
done:
    (your code continued here)

work_needed:
    jl upper_limit_done
    lea rax, [r12 - 1]
upper_limit_done:
    test rax, rax
    jns lower_limit_done
    xor rax, rax
lower_limit_done:
    jmp done

Here, the usual path contains a branch that is usually not taken. This probably provides some minor improvement, on the expense of a slower exceptional case.

Richerson answered 3/12, 2015 at 17:12 Comment(2)
I love that unsigned-compare trick. I came up with an even more efficient sequence that I used in my answer, though. (I had a bunch of stuff written when you posted this)Boony
I learned about this trick from this question Fastest way to determine if an integer is between two integers (inclusive) with known sets of values. Strange enough, not very old compilers don't support that trickEpiglottis
S
2

Edit: as pointed out by Peter Cordes, CMOVxx makes things a lot shorter. I am including my original answer below to show that it can indeed be done without CMOVxx.

It can be done without branches using conditional moves:

lea   rbx, [r12-1]
cmp   rax, r12
cmovge rax, rbx       ; clamp to r12-1 if it was too high
xor   rbx, rbx
test  rax, rax
cmovl rax, rbx        ; clamp to 0 if it's <= 0

Or optimized to only compare once, using the same signed + unsigned compare trick from Peter's answer:

xor    edx, edx
lea    rcx, [r12-1]
cmp    rax, r12
cmovae rax, rdx        ; clamp to 0 if unsigned >= r12
cmovge rax, rcx        ; clamp to r12-1 if signed >= r12, replacing 0. Else leave orig RAX

If rax >=(unsigned) r12, the value definitely needs to be clamped, to either 0 or r12-1.

For negative RAX inputs, only the ae condition will be true, not ge, so the final result is 0. But for signed-positive inputs that are too high, ge will be true. It doesn't matter that ae was also true, the 2nd cmov will overwrite the 0. Otherwise (for negative RAX), it will leave the 0 alone.

r12 is known to be signed positive so ge being true implies ae being true.

On a CPU with 1-cycle latency cmovae / ge (Broadwell and later, and AMD: https://uops.info/), this has 3-cycle latency from RAX input to RAX output. (cmova is 2 uops, 2 cycle latency on modern Intel from reading both ZF and CF, but fortunately we don't need it.)


Original answer using old-school (1990s) techniques:

There is an answer without any branches. It is a bit more involved because every possible outcome is calculated, even if it is not used. How it compares performancewise to branching code I don't know. But there is no branch penalty.

The trick to avoid branching is to turn the carry flag (CF) into a bitmask by using subtract-with-borrow of a register from itself. After that, you can use that register as an and-mask to select the desired outcome.

; trick #1: rbx made 0 when rax negative, else 0xFFFFFFFFFFFFFFFF

mov rbx, rax
shl rbx, 1      ; negative => CF
cmc             ; CF 0 when rax negative
sbb rbx, rbx    ; this sets CF into all bits of rbx

; trick #2: rcx made 0xFFFFFFFFFFFFFFFF if less than r12, or else 0
; note that r12 should be positive for this to make sense

cmp rax, r12    ; CF 1 when rax < r12
sbb rcx, rcx    ; this sets CF into all bits of rcx

and rax, rcx    ; use rcx mask on rax

; calculate highest allowed value
mov rdx, r12
dec rdx
not rcx         ; invert rcx mask
and rdx, rcx    ; apply to replacement value

or rax, rdx
and rax, rbx
Sly answered 21/12, 2020 at 14:22 Comment(4)
I don't think mov/shl/cmc/sbb/and is better than xor-zero / test/cmovns on any existing CPUs. Same for the high limit, you're spending a lot of instructions to manually blend when cmp / cmovg would work.Boony
or rax, rax is worse than test rax,rax - Test whether a register is zero with CMP reg,0 vs OR reg,reg? explains why: unnecessarily lengthens the critical-path dependency chain through RAX by another cycle. Even if this is on a P6-family CPU, the surrounding instructions write RAX so an extra write won't be needed to avoid register-file read stalls. Also, I think you can borrow the trick from my answer of using signed less-than and unsigned above to only need cmp once. (zero a register ahead of the first compare).Boony
I wanted to upvote this answer because you got most of the way to a good version, so I went ahead and fixed that or vs. test missed-optimization. I also added a version optimized with the unsigned + signed compare trick from my answer. Feel free to reword the text / code into your own words.Boony
Could the opening salvo be replaced with cqd / not edx? cqd is not listed in uops.info, but if it's similar to cdq, then this would seem to be ahead.Gretna
A
1

Less jumps, cleaner I think :

xor  rdx, rdx
test rax, rax
js   OK
lea  rdx, [r12 - 1]
cmp  rax, r12
jge  OK
mov  rdx, rax
OK:
mov  rax, rdx
A1 answered 3/12, 2015 at 16:40 Comment(4)
Mayhaps that wasn't clear in my pseudo code - but instead of lea rdx, [r12 - 1] I this line should be mov rdx, r12 and dec rdxLahey
@Rames: /facepalm. Those have identical results, except LEA doesn't set flagsBoony
@PeterCordes is invoking his Inner Picard today ;)Gummous
lea rdx, [r12 - 1] is Load Effective Address. This particular case it computes an address by taking the value of r12 (without actually modifying it), subtracts 1 and then stores that value in rdx . LEA doesn't actually access any memory (and doesn't set any flags), it simply computes values.Gummous

© 2022 - 2024 — McMap. All rights reserved.