Let's attempt to define a function that returns the maximum of two values x and y. A sufficient condition for these formulas to be valid is that, for signed integers, –2^30 <= x, y <= 2^30 – 1
, and for unsigned integers, 0 <= x, y <= 2^31 – 1
(i.e., it is only required to work on a reduced integer range of effectively one bit less). The most universal (and straightforward) implementation would be:
int max(int x, int y) {
return (x > y) ? x : y
}
Up until the Pentium Pro, this would be the generated assembly in x86:
max: # GCC -O3 -march=pentium -m32
mov eax, DWORD PTR [esp+8]
cmp eax, DWORD PTR [esp+4]
jge .L4
mov eax, DWORD PTR [esp+4]
.L4:
ret
But starting with the Pentium Pro, a new instruction CMOVcc r32,r/m
was introduced, that performs a conditional move depending on some specified status flags:
The CMOVcc instructions check the state of one or more of the status flags in the EFLAGS register (CF, OF, PF, SF, and ZF) and perform a move operation if the flags are in a specified state (or condition). A condition code (cc) is associated with each instruction to indicate the condition being tested for. If the condition is not satisfied, a move is not performed and execution continues with the instruction following the CMOVcc instruction. These instructions can move a 16- or 32-bit value from memory to a general-purpose register or from one general-purpose register to another. Conditional moves of 8-bit register operands are not supported. The terms "less" and "greater" are used for comparisons of signed integers and the terms "above" and "below" are used for unsigned integers.
This allows the creation of a branchless maximum function (if you don't know why a branchless implementation is generally a good thing, you might need to read a little bit about instruction pipelining):
max: # GCC -O3 -march=pentiumpro -m32
mov edx, DWORD PTR [esp+4]
mov eax, DWORD PTR [esp+8]
cmp eax, edx
cmovl eax, edx
ret
With the function calling conventions of x86-64, this is further simplified to:
max: # GCC -O3
cmp esi, edi
mov eax, edi
cmovge eax, esi
ret
Are there other architectures whose compilers produce branchless max/min functions? Well, here's the code for ARMv7:
max: # CLANG -O3
cmp r0, r1
movle r0, r1
bx lr
ARMv8-a and ARM64 have a quite interesting instruction:
max: # GCC -O3
cmp w1, w0
csel w0, w1, w0, ge
ret
What about RISC-V? Well, there seems to be no equivalent of conditional moves, so the compiler generates the basic branch-based version:
max: # rv32gc clang -O3
blt a1, a0, .LBB1_2
mv a0, a1
.LBB1_2:
ret
Can we do better? In Hacker’s Delight, Warren presents us the following formula that works in the aforementioned range (i.e., on a reduced integer range of one bit less) without any branches. In C:
int max_v1(int x, int y) {
return (x - ((x-y) & ((x-y) >> 31)));
}
Where >>
should stand for arithmetic (signed) shift right. The hand-written RISCV assembly for this function is just 4 instructions:
max_v1: # hand-written
sub a1, a0, a1
srai a2, a1, 31
and a1, a1, a2
sub a0, a0, a1
ret
It’s interesting to analyse how this is compiled to x86-64. Because the function only works in the range defined, it is not exactly the same as the maximum. But the optimiser doesn’t exactly produce what we would be expecting:
max_v1: # Clang -O1
mov eax, edi
xor ecx, ecx
mov edx, edi
sub edx, esi
cmovns edx, ecx
sub eax, edx
ret
It’s branchless alright, though it depends on the conditional move. How is this compiled for the RISCV?
max_v1: # rv32gc clang -O3
sub a2, a0, a1
bltz a2, .LBB0_2
mv a1, a0
.LBB0_2:
mv a0, a1
ret
Wait, what? We have a branch! Four instructions and a branch. Why? Let’s take a peek at the unoptimised version:
max_v1: # rv32gc clang -O0
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
addi s0, sp, 16
sw a0, -12(s0)
sw a1, -16(s0)
lw a0, -12(s0)
lw a1, -16(s0)
sub a1, a0, a1 # <
srai a2, a1, 31 # <
and a1, a1, a2 # <
sub a0, a0, a1 # <
lw ra, 12(sp)
lw s0, 8(sp)
addi sp, sp, 16
ret
If you ignore all the stack manipulation (because the compiler isn’t optimising for using the a
registers for function parameters), lines 10-13
are precisely the code we’ve manually produced. Why, oh why, is the optimiser, in all its wisdom, introducing a branch?
P.S. I have a theory, but I would first love to hear from the experts.
pcmpeqb
against zero, useful for strlen/strcpy), single-bit clear/extract/invert like x86bts
/btr reg, reg|imm
, shift-left-and-add like x86lea reg, [reg+reg*2,4,or 8]
) – Flattishcmov
, but my answer on RISCV branchless coding shows clang-march=rv32gcb0p93
usingcmov
. (Extension C and B version 0.93). raw.githubusercontent.com/riscv/riscv-bitmanip/master/… (B version 0.94) showscmov
as part of the Zbt subset of it. And seems extension B has many more parts, including byte and nibble shuffles with a variable control, likepshufb
. – Flattish