Why is rv32gc optimising branchless code with branches for RISC-V?

Let's attempt to define a function that returns the maximum of two values x and y. A sufficient condition for these formulas to be valid is that, for signed integers, –2^30 <= x, y <= 2^30 – 1, and for unsigned integers, 0 <= x, y <= 2^31 – 1 (i.e., it is only required to work on a reduced integer range of effectively one bit less). The most universal (and straightforward) implementation would be:

int max(int x, int y) {
    return (x > y) ? x : y
}

Up until the Pentium Pro, this would be the generated assembly in x86:

max:  # GCC -O3 -march=pentium -m32
  mov eax, DWORD PTR [esp+8]
  cmp eax, DWORD PTR [esp+4]
  jge .L4
  mov eax, DWORD PTR [esp+4]
.L4:
  ret

But starting with the Pentium Pro, a new instruction CMOVcc r32,r/m was introduced, that performs a conditional move depending on some specified status flags:

The CMOVcc instructions check the state of one or more of the status flags in the EFLAGS register (CF, OF, PF, SF, and ZF) and perform a move operation if the flags are in a specified state (or condition). A condition code (cc) is associated with each instruction to indicate the condition being tested for. If the condition is not satisfied, a move is not performed and execution continues with the instruction following the CMOVcc instruction. These instructions can move a 16- or 32-bit value from memory to a general-purpose register or from one general-purpose register to another. Conditional moves of 8-bit register operands are not supported. The terms "less" and "greater" are used for comparisons of signed integers and the terms "above" and "below" are used for unsigned integers.

This allows the creation of a branchless maximum function (if you don't know why a branchless implementation is generally a good thing, you might need to read a little bit about instruction pipelining):

max:  # GCC -O3 -march=pentiumpro -m32
  mov edx, DWORD PTR [esp+4]
  mov eax, DWORD PTR [esp+8]
  cmp eax, edx
  cmovl eax, edx
  ret

With the function calling conventions of x86-64, this is further simplified to:

max:  # GCC -O3 
  cmp esi, edi
  mov eax, edi
  cmovge eax, esi
  ret

Are there other architectures whose compilers produce branchless max/min functions? Well, here's the code for ARMv7:

max:  # CLANG -O3
  cmp r0, r1
  movle r0, r1
  bx lr

ARMv8-a and ARM64 have a quite interesting instruction:

max:  # GCC -O3
  cmp w1, w0
  csel w0, w1, w0, ge
  ret

What about RISC-V? Well, there seems to be no equivalent of conditional moves, so the compiler generates the basic branch-based version:

max:  # rv32gc clang -O3
  blt a1, a0, .LBB1_2
  mv a0, a1
.LBB1_2:
  ret

Can we do better? In Hacker’s Delight, Warren presents us the following formula that works in the aforementioned range (i.e., on a reduced integer range of one bit less) without any branches. In C:

int max_v1(int x, int y) {
    return (x - ((x-y) & ((x-y) >> 31)));
}

Where >> should stand for arithmetic (signed) shift right. The hand-written RISCV assembly for this function is just 4 instructions:

max_v1:  # hand-written
  sub a1, a0, a1
  srai a2, a1, 31
  and a1, a1, a2
  sub a0, a0, a1
  ret

It’s interesting to analyse how this is compiled to x86-64. Because the function only works in the range defined, it is not exactly the same as the maximum. But the optimiser doesn’t exactly produce what we would be expecting:

max_v1:  # Clang -O1
  mov eax, edi
  xor ecx, ecx
  mov edx, edi
  sub edx, esi
  cmovns edx, ecx
  sub eax, edx
  ret

It’s branchless alright, though it depends on the conditional move. How is this compiled for the RISCV?

max_v1:  # rv32gc clang -O3
  sub a2, a0, a1
  bltz a2, .LBB0_2
  mv a1, a0
.LBB0_2:
  mv a0, a1
  ret

Wait, what? We have a branch! Four instructions and a branch. Why? Let’s take a peek at the unoptimised version:

max_v1:  # rv32gc clang -O0
  addi sp, sp, -16
  sw ra, 12(sp)
  sw s0, 8(sp)
  addi s0, sp, 16
  sw a0, -12(s0)
  sw a1, -16(s0)
  lw a0, -12(s0)
  lw a1, -16(s0)
  sub a1, a0, a1   # <
  srai a2, a1, 31  # <
  and a1, a1, a2   # <
  sub a0, a0, a1   # <
  lw ra, 12(sp)
  lw s0, 8(sp)
  addi sp, sp, 16
  ret

If you ignore all the stack manipulation (because the compiler isn’t optimising for using the a registers for function parameters), lines 10-13 are precisely the code we’ve manually produced. Why, oh why, is the optimiser, in all its wisdom, introducing a branch?

P.S. I have a theory, but I would first love to hear from the experts.

Clang makes this transform for x86-sans-cmov as well: try clang -m32 -march=i486 -O2. It results from how LLVM optimization pipeline works.

For such small testcases, you can easily inspect how the compiler transforms the code on your own. On Compiler Explorer, select "Add New > LLVM Opt Pipeline" in the compiler pane; it will spawn a new pane showing the sequence of compiler passes, with passes that changed the code highlighted in green: https://godbolt.org/z/8jEq1nxGY

You can see that LLVM's InstCombine pass recognized the sequence of instructions

%sub = sub nsw i32 %x, %y
%shr = ashr i32 %sub, 31
%and = and i32 %sub, %shr
%sub2 = sub nsw i32 %x, %and
ret i32 %sub2

and replaced it with

%sub2 = call i32 @llvm.smax.i32(i32 %x, i32 %y)
ret i32 %sub2

And then at the beginning of machine code generation passes, the smax instrinsic was expanded to a branchy sequence.

Note that unlike Clang, GCC 12 emits the desired branchless code (it doesn't pattern-match this expression):

sub     a1,a0,a1
srai    a5,a1,31
and     a5,a5,a1
sub     a0,a0,a5
ret

Recommended topics

Hot tags