Asked 22/5, 2022 at 19:27 Answered 22/9, 2023 at 22:45

Solved assembly cpu-architecture riscv branchless conditional-move

On Intel AVX, there is a possibility of branchless code. Instead of branching for case0 or case1, you can compute both cases, and blend the results based on a condition.

AVX does this 8 way for float using the vblendps instruction.

You can also do this in a scalar way, without a vector, using the x86 instruction CMOVcc which performs a move operation, conditionally.

NOTE: ARM has CSEL and NEON has VBSL.

Can RISCV64 do a scalar move like this, so that you do not have to branch for

a = c ? x : y;

As I understand, RISCV implementations are in-order, so it would benefit even more than x86 when not having to branch. (The latter can at least shuffle around some instructions, and even branch speculatively to hide latency.)

The closest I can find w.r.t branchless operation for riscv is SLT (Set Less Than) but that sets to 1 or 0, and then would need multiplications? Wouldn't it be more useful to have SLT set to -1 or 0 instead, so that we can AND that?

UPDATE

When doing:

int foo(int a, int b, int x, int y)
{
    return a < b ? x : y;
}

I tried a poor-man's version of branchless using SLT. I am not sure if I did it completely right, by using bitmask as 0 - condition(0|1), I came up with:

branchless:
    SLT t0,a0,a1
    SUB t0,zero,t0
    NOT t1,t0
    AND t0,a2,t0
    AND t1,a3,t1
    OR  a0,t0,t1
    RET
    .size   branchless, .-branchless

as the branchless version of:

branched:
    BGE a0,a1,.L2
    MV  a3,a2
.L2:
    MV  a0,a3
    RET
    .size   branched, .-branched

I wonder if I used too many instructions for this, but I measured the branching version to be slightly faster than the non-branching one on random data, but not by much.

Pomposity answered 22/5, 2022 at 19:27 Comment(5)

and then would need multiplications? - or sub / and on the opposite condition. Yes, for bithacks and branchless stuff, 0 / -1 would be more useful. But since C implementations typically use a bool whose object representation must be 0 / 1 to allow cheaper conversion to int, that's what MIPS and RISC-V did for their compare-into-register instructions. (And/or possibly other reasons.) – Swagman 22/5, 2022 at 21:45

docs.boom-core.org/en/latest/sections/intro-overview/boom.html – Hampden 22/5, 2022 at 21:52

@Bram: To get your 0/-1 mask, you do mask = (c==0)-1. Like x86 test/setcc / dec. It might or might not actually be worth doing on RISC-V, depending on the microarchitecture and how unpredictable it is. – Swagman 22/5, 2022 at 21:57

@PeterCordes Ah, got it. Thanks again. I will try to bench that against branch on the Sipeed Lichee RV86 that I have here. – Pomposity 22/5, 2022 at 21:59

They left out shift and add (scaling addition) from the original, apparently thinking that compressed instructions plus fusing would do the job. (The compressed instructions are limited to a subset of registers.) – Hampden 22/5, 2022 at 22:38

Update: see sh1's answer for the current situation: there's a conditional-zero instruction, like cmov from x0. The full cmov was dropped from the planned discussions before extension B made it to v1.0 (and extension B was split into some separate parts). An article has some details and links on the situation as of mid 2023.

Current compilers no longer support b as a single-letter extension name either.

The proposed RISC-V extension B includes cmov (with 4 operands: 3 inputs and a separate destination!). (Version 0.93 was current when the rest of this answer was written.)

I think David Patterson (one of the lead architects behind MIPS and RISC-V) really dislikes cmov (along with short-vector SIMD like SSE/AVX) and thinks CPUs should specially handle "hammock" branches (that jump forward over a single instruction like a move) if they want to do that. Something like that. So this seems to be a case of philosophical purity getting in the way of including useful instructions. (AArch64 is a much more pragmatic design, still being RISC in the ways that matter for a high-performance implementation.)

And/or perhaps a desire to limit instructions to at most 2 inputs, if there aren't any other 3-input instructions. That means a scalar pipeline only needs 2 register read ports, not 3, if it strictly follows this restriction. (That also means no add-with-carry, making extended-precision math quite a pain for numbers wider than 2 registers, when you have to deal with carry-in and carry-out to the same add operation.)

You can emulate cmov as you say with a mask for AND/ANDnot/OR, but that would take quite a few instructions and is usually not worth it except possibly on wide and deep out-of-order machines, where the amount of work discarded by a branch miss is a lot bigger. (mask = (c == 0) - 1; which you can do with sltiu / add reg,reg, -1 to turn 0 into -1 and 1 into 0.)

You kind of have it backwards in terms of which kind of microarchitecture benefits more from CMOV, although there are potential benefits either way. And an in-order machine already kind of has to wait at a conditional branch for the condition to resolve, vs. an out-of-order machine treating control dependencies very differently from data dependencies. As discussed in gcc optimization flag -O3 makes code slower than -O2, data dependencies through cmov can create a loop-carried dependency chain that's a bigger bottleneck that highly predictable branches.

There are some out-of-order exec RISC-V designs, maybe even some that are open-source. For example, Erik Eidt linked The Berkeley Out-of-Order Machine (BOOM).

Extension B: where they put all the fun instructions they left out

The RISC-V extension B proposal has a conditional move, along with scalar min/max, popcount, leading/trailing zero count, bitfield insert/extract, two-register shifts, and a bunch of more esoteric stuff. https://five-embeddev.com/riscv-bitmanip/draft/bext.html#conditional-move-cmov

Looking at the list of proposed instructions, it's amazing what got left out of baseline RISC-V, like sign-extension of narrow integers (currently requires slli/srai) if it's not already guaranteed by the calling convention or a load instruction, and standard stuff like popcount and leading/trailing zero count that most ISAs have.

Godbolt shows clang 12.0 using cmov, min, and sext.b. In that clang version, -O3 -Wall -menable-experimental-extensions -march=rv32gcb0p93 was the magic incantation to do that. Extension B 0.93 is enabled by the b0p93 part of the string. (Extension B isn't finalized, and I don't know what version clang 14.0 was looking for; its error message wasn't helpful, and just plain -march=rv32gcb didn't get the compiler to actually use cmov.)

//  -march=rv32gcb0p93 includes extension b 0.93 (0p93)

int sel(int x, int y, int c){
    return c ? x : y;
}
# extension B  clang
        cmov    a0, a2, a0, a1
        ret

# baseline gcc11.3  (clang and GCC12 waste several mv instructions)
        bne     a2,zero,.L2
        mv      a0,a1
.L2:
        ret

int min(int x, int y, int c){
    return (x<y) ? x : y;
}
# extension B  clang
        min     a0, a0, a1
        ret

# baseline gcc
        ble     a0,a1,.L5
        mv      a0,a1
.L5:
        ret

int sext(int c){
    return (signed char)c;
}
# extension B  clang
        sext.b  a0, a0
        ret

# baseline gcc
        slli    a0,a0,24
        srai    a0,a0,24
        ret

Swagman answered 22/5, 2022 at 22:48 Comment(10)

Thanks for the elaborate answer. A great find! I guess the lack of conditional moves really starts to hurt when doing SIMD, because you can't branch differently for each element. I guess you would really need that B extension when you use vectors? – Pomposity 22/5, 2022 at 23:30

@Bram: Existence of scalar cmov or not is unrelated to what SIMD instructions are available. RISC-V vector extensions (which allow hardware to provide whatever actual length it wants, like ARM SVE) I think have first-class masking support like AVX-512, at least for loads/stores. And I think have compare-into-mask. There was also a RISC-V proposed extension with short-vector SIMD; IDK if it had an instruction like x86 vlendps or if you'd have to emulate it like x86 used to need before SSE4.1 with and/andn/or. – Swagman 23/5, 2022 at 0:6

Things have changed a bit since this was written. I posted my own answer so it's available, but if you want to update this then here's a reminder. – Cathodoluminescence 22/9, 2023 at 22:49

@sh1: Thanks for the update. I think I'll leave my answer alone for historical interest, at mot adding a note that it's outdated and point them at yours. It's not like a sub-optimal strategy or something, it just plain won't work on newer toolchains so people will have to find your answer for newer GCC or clang versions. Interesting ISA-design choice that neatly brings it back to a 3-operand instruction. On an ISA with a zero register, selecting vs. zero was something the old cmov could already do without any setup, so it's strictly less powerful, but yeah equally efficient in many cases. – Swagman 22/9, 2023 at 23:53

The design is in line with your optimisation in the -O2 versus -O3 question. I can't claim to know what went on in meetings but I do see it as a strength that the instruction offers an explicit and unambiguous fast path so everybody knows which idioms to lean into. – Cathodoluminescence 23/9, 2023 at 0:43

So really, it all boils down to the original RISC-V design decision to not have a flags register. If you have one, as x86 and ARM do, then cmov only needs two inputs (with flags implicit), and doesn't need any special encodings, extra data paths, etc. – Threewheeler 23/9, 2023 at 13:24

@NateEldredge: x86 CMOV has only two explicit inputs, but it decodes to 2 uops on Intel CPUs before Broadwell because microarchitecturally it has 3 inputs (2 registers, and FLAGS from a third physical register file entry; a SnB-family PRF entry stores FLAGS and/or the integer reg output from the same instruction.) Only with the addition of FMA requiring the scheduler to track 3-input uops did it become possible in the next generation to do that for 3-input integer ops like adc and cmov as well. (Actually cmova and be have 4 inputs so 2 uops: both CF and the SPAZO group, no merging.) – Swagman 23/9, 2023 at 13:33

@NateEldredge even Arm leans away from using condition wherever possible. Aarch64 has very few instructions to set or use flags. Basically as few as they could get away with without upsetting the ecosystem. – Cathodoluminescence 23/9, 2023 at 23:36

@sh1: AArch64's branchless instructions are very powerful, much better than x86's. Since AArch64 has a zero register, it can conditionally zero, or with csinc can materialize a 0/1. Classic ARM spends 4 bits of every instruction word on a predicate, leaning as far in on branchless conditional execution as possible. (And Thumb can predicate with it). Which means you even have conditional store and load that suppress faults on bad pointers, not just ALU select like A64 and most others. (x86 w. APX is introducing conditional load/store with fault suppression, like masked SIMD loads/stores.) – Swagman 24/9, 2023 at 0:53

David Patterson influenced Sparc (that is why they got register windows). Hennessy architected MIPS. RISCV definitely follows MIPS more closely. I do not know either of their positions on conditional moves. – Ochoa 20/8 at 20:7

OK, cmov didn't make it.

Right now you'll need to look at the Zicond extension to get the instructions czero.eqz and czero.nez. These return either the first input or zero, depending on whether or not the last input is zero.

For example:

int cmov(bool c, int x, int y) {
    return c ? x : y;
}

gives:

cmov(bool, int, int):                             # @cmov(bool, int, int)
        czero.nez       a2, a2, a0
        czero.eqz       a0, a1, a0
        or      a0, a0, a2
        ret

Obviously this looks a lot better when one of the operands is constant zero, which is fairly common, or if you're looking at something like c ? x : (x + y) then that would become x + (c ? 0 : y).

To enable this optimisation in clang right now requires: -menable-experimental-extensions -march=rv64gc_zicond1p0

Once everything is settled I suppose this will become: -march=rv64gc_zicond

If you already have a setting for -march=, just tack _zicond1p0 or on the end of it.

In the SIMD space (-march=rv64gcv) you have __riscv_vmerge_*() intrinsics.

What did survive in the B extension family is min/max. You get access to these with -march=rv64gc_zbb, and aside from the obvious uses, you can sometimes refactor things to use them as masking operations.

Cathodoluminescence answered 22/9, 2023 at 22:45 Comment(2)

This answer deserves upvotes, too, people! Since this was posted, three people upvoted my answer (hopefully because of the computer-architecture background on conditional moves), but mine is still the only upvote on this one that actually answers the question for current RISC-V! – Swagman 23/9, 2023 at 15:55

That's the bug of smaller communities here on stackexchange/overflow -- only the first or upper answer gets the majority of upvotes. – Elery 23/9, 2023 at 19:3

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

UPDATE

Extension B: where they put all the fun instructions they left out

Recommended topics

Hot tags