Convert between big-endian and little-endian on RISC-V

Asked 30/8, 2018 at 14:33 Answered 14/3, 2020 at 8:25

What is the simplest way to work with big-endian values in RISC-V at the assembly language level? That is, how to load a big-endian value from memory into a register, work with the register value in native-endian (little-endian), then store it back into memory in big-endian. 16, 32 and 64 bit values are used in many network protocols and file formats.

I couldn't find a byte-swap instruction (equivalent to BSWAP on x86 or REV on ARM) in the manual, nor anything about big-endian loads and stores.

Longitudinal answered 30/8, 2018 at 14:33 Comment(9)

Have you tried checking the documentation? This is really a question that can be answered by a quick read of the relevant document. – Heterochromous 30/8, 2018 at 14:34

Yes, multiple sources. Byte swap is such a common operation that I thought I must have missed something, that's why I'm asking here. – Longitudinal 30/8, 2018 at 14:36

If it's not in the spec it probably isn't there. – Heterochromous 30/8, 2018 at 14:43

I tried to check what compilers do when you ask for a byte-swap, but Godbolt's clang risc-v install is broken and tries to use x86 inline asm for endian.h be32toh(). godbolt.org/z/6MzVWa. Maybe writing pure C that compilers could recognize as a byte-swap would work, but wouldn't prove the non-existence of an instruction. – Tjaden 30/8, 2018 at 14:58

Thanks for the effort, Peter. I reworked the question into a how-to question instead of asking specifically about an instruction made for the purpose. Hope that's better. – Longitudinal 30/8, 2018 at 15:10

@Longitudinal Fair enough. Downvote retracted. – Heterochromous 30/8, 2018 at 15:47

This can be done with a single instruction with the XBitmanip Extension, that's not core and that's not even a finalized extension. Is that within the scope of this question anyway? – Callow 30/8, 2018 at 16:8

@harold Yeah, just noticed that too. I think it is within scope and worth keeping in mind. Since the extension deals with something as basic as bit manipulation it could be extremely widely adopted once finalized, which would make it a viable solution for most purposes. – Longitudinal 30/8, 2018 at 16:16

I was looking into this, investigating code quality... See gcc.godbolt.org/z/dwMH9S which shows generated code for several approaches, dealing with 64-bit stores. – Culmination 14/12, 2018 at 17:36

There is no mention of a byte-swap instruction in the latest RISC-V User-Level ISA Manual (version 2.1). However, the manual has a placeholder for “B” Standard Extension for Bit Manipulation. Some draft materials from that extension's working group are collected on GitHub. In particular, the draft specification talks about a grev instruction (generalized reverse) that can do 16, 32 and 64-bit byte-swaps:

This instruction provides a single hardware instruction that can implement all of byte-order swap, bitwise reversal, short-order-swap, word-order-swap (RV64), nibble-order swap, bitwise reversal in a byte, etc, all from a single hardware instruction. It takes in a single register value and an immediate that controls which function occurs, through controlling the levels in the recursive tree at which reversals occur.

~~The extension B working group was "dissolved for bureaucratic reasons in November 2017" before they could finalize the spec.~~

In 2020 the working group is active again, posting their work at the linked GitHub repo.

As a result, there currently doesn't seem to be anything simpler than doing the usual shift-mask-or dance. I couldn't find any assembly language bswap intrinsic in the GCC or clang riscv ports. As an example, here's a disassembly of the bswapsi2 function (which byte-swaps a 32-bit value) emitted by the riscv64-linux-gnu-gcc compiler version 8.1.0-12:

000000000000068a <__bswapsi2>:
 68a:   0185169b                slliw   a3,a0,0x18
 68e:   0185579b                srliw   a5,a0,0x18
 692:   8fd5                    or      a5,a5,a3
 694:   66c1                    lui     a3,0x10
 696:   4085571b                sraiw   a4,a0,0x8
 69a:   f0068693                addi    a3,a3,-256 # ff00 <__global_pointer$+0xd6a8>
 69e:   8f75                    and     a4,a4,a3
 6a0:   8fd9                    or      a5,a5,a4
 6a2:   0085151b                slliw   a0,a0,0x8
 6a6:   00ff0737                lui     a4,0xff0
 6aa:   8d79                    and     a0,a0,a4
 6ac:   8d5d                    or      a0,a0,a5
 6ae:   2501                    sext.w  a0,a0
 6b0:   8082                    ret

Longitudinal answered 30/8, 2018 at 16:42 Comment(1)

As of early 2020, the "B" extension is not dead, i.e. it moved to github.com/riscv/riscv-bitmanip - the old repository's README was updates in March, 2019 with the note: 'RISC-V XBitmanip is now the official RISC-V Bitmanip draft' – Floodgate 11/2, 2020 at 19:54

The RISC-V ISA has no explicit byte swapping instructions. Your best bet is to use a C builtin to perform this calculation, which in GCC land would be something like __builtin_bswap32(). This gives the compiler the most information possible so it can make good decisions. With the current set of defined ISAs you'll almost certainly end up calling into a routine, but if a B extension is ever defined you will transparently get better generated code. The full set of defined builtins is availiable online: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html .

If you're stuck doing this in assembly, then your best bet is to call into an existing byte swap routine. The canonical one for a 32-bit swap is __bswapsi2, which is part of libgcc -- you're probably using that anyway, so it'll be around. That's what the compiler currently does so all you're losing is eliding the function call when there's a better implementation available.

As a concrete example, here's my example C function

unsigned swapb(unsigned in) { return __builtin_bswap32(in); }

and the generated assembly

swapb:
    addi    sp,sp,-16
    sd  ra,8(sp)
    call    __bswapsi2
    ld  ra,8(sp)
    sext.w  a0,a0
    addi    sp,sp,16
    jr  ra

Vouge answered 4/10, 2018 at 18:52 Comment(2)

Worth noting this will only work when not using the -nodefaultlibs flag. – Bile 19/1, 2020 at 21:29

Strangely, GCC doesn't inline call __bswapsi2 in your example. Does it do that if you compile with -O2 or a greater optimization level? – Longitudinal 12/2, 2020 at 13:14

Unlike x86, RISC-V doesn't have something like movbe (which can load and byte-swap in one instruction).

Thus, on RISC-V you load/store as usual and after/before the load/store you have to swap the bytes with extra instructions.

The RISC-V "B" (Bitmanip) extension (version 0.92) contains generalized bit reverse instructions (grev, grevi) and several pseudo-instructions that you could use for byte swapping:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RISC-V    ARM      X86      Comment
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
rev       RBIT     ☐        bit reverse
rev8.h    REV16    ☐        byte-reverse half-word (lower 16 bit)
rev8.w    REV32    ☐        byte-reverse word (lower 32 bit)
rev8      REV      BSWAP    byte-reverse whole register
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

(Table based on Table 2.5, RISC-V Bitmanip Extension V0.92, page 18)

As of 2020-03, the "B" extension has draft status, thus support in hardware and emulators is limited.

Without the "B" extension you have to implement the byte swapping with several base instructions. See for example page 16 in the "B" specification or look at the disassembled code of the __builtin_bswap16, __builtin_bswap32 and __builtin_bswap64 gcc/clang intrinsics.

Floodgate answered 14/3, 2020 at 8:25 Comment(0)

Note that while it's nice a pretty and convenient to have an instruction to do it, the __bswapsi2 function used in other answers will run at around 400 MB/s on a 1.5 GHz HiFive Unleashed, which is quite a lot faster than the gigE interface is ever going to moved data around.

Even on the HiFive1 running at the default 256 MHz it will do 60 MB/s and you've only got 16 KB of RAM and a bunch of GPIOs that you're not going to wiggle at more than a few MHz or maybe 10s of MHz.

I'm on the BitManipulation working group. The full GREV instruction needs a fair bit of hardware (something close to a multiplier) so small microcontrollers might never include it. However we're planing to use the same GREVI opcodes that give full word bit reversal and byte order reversal and implement them as simpler special-case instructions that don't need much circuitry and hopefully everyone will include them.

Edit, March 2023:

BitManip got ratified in November 2021 with rev8 (reverse bytes in a register) in Zbb and rev.b (reverse bits in each byte) in Zbkb. Applied in sequence you can also reverse bits in a register in two instructions.

I now have in my hands a cheap SBC (VisionFive 2) implementing Zba and Zbb extensions, so have a working rev8.

Jehiah answered 30/8, 2019 at 7:35 Comment(3)

Thanks - extremely useful to get the lowdown from someone involved in the design! Do you operate on some kind of a schedule or is the spec done when it's done? – Longitudinal 30/8, 2019 at 10:55

Arguments based on "using 100% of the CPU time for this tiny step" easily break down when you want to do something with the native-endian data. If you use up a significant fraction of your processing-time budget on endian handling, that doesn't leave as much time for real work. And makes handling on the fly every time it's loaded (to avoid the mem bandwidth of an extra copy) less attractive. e.g. recent x86 has a movbe insns that loads or stores + converts on the fly. (I'm sure you know that, and sometimes you don't need to do more than copy. But worth pointing out for future readers – Tjaden 30/8, 2019 at 12:40

@Longitudinal as a community-driven standard by volunteers it's done when it's done, and when we get sufficient buy-in from representatives of the various companies and other interested groups. We are, however, trying to keep it moving along. I don't think we'll be adding any more instructions (my GORC was a very late addition). Everything has been implemented (with unofficial opcodes) in binutils, spike, gcc (most), and in HDL and is finding its way into FPGA cores. The next few months will be used to run a lot of software to check that each instruction pulls its weight in code size or runtime. – Jehiah 31/8, 2019 at 19:58

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags