Mixed destination/source operand order in RISC-V assembly syntax for loads vs. stores
Asked Answered
M

2

10

Most instructions in RISC-V assembler order the destination operand before the source one, e.g.:

li  t0, 22        # destination, source
li  t1, 1         # destination, source
add t2, t0, t1    # destination, source

But the store instructions have that order reversed:

sb    t0, (sp)    # source, destination
lw    t1, (a0)    # destination, source
vlb.v v4, (a1)    # destination, source
vsb.v v5, (a2)    # source, destination

How come?

What is the motivation for this (arguably) asymmetric assembler syntax design?

Moua answered 18/1, 2020 at 16:4 Comment(6)
The memory operand comes second. You figure the data movement from the instruction. Only the stores are reversed, the loads are of course fine. The syntax matches the machine code encoding where the operands are in the same fields no matter if it's a load or a store.Pathological
This is typical of load/store architectures -- the loads and stores are written the same way just with different opcode. Other architectures use some kind of "move" instruction for both instead, so there the order and nature of operands determines whether reading and writing memory.Polarimeter
Btw, the fields for store are arranged slightly different from loads on RISC V, a change from MIPS where the loads and stores are both the same I-Type instruction format. This is so that loads share dest register field with R-Types, and stores share 2 source register fields with R-Types. From a micro architectural point of view, the store instruction has 2 source registers and no dest register.Polarimeter
fwiw, MIPS, PowerPC, HP-PA, ARM and RISC V, all use this same form for loads & stores.Polarimeter
@Jester, hm, the assembler could still implement the store instruction syntax with an order that differs from the one in the encoding, with trivial implementation effort. I mean RISC-V assembly also has pseudo-instructions where the encoding isn't as direct, either.Moua
Since the reasoning behind many RISC-V design choices is documented (e.g. in books, mailing lists, bug trackers) there is a good chance that this question can be answered based on facts and references, rather than just opinion. The official RISC-V assembly syntax is used throughout the specs and similar material. Thus, there is some reason to believe that the operand order in the assembly syntax was part of some deliberation based on experience and considering some kind of engineering trade-offs.Moua
F
7

I don't see a real inconsistency in RISC-V assembly when it comes to destination and source operands: The destination operand – when it's part of the instruction encoding – always corresponds to the first operand in the assembly language.

If we look at the following instruction examples from four of the six different instruction formats:

  • R-type: add t0, t1, t2
  • I-type: addi t0, t1, 11
  • J-type: jal ra, off
  • U-type: lui t0, 0x12345

In the assembly instructions above, the destination operand is the first operand. Clearly, this destination operand correspond to the destination register in the instruction encoding.

Now, let's focus on the store instructions (S-type format). As an example, consider the following store instruction:

sw t0, 8(sp)

I think it is crystal clear that t0 above is a source operand since the store instruction stores its contents in memory.

We can be tempted to think that 8(sp) is a destination operand. However, by closely looking at the S-type instruction format:

S-type format

We can tell that the 8(sp) part in the assembly instruction above isn't really a single operand but actually two, i.e., the immediate 8 (i.e., imm) and the source register sp (i.e., rs1). If the instruction could be expressed instead like (similar to addi2):

sw t0, sp, 8

It would become evident that this instruction takes three operands, not just two.

The register sp is not modified, only read; it can't be, therefore, considered a destination register. It is also a source register, just as t0 is – the register whose contents the store instruction stores in memory. Memory is the destination operand since it is what receives the content of t0.

The S-type instruction format doesn't encode a destination operand. What the instruction does encode is addressing information on the destination operand. For sw t0, 8(sp), the destination operand is the word in memory at the location specified by the effective address that the store instruction calculates from sp and 8. The register sp contains part of that addressing information about that word in memory (i.e., the destination operand).

Summary

Assembly instructions in RISC-V that encode a destination operand have this operand as the first one. A store instruction, however, doesn't encode a destination operand. Its destination operand is a location in memory, and the address of this location in memory is computed from the contents of the instruction source operands.


1We could possibly argue that the jal ra, off instruction above has an additional destination operand, namely pc, because pc is updated in the following way: pcpc + SignExtension(off). However, executing any other instruction also results in modifying pc, e.g., incrementing pc by four (may be different for branches and jalr). Anyway, pc is not encoded in any instruction, and it is not directly accessible to the programmer as a register. Therefore, it is not of interest to the discussion. For the same reason, I've also omitted the B-type format from this discussion.

2Or the just other way around: think as if you could express addi t0, t0, -1 as addi t0, -1(t0). Would you then say that addi takes two operands (e.g., t0 and -1(t0))?

Francklyn answered 29/1, 2020 at 10:6 Comment(2)
You make a reasonable argument, but instruction-encoding details don't need to inform / correspond to asm source-level syntax. It does make parsing simpler for writing an assembler, I guess. It would be perfectly valid to always have the destination on the left, including for stores. Assembly languages for CISC ISAs typically do that, and the mental model is a "memory destination operand", where the relevant chunk of memory is selected by the addressing mode. (And yes, stores do have 2 inputs: data and addressing mode.) There's no reason you can't think this way about RISC-V.Drawplate
It's basically just historical convention at this point that asm-source syntax for RISC ISAs don't put the memory destination on the left. Even ARM does str reg, [reg, reg] or whatever, and it's not as RISCy as MIPS or POWER.Drawplate
P
4

Assembly language is defined by the assembler, the program. It is up to the author(s) to pick the syntax. An assembler could choose to have the syntax

bob pickle,(jar)

and that would be perfectly valid syntax to store one register into the address defined by another. could probably even use the equivalent of a #define in some assembly language syntaxes.

The why question really means you want to talk to the actual developer who is likely not trolling Stack Overflow, although you might get lucky so this question does not have an actual answer.

To have a chance at success it is in the best interest of the processors developers to create or hire someone to create an assembler initially and later toolchain for their new processor, which would include someone sitting down and examining the machine code and creating a language from that. A chance at success for a third party assembler for a target involves using a syntax for the instructions that resembles those of the original, but why bother making a new one if you are not going to mix it up. The instruction syntax is only a part of the whole language defined by the assembler and you will find wide variations for mips, arm, etc, and will over time for risc-v although the desire to make new tools has gone down dramatically over the last couple of decades.

The only rule a successful assembler has to follow is the rules defined by the logic, the syntax can be whatever they choose for whatever reason they choose. So you have to ask each author/team if you want to know, not sure that even Bugzilla would get you there.

A related why question is since we spent so much of our early life with the destination on the left

y = mx + b

and not

mx + b = y

What sane person would design an assembly language where the instruction part has the destination on the right, even the high level languages don't do that.

A possible answer to your question is that someone way back was lazy and used the same code for load/store, and or cut and pasted it. And the at least RISC folks that followed, followed that convention.

Not just for Intel but for all the major/minor instruction sets you find syntax incompatibilities across tools, x86, arm, mips, msp430, avr, 8051, 6502, z80, etc, and eventually risc-v if not already. The folks that add targets to gnu assembler must take pride in making incompatible assembly languages as they do it so often.

The location within the instruction is generally irrelevant to the assembly language. The authors start off either being in the destination first camp or destination last camp.

add r0,r1,r2  ; r0 = r1 + r2 

add r0,r0,r2  ; r0 + r1 -> r2

and then names of registers is free form and sometimes varies. ax, %ax. r0, $0

A recent (horrible) fad I assume coming from mips and its use in school of v0, a0, t0, etc...and that infecting other unrelated instruction sets. The mangling of different instruction set habits is happening a lot these days.

They choose how to indicate indirection @r1, (r1), [r1]...

How to indicate pre/post increment/modification and so on as they work through the instructions.

Some choose 4(r1) where another would use as [r1,#4]

First assembly languages or heavily used for an individual play a role in how they like to handle others, some folks just have to make their own tool to avoid having to learn another language or deal with what they don't like about another language thus the AT&T thing, possibly the gnu assembler choices. Definitely the way MIPS handled a calling convention and how that notion, feature?, infected other tools and possibly classrooms.

Look at the evolution of x86 assembly languages in particular (the AT&T vs Intel being irrelevant to what I am talking about) over time.

As it should be, you simply learn the language that assembler uses and move on, or you write your own assembler to match the language you prefer, if you publish it and others like it then it can work its way into the norm and you are seeing that happen.

Short answer, because other assembly languages do it. Because you can see a clear connection between risc-v and MIPS in their design, no doubt the authors of the documentation also followed along with a MIPS style that they had been used to leading up to RISC-V. Exceptions to the rule happen, while it would be more of a purist solution to always have the destination left. What is more important is consistency as you pointed out. Don't have one flavor of store one way and another flavor another. Look at MRS/MSR in a typical ARM syntax, destination/source is in the middle, in the same place.

As far as gnu assembler goes, binutils is open source you are perfectly free to switch it around, likewise you are free to create your own assembler with the ordering and syntax as you wish. If you want it to be part of a chain then as with the current toolchains you need to create/change the compiler to match the assembler and linker.

If this is strictly a "why" question, then it is primarily opinion-based and should be closed. The author of the documentation and author of the assembler (backend) were free to choose and this was the choice.

Ptero answered 18/1, 2020 at 17:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.