What was the original reason for the design of AT&T assembly syntax?
Asked Answered
I

2

53

When using assembly instructions on x86 or amd64, programmer can use "Intel" (i.e. nasm compiler) or "AT&T" (i.e. gas compiler) assembly syntax. "Intel" syntax is more popular on Windows, but "AT&T" is more popular on UNIX(-like) systems.

But both Intel and AMD manuals, so manuals created by the creators of the chip, are both using the "Intel" syntax.

I'm wondering, what was the original idea behind the design of the "AT&T" syntax? What was the benefit for floating away from notation used by the creators of the processor?

Indigoid answered 15/2, 2017 at 8:22 Comment(5)
You will have to dig up Dennis Ritchie for the anwer to that.Morphine
Doesn't have anything to do with the OS, the toolchain matters. GNU adopted AT&T syntax, that made it common on Unix. And Windows, lots of programmers use GNU tooling there as well. Easier for the compiler, not humans, it can tell which registers get trashed and the syntax scales easier across architectures. With a dash of Motorola semantics, common on old Unixes. Intel syntax requires a decompiler to figure out the same details. Notable is that Microsoft gave up on it for their x64 compiler and only supports intrinsics.Strawflower
@HansPassant Note that AT&T syntax was standard for x86 (and 8086 before) long before the GNU project ported their assembler. It's called AT&T syntax because it was used by AT&T's UNIX port if I recall correctly.Mathildamathilde
Note that this question has been asked before but the duplicate wasn't caught when this was asked.Mathildamathilde
@fuz: It's not an exact duplicate. That's asking about why AT&T syntax is designed the way it is. This is asking why it was even worth breaking compatibility and making a new syntax. As usual with those kinds of questions, the answers end up overlapping a lot, though.Company
M
60

UNIX was for a long time developed on the PDP-11, a 16 bit computer from DEC, which had a fairly simple instruction set. Nearly every instruction has two operands, each of which can have one of the following eight addressing modes, here shown in the MACRO 16 assembly language:

0n  Rn        register
1n  (Rn)      deferred
2n  (Rn)+     autoincrement
3n  @(Rn)+    autoincrement deferred
4n  -(Rn)     autodecrement
5n  @-(Rn)    autodecrement deferred
6n  X(Rn)     index
7n  @X(Rn)    index deferred

Immediates and direct addresses can be encoded by cleverly re-using some addressing modes on R7, the program counter:

27  #imm      immediate
37  @#imm     absolute
67  addr      relative
77  @addr     relative deferred

As the UNIX tty driver used @ and # as control characters, $ was substituted for # and * for @.

The first operand in a PDP11 instruction word refers to the source operand while the second operand refers to the destination. This is reflected in the assembly language's operand order which is source, then destination. For example, the opcode

011203
||||||
|||||`- register 3
||||`-- addressing mode 0: register
|||`--- register 2
||`---- address mode 1: deferred
|`----- operation 1: mov
`------ operand size 0: word

refers to the instruction

movw (R2),R3

which moves the word pointed to by R2 to R3.

This syntax was adapted to the 8086 CPU and its addressing modes:

mr0 X(bx,si)  bx + si indexed
mr1 X(bx,di)  bx + di indexed
mr2 X(bp,si)  bp + si indexed
mr3 X(bp,di)  bp + di indexed
mr4 X(si)     si indexed
mr5 X(di)     di indexed
mr6 X(bp)     bp indexed
mr7 X(bx)     bx indexed
3rR R         register
0r6 addr      direct

Where m is 0 if there is no index, m is 1 if there is a one-byte index, m is 2 if there is a two-byte index and m is 3 if instead of a memory operand, a register is used. If two operands exist, the other operand is always a register and encoded in the r digit. Otherwise, r encodes another three bits of the opcode.

Immediates aren't possible in this addressing scheme, all instructions that take immediates encode that fact in their opcode. Immediates are spelled $imm just like in the PDP-11 syntax.

While Intel always used a dst, src operand ordering for its assembler, there was no particularly compelling reason to adapt this convention and the UNIX assembler was written to use the src, dst operand ordering known from the PDP11.

They made some inconsistencies with this ordering in their implementation of the 8087 floating point instructions, possibly because Intel gave the two possible directions of non-commutative floating point instructions different mnemonics which do not match the operand ordering used by AT&T's syntax.

The PDP11 instructions jmp (jump) and jsr (jump to subroutine) jump to the address of their operand. Thus, jmp foo would jump to foo and jmp *foo would jump to the address stored in the variable foo, similar to how lea works in the 8086.

The syntax for the x86's jmp and call instructions was designed as if these instructions worked like on the PDP11, which is why jmp foo jumps to foo and jmp *foo jumps to the value at address foo, even though the 8086 doesn't actually have deferred addressing. This has the advantage and convenience of syntactically distinguishing direct jumps from indirect jumps without requiring an $ prefix for every direct jump target but doesn't make a lot of sense logically.

The syntax was expanded to specify segment prefixes using a colon:

seg:addr

When the 80386 was introduced, this scheme was adapted to its new SIB addressing modes using a four-part generic addressing mode:

disp(base,index,scale)

where disp is a displacement, base is a base register, index an index register and scale is 1, 2, 4, or 8 to scale the index register by one of these amounts. This is equal to Intel syntax:

[disp+base+index*scale]

Another remarkable feature of the PDP-11 is that most instructions are available in a byte and a word variant. Which one you use is indicated by a b or w suffix to the opcode, which directly toggles the first bit of the opcode:

 010001   movw r0,r1
 110001   movb r0,r1

this also was adapted for AT&T syntax as most 8086 instructions are indeed also available in a byte mode and a word mode. Later the 80386 and AMD K6 introduced 32 bit instructions (suffixed l for long) and 64 bit instructions (suffixed q for quad).

Last but not least, originally the convention was to prefix C language symbols with an underscore (as is still done on Windows) so you can distinguish a C function named ax from the register ax. When Unix System Laboratories developed the ELF binary format, they decided to get rid of this decoration. As there is no way to distinguish a direct address from a register otherwise, a % prefix was added to every register:

mov direct,%eax # move memory at direct to %eax

And that's how we got today's AT&T syntax.

Mathildamathilde answered 15/2, 2017 at 13:6 Comment(26)
NASM uses a different design: it prevents you from declaring symbols that conflict with register names, and requires declaring all symbols with either extern foobar, foobar equ 123, or as a label. The parser can use a table to distinguishes between registers and symbols (and mnemonics in some cases). For example, mov rdi, rxd treats rxd as a symbol because there's no register of that name, and assembles to a mov r64, imm32 (or imm64). AT&T syntax makes it possible to have a global symbol like eax. A C compiler would have a hard time compiling int eax = 0; to NASM syntax.Company
@PeterCordes That's why C compilers prefix symbols with underscores on some platforms. How does NASM cope with new register names that could possibly conflict with existing symbol names?Mathildamathilde
In YASM at least, you can disable support for new extensions by using something like CPU Conroe AMD (to enable SSSE3 and syscall, since it seems that CPU Conroe doesn't allow syscall even in 64-bit mode...). I think that lets you use ymm0 as a symbol name. NASM also supports a CPU directive, but with different args.Company
Actually I tested, and that doesn't work. CPU Conroe / extern ymm0 is an error in 32 or 64 bit. (And so is ymm0: dd 123). Maybe you just need to change your label names if you want to upgrade to a new version of NASM / YASM that adds support for a new instruction set.Company
@PeterCordes Now that's just terrible design: I can't be sure if my assembly assembles with future version of nasm and there is no easy migration path except for a possibly impossible (due to the need to keep an API) large scale refactoring.Mathildamathilde
Yeah, that doesn't seem like good design. It was never designed to be a good compiler-output format (since you can't compile int eax=0;), although I think NASM syntax was designed in a.out days, before Linux switched to ELF, so maybe the problem wasn't forseen. I asked #45891145 since I'm curious about the answer. I hadn't thought of a ABI compatibility making it harder than just search/replacing a symbol name, but that's a great point.Company
@Mathildamathilde - to be fair, it's also common in pretty much every other language to use "unadorned" identifiers which could clash with identifiers introduced in later versions of the language. Granted, the analogy isn't exact since new identifiers are introduced by the languages themselves while NASM has to cope with "imposed" changes as the hardware changes, but it's not totally insane. Your code will continue to compile fine on old versions of NASM, and if you want to use the new regs you'll just have to fix any conflict in your small amount (right?) of hand-written assembly.Undesirable
@Undesirable Imagine you have written a library in assembly that defines a symbol named ymm0 as a part of its API. Now you update your assembler and nasm refuses to assemble your library. You can't rename the symbol as that would break your library's API, causing every program that uses it to stop working. Note that most serious programming languages have a back up plan for this kind of situation: They never introduce new keywords (instead, new functionality is added as builtin functions that can be shadowed) or reserve a name space for new keywords (like C does).Mathildamathilde
Yes, it would suck in that case. Some languages don't introduce new keywords true, but that comes at a price: oddball overloading of existing keywords or other workarounds. It might be the right tradeoff for those languages which have billions or probably trillions of lines of existing code and where compatibility is everything. Snippets of low-level code written in asm may lead to a different tradeoff. The "backup plan" for nasm it to use the original working version of nasm until which time you decide you want to overhaul your code for the new extension. It will work indefinitely.Undesirable
It's no different with language extensions and changes. Any large project will usually have incompatibilities when moving to a new version of the language spec, be it C99, C++11, Java 8, (add more here), etc. You stay on the old version until you want to upgrade. One difference is newer compiler versions usually let you upgrade the compiler but use a flag to stay on the old version (e.g., -std=...) but maybe NASM doesn't. Just keeping the very small binary around is no big thing though (keeping old C++ compilers around is more of a PITA!).Undesirable
@Undesirable In cases where the symbol in question is part of the API, there is often no way to “overhaul the code.” Renaming the symbol would break the API and that's absolutely inacceptable.Mathildamathilde
@Mathildamathilde - yes, it would be a problem if you somehow have a public ymm0 symbol. A reasonable workaround would be to use an old version of nasm or simply inline asm in C code to create that entry point which delegates to code compiled with new nasm. Or else have a post-assembly step that fixes up the symbol in the .o file. I'm not aware of smoother solutions to this problem probably because the actual incidence of code with public symbols called ymm0 is vanishingly small. NASM is open source so you can always submit a patch if you are the outlier with such symbols.Undesirable
@BeeOnRope: That's why I picked bnd0 and k0 as more plausible examples of name clashes for this question. Still vanishingly small, but more plausible as a totally-unexpected name clash. (Or I've seen C that used ymm0 as a local variable name. Using it as a global and then trying to interface with that from NASM is another use case). I think you're right: the answer to my question is "no, NASM is not forward-compatible", and using an old nasm version is the only option.Company
@peter - yup, I saw that. The k (mask) registers are the biggest issue. They are likely to be heavily used (especially compared to bnd since at this point MPX is looking a bit like a dud feature) in new code. I still think the public API risk is pretty small though. If you are choosing two character names for your public symbols you also stand a reasonable chance of clashing at static or dynamic link time with other equally short-sighted libraries.Undesirable
@BeeOnRope: oh, turns out NASM lets you write $eax to refer to the symbol eax. Ross answered my question :)Company
I guess NASM is innocent of the charge of "terrible design" at least in this respect. Now if they only fix their DWARF info generation.Undesirable
The fsub vs. fsubr issue is documented in the GAS manual in the AT&T syntax bugs section. It's a mistake, not a reasonable design choice, because reversing the operands for a mnemonic doesn't just do the opposite thing. (And in objdump, this applies even to -Mintel disassembly so GAS .intel_syntax is broken, too.) NASM fsubr st3,st0 disassembles as dc e3 fsub st(3),st with objdump -Mintel, while NASM fsubr st0,st3 disassembles as d8 eb fsubr st,st(3) (no syntax bug). So it's just wildly inconsistent.Company
@Peter Cordes Isn't this what I say? That there are inconsistencies wrt. the x87 instructions?Mathildamathilde
Your phrasing of possibly because Intel gave the two possible directions of non-commutative floating point instructions different mnemonics made me think you were saying that it was a valid design choice, and it was just a disagreement over which direction was reverse. But it's worse than that.Company
@PeterCordes I said that because I speculate that maybe they didn't want to flip the instruction mnemonics; I suppose the UNIX assembler might originally only have had the single-operand forms of fsub and fsubr and weirdness happened when they added the two-operand forms.Mathildamathilde
mov (R2),R3 doesn't moves the content of R2 to R3 but it moves the content of the memory location, whose address is in R2, to R3.Gender
@Rhialto The wording is misleading. It was meant to mean something along the lines of “moves the content of (the memory cell described by) R2 to R3.” Let me fix this.Mathildamathilde
I learned assembly on MACRO-11 (and its syntax) so I notice such little things :-)Gender
Just to be clear: In Bell Labs Unix, # was the "erase" (backspace/delete) character, @ was the (line) "kill" (control-U nowadays) character -- I think this is because of limitations of the ASR33 teletype. This is why @ was not used in C and why # was used only at the start of a line, for the preprocessor.Swirl
@Swirl The C preprocessor was a fairly late addition to the language. It is plausible that they already had glass teletypes by the time it was added, though the timeline is not clear to me.Mathildamathilde
"The Unix Programming Environment" by Kernighan & Pike published 1983, documents the "@" and "#" characters for erase and kill (but notes that local installations might be using something else), so these two characters seemed to have ingrained themselves fairly well even after hardcopy terminals became obsolete, Wikipedia states the C preprocessor was introducted in 1973.Swirl
H
-11

Assembly language is defined by the assembler, the software that parses the assembly language. The only "standard" is the machine code, that has to match the processor, but if you take 100 programmers and give them the machine code standard (without any assembly language hints) you will end up with somewhere between 1 and 100 different assembly languages. Which will all work perfectly well for all use cases of that processor (baremetal, operating system, application work) so long as they make a complete tool that fits in with a toolchain.

It is in the best interest of the the creator of the instruction set, the machine code, to create both a document describing the instruction set and an assembler, the first tool you need. They can contract it out or make it in house, either way doesnt matter, but having an assembler, with a syntax, and a document for the machine code, which uses the assembler's syntax to connect the dots between the two, will give anyone possibly interested in that processor a starting point. As was the case with intel and the 8086/88. But that doesnt mean that masm and tasm were completely compatible with intels assembler. Even if the syntax per instruction matched, the per instruction syntax is only part of the assembly language there is a lot of non-instruction type syntax, directives, macro language, etc. And that was from the DOS end of the world, there was the UNIX end and thus AT&T. gnu folks at the time were unix end of the world so it makes perfect sense that they used the AT&T syntax or a derivative of as they generally mess up assembly language during a port. Perhaps there is an exception.

nasm and some others like it are an attempt to continue the masm syntax as masm is a closed sourced Microsoft tool (as was tasm and whatever was with Borland C if that wasnt tasm as well). These might be open sourced now but no need, easier to write one from scratch than to try to port that code, I assume to be built with a modern compiler, and nasm already exists.

The why question is like asking my why you chose the pair of socks you chose this morning or any particular day. Your socks may not have as big of an impact on the rest of the world, but the question is equally irrelevant and/or unanswerable. The answer goes back in part to the ask 100 programmers to make an assembler for the same machine code definition. some of these programmers may be experienced with assembly language and may choose to create an assembly language in the image of one they have used before which means several of them will make one that looks pretty similar to each other. But the one or ones they used before may be different so there would be groups of these similar but still different. Then in lets say 30 years ask each one of those 100 people the why question...if they are still alive...Like asking me why you chose to declare a variable in a program you wrote 30 years ago in the way you did it.

Headfirst answered 29/5, 2017 at 13:49 Comment(11)
Thanks for the answer, but I fail to see the point in your last paragraph.Indigoid
Pick any program you wrote 30 years ago, and post it, and describe for us why you chose the variable names you chose and why you declared them in the order you declared them. This is the same as asking the authors of the assembly language they wrote ages ago why they chose the syntax they chose on that particular day. you would have to ask them specifically, everyone on the team as even though they may have all contributed to the same work they may all have a different answer to the why question...Headfirst
If not old enough to have 30 year old source code, then take a coloring book from when you were say 3 or 4 years old, and why did you choose the colors you chose for that work. did a parent help you? if so what is there answer to why you chose a specific color vs your answer. these kinds of why questions make no sense. The gnu connection to AT&T is pretty obvious based on history. why microsoft went with intel syntax though who knows.. it is a why question.Headfirst
I'm not asking what socks was Dennis Ritche wearing when he was implementing the 'for' loop code generation in C compiler or how many minutes he was looking through the window that day. I am asking about the history and origin of some trend I've observed. Going with your logic, most historic questions should be tagged as irrelevant. If the answer if obvious to you, the best thing you can do is to answer it, not tell how it is obvious and how it's irrelevant. If someone's asking, it's not irrelevant.Indigoid
To be fair, for many other ISAs, your "between 1 and 100" ended up being pretty much "exactly 1" and even in the case of Intel x86 it's "just 2", so it's pretty valid to wonder how both syntaxes originated.Undesirable
@Undesirable There are several, dozens of x86 syntaxes (mainstream), handfuls of mainstream arm syntaxes, mips geez tons, hundreds of thousands (every N th cs or ce student for the last so many decades).Headfirst
Well perhaps, but only a few would account for 99.99% of the market and all of those fit closely into "intel-like" or AT&T. Each student designing his own assembly syntax? Is it true? Does it matter? From what I've seen there is a fairly standard syntax for most RISC archs. It's not my area of expertise though, but it would seem incredibly unusual if those ISAs didn't have a very narrow set of used assembly syntax(es).Undesirable
intel, masm, tasm, borland asm, zortech, watcom, plus others I cant remember, plus the current ones. I would say those covered a significant market share in their time. Just look at the current arm ones that divide the market share. maybe you are getting hung up only with the actual instructions, which again arm is another big one that has differences arm created not to mention gcc going their own way and the some folks are trying to infect arm register naming with both gnu x86 and mips. but that is just the instructionsHeadfirst
then you have the rest of the language that makes the code unportable, the instructions are the relatively easy part. give it another couple three decades of experience as well as look at what has been going on just in the gnu world in the last handful of years with the mainstream targets and you will see what I am talking about...Headfirst
I'll grant you that there are more flavors than a standardized language like C, and that the flavors have somewhat more variation (even the C compilers have some variation), but I think the trend it still towards de-facto standardization on a few variants. Time will tell I suppose.Undesirable
I'm really not a huge fan of your answer because as my answer outlines, there is actually a very good reason for nearly every single aspect of AT&T syntax. It's far from being an arbitrary choice like which colour socks you wear.Mathildamathilde

© 2022 - 2024 — McMap. All rights reserved.