Distinguishing memory from constant in GNU as .intel_syntax
Asked Answered
A

1

9

I have an instruction written in Intel syntax (using gas as my assembler) that looks like this:

mov rdx, msg_size
...
msg: .ascii "Hello, world!\n"
     .set msg_size, . - msg

but that mov instruction is being assembled to mov 0xe,%rdx, rather than mov $0xe,%rdx, as I would expect. How should I write the first instruction (or the definition of msg_size) to get the expected behavior?

Antigorite answered 6/9, 2016 at 18:1 Comment(5)
when I try that, I get undefined reference to `$msg_size'Antigorite
Oh, sorry, I missed the part of about Intel syntax. In true MASM syntax you wouldn't need to do anything. You could try OFFSET msg_sizeBarimah
Yes, that works, thank you. I'm a bit too used to nasm, I think...Antigorite
@RossRidge Sorry, one more question. What works similarly in lea <reg>, [<reg> + <constant>]?Antigorite
In that context it shouldn't matter. It's unambiguously a displacement.Barimah
K
16

Use mov edx, OFFSET symbol to get the symbol "address" as an immediate, rather than loading from it as an address. This works for actual label addresses as well as symbols you set to an integer with .set.

For the msg address (not msg_size assemble-time constant) in 64-bit code, you may want
lea rdx, [RIP+msg] for a PIE executable where static addresses don't fit in 32 bits. How to load address of function or label into register


In GAS .intel_syntax noprefix mode:

  • OFFSET symbol works like AT&T $symbol. This is somewhat like MASM.
  • symbol works like AT&T symbol (i.e. a dereference) for unknown symbols.
  • [symbol] is always an effective-address, never an immediate, in GAS and NASM/YASM. LEA doesn't load from the address but it still uses the memory-operand machine encoding. (That's why lea uses the same syntax).

Interpretation of bare symbol depends on order of declaration

GAS is a one-pass assembler (which goes back and fills in symbol values once they're known).

It decides on the opcode and encoding for mov rdx, symbol when it first encounters that line. An earlier msize= . - msg or .equ / .set will make it choose mov reg, imm32, but a later directive won't be visible yet.

The default assumption for not-yet-defined symbols is that symbol is an address in some section (like you get from defining it with a label like symbol:, or from .set symbol, .). And because GAS .intel_syntax is like MASM not NASM, a bare symbol is treated like [symbol] - a memory operand.

If you put a .set or msg_length=msg_end - msg directive at the top of your file, before the instructions that reference it, they would assemble to mov reg, imm32 mov-immediate. (Unlike in AT&T syntax where you always need a $ for an immediate even for numeric literals like 1234.)

For example: source and disassembly interleaved with objdump -dS:
Assembled with gcc -g -c foo.s and disassembled with objdump -drwC -S -Mintel foo.o (with as --version = GNU assembler (GNU Binutils) 2.34). We get this:

0000000000000000 <l1>:
.intel_syntax noprefix

l1:     
mov eax, OFFSET equsym
   0:   b8 01 00 00 00          mov    eax,0x1
mov eax, equsym            #### treated as a load
   5:   8b 04 25 01 00 00 00    mov    eax,DWORD PTR ds:0x1
mov rax, big               #### 32-bit sign-extended absolute load address, even though the constant was unsigned positive
   c:   48 8b 04 25 aa aa aa aa         mov    rax,QWORD PTR ds:0xffffffffaaaaaaaa
mov rdi, OFFSET label
  14:   48 c7 c7 00 00 00 00    mov    rdi,0x0  17: R_X86_64_32S        .text+0x1b

000000000000001b <label>:

label:
nop
  1b:   90                      nop

.equ equsym, . - label            # equsym = 1
big = 0xaaaaaaaa

mov eax, OFFSET equsym
  1c:   b8 01 00 00 00          mov    eax,0x1
mov eax, equsym           #### treated as an immediate
  21:   b8 01 00 00 00          mov    eax,0x1
mov rax, big              #### constant doesn't fit in 32-bit sign extended, assembler can see it when picking encoding so it picks movabs imm64
  26:   48 b8 aa aa aa aa 00 00 00 00   movabs rax,0xaaaaaaaa

It's always safe to use mov edx, OFFSET msg_size to treat any symbol (or even a numeric literal) as an immediate regardless of how it was defined. So it's exactly like AT&T $ except that it's optional when GAS already knows the symbol value is just a number, not an address in some section. For consistency it's probably a good idea to always use OFFSET msg_size so your code doesn't change meaning if some future programmer moves code around so the data section and related directives are no longer first. (Including future you who's forgotten these strange details that are unlike most assemblers.)

BTW, .set is a synonym for .equ, and there's also symbol=value syntax for setting a value which is also synonymous to .set.


Operand-size: generally use 32-bit unless a value needs 64

mov rdx, OFFSET symbol will assemble to mov r/m64, sign_extended_imm32. You don't want that for a small length (vastly less than 4GiB) unless it's a negative constant, not an address. You also don't want movabs r64, imm64 for addresses; that's inefficient.

It's safe under GNU/Linux to write mov edx, OFFSET symbol in a position-dependent executable, and in fact you should always do that or use lea rdx, [rip + symbol], never sign-extended 32-bit immediate unless you're writing code that will be loaded into the high 2GB of virtual address space (e.g. a kernel). How to load address of function or label into register

See also 32-bit absolute addresses no longer allowed in x86-64 Linux? for more about PIE executables being the default in modern distros.


Tip: if you know the AT&T or NASM syntax, or the NASM syntax, for something, use that to produce the encoding you want and then disassemble with objdump -Mintel to find out the right syntax for .intel_syntax noprefx.

But that doesn't help here because disassembly will just show the numeric literal like mov edx, 123, not mov edx, OFFSET name_not_in_object_file. Looking at gcc -masm=intel compiler output can also help, but again compilers do their own constant-propagation instead of using symbols for assemble-time constants.

BTW, no open-source projects that I'm aware of contain GAS intel_syntax source code. If they use gas, they use AT&T syntax. Otherwise they use NASM/YASM. (You sometimes also see MSVC inline asm in open source projects).


Same effect in AT&T syntax, or for [RIP + symbol]

This is a lot more artificial since you wouldn't normally do this with an integer constant that wasn't an address. I include it here just to show another facet of GAS's behaviour depending on a symbol being defined or not at a point during its 1 pass.

How do RIP-relative variable references like "[RIP + _a]" in x86-64 GAS Intel-syntax work? - [RIP + symbol] is interpreted as using relative addressing to reach symbol, not actually adding two addresses. But [RIP + 4] is taken literally, as an offset relative to the end of this instruction.

So again, it matters what GAS knows about a symbol when it reaches an instruction that references it, because it's 1-pass. If undefined, it assumes it's a normal symbol. If defined as a numeric value with no section associated, it works like a literal number.

_start:
foo=4
jmpq *foo(%rip)
jmpq *bar(%rip)
bar=4

That assembles to the first jump being the same as jmp *4(%rip) loading a pointer from 4 bytes past the end of the current instruction. But the 2nd jump using a symbol relocation for bar, using a RIP-relative addressing mode to reach the absolute address of the symbol bar, whatever that may turn out to be.

0000000000000000 <.text>:
   0:   ff 25 04 00 00 00       jmp    QWORD PTR [rip+0x4]        # a <.text+0xa>
   6:   ff 25 00 00 00 00       jmp    QWORD PTR [rip+0x0]        # c <bar+0x8> 8: R_X86_64_PC32        *ABS*

After linking with ld foo.o, the executable has:

  401000:       ff 25 04 00 00 00       jmp    *0x4(%rip)        # 40100a <bar+0x401006>
  401006:       ff 25 f8 ef bf ff       jmp    *-0x401008(%rip)        # 4 <bar>
Kalinda answered 7/9, 2016 at 3:6 Comment(7)
Thanks for the detailed answer, Peter!Antigorite
"no open-source projects that I'm aware of contain GAS intel_syntax source code." Wow, really? Maybe it's all what you get used to, but I loathe the AT&T syntax. If I was going to spend any amount of effort writing/maintaining GAS inline assembly, I would definitely prefer to use the Intel syntax. Is there some technical reason why people don't do this? Like, are there significant limitations of GAS with respect to Intel syntax? (I'm mostly an MSVC guy. The MASM syntax is great, but the inability to specify input parameters for inline assembly makes it difficult to use for optimization.)Swinger
@CodyGray: When I was first teaching myself x86 asm, I was used to AT&T syntax because that's what gcc / objdump used. The explicit $ on immediate operands was a big plus. But once I realized the only real insn set reference was in Intel syntax, and got used to that a bit, I started to realize that it seems easier / nicer to read. Dest on the left, and nicer memory operand syntax. I always use Intel-syntax mode on Godbolt and stuff like that. I only use AT&T syntax when writing gcc missed-optimization bug reports, since that seems to be what compiler devs choose.Kalinda
@Cody: as far as technical limitations, I don't think so. But it's not as well documented. I seem to recall an SO question where there was some sense of GAS intel_syntax being semi-ambiguous or limiting, but I can't remember what the issue way. I think it you want Intel syntax, though, it's usually even nicer to use NASM or YASM, so it's unusual to go half-way and use gas .intel_syntax.Kalinda
@CodyGray: Oh, I just remembered one specific downside: gas .intel_syntax still has the AT&T syntax design bug that some forms of the non-commutative x87 insns (like fdivr and fdiv) are reversed. I just re-tested this, and NASM fsubr st3, st0 is objdump (AT&T) fsub %st,%st(3) and also objdump -Mintel fsub st(3),st. (note the fsub mnemonic in both cases.) Agner Fog's objconv -fyasm disassembles it as fsubr st3, st(0), with the correct mnemonic. Thumbs down for design bugs that can't be fixed for compat reasonsKalinda
Oh, good point about the non-commutative FP instructions. That left me stumped for a good couple of days (hard to Google for) when I was trying to teach myself how the GAS inline assembler worked using Godbolt. It doesn't help that the mnemonics themselves used by different disassemblers are often inconsistent. Fortunately, the x87 is becoming less and less relevant. I'm really not familiar with NASM or YASM for inline assembly. Naively, I'd imagine that using the compiler toolchain's built-in assembler increases your chances for better code.Swinger
@CodyGray: You can't use YASM/NASM for inline asm at all. Projects like x264 and x265 that use NASM/YASM have the asm in separate files (and make extensive use of assembler macros).Kalinda

© 2022 - 2024 — McMap. All rights reserved.