What's the difference between the x86-64 AT&T instructions movq and movabsq?
Asked Answered
A

1

8

After reading this stack overflow answer, and this document, I still don't understand the difference between movq and movabsq.

My current understanding is that in movabsq, the first operand is a 64-bit immediate operand whereas movq sign-extends a 32-bit immediate operand. From the 2nd document referenced above:

Moving immediate data to a 64-bit register can be done either with the movq instruction, which will sign extend a 32-bit immediate value, or with the movabsq instruction, when a full 64-bit immediate is required.

In the first reference, Peter states:

Interesting experiment: movq $0xFFFFFFFF, %rax is probably not encodeable, because it's not representable with a sign-extended 32-bit immediate, and needs either the imm64 encoding or the %eax destination encoding.

(editor's note: this mistaken assumption is fixed in the current version of that answer).

However, when I assemble/run this it seems to work fine:

        .section .rodata
str:
        .string "0x%lx\n"
        .text
        .globl  main
main:
        pushq   %rbp
        movq    %rsp, %rbp
        movl    $str, %edi
        movq    $0xFFFFFFFF, %rsi
        xorl    %eax, %eax
        call    printf
        xorl    %eax, %eax
        popq    %rbp
        ret

$ clang file.s -o file && ./file

prints 0xffffffff. (This works similarly for larger values, for instance if you throw in a few additional "F"s). movabsq generates an identical output.

Is Clang inferring what I want? If it is, is there still a benefit to movabsq over movq?

Did I miss something?

Aliquant answered 20/9, 2018 at 22:14 Comment(3)
When in doubt, check the disassembly. I have not tried clang, but gas silently converts that to movabsq. The point of movabsq is to make it explicit that you want 64 bit immediate e.g. even if the literal would fit into 32 bits or for symbols.Nameplate
Maybe try with $label as an immediate, so the assembler doesn't know at assemble time whether the label's absolute address will fit in a 32-bit immediate or not. That's not known until link time.Flotsam
I updated my answer on the other question, with mov $symbol, %rdi and movabs $symbol, %rdi and so on. Thanks for catching that mistaken assumption. I even tested an old version of GAS from 2008, and it still used the 10-byte encoding instead of truncating, so my guess was probably never correct for assemble-time constants, only link-time constants (addresses).Flotsam
P
8

There are three kind of moves to fill a 64-bit register:

  1. Moving to the low 32-bit part: B8 +rd id , 5 bytes
    Example: mov eax, 241 / mov[l] $241, %eax
    Moving to the low 32-bit part will zero the upper part.

  2. Moving with a 64-bit immediate: 48 B8 +rd io, 10 bytes
    Example: mov rax, 0xf1f1f1f1f1f1f1f1 / mov[abs][q] $0xf1f1f1f1f1f1f1f1, %rax
    Moving a full 64-bit immediate.

  3. Moving with a sign-extended 32-bit immediate: 48 C7 /0 id, 7 bytes
    Example: mov rax, 0xffffffffffffffff / mov[q] $0xffffffffffffffff, %rax Moving a signed 32-bit immediate to full 64-bit register.

Notice how at the assembly level there is room for ambiguity, movq is used for the second and third case.

For each immediate value we have:

  • (a) Values in [0, 0x7fff_ffff] can be encoded with (1), (2) and (3).
  • (b) Values in [0x8000_0000, 0xffff_ffff] can be encoded with (1) and (2).
  • (c) Values in [0x1_0000_0000, 0xffff_ffff_7fff_ffff] can be encoded with (2)
  • (d) Values in [0xffff_ffff_8000_0000, 0xffff_ffff_ffff_ffff] can be encoded with (2) and (3).

All the cases but the third have at least two possible encoding.
The assembler picks up the shortest one usually if more than one encoding is available but that's not always the case.

For GAS:
movabs[q] always correspond to (2).
mov[q] corresponds to (3) for the cases (a) and (d), to (2) for the other cases.
It never generate (1) for a move to a 64-bit register.

To make it pick up (1) we have to use mov[l] $0xffffffff, %edi which is equivalent (I believe GAS won't convert a move to a 64-bit register to one to its lower 32-bit register even when this is equivalent).


In the 16/32-bit era distinguishing between (1) and (3) was not considered really important (yet in GAS it's possible to pick one specific form) since it was not a sign-extend operation but an artefact of the original encoding in the 8086.

The mov instruction was never split into two forms to account for (1) and (3), instead a single mov was being used with the assembler almost always picking (1) over (3).

With the new 64-bit registers having 64-bit immediates would make the code far too sparse (and would easily violate the current maximum instruction length of 16 bytes) so it was not worth it to extend (1) to always take 64-bit immediate.
Instead (1) still have 32-bit immediate and zero-extends (to break any false data dependency) and (2) was introduced for the rare case where a 64-bit immediate operand is actually needed.
Taking the chance, (3) was also changed to still take a 32-bit immediate but to also sign-extend it.
(1) and (3) should suffice for the most common immediates (like 1 or -1).

However the difference between (1)/(3) and (2) is deeper than the past difference between (1) and (3) because while (1) and (3) both have an operand of the same size, 32-bit, (3) has a 64-bit immediate operand.

Why would one want an artificially lengthened instruction?
As described in the linked answer, one use case could be padding so that the top of the next loop is at a multiple of 16/32 bytes, without needing any NOP instructions.
This sacrifices code density (more space in the instruction cache) and decode efficiency outside the loop for better front-end efficiency for each loop iteration. But longer instructions are still generally cheaper for the front-end than having to decode some NOPs as well.

Another, and more frequent, use case is when one only need to generate a machine code template.
For example in a JIT one may want to prepare the sequence of instructions to use and fill the immediates values only at runtime.
In that case using (2) will greatly simplify the handling since there is always enough room for all the possible values.

Another case is for some patching functionality, in a debug version of a software specific calls could be made indirectly with an address in a register that has just been loaded with (2) so that the debugger can hijack the call easily to any new target.

Phlegm answered 21/9, 2018 at 0:39 Comment(7)
There is also a fourth special case for the accumulator register not listed here. You mean the AL/AX/EAX/RAX load/store from/to a 64-bit absolute address? If you include that, then you also need to include mov r64, r/m64 with a normal addressing mode.Flotsam
Do nops generate µOPs at all? I thought they were discarded in the decoder.Subscribe
@Subscribe As far as I know they are still issued to the BE. That's how I interpreted the results of a couple of experiments I made after having seen a comment about this from PeterCordes. Anyway I carefully chose the term "no op" and not "nop" to include other de-facto no op instructions besides nop.Phlegm
@PeterCordes Oh, you are right! I somehow was sure the form of [48] A0/1 was mov A, imm. Thank you. Now there you are here, do you mind taking a look at the comment above? XDPhlegm
@fuz: nop on Intel CPUs takes a fused-domain uop all the way through the pipeline. Running NOPs is not usually important for performance, so Intel didn't give them special support other than of course zero unfused-domain uops / no execution unit. It's possible for nop to be a branch target, so not having it in the uop cache would be a complication. Also, performance counters for instructions retired count NOP. I guess losing NOP from perf counters would be ok, if they wanted to drop it at issue/rename time, but presumably it's not as easy as we might imagine to get the right RIP...Flotsam
@PeterCordes Couldn't the nop be eliminated at issue time, i.e. not issuing it from the IDQ but still having it in the DSB? This will only save an entry in the ROB since nop don't execute and I don't know how much complexity this would introduce in the BE to handle the correct retirement. Probably not worth it.Phlegm
@MargaretBloom: IDK how important it is to have the nop in the ROB for the purposes of getting the right RIP if the instruction before or after it faults or something, though. Or if each instruction knows both its start and end address instead of having to refer to the end of the previous. If TF is set, the ISA requires it to trap after the nop. (If TF isn't renamed, then this could still be handled in issue/rename, maybe by translating it into a pseudo-nop like mov rax,rax? Or just not filtering out the nop?) I'm sure it's complicated in ways we have no idea about.Flotsam

© 2022 - 2024 — McMap. All rights reserved.