That's correct, x86 machine code can't encode an instruction with two explicit memory operands (arbitrary addresses specified in []
)
Whats the recommended register
Any register you don't need to save/restore.
In all the mainstream 32-bit and 64-bit calling conventions, EAX, ECX, and EDX are call-clobbered, so AL, CL, and DL are good choices. For a byte or word copy, you typically want a movzx
load into a 32-bit register, then an 8-bit or 16-bit store. This avoids a false dependency on the old value of the register. Only use a narrow 16 or 8-bit mov
load if you actively want to merge into the low bits of another value. x86's movzx
is the analogue of instructions like ARM ldrb
.
movzx ecx, byte [rdi] ; load CL, zero-extending into RCX
mov [rdi+10], cl
In 64-bit mode, SIL, DIL, r8b, r9b and so on are also fine choices, but require a REX prefix in the machine code for the store so there's a minor code-size reason to avoid them.
Generally avoid writing AH, BH, CH, or DH for performance reasons, unless you've read and understood the following links and any false dependencies or partial-register merging stalls aren't going to be a problem or happen at all in your code.
(or should I use the stack instead)?
First of all, you can't push a single byte at all, so there's no way you could do a byte load / byte store from the stack. For a word, dword, or qword (depending on CPU mode), you could push [src]
/ pop [dst]
, but that's a lot slower than copying via a register. It introduces an extra store/reload store-forwarding latency before the data can be read from the final destination, and takes more uops.
Unless somewhere on the stack is the desired destination and you can't optimize that local variable into a register, in which case push [src]
is just fine to copy it there and allocate stack space for it.
See https://agner.org/optimize/ and other x86 performance links in the x86 tag wiki
mov
, intel.com/Assets/PDF/manual/253666.pdf – Wearing