assembly cltq and movslq difference
Asked Answered
R

1

8

Chapter 3 of Computer Systems A Programmer's Perspective (2nd Edition) mentions that
cltq is equivalent to movslq %eax, %rax.

Why did they create a new instruction (cltq) instead of just using movslq %eax,%rax? Isn't that redundant?

Rustyrut answered 10/6, 2016 at 8:28 Comment(3)
First of cltq is the name the gnu assembler uses for the x86-instruction called cdqe. This is a two-byte instruction (1-byte REX-prefix, 1 byte opcode) and the instruction is used since the 16-bit 8086. movslq is more recent (base opcode added with the 32-bit extension in 386) and takes 3 bytes for the equivalent functionality.Posada
@EOF, technically, movslq is even newer, using the opcode that was ARPL in 32/16bit modes. It's still 3 bytes, though, including the REX prefix. (Instead of 4 bytes for movswq or something.) In Intel syntax, it's called movsxd.Kirsti
Could someone explain what movslq does, please?Bohlin
K
23

TL;DR: use cltq (aka cdqe) when possible, because it's one byte shorter than the exactly-equivalent movslq %eax, %rax. That's a very minor advantage (so don't sacrifice anything else to make this happen) but choose eax if you're going to want to sign-extend it a lot.

This is mostly relevant for compiler-writers (compiling signed-integer loop counters indexing arrays); stuff like sign-extending a loop counter every iteration only happens when compilers don't manage to take advantage of signed overflow being undefined behaviour to avoid it. Human programmers will just decide what's signed vs. unsigned to save instructions.

(sign-extending into a different register with movsx / movslq can avoid lengthening the dependency chain for the 32-bit value, relevant if it's updated in a loop.)


Related: complete run-down on Intel vs. AT&T mnemonics for the different sizes of the instructions that sign-extend within RAX (cltq), or from EAX into EDX:EAX (cltd), with the equivalent movsx / movs?t?: What does cltq do in assembly?.


The history

Actually, the 32->64 bit form of MOVSX (called movslq in AT&T syntax), is the new one, new with AMD64. The Intel-syntax mnemonic is actually MOVSXD. The opcode is 63 /r (so it's 3 bytes including the necessary REX prefix, vs. 4 bytes for 8->64 or 16->64 MOVSX). AMD repurposed the opcode from ARPL, which doesn't exist in 64-bit mode.

To understand the history, remember that current x86 wasn't designed all at once. First there was 16-bit 8086, with not MOVSZ/MOVZX at all, just CBW and CWD. Then 386 added MOVS/ZX (and wider versions of CBW/CWD for sign-extending within eax or into edx). Then AMD extended all of that to 64-bit.

The REX versions of the existing MOVSX opcodes still have an 8 or 16bit source, but sign extend all the way to 64 bits instead of just 32. The operand-size prefix lets you encode movsbw, aka movsx r16, r/m8. IDK what happens if you use an operand-size prefix and REX.W at the same time. Or what happens if you use an operand-size prefix with the 16bit source form of MOVSX. Probably it's just an expensive way to encode MOV, like using 63 /r without a REX prefix (which the Intel's insn set manual recommends against).


cltq (aka CDQE) is just the obvious way to extend the existing cwtl (aka CWDE) with a REX.W prefix to promote the operand-size to 64 bits. The original form of this, cbtw (aka CBW), was in 8086, predating MOVSX, and was the only sane way to sign-extend anything. Since shifts with immediate count>1 were a 186 feature, the least bad other option seems to be mov ah, al / mov cl, 7 / sar ah, cl to broadcast the sign bit to all positions.

Also, don't confuse cwtl with cwtd (aka CWD: sign extend ax into dx:ax, e.g. to set up for idiv).

The AT&T mnemonics are pretty horrible here. l vs. d, really? The Intel mnemonics all have e on the end for the ones that extend within rax, and not for the ones that extend into (part of) rdx. Except for CBW, but of course that extends al into ax, because even 8086 had 16bit registers, so never needed to store 16bit values in dl:al. idiv r/m8 uses ax as a source reg, not dl:al (and puts the results in ah, al)).


redundancies

Yes, this is one of many redundancies in x86 assembly language. e.g. sub eax,eax to zero rax vs. xor eax,eax. (mov eax,0 isn't totally redundant, because it doesn't affect flags. If you include slight differences like that as redundant, or even instructions that run on different execution ports, there are lots of ways to do some things.).

If I had the chance to modify the x86-64 ISA, I would probably give MOVZX and MOVSX single-byte opcodes (instead of 0F XX two-byte escaped opcodes), at least the 8-bit-source versions. So movsx eax, byte [mem] would be as compact as mov al, [mem]. (They're already the same performance on Intel CPUs: handled entirely in the load port, with no ALU uop). Most real code fails to take advantage of [u]int16_t arrays for higher cache density, so I think movs/zx from word to dword or qword is rarer. Or maybe there's enough wide-character code around to justify shorter opcodes for MOVZX r32/r64, r/m16. To make some room, we can drop the CBW / CWDE / CDQE opcode entirely. I might keep CWD / CDQ / CQO as a useful setup for idiv, which has no one-instruction equivalent.

In reality, probably having fewer single-byte opcodes and more escape prefixes would be a lot more useful (e.g. so common SSE2 insns can be 2 opcode bytes + ModRM, instead of the usual 3 or 4 opcode bytes). Instruction-decoding is less of a bottleneck with shorter instructions in high-performance loops. But if x86-64 machine code is too different from 32-bit, we need extra decode transistors. That may be ok now that power limitations have made dark silicon a thing, because a core would never need its 32-bit decoder powered on at the same time as its 64-bit decoder. That wasn't the case when AMD was designing AMD64. (err, HyperThreading alternating cycles between logical threads running in 32-bit and 64-bit would stop you from fully shutting down either, if they were separate.)

Instead of CDQ, we could made two-operand shift instructions, with a non-destructive destination, so sar edx, eax, 31 would do CDQ in 3 bytes. Dropping the one-byte xchg-with-eax opcodes (other than 0x90 xchg eax,eax NOP) would free up lots of coding space for sar, shr, shl without needing the Reg field of the ModRM as extra opcode bits. And of course remove the doesn't-affect-flags special case for shift_count=0 to kill the input dependency on FLAGS).

(I'd also have changed setcc r/m8 to setcc r/m32. Or maybe setcc r32/m8. (Memory dst uses a separate ALU uop anyway, so it could decode as setcc tmp32 and store the low 8 of that). It's almost always used by xor-zeroing a destination, and you have to juggle that vs. the flag-setting.)

AMD had the chance to do (some of) this with AMD64, but chose to be conservative to share as many instruction-decode transistors as possible. (Can't really fault them for that, but it's unfortunate that political/economic circumstances resulted in x86 missing its only chance for the foreseeable future to drop some of its legacy baggage.) It also meant less work modifying code generation / analysis software, but that's a one-time cost and small potatoes compared to potentially making every x86-64 CPU run faster and have smaller binaries.


See also the tag wiki for more links, including this old appendix from the NASM manual documenting when every form of every instruction was introduced.

Related: MOVZX missing 32 bit register to 64 bit register.

Kirsti answered 10/6, 2016 at 10:47 Comment(11)
Why is this answer so popular? I usually get fewer upvotes than this for answers with a clever use of SSE or something. This is just some instruction-set arcana, and I thought most of it was pretty obvious if you think about how x86 was extended from 16 to 32, and then to 64.Kirsti
Don't ask, don't tell? I mean if you really want I can downvote you to set the world a bit more straight.Had
@BeeOnRope: heh, I was thinking more along the lines of upvoting my answer on #36327600, #35517378, https://mcmap.net/q/14722/-packing-bcd-to-dpd-how-to-improve-this-amd64-assembly-routine, or some of the other answers I spent a long time on :PKirsti
I added a solution for the first one which should get to 32B/cycle.Had
@Peter, thanks for useful answer. Would not be nice to use slightly simpler answers, such that newbies would not be stunned with details they have poor understanding of, and will not overlook the answer somewhere in the middle, because of being lost, that way abasing usefulness of overall qualitative answer?Counts
@Peter, Wading through complicated discussion of over more complicated instruction format left since early intel days just to find answer to simple question is no fun, I would tell you.Counts
I took a guess that you just wanted to know which one to use, and put the TL:DR at the top. It was already in bold...Kirsti
@Peter, came from gas documentation, seeking difference between the two. IMHO, looks like too deep coverage, a bit over overwhelming. Others, of course, may have different opinions.Counts
@BulatM.: So is my edit what you were hoping for? The question already says that they are the same. The historical background on the evolution of the instruction set is what made the question worth answering, IMO. Also, sorting out the potentially-confusing naming of CDQ and CDQE.Kirsti
Could someone explain what movslq does, please?Bohlin
@ibodi: it's movsxd, 2's complement sign-extension from 32 to 64 bits. See also What does cltq do in assembly? for answers that show examples of the what happens to the bits for movslq %eax, %rax.Kirsti

© 2022 - 2024 — McMap. All rights reserved.