Why does SSE/AVX lack loading an immediate value?
Asked Answered
M

0

7

As far as I know, there is no instruction in SSE/AVX for loading an immediate. One workaround is loading a value to a normal register and movd, but compilers seem to think this is more costly than loading from memory even for a single scalar value.

This makes memory access necessary every time doing an operation with common constants such as 1, 0x80000000, 0x7fffffff, 0x3f800000, 0x3f000000, etc. Well, having these values encoded in the machine code will occupy 4 bytes each, but so does a 32-bit absolute or rip-relative address, and I believe an immediate load is cheaper than any sort of memory load.

I always thought something like movss xmm, imm32 or broadcastss xmm, imm32 would be nice to have, but there must be a reason for not making such instructions. Why was it designed this way?

Mischance answered 6/5, 2022 at 17:6 Comment(9)
By contrast, ARM NEON does have instructions that broadcast an immediate value into a vector. Reasons that are posted as answer won't be convincing if they would apply equally as much to NEON.Barnabas
This is likely to be unanswerable unless somebody from the SSE/AVX design team sees the question and is willing to discuss what they were thinking.Damick
The standard solution for this is to load a constant from memory. This is how the instruction set was designed and it's the same on MMX and the x87 floating point unit.Lenhard
Several of those constants (where all the set bits are contiguous at one end of the register) can be generated in 2 instructions, starting with pcmped xmm0,xmm0 (all-ones). See What are the best instruction sequences to generate vector constants on the fly? and Agner Fog's guide. But 2 instructions is still worse than 1, or a memory source operand, so compilers generally don't do that.Doall
@Lenhard I don't know much about the history of x86, but I think x87 was designed to load a constant from memory because it was originally a stack-machine-like coprocessor, and MMX was built on top of x87. SSE was a totally new design, so it doesn't have to follow x87 and MMX.Mischance
AVX-512 has vpbroadcastd z/y/zmm, eax, so you can construct any set1_epi32() constant with a mov-immediate + that. (Strangely compilers do sometimes use that, but not pcmpeqd / psrld).Doall
I've wondered if lack of mov-immediate to vector reg was a matter of never decoding more than a 1-byte immediate for vector instructions. Or some other quirk of convenience / inconvenience for existing Intel microarchitectures. Intel has definitely gimped their ISA for short-term convenience in the past, like SSE1 with cvtsi2ss and sqrtss merging into an XMM (false dep) instead of zero extending, because P3 handles 128-bit vectors as two 64-bit halves, so zero-extending would take 2 uops to write a full reg. GCC spends extra dep-breaking pxor instructions to work around it.Doall
@PeterCordes: But even a one-byte immediate could have been very useful. The NEON move-immediate only includes an 8-bit immediate (with a few different options for how to decode it), and that probably covers 95% of use cases.Beane
@NateEldredge: Right, yes, 1 byte might actually be a better design choice than 32-bit. (Although ARM already has complex decoding for immediates for Thumb mode, while x86 at most does sign-extension. Except for bitfields for control operands for stuff like roundps, but that's kinda different.) Still, not a great argument. Possibly something microarchitectural about getting immediates used as values, as opposed to shuffle or other control operands for SIMD instructions.Doall

© 2022 - 2024 — McMap. All rights reserved.