Why does SSE/AVX lack loading an immediate value?

About

Asked 6/5, 2022 at 17:6 Answered 6/5, 2022 at 17:6

assembly x86 sse instruction-set immediate-operand

As far as I know, there is no instruction in SSE/AVX for loading an immediate. One workaround is loading a value to a normal register and movd, but compilers seem to think this is more costly than loading from memory even for a single scalar value.

This makes memory access necessary every time doing an operation with common constants such as 1, 0x80000000, 0x7fffffff, 0x3f800000, 0x3f000000, etc. Well, having these values encoded in the machine code will occupy 4 bytes each, but so does a 32-bit absolute or rip-relative address, and I believe an immediate load is cheaper than any sort of memory load.

I always thought something like movss xmm, imm32 or broadcastss xmm, imm32 would be nice to have, but there must be a reason for not making such instructions. Why was it designed this way?

Mischance answered 6/5, 2022 at 17:6 Comment(9)

By contrast, ARM NEON does have instructions that broadcast an immediate value into a vector. Reasons that are posted as answer won't be convincing if they would apply equally as much to NEON. – Barnabas 6/5, 2022 at 17:18

This is likely to be unanswerable unless somebody from the SSE/AVX design team sees the question and is willing to discuss what they were thinking. – Damick 6/5, 2022 at 17:25

The standard solution for this is to load a constant from memory. This is how the instruction set was designed and it's the same on MMX and the x87 floating point unit. – Lenhard 6/5, 2022 at 17:30

Several of those constants (where all the set bits are contiguous at one end of the register) can be generated in 2 instructions, starting with pcmped xmm0,xmm0 (all-ones). See What are the best instruction sequences to generate vector constants on the fly? and Agner Fog's guide. But 2 instructions is still worse than 1, or a memory source operand, so compilers generally don't do that. – Doall 6/5, 2022 at 17:51

@Lenhard I don't know much about the history of x86, but I think x87 was designed to load a constant from memory because it was originally a stack-machine-like coprocessor, and MMX was built on top of x87. SSE was a totally new design, so it doesn't have to follow x87 and MMX. – Mischance 6/5, 2022 at 17:51

AVX-512 has vpbroadcastd z/y/zmm, eax, so you can construct any set1_epi32() constant with a mov-immediate + that. (Strangely compilers do sometimes use that, but not pcmpeqd / psrld). – Doall 6/5, 2022 at 17:56

I've wondered if lack of mov-immediate to vector reg was a matter of never decoding more than a 1-byte immediate for vector instructions. Or some other quirk of convenience / inconvenience for existing Intel microarchitectures. Intel has definitely gimped their ISA for short-term convenience in the past, like SSE1 with cvtsi2ss and sqrtss merging into an XMM (false dep) instead of zero extending, because P3 handles 128-bit vectors as two 64-bit halves, so zero-extending would take 2 uops to write a full reg. GCC spends extra dep-breaking pxor instructions to work around it. – Doall 6/5, 2022 at 17:57

@PeterCordes: But even a one-byte immediate could have been very useful. The NEON move-immediate only includes an 8-bit immediate (with a few different options for how to decode it), and that probably covers 95% of use cases. – Beane 6/5, 2022 at 18:12

@NateEldredge: Right, yes, 1 byte might actually be a better design choice than 32-bit. (Although ARM already has complex decoding for immediates for Thumb mode, while x86 at most does sign-extension. Except for bitfields for control operands for stuff like roundps, but that's kinda different.) Still, not a great argument. Possibly something microarchitectural about getting immediates used as values, as opposed to shuffle or other control operands for SIMD instructions. – Doall 6/5, 2022 at 18:23

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags