As far as I know, there is no instruction in SSE/AVX for loading an immediate. One workaround is loading a value to a normal register and movd
, but compilers seem to think this is more costly than loading from memory even for a single scalar value.
This makes memory access necessary every time doing an operation with common constants such as 1
, 0x80000000
, 0x7fffffff
, 0x3f800000
, 0x3f000000
, etc. Well, having these values encoded in the machine code will occupy 4 bytes each, but so does a 32-bit absolute or rip
-relative address, and I believe an immediate load is cheaper than any sort of memory load.
I always thought something like movss xmm, imm32
or broadcastss xmm, imm32
would be nice to have, but there must be a reason for not making such instructions. Why was it designed this way?
pcmped xmm0,xmm0
(all-ones). See What are the best instruction sequences to generate vector constants on the fly? and Agner Fog's guide. But 2 instructions is still worse than 1, or a memory source operand, so compilers generally don't do that. – Doallvpbroadcastd z/y/zmm, eax
, so you can construct any set1_epi32() constant with a mov-immediate + that. (Strangely compilers do sometimes use that, but not pcmpeqd / psrld). – Doallcvtsi2ss
andsqrtss
merging into an XMM (false dep) instead of zero extending, because P3 handles 128-bit vectors as two 64-bit halves, so zero-extending would take 2 uops to write a full reg. GCC spends extra dep-breakingpxor
instructions to work around it. – Doallroundps
, but that's kinda different.) Still, not a great argument. Possibly something microarchitectural about getting immediates used as values, as opposed to shuffle or other control operands for SIMD instructions. – Doall