Yes using vpermilps
-immediate is normally a missed-optimization vs. vshufps
(except on Knight's Landing), wasting 1 byte of code size for the same operation with the same performance.
I think the main point of vpermilps
is that it's available with a vector control operand. Before AVX, the only variable-control shuffle was integer pshufb
.
VPERMILPS ymm1, ymm2, ymm3/m256
- Permute single-precision floating-point values in ymm2 using controls from ymm3/m256 and store result in ymm1.
But of course the immediate form has a totally separate opcode, and you're asking why it exists. Intel definitely could have included only the vector version, so the question becomes "why did they include the immediate version?" It takes at least a bit of extra decode hardware. The shuffle unit already has hardware to unpack immediate control operands in this form, because it's identical to vshufps
, so perhaps it was cheap-ish to implement?
The only thing you can do with immediate vpermilps
that you can't with vshufps
is load+shuffle in one instruction, like vpermilps ymm0, [rdi], 0b00011011
to reverse the elements in each lane of the source. But like most instructions with an immediate, it can't micro-fuse a memory operand so it's still 2 fused-domain uops for the front end. (On AMD CPUs, it actually does save front-end bandwidth.) Still, it saves code-size vs. vmovups ymm0, [rdi]
/ vshufps ymm0,ymm0,ymm0, 0b00011011
.
Other than that, I don't see much point. They both do the same shuffle in both 128-bit lanes, reusing the 4x 2-bit fields of the immediate for both lanes. (While vpermilpd
and vshufpd
both use 1-bit fields in their immediates, and can do different shuffles in each lane; the upper lane uses bits 2 and 3. And the ZMM versions use bits 4..7 for the upper 256. So again vpermilpd dst, src, imm
is identical to vshufpd dst, src,src, imm
, unless you use a memory source or you use a shuffle-control vector instead of immediate.)
It makes you wonder if Intel forgot that VEX encoding was going to enable non-destructive vshufps
to do the same thing for immediate shuffles.
Or maybe they had in mind their low-power CPUs, like Knight's Landing (Xeon Phi), where a 1-source shuffle is cheaper:
vpermilps
has 1-cycle throughput there, but vshufps
or vperm2f128
has 2-cycle throughput and an extra cycle of latency. (According to Agner Fog's instruction tables.)
So using vshufps
with the same input twice is slower there.
But on Intel's big-core mainstream CPUs, yes using vpermilps
-immediate is a missed-optimization vs. vshufps
, unless you can use it with a memory source. vshufps
would need the same memory source twice, which obviously isn't encodeable.
AVX was designed years ahead of KNL, but maybe the ISA designers had in mind that maybe some future CPU could be more efficient with a simpler shuffle.
Regular Silvermont (out-of-order Atom that KNL is based on) doesn't support AVX, but it has 1 uop / 1-cycle throughput and latency for shufps
. Goldmont has 0.5c throughput for shufps
.
AFAIK, Intel still hasn't made a low-power core (other than Xeon Phi) with AVX. I don't think they're planning to with Tremont or Gracemont, successors to Goldmont Plus.
vpermilps
there are three 'orthogonal' options: 1. packed single versus packed double, 2. immediate shuffle control integer versus variable control vector, and 3. 128 bits xmm vs 256 bits ymm. Choosing different combinations, this leads to 8 different versions ofvpermilps
. Coincidentally one of them has the same behavior asvshufps
. Therefore, it wouldn't be logical ifvpermilps xmm0,xmm1,0x0
didn't exist. Nevertheless, one might prefervshufps xmm0,xmm1,xmm1,0x0
, which saves one byte indeed. (I'm not sure if this comment is suitable as an answer.) – Avronvpermilps
with immediate control integers exists at all, because there is alsovshufpd
. – Avronshufps
will leave upper lanes unmodified, and yes that's not available forvpermilps
. But your example isvshufps
, which does zero upper lanes, and can copy-and-shuffle. I guess that's potentially useful on CPUs where it won't cause an AVX/SSE transition stall (anything except Haswell/Broadwell), like SSEpinsrd/q
for inserting into a YMM. I guess it also saves even more code bytes to use SSE1 shufps, if you avoid a REX prefix from xmm8..15. Just a 2-byte opcode + modrm + imm8 = 4 bytes total, vs. 5 with a 2-byte VEX + opcode. – Coomer