What's the point of the VPERMILPS instruction (_mm_permute_ps)?
Asked Answered
S

1

15

The AVX instruction set introduced VPERMILPS which seems to be a simplified version of SHUFPS (for the case where both input registers are the same).

For example, the following instruction:

c5 f0 c6 c1 00          vshufps xmm0,xmm1,xmm1,0x0

can be replaced with:

c4 e3 79 04 c1 00       vpermilps xmm0,xmm1,0x0

As you can see, the VPERMILPS version takes one byte extra and does the same thing. According to the instruction tables, both of the instructions take 1 CPU cycle and have the same throughput.

What's the point of introducing this kind of instruction? Am I missing something?

Seditious answered 13/1, 2019 at 12:12 Comment(3)
With vpermilps there are three 'orthogonal' options: 1. packed single versus packed double, 2. immediate shuffle control integer versus variable control vector, and 3. 128 bits xmm vs 256 bits ymm. Choosing different combinations, this leads to 8 different versions of vpermilps. Coincidentally one of them has the same behavior as vshufps. Therefore, it wouldn't be logical if vpermilps xmm0,xmm1,0x0 didn't exist. Nevertheless, one might prefer vshufps xmm0,xmm1,xmm1,0x0, which saves one byte indeed. (I'm not sure if this comment is suitable as an answer.)Avron
The question remains, however, why vpermilps with immediate control integers exists at all, because there is also vshufpd.Avron
re: your edit: Only the SSE encoding of shufps will leave upper lanes unmodified, and yes that's not available for vpermilps. But your example is vshufps, which does zero upper lanes, and can copy-and-shuffle. I guess that's potentially useful on CPUs where it won't cause an AVX/SSE transition stall (anything except Haswell/Broadwell), like SSE pinsrd/q for inserting into a YMM. I guess it also saves even more code bytes to use SSE1 shufps, if you avoid a REX prefix from xmm8..15. Just a 2-byte opcode + modrm + imm8 = 4 bytes total, vs. 5 with a 2-byte VEX + opcode.Coomer
C
13

Yes using vpermilps-immediate is normally a missed-optimization vs. vshufps (except on Knight's Landing), wasting 1 byte of code size for the same operation with the same performance.


I think the main point of vpermilps is that it's available with a vector control operand. Before AVX, the only variable-control shuffle was integer pshufb.

VPERMILPS ymm1, ymm2, ymm3/m256 - Permute single-precision floating-point values in ymm2 using controls from ymm3/m256 and store result in ymm1.


But of course the immediate form has a totally separate opcode, and you're asking why it exists. Intel definitely could have included only the vector version, so the question becomes "why did they include the immediate version?" It takes at least a bit of extra decode hardware. The shuffle unit already has hardware to unpack immediate control operands in this form, because it's identical to vshufps, so perhaps it was cheap-ish to implement?

The only thing you can do with immediate vpermilps that you can't with vshufps is load+shuffle in one instruction, like vpermilps ymm0, [rdi], 0b00011011 to reverse the elements in each lane of the source. But like most instructions with an immediate, it can't micro-fuse a memory operand so it's still 2 fused-domain uops for the front end. (On AMD CPUs, it actually does save front-end bandwidth.) Still, it saves code-size vs. vmovups ymm0, [rdi] / vshufps ymm0,ymm0,ymm0, 0b00011011.

Other than that, I don't see much point. They both do the same shuffle in both 128-bit lanes, reusing the 4x 2-bit fields of the immediate for both lanes. (While vpermilpd and vshufpd both use 1-bit fields in their immediates, and can do different shuffles in each lane; the upper lane uses bits 2 and 3. And the ZMM versions use bits 4..7 for the upper 256. So again vpermilpd dst, src, imm is identical to vshufpd dst, src,src, imm, unless you use a memory source or you use a shuffle-control vector instead of immediate.)

It makes you wonder if Intel forgot that VEX encoding was going to enable non-destructive vshufps to do the same thing for immediate shuffles.


Or maybe they had in mind their low-power CPUs, like Knight's Landing (Xeon Phi), where a 1-source shuffle is cheaper:

vpermilps has 1-cycle throughput there, but vshufps or vperm2f128 has 2-cycle throughput and an extra cycle of latency. (According to Agner Fog's instruction tables.)

So using vshufps with the same input twice is slower there.

But on Intel's big-core mainstream CPUs, yes using vpermilps-immediate is a missed-optimization vs. vshufps, unless you can use it with a memory source. vshufps would need the same memory source twice, which obviously isn't encodeable.

AVX was designed years ahead of KNL, but maybe the ISA designers had in mind that maybe some future CPU could be more efficient with a simpler shuffle.

Regular Silvermont (out-of-order Atom that KNL is based on) doesn't support AVX, but it has 1 uop / 1-cycle throughput and latency for shufps. Goldmont has 0.5c throughput for shufps.

AFAIK, Intel still hasn't made a low-power core (other than Xeon Phi) with AVX. I don't think they're planning to with Tremont or Gracemont, successors to Goldmont Plus.

Coomer answered 13/1, 2019 at 14:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.