Slow vpermpd instruction being generated; why?

Asked 27/1, 2019 at 3:25 Answered 27/1, 2019 at 5:49

Solved c++assembly signal-processing avx auto-vectorization

I have a filter m_f which acts on an input vector v via

Real d2v = m_f[0]*v[i];
for (size_t j = 1; j < m_f.size(); ++j)
{
   d2v += m_f[j] * (v[i + j] + v[i - j]);
}

perf tells us where this loop is hot:

The vaddpd and vfma231pd make sense; surely we can't perform this operation without them. But slow vpermpd is baffling to me. What is it accomplishing?

Vicinal answered 27/1, 2019 at 3:25 Comment(0)

vpermpd should only be slowing you down here if your bottleneck is front-end throughput (feeding uops into the out-of-order core).

vpermpd isn't particularly "slow" unless you're on an AMD CPU. (Lane-crossing YMM shuffles are slow-ish on AMD's CPUs, because they have to decode into more than the normal 2 128-bit uops that 256-bit instructions are split into. vpermpd is 3 uops on Ryzen, or 4 with a memory source.)

On Intel, vpermpd with a memory source is always 2 uops for the front-end (even a non-indexed addressing mode can't micro-fuse). Bu

If your loop only runs for a tiny number of iterations, then OoO exec may be able to hide the FMA latency and maybe actually bottleneck on the front end for this loop + surrounding code. That's possible, given how many counts the (inefficient) horizontal-sum code outside the loop is getting.

In that case, maybe unrolling by 2 would help, but maybe the extra overhead to check if you can run even one iteration of the main loop could get costly for very small counts.

Otherwise (for large counts) your bottleneck is probably on the 4 to 5 cycle loop-carried dependency of doing an FMA with d2v as an input/output operand. Unrolling with multiple accumulators, and pointer-increments instead of indexing, would be a huge performance win. Like 2x or 3x.

Try clang, it will usually do that for you, and its skylake/haswell tunings unroll pretty aggressively. (e.g. clang -O3 -march=native -ffast-math)

GCC with -funroll-loops doesn't actually use multiple accumulators, IIRC. I haven't looked in a while, I might be wrong, but I think it will just repeat the loop body using the same accumulator register, not helping at all to run more dep chains in parallel. Clang will actually use 2 or 4 different vector registers to hold partial sums for d2v, and add them at the end outside the loop. (But for large sizes, 8 or more would be better. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?)

Unrolling would also make it worthwhile to use pointer increments, saving 1 uop in each of the vaddpd and vfmadd instructions on Intel SnB-family.

Why is m_f.size(); being kept in memory (cmp rax, [rsp+0x50]) instead of a register? Are you compiling with strict-aliasing disabled? The loop doesn't write memory, so that's just strange. Unless the compiler thinks the loop will run very few iterations, so not worth code outside the loop to load a max?

Copying and negating j every iteration looks like a missed-optimization. Obviously more efficient to start with 2 registers outside the loop, and add rax,0x20 / sub rbx, 0x20 every loop iteration instead of MOV+NEG.

If you have a [mcve] of this, it looks like several missed optimizations that could be reported as compiler bugs. This asm looks like gcc output to me.

It's disappointing that gcc uses such a terrible horizontal-sum idiom. VHADDPD is 3 uops, 2 of which need the shuffle port. Maybe try a newer version of GCC, like 8.2. Although I'm not sure if avoiding VHADDPS/PD was part of closing GCC bug 80846 as fixed. That link is to my comment on the bug analyzing GCC's hsum code using packed-single, using vhaddps twice.

It looks like your hsum following the loop is actually "hot", so you're suffering from gcc's compact but inefficient hsum.

Eigenfunction answered 27/1, 2019 at 5:49 Comment(9)

If you left out -ffast-math, clang might not even contract the multiply + add into an FMA. See -ffp-contract=fast, or use the pragma. But you need associative math (enabled as part of fast-math) to auto-vectorize the reduction. – Eigenfunction 27/1, 2019 at 6:2

I compiled it on gcc-8 with -ffast-math -O3 -march-native; not sure the minor version. Didn't give any flags for aliasing. – Vicinal 27/1, 2019 at 13:44

For the test which produced the perf graphic, m_f.size() was 18. – Vicinal 27/1, 2019 at 14:3

I couldn't get the godbolt link any more pared down that this, unfortunately. – Vicinal 27/1, 2019 at 14:16

@user14717: your Godbolt link uses long double in several places, so a bunch of the asm is using scalar x87 for 80-bit long double precision. There are two vpermpd instructions in the output, but none in a loop with vaddpd+vfma*pd, just FMA or FMA+sub. What CPU are you on? -march=native isn't reproducible if you don't say what CPU you're compiling for. And are you maybe on Windows, where long double is the same as double (64-bit)? – Eigenfunction 27/1, 2019 at 18:33

m.f_size() == 18 would mean the vectorized loop only runs 4 iterations, then there's cleanup. Cleanup overhead is going to matter a lot. Unrolling by 2 might help to reduce front-end uops, but OoO exec is probably hiding the dependency chain pretty well. (GCC's cleanup probably does scalar for the last 2 elements, instead of using 128-bit vectors until there's at most one element left. For even-numbered sizes, using 128-bit vectors as cleanup would be really good. Especially if the main loop were unrolled any, so more cleanup was needed.) – Eigenfunction 27/1, 2019 at 18:37

@user14717: You also forgot to enable optimization in your Godbolt link. With -march=skylake, a loop like the one in your question but with - instead of + gets auto-vectorized basically identically at .L327 in the asm output (godbolt.org/z/WO3NrV). (from operator()). I found it with a search for cmp.*\[rsp\+, to find where the compiler was keeping a loop bound in stack memory, and used "scroll to source" in the right-click menu. Still not sure why the compiler doesn't load the loop bound into a register before looping; maybe it expects a very short loop? – Eigenfunction 27/1, 2019 at 18:46

I tried with -funroll-loops, but gcc still uses mov+neg inside every unrolled iteration, instead of using -32 / -64 / etc offsets in an addressing mode. We probably can't avoid that missed optimization with gcc options, have to report it + get it fixed, and until then use a different compiler. clang7.0 does much better job with unrolling (far fewer instructions, using offsets in addr modes, and multiple accumulators). godbolt.org/z/3ekv7E. And its hsum is more AMD-friendly, narrowing to 128-bit right away instead of using 256-bit haddpd. – Eigenfunction 27/1, 2019 at 18:48

Your discussion is super helpful. Thanks so much! – Vicinal 28/1, 2019 at 3:35

That is the v[i - j] term. Since the memory access moves backwards thru memory as j increases, the shuffle is necessary to reverse the order of the 4 values that are read from memory.

Grady answered 27/1, 2019 at 3:41 Comment(0)