vpermpd
should only be slowing you down here if your bottleneck is front-end throughput (feeding uops into the out-of-order core).
vpermpd
isn't particularly "slow" unless you're on an AMD CPU. (Lane-crossing YMM shuffles are slow-ish on AMD's CPUs, because they have to decode into more than the normal 2 128-bit uops that 256-bit instructions are split into. vpermpd
is 3 uops on Ryzen, or 4 with a memory source.)
On Intel, vpermpd
with a memory source is always 2 uops for the front-end (even a non-indexed addressing mode can't micro-fuse). Bu
If your loop only runs for a tiny number of iterations, then OoO exec may be able to hide the FMA latency and maybe actually bottleneck on the front end for this loop + surrounding code. That's possible, given how many counts the (inefficient) horizontal-sum code outside the loop is getting.
In that case, maybe unrolling by 2 would help, but maybe the extra overhead to check if you can run even one iteration of the main loop could get costly for very small counts.
Otherwise (for large counts) your bottleneck is probably on the 4 to 5 cycle loop-carried dependency of doing an FMA with d2v
as an input/output operand. Unrolling with multiple accumulators, and pointer-increments instead of indexing, would be a huge performance win. Like 2x or 3x.
Try clang, it will usually do that for you, and its skylake/haswell tunings unroll pretty aggressively. (e.g. clang -O3 -march=native -ffast-math
)
GCC with -funroll-loops
doesn't actually use multiple accumulators, IIRC. I haven't looked in a while, I might be wrong, but I think it will just repeat the loop body using the same accumulator register, not helping at all to run more dep chains in parallel. Clang will actually use 2 or 4 different vector registers to hold partial sums for d2v
, and add them at the end outside the loop. (But for large sizes, 8 or more would be better. Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?)
Unrolling would also make it worthwhile to use pointer increments, saving 1 uop in each of the vaddpd
and vfmadd
instructions on Intel SnB-family.
Why is m_f.size();
being kept in memory (cmp rax, [rsp+0x50]
) instead of a register? Are you compiling with strict-aliasing disabled? The loop doesn't write memory, so that's just strange. Unless the compiler thinks the loop will run very few iterations, so not worth code outside the loop to load a max?
Copying and negating j
every iteration looks like a missed-optimization. Obviously more efficient to start with 2 registers outside the loop, and add rax,0x20
/ sub rbx, 0x20
every loop iteration instead of MOV+NEG.
If you have a [mcve] of this, it looks like several missed optimizations that could be reported as compiler bugs. This asm looks like gcc output to me.
It's disappointing that gcc uses such a terrible horizontal-sum idiom. VHADDPD is 3 uops, 2 of which need the shuffle port. Maybe try a newer version of GCC, like 8.2. Although I'm not sure if avoiding VHADDPS/PD was part of closing GCC bug 80846 as fixed. That link is to my comment on the bug analyzing GCC's hsum code using packed-single, using vhaddps
twice.
It looks like your hsum following the loop is actually "hot", so you're suffering from gcc's compact but inefficient hsum.
-ffast-math
, clang might not even contract the multiply + add into an FMA. See-ffp-contract=fast
, or use the pragma. But you need associative math (enabled as part of fast-math) to auto-vectorize the reduction. – Eigenfunction