The AVX encoding of vpblendvb
has 4 operands (3 sources and a separate destination), and is multi-uop even on Intel P-cores (unlike the legacy-SSE 128-bit encoding), but is single-uop on Zen. A different algorithm can avoid it.
Alder Lake E-cores (Gracemont) are 5-wide out-of-order with reasonable out-of-order exec capability, but they're not great at 256-bit SIMD in general, and choke badly on 8-uop vpblendvb ymm
in particular, including a front-end bottleneck it looks like. But your inner loop uses it every 4th instruction in a dependency chain (short enough for OoO exec to maybe partly hide, so we might just be getting the effects of the back-end-throughput or front-end bottleneck).
Your implementation strategy / algorithm is something Zen 2 is great at but which is a stumbling block for Gracemont, amplifying the difference between 256-bit vs. 128-bit SIMD execution units.
Your i3-N305 is Alder Lake-N series. Like earlier Celeron / Pentium CPUs with N in their model number, the cores are low-power Silvermont-family. In this case Gracemont, the E-cores found in full Alder Lake chips. (Which are significantly beefier than Tremont or especially earlier generations like Goldmont Plus.) And it has AVX2+FMA which I guess is what justifies selling it as an i3.
https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/ is a good deep-dive on the CPU microarchitecture, with some comparisons to Zen 2, and microbenchmarks of cache bandwidth and latency (as part of an i9-12900k, IDK if the interconnect or L3 would be different in an i3-N series, but your benchmark fits in its 2M L2 cache; with a single core active, read bandwidth from L2 is about the same as L1d for sequential access.) No mention about how the decoders handle instructions that are more than 3 uops, but it does have a diagram showing the pair of 3-wide decode clusters. (If it's like previous Intel, any instruction more than 1 uop can only decode in the first decoder of a cluster, so that probably limits front-end throughput to two YMM vector instructions per clock even if they're the minimum 2 uops.)
Your Ryzen 3800X is a Zen 2, a full-fledged big core with good 256-bit SIMD load and ALU throughput (up from 128-bit in Zen 1, Ryzen 1xxx and 2xxx series). And single-uop vpblendvb
.
The most important factors are:
Vector ALU and memory ports are 128-bit wide, and every 256-bit instruction decodes to (at least) 2 uops, except a few like vextracti128
and vpmovmskb
. (So it's like Zen 1 and Bulldozer-family). So uops per clock is about twice the IPC, when running code that's mostly vector instructions with a bit of scalar overhead. 2/clock load bandwidth only goes half as far when each load is only 128-bit.
That select
compiles to a vpblendvb
. Unfortunately that's very slow on Gracemont, see https://uops.info/ - VEX encodings of variable blends are 4 uops per 128-bit lane, so the YMM version is 8 uops with a measured throughput of one per 3.86 cycles. (Or 3.2 cycles for a memory source instead of register, surprisingly.) Zen family runs the 4-operand vpblendvb
as a single uop (with a choice of ports even).
The legacy-SSE encoding only has 3 operands, one of them implicitly XMM0, and Gracemont runs that as a single uop. Even Alder Lake P-cores run vpblendvb x/ymm
as 3 uops, up from 2 in Ice Lake, while SSE4.1 pblendvb xmm, xmm
is single uop on modern Intel P-cores, too.
Gracemont vpblendvb ymm
also has 6 to 7 cycle latency, or 5c for the XMM version (vs. 2 to 3 on P cores), depending on data vs. control inputs being the critical path, vs. 1 cycle on Zen. Even worse than its throughput even with the front-end bottleneck. Out-of-order exec buffers (scheduler and ROB) are probably big enough to hide this over a chain of 7 of them, since you start a new dep chain every 256 bytes, but it's not great and would be a bottleneck in a loop that ran more iterations.
It seems Intel goofed when designing the AVX1 encoding of it (with a 4th register number in an immediate byte!) while Sandybridge-family was still being designed, not anticipating that their later CPUs would be able to handle 3-operand instructions as a single uop. (Motivated by FMA in Haswell, but benefiting others in Broadwell and later.) And that mov-elimination would remove the back-end execution port cost of copying a register if needed (unlike here) if the original value is needed after an instruction that modifies a R+W destination in-place. FMA3 and later 3-input instructions like AVX-512 vpternlogd
and vpermi/t2d
have an R+W source/destination as the first operand. (k
mask inputs to AVX-512 instructions are a separate forwarding network and a separate domain to track dependencies in, so they don't count.)
8 uops inherently contributes to low IPC for the same uops/clock throughput, but probably also stalls the front-end some, reducing uops/clock. Even Gracemont's 4-uop vpblendvb xmm
has about the same bad throughput if running just that back-to-back, which is consistent with some kind of decode stall or having to switch to a microcode ROM on >3 uop instructions.
You could try to blend manually with _mm256_and_si256
/ andnot
/ or
, which would be 6 uops but avoid front-end stalls for a total throughput cost of 1.33 cycles on the vector ALU ports. But clang will "optimize" those intrinsics to a vpblendvb
since it knows the blend-control is a compare result, with all bits matching the sign bit.
Clang trunk's -mtune=gracemont
or -march=gracemont
doesn't know it's slow on that uarch, at least not splitting select
into those. MSVC, or classic ICC, are a lot more literal about intrinsics. GCC does optimize some, but in this case it does use actual vpand
/vpandn
/vpor
instructions (https://godbolt.org/z/3fc1jo9r4), so you could make a version that's worse on Ryzen, less bad on Gracemont, but not optimal anywhere. I think it's still worse on Gracemont than the noselect
version below.
Your original is fairly good for Ryzen, but there's room for improvement in the cleanup, and in maybe scanning backwards to avoid inverting the compare to feed the blend. Or the branchy strategy might be best if an instance of the max element is often seen within the first 64 bytes so it's predictable. Just load + 7x vpmaxub ymm, mem
, then reduce and scan.
Avoiding variable-blend
Your actual problem could be done other ways, for example unpacking your data with indices as chtz suggested in Looking for an efficient function to find an index of max element in SIMD vector using a library , so the max u16
element contains the data and the index. (And instead of loading, the index can come from idx = _mm256_add_epi8(idx, _mm256_set1_epi8(32));
.
Of maybe that inner loop over 256 bytes can get fully unrolled so you have 8 registers holding index data.)
Since you'd probably want to use that improved reduction anyway, unpacking even earlier saves some cleanup work, and your loop is only 8 vectors.
For a sum of indices, I guess it's important that you get the first occurrence of a match? So you'd want to invert your indices so the max of data:index packed as a u16 picks the earlier index when it's a tie-break for equal data. That's what we want anyway for a cleanup that's going to use vphminposuw
.
This is what it might look like, without being clever about indices so it might be taking the last one.
int loop_vc_nested_noselect(const std::array<uint8_t, H*W> &img, const std::array<Vec32uc, 8> &idx) {
int sum = 0;
for (int i=0; i<H*W; i+=W) {
__m256i tmpidx = _mm256_loadu_si256((__m256i*)&idx[0]);
__m256i tmp = _mm256_loadu_si256((__m256i*)&img[i]);
Vec16us vMaxlo = _mm256_unpacklo_epi8(tmpidx, tmp);
Vec16us vMaxhi = _mm256_unpackhi_epi8(tmpidx, tmp);
for (int j=1; j<8; j++) {
Vec32uc vCurr, iCurr;
iCurr.load(&idx[j]); // these get hoisted out of the outer loop and reused across img iters
vCurr.load(&img[i+j*32]);
Vec16us lo = _mm256_unpacklo_epi8(iCurr, vCurr);
Vec16us hi = _mm256_unpackhi_epi8(iCurr, vCurr);
vMaxlo = max(vMaxlo, lo);
vMaxhi = max(vMaxhi, hi);
// vMax = max(vMax, max(lo,hi)); // GCC was optimizing to two dep chains anyway, and that's better on big-cores that can do more than 1 load+shuffle+max per clock
}
Vec16us vMax = max(vMaxlo, vMaxhi);
// silly GCC uses vpextrw even though we're already truncating narrower
auto maxidx = (uint8_t)horizontal_max(vMax); // retrieve the payload from the bottom of the max
// TODO: use phminposuw like the last part of maxpos_u8_noscan_unpack
// with indices loaded and inverted once, outside the outer loop. (Manually unrolled if compilers don't do that for you)
sum += maxidx;
}
return sum;
}
Instead of loading indices, you could maybe just compute them with _mm256_sub_epi8(idx, _mm256_set1_epi8(-1))
(or add
to go in descending order down from 255), although compilers will probably constant-propagate through that and make 8 vectors of constants, and the RIP-relative addressing mode to load that is larger code-size than [rsi+disp8]
for the first 5 loads, but that's just the startup code. After the compiler's done unrolling, you definitely want it to have 8 vectors of indices that it generates once ahead of the loop.
Godbolt. GCC -O3 -march=alderlake
fully unrolls, loading all 8 index
vectors before the outer loop and using them from registers. (Same in the original version.)
The inner loop looks like this; notice that it uses the same memory source operand twice to save front-bandwidth at the cost of more back-end uops. This is actually ok on Gracemont as well as Alder Lake; vpunpckl/hbw
is 2 front-end uops with or without a memory source operand. With 1.0 vs. 0.66 cycle throughput, but with separate loads I think the front-end would be a worse bottleneck depending how fast it can decode 2-uop instructions. And the vpmaxuw
per unpack is extra vector ALU work to keep ports busy so it doesn't bottleneck on loads.
Clang -mtune=gracemont
chooses differently, but it doesn't load twice even tuning for Alder Lake / Ice Lake.
.L7:
vpunpcklbw ymm11, ymm7, YMMWORD PTR [rax+32]
vpunpckhbw ymm10, ymm7, YMMWORD PTR [rax+32]
add rax, 256
vpunpcklbw ymm0, ymm8, YMMWORD PTR [rax-256]
vpunpckhbw ymm9, ymm8, YMMWORD PTR [rax-256]
vpmaxuw ymm0, ymm0, ymm11
vpmaxuw ymm9, ymm9, ymm10
vpunpcklbw ymm11, ymm6, YMMWORD PTR [rax-192]
vpunpckhbw ymm10, ymm6, YMMWORD PTR [rax-192]
vpmaxuw ymm0, ymm0, ymm11
vpunpcklbw ymm11, ymm5, YMMWORD PTR [rax-160]
vpmaxuw ymm9, ymm9, ymm10
vpunpckhbw ymm10, ymm5, YMMWORD PTR [rax-160]
vpmaxuw ymm0, ymm0, ymm11
...
https://uica.uops.info/ predicts Ice Lake could run it at 14 cycles per iteration, vs. 17 for the vpblendvb
version. And that's nearly bottlenecked on vector ALU ports, so Alder Lake would be even worse with the vpblendvb
version.
I haven't analyzed by hand for Gracemont, or tried LLVM-MCA which might have a Gracemont model.
I also haven't looked at optimizing it to use vphminposuw
as part of the cleanup, which would save even more, helping pay for the extra shuffle work we're doing per vector.
Or consider a branchy strategy, like finding the max and then searching the array for for the first match. (compare/movemask aka to_bits(curr == bcast_max)
, and if non-zero, return tzcnt(mask)
). You never need to load vectors of index data, and an early match reduces the amount of work. (But it can mispredict which might be much worse; still worth a try. But usefully microbenchmarking things that depend on correct branch prediction is hard - a microbenchmark can learn a pattern. Or if you make it totally random, it predicts worse than real data distributions.)
With only 8 vectors of data, that second pass loop can be fully unrolled with no loads. The first pass can leave the data in registers. (But it would have to be fully unrolled, too, perhaps checking a pair of ymm regs at a time for a match, with shift/or and a 64-bit tzcnt. vpmovmskb r32, ymm
is single-uop on Gracemont.) And it would mean separate load + max instructions in the first pass, not memory-source. Gracemont doesn't have a uop-cache but apparently its decoders manage ok for throughput. Perhaps not wonderfully with back-to-back 2-uop instructions.
(This is basically the same strategy your current cleanup is using, find the max then search for its position, but across the whole 8-vector array. Allowing reduction to 128-bit for most of the horizontal max work between the first and second pass is nice.)
Commented version of your original, looking at how it compiled to asm:
int loop_vc_nested(const std::array<uint8_t, H*W> &img, const std::array<Vec32uc, 8> &idx) {
int sum = 0;
Vec32uc vMax, iMax, vCurr, iCurr;
for (int i=0; i<H*W; i+=W) {
iMax.load(&idx[0]);
vMax.load(&img[i]);
for (int j=1; j<8; j++) {
iCurr.load(&idx[j]); // these get hoisted out of the outer loop and reused across img iters
vCurr.load(&img[i+j*32]);
// unsigned > isn't available until AVX-512. VCL uses !(a == max(a,b))
// GCC XORs the compare result, clang uses max and a==min(a,b)
iMax = select(vCurr > vMax, iCurr, iMax);
// scanning backwards from the end with a==max(a,b), we could still find the earliest max
vMax = max(vMax, vCurr);
}
#if 1
Vec32uc vMaxAll{horizontal_max(vMax)};
//size_t maxidx = horizontal_find_first(vMax == vMaxAll); // total disaster on clang: non-inlined BSF wrapper forces vector spill/reload of the idx vectors
size_t maxidx = _tzcnt_u32(to_bits(vMax == vMaxAll));
#else
size_t maxidx = maxpos_u8_noscan_unpack(vMax);
#endif
sum += iMax[maxidx];
}
return sum;
}
which compiles to code that loads the first 4 vectors early, the some processing, then loading more as it goes. ymm1 = set1(-1), XOR with it does a NOT of the compare result.
# GCC13.2 -O3 -march=alderlake for the version of your source above
loop_vc_nested(std::array<unsigned char, 208896ul> const&, std::array<Vec32uc, 8ul> const&):
push rbp
mov rax, rdi
xor ecx, ecx
vpcmpeqd ymm1, ymm1, ymm1 # set1(-1)
mov rbp, rsp
and rsp, -32 # align the stack for the store that we index with movzx
vmovdqu ymm9, YMMWORD PTR [rsi+32] # idx[32..63]
vmovdqu ymm8, YMMWORD PTR [rsi] # idx[0..31]
... # and all 8 vectors of idx
lea rsi, [rdi+208896] # img.end()
.L2:
vmovdqu ymm0, YMMWORD PTR [rax+32]
vpmaxub ymm11, ymm0, YMMWORD PTR [rax]
add rax, 256
vpmaxub ymm10, ymm11, YMMWORD PTR [rax-192]
vpcmpeqb ymm0, ymm11, YMMWORD PTR [rax-256]
vpcmpeqb ymm11, ymm11, ymm10
vpxor ymm0, ymm0, ymm1
vpxor ymm11, ymm11, ymm1
vpblendvb ymm0, ymm8, ymm9, ymm0
vpblendvb ymm0, ymm0, ymm7, ymm11
vpmaxub ymm11, ymm10, YMMWORD PTR [rax-160]
vpcmpeqb ymm10, ymm10, ymm11
vpxor ymm10, ymm10, ymm1
vpblendvb ymm0, ymm0, ymm6, ymm10
vpmaxub ymm10, ymm11, YMMWORD PTR [rax-128]
vpcmpeqb ymm11, ymm11, ymm10
vpxor ymm11, ymm11, ymm1
vpblendvb ymm0, ymm0, ymm5, ymm11
vpmaxub ymm11, ymm10, YMMWORD PTR [rax-96]
vpcmpeqb ymm10, ymm10, ymm11
vpxor ymm10, ymm10, ymm1
vpblendvb ymm0, ymm0, ymm4, ymm10
vpmaxub ymm10, ymm11, YMMWORD PTR [rax-64]
vpcmpeqb ymm11, ymm11, ymm10
vpxor ymm11, ymm11, ymm1
vpblendvb ymm0, ymm0, ymm3, ymm11
vpmaxub ymm11, ymm10, YMMWORD PTR [rax-32]
vpcmpeqb ymm10, ymm10, ymm11
vpxor ymm10, ymm10, ymm1
vpblendvb ymm0, ymm0, ymm2, ymm10
## end of unrolled inner loop
vextracti128 xmm10, ymm11, 0x1 # start of horizontal_max
vpmaxub xmm12, xmm11, xmm10
vmovdqa YMMWORD PTR [rsp-32], ymm0 # store iMax
vpunpckhqdq xmm10, xmm12, xmm12
...
vpmaxub xmm10, xmm10, xmm12 # end of horizontal_max
vpbroadcastb ymm10, xmm10
vpcmpeqb ymm10, ymm10, ymm11
vpmovmskb edx, ymm10
tzcnt edx, edx # your actual original used BSF, much worse on AMD
and edx, 31 # this isn't in the source anywhere!
movzx edx, BYTE PTR [rsp-32+rdx]
add ecx, edx # sum +=
cmp rsi, rax
jne .L2 }while(ptr != endptr);
mov eax, ecx
vzeroupper
ret
As mentioned in the comments I added, saving an instruction around the blend (to get the opposite condition) could be done with curr == max(vmax, curr)
, but that's true on a tie when your condition isn't. Looping backward could fix that, but might be harder for the prefetchers.
(In asm at least, you could load all 8 vectors in forward order, or one from each cache line, but process them backwards. That makes out-of-order exec work even harder to hide load latency, assuming prefetch keeps streaming in order.)