Why performance for this index-of-max function over many arrays of 256 bytes is so slow on Intel i3-N305 compared to AMD Ryzen 7 3800X?
Asked Answered
U

1

28

I've run the same binaries compiled with gcc-13 (https://godbolt.org/z/qq5WrE8qx) on Intel i3-N305 3.8GHz and AMD Ryzen 7 3800X 3.9GHz PCs. This code uses VCL library (https://github.com/vectorclass/version2):

int loop_vc_nested(const array<uint8_t, H*W> &img, const array<Vec32uc, 8> &idx) {
  int sum = 0;
  Vec32uc vMax, iMax, vCurr, iCurr;

  for (int i=0; i<H*W; i+=W) {
    iMax.load(&idx[0]);
    vMax.load(&img[i]);

    for (int j=1; j<8; j++) {
      iCurr.load(&idx[j]);
      vCurr.load(&img[i+j*32]);
      iMax = select(vCurr > vMax, iCurr, iMax);
      vMax = max(vMax, vCurr);
    }

    Vec32uc vMaxAll{horizontal_max(vMax)};
    sum += iMax[horizontal_find_first(vMax == vMaxAll)];
  }

  return sum;
}

Full benchmark source is here: https://github.com/pauljurczak/simd-benchmarks/blob/main/main-5-vcl-eve.cpp. Here is the timing:

Ubuntu 22.04.3 LTS on AMD Ryzen 7 3800X 8-Core Processor
gcc    v13.1   __cplusplus=202100
loop_vc_nested(): 3.597  3.777 [us]  108834

Ubuntu 23.10 on Intel(R) Core(TM) i3-N305
gcc    v13.1   __cplusplus=202100
loop_vc_nested(): 11.804  11.922 [us]  108834

There is an unexpected slowdown of 3.2x. AFAIK, these CPUs have similar SIMD capabilities for a single thread program. Performance on 7-zip benchmark is very close. Why such a big gap?


Here is an output from perf. AMD Ryzen 7 3800X:

          3,841.61 msec task-clock                       #    1.000 CPUs utilized             
                20      context-switches                 #    5.206 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
             2,191      page-faults                      #  570.333 /sec                      
    14,909,837,582      cycles                           #    3.881 GHz                         (83.34%)
         3,509,824      stalled-cycles-frontend          #    0.02% frontend cycles idle        (83.34%)
     9,865,497,290      stalled-cycles-backend           #   66.17% backend cycles idle         (83.34%)
    42,856,816,868      instructions                     #    2.87  insn per cycle            
                                                  #    0.23  stalled cycles per insn     (83.34%)
     1,718,672,677      branches                         #  447.383 M/sec                       (83.34%)
         2,409,251      branch-misses                    #    0.14% of all branches             (83.29%)

Intel i3-N305:

         12,015.18 msec task-clock                       #    1.000 CPUs utilized             
                57      context-switches                 #    4.744 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
             2,196      page-faults                      #  182.769 /sec                      
    45,432,594,158      cycles                           #    3.781 GHz                         (74.97%)
    42,847,054,707      instructions                     #    0.94  insn per cycle              (87.48%)
     1,714,003,765      branches                         #  142.653 M/sec                       (87.48%)
         4,254,872      branch-misses                    #    0.25% of all branches             (87.51%)
                        TopdownL1                 #      0.2 %  tma_bad_speculation    
                                                  #     45.5 %  tma_retiring             (87.52%)
                                                  #     53.8 %  tma_backend_bound      
                                                  #     53.8 %  tma_backend_bound_aux  
                                                  #      0.5 %  tma_frontend_bound       (87.52%)

Compiler options: -O3 -Wno-narrowing -ffast-math -fno-trapping-math -fno-math-errno -ffinite-math-only -march=alderlake


Additional cache use information from perf stat -d on i3-N305:

    15,615,324,576      L1-dcache-loads                  #    1.294 G/sec                       (54.50%)
   <not supported>      L1-dcache-load-misses                                                 
            60,909      LLC-loads                        #    5.048 K/sec                       (54.50%)
             5,231      LLC-load-misses                  #    8.59% of all L1-icache accesses   (54.50%)

I installed the newest Intel C++ compiler, in order to get -march=gracemont working. Performance did not improve, since Intel compiler is based on clang, which performed worse than gcc in this benchmark. Here are the timings:

Ubuntu 23.10 on Intel(R) Core(TM) i3-N305
clang v17.0.0 (icx 2024.0.2.20231213) C++
loop_vc_nested(): 12.311  12.397 [us]  108834  # -march=native
loop_vc_nested(): 12.773  12.847 [us]  108834  # -march=alderlake
loop_vc_nested(): 12.418  12.519 [us]  108834  # -march=gracemont
loop_vc_unrolled(): 10.388  12.406 [us]  108834  # -march=gracemont
loop_vc_nested_noselect_2chains(): 6.686  10.454 [us]  109599  # -march=gracemont
Underwing answered 25/12, 2023 at 7:56 Comment(1)
Comments have been moved to chat; please do not continue the discussion here. Before posting a comment below this one, please review the purposes of comments. Comments that do not request clarification or suggest improvements usually belong as an answer, on Meta Stack Overflow, or in Stack Overflow Chat. Comments continuing discussion may be removed.Andradite
K
44

The AVX encoding of vpblendvb has 4 operands (3 sources and a separate destination), and is multi-uop even on Intel P-cores (unlike the legacy-SSE 128-bit encoding), but is single-uop on Zen. A different algorithm can avoid it.

Alder Lake E-cores (Gracemont) are 5-wide out-of-order with reasonable out-of-order exec capability, but they're not great at 256-bit SIMD in general, and choke badly on 8-uop vpblendvb ymm in particular, including a front-end bottleneck it looks like. But your inner loop uses it every 4th instruction in a dependency chain (short enough for OoO exec to maybe partly hide, so we might just be getting the effects of the back-end-throughput or front-end bottleneck).

Your implementation strategy / algorithm is something Zen 2 is great at but which is a stumbling block for Gracemont, amplifying the difference between 256-bit vs. 128-bit SIMD execution units.


Your i3-N305 is Alder Lake-N series. Like earlier Celeron / Pentium CPUs with N in their model number, the cores are low-power Silvermont-family. In this case Gracemont, the E-cores found in full Alder Lake chips. (Which are significantly beefier than Tremont or especially earlier generations like Goldmont Plus.) And it has AVX2+FMA which I guess is what justifies selling it as an i3.

https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/ is a good deep-dive on the CPU microarchitecture, with some comparisons to Zen 2, and microbenchmarks of cache bandwidth and latency (as part of an i9-12900k, IDK if the interconnect or L3 would be different in an i3-N series, but your benchmark fits in its 2M L2 cache; with a single core active, read bandwidth from L2 is about the same as L1d for sequential access.) No mention about how the decoders handle instructions that are more than 3 uops, but it does have a diagram showing the pair of 3-wide decode clusters. (If it's like previous Intel, any instruction more than 1 uop can only decode in the first decoder of a cluster, so that probably limits front-end throughput to two YMM vector instructions per clock even if they're the minimum 2 uops.)

Your Ryzen 3800X is a Zen 2, a full-fledged big core with good 256-bit SIMD load and ALU throughput (up from 128-bit in Zen 1, Ryzen 1xxx and 2xxx series). And single-uop vpblendvb.

The most important factors are:

  • Vector ALU and memory ports are 128-bit wide, and every 256-bit instruction decodes to (at least) 2 uops, except a few like vextracti128 and vpmovmskb. (So it's like Zen 1 and Bulldozer-family). So uops per clock is about twice the IPC, when running code that's mostly vector instructions with a bit of scalar overhead. 2/clock load bandwidth only goes half as far when each load is only 128-bit.

  • That select compiles to a vpblendvb. Unfortunately that's very slow on Gracemont, see https://uops.info/ - VEX encodings of variable blends are 4 uops per 128-bit lane, so the YMM version is 8 uops with a measured throughput of one per 3.86 cycles. (Or 3.2 cycles for a memory source instead of register, surprisingly.) Zen family runs the 4-operand vpblendvb as a single uop (with a choice of ports even).

    The legacy-SSE encoding only has 3 operands, one of them implicitly XMM0, and Gracemont runs that as a single uop. Even Alder Lake P-cores run vpblendvb x/ymm as 3 uops, up from 2 in Ice Lake, while SSE4.1 pblendvb xmm, xmm is single uop on modern Intel P-cores, too.

    Gracemont vpblendvb ymm also has 6 to 7 cycle latency, or 5c for the XMM version (vs. 2 to 3 on P cores), depending on data vs. control inputs being the critical path, vs. 1 cycle on Zen. Even worse than its throughput even with the front-end bottleneck. Out-of-order exec buffers (scheduler and ROB) are probably big enough to hide this over a chain of 7 of them, since you start a new dep chain every 256 bytes, but it's not great and would be a bottleneck in a loop that ran more iterations.

    It seems Intel goofed when designing the AVX1 encoding of it (with a 4th register number in an immediate byte!) while Sandybridge-family was still being designed, not anticipating that their later CPUs would be able to handle 3-operand instructions as a single uop. (Motivated by FMA in Haswell, but benefiting others in Broadwell and later.) And that mov-elimination would remove the back-end execution port cost of copying a register if needed (unlike here) if the original value is needed after an instruction that modifies a R+W destination in-place. FMA3 and later 3-input instructions like AVX-512 vpternlogd and vpermi/t2d have an R+W source/destination as the first operand. (k mask inputs to AVX-512 instructions are a separate forwarding network and a separate domain to track dependencies in, so they don't count.)

    8 uops inherently contributes to low IPC for the same uops/clock throughput, but probably also stalls the front-end some, reducing uops/clock. Even Gracemont's 4-uop vpblendvb xmm has about the same bad throughput if running just that back-to-back, which is consistent with some kind of decode stall or having to switch to a microcode ROM on >3 uop instructions.


You could try to blend manually with _mm256_and_si256 / andnot / or, which would be 6 uops but avoid front-end stalls for a total throughput cost of 1.33 cycles on the vector ALU ports. But clang will "optimize" those intrinsics to a vpblendvb since it knows the blend-control is a compare result, with all bits matching the sign bit.

Clang trunk's -mtune=gracemont or -march=gracemont doesn't know it's slow on that uarch, at least not splitting select into those. MSVC, or classic ICC, are a lot more literal about intrinsics. GCC does optimize some, but in this case it does use actual vpand/vpandn/vpor instructions (https://godbolt.org/z/3fc1jo9r4), so you could make a version that's worse on Ryzen, less bad on Gracemont, but not optimal anywhere. I think it's still worse on Gracemont than the noselect version below.

Your original is fairly good for Ryzen, but there's room for improvement in the cleanup, and in maybe scanning backwards to avoid inverting the compare to feed the blend. Or the branchy strategy might be best if an instance of the max element is often seen within the first 64 bytes so it's predictable. Just load + 7x vpmaxub ymm, mem, then reduce and scan.


Avoiding variable-blend

Your actual problem could be done other ways, for example unpacking your data with indices as chtz suggested in Looking for an efficient function to find an index of max element in SIMD vector using a library , so the max u16 element contains the data and the index. (And instead of loading, the index can come from idx = _mm256_add_epi8(idx, _mm256_set1_epi8(32));. Of maybe that inner loop over 256 bytes can get fully unrolled so you have 8 registers holding index data.)

Since you'd probably want to use that improved reduction anyway, unpacking even earlier saves some cleanup work, and your loop is only 8 vectors.

For a sum of indices, I guess it's important that you get the first occurrence of a match? So you'd want to invert your indices so the max of data:index packed as a u16 picks the earlier index when it's a tie-break for equal data. That's what we want anyway for a cleanup that's going to use vphminposuw.

This is what it might look like, without being clever about indices so it might be taking the last one.

int loop_vc_nested_noselect(const std::array<uint8_t, H*W> &img, const std::array<Vec32uc, 8> &idx) {
  int sum = 0;

  for (int i=0; i<H*W; i+=W) {
    __m256i tmpidx = _mm256_loadu_si256((__m256i*)&idx[0]);
    __m256i tmp = _mm256_loadu_si256((__m256i*)&img[i]);
    Vec16us vMaxlo = _mm256_unpacklo_epi8(tmpidx, tmp);
    Vec16us vMaxhi = _mm256_unpackhi_epi8(tmpidx, tmp);

    for (int j=1; j<8; j++) {
      Vec32uc vCurr, iCurr;
      iCurr.load(&idx[j]);  // these get hoisted out of the outer loop and reused across img iters
      vCurr.load(&img[i+j*32]);
      Vec16us lo = _mm256_unpacklo_epi8(iCurr, vCurr);
      Vec16us hi = _mm256_unpackhi_epi8(iCurr, vCurr);
      vMaxlo = max(vMaxlo, lo);
      vMaxhi = max(vMaxhi, hi);
          // vMax = max(vMax, max(lo,hi));  // GCC was optimizing to two dep chains anyway, and that's better on big-cores that can do more than 1 load+shuffle+max per clock
    }
    Vec16us vMax = max(vMaxlo, vMaxhi);

    // silly GCC uses vpextrw even though we're already truncating narrower
    auto maxidx = (uint8_t)horizontal_max(vMax); // retrieve the payload from the bottom of the max
    // TODO: use phminposuw like the last part of maxpos_u8_noscan_unpack
    // with indices loaded and inverted once, outside the outer loop.  (Manually unrolled if compilers don't do that for you)
    sum += maxidx;
  }

  return sum;
}

Instead of loading indices, you could maybe just compute them with _mm256_sub_epi8(idx, _mm256_set1_epi8(-1)) (or add to go in descending order down from 255), although compilers will probably constant-propagate through that and make 8 vectors of constants, and the RIP-relative addressing mode to load that is larger code-size than [rsi+disp8] for the first 5 loads, but that's just the startup code. After the compiler's done unrolling, you definitely want it to have 8 vectors of indices that it generates once ahead of the loop.

Godbolt. GCC -O3 -march=alderlake fully unrolls, loading all 8 index vectors before the outer loop and using them from registers. (Same in the original version.)

The inner loop looks like this; notice that it uses the same memory source operand twice to save front-bandwidth at the cost of more back-end uops. This is actually ok on Gracemont as well as Alder Lake; vpunpckl/hbw is 2 front-end uops with or without a memory source operand. With 1.0 vs. 0.66 cycle throughput, but with separate loads I think the front-end would be a worse bottleneck depending how fast it can decode 2-uop instructions. And the vpmaxuw per unpack is extra vector ALU work to keep ports busy so it doesn't bottleneck on loads.

Clang -mtune=gracemont chooses differently, but it doesn't load twice even tuning for Alder Lake / Ice Lake.

.L7:
        vpunpcklbw      ymm11, ymm7, YMMWORD PTR [rax+32]
        vpunpckhbw      ymm10, ymm7, YMMWORD PTR [rax+32]
        add     rax, 256
        vpunpcklbw      ymm0, ymm8, YMMWORD PTR [rax-256]
        vpunpckhbw      ymm9, ymm8, YMMWORD PTR [rax-256]
        vpmaxuw ymm0, ymm0, ymm11
        vpmaxuw ymm9, ymm9, ymm10
        vpunpcklbw      ymm11, ymm6, YMMWORD PTR [rax-192]
        vpunpckhbw      ymm10, ymm6, YMMWORD PTR [rax-192]
        vpmaxuw ymm0, ymm0, ymm11
        vpunpcklbw      ymm11, ymm5, YMMWORD PTR [rax-160]
        vpmaxuw ymm9, ymm9, ymm10
        vpunpckhbw      ymm10, ymm5, YMMWORD PTR [rax-160]
        vpmaxuw ymm0, ymm0, ymm11
...

https://uica.uops.info/ predicts Ice Lake could run it at 14 cycles per iteration, vs. 17 for the vpblendvb version. And that's nearly bottlenecked on vector ALU ports, so Alder Lake would be even worse with the vpblendvb version.

I haven't analyzed by hand for Gracemont, or tried LLVM-MCA which might have a Gracemont model.

I also haven't looked at optimizing it to use vphminposuw as part of the cleanup, which would save even more, helping pay for the extra shuffle work we're doing per vector.


Or consider a branchy strategy, like finding the max and then searching the array for for the first match. (compare/movemask aka to_bits(curr == bcast_max), and if non-zero, return tzcnt(mask)). You never need to load vectors of index data, and an early match reduces the amount of work. (But it can mispredict which might be much worse; still worth a try. But usefully microbenchmarking things that depend on correct branch prediction is hard - a microbenchmark can learn a pattern. Or if you make it totally random, it predicts worse than real data distributions.)

With only 8 vectors of data, that second pass loop can be fully unrolled with no loads. The first pass can leave the data in registers. (But it would have to be fully unrolled, too, perhaps checking a pair of ymm regs at a time for a match, with shift/or and a 64-bit tzcnt. vpmovmskb r32, ymm is single-uop on Gracemont.) And it would mean separate load + max instructions in the first pass, not memory-source. Gracemont doesn't have a uop-cache but apparently its decoders manage ok for throughput. Perhaps not wonderfully with back-to-back 2-uop instructions.

(This is basically the same strategy your current cleanup is using, find the max then search for its position, but across the whole 8-vector array. Allowing reduction to 128-bit for most of the horizontal max work between the first and second pass is nice.)


Commented version of your original, looking at how it compiled to asm:
int loop_vc_nested(const std::array<uint8_t, H*W> &img, const std::array<Vec32uc, 8> &idx) {
  int sum = 0;
  Vec32uc vMax, iMax, vCurr, iCurr;

  for (int i=0; i<H*W; i+=W) {
    iMax.load(&idx[0]);
    vMax.load(&img[i]);

    for (int j=1; j<8; j++) {
      iCurr.load(&idx[j]);  // these get hoisted out of the outer loop and reused across img iters
      vCurr.load(&img[i+j*32]);
      // unsigned > isn't available until AVX-512.  VCL uses !(a == max(a,b))
      // GCC XORs the compare result, clang uses max and a==min(a,b)
      iMax = select(vCurr > vMax, iCurr, iMax);
      // scanning backwards from the end with a==max(a,b), we could still find the earliest max
      vMax = max(vMax, vCurr);
    }

#if 1
   Vec32uc vMaxAll{horizontal_max(vMax)};
   //size_t maxidx = horizontal_find_first(vMax == vMaxAll); // total disaster on clang: non-inlined BSF wrapper forces vector spill/reload of the idx vectors
   size_t maxidx = _tzcnt_u32(to_bits(vMax == vMaxAll));
#else
    size_t maxidx = maxpos_u8_noscan_unpack(vMax);
#endif
    sum += iMax[maxidx];
  }

  return sum;
}

which compiles to code that loads the first 4 vectors early, the some processing, then loading more as it goes. ymm1 = set1(-1), XOR with it does a NOT of the compare result.

# GCC13.2 -O3 -march=alderlake for the version of your source above
loop_vc_nested(std::array<unsigned char, 208896ul> const&, std::array<Vec32uc, 8ul> const&):
        push    rbp
        mov     rax, rdi
        xor     ecx, ecx
        vpcmpeqd        ymm1, ymm1, ymm1   # set1(-1)
        mov     rbp, rsp
        and     rsp, -32                   # align the stack for the store that we index with movzx

        vmovdqu ymm9, YMMWORD PTR [rsi+32] # idx[32..63]
        vmovdqu ymm8, YMMWORD PTR [rsi]    # idx[0..31]
        ...        # and all 8 vectors of idx
        lea     rsi, [rdi+208896]         # img.end()
.L2:
        vmovdqu ymm0, YMMWORD PTR [rax+32]
        vpmaxub ymm11, ymm0, YMMWORD PTR [rax]
        add     rax, 256
        vpmaxub ymm10, ymm11, YMMWORD PTR [rax-192]
        vpcmpeqb        ymm0, ymm11, YMMWORD PTR [rax-256]
        vpcmpeqb        ymm11, ymm11, ymm10
        vpxor   ymm0, ymm0, ymm1
        vpxor   ymm11, ymm11, ymm1
        vpblendvb       ymm0, ymm8, ymm9, ymm0
        vpblendvb       ymm0, ymm0, ymm7, ymm11
        vpmaxub ymm11, ymm10, YMMWORD PTR [rax-160]
        vpcmpeqb        ymm10, ymm10, ymm11
        vpxor   ymm10, ymm10, ymm1
        vpblendvb       ymm0, ymm0, ymm6, ymm10
        vpmaxub ymm10, ymm11, YMMWORD PTR [rax-128]
        vpcmpeqb        ymm11, ymm11, ymm10
        vpxor   ymm11, ymm11, ymm1
        vpblendvb       ymm0, ymm0, ymm5, ymm11
        vpmaxub ymm11, ymm10, YMMWORD PTR [rax-96]
        vpcmpeqb        ymm10, ymm10, ymm11
        vpxor   ymm10, ymm10, ymm1
        vpblendvb       ymm0, ymm0, ymm4, ymm10
        vpmaxub ymm10, ymm11, YMMWORD PTR [rax-64]
        vpcmpeqb        ymm11, ymm11, ymm10
        vpxor   ymm11, ymm11, ymm1
        vpblendvb       ymm0, ymm0, ymm3, ymm11
        vpmaxub ymm11, ymm10, YMMWORD PTR [rax-32]
        vpcmpeqb        ymm10, ymm10, ymm11
        vpxor   ymm10, ymm10, ymm1
        vpblendvb       ymm0, ymm0, ymm2, ymm10
 ## end of unrolled inner loop
        vextracti128    xmm10, ymm11, 0x1   # start of horizontal_max
        vpmaxub xmm12, xmm11, xmm10
        vmovdqa YMMWORD PTR [rsp-32], ymm0   # store iMax
        vpunpckhqdq     xmm10, xmm12, xmm12
    ...
        vpmaxub xmm10, xmm10, xmm12       # end of horizontal_max
        vpbroadcastb    ymm10, xmm10
        vpcmpeqb        ymm10, ymm10, ymm11
        vpmovmskb       edx, ymm10
        tzcnt   edx, edx        # your actual original used BSF, much worse on AMD
        and     edx, 31         # this isn't in the source anywhere!
        movzx   edx, BYTE PTR [rsp-32+rdx]
        add     ecx, edx        # sum += 
        cmp     rsi, rax
        jne     .L2         }while(ptr != endptr);

        mov     eax, ecx
        vzeroupper
        ret

As mentioned in the comments I added, saving an instruction around the blend (to get the opposite condition) could be done with curr == max(vmax, curr), but that's true on a tie when your condition isn't. Looping backward could fix that, but might be harder for the prefetchers.

(In asm at least, you could load all 8 vectors in forward order, or one from each cache line, but process them backwards. That makes out-of-order exec work even harder to hide load latency, assuming prefetch keeps streaming in order.)

Keble answered 25/12, 2023 at 11:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.