Looking for an efficient function to find an index of max element in SIMD vector using a library

There are similar older questions, but they are using intrinsics and old instruction sets. I have a function f written with C++ vector class library (https://www.agner.org/optimize/#vectorclass):

int f(const Vec32uc &a) {
  Vec32uc b{horizontal_max(a)};
  return horizontal_find_first(a == b);
}

and compiled on https://godbolt.org/z/Gfjo7zo1c with -O3 -march=alderlake. The result looks a bit long:

f(Vec32uc const&):
        vmovdqa ymm1, YMMWORD PTR [rdi]
        vextracti128    xmm0, ymm1, 0x1
        vpmaxub xmm2, xmm1, xmm0
        vpunpckhqdq     xmm0, xmm2, xmm2
        vpmaxub xmm0, xmm0, xmm2
        vpsrldq xmm2, xmm0, 4
        vpmaxub xmm0, xmm0, xmm2
        vpsrldq xmm2, xmm0, 2
        vpmaxub xmm0, xmm0, xmm2
        vpsrldq xmm2, xmm0, 1
        vpmaxub xmm0, xmm0, xmm2
        vpbroadcastb    ymm0, xmm0
        vpcmpeqb        ymm1, ymm1, ymm0
        vpmovmskb       eax, ymm1
        test    eax, eax
        je      .L3
        bsfl eax, eax
.L1:
        vzeroupper
        ret
.L3:
        mov     eax, -1
        jmp     .L1

compared to intrinsic solutions, e.g. Find index of maximum element in x86 SIMD vector, but I may be wrong. Is there a more efficient implementation using a SIMD library and modern X86-64 instruction set.

SSE4.1 phminposuw is a horizontal min of 8x 16-bit elements. 8 elements otherwise takes 3 shuffle/min steps, 6 instructions, so if we can use fewer uops than that to massage our data into an input for _mm_minpos_epu16, we come out ahead. It's single-uop with 3 to 4 cycle latency across Zen / Skylake / Gracemont, so it's also a latency win vs. 6 shuffle/min ops. (https://uops.info/)

It's not available for wider vectors even with AVX-512, so we do need two reduction steps to get our 32 elements down to 8, but we can get them zero-extended to 16-bit for free by being clever about it. (It does also find the position of the min, but only on our already-reduced input which isn't useful here.)

Bitwise NOT turns UINT8_MIN (0) into UINT8_MAX (255) and vice versa. It's the same thing as 255-x which more obviously does what we want, a 1:1 transform that reverses an unsigned <= compare, without wrapping for any uint8_t value. (Thanks to @harold and @Homer512 in comments for fixing my initial idea.)

@chtz had a good idea: unpack the data with indices, so the index is part of the max or min u16 element we find, skipping the broadcast/compare/movemask/bit-scan. This is orthogonal to using phminpos, and works for any ISA and element-size as long as we have a vertical min or max for elements that are twice as wide. The first reduction step changes to doing something to both inputs, widening elements, instead of just shuffling the high half down to the bottom. The biggest benefit is critical path latency, but this does also help throughput slightly.

In case of a tie, when the original 8-bit value is equal, the low half, the index, will be the tie-break. If you want the first occurrence of the max element, min with inverted elements and normal indices works: the smaller index will be the min. Or max with normal elements and inverted indices, which you flip after some reduction.

#include "vectorclass.h"

int maxpos_u8_noscan_unpack(const Vec32uc v)
{
    Vec32uc indices = _mm256_setr_epi8( ~0, ~1, ~2, ~3, ~4, ~5, ~6, ~7,
                                        ~8, ~9,~10,~11,~12,~13,~14,~15,
                                       ~16,~17,~18,~19,~20,~21,~22,~23,
                                       ~24,~25,~26,~27,~28,~29,~30,~31);
    __m256i lo = _mm256_unpacklo_epi8(indices, v);  // pair value:index in a u16 with value most significant
    __m256i hi = _mm256_unpackhi_epi8(indices, v);  // then find the max of those u16 elements and extract the payload
    Vec16us vu16 = _mm256_max_epu16(lo, hi);
    Vec8us vu8 = max(vu16.get_low(), vu16.get_high());  // narrow to 128-bit
// on a tie, we take the max ~index or min index = first occurence of max

    // inverted indices allow XOR with set1(-1), a cheaper constant to generate than set1_epi16(0xff00)
    auto vinv = ~vu8;   // 1:1 mapping that flips max to min, same as 255-v (for bytes)
 // Now we're looking for the min u16, whose low 8 bits are the index of the original u8
    __m128i minpos = _mm_minpos_epu16(vinv);  // min value in the bottom u16 of the vector
    unsigned min = _mm_cvtsi128_si32(minpos);
    return (uint8_t)min;                      // movd + scalar movzx is cheaper than  pextrb 0
  // a smarter compiler would vmovd into a different reg like EDX, allowing mov-elimination for the movzx
}

Godbolt with a test loop (uiCA) to see how they inline and hoist constants, i.e. to get a loop I can copy/paste into https://uica.uops.info/
I was going to write the indices as normal numbers, and use unpack(~indices, v). But MSVC didn't constant-propagate through the bitwise not, doing it at run-time!

Clang is too clever for it's own good: it sees that some of the indices input elements aren't used by _mm256_unpacklo_epi8 so it zeros them out. Same for the unpack hi. But that means it needs to load 2 different vector constants instead of having unpack read parts of the same one! (TODO: report on https://github.com/llvm/llvm-project/issues/)

There are multiple strategies for unpacking:

chtz's suggestions of vpunpckl/hbw, as above. The in-lane behaviour is funky for 256-bit shuffles, but they do pair corresponding elements of the two input vectors. This way needs fewer vector constants, just one non-trivial one except when clang pessimizes it, and fewer vector ALU uops, but more of them are shuffles, so on Skylake and earlier, port 5 could be a bottleneck depending on the surrounding code. Not a problem as cleanup for you loop in Why performance of this function is so slow on Intel i3-N305 compared to AMD Ryzen 7 3800X? (where chtz's strategy could be expanded to your whole loop so the cleanup is just a max reduction of u16 elements, including the inversion for phminposuw.)

Ice Lake and later, and Gracemont, can run vpunpck integer shuffles on more than one port so back-end port throughtput bottlenecks aren't a problem.
Odd/even by shifting the even elements left to the top of a u16, and mask to isolate the odds. (Then XOR to both flip the data and apply the index into the other half.) Fewer of the uops are shifts, but 1 more total uop than unpack. And more of the instructions are 256-bit YMM, worse on Zen 1 or your Gracemont.

With AVX-512 (or AVX10 256-bit), vpternlogd is useful, but not much else.
Compared to not using chtz's trick at all, using bcast/cmp/movemask/scan: no vector constants needed at all. Perhaps good as a one-off, like cleanup for an infrequently-run loop. It's 10 uops to get a scalar integer result (starting from data in a YMM vector) vs. 9 for vpunpckl/hbw on an Intel P-core. (7 uops for that to get the index in the bottom byte of an XMM register, or 8 total (including a vpand or vpmovzxbq) to get it zero-extended into a qword at the bottom, where you could vpaddd or vpaddq to accumulate it into an XMM total which you retrieve after your outer loop.)

shift/XOR strategy, fewer shuffles but more uops and constants, possible use-cases on Skylake

int maxpos_u8_noscan_odd_even(const Vec32uc v)
{
    // 0,2,4... unpacked with FF bytes
    __m256i indices_even = _mm256_setr_epi16(
        0xff00, 0xff02, 0xff04, 0xff06, 0xff08, 0xff0a, 0xff0c, 0xff0e,
        0xff10, 0xff12, 0xff14, 0xff16, 0xff18, 0xff1a, 0xff1c, 0xff1e);
    __m256i indices_odd = _mm256_sub_epi16(indices_even, _mm256_set1_epi16(-1));  // save typing, and it wouldn't be too bad if the compiler actually did it at runtime this way.  MSVC 19.14 (2017) actually does!
    // _mm256_setr_epi16(0xff01, 0xff03, ...

    __m256i even = _mm256_slli_epi16(v, 8);          // bring values to the top of each u16
    even = _mm256_xor_si256(even, indices_even);     // AVX-512: merge-masked vpshufb?  But that doesn't flip the data.  Do that separately? Use merge-masked sub for odds?
    __m256i odd = _mm256_and_si256(v, _mm256_set1_epi16(0xff00));
    odd = _mm256_xor_si256(odd, indices_odd);     // AVX-512 _mm256_ternlog_epi32 would be great
// NOT the data (255-v) while we unpack with indices; now we're looking for the min element
    
    Vec16us vu16 = _mm256_min_epu16(odd, even);    // values are the "key", and bring the index along with
    Vec8us vu8 = min(vu16.get_low(), vu16.get_high());  // narrow to 128-bit

    __m128i minpos = _mm_minpos_epu16(vu8);  // value:index in the bottom two bytes of the vector
    unsigned min = _mm_cvtsi128_si32(minpos);
    return (uint8_t)min;                      // movd + scalar movzx is cheaper than  pextrb 0
}

Earlier version, with bcast / cmp / movemask / tzcnt cleanup

int maxpos_u8(const Vec32uc v)
{
    auto vinv32 = ~v;   // 1:1 mapping that flips max to min
       // now we're looking for minpos_u8(vinv32).
    Vec16uc vinv = min(vinv32.get_low(), vinv32.get_high());
       // Reduce to 8x u16
    Vec8us vus = Vec8us(min(vinv, _mm_srli_epi16(vinv, 8)));
    // min of each pair 0-extended into a 16-bit element, since minu(0, x) = 0 in high half

    Vec32uc minval = _mm256_broadcastb_epi8(_mm_minpos_epu16(vus));  // bottom u16 is the min value, second u16 is the min pos
    auto match = (minval == vinv32);
    return _tzcnt_u32(_mm256_movemask_epi8(match));
 // VCL has silly inline asm for BSF.
 // Use __builtin_ctz, std::countr_zero, or a portable tzcnt intrinsic since we have AVX2
}

AVX2 basically implies BMI1/2, so requiring it for tzcnt too won't exclude any AVX2 CPUs. Compile with GCC/Clang -march=x86-64-v3 (https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels) or MSVC -arch:AVX2 which actually implies BMI2 as well. Or use std::countr_zero from C++20 #include <bit>, but make sure to enable BMI so compilers can use tzcnt.

I changed the arg type to take a vector by value (in a YMM register). This means the non-inline version needs to shuffle instead of loading the 128-bit halves separately to reduce to 128. This is generally a good convention since callers might pass vectors that aren't already in memory, although you want small functions like this to inline.

Godbolt

# clang and GCC make identical asm, except clang spends a vzeroupper when not inlining.
maxpos_u8(Vec32uc):
           vpcmpeqd        ymm1, ymm1, ymm1     # all-ones mask, can be hoisted from loops
        vpxor   ymm0, ymm0, ymm1             # NOT = xor(v, -1)
        vextracti128    xmm2, ymm0, 0x1      # same first shuffle step
        vpminub xmm1, xmm0, xmm2

        vpsrlw  xmm2, xmm1, 8                # 16-bit element size shifts in zeros
        vpminub xmm1, xmm1, xmm2
        vphminposuw     xmm1, xmm1

        vpbroadcastb    ymm1, xmm1           # then same cleanup
        vpcmpeqb        ymm0, ymm0, ymm1
        vpmovmskb       eax, ymm0
        tzcnt   eax, eax
        ret

According to https://uica.uops.info/, it's 10 uops for the front-end on Skylake (not counting the vpcmpeqd ones idiom that can be hoisted out of loops, and not counting the ret.) vs. the VCL version (using tzcnt instead of branch around bsf) being 14 uops.

It has 15 cycle latency from input to the final ymm0 result. (Plus another maybe 4 or 6 cycles for movemask + tzcnt, which is the same for both versions if we improve the VCL cleanup, or even with bsf on Intel.)
The VCL version is 16 cycle latency from ymm0 input to vpcmpeqb result, so my version is better by 1 cycle for latency. (Probably by 2 cycles on Zen-family, where phminposuw is only 3c latency instead of 4 on Intel.)

My version is 10 uops (not counting vpcmpeqd, assuming it was hoisted out of a loop) vs. 14 for the VCL version (assuming it's changed to use tzcnt).

6 of the uops for the VCL version can only run on port 5 on Skylake, so a loop doing only this would bottleneck on that for throughput. On Ice Lake they're fairly evenly distributed since the shuffles can run on port 1 or 5.

Machine-code size for my version is smaller by 13 bytes (counting the vpcmpeqd set1(-1) since it's part of static code size); smaller is generally good for I-cache density. (The VCL shuffles could have avoided immediate operands for some, saving a couple bytes.)

This worked out better than I thought it might. I was worried we were going to have to invert again after phminposuw to recover the actual max element, but matching against the transformed vinv32 avoids that. (If you do want the value as well, uint8_t min = ~_mm_cvtsi128_si32(minpos_result); - movd + not, and maybe a movzx if you need it as 32-bit.) And phminposuw leaves the value at the bottom of the register, position above that, so a broadcast to 256 worked without an extra shift or shuffle.

Also, at first I though we might need an AND or something to zero-extend to 16-bit, or that I'd have to shift left so the value we wanted was in most-significant half of the u16 elements (so low garbage wouldn't matter). But again, then we'd have needed to broadcast a byte that wasn't the lowest, which would have required a shift + vpbroadcast, or AVX-512 vpermb.

VCL uses `bsf`!?!?

The most obvious thing to improve on is VCL's insane use of inline asm for a legacy BSF instruction (slow on AMD), instead of using GNU C __builtin_ctz on non-MSVC. That can compile to tzcnt on CPUs that have it (all CPUs with AVX2 have BMI1/2). Even on CPUs that don't have it, since unlike lzcnt, the tzcnt result is compatible with bsf for all cases where __builtin_ctz has a well-defined result (when the input is non-zero), and tzcnt=rep bsf which runs as bsf on older CPUs. (There's some possible justification for using inline asm for bsr in bit_scan_reverse, but probably still better to use 63- or 31 - __builtin_clz there.)

And we know there will be a match, so horizontal_find_first's if (bits == 0) return -1; is totally useless even if we were using bsf instead of tzcnt. (tzcnt produces 32 or 64 in that case, the operand-size.)

Avoiding the broadcast: stay 256-bit the whole time?

This isn't better for u8 or u16 elements. (Also, chtz's idea of unpacking with indices made it a moot point.)

If you weren't using vphminposuw, you might avoid the vpbroadcastb by doing all your shuffles at 256-bit, so you finish with the max already broadcast. e.g. vpermi128 instead of vextracti128. But I'm not sure how you swap bytes within words or dwords without needing a control vector for vpshufb. Perhaps vpalignr ymm, same,same, imm8 with 8 / 4 / 2 / 1? It's port 5 only even on Ice Lake, and higher latency (2 cycles) on Zen 4 vs. vpshufd. And the first two shuffles (by 8 and 4) can be done with vpshufd which can run on more port 1 or 5 on Ice Lake and later, for better balance of uops for execution ports, which is generally good unless the surrounding code under-uses port 5.

But this strategy is worse for throughput on Zen 1 and Alder Lake E-cores (Gracemont), since all the 256-bit ops will be 2 uops. They're wide enough that it's ok for latency, you don't get resource conflicts.

There are multiple strategies for unpacking:

shift/XOR strategy, fewer shuffles but more uops and constants, possible use-cases on Skylake

Earlier version, with bcast / cmp / movemask / tzcnt cleanup

VCL uses `bsf`!?!?

Avoiding the broadcast: stay 256-bit the whole time?

Recommended topics

Hot tags

There are multiple strategies for unpacking:

shift/XOR strategy, fewer shuffles but more uops and constants, possible use-cases on Skylake

Earlier version, with bcast / cmp / movemask / tzcnt cleanup

VCL uses bsf!?!?

Avoiding the broadcast: stay 256-bit the whole time?

Recommended topics

Hot tags

VCL uses `bsf`!?!?