Efficiently find least significant set bit in a large array?
Asked Answered
S

3

16

I have a huge memory block (bit-vector) with size N bits within one memory page, consider N on average is 5000, i.e. 5k bits to store some flags information.
At a certain points in time (super-frequent - critical) I need to find the first bit set in this whole big bit-vector. Now I do it per-64-word, i.e. with help of __builtin_ctzll). But when N grows and search algorithm cannot be improved, there can be some possibility to scale this search through the expansion of memory access width. This is the main problem in a few words

There is a single assembly instruction called BSF that gives the position of the highest set bit (GCC's __builtin_ctzll()). So in arch I can find the highest bit set cheaply in 64-bit words.

But what about scaling through memory width?
E.g. is there a way to do it efficiently with 128 / 256 / 512 -bit registers?
Basically I'm interested in some C API function to achieve this, but also want to know what this method is based on.

UPD: As for CPU, I'm interested for this optimization to support the following CPU lineups:
Intel Xeon E3-12XX, Intel Xeon E5-22XX/26XX/E56XX, Intel Core i3-5XX/4XXX/8XXX, Intel Core i5-7XX, Intel Celeron G18XX/G49XX (optional for Intel Atom N2600, Intel Celeron N2807, Cortex-A53/72)

P.S. In mentioned algorithm before the final bit scan I need to sum k (in average 20-40) N-bit vectors with CPU AND (the AND result is just a preparatory stage for the bit-scan). This is also desirable to do with memory width scaling (i.e. more efficiently than per 64bit-word AND)

Read also: Find first set

Statecraft answered 19/5, 2021 at 14:43 Comment(44)
Registers aren't visible to other threads / CPU cores, and "atomically" reading your own private state either has no meaning or is trivial. Did you mean single asm instruction or something so some change to a register is atomic wrt. interrupts? But if you're interested in that, how does the C tag make sense?Theatrics
It's easy to find the first non-zero byte or dword element, using pcmpeqb / pmovmskb / not / bsf. Then you can vector store / byte reload and find the bit position within that. Or I guess maybe do something with movq/pextrq / 2x bsf + cmov, but that doesn't scale well to 256-bit vectors.Theatrics
@PeterCordes Yep, I meant single asm instruction and its "atomic" visibility in SMP of course. C-tag is because of GCC's wrapper __builtin_clzll - I also need some wrapper for natty C code.Statecraft
Writing registers doesn't have any SMP visibility. Registers are private to the thread/core. If you mean "efficient", say that. __builtin_clzll isn't necessarily a single instruction - without lzcnt (-march=haswell or -mbmi), it compiles to bsr reg,reg / xor reg, 63 (i.e. lzcnt(x) = 63 - bsr(x)). BSR gives you the most-significant set-bit position. But if you want first (LSB), that's trailing zeros anyway, BSF / __builtin_ctzll. Anyway, re: tags: do you actually care about C, or were you just tagging it because you wanted to mention a GCC feature?Theatrics
Also, is this specific to x86-64? Or are you looking for something portable to other ISAs?Theatrics
AVX-512 has SIMD lzcnt: felixcloutier.com/x86/vplzcntd:vplzcntq. But only in 32 or 64-bit chunks, not for the whole register. Finding the first non-zero element could be done with test-into-mask / vpcompressd, but you'd still probably need scalar to get a trailing-zero count, or to calculate where it was in the original vector without just bit-scanning a compare mask. No easy way to find the highest set-bit position in a compare mask either, except for kmov to scalar for BSR / lzcnt.Theatrics
@PeterCordes "is this specific to x86-64?" - this is mainly for x86-64, but will be good to have ARM support (optional). I mentioned C tag because I want some C API to handle this situation with 128/256-bit CPU words (as with __builtin_clzll). "Registers are private to the thread/core" - Certainly.Statecraft
@PeterCordes Yep, I need some way to find the first bit set which will be more efficient than do it with 64-bit registers with __builtin_clzll() (BSR + XOR). Suppose I have some huge bit-vector, e.g. 50k bit. And I want to find the first 1 bit (from either side) more efficiently than with __builtin_clzll().Statecraft
If you insist on it being "atomic" or a single CPU instruction, you're out of luck. (Except maybe on RISC-V128.) For a huge array, you scan for the first non-zero vector, and load the byte or dword that contains it, then bit-scan that and add the vector and element offsets. Make up your mind what you want your question to be about, and take out the word "atomic" from the title if you want the answer to be anything other than "impossible". (Or define exactly what you mean, atomic wrt. what possible observers.)Theatrics
@Statecraft bsr is not a read-modify-write operation. I'm not sure in regards to what you expect it to be atomic.Carrefour
Is your real question about searching a large array, or about searching within one XMM/YMM/ZMM register that isn't already in memory (i.e. a calculation result, not something you just loaded and could also efficiently load as scalar)? For a large array, most of the work is just checking a whole vector for any non-zero, only caring about where in the register after you find one. Your question still seems like a possible X-Y problem. Also, if you do mean just within one register, there's likely a tradeoff between throughput and latency.Theatrics
Also, please clarify whether you really mean find highest bit set (BSR) / number of high zeros (clz), or actually want first (which for most people means lowest bit set).Theatrics
@PeterCordes I have a huge memory block (bit-vector) within one memory page with size N, consider N is 5000, i.e. 5k bits to store some flags information. At a certain points in time (very often) I need to find the first bit set in this big bit-vector. Now I do it per-64-word, i.e. with help of ` __builtin_clzll()`. But when N grows and search algorithm cannot be improved, there can be some possibility to scale this search through the expansion of memory access width. This is the main problem in a few wordsStatecraft
You still haven't said what you mean by "first", and why clz makes sense for that. clz would help you find the last, unless you have a big-endian machine. (But you don't since you're talking about x86 stuff.)Theatrics
@PeterCordes It doesn't matter ;) I can dispose these bits inside as I like. And search from any suitable sideStatecraft
I had the same problem. As per my tests searching for non zero using __int128 (as it is translated to load to reg or with another memory location was the fastest I could archive.Lunular
@Statecraft What instruction set extensions can you use? The approach given by 0___________ is probably a good compromise between speed and portability.Carrefour
@Carrefour "What instruction set extensions can you use?" Please clarify a bit this question with example.. How can I recognize what extensions I can use?Statecraft
@Statecraft It depends on what sort of processors your code is supposed to run on. The more advanced instructions can be used, the faster your code could be.Carrefour
@Carrefour The code is mostly for x86-64, but it would be also good to have such an optimization for ARMStatecraft
@Statecraft x86 is an architecture that spans many decades. If even the oldest x86 chips need to be supported, I cannot use modern instructions that might speed up the code in my answer. So I need to know what instruction set extensions (or if you don't know that, what generation or model processor) you target.Carrefour
As for ARM, I would need to know if you want 32 bit ARM or 64 bit ARM. And perhaps what generation ARM processor you are programming for (armv6? armv7? armv8? With SVE?)Carrefour
@Carrefour As for x86-64: Intel Xeon E3-12XX, Intel Xeon E5-22XX/26XX/E56XX, Intel Core i3-5XX/4XXX/8XXX, Intel Core i5-7XX, Intel Celeron G18XX/G49XX (optional for Intel Atom N2600, Intel Celeron N2807)Statecraft
@Carrefour As for ARM: Cortex-A53/72 pi-4Statecraft
@Statecraft Okay, for ARM that would be ARMv8-A, no extensions. For x86, the optional parts go up to SSSE3 (so no AVX and no SSE4.1, both of which would help). As for the parts in your first list, the intel Core i3-5XX and i5-7XX parts are the oldest and only go up to SSE4.2. Okay. I can work with that.Carrefour
@Statecraft Are you running the Pi with a 32 bit or a 64 bit operating system (same for x86).Carrefour
@Carrefour It's always 64-bit OS.Statecraft
Since you mention 256 and 512-bit vectors, are you wanting to do runtime CPU detection to select (via function pointer) a version optimal for the current CPU? Or do you only want a version that works on your baseline lowest-common-denominator CPU (the Celerons, and Nehalem, both lacking any AVX or BMI1, although since BSF and TZCNT are compatible for non-zero inputs, that's fine. But BSR and LZCNT aren't, so it's even more important that you end this confusion about highest within an element vs. lowest, if you care about tuning to run optimally on AMD as well; at least state that lowest is ok)Theatrics
Do you have any feeling for how often a bit will be set? Using some AVX it would be fast to test that the entire lot of bits is all zeros, or that one of them wasn't zero (just loop over the whole lot, comparing to a zero vector, ORing the result with a result's vector, test the results vector for zero after the end of the loop). The cost would be that if one wasn't zero, it'd mean going back and doing the search properly. This might not be too bad if a non-zero bit is a rare thing.Lactoscope
If all you need is the first set bit, a separate variable to track that bit's location could be maintained. Adds an extra if on set, but should help with scale.Lamrert
If you can waste core time, you could also do a divide and conquer approach (use you algorithm on an equally divided buffer per core) with multiple threads - though it might be too small a chunk to divide up enough to make up for thread overhead.Lamrert
@PeterCordes I'm not interested in AMD.. "are you wanting to do runtime CPU detection to select" - it will be sufficient to detect the particular API at compile time (based on platform #ifdefs in my source code). "confusion about highest within an element vs. lowest" - I need to just find the first 1 in this big bit-vector from any side.. If I don't answer your question, please clarify a bit what you meanStatecraft
For most people, the first set bit is the lowest one. Like you'd find if you were using bt [bitvec], eax incrementing EAX from 0 until you find one. Since that's the easiest / most-efficient one to find with the method I suggested earlier, and one which is consistent regardless of chunk size you use for vector search vs. bit-scan, that's what I'd recommend. Instead of this weird suggestion in the question that you want to find the highest set bit (clz/ffs()) within the lowest non-zero chunk. Or were you considering looping backwards for that?Theatrics
It's not like this "first" vs. "clz" stuff makes it unanswerable, it's just weird and complicates the question for future readers.Theatrics
@Lactoscope You're right. The set bits most likely will be sparse. So we can loop over the whole N-bit-vector by W (memory width, now - 64) and compare every 64-bit element with 0, if it's not zero, we find the first bit set in 64-bit-word and calculate the number of already viewed null words, then sum it and get the index of first bit set in a whole vector.Statecraft
@Lactoscope It is about algorithm details, but now it's more important to do this scan more fast/efficiently than per 64bit-blocks. Obviously, if there will be an opportunity to do it per 512bit-blocks (on CPU level) it will be more faster with or without the optimization you mentioned. I.e. memory width scaling is also the case.Statecraft
@PeterCordes I think any option you suggest will suit me. Whether it's "first" or "clz". I have a big bit-vector with presumably a lot of zero bits and a few set bits. I want to find the number(index) of first non-zero bit in this whole bit-vector. Does it matter which exactly operation to use here?Statecraft
Yes, it matters, because clz will give the wrong answer, not the first if there are multiple bits set in the uint64_t you run it on. You want ctz (BSF) for a forward search. The R and F in the BSR / BSF names match the search direction.Theatrics
@PeterCordes Yep, seems that I want ctz. Edited question.Statecraft
Aha! How sparse - 1 in tens of thousands/millions? The reason I suggested looping, comparing, ORing, and testing the result outside of the loop is that it avoids a branch inside the loop. That way you can get the most out of the available memory bandwidth - the data just streams on through from beginning to end, no interruptions. Branches are the enemy of sustained data flow from memory, so omitting them, or at least having very few very short ones, is worthwhile. This gives the cache infrastructure a chance of keeping the memory bus saturated, beyond which no improvement is possible.Lactoscope
@bazza: Yup, you may want to unroll enough to hide vptest / jcc overhead enough to do 2 vector loads per clock from L1d cache until you reach a set of vectors containing a set bit. Or just unroll some, like enough to at least keep up with one 32-byte load per clock. Untested first draft of an AVX2 answer: godbolt.org/z/1aW68hf4e (untested). (Future answers feel free to borrow it if you're willing to write up a full answer describing why it's good, and/or if you find some improvements. Otherwise I'll probably get around to posting my own answer sometime.)Theatrics
Note that your on-the-fly AND computation is a further refinement you could maybe ask as a separate question. Do you need to save the AND result anywhere? Or only as temporaries while finding the first 1 in their intersection?Theatrics
@PeterCordes The AND result is just a preparatory stage for the bit-scan. So I don't need to deliberately save the AND result anywhere.Statecraft
@PeterCordes Don't you want to answer my question? :) Now the situation is that the answer with the most votes does not answer my question at all..Statecraft
T
5

The best way to find the first set bit within a whole vector (AFAIK) involves finding the first non-zero SIMD element (e.g. a byte or dword), then using a bit-scan on that. (__builtin_ctz / bsf / tzcnt / ffs-1) . As such, ctz(vector) is not itself a useful building block for searching an array, only for after the loop.

Instead you want to loop over the array searching for a non-zero vector, using a whole-vector check involving SSE4.1 ptest xmm0,xmm0 / jz .loop (3 uops), or with SSE2 pcmpeqd v, zero / pmovmskb / cmp eax, 0xffff / je .loop (3 uops after cmp/jcc macro-fusion). https://uops.info/

Once you do find a non-zero vector, pcmpeqb / movmskps / bsf on that to find a dword index, then load that dword and bsf it. Add the start-bit position (CHAR_BIT*4*dword_idx) to the bsf bit-position within that element. This is a fairly long dependency chain for latency, including an integer L1d load latency. But since you just loaded the vector, at least you can be fairly confident you'll hit in cache when you load it again with integer. (If the vector was generated on the fly, then probably still best to store / reload it and let store-forwarding work, instead of trying to generate a shuffle control for vpermilps/movd or SSSE3 pshufb/movd/movzx ecx, al.)

The loop problem is very much like strlen or memchr, except we're rejecting a single value (0) and looking for anything else. Still, we can take inspiration from hand-optimized asm strlen / memchr implementations like glibc's, for example loading multiple vectors and doing one check to see if any of them have what they're looking for. (For strlen, combine with pminub to get a 0 if any element is 0. For pcmpeqb compare results, OR for memchr). For our purposes, the reduction operation we want is OR - any non-zero input will make the output non-zero, and bitwise boolean ops can run on any vector ALU port.

(If the expected first-bit-position isn't very high, it's not worth being too aggressive with this: if the first set bit is in the first vector, sorting things out between 2 vectors you've loaded will be slower. 5000 bits is only 625 bytes, or 19.5 AVX2 __m256i vectors. And the first set bit is probably not always right at the end)

AVX2 version:

This checks pairs of 32-byte vectors (i.e. whole cache lines) for non-zero, and if found then sorts that out into one 64-bit bitmap for a single CTZ operation. That extra shift/OR costs latency in the critical path, but the hope is that we get to the first 1 bit sooner.

Combining 2 vectors down to one with OR means it's not super useful to know which element of the OR result was non-zero. We basically redo the work inside the if. That's the price we pay for keeping the amount of uops low for the actual search part.

(The if body ends with a return, so in the asm it's actually like an if()break, or actually an if()goto out of the loop since it goes to a difference place than the not-found return -1 from falling through out of the loop.)

// untested, especially the pointer end condition, but compiles to asm that looks good
// Assumes len is a multiple of 64 bytes

#include <immintrin.h>
#include <stdint.h>
#include <string.h>

// aliasing-safe: p can point to any C data type
int bitscan_avx2(const char *p, size_t len /* in bytes */)
{
    //assert(len % 64 == 0);
    //optimal if p is 64-byte aligned, so we're checking single cache-lines
    const char *p_init = p;
    const char *endp = p + len - 64;
    do {
        __m256i v1 = _mm256_loadu_si256((const __m256i*)p);
        __m256i v2 = _mm256_loadu_si256((const __m256i*)(p+32));
        __m256i or = _mm256_or_si256(v1,v2);
        if (!_mm256_testz_si256(or, or)){        // find the first non-zero cache line
            __m256i v1z = _mm256_cmpeq_epi32(v1, _mm256_setzero_si256());
            __m256i v2z = _mm256_cmpeq_epi32(v2, _mm256_setzero_si256());
            uint32_t zero_map = _mm256_movemask_ps(_mm256_castsi256_ps(v1z));
            zero_map |= _mm256_movemask_ps(_mm256_castsi256_ps(v2z)) << 8;

            unsigned idx = __builtin_ctz(~zero_map);  // Use ctzll for GCC, because GCC is dumb and won't optimize away a movsx
            uint32_t nonzero_chunk;
            memcpy(&nonzero_chunk, p+4*idx, sizeof(nonzero_chunk));  // aliasing / alignment-safe load

            return (p-p_init + 4*idx)*8 + __builtin_ctz(nonzero_chunk);
        }
        p += 64;
    }while(p < endp);
    return -1;
}

On Godbolt with clang 12 -O3 -march=haswell:

bitscan_avx2:
        lea     rax, [rdi + rsi]
        add     rax, -64                 # endp
        xor     ecx, ecx
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm1, ymmword ptr [rdi]  # do {
        vmovdqu ymm0, ymmword ptr [rdi + 32]
        vpor    ymm2, ymm0, ymm1
        vptest  ymm2, ymm2
        jne     .LBB0_2                       # if() goto out of the inner loop
        add     ecx, 512                      # bit-counter incremented in the loop, for (p-p_init) * 8
        add     rdi, 64
        cmp     rdi, rax
        jb      .LBB0_1                  # }while(p<endp)

        mov     eax, -1               # not-found return path
        vzeroupper
        ret

.LBB0_2:
        vpxor   xmm2, xmm2, xmm2
        vpcmpeqd        ymm1, ymm1, ymm2
        vmovmskps       eax, ymm1
        vpcmpeqd        ymm0, ymm0, ymm2
        vmovmskps       edx, ymm0
        shl     edx, 8
        or      edx, eax             # mov ah,dl  would be interesting, but compilers won't do it.
        not     edx                  # one_positions = ~zero_positions
        xor     eax, eax                # break false dependency
        tzcnt   eax, edx             # dword_idx
        xor     edx, edx
        tzcnt   edx, dword ptr [rdi + 4*rax]   # p[dword_idx]
        shl     eax, 5               # dword_idx * 4 * CHAR_BIT
        add     eax, edx
        add     eax, ecx
        vzeroupper
        ret

This is probably not optimal for all CPUs, e.g. maybe we could use a memory-source vpcmpeqd for at least one of the inputs, and not cost any extra front-end uops, only back-end. As long as compilers keep using pointer-increments, not indexed addressing modes that would un-laminate. That would reduce the amount of work needed after the branch (which probably mispredicts).

To still use vptest, you might have to take advantage of the CF result from the CF = (~dst & src == 0) operation against a vector of all-ones, so we could check that all elements matched (i.e. the input was all zeros). Unfortunately, Can PTEST be used to test if two registers are both zero or some other condition? - no, I don't think we can usefully use vptest without a vpor.

Clang decided not to actually subtract pointers after the loop, instead to do more work in the search loop. :/ The loop is 9 uops (after macro-fusion of cmp/jb), so unfortunately it can only run a bit less than 1 iteration per 2 cycles. So it's only managing less than half of L1d cache bandwidth.

But apparently a single array isn't your real problem.

Without AVX

16-byte vectors mean we don't have to deal with the "in-lane" behaviour of AVX2 shuffles. So instead of OR, we can combine with packssdw or packsswb. Any set bits in the high half of a pack input will signed-saturate the result to 0x80 or 0x7f. (So signed saturation is key, not unsigned packuswb which will saturate signed-negative inputs to 0.)

However, shuffles only run on port 5 on Intel CPUs, so beware of throughput limits. ptest on Skylake for example is 2 uops, p5 and p0, so using packsswb + ptest + jz would limit to one iteration per 2 clocks. But pcmpeqd + pmovmskb don't.

Unfortunately, using pcmpeq on each input separately before packing / combining would cost more uops. But would reduce the amount of work left for the cleanup, and if the loop-exit usually involves a branch mispredict, that might reduce overall latency.

2x pcmpeqd => packssdw => pmovmskb => not => bsf would give you a number you have to multiply by 2 to use as a byte offset to get to the non-zero dword. e.g. memcpy(&tmp_u32, p + (2*idx), sizeof(tmp_u32));. i.e. bsf eax, [rdi + rdx*2].

With AVX-512:

You mentioned 512-bit vectors, but none of the CPUs you listed support AVX-512. Even if so, you might want to avoid 512-bit vectors because SIMD instructions lowering CPU frequency, unless your program spends a lot of time doing this, and your data is hot in L1d cache so you can truly benefit instead of still bottlenecking on L2 cache bandwidth. But even with 256-bit vectors, AVX-512 has new instructions that are useful for this:

  • integer compares (vpcmpb/w/d/q) have a choice of predicate, so you can do not-equal instead of having to invert later with NOT. Or even test-into-register vptestmd so you don't need a zeroed vector to compare against.

  • compare-into-mask is sort of like pcmpeq + movmsk, except the result is in a k register, still need a kmovq rax, k0 before you can tzcnt.

  • kortest - set FLAGS according to the OR of two mask registers being non-zero. So the search loop could do vpcmpd k0, ymm0, [rdi] / vpcmpd k1, ymm0, [rdi+32] / kortestw k0, k1

  • vplzcntd (or q) - Combined with SIMD isolate_lowest = v &= -v, this can find the position of the lowest set bit (in each SIMD vector.) bit_index = 31-lzcnt = 31^lzcnt for non-zero inputs.

  • vpcompressq/d - 2 uops on Intel and Zen 4 for the reg-reg version (https://uops.info). Followed by vmovq eax, ymm0, this can extract the lowest non-zero element (given a compare mask) with probably lower latency than scalar tzcnt on the mask to index another load.

    But you do still need that scalar tzcnt to find out what to add to the bit-within-dword index, so this costs extra uops only to shorten critical-path latency. e.g.

// untested and worse for throughput, probably better for latency.
// Just writing it out to see what it looks like

// after finding a v  with a a non-zero bit somewhere:
  __mmask8 nzmask = _mm256_test_epi32_mask(v,v);  // true for non-zero elements
  __m256i bit_in_dword_lzcnt = _mm256_lzcnt_epi32(v & -v);  // lzcnt of the lowest set bit
  __m256i tmp = _mm256_maskz_compress_epi32(nzmask, bit_in_dword_lzcnt);  // low element has the lzcnt we want

  unsigned bit_idx = _tzcnt_u32(nzmask)*32;
  bit_idx += 31^_mm_cvtsi128_si32(_mm256_castsi256_si128(tmp)); // vmovd + xor to do 31-lzcnt more cheaply.

According to uops.info, vpcompressd latency on Intel is 6 cycles from mask to output, but only 3 cycles from vector input to vector output. So the first uop is just pre-processing the mask into a vpermd shuffle-control I guess.

On Zen 4, it's 4 cycles from vector input to output, 8 cycles from mask to output, for 256-bit vector width. For 512-bit, 8:9.

The vector input comes from vplzcntd(v & -v) which will take longer than just vptestmd(v) to get the mask, so that works out well.


ANDing multiple input arrays

You mention your real problem is that you have up-to-20 arrays of bits, and you want to intersect them with AND and find the first set bit in the intersection.

You may want to do this in blocks of a few vectors, optimistically hoping that there will be a set bit somewhere early.

AND groups of 4 or 8 inputs, accumulating across results with OR so you can tell if there were any 1s in this block of maybe 4 vectors from each input. (If there weren't any 1 bits, do another block of 4 vectors, 64 or 128 bytes while you still have the pointers loaded, because the intersection would definitely be empty if you moved on to the other inputs now). Tuning these chunk sizes depends on how sparse your 1s are, e.g. maybe always work in chunks of 6 or 8 vectors. Power-of-2 numbers are nice, though, because you can pad your allocations out to a multiple of 64 or 128 bytes so you don't have to worry about stopping early.)

(For odd numbers of inputs, maybe pass the same pointer twice to a function expecting 4 inputs, instead of dispatching to special versions of the loop for every possible number.)

L1d cache is 8-way associative (before Ice Lake with 12-way), and a limited number of integer/pointer registers can make it a bad idea to try to read too many streams at once. You probably don't want a level of indirection that makes the compiler loop over an actual array in memory of pointers either.

Theatrics answered 31/5, 2021 at 20:12 Comment(19)
with AVX512 you might try loop that does 4x loads with 1x vpcmp into mask + 1x maskz vpternlogd. Or maybe just a 3x loop with vpternlogd. Also why vptest > vpmovmskb + test? If you vpmovmskb in the loop you can reuse the result for second vec in return + saves a uop in the loop.Feleciafeledy
Regarding order of branches in the loop. I think it makes more sense to 1) remove the counter from the loop, and 2) order it so that the fall through is if a non-zero vector is found. Rational is that the ALU on loop bounds with speculate ahead easily so if you break on no bit found you will have already loaded all the code / began speculatively executing it and there will be no real cost. the non-zero check though will likely require a resteer so think it makes sense to make the resteer to next PC which IIRC saves 2 cycles.Feleciafeledy
@Noah: If you were only doing a single vector, then yes you could reuse the vpmovmskb result. But if you're ORing together multiple vectors (or even ANDing v==0 compare results), the bitmap of which element was non-zero isn't as useful. You'd need to tzcnt both possibilities (idx and idx+32) and test/cmov based on one of them.Theatrics
I think you can reuse vpmovmskb. If first vec is non-zero then result of tzcnt(first_vec_mask | (second_vec_mask << 8)) will ignore the second vec. If the first vec is zero then we essentially have tzcnt(second_vec_mask << 8) which is what we want. Talking about something similiar to this in strlenFeleciafeledy
@Noah: are you suggesting doing two vpmovmskb in the search loop, a separate one for each vector, and using or (or an add since the masks are narrow enough) as the loop branch or break condition? That would cost more uops in the loop. If you don't do that, then you only have movemask( (v1 | v2) == 0), not two separate vec_mask results.Theatrics
Oh I am mistaken! For some reason thought there was a vpcmpeq after the vpor in the loop! What your doing makes sense!Feleciafeledy
@Noah: Oh, I hadn't understood the code you linked first, the comment made sense: rcx has combined result from all 4 VEC. It will only be used if ... - that made sense where "first" and "second" made me think you were talking about non-combined results. Yes, vpcmpeq / vmovmskps / cmp $0xff, %eax+je is 3 uops, same as vptest / jz. I only used VPTEST to save code-size, not uops. (Although without AVX, I think ptest also avoids a movdqa or pxor-zeroing). So yes we could reuse a vmovmskps result from the loop. Hmm, too bad xor/jcc can't macro-fuse; could avoid a NOT?Theatrics
How about subl $0xff, %eax to save a not? Previous implementation of avx2 memcpy did thatFeleciafeledy
@Noah: That doesn't preserve the position of the bits. Remember we need to bit-scan that, not just have it be 0 / non-zero. We're already using cmp all-ones, reg instead of test reg,reg to avoid needing a NOT before that compare/branch.Theatrics
Maybe I'm missing something. But if you do vpcmpeq; vpmovmskps all 1s means continue looping. Not all 1s means some word wasnt zero? But if you swap vpmovmskps for vpmovmskb you use incl eax; jz L(continue_loop) then just change the scale in lea from 4 to 1 (at the end because with vpmovmskb you have exact byte of difference).Feleciafeledy
@Noah: Yes, that's the idea. Interesting, yes with a wide enough bitmap to fill the operand-size, +1 clears all the bits if they were all-ones. (Similarly, inc %al would work with an 8-bit bitmap, if you can get a compiler to do that.) It's a bit of a pain in C to get the compiler to use a byte offset for a dword load (arr[idx/4] will AND to zero the low bits), but I guess I'm already using char* address math (with memcpy for aliasing-safety) so I could just do that. But can you usefully bit-scan the inc result? I think so: the lowest 0 becomes the lowest 1!Theatrics
@Noah: If you want to post an answer with your asm ideas, I'd give it an upvote. If you can get a compiler to do them, so much the better, though. (Also, I haven't included a full SSE4 version; I might edit to do that to try out my packssdw instead of por idea. Or you could try that.)Theatrics
Re: "But can you usefully bit-scan the inc result? I think so: the lowest 0 becomes the lowest 1!" I think so as well (and if you can't we are all going to run into a major issue because thats used in some of the library functions in glibc). I think incl %al would cause a partial register stall when you later try and bitscan eax and AFAIK there is no byte sized bitscan operation.Feleciafeledy
Maybe once I get off work. I generally think I'm a bit to mistake prone to post "answers" though (i.e see early comments!).Feleciafeledy
@Noah: Intel CPUs with AVX2 (Haswell and later) don't rename AL separately from RAX. Still, inc %eax is probably better, I was just exploring the general idea. As long as you use pcmpeqd, you can still index a whole dword instead of needing a movzx byte load, even if you have a vpmovmskb result. (Also, movmskps after integer pcmpeq may cost a cycle of bypass latency; I should check on that with a round trip loop.)Theatrics
Does incb al; tzcntl eax, eax still have a merging uop or is it totally "free"?Feleciafeledy
@Noah: It's totally free: it's an RMW on the full RAX that just happens to stop propagating carry after the low byte. (Even on Sandybridge, being an RMW instead of mov $1, %al means it won't rename AL separately from RAX). Of course, it's bad on P6-family including Nehalem.Theatrics
Oh wow, we can drop an extra xorl from memcmp. (would have to reorgnize the register usage a bit though). By the way on my tigerlake I am still seeing LSD uops with writes to ah. Bummer otherwise that would have been a very lightweight way to disable it for benchmarking.Feleciafeledy
@Noah: Yes, for AVX2 versions if the high bytes of RDX are guaranteed to be zero at that point, without a false dependency on anything that this dep chain didn't already include.Theatrics
E
9

This answer is in a different vein, but if you know in advance that you're going to be maintaining a collection of B bits and need to be able to efficiently set and clear bits while also figuring out which bit is the first bit set, you may want to use a data structure like a van Emde Boas tree or a y-fast trie. These data structures are designed to store integers in a small range, so instead of setting or clearing individual bits, you could add or remove the index of the bit you want to set/clear. They're quite fast - you can add or remove items in time O(log log B), and they let you find the smallest item in time O(1). Figure that if B ≈ 50000, then log log B is about 4.

I'm aware this doesn't directly address how to find the highest bit set in a huge bitvector. If your setup is such that you have to work with bitvectors, the other answers might be more helpful. But if you have the option to reframe the problem in a way that doesn't involve bitvector searching, these other data structures might be a better fit.

Effusive answered 25/5, 2021 at 20:48 Comment(1)
Thanks for the information! The bottleneck of the algorithm is precisely the final search (bit-scan), because it is linear. Setting and clearing bits is one level lower in other data structures, and it is not super-frequent..Statecraft
T
5

The best way to find the first set bit within a whole vector (AFAIK) involves finding the first non-zero SIMD element (e.g. a byte or dword), then using a bit-scan on that. (__builtin_ctz / bsf / tzcnt / ffs-1) . As such, ctz(vector) is not itself a useful building block for searching an array, only for after the loop.

Instead you want to loop over the array searching for a non-zero vector, using a whole-vector check involving SSE4.1 ptest xmm0,xmm0 / jz .loop (3 uops), or with SSE2 pcmpeqd v, zero / pmovmskb / cmp eax, 0xffff / je .loop (3 uops after cmp/jcc macro-fusion). https://uops.info/

Once you do find a non-zero vector, pcmpeqb / movmskps / bsf on that to find a dword index, then load that dword and bsf it. Add the start-bit position (CHAR_BIT*4*dword_idx) to the bsf bit-position within that element. This is a fairly long dependency chain for latency, including an integer L1d load latency. But since you just loaded the vector, at least you can be fairly confident you'll hit in cache when you load it again with integer. (If the vector was generated on the fly, then probably still best to store / reload it and let store-forwarding work, instead of trying to generate a shuffle control for vpermilps/movd or SSSE3 pshufb/movd/movzx ecx, al.)

The loop problem is very much like strlen or memchr, except we're rejecting a single value (0) and looking for anything else. Still, we can take inspiration from hand-optimized asm strlen / memchr implementations like glibc's, for example loading multiple vectors and doing one check to see if any of them have what they're looking for. (For strlen, combine with pminub to get a 0 if any element is 0. For pcmpeqb compare results, OR for memchr). For our purposes, the reduction operation we want is OR - any non-zero input will make the output non-zero, and bitwise boolean ops can run on any vector ALU port.

(If the expected first-bit-position isn't very high, it's not worth being too aggressive with this: if the first set bit is in the first vector, sorting things out between 2 vectors you've loaded will be slower. 5000 bits is only 625 bytes, or 19.5 AVX2 __m256i vectors. And the first set bit is probably not always right at the end)

AVX2 version:

This checks pairs of 32-byte vectors (i.e. whole cache lines) for non-zero, and if found then sorts that out into one 64-bit bitmap for a single CTZ operation. That extra shift/OR costs latency in the critical path, but the hope is that we get to the first 1 bit sooner.

Combining 2 vectors down to one with OR means it's not super useful to know which element of the OR result was non-zero. We basically redo the work inside the if. That's the price we pay for keeping the amount of uops low for the actual search part.

(The if body ends with a return, so in the asm it's actually like an if()break, or actually an if()goto out of the loop since it goes to a difference place than the not-found return -1 from falling through out of the loop.)

// untested, especially the pointer end condition, but compiles to asm that looks good
// Assumes len is a multiple of 64 bytes

#include <immintrin.h>
#include <stdint.h>
#include <string.h>

// aliasing-safe: p can point to any C data type
int bitscan_avx2(const char *p, size_t len /* in bytes */)
{
    //assert(len % 64 == 0);
    //optimal if p is 64-byte aligned, so we're checking single cache-lines
    const char *p_init = p;
    const char *endp = p + len - 64;
    do {
        __m256i v1 = _mm256_loadu_si256((const __m256i*)p);
        __m256i v2 = _mm256_loadu_si256((const __m256i*)(p+32));
        __m256i or = _mm256_or_si256(v1,v2);
        if (!_mm256_testz_si256(or, or)){        // find the first non-zero cache line
            __m256i v1z = _mm256_cmpeq_epi32(v1, _mm256_setzero_si256());
            __m256i v2z = _mm256_cmpeq_epi32(v2, _mm256_setzero_si256());
            uint32_t zero_map = _mm256_movemask_ps(_mm256_castsi256_ps(v1z));
            zero_map |= _mm256_movemask_ps(_mm256_castsi256_ps(v2z)) << 8;

            unsigned idx = __builtin_ctz(~zero_map);  // Use ctzll for GCC, because GCC is dumb and won't optimize away a movsx
            uint32_t nonzero_chunk;
            memcpy(&nonzero_chunk, p+4*idx, sizeof(nonzero_chunk));  // aliasing / alignment-safe load

            return (p-p_init + 4*idx)*8 + __builtin_ctz(nonzero_chunk);
        }
        p += 64;
    }while(p < endp);
    return -1;
}

On Godbolt with clang 12 -O3 -march=haswell:

bitscan_avx2:
        lea     rax, [rdi + rsi]
        add     rax, -64                 # endp
        xor     ecx, ecx
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm1, ymmword ptr [rdi]  # do {
        vmovdqu ymm0, ymmword ptr [rdi + 32]
        vpor    ymm2, ymm0, ymm1
        vptest  ymm2, ymm2
        jne     .LBB0_2                       # if() goto out of the inner loop
        add     ecx, 512                      # bit-counter incremented in the loop, for (p-p_init) * 8
        add     rdi, 64
        cmp     rdi, rax
        jb      .LBB0_1                  # }while(p<endp)

        mov     eax, -1               # not-found return path
        vzeroupper
        ret

.LBB0_2:
        vpxor   xmm2, xmm2, xmm2
        vpcmpeqd        ymm1, ymm1, ymm2
        vmovmskps       eax, ymm1
        vpcmpeqd        ymm0, ymm0, ymm2
        vmovmskps       edx, ymm0
        shl     edx, 8
        or      edx, eax             # mov ah,dl  would be interesting, but compilers won't do it.
        not     edx                  # one_positions = ~zero_positions
        xor     eax, eax                # break false dependency
        tzcnt   eax, edx             # dword_idx
        xor     edx, edx
        tzcnt   edx, dword ptr [rdi + 4*rax]   # p[dword_idx]
        shl     eax, 5               # dword_idx * 4 * CHAR_BIT
        add     eax, edx
        add     eax, ecx
        vzeroupper
        ret

This is probably not optimal for all CPUs, e.g. maybe we could use a memory-source vpcmpeqd for at least one of the inputs, and not cost any extra front-end uops, only back-end. As long as compilers keep using pointer-increments, not indexed addressing modes that would un-laminate. That would reduce the amount of work needed after the branch (which probably mispredicts).

To still use vptest, you might have to take advantage of the CF result from the CF = (~dst & src == 0) operation against a vector of all-ones, so we could check that all elements matched (i.e. the input was all zeros). Unfortunately, Can PTEST be used to test if two registers are both zero or some other condition? - no, I don't think we can usefully use vptest without a vpor.

Clang decided not to actually subtract pointers after the loop, instead to do more work in the search loop. :/ The loop is 9 uops (after macro-fusion of cmp/jb), so unfortunately it can only run a bit less than 1 iteration per 2 cycles. So it's only managing less than half of L1d cache bandwidth.

But apparently a single array isn't your real problem.

Without AVX

16-byte vectors mean we don't have to deal with the "in-lane" behaviour of AVX2 shuffles. So instead of OR, we can combine with packssdw or packsswb. Any set bits in the high half of a pack input will signed-saturate the result to 0x80 or 0x7f. (So signed saturation is key, not unsigned packuswb which will saturate signed-negative inputs to 0.)

However, shuffles only run on port 5 on Intel CPUs, so beware of throughput limits. ptest on Skylake for example is 2 uops, p5 and p0, so using packsswb + ptest + jz would limit to one iteration per 2 clocks. But pcmpeqd + pmovmskb don't.

Unfortunately, using pcmpeq on each input separately before packing / combining would cost more uops. But would reduce the amount of work left for the cleanup, and if the loop-exit usually involves a branch mispredict, that might reduce overall latency.

2x pcmpeqd => packssdw => pmovmskb => not => bsf would give you a number you have to multiply by 2 to use as a byte offset to get to the non-zero dword. e.g. memcpy(&tmp_u32, p + (2*idx), sizeof(tmp_u32));. i.e. bsf eax, [rdi + rdx*2].

With AVX-512:

You mentioned 512-bit vectors, but none of the CPUs you listed support AVX-512. Even if so, you might want to avoid 512-bit vectors because SIMD instructions lowering CPU frequency, unless your program spends a lot of time doing this, and your data is hot in L1d cache so you can truly benefit instead of still bottlenecking on L2 cache bandwidth. But even with 256-bit vectors, AVX-512 has new instructions that are useful for this:

  • integer compares (vpcmpb/w/d/q) have a choice of predicate, so you can do not-equal instead of having to invert later with NOT. Or even test-into-register vptestmd so you don't need a zeroed vector to compare against.

  • compare-into-mask is sort of like pcmpeq + movmsk, except the result is in a k register, still need a kmovq rax, k0 before you can tzcnt.

  • kortest - set FLAGS according to the OR of two mask registers being non-zero. So the search loop could do vpcmpd k0, ymm0, [rdi] / vpcmpd k1, ymm0, [rdi+32] / kortestw k0, k1

  • vplzcntd (or q) - Combined with SIMD isolate_lowest = v &= -v, this can find the position of the lowest set bit (in each SIMD vector.) bit_index = 31-lzcnt = 31^lzcnt for non-zero inputs.

  • vpcompressq/d - 2 uops on Intel and Zen 4 for the reg-reg version (https://uops.info). Followed by vmovq eax, ymm0, this can extract the lowest non-zero element (given a compare mask) with probably lower latency than scalar tzcnt on the mask to index another load.

    But you do still need that scalar tzcnt to find out what to add to the bit-within-dword index, so this costs extra uops only to shorten critical-path latency. e.g.

// untested and worse for throughput, probably better for latency.
// Just writing it out to see what it looks like

// after finding a v  with a a non-zero bit somewhere:
  __mmask8 nzmask = _mm256_test_epi32_mask(v,v);  // true for non-zero elements
  __m256i bit_in_dword_lzcnt = _mm256_lzcnt_epi32(v & -v);  // lzcnt of the lowest set bit
  __m256i tmp = _mm256_maskz_compress_epi32(nzmask, bit_in_dword_lzcnt);  // low element has the lzcnt we want

  unsigned bit_idx = _tzcnt_u32(nzmask)*32;
  bit_idx += 31^_mm_cvtsi128_si32(_mm256_castsi256_si128(tmp)); // vmovd + xor to do 31-lzcnt more cheaply.

According to uops.info, vpcompressd latency on Intel is 6 cycles from mask to output, but only 3 cycles from vector input to vector output. So the first uop is just pre-processing the mask into a vpermd shuffle-control I guess.

On Zen 4, it's 4 cycles from vector input to output, 8 cycles from mask to output, for 256-bit vector width. For 512-bit, 8:9.

The vector input comes from vplzcntd(v & -v) which will take longer than just vptestmd(v) to get the mask, so that works out well.


ANDing multiple input arrays

You mention your real problem is that you have up-to-20 arrays of bits, and you want to intersect them with AND and find the first set bit in the intersection.

You may want to do this in blocks of a few vectors, optimistically hoping that there will be a set bit somewhere early.

AND groups of 4 or 8 inputs, accumulating across results with OR so you can tell if there were any 1s in this block of maybe 4 vectors from each input. (If there weren't any 1 bits, do another block of 4 vectors, 64 or 128 bytes while you still have the pointers loaded, because the intersection would definitely be empty if you moved on to the other inputs now). Tuning these chunk sizes depends on how sparse your 1s are, e.g. maybe always work in chunks of 6 or 8 vectors. Power-of-2 numbers are nice, though, because you can pad your allocations out to a multiple of 64 or 128 bytes so you don't have to worry about stopping early.)

(For odd numbers of inputs, maybe pass the same pointer twice to a function expecting 4 inputs, instead of dispatching to special versions of the loop for every possible number.)

L1d cache is 8-way associative (before Ice Lake with 12-way), and a limited number of integer/pointer registers can make it a bad idea to try to read too many streams at once. You probably don't want a level of indirection that makes the compiler loop over an actual array in memory of pointers either.

Theatrics answered 31/5, 2021 at 20:12 Comment(19)
with AVX512 you might try loop that does 4x loads with 1x vpcmp into mask + 1x maskz vpternlogd. Or maybe just a 3x loop with vpternlogd. Also why vptest > vpmovmskb + test? If you vpmovmskb in the loop you can reuse the result for second vec in return + saves a uop in the loop.Feleciafeledy
Regarding order of branches in the loop. I think it makes more sense to 1) remove the counter from the loop, and 2) order it so that the fall through is if a non-zero vector is found. Rational is that the ALU on loop bounds with speculate ahead easily so if you break on no bit found you will have already loaded all the code / began speculatively executing it and there will be no real cost. the non-zero check though will likely require a resteer so think it makes sense to make the resteer to next PC which IIRC saves 2 cycles.Feleciafeledy
@Noah: If you were only doing a single vector, then yes you could reuse the vpmovmskb result. But if you're ORing together multiple vectors (or even ANDing v==0 compare results), the bitmap of which element was non-zero isn't as useful. You'd need to tzcnt both possibilities (idx and idx+32) and test/cmov based on one of them.Theatrics
I think you can reuse vpmovmskb. If first vec is non-zero then result of tzcnt(first_vec_mask | (second_vec_mask << 8)) will ignore the second vec. If the first vec is zero then we essentially have tzcnt(second_vec_mask << 8) which is what we want. Talking about something similiar to this in strlenFeleciafeledy
@Noah: are you suggesting doing two vpmovmskb in the search loop, a separate one for each vector, and using or (or an add since the masks are narrow enough) as the loop branch or break condition? That would cost more uops in the loop. If you don't do that, then you only have movemask( (v1 | v2) == 0), not two separate vec_mask results.Theatrics
Oh I am mistaken! For some reason thought there was a vpcmpeq after the vpor in the loop! What your doing makes sense!Feleciafeledy
@Noah: Oh, I hadn't understood the code you linked first, the comment made sense: rcx has combined result from all 4 VEC. It will only be used if ... - that made sense where "first" and "second" made me think you were talking about non-combined results. Yes, vpcmpeq / vmovmskps / cmp $0xff, %eax+je is 3 uops, same as vptest / jz. I only used VPTEST to save code-size, not uops. (Although without AVX, I think ptest also avoids a movdqa or pxor-zeroing). So yes we could reuse a vmovmskps result from the loop. Hmm, too bad xor/jcc can't macro-fuse; could avoid a NOT?Theatrics
How about subl $0xff, %eax to save a not? Previous implementation of avx2 memcpy did thatFeleciafeledy
@Noah: That doesn't preserve the position of the bits. Remember we need to bit-scan that, not just have it be 0 / non-zero. We're already using cmp all-ones, reg instead of test reg,reg to avoid needing a NOT before that compare/branch.Theatrics
Maybe I'm missing something. But if you do vpcmpeq; vpmovmskps all 1s means continue looping. Not all 1s means some word wasnt zero? But if you swap vpmovmskps for vpmovmskb you use incl eax; jz L(continue_loop) then just change the scale in lea from 4 to 1 (at the end because with vpmovmskb you have exact byte of difference).Feleciafeledy
@Noah: Yes, that's the idea. Interesting, yes with a wide enough bitmap to fill the operand-size, +1 clears all the bits if they were all-ones. (Similarly, inc %al would work with an 8-bit bitmap, if you can get a compiler to do that.) It's a bit of a pain in C to get the compiler to use a byte offset for a dword load (arr[idx/4] will AND to zero the low bits), but I guess I'm already using char* address math (with memcpy for aliasing-safety) so I could just do that. But can you usefully bit-scan the inc result? I think so: the lowest 0 becomes the lowest 1!Theatrics
@Noah: If you want to post an answer with your asm ideas, I'd give it an upvote. If you can get a compiler to do them, so much the better, though. (Also, I haven't included a full SSE4 version; I might edit to do that to try out my packssdw instead of por idea. Or you could try that.)Theatrics
Re: "But can you usefully bit-scan the inc result? I think so: the lowest 0 becomes the lowest 1!" I think so as well (and if you can't we are all going to run into a major issue because thats used in some of the library functions in glibc). I think incl %al would cause a partial register stall when you later try and bitscan eax and AFAIK there is no byte sized bitscan operation.Feleciafeledy
Maybe once I get off work. I generally think I'm a bit to mistake prone to post "answers" though (i.e see early comments!).Feleciafeledy
@Noah: Intel CPUs with AVX2 (Haswell and later) don't rename AL separately from RAX. Still, inc %eax is probably better, I was just exploring the general idea. As long as you use pcmpeqd, you can still index a whole dword instead of needing a movzx byte load, even if you have a vpmovmskb result. (Also, movmskps after integer pcmpeq may cost a cycle of bypass latency; I should check on that with a round trip loop.)Theatrics
Does incb al; tzcntl eax, eax still have a merging uop or is it totally "free"?Feleciafeledy
@Noah: It's totally free: it's an RMW on the full RAX that just happens to stop propagating carry after the low byte. (Even on Sandybridge, being an RMW instead of mov $1, %al means it won't rename AL separately from RAX). Of course, it's bad on P6-family including Nehalem.Theatrics
Oh wow, we can drop an extra xorl from memcmp. (would have to reorgnize the register usage a bit though). By the way on my tigerlake I am still seeing LSD uops with writes to ah. Bummer otherwise that would have been a very lightweight way to disable it for benchmarking.Feleciafeledy
@Noah: Yes, for AVX2 versions if the high bytes of RDX are guaranteed to be zero at that point, without a false dependency on anything that this dep chain didn't already include.Theatrics
E
1

You may try this function, your compiler should optimize this code for your CPU. It's not super perfect, but it should be relatively quick and mostly portable.

PS length should be divisible by 8 for max speed

#include <stdio.h>
#include <stdint.h>

/* Returns the index position of the most significant bit; starting with index 0. */
/* Return value is between 0 and 64 times length. */
/* When return value is exact 64 times length, no significant bit was found, aka bf is 0. */
uint32_t offset_fsb(const uint64_t *bf, const register uint16_t length){
    register uint16_t i = 0;
    uint16_t remainder = length % 8;

    switch(remainder){
        case 0 : /* 512bit compare */
            while(i < length){
                if(bf[i] | bf[i+1] | bf[i+2] | bf[i+3] | bf[i+4] | bf[i+5] | bf[i+6] | bf[i+7]) break;
                i += 8;
            }
            /* fall through */

        case 4 : /* 256bit compare */
            while(i < length){
                if(bf[i] | bf[i+1] | bf[i+2] | bf[i+3]) break;
                i += 4;
            }
            /* fall through */

        case 6 : /* 128bit compare */    
            /* fall through */
        case 2 : /* 128bit compare */
            while(i < length){
                if(bf[i] | bf[i+1]) break;
                i += 2;
            }
            /* fall through */

        default : /* 64bit compare */
            while(i < length){
                if(bf[i]) break;
                i++;
            }
    }

    register uint32_t offset_fsb = i * 64;

    /* Check the last uint64_t if the last uint64_t is not 0. */
    if(bf[i]){
        register uint64_t s = bf[i];
        offset_fsb += 63;
        while(s >>= 1) offset_fsb--;
    }

    return offset_fsb;
}

int main(int argc, char *argv[]){
    uint64_t test[16];
    test[0] = 0;
    test[1] = 0;
    test[2] = 0;
    test[3] = 0;
    test[4] = 0;
    test[5] = 0;
    test[6] = 0;
    test[7] = 0;
    test[8] = 0;
    test[9] = 0;
    test[10] = 0;
    test[11] = 0;
    test[12] = 0;
    test[13] = 0;
    test[14] = 0;
    test[15] = 1;

    printf("offset_fsb = %d\n", offset_fsb(test, 16));

    return 0;
}
Egest answered 25/5, 2021 at 20:35 Comment(14)
register what do you think this keyword does?Lunular
@Lunular suggesting the compiler to put that variable in a CPU register.Egest
Compilers already do that when you enable optimization. All register does is stop you from taking the address of the variable. (It does make a different in debug builds, but debug-build performance is mostly irrelevant. this Q&A has an example of the effect of register in un-optimized builds). There's also no reason to use uint16_t here when you just want a normal local variable; use int or unsigned.Theatrics
godbolt.org/z/z9anMdPhW shows GCC and clang -O3 -march=skylake for x86-64. GCC doesn't manage to use any SIMD, just scalar or. Clang does OR a pair of 256-bit YMM vectors together for the test, and does manage to turn the pure C bit-loop into lzcnt. But your unfortunate choice of uint16_t i is making it redo zero-extension inside the vector loop, to implement the wrapping semantics of your C. e.g. movzx eax, cx. Use size_t like a normal person, or even int would be better.Theatrics
@PeterCordes The reason for this is mostly because SIMD is slow for this small data size of just 16x 64bit. So the compiler prefers "old" techniques, which are faster. I thought using uint16_t would allow to use single cycle register compare. I had the pipelining nature of the CPU in mind, which would do a load balance anyway. Am I wrong?Egest
The CPUs the OP cares about are x86-64, so uint16_t is 1/4 of a full register, and more importantly half of the natural 32-bit operand-size. Use unsigned or int for cases like this: it will be a size that the machine handles efficiently. (Except on 8-bit machines of course; int / unsigned int are at least 16-bit). Anyway, on x86-64, even cmp rax, rcx / jb (64-bit operand-size) can macro-fuse into a single compare-and-branch uop. See agner.org/optimize for x86 microarchitecture details. (and stackoverflow.com/tags/x86/info)Theatrics
Also, SIMD is not slow for this on x86, where the SIMD execution units are always tightly coupled to the integer core. There's some latency to get a result from the SIMD domain to integer, but usually 3 cycles or less. And if you're branching on it, branch prediction + speculative execution can hide that latency. It's not like some ARM microarchitectures that have to stall on instructions that transfer data from vector to scalar registers.Theatrics
x86 even has a pmovmskb eax, xmm0 instruction that turns a vector compare result into an integer bitmap. Hand-written asm implementation of memchr use it. (glibc wmemchr (code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/…) is the closest standard function to the problem you're solving here of finding a uint64_t element before looking at its bits. wchar_t is 32-bit, but not significantly different from searching for a uint64_t.).Theatrics
@PeterCordes Okay thank you for these info. I once read that SIMD may slow down multi core CPU by throttling down the frequency because of thermal problems.Egest
Yes, that's true to some degree, especially AVX-512. SIMD instructions lowering CPU frequency. But 128-bit instructions are used heavily by compilers any time they want to copy 16 bytes for example, and glibc functions like strlen and memchr use 256-bit vectors on machines where it's available. But not 512-bit instructions, because the turbo penalty there is actually significant, and it's less likely that the rest of the program is also using 512-bit vectors. So anyway, there's zero downside to using 128-bit SSE2 instructions here.Theatrics
I'd probably just require that the array size is a multiple of 32 or 64 bytes, so unrolling and/or wider vectors are safe.Theatrics
BTW, with some tweaks like size_t instead of uint16_t, and #ifdef __GNUC__ / __builtin_ctzll instead of always using that loop, this would be a decent portable fallback implementation. Although the dispatch to unrolled search is somewhat naive, never using wide checks for an array of 99 uint64_t, for example. So it's only good if you can control the size to make sure it is a multiple of 8.Theatrics
Also, when the 8-at-a-time search finds one, it probably makes sense to go straight to the 1-at-a-time loop for up to 8 iterations, instead of redoing the work in 256 and 128-bit. Especially if it's common for the first bit to be in the first couple uint64_t. Probably best to benchmark it with random bit-placements to see which is less bad for branch prediction, if you can't get the compiler to use a pmovmskb type of operation.Theatrics
BTW, an AVX2 version might look like this, with len assumed to be a multiple of 64 bytes (8x 64-bit): godbolt.org/z/1aW68hf4e (untested). (Future answers feel free to borrow it if you're willing to write up a full answer describing why it's good, and/or if you find some improvements. Otherwise I'll probably get around to posting my own answer sometime.)Theatrics

© 2022 - 2024 — McMap. All rights reserved.