Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2

Asked 9/3, 2019 at 20:13 Answered 6/8, 2019 at 0:38

(Related: How to quickly count bits into separate bins in a series of ints on Sandy Bridge? is an earlier duplicate of this, with some different answers. Editor's note: the answers here are probably better.

Also, an AVX2 version of a similar problem, with many bins for a whole row of bits much wider than one uint64_t: Improve column population count algorithm)

I am working on a project in C where I need to go through tens of millions of masks (of type ulong (64-bit)) and update an array (called target) of 64 short integers (uint16) based on a simple rule:

// for any given mask, do the following loop
for (i = 0; i < 64; i++) {
    if (mask & (1ull << i)) {
        target[i]++
    }
}

The problem is that I need do the above loops on tens of millions of masks and I need to finish in less than a second. Wonder if there are any way to speed it up, like using some sort special assembly instruction that represents the above loop.

Currently I use gcc 4.8.4 on ubuntu 14.04 (i7-2670QM, supporting AVX, not AVX2) to compile and run the following code and took about 2 seconds. Would love to make it run under 200ms.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}
unsigned int target[64];

int main(int argc, char *argv[]) {
    int i, j;
    unsigned long x = 123;
    unsigned long m = 1;
    char *p = malloc(8 * 10000000);
    if (!p) {
        printf("failed to allocate\n");
        exit(0);
    }
    memset(p, 0xff, 80000000);
    printf("p=%p\n", p);
    unsigned long *pLong = (unsigned long*)p;
    double start = getTS();
    for (j = 0; j < 10000000; j++) {
        m = 1;
        for (i = 0; i < 64; i++) {
            if ((pLong[j] & m) == m) {
                target[i]++;
            }
            m = (m << 1);
        }
    }
    printf("took %f secs\n", getTS() - start);
    return 0;
}

Thanks in advance!

Bamby answered 9/3, 2019 at 20:13 Comment(15)

using uint16 as counters would likely result in overflow – Zeist 9/3, 2019 at 20:22

"like using some sort special assembly instruction that represents the above loop." --> a good compiler will see this. Refer to your compiles optimization settings to activate such. – Intuitionism 9/3, 2019 at 20:22

Suggestion: take down question, add a test harness around it to report time and post that code asking for improvements. As is now, question is too broad. – Intuitionism 9/3, 2019 at 20:24

This is a classic case where a GPU can help greatly speed up the process. It process in the batches of 64 etc. and has many CUDA cores to take advantage of the possible parallel computation in your problem – Selectivity 9/3, 2019 at 20:26

@VTT you are right, it's better to have uint32. I was subconsciously using the pass experience that the count won't grow to be more than 10,000. Can you suggest a compile for this? Don't think gcc can do this. – Bamby 9/3, 2019 at 20:27

@chux thanks for the suggestion, will try to do some benchmark on it and update. – Bamby 9/3, 2019 at 20:28

Done. Thx @rustyx. – Bamby 9/3, 2019 at 21:23

Added a simple benchmark code. – Bamby 9/3, 2019 at 21:35

The popcnt instruction (including an SSE version) counts bits set to 1 in a word, which is what you're doing. I don't thing that's what you want. See felixcloutier.com/x86/popcnt and #6427973 . Note you're benchmark skips bit 0 (thought the initial code doesn't). – Farmland 9/3, 2019 at 22:10

@Gene, good catch. I moved m = (m <<1) to the end. The time is now 1.64 seconds. Will check out your link. – Bamby 9/3, 2019 at 22:16

You should fill the buffer with realistic data too. The number of set bits and their pattern affects the performance of this code, accidentally having all zeroes would unfairly benefit branchy code which would perform worse on maximally random bits (50% chance, uncorrelated) – Abrahan 9/3, 2019 at 22:31

@harold that's a good point. Changed to set all bytes to be 0xff. As you predicted, the time does go up ( at 2 seconds). – Bamby 9/3, 2019 at 23:8

all-ones is still very predictable, so branching can mispredict on actual data with different mixes of zeros/ones. Doing target[i] += (pLong[j] & (1ULL << i) != 0) might compile to a bts / setc / add if you're lucky. – Melson 10/3, 2019 at 0:29

@Gene: no, popcnt is a horizontal popcnt that adds up all the bits in one integer. The OP wants separate counts for every bit-position over multiple integers. You want to unpack bit to something wider (e.g. nibbles or bytes), then vertical add (like paddb) as many times as you can without overflow, then expand to wider counters. – Melson 10/3, 2019 at 0:32

@chux: good luck getting gcc or clang to auto-vectorize this. Especially where the C abstract machine doesn't touch target[] if the input is all zeros, it can't introduce a non-atomic += condition; With AVX, it could still read it and do a masked store like vmaskmovps that wouldn't rewrite values that weren't touched, but in practice it doesn't. godbolt.org/z/b5Yfug Even ICC doesn't do anything better than jcc over an inc [mem]. – Melson 10/3, 2019 at 1:13

On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3, your code runs in 500ms.

Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms.

Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1; without a test brings it down to 280ms.

Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.

Here is an improved version using this method. it runs in 45ms on my system.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}

int main(int argc, char *argv[]) {
    unsigned int target[64] = { 0 };
    unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
    int i, j;

    if (!pLong) {
        printf("failed to allocate\n");
        exit(1);
    }
    memset(pLong, 0xff, sizeof(*pLong) * 10000000);
    printf("p=%p\n", (void*)pLong);
    double start = getTS();
    uint64_t inflate[256];
    for (i = 0; i < 256; i++) {
        uint64_t x = i;
        x = (x | (x << 28));
        x = (x | (x << 14));
        inflate[i] = (x | (x <<  7)) & 0x0101010101010101ULL;
    }
    for (j = 0; j < 10000000 / 255 * 255; j += 255) {
        uint64_t b[8] = { 0 };
        for (int k = 0; k < 255; k++) {
            uint64_t u = pLong[j + k];
            for (int kk = 0; kk < 8; kk++, u >>= 8)
                b[kk] += inflate[u & 255];
        }
        for (i = 0; i < 64; i++)
            target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
    }
    for (; j < 10000000; j++) {
        uint64_t m = 1;
        for (i = 0; i < 64; i++) {
            target[i] += (pLong[j] >> i) & 1;
            m <<= 1;
        }
    }
    printf("target = {");
    for (i = 0; i < 64; i++)
        printf(" %d", target[i]);
    printf(" }\n");
    printf("took %f secs\n", getTS() - start);
    return 0;
}

The technique for inflating a byte to a 64-bit long are investigated and explained in the answer: https://mcmap.net/q/827156/-is-there-a-more-efficient-way-of-expanding-a-char-to-an-uint64_t . I made the target array a local variable, as well as the inflate array, and I print the results to ensure the compiler will not optimize the computations away. In a production version you would compute the inflate array separately.

Using SIMD directly might provide further improvements at the expense of portability and readability. This kind of optimisation is often better left to the compiler as it can generate specific code for the target architecture. Unless performance is critical and benchmarking proves this to be a bottleneck, I would always favor a generic solution.

A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.

Iodism answered 9/3, 2019 at 23:46 Comment(13)

It appears that my CPU doesn't perform as well as yours :-) I tried if ((pLong[j] & m) != 0) and it made no difference in time. Tried target[i] += (pLong[j] >> i) & 1;, it is actually worse, since time goes to 2.74 seconds. – Bamby 9/3, 2019 at 23:53

@Bamby and chqrlie: Your numbers are more useful when you specify the CPU model used to run the experiments and the compiler options used to compile the code. – Embarrassment 10/3, 2019 at 0:23

@HadiBrais My CPU is i7-2670QM, mentioned in the post. – Bamby 10/3, 2019 at 0:26

@Bamby Right, you mentioned the CPU model and the compiler, but you forgot the compiler options used to compile the code. – Embarrassment 10/3, 2019 at 0:27

@HadiBrais I just used gcc <c_file>. Tried -O 3 but the same number. – Bamby 10/3, 2019 at 0:29

@pktCoder: It's not plausible that un-optimized gcc foo.c was the same speed as gcc -O3 -march=native foo.c for this. Your inner-most loop should run much slower if you disable optimization so the loop-carried dependency involves a store/reload, and all that extra front-end bandwidth. – Melson 10/3, 2019 at 0:40

@pktCoder: Also note that chqrlie is using clang, which normally unrolls inner loops, while gcc doesn't enable loop unrolling unless you use -fprofile-generate / run the program / -fprofile-use. Also your gcc4.8 is quite old, barely newer than your CPU. A newer gcc version would optimize better. – Melson 10/3, 2019 at 0:44

on my PC (i5-7200U) gcc 7.3 output performs at ~505ms and ~330ms for the 2 versions which Clang output runs consistently at 420ms for both versions. But probably the test is not good because target is not used anywhere and can be removed completely by some compilers – Appendectomy 10/3, 2019 at 0:59

@Iodism I tried your code and and the performance is amazing, the time goes down from 2.0seconds to about 0.28 second. Trying to understand the code here. Do you have a reference (document) that explains this technique? Thanks. – Bamby 10/3, 2019 at 1:33

This answer is the most clever one and it's very fast, so I will accept it. But I think the other answers citing SIMD is the ultimately the way to go. – Bamby 10/3, 2019 at 4:38

@pktCoder: I amended the answer with further explanations. – Iodism 10/3, 2019 at 5:42

there are many other ways to inflate 8 bits in a byte to 8 bytes in a long How to create a byte out of 8 bool values (and vice versa)?, How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD, How to convert a sequence of 32 char (0/1) to 32 bits (uint32_t)? – Appendectomy 10/3, 2019 at 7:28

@phuclv: Thank you. I added your method in this comparative benchmark: https://mcmap.net/q/827156/-is-there-a-more-efficient-way-of-expanding-a-char-to-an-uint64_t in the case of this question, it is (only) 30% slower than the lookup table. SIMD approaches would probably squeeze more cpu cycles out. – Iodism 10/3, 2019 at 8:22

an earlier duplicate has some alternate ideas: How to quickly count bits into separate bins in a series of ints on Sandy Bridge?.
Harold's answer on AVX2 column population count algorithm over each bit-column separately.
Matrix transpose and population count has a couple useful answers with AVX2, including benchmarks. It uses 32-bit chunks instead of 64-bit.

Also: https://github.com/mklarqvist/positional-popcount has SSE blend, various AVX2, various AVX512 including Harley-Seal which is great for large arrays, and various other algorithms for positional popcount. Possibly only for uint16_t, but most could be adapted for other word widths. I think the algorithm I propose below is what they call adder_forest.

Your best bet is SIMD, using AVX1 on your Sandybridge CPU. Compilers aren't smart enough to auto-vectorize your loop-over-bits for you, even if you write it branchlessly to give them a better chance.

And unfortunately not smart enough to auto-vectorize the fast version that gradually widens and adds.

See is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. Ext3h's suggestion in another answer is good: Unpack bits to something narrower than the final count array gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 vertical paddb without overflow, before unpacking to accumulate into the 32-bit counter array.

It only takes 4x 16-byte __m128i vectors to hold all 64 uint8_t elements, so those accumulators can stay in registers, only adding to memory when widening out to 32-bit counters in an outer loop.

The unpack doesn't have to be in-order: you can always shuffle target[] once at the very end, after accumulating all the results.

The inner loop could be unrolled to start with a 64 or 128-bit vector load, and unpack 4 or 8 different ways using pshufb (_mm_shuffle_epi8).

An even better strategy is to widen gradually

Starting with 2-bit accumulators, then mask/shift to widen those to 4-bit. So in the inner-most loop most of the operations are working with "dense" data, not "diluting" it too much right away. Higher information / entropy density means that each instruction does more useful work.

Using SWAR techniques for 32x 2-bit add inside scalar or SIMD registers is easy / cheap because we need to avoid the possibility of carry out the top of an element anyway. With proper SIMD, we'd lose those counts, with SWAR we'd corrupt the next element.

uint64_t x = *(input++);        // load a new bitmask
const uint64_t even_1bits = 0x5555555555555555;  // 0b...01010101;

uint64_t lo = x & even_1bits;
uint64_t hi = (x>>1) & even_1bits;            // or use ANDN before shifting to avoid a MOV copy

accum2_lo += lo;   // can do up to 3 iterations of this without overflow
accum2_hi += hi;   // because a 2-bit integer overflows at 4

Then you repeat up to 4 vectors of 4-bit elements, then 8 vectors of 8-bit elements, then you should widen all the way to 32 and accumulate into the array in memory because you'll run out of registers anyway, and this outer outer loop work is infrequent enough that we don't need to bother with going to 16-bit. (Especially if we manually vectorize).

Biggest downside: this doesn't auto-vectorize, unlike @njuffa's version. But with gcc -O3 -march=sandybridge for AVX1 (then running the code on Skylake), this running scalar 64-bit is actually still slightly faster than 128-bit AVX auto-vectorized asm from @njuffa's code.

But that's timing on Skylake, which has 4 scalar ALU ports (and mov-elimination), while Sandybridge lacks mov-elimination and only has 3 ALU ports, so the scalar code will probably hit back-end execution-port bottlenecks. (But SIMD code may be nearly as fast, because there's plenty of AND / ADD mixed with the shifts, and SnB does have SIMD execution units on all 3 of its ports that have any ALUs on them. Haswell just added port 6, for scalar-only including shifts and branches.)

With good manual vectorization, this should be a factor of almost 2 or 4 faster.

But if you have to choose between this scalar or @njuffa's with AVX2 autovectorization, @njuffa's is faster on Skylake with -march=native

If building on a 32-bit target is possible/required, this suffers a lot (without vectorization because of using uint64_t in 32-bit registers), while vectorized code barely suffers at all (because all the work happens in vector regs of the same width).

// TODO: put the target[] re-ordering somewhere
// TODO: cleanup for N not a multiple of 3*4*21 = 252
// TODO: manual vectorize with __m128i, __m256i, and/or __m512i

void sum_gradual_widen (const uint64_t *restrict input, unsigned int *restrict target, size_t length)
{
    const uint64_t *endp = input + length - 3*4*21;     // 252 masks per outer iteration
    while(input <= endp) {
        uint64_t accum8[8] = {0};     // 8-bit accumulators
        for (int k=0 ; k<21 ; k++) {
            uint64_t accum4[4] = {0};  // 4-bit accumulators can hold counts up to 15.  We use 4*3=12
            for(int j=0 ; j<4 ; j++){
                uint64_t accum2_lo=0, accum2_hi=0;
                for(int i=0 ; i<3 ; i++) {  // the compiler should fully unroll this
                    uint64_t x = *input++;    // load a new bitmask
                    const uint64_t even_1bits = 0x5555555555555555;
                    uint64_t lo = x & even_1bits; // 0b...01010101;
                    uint64_t hi = (x>>1) & even_1bits;  // or use ANDN before shifting to avoid a MOV copy
                    accum2_lo += lo;
                    accum2_hi += hi;   // can do up to 3 iterations of this without overflow
                }

                const uint64_t even_2bits = 0x3333333333333333;
                accum4[0] +=  accum2_lo       & even_2bits;  // 0b...001100110011;   // same constant 4 times, because we shift *first*
                accum4[1] += (accum2_lo >> 2) & even_2bits;
                accum4[2] +=  accum2_hi       & even_2bits;
                accum4[3] += (accum2_hi >> 2) & even_2bits;
            }
            for (int i = 0 ; i<4 ; i++) {
                accum8[i*2 + 0] +=   accum4[i] & 0x0f0f0f0f0f0f0f0f;
                accum8[i*2 + 1] +=  (accum4[i] >> 4) & 0x0f0f0f0f0f0f0f0f;
            }
        }

        // char* can safely alias anything.
        unsigned char *narrow = (uint8_t*) accum8;
        for (int i=0 ; i<64 ; i++){
            target[i] += narrow[i];
        }
    }
    /* target[0] = bit 0
     * target[1] = bit 8
     * ...
     * target[8] = bit 1
     * target[9] = bit 9
     * ...
     */
    // TODO: 8x8 transpose
}

We don't care about order, so accum4[0] has 4-bit accumulators for every 4th bit, for example. The final fixup needed (but not yet implemented) at the very end is an 8x8 transpose of the uint32_t target[64] array, which can be done efficiently using unpck and vshufps with only AVX1. (Transpose an 8x8 float using AVX/AVX2). And also a cleanup loop for the last up to 251 masks.

We can use any SIMD element width to implement these shifts; we have to mask anyway for widths lower than 16-bit (SSE/AVX doesn't have byte-granularity shifts, only 16-bit minimum.)

Benchmark results on Arch Linux i7-6700k from @njuffa's test harness, with this added. (Godbolt) N = (10000000 / (3*4*21) * 3*4*21) = 9999864 (i.e. 10000000 rounded down to a multiple of the 252 iteration "unroll" factor, so my simplistic implementation is doing the same amount of work, not counting re-ordering target[] which it doesn't do, so it does print mismatch results. But the printed counts match another position of the reference array.)

I ran the program 4x in a row (to make sure the CPU was warmed up to max turbo) and took one of the runs that looked good (none of the 3 times abnormally high).

ref: the best bit-loop (next section)
fast: @njuffa's code. (auto-vectorized with 128-bit AVX integer instructions).
gradual: my version (not auto-vectorized by gcc or clang, at least not in the inner loop.) gcc and clang fully unroll the inner 12 iterations.

gcc8.2 -O3 -march=sandybridge -fpie -no-pie
ref: 0.331373 secs, fast: 0.011387 secs, gradual: 0.009966 secs
gcc8.2 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.397175 secs, fast: 0.011255 secs, gradual: 0.010018 secs
clang7.0 -O3 -march=sandybridge -fpie -no-pie
ref: 0.352381 secs, fast: 0.011926 secs, gradual: 0.009269 secs (very low counts for port 7 uops, clang used indexed addressing for stores)
clang7.0 -O3 -march=sandybridge -fno-pie -no-pie
ref: 0.293014 secs, fast: 0.011777 secs, gradual: 0.009235 secs

-march=skylake (allowing AVX2 for 256-bit integer vectors) helps both, but @njuffa's most because more of it vectorizes (including its inner-most loop):

gcc8.2 -O3 -march=skylake -fpie -no-pie
ref: 0.328725 secs, fast: 0.007621 secs, gradual: 0.010054 secs (gcc shows no gain for "gradual", only "fast")
gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.333922 secs, fast: 0.007620 secs, gradual: 0.009866 secs
clang7.0 -O3 -march=skylake -fpie -no-pie
ref: 0.260616 secs, fast: 0.007521 secs, gradual: 0.008535 secs (IDK why gradual is faster than -march=sandybridge; it's not using BMI1 andn. I guess because it's using 256-bit AVX2 for the k=0..20 outer loop with vpaddq)
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.259159 secs, fast: 0.007496 secs, gradual: 0.008671 secs

Without AVX, just SSE4.2: (-march=nehalem), bizarrely clang's gradual is faster than with AVX / tune=sandybridge. "fast" is only barely slower than with AVX.

gcc8.2 -O3 -march=skylake -fno-pie -no-pie
ref: 0.337178 secs, fast: 0.011983 secs, gradual: 0.010587 secs
clang7.0 -O3 -march=skylake -fno-pie -no-pie
ref: 0.293555 secs, fast: 0.012549 secs, gradual: 0.008697 secs

-fprofile-generate / -fprofile-use help some for GCC, especially for the "ref" version where it doesn't unroll at all by default.

I highlighted the best, but often they're within measurement noise margin of each other. It's unsurprising the -fno-pie -no-pie was sometimes faster: indexing static arrays with [disp32 + reg] is not an indexed addressing mode, just base + disp32, so it doesn't ever unlaminate on Sandybridge-family CPUs.

But with gcc sometimes -fpie was faster; I didn't check but I assume gcc just shot itself in the foot somehow when 32-bit absolute addressing was possible. Or just innocent-looking differences in code-gen happened to cause alignment or uop-cache problems; I didn't check in detail.

For SIMD, we can simply do 2 or 4x uint64_t in parallel, only accumulating horizontally in the final step where we widen bytes to 32-bit elements. (Perhaps by shuffling in-lane and then using pmaddubsw with a multiplier of _mm256_set1_epi8(1) to add horizontal byte pairs into 16-bit elements.)

TODO: manually-vectorized __m128i and __m256i (and __m512i) versions of this. Should be close to 2x, 4x, or even 8x faster than the "gradual" times above. Probably HW prefetch can still keep up with it, except maybe an AVX512 version with data coming from DRAM, especially if there's contention from other threads. We do a significant amount of work per qword we read.

Obsolete code: improvements to the bit-loop

Your portable scalar version can be improved, too, speeding it up from ~1.92 seconds (with a 34% branch mispredict rate overall, with the fast loops commented out!) to ~0.35sec (clang7.0 -O3 -march=sandybridge) with a properly random input on 3.9GHz Skylake. Or 1.83 sec for the branchy version with != 0 instead of == m, because compilers fail to prove that m always has exactly 1 bit set and/or optimize accordingly.

(vs. 0.01 sec for @njuffa's or my fast version above, so this is pretty useless in an absolute sense, but it's worth mentioning as a general optimization example of when to use branchless code.)

If you expect a random mix of zeros and ones, you want something branchless that won't mispredict. Doing += 0 for elements that were zero avoids that, and also means that the C abstract machine definitely touches that memory regardless of the data.

Compilers aren't allowed to invent writes, so if they wanted to auto-vectorize your if() target[i]++ version, they'd have to use a masked store like x86 vmaskmovps to avoid a non-atomic read / rewrite of unmodified elements of target. So some hypothetical future compiler that can auto-vectorize the plain scalar code would have an easier time with this.

Anyway, one way to write this is target[i] += (pLong[j] & m != 0);, using bool->int conversion to get a 0 / 1 integer.

But we get better asm for x86 (and probably for most other architectures) if we just shift the data and isolate the low bit with &1. Compilers are kinda dumb and don't seem to spot this optimization. They do nicely optimize away the extra loop counter, and turn m <<= 1 into add same,same to efficiently left shift, but they still use xor-zero / test / setne to create a 0 / 1 integer.

An inner loop like this compiles slightly more efficiently (but still much much worse than we can do with SSE2 or AVX, or even scalar using @chrqlie's lookup table which will stay hot in L1d when used repeatedly like this, allowing SWAR in uint64_t):

    for (int j = 0; j < 10000000; j++) {
#if 1  // extract low bit directly
        unsigned long long tmp = pLong[j];
        for (int i=0 ; i<64 ; i++) {   // while(tmp) could mispredict, but good for sparse data
            target[i] += tmp&1;
            tmp >>= 1;
        }
#else // bool -> int shifting a mask
        unsigned long m = 1;
        for (i = 0; i < 64; i++) {
            target[i]+= (pLong[j] & m) != 0;
            m = (m << 1);
        }
#endif

Note that unsigned long is not guaranteed to be a 64-bit type, and isn't in x86-64 System V x32 (ILP32 in 64-bit mode), and Windows x64. Or in 32-bit ABIs like i386 System V.

Compiled on the Godbolt compiler explorer by gcc, clang, and ICC, it's 1 fewer uops in the loop with gcc. But all of them are just plain scalar, with clang and ICC unrolling by 2.

# clang7.0 -O3 -march=sandybridge
.LBB1_2:                            # =>This Loop Header: Depth=1
   # outer loop loads a uint64 from the src
    mov     rdx, qword ptr [r14 + 8*rbx]
    mov     rsi, -256
.LBB1_3:                            #   Parent Loop BB1_2 Depth=1
                                    # do {
    mov     edi, edx
    and     edi, 1                              # isolate the low bit
    add     dword ptr [rsi + target+256], edi   # and += into target

    mov     edi, edx
    shr     edi
    and     edi, 1                              # isolate the 2nd bit
    add     dword ptr [rsi + target+260], edi

    shr     rdx, 2                              # tmp >>= 2;

    add     rsi, 8
    jne     .LBB1_3                       # } while(offset += 8 != 0);

This is slightly better than we get from test / setnz. Without unrolling, bt / setc might have been equal, but compilers are bad at using bt to implement bool (x & (1ULL << n)), or bts to implement x |= 1ULL << n.

If many words have their highest set bit far below bit 63, looping on while(tmp) could be a win. Branch mispredicts make it not worth it if it only saves ~0 to 4 iterations most of the time, but if it often saves 32 iterations, that could really be worth it. Maybe unroll in the source so the loop only tests tmp every 2 iterations (because compilers won't do that transformation for you), but then the loop branch can be shr rdx, 2 / jnz.

On Sandybridge-family, this is 11 fused-domain uops for the front end per 2 bits of input. (add [mem], reg with a non-indexed addressing mode micro-fuses the load+ALU, and the store-address+store-data, everything else is single-uop. add/jcc macro-fuses. See Agner Fog's guide, and https://stackoverflow.com/tags/x86/info). So it should run at something like 3 cycles per 2 bits = one uint64_t per 96 cycles. (Sandybridge doesn't "unroll" internally in its loop buffer, so non-multiple-of-4 uop counts basically round up, unlike on Haswell and later).

vs. gcc's not-unrolled version being 7 uops per 1 bit = 2 cycles per bit. If you compiled with gcc -O3 -march=native -fprofile-generate / test-run / gcc -O3 -march=native -fprofile-use, profile-guided optimization would enable loop unrolling.

This is probably slower than a branchy version on perfectly predictable data like you get from memset with any repeating byte pattern. I'd suggest filling your array with randomly-generated data from a fast PRNG like an SSE2 xorshift+, or if you're just timing the count loop then use anything you want, like rand().

Melson answered 10/3, 2019 at 2:6 Comment(6)

Thanks @peter-cordes for the writeup. I will have to study it carefully since I believe it's the way to go. – Bamby 10/3, 2019 at 4:40

The gradual strategy can be rewritten such that it is auto-vectorizable by clang/gcc: godbolt.org/z/BSSZeL (It's only a matter of using indexed access instead of pointer increments, and fields instead of individual variables) ... except it doesn't help. In contrary, the auto vectorized code is only half as fast as the scalar one. – Rossierossing 10/3, 2019 at 20:34

@Ext3h: Interesting. Yeah, it auto-vectorizes, but with a strategy that's total garbage. You convinced the compiler to do an actual multiply in the inner-most loop for index calc for a vpgatherqq! Is it maybe vectorizing over an outer loop? (If you're testing on Ryzen, that will be extra slow. Skylake has fast gathers (4c SKL vs. 12c Ryzen according to Agner Fog), but that's still going to suck vs. contiguous loads). There's also a bunch of vpinsrq xmm2, xmm1, [small local array], 0 / vmovdqa [big local array], ymm2 insert/spill in the outer loop which looks super weird. – Melson 11/3, 2019 at 0:2

Use of gather op is just a result of targeting Skylake. Computing index per iteration is a side effect of using a 21*4*3 split, instead of a simpler 16*4*2. With later one, it just barely overtakes scalar on all platforms. clang still appears to have a hard time with the increasing register pressure when gradual unpacking though. – Rossierossing 11/3, 2019 at 6:33

@Ext3h: My point was that all this stems from auto-vectorizing the wrong way. I think it might be doing the horizontal stuff inside the inner-most loop instead of turning accum4 and accum8 into SIMD vectors with strides of 4 masks. With -march=skylake it uses a gather, with -march=znver1 it's doing movq loads and shuffling them together with vpunpcklqdq and vinsert128. Simplifying to 16*4*2 yes it uses vector loads before shuffling. But clang still spends more work shuffling in the inner loop than it does on vpand / shift. (It does fully unroll over 2*4 = 8 vectors.) – Melson 11/3, 2019 at 6:58

... godbolt.org/z/ZTXgSU has znver1 with 16*4*2. I bet it doesn't realize that none of the intermediates ever carry-out / wrap, and thus it preserves the exact same intermediate values to ensure that any wrapping would happen the same way. Using int64_t (to promise no signed overflow because that's UB) wouldn't help, because all it knows is that the C abstract machine doesn't have any overflows with that order of operations. (And it wouldn't even be true; carry into the signbit is possible. Signed overflow != unsigned overflow). – Melson 11/3, 2019 at 7:3

On my system, a 4 year old MacBook (2.7 GHz intel core i5) with clang-900.0.39.2 -O3, your code runs in 500ms.

Just changing the inner test to if ((pLong[j] & m) != 0) saves 30%, running in 350ms.

Further simplifying the inner part to target[i] += (pLong[j] >> i) & 1; without a test brings it down to 280ms.

Further improvements seem to require more advanced techniques such as unpacking the bits into blocks of 8 ulongs and adding those in parallel, handling 255 ulongs at a time.

Here is an improved version using this method. it runs in 45ms on my system.

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/stat.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}

int main(int argc, char *argv[]) {
    unsigned int target[64] = { 0 };
    unsigned long *pLong = malloc(sizeof(*pLong) * 10000000);
    int i, j;

    if (!pLong) {
        printf("failed to allocate\n");
        exit(1);
    }
    memset(pLong, 0xff, sizeof(*pLong) * 10000000);
    printf("p=%p\n", (void*)pLong);
    double start = getTS();
    uint64_t inflate[256];
    for (i = 0; i < 256; i++) {
        uint64_t x = i;
        x = (x | (x << 28));
        x = (x | (x << 14));
        inflate[i] = (x | (x <<  7)) & 0x0101010101010101ULL;
    }
    for (j = 0; j < 10000000 / 255 * 255; j += 255) {
        uint64_t b[8] = { 0 };
        for (int k = 0; k < 255; k++) {
            uint64_t u = pLong[j + k];
            for (int kk = 0; kk < 8; kk++, u >>= 8)
                b[kk] += inflate[u & 255];
        }
        for (i = 0; i < 64; i++)
            target[i] += (b[i / 8] >> ((i % 8) * 8)) & 255;
    }
    for (; j < 10000000; j++) {
        uint64_t m = 1;
        for (i = 0; i < 64; i++) {
            target[i] += (pLong[j] >> i) & 1;
            m <<= 1;
        }
    }
    printf("target = {");
    for (i = 0; i < 64; i++)
        printf(" %d", target[i]);
    printf(" }\n");
    printf("took %f secs\n", getTS() - start);
    return 0;
}

A different solution by njuffa provides similar performance without the need for a precomputed array. Depending on your compiler and hardware specifics, it might be faster.

Iodism answered 9/3, 2019 at 23:46 Comment(13)

@Bamby and chqrlie: Your numbers are more useful when you specify the CPU model used to run the experiments and the compiler options used to compile the code. – Embarrassment 10/3, 2019 at 0:23

@HadiBrais My CPU is i7-2670QM, mentioned in the post. – Bamby 10/3, 2019 at 0:26

@Bamby Right, you mentioned the CPU model and the compiler, but you forgot the compiler options used to compile the code. – Embarrassment 10/3, 2019 at 0:27

@HadiBrais I just used gcc <c_file>. Tried -O 3 but the same number. – Bamby 10/3, 2019 at 0:29

This answer is the most clever one and it's very fast, so I will accept it. But I think the other answers citing SIMD is the ultimately the way to go. – Bamby 10/3, 2019 at 4:38

@pktCoder: I amended the answer with further explanations. – Iodism 10/3, 2019 at 5:42

One way of speeding this up significantly, even without AVX, is to split the data into blocks of up to 255 elements, and accumulate the bit counts byte-wise in ordinary uint64_t variables. Since the source data has 64 bits, we need an array of 8 byte-wise accumulators. The first accumulator counts bits in positions 0, 8, 16, ... 56, second accumulator counts bits in positions 1, 9, 17, ... 57; and so on. After we are finished processing a block of data, we transfers the counts form the byte-wise accumulator into the target counts. A function to update the target counts for a block of up to 255 numbers can be coded in a straightforward fashion according to the description above, where BITS is the number of bits in the source data:

/* update the counts of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
    int jj, k, kk;
    uint64_t byte_wise_sum [BITS/8] = {0};
    for (jj = lo; jj < hi; jj++) {
        uint64_t t = pLong[jj];
        for (k = 0; k < BITS/8; k++) {
            byte_wise_sum[k] += t & 0x0101010101010101;
            t >>= 1;
        }
    }
    /* accumulate byte sums into target */
    for (k = 0; k < BITS/8; k++) {
        for (kk = 0; kk < BITS; kk += 8) {
            target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
        }
    }
}

The entire ISO-C99 program, which should be able to run on at least Windows and Linux platforms is shown below. It initializes the source data with a PRNG, performs a correctness check against the asker's reference implementation, and benchmarks both the reference code and the accelerated version. On my machine (Intel Xeon E3-1270 v2 @ 3.50 GHz), when compiled with MSVS 2010 at full optimization (/Ox), the output of the program is:

p=0000000000550040
ref took 2.020282 secs, fast took 0.027099 secs

where ref refers to the asker's original solution. The speed-up here is about a factor 74x. Different speed-ups will be observed with other (and especially newer) compilers.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

/*
  From: geo <[email protected]>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

#define N          (10000000)
#define BITS       (64)
#define BLOCK_SIZE (255)

/* cupdate the count of 1-bits in each bit position for up to 255 numbers */
void sum_block (const uint64_t *pLong, unsigned int *target, int lo, int hi)
{
    int jj, k, kk;
    uint64_t byte_wise_sum [BITS/8] = {0};
    for (jj = lo; jj < hi; jj++) {
        uint64_t t = pLong[jj];
        for (k = 0; k < BITS/8; k++) {
            byte_wise_sum[k] += t & 0x0101010101010101;
            t >>= 1;
        }
    }
    /* accumulate byte sums into target */
    for (k = 0; k < BITS/8; k++) {
        for (kk = 0; kk < BITS; kk += 8) {
            target[kk + k] += (byte_wise_sum[k] >> kk) & 0xff;
        }
    }
}

int main (void) 
{
    double start_ref, stop_ref, start, stop;
    uint64_t *pLong;
    unsigned int target_ref [BITS] = {0};
    unsigned int target [BITS] = {0};
    int i, j;

    pLong = malloc (sizeof(pLong[0]) * N);
    if (!pLong) {
        printf("failed to allocate\n");
        return EXIT_FAILURE;
    }
    printf("p=%p\n", pLong);

    /* init data */
    for (j = 0; j < N; j++) {
        pLong[j] = KISS64;
    }

    /* count bits slowly */
    start_ref = second();
    for (j = 0; j < N; j++) {
        uint64_t m = 1;
        for (i = 0; i < BITS; i++) {
            if ((pLong[j] & m) == m) {
                target_ref[i]++;
            }
            m = (m << 1);
        }
    }
    stop_ref = second();

    /* count bits fast */
    start = second();
    for (j = 0; j < N / BLOCK_SIZE; j++) {
        sum_block (pLong, target, j * BLOCK_SIZE, (j+1) * BLOCK_SIZE);
    }
    sum_block (pLong, target, j * BLOCK_SIZE, N);
    stop = second();

    /* check whether result is correct */
    for (i = 0; i < BITS; i++) {
        if (target[i] != target_ref[i]) {
            printf ("error @ %d: res=%u ref=%u\n", i, target[i], target_ref[i]);
        }
    }

    /* print benchmark results */
    printf("ref took %f secs, fast took %f secs\n", stop_ref - start_ref, stop - start);
    return EXIT_SUCCESS;
}

Humus answered 10/3, 2019 at 5:25 Comment(17)

E3-1270 v2 is a quad-core IvyBridge, pretty comparable to the OP's Sandybridge in terms of microarchitecture. (Faster clock vs. 3.1GHz, and you have 8MB L3 vs. 6MB). Both lack AVX2. I think even with AVX, starting with 2-bit accumulators and widening gradually might be a win here vs. unpacking all the way to bytes to start with. Your SWAR byte add doesn't need to kill the carry, because you need to avoid inputs that could possible carry anyway. – Melson 10/3, 2019 at 5:29

@PeterCordes It occured to me that I could eliminate the carry killing logic in the byte-wise add, but there is also something to be said for re-using existing tested building blocks "as is", instead of fiddling with them at the last minute. – Humus 10/3, 2019 at 5:42

Ok, but literally all you need is +! It doesn't need a function at all. – Melson 10/3, 2019 at 5:53

@PeterCordes I agree 100%. I have been thinking too hard today, my old brain is tired, and so I completely overlooked that. Thanks for pointing it out, I will change it right away. – Humus 10/3, 2019 at 5:59

You also want sizeof(*pLong) for malloc, not sizeof(pLong). I accidentally compiled with -m32 from an old command I recalled and edited, and it segfaults in the init loop because you only alloc half as much memory as you need if sizeof(pointer) is less than uint64_t. I would edit but you're probably about to make another edit. Also, style: a block of declarations at the top of a function hasn't been required for a long time: C99 allows for(int i = ...) and so on, and it's generally considered better style to limit declarations to their scope. – Melson 10/3, 2019 at 6:3

@PeterCordes Thanks I fixed the malloc. I am not going to worry about style, this isn't production code, and I reused the asker's framekwork as much as possible. BTW, it turned out I had not actually used the byte_wise_add() function anywhere in the original code I posted, I had just forgotten to remove the code and description from the write-up prior to posting. Fixed now. Before I confuse myself any further, I'll retire for the day :-) – Humus 10/3, 2019 at 6:8

Yeah, I just noticed that when looking at the compiler output. BTW, gcc does auto-vectorize it quite nicely with AVX1 or AVX2, fully unrolling the inner loop of shift counts 0..7. godbolt.org/z/jCe6ZT – Melson 10/3, 2019 at 6:10

@PeterCordes Good to know, I didn't bother to look at the generated code as it seemed plenty fast. BTW: MSVS 2010 does not support C99 [ I need to upgrade my hardware and software within the next year, it is really a bit long in the tooth at this time ] – Humus 10/3, 2019 at 6:18

I have a mostly-finished version that gradually widens to 2, 4, then 8-bit accumulators, keeping high information/entropy density at all times. Unfortunately it doesn't auto-vectorize, but scalar of that is faster than 128-bit auto-vectorized of yours (on Skylake)! Yours with 256-bit auto-vectorization beats mine scalar. See my answer. (I haven't yet written a 128-bit or 256-bit manually-vectorized version; that should be nearly 2x or 4x faster.) I used your very helpful test harness. Anyway, I'd be interested to see how it runs on your IvyBridge CPU and/or the OP's SnB without mov-elim – Melson 10/3, 2019 at 12:49

@PeterCordes I measured 0.012724 seconds for sum_gradual_widen() on my machine (twice as fast as my code), but note that the bit counts aren't correct (yet?). I checked the generated asm, it's all scalar integer code using GPRs. – Humus 10/3, 2019 at 15:58

@Humus Love your solution! Wish I could vote it up 1000 times. This solution is nice and fast. on my old CPU, I got ref took 2.868154 secs, fast took 0.023752 secs. – Bamby 10/3, 2019 at 16:15

That's pretty much the best solution out of all the proposed options. It's the only one which is successfully auto-vectorized by the majority of compilers, and also scales well with register size. – Rossierossing 10/3, 2019 at 20:42

@pktCoder: Fun fact: my minor modification to ref to make it branchless goes from about 2 sec on my machine to 0.33 sec or if you use gcc -fprofile-generate / -fprofile-use (with -O3 -march=sandybridge), down to 0.22 sec – Melson 11/3, 2019 at 0:14

@njuffa: I guess your compiler isn't auto-vectorizing your version, or it should be close to the same speed or even somewhat faster :( Indeed, even with -Ox -arch:AVX, MSVC 19.16 fails to auto-vectorize, unlike the 3 "good" compilers. godbolt.org/z/fhAt7Y has a C++ port of the C from my answer (including mine and yours). – Melson 11/3, 2019 at 0:20

@njuffa: My sum_gradual_widen produces correct counts, but in the wrong order. (If N is a multiple of 3*4*21, I didn't write cleanup code because it's only really useful if it auto-vectorizes). The comment at the bottom of the outer loop documents the 0, 8, 16, ... 1, 9, 17 order of traversal for target[] to get bit-positions in order. (I searched the "error" output for a few numbers, and found a matching number at the expected position, so I think all the right counts are in there somewhere.) I think it basically needs an 8x8 transpose which can be SIMDed efficiently. – Melson 11/3, 2019 at 0:22

@PeterCordes There were two issues I needed to fix: (1) sum_gradual_widen didn't have cleanup code to handle a partial block at the end (2) The re-ordering of target results was not what I expected it to be (simple row-major to column-major transpose). Working now, timing for fast is 0.012774 secs on my machine. – Humus 11/3, 2019 at 4:26

Cool, so probably still close to the same speed as an AVX-128 auto-vectorized version of yours on actual SnB. (If you have the a local install of gcc or even just as, you might be able to copy/paste asm from Godbolt. You'd have a calling-convention mismatch unless you used __attribute__((ms_abi)) all over the place, including on printf's prototype. If you wanted to try this you'd probably want to just copy the asm for a stand-alone version of a function you can call from MSVC.) I've done that to test ICC's code-gen on my Linux desktop without having ICC installed. – Melson 11/3, 2019 at 5:16

For starters, the problem of unpacking the bits, because seriously, you do not want to test each bit individually.

So just follow the following strategy for unpacking the bits into bytes of a vector: https://mcmap.net/q/14410/-fastest-way-to-unpack-32-bits-to-a-32-byte-simd-vector

Now that you have padded each bit to 8 bits, you can just do this for blocks of up to 255 bitmasks at a time, and accumulate them all into a single vector register. After that, you would have to expect potential overflows, so you need to transfer.

After each block of 255, unpack again to 32bit, and add into the array. (You don't have to do exactly 255, just some convenient number less than 256 to avoid overflow of byte accumulators).

At 8 instructions per bitmask (4 per each lower and higher 32-bit with AVX2) - or half that if you have AVX512 available - you should be able to achieve a throughput of about half a billion bitmasks per second and core on an recent CPU.

typedef uint64_t T;
const size_t bytes = 8;
const size_t bits = bytes * 8;
const size_t block_size = 128;

static inline __m256i expand_bits_to_bytes(uint32_t x)
{
    __m256i xbcast = _mm256_set1_epi32(x);    // we only use the low 32bits of each lane, but this is fine with AVX2

    // Each byte gets the source byte containing the corresponding bit
    const __m256i shufmask = _mm256_set_epi64x(
        0x0303030303030303, 0x0202020202020202,
        0x0101010101010101, 0x0000000000000000);
    __m256i shuf = _mm256_shuffle_epi8(xbcast, shufmask);

    const __m256i andmask = _mm256_set1_epi64x(0x8040201008040201);  // every 8 bits -> 8 bytes, pattern repeats.
    __m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);

    // this is the extra step: byte == 0 ? 0 : -1
    return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
}

void bitcount_vectorized(const T *data, uint32_t accumulator[bits], const size_t count)
{
    for (size_t outer = 0; outer < count - (count % block_size); outer += block_size)
    {
        __m256i temp_accumulator[bits / 32] = { _mm256_setzero_si256() };
        for (size_t inner = 0; inner < block_size; ++inner) {
            for (size_t j = 0; j < bits / 32; j++)
            {
                const auto unpacked = expand_bits_to_bytes(static_cast<uint32_t>(data[outer + inner] >> (j * 32)));
                temp_accumulator[j] = _mm256_sub_epi8(temp_accumulator[j], unpacked);
            }
        }
        for (size_t j = 0; j < bits; j++)
        {
            accumulator[j] += ((uint8_t*)(&temp_accumulator))[j];
        }
    }
    for (size_t outer = count - (count % block_size); outer < count; outer++)
    {
        for (size_t j = 0; j < bits; j++)
        {
            if (data[outer] & (T(1) << j))
            {
                accumulator[j]++;
            }
        }
    }
}

void bitcount_naive(const T *data, uint32_t accumulator[bits], const size_t count)
{
    for (size_t outer = 0; outer < count; outer++)
    {
        for (size_t j = 0; j < bits; j++)
        {
            if (data[outer] & (T(1) << j))
            {
                accumulator[j]++;
            }
        }
    }
}

Depending on the chose compiler, the vectorized form achieved roughly a factor 25 speedup over the naive one.

On a Ryzen 5 1600X, the vectorized form roughly achieved the predicted throughput of ~600,000,000 elements per second.

Surprisingly, this is actually still 50% slower than the solution proposed by @njuffa.

Rossierossing answered 9/3, 2019 at 20:25 Comment(6)

See also is there an inverse instruction to the movemask instruction in intel avx2? for a summary of bitmap -> vector unpack methods for different sizes. But yes, unpack bits to something narrower than the final count gives you more elements per instruction. Bytes is efficient with SIMD, and then you can do up to 255 (not 256) vertical paddb without overflow, before unpacking to 32-bit counters. The inner loop could be unrolled to start with a vector load and unpack different ways. Note that the OP does not have AVX2 or BMI2 on Sandybridge. – Melson 10/3, 2019 at 0:36

You should be able to delay the byte-order fixup _mm256_shuffle_epi8(xbcast, shufmask) until the outer loop. Also note that uint64_t is guaranteed to be exactly 64 bits with no padding, but not guaranteed to have sizeof(uint64_t) == 8. CHAR_BIT (from limits.h) could be any value, for example 16 or 32 on some word-addressable DSPs. So your constexpr constants are parameterized backwards. Also note this is a C question, and C doesn't have constexpr. – Melson 10/3, 2019 at 13:40

@PeterCordes Right, delaying the shuffle up does almost catch up, but there is still another 10% missing due to the load being to narrow. njuffa's solution wins by performing a wider load, despite requiring significantly more instructions. – Rossierossing 10/3, 2019 at 14:7

I'm pretty sure your edit introduced a bug. You can't just do the same shuffle after AND/PCMPEQ instead of before! Your AND mask needs to change (to pick out different bits in all 8 broadcasted copies of the same uint32_t), and your fixup shuffle can't be duplicating any bytes; all the bytes at that point hold valuable data. But that shouldn't affect performance. – Melson 10/3, 2019 at 14:28

Anyway, while you're at it maybe you can adjust this to do a 64-bit broadcast like njuffa's version, and mask it 2 different ways to get the high 32 bits and the low 32 bits. But then you'd need a lane-crossing fixup shuffle. That's not great on Zen, but it should be ok since you only do it once per 128? vectors. – Melson 10/3, 2019 at 14:31

But anyway, I think my answer's suggestion (to widen gradually) is still the best bet for Zen as well, producing interleaved results for 4x uint64_t at a time if using 64-bit vectors. Instead of broadcasting within 1 vector, shift and mask so you end up with more vectors. It'd be awesome if you wanted to put some time into manually vectorizing that and/or adding a cleanup loop; I need to sleep and probably won't get back to it tomorrow. Feel free to copy as much code as you want from my Godbolt link and/or answer. Or not if you'd rather just play with yours :P – Melson 10/3, 2019 at 14:34

See

Efficient Computation of Positional Population Counts Using SIMD Instructions by Marcus D. R. Klarqvist, Wojciech Muła, Daniel Lemire (7 Nov 2019)

Faster Population Counts using AVX2 Instructions by Wojciech Muła, Nathan Kurz, Daniel Lemire (23 Nov 2016).

Basically, each full adder compresses 3 inputs to 2 outputs. So one can eliminate an entire 256-bit word for the price of 5 logic instructions. The full adder operation could be repeated until registers become exhausted. Then results in the registers are accumulated (as seen in most of the other answers).

Positional popcnt for 16-bit subwords is implemented here: https://github.com/mklarqvist/positional-popcount

// Carry-Save Full Adder (3:2 compressor)
b ^= a;
a ^= c;
c ^= b; // xor sum
b |= a;
b ^= c; // carry

Note: the accumulate step for positional-popcnt is more expensive than for normal simd popcnt. Which I believe makes it feasible to add a couple of half-adders to the end of the CSU, it might pay to go all the way up to 256 words before accumulating.

Pam answered 6/8, 2019 at 0:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

An even better strategy is to widen gradually

Obsolete code: improvements to the bit-loop

Recommended topics

Hot tags