Matrix transpose and population count

Asked 23/7, 2018 at 9:37 Answered 7/3, 2024 at 9:14

bit-manipulation transpose simd avx bitcount

I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column.

For instance for n=4:

1101
0101
0001
1001

M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} };

result = { 2, 2, 0, 4};

I can obviously

transpose the matrix M into a matrix M'
popcount each row of M'.

Good algorithms exist for matrix transposition and popcounting through bit manipulation.

My question is: would it be possible to "merge" such algorithms into a single one ?

Note that N could be quite large (say 1024 and more) regarding 64 bits architecture.

Laaspere answered 23/7, 2018 at 9:37 Comment(10)

Do you want to include AVX2 (or similar) answers as well or just scalar stuff? – Amphioxus 23/7, 2018 at 9:48

Yes, SIMD answers would be also interesting; I just added the tag to the question. – Laaspere 23/7, 2018 at 11:17

related: Large (0,1) matrix multiplication using bitwise AND and popcount instead of actual int or float multiplies?, but I didn't get into details of a transpose implementation in my answer. Would it be useful to popcount multiple columns in parallel, so you use more of the data from the cache lines you touch? If so, you could mask / shift two different ways and add; carry won't go from one element into the next so you have separate 2-bit totals for every column in up to 3 rows. Shift/mask again and get 4-bit totals. At some loop with 16b totals – Weekly 23/7, 2018 at 12:12

If the data bits are stored in individual bytes, use an unsigned 64-bit type to sum 4 populations in parallel. Since this can only overflow after 256 rows, split your uint64_t into 4 integers every 255 additions (small cost overall), and update the grand totals. Also, don't waste time transposing, update the popcount row by row. At the very least, utilize a full cache line at a time from the row. – Steady 25/7, 2018 at 3:12

@DillonDavis. Popcounting row by row seems interesting, the drawback in my case is that rows are packed as bits so I would need to unpack them as bytes first (could be done quite efficiently I guess). AVX2 could be of interest here by updating 32 local totals in a single addition. However, I'm not sure about the most efficient way to update the 32 grand totals (coded as u_int32_t) from the 32 local totals (coded as bytes in one AVX2 register). – Laaspere 27/7, 2018 at 9:51

@Laaspere you could probably delay updating uint32_t grand totals until the all rows are accounted for by adding another set of registers to hold their counts until then. So something like _mm256_srli_si256 to shift the bytes 0, 64, 128, and 192 bytes, _mm256_castsi256_si128 to get the last 128 bits as a __mm128, _mm256_cvtepu8_epi32 to convert the last 8 bytes of the __m128 to 0-extended 32-bit integers. Then proceed to add these to the grand total avx2 registers. Once you've handled all rows, then go ahead and update the uint32_t array values. – Steady 29/7, 2018 at 19:29

@Laaspere alternatively, you could just use 16-bit values rather than 8-bit, and provided you have less than 64k rows in your database, you won't run into overflow, so you skip the local->global transfer every 255 rows. Of course this halves your efficiency to 16 columns in parallel. – Steady 29/7, 2018 at 19:31

@DillonDavis I implemented your comment that proposes to use other registers to hold grand totals count (see the code in my answer). There is a small improvement: from 1.28 cycles/row to 1.25. It's not that big but it's an improvement anyway. – Laaspere 3/8, 2018 at 14:25

@Laaspere I had another idea for a potential optimization, but was unable to come up with a way of implementing efficiently, but maybe you might come up with something. You might be able to postpone shifting the registers until right before updating the grand totals. Lets say you are going to update your __mm256 register. Normally you'd shift, mask, and then add. Instead, mask and add. The overflow would be preserved by the preceding bits for all the but the first of the 32 columns, and just shift everything right before updating the grand totals. The problem is handling that first column. – Steady 3/8, 2018 at 18:18

counting bits in columns: #55082025 – Chancroid 11/1, 2020 at 14:21

I had another idea which I haven't finished writing up nicely.

Godbolt link to messy work-in-progress which doesn't have correct loop bounds / cleanup, but for large buffers runs ~3x faster than @edrezen's version on my Skylake i7-6700k, with g++7.3 -O3 -march=native. See the test_SWAR_avx2 function. (I know it doesn't compile on Godbolt; Agner Fog's asmlib.h isn't present.)

I might have some columns in the wrong order, too, but from stepping through the asm I think it's doing the right amount of work. i.e. any necessary bugfixes won't slow it down.

I used 16-bit accumulators, so another outer loop might be necessary if you care about inputs large enough to overflow 16-bit per-column counters.

Interesting observation: An earlier buggy version of my loop used sum0123 twice in store_globalsums_from_vec16, leaving sum4567 unused, so it optimized away in the main loop. With less work, gcc fully unrolled the large for(int i=0 ; i<5 ; i++) loop, and the code ran slower, like about 1 cycle per byte instead of 0.5. The loop was probably too big for the uop cache or something (I didn't profile yet but a front-end decode bottleneck would explain it). For some reason @edrezen's version is only running at about 1.5c/B for me, not the ~1.25 reported in the answer. My CPU is actually running 3.9GHz, but Agner Fog's library detects it at 4.0, but that's not enough to explain it.

Also, gcc spills sum4567_16bit to the stack, so we're already pushing the boundary of register pressure without AVX512. It's updated infrequently and isn't a problem, but needing more accumulators in the inner loop could be.

Your data layout isn't clear about when the number of columns isn't 32.

It seems that for each uint32_t chunk of 32 columns, you have all the rows stored contiguously in memory. i.e. looping over the rows for a column is efficient. If you had more than 32 columns, the rows for columns 32..63 will be contiguous and come after all the rows for columns 0..31.

(If instead you have all the columns for a single row contiguous, you could still use this idea, but might need to spill/reload some accumulators to memory, or let the compiler do that for you if it makes good choices.)

So loading a 32-byte (8 dword) vector gets 8 rows of data for one column chunk. That's extremely convenient, and allows widening from 1-bit (in memory) to 2-bit accumulators, then grab more data before we widen to 4-bit, and so on, summing along the way so we get significant work done while the data is still dense. (Rather than only adding 1 bit (0 or 1) per byte to vector accumulators.)

The more we unroll, the more data we can grab from memory to make better use of the coding space in our vectors. i.e. our variables have higher entropy. Throwing around more data (in terms of bits of memory that contributed to it) per vpaddb/w/d/q or unpack/shuffle instruction is a Good Thing.

Accumulators narrower than 1 byte within a SIMD vector is basically an https://en.wikipedia.org/wiki/SWAR technique, where you have to AND away bits that you shift past an element boundary, because we don't have SIMD element boundaries to do it for us. (And we avoid overflow anyway, so ADD carrying into the next element isn't a problem.)

Each inner loop iteration:

take a vector of data from the same columns in each of 2 or 3 (groups of) rows. So you either have 3 * 8 rows from one chunk of 32 columns, or 3 rows of 256 columns.
mask them with set1(0b01010101) to get the even (low) bits, and with (vec>>1) & mask (_mm256_srli_epi32(v,1)) to get the odd (high) bits. Use _mm256_add_epi8 to accumulate within those 2-bit accumulators. They can't overflow with only 3 ones, so carry-propagation boundaries don't actually matter.

Each byte of your vector has 4 separate vertical sums, and you have two vectors (odd/even).
Repeat the above again, to get another pair of vectors from 3 vectors of data from memory.
Combine again to get 4 vectors of 4-bit accumulators (with possible values 0..6). Still without mixing bits from within a single 32-bit element, of course, because we must never do that. Shifts only move bits for odd / high columns to the bottom of the 2-bit or 4-bit unit that contains them so they can be added with bits that were moved the same way in other vectors.
_mm256_unpacklo/hi_epi8 and mask or shift+mask to get 8-bit accumulators
Put the above in a loop that runs up to 5 times, so the 0..12 accumulator values go up to 0..60 (i.e. leaving 2 bits of headroom for unpacking the 8-bit accumulators, using all their coding space.)

If you have the data layout from your answer, then we can add data from dword elements within the same vector. We can do that so we don't run out of registers when widening our accumulators up to 16-bit (because x86-64 only has 16 YMM registers, and we need some for constants.)

_mm256_unpacklo/hi_epi16 and add, to interleave pairs of 8-bit counters so a group of counters for the same column has expanded from a dword to a qword.

Repeat this general idea to reduce the number of registers (or __m256i variables) your accumulators are spread over.

Efficiently handling the lack of a lane-crossing 2-input byte or word shuffle is inconvenient, but it's a pretty small part of the total work. vextracti128 / vpaddb xmm -> vpmovzxbw worked well enough.

Weekly answered 4/8, 2018 at 11:19 Comment(11)

Just to understand my initial intention : one row represents a configuration of M sets (M=32 for instance), with row[i]==1 meaning that the ith set is present in the configuration (0 otherwise). So the columns popcount I want to do corresponds to compute the size of each set for all found configurations. Since I can compute these configurations (i.e. rows) only sequentially, my data layout consists in an array of rows, each row being a uint32_t holding one 32 bit packed configuration. – Laaspere 4/8, 2018 at 16:41

@edrezen: ok perfect, so if you had 64 columns and 16 rows, you'd have 16 uint32_t for the first 32 columns contiguous, then another 16 uint32_t for the later columns. – Weekly 4/8, 2018 at 18:56

@edrezen: I have a work-in-progress SWAR-style implementation that runs about 3x faster than yours. See the Godbolt link. I'll hopefully get back to this again and improve the answer. – Weekly 5/8, 2018 at 7:59

@edrezen: and BTW, I'm finding that my version at least is somewhat sensitive to the data being hot in cache. I guess OoO exec has a hard time hiding memory latency as well as all the instruction latency if there are any delays. Testing with much a smaller working set and many more runs looks better, if the N=1024 you mention in the question is more like your real problem size. ./a.out 18 100000 easily fits in L3 cache, and runs ~0.5 c / B, vs. your default settings which only runs 0.8. (Fitting in L2 / L1 only speeds it up to 0.48, a trivial amount. But L3 really helps.) – Weekly 5/8, 2018 at 9:9

Well, it seems to be a very promising solution ! Actually, it was the kind of solution I unconsciously dreamt of when I posted the question, i.e. some kind of mix between a transpose and a popcount with SWAR technique. The key point I was completely missing was to use more than one row information at the same time, so your high entropy proposal definitely points in the good direction. – Laaspere 6/8, 2018 at 11:4

I tested your solution on another CPU (Intel Xeon E5-2603) with gcc version 8.1.1. For ./a.out 18 100000, I get 0.52 c/B for your solution and 1.68 c/B for mine (I don't understand either why I got 1.25 c/B on my Intel Core i7-6700HQ) – Laaspere 6/8, 2018 at 12:12

@edrezen: It looks like your "frequency" calculation might be based on RDTSC reference cycles, not performance-counter core clock cycles, because your code was always saying 4.0GHz when my SKL was running at 3.9GHz (energy_performance_preference=balance_powersave). So Turbo above the rated frequency on your laptop will make your CPU get much more than 1 cycle of work done per reference cycle. Turbo. But your Xeon presumably has a max turbo not much above its rated speed, or it wasn't turboing because other cores were loaded. (E5-2603 is SnB which doesn't have AVX2. Is it actually a v3 HSW?) – Weekly 6/8, 2018 at 21:11

Actually, I rely on the ReadTSC function from the Agner Fog's asmlib and use it as a black box, so I could not say what is used for getting cycles information. Concerning the E5-2603, I just used lscpu to get the CPU model and the flags tell that avx2 is present. – Laaspere 7/8, 2018 at 8:56

@edrezen: ReadTSC pretty obviously uses the rdtsc instruction to read reference cycles, not core clock cycles. RDTSC is used as a time source by gettimeofday and so on. So you should probably remove your "frequency" column, or disable turbo for your testing and make sure your CPU runs at reference frequency. See Getting cpu cycles using RDTSC - why does the value of RDTSC always increase? for the history of why RDTSC is a timesource, not a real clock cycle counter. – Weekly 7/8, 2018 at 10:5

@edrezen: Then your CPU model string is not just "E5-2603". It's probably "E5-2603 v3" or "E5-2603 v4". (Haswell or Broadwell). Like I said, "E5-2603" is a Sandybridge without AVX2. – Weekly 7/8, 2018 at 10:7

You're absolutely right. I have been a little bit lazy on this point because the lscpu command returned precisely "Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz" – Laaspere 7/8, 2018 at 12:8

I made some benchmark between the two approaches:

transpose + popcount
update row by row

I wrote a naive version and an AVX2 one for both approaches. I used some functions (found on stackoverflow or elsewhere) for the AVX2 "transpose+popcount" approach.

In my test, I make the assumption that the input is a nbRowsx32 matrix in a bits packed format (nbRows itself being a multiple of 32); the matrix is therefore stored as an array of uint32_t.

The code is the following:

#include <cinttypes>
#include <cstdio>
#include <cstring>
#include <cmath>
#include <cassert>
#include <chrono>
#include <immintrin.h>
#include <asmlib.h>

using namespace std;
using namespace std::chrono;

// see https://mcmap.net/q/14410/-fastest-way-to-unpack-32-bits-to-a-32-byte-simd-vector
static __m256i expand_bits_to_bytes (uint32_t x);

// see https://mischasan.wordpress.com/2011/10/03/the-full-sse2-bit-matrix-transpose-routine/
static void sse_trans(char const *inp, char *out);

static double deviation (double n, double sum2, double sum);

////////////////////////////////////////////////////////////////////////////////
// Naive approach (matrix transposition)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t*  globalSums)
{
    assert (nbRows%32==0);

    uint8_t transpo[32][32];  memset (transpo, 0, sizeof(transpo));

    for (uint64_t k=0; k<nbRows; k+=32)
    {
        // We unpack and transpose the input into a 32x32 bytes matrix
        for (size_t row=0; row<32; row++)
        {
            for (size_t col=0; col<32; col++)  {  transpo[col][row] = (bitmap[k+row] >> col) & 1 ;  }
        }

        for (size_t row=0; row<32; row++)
        {
            // We popcount the current row
            u_int8_t sum=0;
            for (size_t col=0; col<32; col++)  {  sum += transpo[row][col];  }

            // We update the corresponding global sum
            globalSums[row] += sum;
        }
    }
}

////////////////////////////////////////////////////////////////////////////////
// Naive approach (row by row)
////////////////////////////////////////////////////////////////////////////////
void test_update_row_by_row_naive (uint64_t nbRows, const uint32_t* bitmap, uint64_t*  globalSums)
{
    for (uint64_t row=0; row<nbRows; row++)
    {
        for (size_t col=0; col<32; col++)
        {
            globalSums[col] += (bitmap[row] >> col) & 1;
        }
    }
}

////////////////////////////////////////////////////////////////////////////////
// AVX2 (matrix transposition + popcount)
////////////////////////////////////////////////////////////////////////////////
void test_transpose_popcnt_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t*  globalSums)
{
    assert (nbRows%32==0);

    uint32_t transpo[32];

    const uint32_t* loop = bitmap;
    for (uint64_t k=0; k<nbRows; loop+=32, k+=32)
    {
        // We transpose the input as a 32x32 bytes matrix
        sse_trans ((const char*)loop, (char*)transpo);

        // We update the global sums
        for (size_t i=0; i<32; i++)
        {
            globalSums[i] += __builtin_popcount (transpo[i]);
        }
    }
}

////////////////////////////////////////////////////////////////////////////////
// AVX2 approach (update totals row by row)
////////////////////////////////////////////////////////////////////////////////

// Note: we use template specialization to unroll some portions of a loop
template<int N>
void UpdateLocalSums (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
    // We update the local sums with the current row
    localSums = _mm256_sub_epi8 (localSums, expand_bits_to_bytes (bitmap[k++]));

    // Go recursively
    UpdateLocalSums<N-1>(localSums, bitmap, k);
}

template<>
void UpdateLocalSums<0> (__m256i& localSums, const uint32_t* bitmap, uint64_t& k)
{
}

// Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
#define USE_AVX2_FOR_GRAND_TOTALS 1

void test_update_row_by_row_avx2 (uint64_t nbRows, const uint32_t* bitmap, uint64_t*  globalSums)
{
    union U256i {  __m256i v;   uint8_t a[32];  uint32_t b[8];  };

    // We use 1 register for updating local totals
    __m256i   localSums = _mm256_setzero_si256();

#ifdef USE_AVX2_FOR_GRAND_TOTALS
    // Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums with AVX2
    __m256i   globalSumsReg[4];  for (size_t r=0; r<4; r++)  {   globalSumsReg[r] = _mm256_setzero_si256(); }
#endif

    uint64_t steps = nbRows / 255;
    uint64_t k=0;

    const int divisorOf255 = 5;

    // We iterate over all rows
    for (uint64_t i=0; i<steps; i++)
    {
        // we update the local totals (255*32=8160 additions)
        for (int j=0; j<255/divisorOf255; j++)
        {
            // unroll some portion of the 255 loop through template specialization
            UpdateLocalSums<divisorOf255>(localSums, bitmap, k);
        }

#ifdef USE_AVX2_FOR_GRAND_TOTALS
        // Dillon Davis proposal: use 4 registers holding uint32_t values and update them from local sums

        // We take the 128 high bits of the local sums
        __m256i   localSums2 = _mm256_broadcastsi128_si256(_mm256_extracti128_si256(localSums,1));

        globalSumsReg[0] = _mm256_add_epi32 (globalSumsReg[0],
            _mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 0)))
        );
        globalSumsReg[1] = _mm256_add_epi32 (globalSumsReg[1],
            _mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums, 8)))
        );
        globalSumsReg[2] = _mm256_add_epi32 (globalSumsReg[2],
            _mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 0)))
        );
        globalSumsReg[3] = _mm256_add_epi32 (globalSumsReg[3],
            _mm256_cvtepu8_epi32 (_mm256_castsi256_si128 (_mm256_srli_si256(localSums2, 8)))
        );
#else
        // we update the global totals
        U256i tmp = { localSums };
        for (size_t k=0; k<32; k++)  {  globalSums[k] += tmp.a[k];  }
#endif
        // we reset the local totals
        localSums = _mm256_setzero_si256();
    }

#ifdef USE_AVX2_FOR_GRAND_TOTALS
    // We update the global totals into the final uint32_t array
    for (size_t r=0; r<4; r++)
    {
        U256i tmp = { globalSumsReg[r] };
        for (size_t k=0; k<8; k++)  {  globalSums[r*8+k] += tmp.b[k];  }
    }
#endif

    // we update the remaining local totals
    for (uint64_t i=steps*255; i<nbRows; i++)
    {
        UpdateLocalSums<1>(localSums, bitmap, k);
    }

    // we update the global totals
    U256i tmp = { localSums };
    for (size_t k=0; k<32; k++)  {  globalSums[k] += tmp.a[k];  }
}

////////////////////////////////////////////////////////////////////////////////
void execute (
    const char* name,
    void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint64_t*  globalSums),
    size_t nbRuns,
    uint64_t nbRows,
    u_int32_t* bitmap
)
{
    uint64_t  sums[32];

    double timeTotal=0;
    double cycleTotal=0;
    double timeTotal2=0;
    double cycleTotal2=0;
    uint64_t check=0;

    for (size_t n=0; n<nbRuns; n++)
    {
        memset(sums,0,sizeof(sums));

        // We want both time and cpu cycles information
        milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
        uint64_t c0 = ReadTSC();

        // We run the test
        (*fct) (nbRows, bitmap, sums);

        uint64_t c1 = ReadTSC();
        milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());

        timeTotal  += (t1-t0).count();
        cycleTotal += (double)(c1-c0) / nbRows;

        timeTotal2  += (t1-t0).count() * (t1-t0).count();
        cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);

        // We compute some dummy checksum
        for (size_t k=0; k<32; k++)  {  check += sums[k];  }
    }

    printf ("%-21s |  %5.0lf (%5.1lf)            |  %5.2lf (%4.2lf)          |  %.3lf           |  0x%lx\n",
        name,
        timeTotal / nbRuns,
        deviation (nbRuns, timeTotal2, timeTotal),
        cycleTotal/nbRuns,
        deviation (nbRuns, cycleTotal2, cycleTotal),
        check,
        nbRows * cycleTotal / timeTotal / 1000000.0
    );
}

////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
    // We set rows number as 2^n where n is the provided argument
    // For simplification, we assume that the rows number is a multiple of 32
    uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
    size_t   nbRuns = argc>2 ? atoi(argv[2]) : 10;

    // We build an bitmap of size nbRows*32
    uint32_t* bitmap = new uint32_t[nbRows];
    if (bitmap==nullptr)
    {
        fprintf(stderr, "unable to allocate the bitmap\n");
        exit(1);
    }

    // We fill the bitmap with random values
    srand(time(nullptr));
    for (uint64_t i=0; i<nbRows; i++)    {  bitmap[i] = rand() & 0xFFFFFFFF;  }

    printf ("\n");
    printf ("nbRows=%ld  nbRuns=%ld\n", nbRows, nbRuns);
    printf ("------------------------------------------------------------------------------------------------------------\n");
    printf ("name                  | time in msec : mean (sd)  | cycles/row : mean (sd) | frequency in GHz | checksum\n");
    printf ("------------------------------------------------------------------------------------------------------------\n");

    // We launch the benchmark
    execute ("naive (transpo)   ", test_transpose_popcnt_naive,  nbRuns, nbRows, bitmap);
    execute ("naive (row by row)", test_update_row_by_row_naive, nbRuns, nbRows, bitmap);
    execute ("AVX2  (transpo)   ", test_transpose_popcnt_avx2,   nbRuns, nbRows, bitmap);
    execute ("AVX2  (row by row)", test_update_row_by_row_avx2,  nbRuns, nbRows, bitmap);

    printf ("\n");

    // Some clean up
    delete[] bitmap;

    return EXIT_SUCCESS;
}

////////////////////////////////////////////////////////////////////////////////
__m256i expand_bits_to_bytes(uint32_t x)
{
    __m256i xbcast = _mm256_set1_epi32(x);

    // Each byte gets the source byte containing the corresponding bit
    __m256i shufmask = _mm256_set_epi64x(
        0x0303030303030303, 0x0202020202020202,
        0x0101010101010101, 0x0000000000000000);
    __m256i shuf     = _mm256_shuffle_epi8(xbcast, shufmask);
    __m256i andmask  = _mm256_set1_epi64x(0x8040201008040201);  // every 8 bits -> 8 bytes, pattern repeats.
    __m256i isolated_inverted = _mm256_and_si256(shuf, andmask);

    // Avoid an _mm256_add_epi8 thanks to Peter Cordes's comment
    return _mm256_cmpeq_epi8(isolated_inverted, andmask);
}

////////////////////////////////////////////////////////////////////////////////
void sse_trans(char const *inp, char *out)
{
#define INP(x,y) inp[(x)*4 + (y)/8]
#define OUT(x,y) out[(y)*4 + (x)/8]

    int rr, cc, i, h;
    union { __m256i x; uint8_t b[32]; } tmp;

    for (cc = 0; cc < 32; cc += 8)
    {
        for (i = 0; i < 32; ++i)
            tmp.b[i] = INP(i, cc);

        for (i = 8; i--; tmp.x = _mm256_slli_epi64(tmp.x, 1))
            *(uint32_t*)&OUT(0, cc + i) = _mm256_movemask_epi8(tmp.x);
    }
}

////////////////////////////////////////////////////////////////////////////////
double deviation (double n, double sum2, double sum)  {  return sqrt (sum2/n - (sum/n)*(sum/n)); }

Some remarks:

I used the Agner Fog's asmlib to have a function that returns CPU cycles
The compilation command is g++ -O3 -march=native ../Test.cpp -o ./Test -laelf64
The gcc version is 7.3.1
The CPU is Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
I compute some dummy checksum to compare the results of the different tests

Now the results:

------------------------------------------------------------------------------------------------------------
name                  | time in msec : mean (sd)  | cycles/row : mean (sd) | frequency in GHz | checksum
------------------------------------------------------------------------------------------------------------
naive (transpo)       |   4548 ( 36.5)            |  43.91 (0.35)          |  2.592           |  0x9affeb5a6
naive (row by row)    |   3033 ( 11.0)            |  29.29 (0.11)          |  2.592           |  0x9affeb5a6
AVX2  (transpo)       |    767 ( 12.8)            |   7.40 (0.12)          |  2.592           |  0x9affeb5a6
AVX2  (row by row)    |    130 (  4.0)            |   1.25 (0.04)          |  2.591           |  0x9affeb5a6

So it seems that the "row by row" in AVX2 is the best so far.

Note that when I saw this result (less than 2 cycles per row), I made no more effort to optimize the AVX2 "transpose+popcount" method, which should be feasable by computing several popcounts in parallel (I may test it later).

Laaspere answered 2/8, 2018 at 11:23 Comment(10)

_mm256_cmpeq_epi8(isolated_inverted, andmask) should work to avoid the extra vpaddb. (But then you have 0/-1 instead of 0/1. subtract it instead of adding it. 0 - 1 - 1 = 2, just like 0 + 1 + 1). Or use andn instead of and like I suggested in comments on Z Boson's answer: Fastest way to unpack 32 bits to a 32 byte SIMD vector. But actually, comparing against a mask you already need in a register is probably best. – Weekly 2/8, 2018 at 20:56

I meant 0 - (-1) - (-1) = 2 of course. You can make the outer loop more efficient by using SIMD for that: unpack to 16-bit or 32-bit elements and add to two or four vectors. Wrap another loop around that if you really want 64-bit counters to avoid overflow, and account for the in-lane ordering of _mm256_unpacklo/hi_epi8(localsums, _mm256_setzero_si256()) in that outer-most loop. Looping over every byte of a __m256i every 255 iterations is probably a bit expensive. – Weekly 3/8, 2018 at 2:33

I wonder if you can do anything interesting like unpacking to only nibbles instead of whole bytes, maybe with variable-shift and pshufb to select bytes based on the low 2 bits. Those 4-bit partial sums would overflow into each other after 16 iterations, but you can unpack to 16-bit every 15 iterations (with unpacklo/hi into _mm256_and_si256 or _mm_srli_epi16, producing four vectors of 16-bit accumulators), with maybe better throughput than what you're doing now with 8-bit. – Weekly 3/8, 2018 at 2:41

@PeterCordes Good point for skipping the extra vpaddb; I edited the answer to take it into account. – Laaspere 3/8, 2018 at 8:42

I also make two changes: 1) do some stats by running several times each test to get means + standard deviation 2) unroll the inner loop (done with template specialization). With the skip of vpaddb, it goes from 1.89 cycles/row to 1.28 for the "AVX2 row by row" method. – Laaspere 3/8, 2018 at 8:46

Cool. Your table headers should probably say mean (sd) to match the formatting of the entries. – Weekly 3/8, 2018 at 9:2

localSums2 = _mm256_broadcastsi128_si256 is unnecessary; some compilers might not optimize it away. Just make it a __m128i! Or better, do a lot less total shuffling inside the inner-most loop and use _mm256_unpacklo/hi_epi8(localSums, _mm256_setzero_si256());, then expand each of those 2 results up to 32-bit with _mm256_unpacklo/hi_epi16. Then you have 6 total shuffles instead of 7 for your vextracti128 and vpsrldq to feed vpmovsxbd (assuming >> 0 optimizes). None of the _mm256_unpacklo_epi8 or hi are lane-crossing, so OoO exec has less latency to hide. – Weekly 3/8, 2018 at 22:25

Thanks for the suggestion; I tried it and now one gets 1.14 cycle/row (not yet edited the answer) – Laaspere 4/8, 2018 at 9:22

cool! Like I suggested in an earlier comment, you could nest another loop to delay unpacking to 32 for another up to 256 iterations, so you only unpack 2 ways in the 2nd-from-inner loop. (I wonder if there's anything to gain from using a lower count, under 23 or so, so SKL's branch predictors can perfectly predict the pattern. Maybe not; OoO exec can likely run the loop overhead ahead of the important stuff and hide the miss recovery without losing throughput on the critical path. So go for 256 iters of another loop.) – Weekly 4/8, 2018 at 9:30

The last 2 args to printf in execute() are reversed. g++ -Wall warns about it. It happens to work anyway because the x86-64 System V calling convention puts integer args in integer regs and FP args in FP regs. So it doesn't actually change how the args are passed to printf in x86-64 SysV, but would in many other calling conventions. – Weekly 5/8, 2018 at 8:11

I eventually wrote another implementation, following the high entropy SWAR approach proposed by Peter Cordes. This implementation is recursive and relies on C++ template specialization.

The global idea is to fill N-bit accumulators to their maximum without carry overflow (this is where recursion is used). When these accumulators are filled, we update the grand totals and we start again with new N-bit accumulators to fill until all rows have been processed.

Here is the code (see function test_SWAR_recursive):

#include <immintrin.h>
#include <cassert>
#include <chrono>
#include <cinttypes>
#include <cmath>
#include <cstdio>
#include <cstring>

using namespace std;
using namespace std::chrono;

// avoid the #include <asmlib.h>
extern "C" u_int64_t ReadTSC();

static double deviation (double n, double sum2, double sum)  {  return sqrt (sum2/n - (sum/n)*(sum/n)); }

////////////////////////////////////////////////////////////////////////////////
// Recursive SWAR approach (with template specialization)
////////////////////////////////////////////////////////////////////////////////

template<int DEPTH>
struct RecursiveSWAR
{
    // Number of accumulators for current depth
    static const int N = 1<<DEPTH;

    // Array of N-bit accumulators
    typedef __m256i Array[N];

    // Magic numbers (0x55555555, 0x33333333, ...) computed recursively
    static const u_int32_t MAGIC_NUMBER =
        RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER
            * (1 + (1<<(1<<(DEPTH-1))))
            / (1 + (1<<(1<<(DEPTH+0))));

    static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array accumulators)
    {
        // We reset the N-bit accumulators
        for (int i=0; i<N; i++)  {  accumulators[i] = _mm256_setzero_si256();  }

        // We check (only for depth big enough) that we have still rows to process
        if (DEPTH>=3)  if (begin>=end)  { return; }

        typename RecursiveSWAR<DEPTH-1>::Array accumulatorsMinusOne;

        // We load a register with the mask
        __m256i mask = _mm256_set1_epi32 (RecursiveSWAR<DEPTH-1>::MAGIC_NUMBER);

        // We fill the N-bit accumulators to their maximum capacity without carry overflow
        for (int i=0; i<N+1; i++)
        {
            // We fill (N-1)-bit accumulators recursively
            RecursiveSWAR<DEPTH-1>::fillAccumulators (begin, end, accumulatorsMinusOne);

            // We update the N-bit accumulators from the (N-1)-bit accumulators
            for (int j=0; j<RecursiveSWAR<DEPTH-1>::N; j++)
            {
                // LOW part
                accumulators[2*j+0] = _mm256_add_epi32 (
                    accumulators[2*j+0],
                    _mm256_and_si256 (
                        accumulatorsMinusOne[j],
                        mask
                    )
                );

                // HIGH part
                accumulators[2*j+1] = _mm256_add_epi32 (
                    accumulators[2*j+1],
                    _mm256_and_si256 (
                        _mm256_srli_epi32 (
                            accumulatorsMinusOne[j],
                            RecursiveSWAR<DEPTH-1>::N
                        ),
                        mask
                    )
                );
            }
        }
    }
};

// Template specialization for DEPTH=0
template<>
struct RecursiveSWAR<0>
{
    static const int N = 1;

    typedef __m256i Array[N];

    static const u_int32_t MAGIC_NUMBER = 0x55555555;

    static void fillAccumulators (u_int32_t*& begin, const u_int32_t* end, Array result)
    {
        // We just load 8 rows in the AVX2 register
        result[0] = _mm256_loadu_si256 ((__m256i*)begin);

        // We update the iterator
        begin += 1*sizeof(__m256i)/sizeof(u_int32_t);
    }
};

template<int DEPTH> struct TypeInfo  { };
template<> struct TypeInfo<3>  {  typedef u_int8_t  Type; };
template<> struct TypeInfo<4>  {  typedef u_int16_t Type; };
template<> struct TypeInfo<5>  {  typedef u_int32_t Type; };

unsigned char reversebits (unsigned char b)
{
    return ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;
}

void test_SWAR_recursive (uint64_t nbRows, const uint32_t* bitmap, uint32_t*  globalSums)
{
    static const int DEPTH = 4;

    RecursiveSWAR<DEPTH>::Array accumulators;

          uint32_t* begin = (uint32_t*) bitmap;
    const uint32_t* end   = bitmap + nbRows;

    // We reset the grand totals
    for (int i=0; i<32; i++)  { globalSums[i] = 0; }

    while (begin < end)
    {
        // We fill the N-bit accumulators to the maximum without overflow
        RecursiveSWAR<DEPTH>::fillAccumulators (begin, end, accumulators);

        // We update grand totals from the filled N-bit accumulators
        for (int i=0; i<RecursiveSWAR<DEPTH>::N; i++)
        {
            int r = reversebits(i) >> (8-DEPTH);
            u_int32_t* sums   = globalSums+r;
            TypeInfo<DEPTH>::Type*  values = (TypeInfo<DEPTH>::Type*) (accumulators+i);

            for (int j=0; j<8*(1<<(5-DEPTH)); j++)
            {
                sums[(j*RecursiveSWAR<DEPTH>::N) % 32] += values[j];
            }
        }
    }
}

////////////////////////////////////////////////////////////////////////////////
void execute (
    const char* name,
    void (*fct)(uint64_t nbRows, const uint32_t* bitmap, uint32_t*  globalSums),
    size_t nbRuns,
    uint64_t nbRows,
    u_int32_t* bitmap
)
{
    uint32_t  sums[32];

    double timeTotal=0;
    double cycleTotal=0;
    double timeTotal2=0;
    double cycleTotal2=0;
    uint64_t check=0;

    for (size_t n=0; n<nbRuns; n++)
    {
        memset(sums,0,sizeof(sums));

        // We want both time and cpu cycles information
        milliseconds t0 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());
        uint64_t c0 = ReadTSC();

        // We run the test
        (*fct) (nbRows, bitmap, sums);

        uint64_t c1 = ReadTSC();
        milliseconds t1 = duration_cast< milliseconds >(system_clock::now().time_since_epoch());

        timeTotal  += (t1-t0).count();
        cycleTotal += (double)(c1-c0) / nbRows;

        timeTotal2  += (t1-t0).count() * (t1-t0).count();
        cycleTotal2 += ((double)(c1-c0) / nbRows) * ((double)(c1-c0) / nbRows);

        // We compute some dummy checksum
        for (size_t k=0; k<32; k++)  {  check += (k+1)*sums[k];  }
    }

    printf ("%-21s |  %5.0lf (%5.1lf)            |  %5.2lf (%5.3lf)         |  %.3lf           |  0x%lx\n",
        name,
        timeTotal / nbRuns,
        deviation (nbRuns, timeTotal2, timeTotal),
        cycleTotal/nbRuns,
        deviation (nbRuns, cycleTotal2, cycleTotal),
        nbRows * cycleTotal / timeTotal / 1000000.0,
        check/nbRuns
    );
}


////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
    // We set rows number as 2^n where n is the provided argument
    // For simplification, we assume that the rows number is a multiple of 32
    uint64_t nbRows = 1ULL << (argc>1 ? atoi(argv[1]) : 28);
    size_t   nbRuns = argc>2 ? atoi(argv[2]) : 10;

    // We build an bitmap of size nbRows*32
    uint64_t actualNbRows = nbRows + 100000;
    uint32_t* bitmap = (uint32_t*)_mm_malloc(sizeof(uint32_t)*actualNbRows, 256);
    if (bitmap==nullptr)
    {
        fprintf(stderr, "unable to allocate the bitmap\n");
        exit(1);
    }
    memset (bitmap, 0, sizeof(u_int32_t)*actualNbRows);

    // We fill the bitmap with random values
    //    srand(time(nullptr));
    for (uint64_t i=0; i<nbRows; i++)    {  bitmap[i] = rand() & 0xFFFFFFFF;  }


    printf ("\n");
    printf ("nbRows=%ld  nbRuns=%ld\n", nbRows, nbRuns);
    printf ("------------------------------------------------------------------------------------------------------------\n");
    printf ("name                  | time in msec : mean (sd)  | cycles/row : mean (sd) | frequency in GHz | checksum\n");
    printf ("------------------------------------------------------------------------------------------------------------\n");

    // We launch the benchmark
    execute ("AVX2  (SWAR rec)  ", test_SWAR_recursive,          nbRuns, nbRows, bitmap);

    printf ("\n");

    // Some clean up
    _mm_free (bitmap);

    return EXIT_SUCCESS;
}

The size of the accumulators is 2^DEPTH in this code. Note that this implementation is valid up to DEPTH=5. For DEPTH=4, here are the performance results compared to the implementation of Peter Cordes (named high entropy SWAR):

The graph gives the number of cycles required to process a row (of 32 items) as a function of the number of rows of the matrix. As expected, the results are pretty similar since the main idea is the same. It is interesting to note the three parts of the graph:

constant value for log2(n)<=20
increasing value for log2(n) between 20 and 22
constant value for log2(n)>=22

I guess that CPU caches properties can explain this behaviour.

Laaspere answered 26/8, 2018 at 13:45 Comment(0)

For n = 4 and a matrix m packed into a uint16_t it's of course trivial: { popcnt(m & 0x8888), popcnt(m & 0x4444), popcnt(m & 0x2222), popcnt(m & 0x1111) }. This is a reasonable approach for small n, up to 8. For larger values of n, you could extract 8x8 submatrices, use this approach on them and accumulate. To extract the submatrices there are SIMD instructions to extract or rearrange bytes which might be useful.

Houppelande answered 7/3, 2024 at 9:14 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags