efficient bitwise sum calculation
Asked Answered
O

2

6

Is there an efficient way to calculate a bitwise sum of uint8_t buffers (assume number of buffers are <= 255, so that we can make the sum uint8)? Basically I want to know how many bits are set at the i'th position of each buffer.

Ex: For 2 buffers

uint8 buf1[k] -> 0011 0001 ...
uint8 buf2[k] -> 0101 1000 ...
uint8 sum[k*8]-> 0 1 1 2 1 0 0 1... 

are there any BLAS or boost routines for such a requirement?

This is a highly vectorizable operation IMO.

UPDATE: Following is a naive impl of the requirement

for (auto&& buf: buffers){
  for (int i = 0; i < buf_len; i++){
    for (int b = 0; b < 8; ++b) {
      sum[i*8 + b] += (buf[i] >> b) & 1;
    }
  }
}
Offence answered 7/10, 2021 at 15:53 Comment(1)
Comments are not for extended discussion; this conversation has been moved to chat.Mansion
V
5

An alternative to OP's naive code:

Perform 8 additions at once. Use a lookup table to expand the 8 bits to 8 bytes with each bit to a corresponding byte - see ones[].

void sumit(uint8_t number_of_buf, uint8_t k, const uint8_t buf[number_of_buf][k]) {
  static const uint64_t ones[256] = { 0, 0x1, 0x100, 0x101, 0x10000, 0x10001, 
      /* 249 more pre-computed constants */ 0x0101010101010101};

  uint64_t sum[k];
  memset(sum, 0, sizeof sum):

  for (size_t buf_index = 0; buf_index < number_of_buf;  buf_index++) {
    for (size_t int i = 0; i < k; i++) {
      sum[i] += ones(buf[buf_index][i]);
    }
  }

  for (size_t int i = 0; i < k; i++) {
    for (size_t bit = 0; bit < 8;  bit++) {
      printf("%llu ", 0xFF & (sum[i] >> (8*bit)));
    }
  }
}

See also @Eric Postpischil.

Vivyan answered 7/10, 2021 at 17:22 Comment(2)
@chux-ReinstateMonica just a followup, this LUT would be 2kB. Do you think it would be significant and affect cache performance?Offence
@Offence Perhaps. There are many potential optimizations. The best ones usually involve testing against a test harness to actually rate the code. As OP did not provide that, we are left with many unanswered questions. IMO, usually the best solutions involves looking at the higher level task rather than trading linear improvements as discussed here. Regarding cache, 100s million processor per year are simple embedded ones. In OP's case, we do not know the platform.Vivyan
P
3

As a modification of chux's approach, the lookup table can be replaced with a vector shift and mask. Here's an example using GCC's vector extensions.

#include <stdint.h>
#include <stddef.h>

typedef uint8_t vec8x8 __attribute__((vector_size(8)));

void sumit(uint8_t number_of_buf,
           uint8_t k,
           const uint8_t buf[number_of_buf][k],
           vec8x8 * restrict sums) {
    static const vec8x8 shift = {0,1,2,3,4,5,6,7};

    for (size_t i = 0; i < k; i++) {
        sums[i] = (vec8x8){0};
        for (size_t buf_index = 0; buf_index < number_of_buf;  buf_index++) {
            sums[i] += (buf[buf_index][i] >> shift) & 1;
        }
    }
}

Try it on godbolt.

I interchanged the loops from chux's answer because it seemed more natural to accumulate the sum for one buffer index at a time (then the sum can be cached in a register throughout the inner loop). There might be a tradeoff in cache performance because we now have to read the elements of the two-dimensional buf in column-major order.

Taking ARM64 as an example, GCC 11.1 compiles the inner loop as follows.

// v1 = sums[i]
// v2 = {0,-1,-2,...,-7} (right shift is done as left shift with negative count)
// v3 = {1,1,1,1,1,1,1,1}
.L4:
        ld1r    {v0.8b}, [x1]         // replicate buf[buf_index][i] to all elements of v0
        add     x0, x0, 1             
        add     x1, x1, x20
        ushl    v0.8b, v0.8b, v2.8b   // shift
        and     v0.8b, v0.8b, v3.8b   // mask
        add     v1.8b, v1.8b, v0.8b   // accumulate
        cmp     x0, x19
        bne     .L4

I think it'd be more efficient to do two bytes at a time (so unrolling the loop on i by a factor of 2) and use 128-bit vector operations. I leave this as an exercise :)

It's not immediately clear to me whether this would end up being faster or slower than the lookup table. You might have to profile both on the target machine(s) of interest.

Peen answered 7/10, 2021 at 19:5 Comment(2)
I think traversing buffer at a time would be more cache efficient.Offence
@n1r44: You can certainly try it both ways and see. The tradeoff of your way is that we must perform an extra load and store of sums[i] on every iteration of the inner loop, and so you have to worry about sums staying hot in cache as well. Which one wins might depend on how large k is.Peen

© 2022 - 2024 — McMap. All rights reserved.