How to set bits of a bit vector efficiently in CUDA?
Asked Answered
P

1

6

The task is like How to set bits of a bit vector efficiently in parallel?, but for CUDA.

Consider a bit vector of N bits in it (N is large, e.g. 4G) and an array of M numbers (M is also large, e.g. 1G), each in range 0..N-1 indicating which bit of the vector must be set to 1. The bit vector is just an array of integers, specifically uint32_t.

I've tried a naive implementation with atomicOr() on the global memory:

__global__ void BitwiseSet(const uint32_t n_indices, const uint32_t *indices,
      const uint32_t n_bits, uint32_t *bitset)
{
  const uint32_t n_threads = blockDim.x * gridDim.x;
  const uint32_t i_thread = threadIdx.x + blockDim.x * blockIdx.x;
  for(uint32_t i=i_thread; i<n_indices; i +=n_threads) {
    const uint32_t index = indices[i];
    assert(index < n_bits);
    const uint32_t i_word = index >> 5;
    const uint32_t i_bit = index & 31;
    atomicOr(bitset+i_word, 1u<<(i_bit));
  }
}

And it produces interesting results for 4G bits and 1G indices:

  • RTX3090: 0.0383266 sec. for sorted indices vs. 0.332674 sec. for unsorted (8.68x improvement)
  • RTX2080: 0.0564464 sec. for sorted indices vs. 1.23666 sec. for unsorted (21.91x improvement)

So it seems the devices coalesce/unite multiple atomicOr() operations within a warp, and perhaps L1 cache kicks in, so when indices conflict (which is the case for sorted indices), 32-bit assignments are in reality much faster than for non-conflicting indices (the unsorted case).

Can we further improve the sorted or unsorted case?

UPDATE: answering the comments, any solution is of interest, whether it's for sorted or unsorted case, with or without repetitions. Unsorted and with repetitions is a more generic case, so it would be of the most interest.

UPDATE2: following the suggestion to vectorize memory accesses, I implemented this:

__global__ void BitwiseSet(const uint32_t n_indices, const uint32_t *indices, const uint32_t n_bits, uint32_t *bitset) {
  const uint32_t n_threads = blockDim.x * gridDim.x;
  const uint32_t i_thread = threadIdx.x + blockDim.x * blockIdx.x;
  const uint32_t n_vectors = n_indices / 4;
  for(uint32_t i=i_thread; i<n_vectors; i +=n_threads) {
    const uint4 v_index = reinterpret_cast<const uint4*>(indices)[i];
    assert(v_index.x < n_bits);
    assert(v_index.y < n_bits);
    assert(v_index.z < n_bits);
    assert(v_index.w < n_bits);
    uint4 vi_word, vi_bit;
    vi_word.x = v_index.x >> 5;
    vi_word.y = v_index.y >> 5;
    vi_word.z = v_index.z >> 5;
    vi_word.w = v_index.w >> 5;
    vi_bit.x = v_index.x & 31;
    vi_bit.y = v_index.y & 31;
    vi_bit.z = v_index.z & 31;
    vi_bit.w = v_index.w & 31;
    atomicOr(bitset+vi_word.x, 1u<<vi_bit.x);
    atomicOr(bitset+vi_word.y, 1u<<vi_bit.y);
    atomicOr(bitset+vi_word.z, 1u<<vi_bit.z);
    atomicOr(bitset+vi_word.w, 1u<<vi_bit.w);
  }
  if(i_thread < 4) {
    const uint32_t tail_start = n_vectors*4;
    const uint32_t tail_len = n_indices - tail_start;
    if(i_thread < tail_len) {
      const uint32_t index = indices[tail_start+i_thread];
      assert(index < n_bits);
      const uint32_t i_word = index >> 5;
      const uint32_t i_bit = index & 31;
      atomicOr(bitset+i_word, 1u<<i_bit);
    }
  }
}

But at least on RTX2080 it's slower (I don't have the eGPU with RTX3090 with me right now to test):

  • RTX2080: 0.0815998 sec. for sorted vs. 1.39829 sec. for unsorted (17.14x ratio)
Picofarad answered 27/7, 2022 at 7:12 Comment(11)
If the sorted case is common, you could try a warp-level vote to see if coalescing is worth it, then use warp-level reduction to make a single (or a few) atomic operations only.Therron
@Therron Isn't that done automatically nowadays? See update to this blog post.Benthos
@Benthos yes, the performance measurements suggest it's done automatically in the hardware.Picofarad
Jup, seems like it is. See also here: on-demand.gputechconf.com/gtc/2013/presentations/…Therron
This problem is better described as "sparse to dense set representation". Except... you need to tell us whether the M values can have repetitions or not; and whether they're sorted or not.Average
@einpoklum, I've edited the question to address your comment. Basically, any solution would be of interest, but unsorted with repetitions case is more generic.Picofarad
Your peak throughput is 4.5GiB / .038s = 118 GiB/s which says you are not bandwidth bound (~12% of peak on 3090). Possibly latency bound. Have you run it in nsight-compute to see what the reported bottleneck is? My hunch of the best next step would be to use vectorized loads for the indices, and do 4 atomicOr per thread-iteration rather than one. See developer.nvidia.com/blog/…Trappist
@harrism, thanks, I've added this experiment to the question. The program became slower...Picofarad
Suggest you profile it.Trappist
@SergeRogatch, maybe it makes sense to specify launch bounds for your kernel, especially the one with vectorized access. And try to play around with the max registers per thread (perhaps just having 16 regs per thread is more than enough here). Also, configure the cuda runtime to use 48k L1 cache and 16k registers or so per multiprocessor.Tidwell
If the data is not sorted, then you're doing scattered writes. These will never run at the full speed of the hardware, because when fully scattered each batch of writes will only use 1/32th of a cacheline, hence your code (at worst) will only run at ~3% of max. In your case you run at 12%, so there is some locality in your dataset. If your data was sorted, then you could cache the output in a shared buffer and only write that out when you're the buffer is full.Nessi
N
2

This is not a full answer, but I have too much code for a comment.

You code is limited mainly limited by the scattered atomic writes.
There is a limit to the number of cachelines a block (actually an SM) within the GPU can write to per cycle. If you try to write more cachelines, these are stalled until the earlier ones are settled. This is the performance difference you are seeing. atomicOr does no get bundled per warp, only atomicAdd(x, 1) and atomicInc() get this special treatment, and only for atomicAdds that write to the same destination.

So you can hardly expect to max out the memory bus. You can only do so if you fully utilize a cache line every write (i.e. coalesced writes only).

However you may speed this up a bit by prefetching the data using memcpy_async.

You will need to prefetch enough data to overcome the latency.

//prefetch count cannot be greater than 8!
template <int my_blockdim, int prefetchcount>
__global__ void BitwiseSet(const uint32_t n_indices, const uint32_t *indices,
      const uint32_t n_bits, uint32_t *bitset)
{
  constexpr auto buffersize = myblockdim * prefetchcount;
  __shared__ s_indices[buffersize];
  auto pipeline = cuda::make_pipeline(); //pipeline with thread_scope_thread
  //every block handles its own section of the data.
  const auto start = blockDim.x * blockIdx.x;
  const auto end = std::min(n_indices, start + ((n_indices + gridDim.x - 1) / gridDim.x); 

  const auto prefetch = [&](uint32_t i){
    //pipeline.producer_acquire(); //no-op for thread_scope_thread
    const auto source = &indices[start + i];
    const auto dest = &s_indices[i % buffersize];
    constexpr auto size = sizeof(int);
    memcpy_async(dest, source, size, pipeline);
    pipeline.producer_commit();
  };
  
  //prime the pump
  for (auto i = 0; i < prefetchcount; i ++) {
    const auto a = start + threadIdx.x + (blockDim.x * i);
    prefetch(a);
  }

  const auto dowork = [&]<bool in_tail>(uint32_t start, uint32_t end) {
  //skip prefetch items, we'll process those in the tail.
  for (uint32_t i = start + threadIdx.x; i < end; i += blockDim.x) {
    pipeline.consumer_wait(); //wait for one batch
    //__syncwarp(); no need for sync here
    const auto index = s_indices[i % buffersize]; //fast because mod by constant
    //prefetch the next batch
    if constexpr (in_tail) {
      prefetch(i);
    }
    //const uint32_t index = indices[i];
    assert(index < n_bits);
    const uint32_t i_word = index >> 5;
    const uint32_t i_bit = index & 31;
    atomicOr(bitset+i_word, (1u << i_bit));
  };
  

  const auto start2 = start + buffersize;
  dowork.template operator()<false>(start2, end);
  dowork.template operator()<true>(0, buffersize);
  
}

You can unroll this as needed by doing multiple memcpy_async() per pipeline.producer_commit() and adjusting the rest as needed.

Other suggestions
If your data is sorted, or if your output (mostly) falls into a narrow band, then you can write the atomicOr() to a shared buffer. I anticipated on this a bit by 'banding' the data in the above code. For sorted data, you can write to a shared buffer, as soon as the destinations fall outside the buffer, write the buffer to memory (only non-zero values), meanwhile perhaps do some more prefetching. And then reuse the buffer.

Nessi answered 6/5, 2024 at 11:17 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.