CUDA, how to find the first item in an array that makes a function maximal
Asked Answered
R

1

1

In Cuda C++, I have a big array Arr of integers, and a function F: int -> int. I want to find the first index of some items in Arr that makes F maximal.

How can I write a kernel that always keeps the maximal value (in F) to compare with others using atomic stuff to avoid facing any race condition problems?

BTW, I wonder if I can use the functions in the Thrust library for this purpose instead.

Rumery answered 6/12, 2022 at 23:1 Comment(3)
write a transformation operation where each thread tests its assigned array element, producing a functional result and its thread (array) index. Use a reduction operation to find the maximum value of the functional results across threads, that has the minimum thread index. Yes, this would be a sensible thing to do with thrust.Lipoprotein
@RobertCrovella Thanks Robert. So apparently, I need to use another big array to keep the result of transformation. Yes? Or I can do the transformation and reduction together without toching the memory?Rumery
in my answer I show how to do the transformation and reduction together (at least for the thrust case, and arguably for the atomic case). neither example overwrites the original array; neither example requires an extra explicit allocationLipoprotein
L
2

How can I write a kernel that always keeps the maximal value (in F) to compare with others using atomic stuff to avoid facing the race condition problems?

Based on your description, including usage of int, and a desire to use atomics, I would suggest using a custom atomic. This should work for arrays up to 4 billion elements:

$ cat t2154.cu
#include <iostream>

__device__ __host__ int get_int(unsigned long long val){return reinterpret_cast<int *>(&val)[0];}
__device__ __host__ unsigned get_uns(unsigned long long val){return reinterpret_cast<unsigned *>(&val)[1];}
__device__ bool done(int fval, int fval1, unsigned idx, unsigned idx1){
  if (fval > fval1) return true;
  if ((fval == fval1) && (idx <= idx1)) return true;
  return false;
}
__device__ unsigned long long my_custom_atomic(unsigned long long *addr, int fval, unsigned idx){

  unsigned long long old = *addr;
  while (!done(get_int(old),fval, get_uns(old), idx))
    old = atomicCAS(addr, old, ((((unsigned long long)idx)<<32)|fval));
  return old;
}
const int minidx = 256;
__device__ int f(int t){ return minidx - (t-minidx)*(t-minidx);}

__global__ void k(int *arr, unsigned long long *min, unsigned N){

  unsigned my_idx = blockIdx.x*blockDim.x+threadIdx.x;
  if (my_idx < N){
    int my_val = arr[my_idx];
    my_val = f(my_val);
    my_custom_atomic(min, my_val, my_idx);
  }
}
const unsigned my_N = 32768;

int main(){

  unsigned long long *min;
  cudaMallocManaged(&min, sizeof(min[0]));
  int *arr;
  cudaMallocManaged(&arr, sizeof(arr[0])*my_N);
  for (unsigned i = 0; i < my_N; i++) arr[i] = i;
  *min = 0xFFFFFFFF80000000ULL; //maximum unsigned index combined with minimum int value
  k<<<my_N/256, 256>>>(arr, min, my_N);
  cudaDeviceSynchronize();
  std::cout <<  " maximum val: " << get_int(*min) << " at index: " << get_uns(*min) << std::endl;
}
$ nvcc -o t2154 t2154.cu
$ compute-sanitizer ./t2154
========= COMPUTE-SANITIZER
 maximum val: 256 at index: 256
========= ERROR SUMMARY: 0 errors
$

We are assembling and disassembling the 64-bit quantity as needed, and using the general method outlined in the programming guide for arbitrary atomics.

I wonder if I can use the functions in Thrust library for this purpose instead.

Yes, you can do this with a transform and a reduce operation in thrust. In fact thrust can combine these into a single algorithm call. Here is an example:

$ cat t2155.cu
#include <iostream>
#include <thrust/transform.h>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/sequence.h>
#include <limits>

const int minidx = 256;

const size_t my_N = 32768;

struct f
{
  template <typename T>
  __host__ __device__ T operator()(T t) {
    T result = t;
    thrust::get<0>(result) = minidx - (thrust::get<0>(t) - minidx)*(thrust::get<0>(t) - minidx);
    return result;
  }
};

struct r
{
  template <typename T1, typename T2>
  __host__ __device__ T1 operator()(T1 &t1, T2 &t2){
  if (thrust::get<0>(t1) > thrust::get<0>(t2)) return t1;
  if (thrust::get<0>(t1) < thrust::get<0>(t2)) return t2;
  if (thrust::get<1>(t1) < thrust::get<1>(t2)) return t1;
  return t2;
  }
};

int main(){

  thrust::device_vector<int> arr(my_N);
  thrust::sequence(arr.begin(), arr.end());
  auto my_zip = thrust::make_zip_iterator(thrust::make_tuple(arr.begin(), thrust::counting_iterator<size_t>(0)));
  auto init = thrust::make_tuple(std::numeric_limits<int>::min(), std::numeric_limits<size_t>::max());
  auto result = thrust::transform_reduce(my_zip, my_zip+my_N, f(), init, r());
  std::cout <<  " maximum val: " << thrust::get<0>(result) << " at index: " << thrust::get<1>(result) << std::endl;
}
$ nvcc -o t2155 t2155.cu
$ compute-sanitizer ./t2155
========= COMPUTE-SANITIZER
 maximum val: 256 at index: 256
========= ERROR SUMMARY: 0 errors
$

Notes:

  • Comparing those 2 implementations/examples, I know which one I would choose. The atomic method is brittle, type-limited, and size-limited. The thrust example could be adapted to handle types more flexibly (e.g. a function that returns a 64-bit type) and could be extended to handle beyond 4 billion elements.

  • The code here is just intended to be a possible roadmap. It's not thoroughly tested; bugs are always possible.

  • There is a strong correspondence between the two methods. The main() routines have almost a 1:1 correspondence, which hopefully you can identify. Furthermore the r() functor corresponds to the done() function, and the f() functor corresponds to the f() function.

  • Don't assume that you can readily/trivially increase my atomic example to 4 billion elements. The f() function I wrote would overflow/underflow an int variable. But with an appropriate data array and f() function, it should be possible to use up to 4 billion elements.

EDIT: As suggested in the comments below, we may be able to do a better job in the atomic case, by doing a threadblock level shared sweep reduction, followed by a single atomic per threadblock. Here is an example of that:

#include <iostream>
const int nTPB=512;
const unsigned long long initval = 0xFFFFFFFF80000000ULL; // maximum  index and minimum int

__device__ __host__ int get_int(unsigned long long val){return reinterpret_cast<int *>(&val)[0];}

__device__ __host__ unsigned get_uns(unsigned long long val){return reinterpret_cast<unsigned *>(&val)[1];}

__device__ bool done(int fval, int fval1, unsigned idx, unsigned idx1){
  if (fval > fval1) return true;
  if ((fval == fval1) && (idx <= idx1)) return true;
  return false;
}

__device__ unsigned long long my_custom_atomic(unsigned long long *addr, int fval, unsigned idx){

  unsigned long long old = *addr;
  while (!done(get_int(old),fval, get_uns(old), idx))
    old = atomicCAS(addr, old, ((((unsigned long long)idx)<<32)|fval));
  return old;
}
const int minidx = 256;

__device__ int f(int t){ return minidx - (t-minidx)*(t-minidx);}

__device__ unsigned long long my_reduce(unsigned long long t1, unsigned long long t2){
  if (done(get_int(t1), get_int(t2), get_uns(t1), get_uns(t2))) return t1;
  return t2;
}
__global__ void k(int *arr, unsigned long long *min, unsigned N){

  __shared__ unsigned long long smem[nTPB];
  smem[threadIdx.x] = initval;
  for (unsigned int idx = blockIdx.x*blockDim.x+threadIdx.x; idx < N; idx+=gridDim.x*blockDim.x)
      smem[threadIdx.x] = my_reduce(smem[threadIdx.x], (((unsigned long long)idx)<<32)|f(arr[idx]));
  for (int t = nTPB>>1; t > 0; t>>=1){
      __syncthreads();
      if (threadIdx.x < t) smem[threadIdx.x] = my_reduce(smem[threadIdx.x], smem[threadIdx.x+t]);}
  if (!threadIdx.x) my_custom_atomic(min, get_int(smem[0]), get_uns(smem[0]));
}

const unsigned my_N = 32768;

int main(){

  unsigned long long *min;
  cudaMallocManaged(&min, sizeof(min[0]));
  int *arr;
  cudaMallocManaged(&arr, sizeof(arr[0])*my_N);
  for (unsigned i = 0; i < my_N; i++) arr[i] = i;
  arr[1024] = minidx;
  *min = initval;
  k<<<(my_N+nTPB-1)/nTPB, nTPB>>>(arr, min, my_N);
  cudaDeviceSynchronize();
  std::cout <<  " minimum val: " << get_int(*min) << " at index: " << get_uns(*min) << std::endl;
}
Lipoprotein answered 7/12, 2022 at 2:20 Comment(6)
Is there any particular reason for using 256 for the number of blocks? Can I use 1024 to maximize the speed?Rumery
Yes. You should size the grid to match the problem size. General CUDA grid sizing strategies apply here, I didn't mean to communicate anything by that, and the code should not be especially sensitive to the number of threads per block. (I think you are asking about threads per block, not number of blocks. But in any event, both should be chosen according to typical CUDA grid sizing strategies, which is not the focus of this question.) My grid sizing also doesn't properly handle arbitrary sizes. Again, that isn't the focus of what I'm communicating. I assume some basic CUDA knowledge here.Lipoprotein
If you'd like to learn about general CUDA grid sizing strategies, there are many questions here on SO about it, or else I recommend this online training series.Lipoprotein
I implemented a mixture of your ideas + my stuff and it worked properly. Thanks. My final question. Is your first code (without using Thrust library) fast enough? or I should use shared memory at least for reduction?Rumery
For an ordinary reduction I would say that it would be best to do a threadblock level sweep followed by a single atomic per threadblock. That methodology is outlined in the unit 5 of the training series I suggested. In this case, layering the minimum thread index on top of the max-finding reduction, I'm not sure which would be best. It would probably require a test case, and I'm sure it is going to be data dependent as to how much benefit you get out of a shared sweep first. Only you can decide what is "fast enough". And,, I would strongly suggest and prefer the thrust approach.Lipoprotein
I added an example using shared memory for a threadblock-level sweep reduction. The atomic still makes sense to use as the final threadblock-level operation.Lipoprotein

© 2022 - 2025 — McMap. All rights reserved.