For a current OpenCL GPGPU project, I need to sort elements in an array according to some key with 64 possible values. I need the final array to have all elemens with the same key to be contiguous. It's sufficient to have an associative array new_index[old_index]
as the output of this task.
I split the task into two parts. First, I count for each possible key (bucket) the number of elements with this key (that go into that bucket). I scan this array (generate a prefix sum) which indicates the new index range of the elements for each bucket, like "start" indices per bucket.
The second step will then have to assign each element a new index. If I was to implement this on a CPU, the algorithm would be something like this:
for all elements e:
new_index[e] = bucket_start[bucket(e)]++
Of course, this doesn't work on the GPU. Every item needs to access the bucket_start
array in read-write mode which is essentially a synchronization between all work items, the worst we can do.
An idea is to put some computation in work groups. But I'm not sure how this should be done exactly, since I'm not experienced in GPGPU computing.
In global memory, we have the bucket start array initialized with the prefix sum as above. Access to this array is "mutexed" with an atomic int. (I'm new to this, so maybe mixing some words here.)
Every work group is implicitly assigned a part of the input element array. It uses a local bucket array containing the new indices, relative to a (global) bucket start which we do not yet know. After one of these "local buffers" is full, the work group has to write the local buffers into global array. For this, it locks access to the global bucket start array, increments these values by the current local bucket sizes, unlocks, and then can write the result into the global new_index
array (by adding the according offset). This process is repeated until all assigned elements are processed.
Two questions arise:
Is this a good approach? I know that reading and writing from/to global memory is most likely the bottleneck here, especially since I'm trying to acquire synchronized access to (at least only a small part of) the global memory. But maybe there is a much better approach to do this, maybe using kernel decomposition. Note that I try to avoid reading back data from GPU to CPU during kernels (to avoid an OpenCL command queue flush, which is also bad as I was tought).
In the algorithm design above, how do I implement the locking mechanism? Will something like the following code work? In particular, I expect problems when the hardware executes work items "truly parallel" in SIMD groups, like Nvidia "warps". In my current code, all items of a work group would try to acquire the lock in an SIMD fashion. Should I restrict this to the first work item only? And use barriers to keep them locally in sync?
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable __kernel void putInBuckets(__global uint *mutex, __global uint *bucket_start, __global uint *new_index) { __local bucket_size[NUM_BUCKETS]; __local bucket[NUM_BUCKETS][LOCAL_MAX_BUCKET_SIZE]; // local "new_index" while (...) { // process a couple of elements locally until a local bucket is full ... // "lock" while(atomic_xchg(mutex, 1)) { } // "critical section" __local uint l_bucket_start[NUM_BUCKETS]; for (int b = 0; b < NUM_BUCKETS; ++b) { l_bucket_start[b] = bucket_start[b]; // where should we write? bucket_start[b] += bucket_size[b]; // update global offset } // "unlock" atomic_xchg(mutex, 0); // write to global memory by adding the offset for (...) new_index[...] = ... + l_bucket_start[b]; } }