Edit 2: Please take a look at this crosspost for TLDR.
Edit: Given that the particles are segmented into grid cells (say 16^3
grid), is it a better idea to let run one work-group for each grid cell and as many work-items in one work-group as there can be maximal number of particles per grid cell?
In that case I could load all particles from neighboring cells into local memory and iterate through them computing some properties. Then I could write specific value into each particle in the current grid cell.
Would this approach be beneficial over running the kernel for all particles and for each iterating over (most of the time the same) neighbors?
Also, what is the ideal ratio of number of particles/number of grid cells
?
I'm trying to reimplement (and modify) CUDA Particles for OpenCL and use it to query nearest neighbors for every particle. I've created the following structures:
- Buffer
P
holding all particles' 3D positions (float3
) Buffer
Sp
storingint2
pairs of particle ids and their spatial hashes.Sp
is sorted according to the hash. (The hash is just a simple linear mapping from 3D to 1D – no Z-indexing yet.)Buffer
L
storingint2
pairs of starting and ending positions of particular spatial hashes in bufferSp
. Example:L[12] = (int2)(0, 50)
.L[12].x
is the index (inSp
) of the first particle with spatial hash12
.L[12].y
is the index (inSp
) of the last particle with spatial hash12
.
Now that I have all these buffers, I want to iterate through all the particles in P
and for each particle iterate through its nearest neighbors. Currently I have a kernel that looks like this (pseudocode):
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
for(int x=-1; x<=1; x++)
for(int y=-1; y<=1; y++)
for(int z=-1; z<=1; z++) {
float3 neigh_position = curr_particle + (float3)(x,y,z)*GRID_CELL_SIDE;
// ugly boundary checking
if ( dot(neigh_position<0, (float3)(1)) +
dot(neigh_position>BOUNDARY, (float3)(1)) != 0)
continue;
int neigh_hash = spatial_hash( neigh_position );
int2 particles_range = L[ neigh_hash ];
for(int p=particles_range.x; p<particles_range.y; p++)
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}
The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y]
in the inner-most for
loop) to be causing the slowness.
What I want to do is to use Z-order curve as the spatial hash. That way I could have only 1 for
loop iterating through a continuous range of memory when querying neighbors. The only problem is that I don't know what should be the start and stop Z-index values.
The holy grail I want to achieve:
__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
size_t gid = get_global_id(0);
float3 curr_particle = P[gid];
int processed_value = 0;
// How to accomplish this??
// `get_neighbors_range()` returns start and end Z-index values
// representing the start and end near neighbors cells range
int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
int last_particle_id = L[ nearest_neighboring_cells_range.y ].y;
for(int p=first_particle_id; p<=last_particle_id; p++) {
processed_value += heavy_computation( P[ Sp[p].y ] );
}
Out[gid] = processed_value;
}