Nearest Neighbors in CUDA Particles

Edit 2: Please take a look at this crosspost for TLDR.

Edit: Given that the particles are segmented into grid cells (say 16^3 grid), is it a better idea to let run one work-group for each grid cell and as many work-items in one work-group as there can be maximal number of particles per grid cell?

In that case I could load all particles from neighboring cells into local memory and iterate through them computing some properties. Then I could write specific value into each particle in the current grid cell.

Would this approach be beneficial over running the kernel for all particles and for each iterating over (most of the time the same) neighbors?

Also, what is the ideal ratio of number of particles/number of grid cells?

I'm trying to reimplement (and modify) CUDA Particles for OpenCL and use it to query nearest neighbors for every particle. I've created the following structures:

Buffer P holding all particles' 3D positions (float3)
Buffer Sp storing int2 pairs of particle ids and their spatial hashes. Sp is sorted according to the hash. (The hash is just a simple linear mapping from 3D to 1D – no Z-indexing yet.)
Buffer L storing int2 pairs of starting and ending positions of particular spatial hashes in buffer Sp. Example: L[12] = (int2)(0, 50).
- L[12].x is the index (in Sp) of the first particle with spatial hash 12.
- L[12].y is the index (in Sp) of the last particle with spatial hash 12.

Now that I have all these buffers, I want to iterate through all the particles in P and for each particle iterate through its nearest neighbors. Currently I have a kernel that looks like this (pseudocode):

__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
  size_t gid             = get_global_id(0);
  float3 curr_particle   = P[gid];
  int    processed_value = 0;

  for(int x=-1; x<=1; x++)
    for(int y=-1; y<=1; y++)
      for(int z=-1; z<=1; z++) {

        float3 neigh_position = curr_particle + (float3)(x,y,z)*GRID_CELL_SIDE;

        // ugly boundary checking
        if ( dot(neigh_position<0,        (float3)(1)) +
             dot(neigh_position>BOUNDARY, (float3)(1))   != 0)
             continue;

        int neigh_hash        = spatial_hash( neigh_position );
        int2 particles_range  = L[ neigh_hash ];

        for(int p=particles_range.x; p<particles_range.y; p++)
          processed_value += heavy_computation( P[ Sp[p].y ] );

      }

  Out[gid] = processed_value;
}

The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.

What I want to do is to use Z-order curve as the spatial hash. That way I could have only 1 for loop iterating through a continuous range of memory when querying neighbors. The only problem is that I don't know what should be the start and stop Z-index values.

The holy grail I want to achieve:

__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
  size_t gid             = get_global_id(0);
  float3 curr_particle   = P[gid];
  int    processed_value = 0;

  // How to accomplish this??
  // `get_neighbors_range()` returns start and end Z-index values
  // representing the start and end near neighbors cells range
  int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
  int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
  int last_particle_id  = L[ nearest_neighboring_cells_range.y ].y;

  for(int p=first_particle_id; p<=last_particle_id; p++) {
      processed_value += heavy_computation( P[ Sp[p].y ] );
  }

  Out[gid] = processed_value;
}

You should study the Morton Code algorithms closely. Ericsons Real time collision detection explains that very well.

Ericson - Real time Collision detection

Here is another nice explanation including some tests:

Morton encoding/decoding through bit interleaving: Implementations

Z-Order algorithms only defines the paths of the coordinates in which you can hash from 2 or 3D coordinates to just an integer. Although the algorithm goes deeper for every iteration you have to set the limits yourself. Usually the stop index is denoted by a sentinel. Letting the sentinel stop will tell you at which level the particle is placed. So the maximum level you want to define will tell you the number of cells per dimension. For example with maximum level at 6 you have 2^6 = 64. You will have 64x64x64 cells in your system (3D). That also means that you have to use integer based coordinates. If you use floats you have to convert like coord.x = 64*float_x and so on.

If you know how many cells you have in your system you can define your limits. Are you trying to use a binary octree?

Since particles are in motion (in that CUDA example) you should try to parallelize over the number of particles instead of cells.

If you want to build lists of nearest neighbours you have to map the particles to cells. This is done through a table that is sorted afterwards by cells to particles. Still you should iterate through the particles and access its neighbours.

About your code:

The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.

Remember Donald Knuth. You should measure where the bottle neck is. You can use NVCC Profiler and look for bottleneck. Not sure what OpenCL has as profiler.

    // ugly boundary checking
    if ( dot(neigh_position<0,        (float3)(1)) +
         dot(neigh_position>BOUNDARY, (float3)(1))   != 0)
         continue;

I think you should not branch it that way, how about returning zero when you call heavy_computation. Not sure, but maybe you have sort of a branch prediction here. Try to remove that somehow.

Running parallel over the cells is a good idea only if you have no write accesses to the particle data, otherwise you will have to use atomics. If you go over the particle range instead you read accesses to the cells and neighbours but you create your sum in parallel and you are not forced to some race condiction paradigm.

Also, what is the ideal ratio of number of particles/number of grid cells?

Really depends on your algorithms and the particle packing within your domain, but in your case I would define the cell size equivalent to the particle diameter and just use the number of cells you get.

So if you want to use Z-order and achieve your holy grail, try to use integer coordinates and hash them.

Also try to use larger amounts of particles. About 65000 particles like CUDA examples uses you should consider because that way the parallelisation is mostly efficient; the running processing units are exploited (fewer idles threads).

Edit 2: Please take a look at this crosspost for TLDR.

Recommended topics

Hot tags