Nearest Neighbors in CUDA Particles
Asked Answered
E

1

9

Edit 2: Please take a look at this crosspost for TLDR.

Edit: Given that the particles are segmented into grid cells (say 16^3 grid), is it a better idea to let run one work-group for each grid cell and as many work-items in one work-group as there can be maximal number of particles per grid cell?

In that case I could load all particles from neighboring cells into local memory and iterate through them computing some properties. Then I could write specific value into each particle in the current grid cell.

Would this approach be beneficial over running the kernel for all particles and for each iterating over (most of the time the same) neighbors?

Also, what is the ideal ratio of number of particles/number of grid cells?


I'm trying to reimplement (and modify) CUDA Particles for OpenCL and use it to query nearest neighbors for every particle. I've created the following structures:

  • Buffer P holding all particles' 3D positions (float3)
  • Buffer Sp storing int2 pairs of particle ids and their spatial hashes. Sp is sorted according to the hash. (The hash is just a simple linear mapping from 3D to 1D – no Z-indexing yet.)

  • Buffer L storing int2 pairs of starting and ending positions of particular spatial hashes in buffer Sp. Example: L[12] = (int2)(0, 50).

    • L[12].x is the index (in Sp) of the first particle with spatial hash 12.
    • L[12].y is the index (in Sp) of the last particle with spatial hash 12.

Now that I have all these buffers, I want to iterate through all the particles in P and for each particle iterate through its nearest neighbors. Currently I have a kernel that looks like this (pseudocode):

__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
  size_t gid             = get_global_id(0);
  float3 curr_particle   = P[gid];
  int    processed_value = 0;

  for(int x=-1; x<=1; x++)
    for(int y=-1; y<=1; y++)
      for(int z=-1; z<=1; z++) {

        float3 neigh_position = curr_particle + (float3)(x,y,z)*GRID_CELL_SIDE;

        // ugly boundary checking
        if ( dot(neigh_position<0,        (float3)(1)) +
             dot(neigh_position>BOUNDARY, (float3)(1))   != 0)
             continue;

        int neigh_hash        = spatial_hash( neigh_position );
        int2 particles_range  = L[ neigh_hash ];

        for(int p=particles_range.x; p<particles_range.y; p++)
          processed_value += heavy_computation( P[ Sp[p].y ] );

      }

  Out[gid] = processed_value;
}

The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.

What I want to do is to use Z-order curve as the spatial hash. That way I could have only 1 for loop iterating through a continuous range of memory when querying neighbors. The only problem is that I don't know what should be the start and stop Z-index values.

The holy grail I want to achieve:

__kernel process_particles(float3* P, int2* Sp, int2* L, int* Out) {
  size_t gid             = get_global_id(0);
  float3 curr_particle   = P[gid];
  int    processed_value = 0;

  // How to accomplish this??
  // `get_neighbors_range()` returns start and end Z-index values
  // representing the start and end near neighbors cells range
  int2 nearest_neighboring_cells_range = get_neighbors_range(curr_particle);
  int first_particle_id = L[ nearest_neighboring_cells_range.x ].x;
  int last_particle_id  = L[ nearest_neighboring_cells_range.y ].y;

  for(int p=first_particle_id; p<=last_particle_id; p++) {
      processed_value += heavy_computation( P[ Sp[p].y ] );
  }

  Out[gid] = processed_value;
}
Erastatus answered 28/7, 2016 at 20:27 Comment(6)
As your comments suggests, I would definitely recommend reading in data into shared memory first. I'm not sure about OpenCL terminology, but in CUDA I would have each block (which I think corresponds to a work group for you) read in all particles in a certain area (and remember to add some padding as you're doing a gather operation), process in local memory, and finally write back data if necessary.Johns
@Johns thank you for your comment! It's great to have someone to interact with :). I'm not sure, if I understand the part "remember to add some padding as you're doing a gather operation", could you please describe it in more detail? I understand it as: "create an array in local memory with overestimated size". Having said that, it might be beneficial for others to look at my reformulated question on Computer Science to get a densely formulated version of this question.Erastatus
I would suggest two things: (1) as you write as well, first group particles into cells. (2) Instead of only looping over particles, do a nested loop over cells and then particles within cells. However, particles close to the edges of the cell will interact with particles from other cells. This is why a padding is needed (i.e. if you're trying to determine neighbours for particles in cell A, you will need to search in cell A and all cells that are within the search radius). To do this efficiently you need to make sure you read in all data first, and then process it.Johns
@Johns thanks, now I understand.Erastatus
I have no idea about nearest neighbours, but just in case it turns out to be useful, there are some simple tricks to access the neighbouring cells of a given Z-ordered point (tesseral arithmetic)Schrock
@harold thanks for pointing that out. I'm definitely going to take a look at tesseral arithmetic – haven't heard about it yet.Erastatus
P
-1

You should study the Morton Code algorithms closely. Ericsons Real time collision detection explains that very well.

Ericson - Real time Collision detection

Here is another nice explanation including some tests:

Morton encoding/decoding through bit interleaving: Implementations

Z-Order algorithms only defines the paths of the coordinates in which you can hash from 2 or 3D coordinates to just an integer. Although the algorithm goes deeper for every iteration you have to set the limits yourself. Usually the stop index is denoted by a sentinel. Letting the sentinel stop will tell you at which level the particle is placed. So the maximum level you want to define will tell you the number of cells per dimension. For example with maximum level at 6 you have 2^6 = 64. You will have 64x64x64 cells in your system (3D). That also means that you have to use integer based coordinates. If you use floats you have to convert like coord.x = 64*float_x and so on.

If you know how many cells you have in your system you can define your limits. Are you trying to use a binary octree?

Since particles are in motion (in that CUDA example) you should try to parallelize over the number of particles instead of cells.

If you want to build lists of nearest neighbours you have to map the particles to cells. This is done through a table that is sorted afterwards by cells to particles. Still you should iterate through the particles and access its neighbours.

About your code:

The problem with that code is that it's slow. I suspect the nonlinear GPU memory access (particulary P[Sp[p].y] in the inner-most for loop) to be causing the slowness.

Remember Donald Knuth. You should measure where the bottle neck is. You can use NVCC Profiler and look for bottleneck. Not sure what OpenCL has as profiler.

    // ugly boundary checking
    if ( dot(neigh_position<0,        (float3)(1)) +
         dot(neigh_position>BOUNDARY, (float3)(1))   != 0)
         continue;

I think you should not branch it that way, how about returning zero when you call heavy_computation. Not sure, but maybe you have sort of a branch prediction here. Try to remove that somehow.

Running parallel over the cells is a good idea only if you have no write accesses to the particle data, otherwise you will have to use atomics. If you go over the particle range instead you read accesses to the cells and neighbours but you create your sum in parallel and you are not forced to some race condiction paradigm.

Also, what is the ideal ratio of number of particles/number of grid cells?

Really depends on your algorithms and the particle packing within your domain, but in your case I would define the cell size equivalent to the particle diameter and just use the number of cells you get.

So if you want to use Z-order and achieve your holy grail, try to use integer coordinates and hash them.

Also try to use larger amounts of particles. About 65000 particles like CUDA examples uses you should consider because that way the parallelisation is mostly efficient; the running processing units are exploited (fewer idles threads).

Parental answered 7/8, 2016 at 12:25 Comment(6)
Why would you consider a cell size equal to the particle diameter to be good? My thought would be to use cells as an acceleration structure that groups nearby particles. You will never have more than 1 particle in each cell. What's the benefit?Johns
@pingul: Every particle can occupy at most eight cells. Considering no overlaps between the neighbours, every cell might have to handle up to eight particles. I think it is a good tradeoff. Still though it depends on the algorithm you and how you parallelize for your outcome. With larger cells you have to consider to tackle more neighbours, with smaller cells, you have to consider more than eight cells.Parental
I see no reason why you would have to think of the particles as actual particles with size. For a nearby neighbor search it seems perfectly sufficient to consider only their position, and look up points within a radius. With your scheme one particle would live in multiple cells -- what's the point of that?Johns
The point is that it is enough to only consider the cells for interactions. You do your math within these cells, store these values and apply them to the particles. If you happen to consider multiphysics and time dependant interaction values like tangential forces you might have to do it that way.Parental
I think we will have to agree to disagree. From your description it sounds like you're talking about a PIC algorithm, which makes no sense for the problem at hand.Johns
You consider particles in cell, why would it not make sense? You might be able to apply mutliple heavy_computation physics using interaction one list per frame. And you still divide your space into cells and apply your schemes and code posted as question. Since we do not know what heavy_computation is supposed to do I would say PIC could be possible or a bad idea. You can create a list of neighbours by looping over the cells and work on that list later or you calculate your potential through the neighbours directly. It depends on your equations.Parental

© 2022 - 2024 — McMap. All rights reserved.