[OpenCL]nearest neighbour using euclidean distance
Asked Answered
K

7

8

I'm using OpenCL to find the nearest neighbour between two set of 3D points.

Nearest Neighbour: For each point(x,y,z) in the DataSet I have to find the nearest one in the model. Squared distance = (Ax-Bx)^2 + (Ay-By)^2 + (Az-Bz)^2

Here what I've done so far:

struct point {
int x;
int y;
int z;
};

__kernel void 
nearest_neighbour(__global struct point *model,
__global struct point *dataset,
__global int *nearest,
const unsigned int model_size)
{
    int g_dataset_id = get_global_id(0);

    int dmin = -1;
    int d, dx, dy, dz;

    for (int i=0; i<model_size; ++i) {
        dx = model[i].x - dataset[g_dataset_id].x;
        dx = dx * dx;

        dy = model[i].y - dataset[g_dataset_id].y;
        dy = dy * dy;

        dz = model[i].z - dataset[g_dataset_id].z;
        dz = dz * dz;

        d = dx + dy + dz;

        if(dmin == -1 || d < dmin)
        {
            nearest[g_dataset_id] = i;
            dmin = d;
        }
    }
}

The code seems to work, but I'm sure that it can be optimized. I would like to know how can I take advantage of the local memory to make it better.

Thanks

P.S. I know that there are other (better) methods to find nearest neighbour, like kd-tree, but for now I would like to do the easy one.

Kennithkennon answered 21/3, 2011 at 17:35 Comment(0)
B
4

The compiler is probably hoisting these loop-invariants for you, but to be sure it gets done, try this code which assigns them to temporaries named datum_x and so on. Also, initializing dmin to MAX_INT allows you to avoid the superfluous comparison with -1. Another approach is to unroll the first loop iteration (with i=0) in order to initialize dmin.

int dmin = MAX_INT;
int d, dx, dy, dz;
int datum_x, datum_y, datum_z;

datum_x = dataset[g_model_id].x;
datum_y = dataset[g_model_id].y;
datum_z = dataset[g_model_id].z;

for (int i=0; i<size_dataset; ++i) {
    dx = model[i].x - datum_x;
    dx = dx * dx;

    dy = model[i].y - datum_y;
    dy = dy * dy;

    dz = model[i].z - datum_z;
    dz = dz * dz;

    d = dx + dy + dz;

    if(d < dmin)
    {
        nearest[g_dataset_id] = i;
        dmin = d;
    }
}
Bibliopegy answered 21/3, 2011 at 17:51 Comment(1)
Thanks, actually with your suggestion execution time has decreased from 5.3s to 3.6s (Model: 200.000 points and Dataset 200.000 points - on ATI 5850)Kennithkennon
S
2

Maybe a quick pre-filter can speed things up. Instead of calculating the squared distance immediately, you can first check if distance in all three coordinates are closer than dmin. So, you can replace your inner loop with

{
    dx = model[i].x - datum_x;
    if (abs(dx) >= dmin) continue;

    dy = model[i].y - datum_y;
    if (abs(dy) >= dmin) continue;

    dz = model[i].z - datum_z;
    if (abs(dz) >= dmin) continue;

    dx = dx * dx;    
    dy = dy * dy;
    dz = dz * dz;

    d = dx + dy + dz;

    if(d < dmin)
    {
        nearest[g_dataset_id] = i;
        dmin = d;
    }
}

I am not sure if the extra calls to the abs() and the ifs per point will filter out enough data points, but it is a simple enough change to try out, I think.

Shot answered 22/3, 2011 at 2:6 Comment(2)
thanks, your suggestion was really good, execution time decreased from 3.6s to 2.7s. Based on this I've made another version that don't use abs() (2.3s now): dx = model[i].x - local_x; dx = dx * dx; if (dx >= dmin) continue; dy = model[i].y - local_y; dy = dy * dy; if (dy >= dmin) continue; dz = model[i].z - local_z; dz = dz * dz; if (dz >= dmin) continue; d = dx + dy + dz;Kennithkennon
That looks better. The days of extremely slow multiplications are long past, and I suspect abs() and multiplication will take comparable amount of time.Shot
S
1

The first thing that occurred to me is the suggestion that Heath made. Each work item is accessing the memory item model[i] simultaneously. Depending on how good the compiler is at optimizing, it might be better to have each work item access a different element from the array. One way of staggering it is:

int datum_x, datum_y, datum_z;

datum_x = dataset[g_model_id].x;
datum_y = dataset[g_model_id].y;
datum_z = dataset[g_model_id].z;

for (int i=0; i<size_dataset; ++i) {
    j = (i + g_model_id) % size_dataset;  // i --> j by cyclic permutation

    dx = model[j].x - datum_x;
    dx = dx * x;

    dy = model[j].y - datum_y;
    dy = dy * dy;

    /* and so on... */
}

However, it may well be that the access to model[i] in your code is handled as a "broadcast," in which case my code will run slower.

Sean answered 21/3, 2011 at 21:17 Comment(0)
A
1

Heath's suggestion can be applied to the output index too: maintain a variable nearest_id, and write it only once after the loop.

Instead of 3 component struct, I would experiment with int4 vectors, and use vector operations.

Avila answered 22/3, 2011 at 10:49 Comment(1)
Thanks for your suggestion, have you got any link about int4/float4 vectors and vector operations? I would like to take a lookKennithkennon
M
1

I am pretty sure that a lot of time is spent writing the current min into nearest[g_dataset_id]. Access to global memory is often very slow, so you're better off storing the current min in a register like you do with dmin = d.

Just like this:

...
int dmin = MAX_INT;
int imin = 0;
...
for (...)
{
  ...
  if(d < dmin)
  {
    imin = i;
    dmin = d;
  }
}

nearest[g_dataset_id] = imin; //write to global memory only once
Mcglynn answered 22/3, 2011 at 11:14 Comment(1)
Hi, actually it wasn't a lot of time, but it gave an improvement of the performance, so thanks for your suggestionKennithkennon
K
1

After Eric Bainville suggestion I've tried to get rid of the point struct. As suggested I've used float4, here what I've done:

__kernel void 
nearest_neighbour(__global float4 *model,
__global float4 *dataset,
__global unsigned int *nearest,
const unsigned int model_size)
{
    int g_dataset_id = get_global_id(0);

    float dmin = MAXFLOAT;
    float d;

    /* Ottimizzato per memoria locale */
    float4 local_xyz = dataset[g_dataset_id];
    float4 d_xyz;
    int imin;

    for (int i=0; i<model_size; ++i) {
        d_xyz = model[i] - local_xyz;

        d_xyz *= d_xyz;

        d = d_xyz.x + d_xyz.y + d_xyz.z;

        if(d < dmin)
        {
            imin = i;
            dmin = d;
        }
    }

    nearest[g_dataset_id] = imin; // Write only once in global memory
}

The problem is that this version run a bit slower than the one based on the point struct. Probably because in the struct one I've used the pre-filter:

dx = model[i].x - local_x;
dx = dx * dx;
if (dx >= dmin) continue;

dy = model[i].y - local_y;
dy = dy * dy;
if (dy >= dmin) continue;

dz = model[i].z - local_z;
dz = dz * dz;
if (dz >= dmin) continue;

d = dx + dy + dz;

I can't use that pre-filter width float4 version. In your opinion are there other optimization I can do on the float4 version?

Thank you all for your valuable suggestions

Kennithkennon answered 22/3, 2011 at 20:54 Comment(1)
If you ensure the 4th component is 0 in the datasets, you can use d = length(model[i] - local_xyz), or d = dot(d_xyz,d_xyz). Getting rid of the if would probably accelerate the loop: imin=(d<dmin)?i:imin, and dmin=min(d,dmin). Your next step would be to use local memory, as you suggested in your question.Avila
C
1

To your specific question of "I would like to know how can I take advantage of the local memory to make it better."

Using the GPU's local memory can be tricky. You need to spend some quality time with the SDK's code samples and programming guide before tackling it.

Basically, you use local memory to cache some block of the global data -- in your case the model[] array -- so that you can read it from the there faster than reading it from global. If you did want to try it, it would go something like this pseudocode:

For each block of the model array {
    1) Read data from __global and write it to __local
    2) Barrier
    3) For each model datum in the __local cache,
       Read it and process it.
    4) Barrier
}

Step 3 is basically the loop you have now, except that it would only be processing a chunk of the model data instead of the whole thing.

Steps 2 and 4 are absolutely essential whenever you use local memory. You have to synchronize all of the theads in your workgroup. The barrier forces all of the work items to complete the code before the barrier before any of them is allowed to proceed to execute code after the barrier. This prevents work items from reading data out of local memory before it gets written there by the other threads. I don't recall the syntax of the barrier instructions, but they're in the OpenCL docs.

Step 1 you would have each work item read a different datum from global and write it to local cache.

Something like this (caution this is oversimplified and untested!):

__local float4 modelcache[CACHESIZE];
int me = get_local_id(0);

for (int j = 0; j < model_size; j += CACHESIZE) {
    modelcache[me] = dataset[j+me];
    barrier(CLK_LOCAL_MEM_FENCE);
    for (int i=0; i < CACHESIZE; ++i) {
        d_xyz = modelcache[i] - local_xyz;
        ... etc.
    }
    barrier(CLK_LOCAL_MEM_FENCE);
}

The design question then is: How big should the local cache be? What's the work group size?

The local data store is shared between work items in a work group. If your ND array of work items executes a number of work groups in parallel, each work group has it's own copy of the modelcache.

If you make the local data arrays too small, you get very little or no benefit from using them. If you make them too big, then the GPU can't execute as many work groups in parallel, and you might actually run considerably slower.

Finally, I have to say that this particular algorithm isn't likely to benefit much from a local memory cache. In your program, all of the work items are reading the same model[i] locations at the same time, and most GPUs have hardware that is specifically optimized to do that fast.

Cellule answered 3/4, 2011 at 22:39 Comment(1)
Thanks for your example, it's really useful as a start to understand local memory better, there is only one thing that is not clear, in step 3 I save the best nearest as: imin = i; But now I can't use i anymore, how can I get the "global id" of the current best? I tried imin=j+i but it works only for the first N points.Kennithkennon

© 2022 - 2024 — McMap. All rights reserved.