To your specific question of "I would like to know how can I take advantage of the local memory to make it better."
Using the GPU's local memory can be tricky. You need to spend some quality time with the SDK's code samples and programming guide before tackling it.
Basically, you use local memory to cache some block of the global data -- in your case the model[] array -- so that you can read it from the there faster than reading it from global. If you did want to try it, it would go something like this pseudocode:
For each block of the model array {
1) Read data from __global and write it to __local
2) Barrier
3) For each model datum in the __local cache,
Read it and process it.
4) Barrier
}
Step 3 is basically the loop you have now, except that it would only be processing a chunk of the model data instead of the whole thing.
Steps 2 and 4 are absolutely essential whenever you use local memory. You have to synchronize all of the theads in your workgroup. The barrier forces all of the work items to complete the code before the barrier before any of them is allowed to proceed to execute code after the barrier. This prevents work items from reading data out of local memory before it gets written there by the other threads. I don't recall the syntax of the barrier instructions, but they're in the OpenCL docs.
Step 1 you would have each work item read a different datum from global and write it to local cache.
Something like this (caution this is oversimplified and untested!):
__local float4 modelcache[CACHESIZE];
int me = get_local_id(0);
for (int j = 0; j < model_size; j += CACHESIZE) {
modelcache[me] = dataset[j+me];
barrier(CLK_LOCAL_MEM_FENCE);
for (int i=0; i < CACHESIZE; ++i) {
d_xyz = modelcache[i] - local_xyz;
... etc.
}
barrier(CLK_LOCAL_MEM_FENCE);
}
The design question then is: How big should the local cache be? What's the work group size?
The local data store is shared between work items in a work group. If your ND array of work items executes a number of work groups in parallel, each work group has it's own copy of the modelcache.
If you make the local data arrays too small, you get very little or no benefit from using them. If you make them too big, then the GPU can't execute as many work groups in parallel, and you might actually run considerably slower.
Finally, I have to say that this particular algorithm isn't likely to benefit much from a local memory cache. In your program, all of the work items are reading the same model[i] locations at the same time, and most GPUs have hardware that is specifically optimized to do that fast.