I have started learning how to use OpenMP as part of a University course. As a lab excercise, we have been given a serial program which we need to parallelize.
One of the first things we were made aware of the dangers of False Sharing, especially when it comes to updating arrays in parallel for loops.
However, I am finding it hard transforming the following snippet of code into a parallizable task without causing False Sharing:
int ii,kk;
double *uk = malloc(sizeof(double) * NX);
double *ukp1 = malloc(sizeof(double) * NX);
double *temp;
double dx = 1.0/(double)NX;
double dt = 0.5*dx*dx;
// Initialise both arrays with values
init(uk, ukp1);
for(kk=0; kk<NSTEPS; kk++) {
for(ii=1; ii<NX-1; ii++) {
ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
}
temp = ukp1;
ukp1 = uk;
uk = temp;
printValues(uk,kk);
}
My first reaction was to try out sharing ukp1:
for(kk=0; kk<NSTEPS; kk++) {
#pragma omp parallel for shared(ukp1)
for(ii=1; ii<NX-1; ii++) {
ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
}
temp = ukp1;
ukp1 = uk;
uk = temp;
printValues(uk,kk);
}
But this clearly shows significant slow down compared to the serial version. The obvious reason is that False sharing is occuring during some write operations to ukp1.
I was under the impression that maybe I could use the reduction clause, however I soon found out that this cannot be used on arrays.
Is there anything I can use to parallelize this code to improve the runtime? Is there a clause I can use which I have not heard of? Or is this the kind of task where I need to restructure the code to allow proper parallization?
All forms of input would be greatly appreciated!
EDIT: It was pointed out to me there was a mistake in my code. The code I have locally is correct, I just edited it wrongly (which changed the structure of the code), sorry for the confusion!
EDIT2:
Some information pointed out to me by @Sergey which I feel is useful:
Setting uk or ukp1 to private will essentially have the same effect as setting them to shared due to the fact they are both pointers to the same memory location
Using static scheduling should help in theory, but I am am still experiencing the same slowdown. Also, I feel that static scheduling is not the most portable way of fixing this problem.
kk
is not used anywhere inside the loop? – Pericarditisukp1
inside the loop, but never reading from it, so each iteration for everykk
will do exactly the same. – Pericarditisuk
after all iterations are done. Look at Massimiliano's answer below and edit your code accordingly if his code is what you actually meant. – Osprey