Avoiding False Sharing in OpenMP with arrays
Asked Answered
G

2

12

I have started learning how to use OpenMP as part of a University course. As a lab excercise, we have been given a serial program which we need to parallelize.

One of the first things we were made aware of the dangers of False Sharing, especially when it comes to updating arrays in parallel for loops.

However, I am finding it hard transforming the following snippet of code into a parallizable task without causing False Sharing:

int ii,kk;

double *uk = malloc(sizeof(double) * NX);
double *ukp1 = malloc(sizeof(double) * NX);
double *temp;

double dx = 1.0/(double)NX;
double dt = 0.5*dx*dx;

// Initialise both arrays with values
init(uk, ukp1);

for(kk=0; kk<NSTEPS; kk++) {
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
   }

   temp = ukp1;
   ukp1 = uk;
   uk = temp;
   printValues(uk,kk);
}

My first reaction was to try out sharing ukp1:

for(kk=0; kk<NSTEPS; kk++) {
   #pragma omp parallel for shared(ukp1)
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
    }

   temp = ukp1;
   ukp1 = uk;
   uk = temp;
   printValues(uk,kk);
}

But this clearly shows significant slow down compared to the serial version. The obvious reason is that False sharing is occuring during some write operations to ukp1.

I was under the impression that maybe I could use the reduction clause, however I soon found out that this cannot be used on arrays.

Is there anything I can use to parallelize this code to improve the runtime? Is there a clause I can use which I have not heard of? Or is this the kind of task where I need to restructure the code to allow proper parallization?

All forms of input would be greatly appreciated!

EDIT: It was pointed out to me there was a mistake in my code. The code I have locally is correct, I just edited it wrongly (which changed the structure of the code), sorry for the confusion!

EDIT2:

Some information pointed out to me by @Sergey which I feel is useful:

  • Setting uk or ukp1 to private will essentially have the same effect as setting them to shared due to the fact they are both pointers to the same memory location

  • Using static scheduling should help in theory, but I am am still experiencing the same slowdown. Also, I feel that static scheduling is not the most portable way of fixing this problem.

Guevara answered 9/10, 2013 at 17:9 Comment(9)
Is that intended that your outer loop index kk is not used anywhere inside the loop?Pericarditis
Yes, I believe the purpose of kk is only to act as an index for the outer loop. The outer loop is simply used to repeat the operation multiple times (1000 times in my case but this can be changed). Because the outer loop depends on previous values, I don't believe we can parallelize it.Guevara
You are only modifying ukp1 inside the loop, but never reading from it, so each iteration for every kk will do exactly the same.Pericarditis
Does this mean this cannot be a case of false sharing? Does writing not cause a cache line to be updated?Guevara
What Sergey is saying is that the outer loop is completely useless. Why are you repeating the iteration if the result is the same after 1 iteration and after 1000 iterations?Osprey
ukp1 is only use as a temporary storage. It takes the values from the previous iteration (stored in uk) and calculates the differential using its neighbors (ii + 1 and ii - 1). Once all the calculations are complete and ukp1 has been filled, it is swapped with uk to be used in the next iteration again. So while it is not used within the loop, it definitely has a purpose. This code was not written by myself but was rather written my our tutor. Our task is only to parallelize it.Guevara
@Michael But what you're saying is not true. The code in its current form only swaps with uk after all iterations are done. Look at Massimiliano's answer below and edit your code accordingly if his code is what you actually meant.Osprey
Oh I see where the mixup is being made, I will fix my original post. If you look at the top code, it should make more sense. The second post was me making a mistake to the edit.Guevara
Does this answer your question? Parallelizing matrix times a vector by columns and by rows with OpenMPAnimalism
P
13

Since we are talking about optimisations first things first:

Define constants as macros, allows for better optimisation by the compiler.

#define dx (1.0/(double)NX)
#define dt (0.5*dx*dx)

When workring with OpenMP the default sharing rule for variables is shared, though it is good practice to set it to none and enable every variable you need inside the parallel section manually. This way you can be certain that you avoid conflicts.

#pragma omp parallel for default(none) shared(ukp1, uk)

Also setting ukp1 or uk to any sharing status will only pass the pointer into your parallel section since you declared them as pointers. So the memory in them is still shared.

Lastly to avoid flase sharing you want to make sure that as few cache lines are shared among threads as possible. Read only cachelines (thus uk in your case) are irrelevant, they can exist in every thread, but write cachelines ukp1 should be per thread. Today cache lines are normally 64 bytes long - thus one cache line will fit 8 doubles. so you want to assign chunks of at least 8 iterations per thread:

#pragma omp parallel for default(none) shared(ukp1, uk) schedule(static,8)

Will deploy your code 8 iterations per chunk and should appear in the inner loop.

for(kk=0; kk<NSTEPS; kk++) {
   #pragma omp parallel for default(none) shared(ukp1, uk) schedule(static,8)
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
   }
   // Swap pointers for the next time step
   temp = ukp1;
   ukp1 = uk;
   uk   = temp;
}

In practice depending on the size of your data you may want to assign even larger chunk sizes. I tend to use 0x1000 - which on most systems will even fit a whole page (assuming you are page-aligned).

Edit:

In order for this to actually have an effect you need to have your memory correctly aligned. You are starting at index 1, so:

 double *uk = memalign(0x40 , sizeof(double) * (NX + 8));
 double *ukp1 = memalign(0x40 , sizeof(double) * (NX + 8));
 uk += 7;
 ukp1 += 7;

Now ukp1[1] is cache-line aligned. Increasing your chunk size may help, but unless you plan on having NX > 100000 there isn't much point in parallelising in the first place.

You need to keep in mind that you are getting quite a lot of overhead restarting the parallel section in each iteration. To get that under control you would need to rework your scheduling beyond simple OpenMP.

Pericarditis answered 9/10, 2013 at 17:56 Comment(10)
Why would a macro-defined constant allow for better optimisations than a const, for example?Osprey
First of all the original code in the question did not define them as const. Second some OpenMP Processors are not that good at porting in cost variables as when using macros. I know icc does, don't know about gcc and I have seen some weird compilers which are bad at optimisations involving consts. And every macro defined constant is one less entry in variable sharing.Pericarditis
Thanks for the information. I was just doing some further googling and found the lastprivate clause. Would it make any sense at all?Guevara
@Sergey I wasn't criticizing your answer :) . I was simply wondering, because in the absence of OpenMP I haven't come across any recent compilers that have problems with optimising consts.Osprey
@MichaelAquilina No, all threads access the same memory through a pointer, so that memory is shared either way. Lastprivate is a very rarely used clause in OpenMP and has very limited use. schedule is the one you want!Pericarditis
This all makes sense, however I am still seeing the same slowdown as before :/Guevara
@MichaelAquilina See my edit about memory alignment.Pericarditis
It's still causing the same slowdown. Even increasing NX to very large values shows the serial version as faster. I've uploaded the code for both the original serial version and my (WIP) parallel version on my bitbucket account if you want to have a lookGuevara
@MichaelAquilina I have downloaded your code. Commented out the print line and I got a 350% speedup in parallel on 4 cores on OS X Macbook Pro compared to serial version. I had to compile it in two steps though: gcc -g parallel_heat.c -c -fopenmp && gcc -o parallel_heat parallel_heat.o -lgompPericarditis
I can't see how this is a solution. You don't know that the cache line is 64 bytes, that's an assumption. Seems very error-prone.Brassie
D
4

I believe that @SergeyL. is right, and your code should look more like this:

// Parabolic 1D heat diffusion solved with an explicit method
for(kk=0; kk<NSTEPS; kk++) {
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
   }
   // Swap pointers for the next time step
   temp = ukp1;
   ukp1 = uk;
   uk   = temp;
}

That said to avoid false sharing you must ensure that different threads are not operating on the same cache line. This indeed requires you to choose an appropriate scheduling and chunk size. The simplest solution that comes to mind is:

// Parabolic 1D heat diffusion solved with an explicit method
#pragma omp parallel private(kk)
{
for(kk=0; kk<NSTEPS; kk++) {
#pragma omp for schedule(static)
   for(ii=1; ii<NX-1; ii++) {
      ukp1[ii] = uk[ii] + (dt/(dx*dx))*(uk[ii+1]-2*uk[ii]+uk[ii-1]);
   }
#pragma omp single
   {
       // Swap pointers for the next time step
       temp = ukp1;
       ukp1 = uk;
       uk   = temp;
   }
} // outer for loop
} // pragma omp parallel
Declination answered 9/10, 2013 at 17:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.