False sharing and pthreads

Asked 30/11, 2011 at 18:51 Answered 18/3, 2017 at 18:6

I have the following task to demonstrate false sharing and wrote a simple program:

#include <sys/times.h>
#include <time.h>
#include <stdio.h> 
#include <pthread.h> 

long long int tmsBegin1,tmsEnd1,tmsBegin2,tmsEnd2,tmsBegin3,tmsEnd3;

int array[100];

void *heavy_loop(void *param) { 
  int   index = *((int*)param);
  int   i;
  for (i = 0; i < 100000000; i++)
    array[index]+=3;
} 

int main(int argc, char *argv[]) { 
  int       first_elem  = 0;
  int       bad_elem    = 1;
  int       good_elem   = 32;
  long long time1;
  long long time2;
  long long time3;
  pthread_t     thread_1;
  pthread_t     thread_2;

  tmsBegin3 = clock();
  heavy_loop((void*)&first_elem);
  heavy_loop((void*)&bad_elem);
  tmsEnd3 = clock();

  tmsBegin1 = clock();
  pthread_create(&thread_1, NULL, heavy_loop, (void*)&first_elem);
  pthread_create(&thread_2, NULL, heavy_loop, (void*)&bad_elem);
  pthread_join(thread_1, NULL);
  pthread_join(thread_2, NULL);
  tmsEnd1 = clock(); 

  tmsBegin2 = clock();
  pthread_create(&thread_1, NULL, heavy_loop, (void*)&first_elem);
  pthread_create(&thread_2, NULL, heavy_loop, (void*)&good_elem);
  pthread_join(thread_1, NULL);
  pthread_join(thread_2, NULL);
  tmsEnd2 = clock();

  printf("%d %d %d\n", array[first_elem],array[bad_elem],array[good_elem]);
  time1 = (tmsEnd1-tmsBegin1)*1000/CLOCKS_PER_SEC;
  time2 = (tmsEnd2-tmsBegin2)*1000/CLOCKS_PER_SEC;
  time3 = (tmsEnd3-tmsBegin3)*1000/CLOCKS_PER_SEC;
  printf("%lld ms\n", time1);
  printf("%lld ms\n", time2);
  printf("%lld ms\n", time3);

  return 0; 
}

I was very surprised when I saw the results (I run it on my i5-430M processor).

With false sharing, it was 1020 ms.
Without false sharing, it was 710 ms, only 30% faster instead of 300% (it was written on some sites that it would be faster than 300-400%).
Without using pthreads, it was 580 ms.

Please show me my mistake or explain why it happens.

Slowmoving answered 30/11, 2011 at 18:51 Comment(0)

Brief on clock() function in C: It gives you the number of CPU clocks elapsed from start to end. So when you run two parallel threads, the number of CPU cycles will be clock cycles of CPU1 + clock cycles of CPU2.

I think what you want is a real timer clock. For this use

clock_gettime()

and you should get the expected output.

I ran your code with clock_gettime() and I got this:

With false sharing 874.587381 ms
Without false sharing 331.844278 ms
Sequential computation 604.160276 ms

Colour answered 18/3, 2017 at 18:6 Comment(0)

False sharing is a result of multiple cores with separate caches accessing the same region of physical memory (although not that same address -- that would be true sharing).

To understand false sharing, you need to understand caches. In most processors, each core will have its own L1 cache, which holds recently accessed data. Caches are organized in "lines", which are aligned chunks of data, usually 32 or 64 bytes in length (depending on your processor). When you read from an address that's not in the cache, the whole line is read from main memory (or an L2 cache) into L1. When you write to an address in the cache, the line containing that address is marked "dirty".

Here's where the sharing aspect comes in. If multiple cores are reading from the same line, they can each have a copy of the line in L1. However, if a copy is marked dirty, it invalidates the line in the other caches. If this didn't happen, then writes made on one core might not be visible to others cores until much later. So next time the other core goes to read from that line, the cache misses, and it has to fetch the line again.

False sharing occurs when the cores are reading and writing to different addresses on the same line. Even though they are not sharing data, the caches act like they are since they are so close.

This effect is highly dependent on the architecture of your processor. If you had a single core processor, you would not see the effect at all, since there would be no sharing. If your cache lines were longer, you would see the effect in both the "bad" and "good" cases, since they are still close together. If your cores did not share an L2 cache (which I'm guessing they do), you might see 300-400% difference as you said, since they would have to go all the way to main memory on a cache miss.

You might also like to know that it's important that each thread is both reading and writing (+= instead of =). Some processors have write-through caches which means if a core writes to an address not in the cache, it doesn't miss and fetch the line from memory. Contrast this with write-back caches, which do miss on writes.

Andalusite answered 30/11, 2011 at 19:10 Comment(7)

I think array[0] and array[1] should be in one cache-line. They are very close, isn't it? – Slowmoving 30/11, 2011 at 19:17

@AlexeyMatveev: 31th and 32th are very close, too. But you assume they belong to different cache lines. The truth is, they may or may not be on the same cache line. What if 1 to 5 (and everything before 1, that fits) goes to one cache line and 6 trough 37 goes to another? – Machiavelli 30/11, 2011 at 19:27

@Vlad-Lazarenko, I understand it, I tested this with another numbers, too. – Slowmoving 30/11, 2011 at 19:36

@AlexeyMatveev: OK, here is what I'd do to improve your test. Get rid of clock () (it is a rough approximation) and replace it with high-precision hardware based timer. If you happen to run on Linux, you can use clock_gettime with CLOCK_MONOTONIC_RAW flag (see bitbucket.org/Yocto/yocto/src/9cec50caf923/include/yocto/… and bitbucket.org/Yocto/yocto/src/9cec50caf923/src/stopwatch.cpp for example).... – Machiavelli 30/11, 2011 at 19:50

Disable CPU throttling and come up with a number of operations that in single-threaded mode will yield at least 1.5 seconds (otherwise energy saving will screw up your benchmark in the first few seconds). Then make sure compiler doesn't apply optimization in undesired places. For example, I'd say "i" must be volatile to avoid unrolling, and array, too, otherwise compiler may decide to throw away your loop completely. And finally, you have to measure time in your "heavy_loop" to exclude thread management overhead. – Machiavelli 30/11, 2011 at 19:52

@Vlad, thank you. Now, I will try follow you advice. But I can't understand why example without using threads is so fast? – Slowmoving 30/11, 2011 at 20:11

@AlexeyMatveev: I can't say for sure without testing. The fact is that starting threads takes hell of a lot of CPU cycles, plus join () is a blocking call and blocking + waking up is expensive. In fact, your current test as it is may be measuring only that, w/o false sharing. But there could be other surprises, like when kernel decides to move you from one CPU to another because of new threads (see process/thread affinity), which doesn't happen with a single thread. – Machiavelli 30/11, 2011 at 20:20

I think what you want is a real timer clock. For this use

clock_gettime()

and you should get the expected output.

I ran your code with clock_gettime() and I got this:

With false sharing 874.587381 ms
Without false sharing 331.844278 ms
Sequential computation 604.160276 ms

Colour answered 18/3, 2017 at 18:6 Comment(0)

Recommended topics

Hot tags