OpenMP: poor performance of heap arrays (stack arrays work fine)
Asked Answered
I

2

21

I am a fairly experienced OpenMP user, but I have just run into a puzzling problem, and I am hopeful that someone here could help. The problem is that a simple hashing algorithm performs well for stack-allocated arrays, but poorly for arrays on the heap.

Example below uses i%M (i modulus M) to count every M-th integer in respective array element. For simplicity, imagine N=1000000, M=10. If N%M==0, then the result should be that every element of bins[] is equal to N/M:

#pragma omp for
  for (int i=0; i<N; i++) 
    bins[ i%M ]++;

Array bins[] is private to each thread (I sum results of all threads in a critical section afterwards).

When bins[] is allocated on the stack, the program works great, with performance scaling proportionally to the number of cores.

However, if bins[] is on the heap (pointer to bins[] is on the stack), performance drops drastically. And that is a major problem!

I want parallelize binning (hashing) of certain data into heap arrays with OpenMP, and this is a major performance hit.

It is definitely not something silly like all threads trying to write into the same area of memory. That is because each thread has its own bins[] array, results are correct with both heap- and stack-allocated bins, and there is no difference in performance for single-thread runs. I reproduced the problem on different hardware (Intel Xeon and AMD Opteron), with GCC and Intel C++ compilers. All tests were on Linux (Ubuntu and RedHat).

There seems no reason why good performance of OpenMP should be limited to stack arrays.

Any guesses? Maybe access of threads to the heap goes through some kind of shared gateway on Linux? How do I fix that?

Complete program to play around with is below:

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int main(const int argc, const char* argv[])
{
  const int N=1024*1024*1024;
  const int M=4;
  double t1, t2;
  int checksum=0;

  printf("OpenMP threads: %d\n", omp_get_max_threads());

  //////////////////////////////////////////////////////////////////
  // Case 1: stack-allocated array
  t1=omp_get_wtime();
  checksum=0;
#pragma omp parallel
  { // Each openmp thread should have a private copy of 
    // bins_thread_stack on the stack:
    int bins_thread_stack[M];
    for (int j=0; j<M; j++) bins_thread_stack[j]=0;
#pragma omp for
    for (int i=0; i<N; i++) 
      { // Accumulating every M-th number in respective array element
        const int j=i%M;
        bins_thread_stack[j]++;
      }
#pragma omp critical
    for (int j=0; j<M; j++) checksum+=bins_thread_stack[j];
  }
  t2=omp_get_wtime();
  printf("Time with stack array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N);
  //////////////////////////////////////////////////////////////////

  //////////////////////////////////////////////////////////////////
  // Case 2: heap-allocated array
  t1=omp_get_wtime();
  checksum=0;
  #pragma omp parallel 
  { // Each openmp thread should have a private copy of 
    // bins_thread_heap on the heap:
    int* bins_thread_heap=(int*)malloc(sizeof(int)*M); 
    for (int j=0; j<M; j++) bins_thread_heap[j]=0;
  #pragma omp for
    for (int i=0; i<N; i++) 
      { // Accumulating every M-th number in respective array element
        const int j=i%M;
        bins_thread_heap[j]++;
      }
  #pragma omp critical
    for (int j=0; j<M; j++) checksum+=bins_thread_heap[j];
    free(bins_thread_heap);
  }
  t2=omp_get_wtime();
  printf("Time with heap  array: %12.3f sec, checksum=%d (must be %d).\n", t2-t1, checksum, N);
  //////////////////////////////////////////////////////////////////

  return 0;
}

Sample outputs of the program are below:

for OMP_NUM_THREADS=1

OpenMP threads: 1
Time with stack array: 2.973 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array: 3.091 sec, checksum=1073741824 (must be 1073741824).

and for OMP_NUM_THREADS=10

OpenMP threads: 10
Time with stack array: 0.329 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array: 2.150 sec, checksum=1073741824 (must be 1073741824).

I would very much appreciate any help!

Inutile answered 7/7, 2011 at 4:5 Comment(0)
B
26

This is a cute problem: with the code as above (gcc4.4, Intel i7) with 4 threads I get

OpenMP threads: 4
Time with stack array:        1.696 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array:        5.413 sec, checksum=1073741824 (must be 1073741824).

but if I change the malloc line to

    int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);

(Update: or even

    int* bins_thread_heap=(int*)malloc(sizeof(int)*16);

)

then I get

OpenMP threads: 4
Time with stack array:        1.578 sec, checksum=1073741824 (must be 1073741824).
Time with heap  array:        1.574 sec, checksum=1073741824 (must be 1073741824).

The problem here is false sharing. The default malloc is being very (space-) efficient, and putting the requested small allocations all in one block of memory, next to each other; but since the allocations are so small that multiple fit in the same cache line, that means every time one thread updates its values, it dirties the cache line of the values in neighbouring threads. By making the requested memory large enough, this is no longer an issue.

Incidentally, it should be clear why the stack-allocated case does not see this problem; different threads - different stacks - memory far enough appart that false sharing isn't an issue.

As a side point -- it doesn't really matter for M of the size you're using here, but if your M (or number of threads) were larger, the omp critical would be a big serial bottleneck; you can use OpenMP reductions to sum the checksum more efficiently

#pragma omp parallel reduction(+:checksum)
    { // Each openmp thread should have a private copy of 
        // bins_thread_heap on the heap:
        int* bins_thread_heap=(int*)malloc(sizeof(int)*M*1024);
        for (int j=0; j<M; j++) bins_thread_heap[j]=0;
#pragma omp for
        for (int i=0; i<N; i++)
        { // Accumulating every M-th number in respective array element
            const int j=i%M;
            bins_thread_heap[j]++;
        }
        for (int j=0; j<M; j++)
            checksum+=bins_thread_heap[j];
        free(bins_thread_heap);
 }
Batavia answered 7/7, 2011 at 13:47 Comment(5)
This is great, Jonathan, thank you! So does it mean that the only way to use heap efficiently is by wasting it? Maybe some implementations of OpenMP have a special malloc function, I will have to research. By the way, what you say about the critical block being a bottleneck is incorrect. The critical block is at the end of my parallel section, and not inside the for loop. In fact, the `reduction' clause achieves reduction by doing exactly that, placing a critical block at the end of the parallel section. But thanks for heads up!Inutile
Ah, but (a) a critical is a very heavyweight operation, and (b) it is coarser-grained than necessary - you could do your local sum first, then just do critical (or better, an atomic) to update the global sum. But even then, with large number of threads a reduction will still be faster, because the final reduction can be done hierarchically (in ln(number of threads) time, rather than in (number of threads) time.)Batavia
As to the efficient use of heap -- avoiding false sharing is a problem that is generic to all shared memory operations, and the only way to avoid it is to make sure you have disjoint chunks of memory that are at least a cache line apart. The size of that spacing will be system dependant; making it multiple K was overkill, usually 512 bytes or so will do it.Batavia
Of course, you are correct about the small tweaks I can do for this little code. My using of the critical section is actually an artifact of the the actual problem I am solving. There, I have arrays of Fortran 90 derived types instead of integer arrays, and I couldn't for the life of me figure out a more elegant way of summing results of individual threads for those.Inutile
For other viewers' sake, here is a link to a discussion of querying the cache line size #795132Inutile
U
0

The initial question implied that heap arrays are slower than stack arrays. Unfortunately the reason for this slowness related to a particular case of cache line clashes in multi-threaded applications. It does not justify the implication that in general heap arrays are slower than stack arrays. For most cases, there is no significant difference in performance, especially where the arrays are much larger than cache line size. The opposite can often be the case, as the use of allocatable heap arrays, targeted to the size required can lead to performance advantages over larger fixed size arrays, which demand more memory transfers.

Urey answered 10/5, 2018 at 14:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.