Fill histograms (array reduction) in parallel with OpenMP without using a critical section
Asked Answered
S

1

13

I would like to fill histograms in parallel using OpenMP. I have come up with two different methods of doing this with OpenMP in C/C++.

The first method proccess_data_v1 makes a private histogram variable hist_private for each thread, fills them in prallel, and then sums the private histograms into the shared histogram hist in a critical section.

The second method proccess_data_v2 makes a shared array of histograms with array size equal to the number of threads, fills this array in parallel, and then sums the shared histogram hist in parallel.

The second method seems superior to me since it avoids a critical section and sums the histograms in parallel. However, it requires knowing the number of threads and calling omp_get_thread_num(). I generally try to avoid this. Is there better way to do the second method without referencing the thread numbers and using a shared array with size equal to the number of threads?

void proccess_data_v1(float *data, int *hist, const int n, const int nbins, float max) {
    #pragma omp parallel 
    {
        int *hist_private = new int[nbins];
        for(int i=0; i<nbins; i++) hist_private[i] = 0;
        #pragma omp for nowait
        for(int i=0; i<n; i++) {
            float x = reconstruct_data(data[i]);
            fill_hist(hist_private, nbins, max, x);
        }
        #pragma omp critical 
        {
            for(int i=0; i<nbins; i++) {
                hist[i] += hist_private[i];
            }
        }
        delete[] hist_private;
    }
}

void proccess_data_v2(float *data, int *hist, const int n, const int nbins, float max) {
    const int nthreads = 8;
    omp_set_num_threads(nthreads);
    int *hista = new int[nbins*nthreads];
    
    #pragma omp parallel 
    {
        const int ithread = omp_get_thread_num();
        for(int i=0; i<nbins; i++) hista[nbins*ithread+i] = 0;
        #pragma omp for
        for(int i=0; i<n; i++) {
            float x = reconstruct_data(data[i]);
            fill_hist(&hista[nbins*ithread], nbins, max, x);
        }

        #pragma omp for
        for(int i=0; i<nbins; i++) {
            for(int t=0; t<nthreads; t++) {
                hist[i] += hista[nbins*t + i];
            }
        }
        
    }
    delete[] hista;
}

Based on a suggestion by @HristoIliev I have created an improved method called process_data_v3:

#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
void proccess_data_v2(float *data, int *hist, const int n, const int nbins, float max) {
    int* hista;
    #pragma omp parallel 
    {
        const int nthreads = omp_get_num_threads();
        const int ithread = omp_get_thread_num();
        
        int lda = ROUND_DOWN(nbins+1023, 1024);  //1024 ints = 4096 bytes -> round to a multiple of page size
        #pragma omp single
        hista = (int*)_mm_malloc(lda*sizeof(int)*nthreads, 4096);  //align memory to page size

        for(int i=0; i<nbins; i++) hista[lda*ithread+i] = 0;
        #pragma omp for
        for(int i=0; i<n; i++) {
            float x = reconstruct_data(data[i]);
            fill_hist(&hista[lda*ithread], nbins, max, x);
        }

        #pragma omp for
        for(int i=0; i<nbins; i++) {
            for(int t=0; t<nthreads; t++) {
                hist[i] += hista[lda*t + i];
            }
        }

    }
    _mm_free(hista);
}
Sylvan answered 28/5, 2013 at 10:2 Comment(2)
Could you please explain why you are using nested parallel regions? (I am referring to your process_data_v1 approach). Maybe I am not understanding something, but according to your code, it seems to me that you are asking for Nthreads**2. It is to say, you are asking for more resources than the available ones. Is that correct? In other words, could you explain the behaviour of parallel regions inside the outer one? Thanks...Lonnylonslesaunier
Hi @Sylvan , isn't proccess_data_v1 the fastest one? Because we don't need shared memory. I try version2 and 3, they are slower than v1. Any suggestion?Telescopy
C
7

You could allocate the big array inside the parallel region, where you can query about the actual number of threads being used:

int *hista;
#pragma omp parallel 
{
    const int nthreads = omp_get_num_threads();
    const int ithread = omp_get_thread_num();

    #pragma omp single
    hista = new int[nbins*nthreads];

    ...
}
delete[] hista;

For better performance I would advise that you round the size of each thread's chunk in hista to a multiple of the system's memory page size, even if this could potentially leave holes between the different partial histograms. This way you will prevent both false sharing and remote memory access on NUMA systems (but not in the final reduction phase).

Cornflower answered 28/5, 2013 at 12:5 Comment(8)
Thank you. I implemented your your suggestion and it's definitely a better solution. I need to read up on the page size. I thought making sure the chunks in hista were a multiple of the cache line size (64 bytes) would be sufficient to prevent false sharing. For example if nbins was a multiple of 64 (and the address of hista was a multiple of 64 as well) wouldn't this prevent false sharing?Sylvan
@Hristolliev, I added some code with your suggestions. I called the chuck size lda and made it a multiple of 64. Should I use a different value, e.g. 4KB = page size?Sylvan
If you run on a NUMA system, e.g. a multisocket AMD64 or modern Xeon machine, then you should round to 4 KiB. Also once the correctly rounded sizes are determined, use posix_memalign to allocate memory aligned on a page boundary.Cornflower
Also align on cache line boundary if not on page boundary, otherwise even having chunks of the correct size could lead to false sharing.Cornflower
Thanks, I updated the code with _mm_malloc(lda*sizeof(int)*nthreads,64)Sylvan
@Hristolliev, I read up a bit on NUMA systems (particular a comment by you https://mcmap.net/q/240146/-what-39-s-the-difference-between-quot-static-quot-and-quot-dynamic-quot-schedule-in-openmp). I updated the code to have chunks size's of 4096 bytes. I need to learn more about NUMA systems. When you allocate memory privately (e.g. in the process_data_v1) is it guaranteed to be be allocated in different pages (if the threads are on different CPUs)? Is this something we have to do manually if we allocate the memory as shared but don't have to do if it's private? Maybe I should make a SO question on this.Sylvan
It really depends on the memory manager in use. For example, on some distributions glibc is configured to use per-thread arenas and each thread gets its own heap space. Larger allocations are typically implemented as anonymous mmaps and therefore always get fresh pages. But it doesn't matter which thread allocated the memory. It matters which tread first touches each particular page - the current NUMA policy on Linux is "first touch", i.e. the physical memory page comes from the NUMA node, where the code that first touched that page runs.Cornflower
Okay, thanks for the information. I need to learn more about NUMA systems. Unfortunately, I don't have regular access to a multi-socket system.Sylvan

© 2022 - 2024 — McMap. All rights reserved.