Parallelize prefix-sum with Open MP
Asked Answered
N

1

5

I have two vectors, a[n] and b[n], where n is a large number.

a[0] = b[0];

for (i = 1; i < size; i++) {
    a[i] = a[i-1] + b[i];
}

With this code we try to achieve that a[i] contains the sum of all the numbers in b[] until b[i]. I need to parallelize this loop using Open MP.

The main problem is that a[i] depends of a[i-1], so the only direct way that comes to my mind would be waiting for each a[i-1] number to be ready, which takes a lot of time and makes no sense. Is there any approach in Open MP for solving this problem?

Noelnoelani answered 6/3, 2016 at 1:2 Comment(6)
Is this homework by any chance?Holsworth
Why do you need to make it parallel? Seems like a sequential problem to me. It could be made parallel if you're doing that for various vectors, not just a.Overplay
The main problem is that a[i] depends of a[i-1] Well, no, actually not. a[i] is sum(b[0]..b[i]). You've sketched a serial way to compute that, but serial-ity is not an essential feature of the calculation. Point your favourite search engine at parallel prefix sum.Rollin
@Holsworth No it isn't, but I don't see how that would change the problematic here.Noelnoelani
@Overplay I need to make it parallel to try to reduce the computing time since n is a really large number.Noelnoelani
@HighPerformanceMark Thanks for your answer, I found useful your guidance. It seems like i have to rebuild the algorithm so that i can work with it in a parallel way.Noelnoelani
D
16

You're Carl Friedrich Gauss in the 18 century and your grade school teacher has decided to punish the class with a homework problem that requires a lot or mundane repeated arithmetic. In the previous week your teacher told you to add up the first 100 counting numbers and because you're clever you came up with a quick solution. Your teacher did not like this so he came up with a new problem which he thinks can't be optimized. In your own notation you rewrite the problem like this

a[0] = b[0];   
for (int i = 1; i < size; i++) a[i] = a[i-1] + b[i];

then you realize that

a0  = b[0]
a1  = (b[0]) + b[1];
a2  = ((b[0]) + b[1]) + b[2]
a_n = b[0] + b[1] + b[2] + ... b[n]

using your notation again you rewrite the problem as

int sum = 0;
for (int i = 0; i < size; i++) sum += b[i], a[i] = sum;

How to optimize this? First you define

int sum(n0, n) { 
    int sum = 0;
    for (int i = n0; i < n; i++) sum += b[i], a[i] = sum;
    return sum;
}

Then you realize that

a_n+1   = sum(0, n) + sum(n, n+1)
a_n+2   = sum(0, n) + sum(n, n+2)
a_n+m   = sum(0, n) + sum(n, n+m)
a_n+m+k = sum(0, n) + sum(n, n+m) + sum(n+m, n+m+k)

So now you know what to do. Find t classmates. Have each one work on a subset of the numbers. To keep it simple you choose size is 100 and four classmates t0, t1, t2, t3 then each one does

 t0               t1                t2              t3
 s0 = sum(0,25)   s1 = sum(25,50)   s2 = sum(50,75) s3 = sum(75,100)

at the same time. Then define

fix(int n0, int n, int offset) {
    for(int i=n0; i<n; i++) a[i] += offset
}

and then each classmates goes back over their subset at the same time again like this

t0             t1               t2                  t3 
fix(0, 25, 0)  fix(25, 50, s0)  fix(50, 75, s0+s1)  fix(75, 100, s0+s1+s2)

You realize that with t classmate taking about the same K seconds to do arithmetic that you can finish the job in 2*K*size/t seconds whereas one person would take K*size seconds. It's clear you're going to need at least two classmates just to break even. So with four classmates they should finish in half the time as one classmate.

Now your write up your algorithm in your own notation

int *suma;  // array of partial results from each classmate
#pragma omp parallel
{
    int ithread = omp_get_thread_num();    //label of classmate
    int nthreads = omp_get_num_threads();  //number of classmates
    #pragma omp single
    suma = malloc(sizeof *suma * (nthreads+1)), suma[0] = 0;

    //now have each classmate calculate their partial result s = sum(n0, n)
    int s = 0;
    #pragma omp for schedule(static) nowait
    for (int i=0; i<size; i++) s += b[i], a[i] = sum;
    suma[ithread+1] = s;

    //now wait for each classmate to finish
    #pragma omp barrier

    // now each classmate sums each of the previous classmates results
    int offset = 0;
    for(int i=0; i<(ithread+1); i++) offset += suma[i];

    //now each classmates corrects their result 
    #pragma omp for schedule(static)
    for (int i=0; i<size; i++) a[i] += offset;
}
free(suma)

You realize that you could optimize the part where each classmate has to add the result of the previous classmate but since size >> t you don't think it's worth the effort.

Your solution is not nearly as fast as your solution adding the counting numbers but nevertheless your teacher is not happy that several of his students finished much earlier than the other students. So now he decides that one student has to read the b array slowly to the class and when you report the result a it has to be done slowly as well. You call this being read/write bandwidth bound. This severely limits the effectiveness of your algorithm. What are you going to do now?

The only thing you can think of is to get multiple classmates to read and record different subsets of the numbers to the class at the same time.

Duwe answered 7/3, 2016 at 9:52 Comment(5)
Wouldn't reading and writing different subsets of the numbers at the same time still be read/write bandwidth bound? Is there a technical name for this so that i can further research on how to do this? I have found out that the secuential version of the program is almost twice as fast than the parallel one. Is this because of being read/write bandwidth bound or because of iterating the whole size-sized vector twice? Thanks for your answer, it was very helpful.Noelnoelani
@paraleljoe, when I came up with this solution a few years ago I did not appreciate what memory bandwidth bound meant. It's the most important concept I think you should understand about parallel computing. If size larger than the last level cache then my solution will probably be worse so that's probably why it's slower. You can improve it and do the parallel part in chunks and pass on the total sum of each chunk to the next chuck. But the best that's going to do is get the hybrid parallel version as fast as the sequential one.Duwe
@paraleljoe, my solution would probably work for large size on a multi-socket system/NUMA sysem. So per socket but not per core. You could also use it in distributed computing. By for a single socket system it's only going to be useful for size not to large.Duwe
@paraleljoe, I added a link to my answer about being memory bandwidth bound. I suggest your read it. BTW, I have many answers about the prefix sum you can read about. I don't claim to be any expert about it. like I said I cam up with this solution before I understand what memory bandwidth bound really meant.Duwe
@paraleljoe, you should add the size of size to your question and also why you need to the the prefix sum for such a large size. I don't know why you would need to do the prefix sum for a very large size. Likely you can do the prefix sum in chunks along with other calculations to try and overcome the memory bandwidth. If you want more help you will need to add this information.Duwe

© 2022 - 2024 — McMap. All rights reserved.