copenmpprefix-sum

Parallelize prefix-sum with Open MP


I have two vectors, a[n] and b[n], where n is a large number.

a[0] = b[0];

for (i = 1; i < size; i++) {
    a[i] = a[i-1] + b[i];
}

With this code we try to achieve that a[i] contains the sum of all the numbers in b[] until b[i]. I need to parallelize this loop using Open MP.

The main problem is that a[i] depends of a[i-1], so the only direct way that comes to my mind would be waiting for each a[i-1] number to be ready, which takes a lot of time and makes no sense. Is there any approach in Open MP for solving this problem?


Solution

  • You're Carl Friedrich Gauss in the 18 century and your grade school teacher has decided to punish the class with a homework problem that requires a lot or mundane repeated arithmetic. In the previous week your teacher told you to add up the first 100 counting numbers and because you're clever you came up with a quick solution. Your teacher did not like this so he came up with a new problem which he thinks can't be optimized. In your own notation you rewrite the problem like this

    a[0] = b[0];   
    for (int i = 1; i < size; i++) a[i] = a[i-1] + b[i];
    

    then you realize that

    a0  = b[0]
    a1  = (b[0]) + b[1];
    a2  = ((b[0]) + b[1]) + b[2]
    a_n = b[0] + b[1] + b[2] + ... b[n]
    

    using your notation again you rewrite the problem as

    int sum = 0;
    for (int i = 0; i < size; i++) sum += b[i], a[i] = sum;
    

    How to optimize this? First you define

    int sum(n0, n) { 
        int sum = 0;
        for (int i = n0; i < n; i++) sum += b[i], a[i] = sum;
        return sum;
    }
    

    Then you realize that

    a_n+1   = sum(0, n) + sum(n, n+1)
    a_n+2   = sum(0, n) + sum(n, n+2)
    a_n+m   = sum(0, n) + sum(n, n+m)
    a_n+m+k = sum(0, n) + sum(n, n+m) + sum(n+m, n+m+k)
    

    So now you know what to do. Find t classmates. Have each one work on a subset of the numbers. To keep it simple you choose size is 100 and four classmates t0, t1, t2, t3 then each one does

     t0               t1                t2              t3
     s0 = sum(0,25)   s1 = sum(25,50)   s2 = sum(50,75) s3 = sum(75,100)
    

    at the same time. Then define

    fix(int n0, int n, int offset) {
        for(int i=n0; i<n; i++) a[i] += offset
    }
    

    and then each classmates goes back over their subset at the same time again like this

    t0             t1               t2                  t3 
    fix(0, 25, 0)  fix(25, 50, s0)  fix(50, 75, s0+s1)  fix(75, 100, s0+s1+s2)
    

    You realize that with t classmate taking about the same K seconds to do arithmetic that you can finish the job in 2*K*size/t seconds whereas one person would take K*size seconds. It's clear you're going to need at least two classmates just to break even. So with four classmates they should finish in half the time as one classmate.

    Now your write up your algorithm in your own notation

    int *suma;  // array of partial results from each classmate
    #pragma omp parallel
    {
        int ithread = omp_get_thread_num();    //label of classmate
        int nthreads = omp_get_num_threads();  //number of classmates
        #pragma omp single
        suma = malloc(sizeof *suma * (nthreads+1)), suma[0] = 0;
    
        //now have each classmate calculate their partial result s = sum(n0, n)
        int s = 0;
        #pragma omp for schedule(static) nowait
        for (int i=0; i<size; i++) s += b[i], a[i] = sum;
        suma[ithread+1] = s;
    
        //now wait for each classmate to finish
        #pragma omp barrier
    
        // now each classmate sums each of the previous classmates results
        int offset = 0;
        for(int i=0; i<(ithread+1); i++) offset += suma[i];
    
        //now each classmates corrects their result 
        #pragma omp for schedule(static)
        for (int i=0; i<size; i++) a[i] += offset;
    }
    free(suma)
    

    You realize that you could optimize the part where each classmate has to add the result of the previous classmate but since size >> t you don't think it's worth the effort.

    Your solution is not nearly as fast as your solution adding the counting numbers but nevertheless your teacher is not happy that several of his students finished much earlier than the other students. So now he decides that one student has to read the b array slowly to the class and when you report the result a it has to be done slowly as well. You call this being read/write bandwidth bound. This severely limits the effectiveness of your algorithm. What are you going to do now?

    The only thing you can think of is to get multiple classmates to read and record different subsets of the numbers to the class at the same time.