I have two vectors, a[n] and b[n], where n is a large number.
a[0] = b[0];
for (i = 1; i < size; i++) {
a[i] = a[i-1] + b[i];
}
With this code we try to achieve that a[i] contains the sum of all the numbers in b[] until b[i]. I need to parallelize this loop using Open MP.
The main problem is that a[i] depends of a[i-1], so the only direct way that comes to my mind would be waiting for each a[i-1] number to be ready, which takes a lot of time and makes no sense. Is there any approach in Open MP for solving this problem?
You're Carl Friedrich Gauss in the 18 century and your grade school teacher has decided to punish the class with a homework problem that requires a lot or mundane repeated arithmetic. In the previous week your teacher told you to add up the first 100 counting numbers and because you're clever you came up with a quick solution. Your teacher did not like this so he came up with a new problem which he thinks can't be optimized. In your own notation you rewrite the problem like this
a[0] = b[0];
for (int i = 1; i < size; i++) a[i] = a[i-1] + b[i];
then you realize that
a0 = b[0]
a1 = (b[0]) + b[1];
a2 = ((b[0]) + b[1]) + b[2]
a_n = b[0] + b[1] + b[2] + ... b[n]
using your notation again you rewrite the problem as
int sum = 0;
for (int i = 0; i < size; i++) sum += b[i], a[i] = sum;
How to optimize this? First you define
int sum(n0, n) {
int sum = 0;
for (int i = n0; i < n; i++) sum += b[i], a[i] = sum;
return sum;
}
Then you realize that
a_n+1 = sum(0, n) + sum(n, n+1)
a_n+2 = sum(0, n) + sum(n, n+2)
a_n+m = sum(0, n) + sum(n, n+m)
a_n+m+k = sum(0, n) + sum(n, n+m) + sum(n+m, n+m+k)
So now you know what to do. Find t
classmates. Have each one work on a subset of the numbers. To keep it simple you choose size
is 100 and four classmates t0, t1, t2, t3
then each one does
t0 t1 t2 t3
s0 = sum(0,25) s1 = sum(25,50) s2 = sum(50,75) s3 = sum(75,100)
at the same time. Then define
fix(int n0, int n, int offset) {
for(int i=n0; i<n; i++) a[i] += offset
}
and then each classmates goes back over their subset at the same time again like this
t0 t1 t2 t3
fix(0, 25, 0) fix(25, 50, s0) fix(50, 75, s0+s1) fix(75, 100, s0+s1+s2)
You realize that with t
classmate taking about the same K
seconds to do arithmetic that you can finish the job in 2*K*size/t
seconds whereas one person would take K*size
seconds. It's clear you're going to need at least two classmates just to break even. So with four classmates they should finish in half the time as one classmate.
Now your write up your algorithm in your own notation
int *suma; // array of partial results from each classmate
#pragma omp parallel
{
int ithread = omp_get_thread_num(); //label of classmate
int nthreads = omp_get_num_threads(); //number of classmates
#pragma omp single
suma = malloc(sizeof *suma * (nthreads+1)), suma[0] = 0;
//now have each classmate calculate their partial result s = sum(n0, n)
int s = 0;
#pragma omp for schedule(static) nowait
for (int i=0; i<size; i++) s += b[i], a[i] = sum;
suma[ithread+1] = s;
//now wait for each classmate to finish
#pragma omp barrier
// now each classmate sums each of the previous classmates results
int offset = 0;
for(int i=0; i<(ithread+1); i++) offset += suma[i];
//now each classmates corrects their result
#pragma omp for schedule(static)
for (int i=0; i<size; i++) a[i] += offset;
}
free(suma)
You realize that you could optimize the part where each classmate has to add the result of the previous classmate but since size >> t
you don't think it's worth the effort.
Your solution is not nearly as fast as your solution adding the counting numbers but nevertheless your teacher is not happy that several of his students finished much earlier than the other students. So now he decides that one student has to read the b
array slowly to the class and when you report the result a
it has to be done slowly as well. You call this being read/write bandwidth bound. This severely limits the effectiveness of your algorithm. What are you going to do now?