fortran openmp thrust prefix-sum stream-compaction

Stream compaction (or Array Packing) with prefix scan using Openmp

I am using openmp to parallelize my code. I have an original array:

A=[3,5,2,5,7,9,-4,6,7,-3,1,7,6,8,-1,2]

and a marks array:

M=[1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1]

using array M i can compact my original array in this packed array:

A=[3,2,-4,-3,1,-1,2]

I'd like to solve this problem using a multi-threads approach. Library 'Thrust' for C++ solves this problem but i am not able to find a similar tools for Fortran. Is there a library, like 'thrust' for C++, that i can use to execute a stream compaction? Alternatively, is there an algorithm that i can write myself using fortran and openmp, to solve this?

Solution

Is there a library, like 'thrust' for C++, that i can use to execute a stream compaction?

It shouldn't be that difficult to call a thrust routine from Fortran (if you're willing to write a little bit of C++ code). Furthermore, thrust can target an OMP backend instead of a GPU backend.

Alternatively, is there an algorithm that i can write myself using fortran and openmp, to solve this?

The basic parallel stream compaction algorithm is as follows. We will assume that there is one thread assigned per element in your data array, initially.

Perform a parallel prefix sum (inclusive scan) on the M array:

 M=[1,0,1,0,0,0,1,0,0,1,1,0,0,0,1,1]
sM=[1,1,2,2,2,2,3,3,3,4,5,5,5,5,6,7]

Each thread will then inspect its element in the M array, and if that element is non-zero, it will copy its corresponding element in the A array to the output array (let's call it O):
```
 M=[1,0,1,0,0,0, 1,0,0, 1,1,0,0,0, 1,1]
sM=[1,1,2,2,2,2, 3,3,3, 4,5,5,5,5, 6,7]
 A=[3,5,2,5,7,9,-4,6,7,-3,1,7,6,8,-1,2]
 O=[3,  2,      -4,    -3,1,      -1,2]
```

If you were doing this in OMP, you will need an OMP barrier between steps 1 and 2. The work in step 2 is relatively simple and completely independent, so you could use a OMP parallel do loop, and break the work up in any fashion you wish. Step 1 will be complicated, and I suggest following the outline provided in the chapter you and I linked. The OMP code there will require various barriers along the way, but is parallelizable.

As mentioned already in the comments, if this is the only piece of work you want to parallelize, I wouldn't recommend a GPU, because the cost of transferring the data to/from the GPU would probably outweigh any parallel execution time benefits you might accrue. But as I mentioned already, thrust can target an OMP realization rather than a GPU realization. It might be worth a try.

Regarding thrust from fortran, most of what you need is here. That is admittedly CUDA fortran, but the only differences should be not using the device attribute, and using thrust::host_vector instead of thrust::device_vector (at least, to get started).