[SOLVED] Stream compaction with Thrust; best practices and fastest way?

Stream compaction with Thrust; best practices and fastest way?

I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease.

What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example code below. The part that I am unsure of how to tackle is dealing with all the extra fill space that is in d_res and thus h_res, after the compaction happens.

The example just uses a 0-99 sequence with all the even entries set to zero. This is just an example, and the real problem will be a general sparse array.

This answer here helped me greatly, although when it comes to reading out the data, the size is just known to be constant: How to quickly compact a sparse array with CUDA C?

I suspect that I can work around this by counting the number of 0's in d_src, and then only allocating d_res to be that size, or doing the count after the compaction, and only copying that many element. Is that really the right way to do it?

I get the sense that there will be some easy fix for this, via clever use of iterators or some other feature of thrust.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>

//Predicate functor
struct is_not_zero
{
    __host__ __device__
        bool operator()(const int x)
    {
        return (x != 0);
    }
};

using namespace std;

int main(void)
{
    size_t N = 100;

    //Host Vector
    thrust::host_vector<int> h_src(N);

    //Fill with some zero and some nonzero data, as an example
    for (int i = 0; i < N; i++){
        if (i % 2 == 0){
            h_src[i] = 0;
        }
        else{
            h_src[i] = i;
        }
    }

    //Print out source data
    cout << "Source:" << endl;

    for (int i = 0; i < N; i++){
        cout << h_src[i] << " ";
    }
    cout << endl;

    //copies to device
    thrust::device_vector<int> d_src = h_src;

    //Result vector
    thrust::device_vector<int> d_res(d_src.size());

    //Copy non-zero elements from d_src to d_res
    thrust::copy_if(d_src.begin(), d_src.end(), d_res.begin(), is_not_zero());

    //Copy back to host
    thrust::host_vector<int> h_res(d_res.begin(), d_res.end());
    //thrust::host_vector<int> h_res = d_res; //Or just this?

    //Show results
    cout << "h_res size is " << h_res.size() << endl;
    cout << "Result after remove:" << endl;

    for (int i = 0; i < h_res.size(); i++){
        cout << h_res[i] << " ";
    }
    cout << endl;

    return 0;
}

Also, I am a novice with thrust, so if the above code has any obvious flaws that go against recommended practices for using thrust, please let me know.

Similarly, speed is always of interest. Reading some of the various thrust tutorials, it seems like little changes here and there can be big speed savers or wasters. So, please let me know if there is a smart way to speed this up.

Solution

What you have appeared to have overlooked is that copy_if returns an iterator which points to the end of the copied data from the stream compaction operation. So all that is required is this:

//copies to device
thrust::device_vector<int> d_src = h_src;

//Result vector
thrust::device_vector<int> d_res(d_src.size());

//Copy non-zero elements from d_src to d_res
auto result_end = thrust::copy_if(d_src.begin(), d_src.end(), d_res.begin(), is_not_zero());

//Copy back to host
thrust::host_vector<int> h_res(d_res.begin(), result_end);

Doing this sizes h_res to only hold the non zeroes and only copies the non zeroes from the output of the stream compaction. No extra computation is required.