algorithm optimization cuda stream-compaction

Stream compaction and transform based on the index in CUDA

I have an array of float that I would like to perform a stram compaction operation on, like what is presented here: Parallel Prefix Sum (Scan) with CUDA, and then apply a transform based on the value and the address or the original element.

For example, I have an array with the values {10,-1, -10, 2}, and I want to return all the elements with an absolute value greater than 5, and apply a transform taking the value and its address in the array. The result here would be {transform(10,0),transform(-10,2)}.

I'm trying to use thrust with this, but this code will run often on large arrays, so ideally it wouldn't use buffers and multiple traversals of the array.

Is it possible to do what I'd like to do without allocating a secondary array and doing multiple traversals? If yes, does such code exist in the wild? Or at least does anybody have any pointers to which functions of thrust or any other library I could compose to reach my goal?

Solution

Yes, it's possible in thrust with a single thrust algorithm call (I assume that is what you mean by "without ... doing multiple traversals") and without "allocating a secondary array".

One approach would be to pass the data array plus an index/"address" array (via thrust::counting_iterator, which avoids the allocation) to a thrust::transform_iterator which creates your "transform" operation (combined with an appropriate functor).

You would then pass the above transform iterator to an appropriate thrust stream compaction algorithm to select the values you want.

Here's a possible approach:

$ cat t1044.cu
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/copy.h>
#include <math.h>

#include <iostream>

__host__ __device__ int my_transform(int data, int idx){
  return (data - idx);  //put whatever transform you want here
}

struct my_transform_func : public thrust::unary_function<thrust::tuple<int, int>, int>
{

  __host__ __device__
  int operator()(thrust::tuple<int, int> &t){
    return my_transform(thrust::get<0>(t), thrust::get<1>(t));
    }
};

struct my_test_func
{
  __host__ __device__
  bool operator()(int data){
    return (abs(data) > 5);
    }
};



int main(){

  int data[] = {10,-1,-10,2};
  int dsize = sizeof(data)/sizeof(int);

  thrust::device_vector<int> d_data(data, data+dsize);
  thrust::device_vector<int> d_result(dsize);
  int rsize = thrust::copy_if(thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(d_data.begin(), thrust::counting_iterator<int>(0))), my_transform_func()), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(d_data.end(), thrust::counting_iterator<int>(dsize))), my_transform_func()), d_data.begin(), d_result.begin(), my_test_func()) - d_result.begin();
  thrust::copy_n(d_result.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}
$ nvcc -o t1044 t1044.cu
$ ./t1044
10,-12,
$

Some possible criticisms of this approach:

It would appear to be loading the d_data elements twice (once for the transform operation, once for the stencil). However, it's possible that the CUDA optimizing compiler would recognize the redundant load in the thread code that ultimately gets generated, and optimize it out.
It would appear that we are performing the transform operation on every data element, whether we intend to save it or not in the result. Once again, it's possible that the thrust copy_if implementation may actually defer the data loading operation until after the stencil decision is made. If that were the case, its possible that the transform only gets done on an as-needed basis. Even if it is done always, this may be an insignificant issue, since many thrust operations tend to be load/store or memory bandwidth bound, not compute bound. However an interesting alternate approach could be to use the adaptation created by @m.s. here which creates a transform applied to the output iterator step, which would presumably limit the transform operation to only be performed on the data elements that are actually being saved in the result, although I have not inspected that closely either.
As mentioned in the comment below, this approach does allocate temporary storage (thrust does so under the hood, as part of the copy_if operation) and of course I'm explicitly allocating O(n) storage for the result. I suspect the thrust allocation (a single cudaMalloc) is probably also for O(n) storage. While it may be possible to do all the things asked for (parallel prefix sum, stream compaction, data transformation) with absolutely no extra storage of any kind (so perhaps the request is for an in-place operation), I think that crafting an algorithm that way is likely to have significant negative performance implications, if it is doable at all (it's not clear to me a parallel prefix sum can be realized with absolutely no additional storage of any kind, let alone coupling that with a stream compaction i.e. data movement in parallel). Since thrust frees all such temporary storage it uses, there can't be much of a storage concern associated with frequent use of this method. The only remaining concern (I guess) is performance. If the performance is a concern, then the time overhead associated with the temporary allocations should be mostly eliminated by coupling the above algorithm with a thrust custom allocator (also see here), which would allocate the maximum needed storage buffer once, then re-use that buffer each time the above algorithm is used.