c++cudareducethrust

How can I do segmented reduction using CUDA thrust?


I want to store partial reduction results in an array.

Say I have data[8] = {10,20,30,40,50,60,70,80}.
And if I divide the data with the chunk_size of 2, the chunks will be {10,20}, {30,40}, ... , {70,80}.

If I target the summation, the reduction in total will be 360 but I want to get an array of partial_sums = {30,70,110,150} which is storing the partial sum of each block.

So far, what I have in mind is to construct an iterator strided_iterator, that will access 0, 2, ... th index of data[8] = {10,20,30,40,50,60,70,80} and something like

thrust::reduce(stride_iterator, stride_iterator + 2,
               partial_sums.begin(),
               thrust::plus<int>());

giving the desired result, but have no idea how could this be done efficiently.

For strided access, thrust/examples/strided_range.cu has a solution but this seems to be not applicable to store segmented reductions.

Of course I can brutally do it with a loop like this,

for (int i = 0; i<4; i++) {
  partial_sums[i] = thrust::reduce(data+2*i, data+2*i+2, 0, thrust::plus<int>());
}

But this kind of practice is what CUDA thrust is trying to avoid as much as possible, right? Somehow I should be able to put it all in a single Thrust call.


Solution

  • Based on the useful answer in Reduce multiple blocks of equal length that are arranged in a big vector Using CUDA, what I come up with so far is like follows.

    In fact I wanted to get min or max values in each chunk.

    using namespace thrust::placeholders;
    int main(int argc, char **argv) {
    
      int N = atoi(argv[1]);
      int K = atoi(argv[2]);
    
      std::cout << "N " << N << " K " << K << std::endl;
    
      typedef int mytype;
    
      thrust::device_vector<mytype> data(N*K);
      thrust::device_vector<mytype> sums(N);
    
      thrust::sequence(data.begin(),data.end());
    
      // method 1
      thrust::reduce_by_key(thrust::device,
                            
      thrust::make_transform_iterator(thrust::counting_iterator<int>(0),  _1/K),
                            
      thrust::make_transform_iterator(thrust::counting_iterator<int>(N*K),_1/K),
                            data.begin(),
                            thrust::discard_iterator<int>(),
                            sums.begin(),
                            thrust::equal_to<int>(),
                            thrust::minimum<mytype>());
    
      // method 2 (bad)
      for (int i=0; i<N; i++) {
        int res = thrust::reduce(data.begin()+K*i, data.begin()+K*i+K,std::numeric_limits<mytype>::max(),thrust::minimum<mytype>());
      }
    
      // just print the first 10 results
      thrust::copy_n(sums.begin(),10,std::ostream_iterator<mytype>(std::cout, ","));
      std::cout << std::endl;
    
      return 0;
    }
    

    The results are shown below with the estimated runtime measured via sdkTimer. As can be seen, method 1 with reduce_by_key is much~ faster than the second one with for loop.

    [sangjun@newmaster01 05_thrust]$ ./exe 1000 100
    N 1000 K 100
     - Elapsed time: 0.00008 sec 
     - Elapsed time: 0.02266 sec 
    0,100,200,300,400,500,600,700,800,900,
    [sangjun@newmaster01 05_thrust]$ ./exe 1000 256
    N 1000 K 256
     - Elapsed time: 0.00008 sec 
     - Elapsed time: 0.02084 sec 
    0,256,512,768,1024,1280,1536,1792,2048,2304,
    [sangjun@newmaster01 05_thrust]$ ./exe 100000 100
    N 100000 K 100
     - Elapsed time: 0.00016 sec 
     - Elapsed time: 1.98978 sec 
    0,100,200,300,400,500,600,700,800,900,
    [sangjun@newmaster01 05_thrust]$ ./exe 100000 256
    N 100000 K 256
     - Elapsed time: 0.00027 sec 
     - Elapsed time: 1.92896 sec 
    0,256,512,768,1024,1280,1536,1792,2048,2304,