cudathrust

CUDA force instruction execution order


I'm trying to transfer some data manipulations from CPU to GPU (CUDA), but there's one small part that requires instructions to be run in a specific order. In principle I could do the first few parallel parts, then transfer the results to the host for the serial part, then transfer it back again for the rest of the parallel parts, but I'm trying to avoid the memory transfer overhead.

The serial part of the calculation is of the form:

for (int i = 2; i<size; i++)
{
    result[i] = oldArray[i] + result[i-1];
}

Other than launching a kernel on a single thread for this calculation, is there a way to force threads or calculations to run in a specific order?

EDIT:

The problem is slightly more complicated than I first showed, and as far as I can tell it doesn't work as a prefix sum problem.

The loop actually takes the form:

for (int i = 2; i<size; i++)
{
    result[i] = oldArray[i] + k * result[i-1];
}

I've been looking through the documentation for the Thrust library, but that doesn't seem to have a solution. However, I may have just not understood what I was looking at. Is there a parallel solution to this sort of problem?


Solution

  • One possible description we could give to problems like this is to put them in the category of recurrence relations.

    The original question:

    for (int i = 2; i<size; i++)
    {
        result[i] = oldArray[i] + result[i-1];
    }
    

    could be solved trivially with a prefix sum on oldArray, if need be following the description given in this question/answer.

    With the modification from the edit:

    for (int i = 2; i<size; i++)
    {
        result[i] = oldArray[i] + k * result[i-1];
    }
    

    We have to do extra work. Referring to the previously linked answer, at the bottom of that answer, reference is given to this paper by Blelloch. If we study section 1.4 of that paper, we can observe that this new problem formulation fits the pattern of the "First Order Recurrences" described in section 1.4.1, specifically formula 1.5. A proof is provided there for how a solution to that formulation may be realized using a scan operation, if we carefully specify the input/output data as well as the scan operator.

    Thrust has the ability to support these kinds of generalizations to the basic scans provided. The sets of pairs in that paper referred to by s and c can be realized as thrust::tuple, and the specific operator can be passed to thrust scan operations, to generalize the operation behavior.

    I'm not going to try to cover all the ground in that paper; we mostly only need to concern ourselves with the material presented on pages 48 and 49.

    What follows then is an example using thrust demonstrating that we can use a thrust scan operation exactly as described in the paper, to solve this problem formulation. The code below is annotated with comments that reference specific formulas in the Blelloch paper:

    $ cat t1929.cu
    #include <iostream>
    #include <thrust/device_vector.h>
    #include <thrust/host_vector.h>
    #include <thrust/scan.h>
    #include <thrust/iterator/zip_iterator.h>
    #include <cstdlib>
    
    template <typename T>
    void cpufunction(T *result, T *oldArray, size_t size, T k){
      for (int i = 1; i<size; i++)
      {
        result[i] = oldArray[i] + k * result[i-1];
      }
    }
    
    struct scan_op // as per blelloch (1.7)
    {
      template <typename T1, typename T2>
      __host__ __device__
      T1 operator()(const T1 &t1, const T2 &t2){
        T1 ret;
        thrust::get<0>(ret) = thrust::get<0>(t1)*thrust::get<0>(t2);
        thrust::get<1>(ret) = thrust::get<1>(t1)*thrust::get<0>(t2)+thrust::get<1>(t2);
        return ret;
        }
    };
    
    typedef float mt;
    const size_t ds = 1048576;
    const mt k = 1.01;
    int main(){
    
      mt *b  = new mt[ds]; // b as in blelloch (1.5)
      mt *a  = new mt[ds]; // a as in blelloch (1.5)
      mt *cr = new mt[ds]; // cpu result
      for (int i = 0; i < ds; i++) { a[i] = k; b[i] = rand()/(float)RAND_MAX;}
      cr[0] = b[0];
      cpufunction(cr, b, ds, k);
      for (int i = 0; i < 10; i++) std::cout << cr[i] << ",";
      std::cout << std::endl;
      thrust::device_vector<mt> db(b, b+ds);
      thrust::device_vector<mt> da(a, a+ds);
      thrust::device_vector<mt> dy(ds);
      thrust::device_vector<mt> dx(ds);
      thrust::inclusive_scan(thrust::make_zip_iterator(thrust::make_tuple(da.begin(), db.begin())), thrust::make_zip_iterator(thrust::make_tuple(da.end(), db.end())), thrust::make_zip_iterator(thrust::make_tuple(dy.begin(), dx.begin())), scan_op());
      thrust::host_vector<mt> hx = dx;
      thrust::copy_n(hx.begin(), 10, std::ostream_iterator<mt>(std::cout, ","));
      std::cout << std::endl;
    }
    $ nvcc -std=c++14 t1929.cu -o t1929
    $ ./t1929
    0.840188,1.24297,2.0385,2.85733,3.79755,4.03307,4.40863,5.22094,5.55093,6.16041,
    0.840188,1.24297,2.0385,2.85733,3.79755,4.03307,4.40863,5.22094,5.55093,6.16041,
    

    The First Order Recurrence described by Blelloch allows for the possibility of a more-or-less arbitrary a array. In this question, the a array is simply given by k, k, k, ... We could probably further simplify this by eliminating the a array and replacing it with a thrust::constant_iterator. That exercise is fairly mechanical and is left to the reader.