
Thrust is very slow for array reduction

I am trying to use thrust to reduce an array of 1M elements to a single value. My code is as follows:



int main()
    int N,M;
    N = 1000;
    M = 1000;
    thrust::device_vector<float> D(N*M,5.0);
    int sum;
    auto start = std::chrono::high_resolution_clock::now();
    sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end-start);

    std::cout<<duration.count()<<" ";

The issue is, thrust::reduce alone takes about 4ms to run on my RTX 3070 laptop GPU. This is considerably slower than code I can write based on reduction#4 in this CUDA reference by Mark Harris, which takes about 150microseconds. Am I doing something wrong here?

EDIT 1: Changed high_resolution_clock to steady_clock. thrust::reduce now takes 2ms to run. Updated code is as follows:



int main()
    int N,M;
    N = 1000;
    M = 1000;
    thrust::device_vector<float> D(N*M,5.0);
    int sum;
    auto start = std::chrono::steady_clock::now();
    sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
    auto end = std::chrono::steady_clock::now();
    auto duration = std::chrono::duration<double,std::ratio<1,1000>>(end-start);

    std::cout<<duration.count()<<" ";

Additional information :
I am running CUDA C++ on Ubuntu in WSL2
CUDA version - 11.4
I am using the nvcc compiler to compile:

nvcc -o reduction reduction.cu

To run:



    I would not say you are doing anything wrong here. However that might be a matter of opinion. Let's unpack it a bit, using a profiler. I'm not using the exact same setup as you (I'm using a different GPU - Tesla V100, on Linux, CUDA 11.4). In my case the measurement spit out by the code is ~0.5ms, not 2ms.

    So is there anything that can be done?

    The thrust designers were aware of this situation. To get a (closer to) apples-apples comparison between just measuring the kernel duration(s) of a CUDA C++ implementation, and using thrust to do the same thing, you could use a profiler to compare measurements, or else take control of the temporary allocations yourself.

    One way to do this would be to switch from thrust to cub.

    The thrust way to do it is to use a thrust custom allocator.

    There may be a few other detail differences in methodology that are impacting your measurement. For example, the thrust call intrinsically copies the reduction result back to host memory. You may or may not be timing that step in your other approach which you haven't shown. But according to my profiler measurement, that only accounts for a few microseconds.