I am trying to use thrust to reduce an array of 1M elements to a single value. My code is as follows:
#include<chrono>
#include<iostream>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>
int main()
{
int N,M;
N = 1000;
M = 1000;
thrust::device_vector<float> D(N*M,5.0);
int sum;
auto start = std::chrono::high_resolution_clock::now();
sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end-start);
std::cout<<duration.count()<<" ";
std::cout<<sum;
}
The issue is, thrust::reduce alone takes about 4ms to run on my RTX 3070 laptop GPU. This is considerably slower than code I can write based on reduction#4 in this CUDA reference by Mark Harris, which takes about 150microseconds. Am I doing something wrong here?
EDIT 1: Changed high_resolution_clock to steady_clock. thrust::reduce now takes 2ms to run. Updated code is as follows:
#include<chrono>
#include<iostream>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/reduce.h>
int main()
{
int N,M;
N = 1000;
M = 1000;
thrust::device_vector<float> D(N*M,5.0);
int sum;
auto start = std::chrono::steady_clock::now();
sum = thrust::reduce(D.begin(),D.end(),(float)0,thrust::plus<float>());
auto end = std::chrono::steady_clock::now();
auto duration = std::chrono::duration<double,std::ratio<1,1000>>(end-start);
std::cout<<duration.count()<<" ";
std::cout<<sum;
}
Additional information :
I am running CUDA C++ on Ubuntu in WSL2
CUDA version - 11.4
I am using the nvcc compiler to compile:
nvcc -o reduction reduction.cu
To run:
./reduction
Am I doing something wrong here?
I would not say you are doing anything wrong here. However that might be a matter of opinion. Let's unpack it a bit, using a profiler. I'm not using the exact same setup as you (I'm using a different GPU - Tesla V100, on Linux, CUDA 11.4). In my case the measurement spit out by the code is ~0.5ms, not 2ms.
thrust::reduce
is accomplished under the hood via a call to cub::DeviceReduceKernel
followed by cub::DeviceReduceSingleTileKernel
. This two-kernel approach should make sense to you if you have studied Mark Harris' reduction material. The profiler tells me that together, these two calls account for ~40us of the ~500us overall time. This is the time that would be most comparable to the measurement you made of your implementation of Mark Harris' reduction code, assuming you are timing the kernels only. If we multiply by 4 to account for the overall perf ratio, it is pretty close to your 150us measurement of that.cudaMalloc
(~200us) and a call to cudaFree
(~200us). This isn't surprising because if you study the cub::DeviceReduce
methodology that is evidently being used by thrust, it requires an initial call to do a temporary allocation. Since thrust provides a self-contained call for thrust::reduce
, it has to perform that call, as well as do a cudaMalloc
and cudaFree
operation for the indicated temporary storage.So is there anything that can be done?
The thrust designers were aware of this situation. To get a (closer to) apples-apples comparison between just measuring the kernel duration(s) of a CUDA C++ implementation, and using thrust to do the same thing, you could use a profiler to compare measurements, or else take control of the temporary allocations yourself.
One way to do this would be to switch from thrust to cub.
The thrust way to do it is to use a thrust custom allocator.
There may be a few other detail differences in methodology that are impacting your measurement. For example, the thrust call intrinsically copies the reduction result back to host memory. You may or may not be timing that step in your other approach which you haven't shown. But according to my profiler measurement, that only accounts for a few microseconds.