According to this article, sum reduction with CUB Library should be one of the fastest way to make parallel reduction. As you can see in a code fragment below, the execution time is measure excluding first cub::DeviceReduce::Reduce(temp_storage, temp_storage_bytes, in, out, N, cub::Sum());
I assume that it's something connected with memory preparation and when we reduce several times the same data it isn't neccesary to call it every time but when I've got many different arrays with the same number of elements and type of data do I have to do it every time? If the answer is yes, it means that usage of CUB Library becomes pointless.
size_t temp_storage_bytes;
int* temp_storage=NULL;
cub::DeviceReduce::Reduce(temp_storage, temp_storage_bytes, in, out, N, cub::Sum());
cudaMalloc(&temp_storage,temp_storage_bytes);
cudaDeviceSynchronize();
cudaCheckError();
cudaEventRecord(start);
for(int i=0;i<REPEAT;i++) {
cub::DeviceReduce::Reduce(temp_storage, temp_storage_bytes, in, out, N, cub::Sum());
}
cudaEventRecord(stop);
cudaDeviceSynchronize();
I assume that it's something connected with memory preparation and when we reduce several times the same data it isn't neccesary to call it every time
That's correct.
but when I've got many different arrays with the same number of elements and type of data do I have to do it every time?
No, you don't need to do it every time. The sole purpose of the "first" call to cub::DeviceReduce::Reduce
(i.e. when temp_storage=NULL
) is to provide the number of bytes required for the temporary storage needed by CUB. If the type and size of your data does not change, there is no need to re-run either this step or the subsequent cudaMalloc
operation. You can simply call cub::DeviceReduce::Reduce
again (with temp_storage
pointing to the previous allocation provided by cudaMalloc
) on your "new" data, as long as the size and type of data is the same.