I'm working on an image processing project with Cuda 7.5 and a GeForce GTX 650 Ti. I decided to use 2 stream, one where I apply the algorithms responsible to enhance the image and another stream where I apply an independent algorithm from the rest of the processing.
I wrote an example to show my problem. In this example I created a stream and then I used nppSetStream.
I invoked the function nppiThreshold_LTValGTVal_32f_C1R but 2 stream are used when the function is executed.
Here there's a code example:
#include <npp.h>
#include <cuda_runtime.h>
#include <cuda_profiler_api.h>
int main(void) {
int srcWidth = 1344;
int srcHeight = 1344;
int paddStride = 0;
float* srcArrayDevice;
float* srcArrayDevice2;
unsigned char* dstArrayDevice;
int status = cudaMalloc((void**)&srcArrayDevice, srcWidth * srcHeight * 4);
status = cudaMalloc((void**)&srcArrayDevice2, srcWidth * srcHeight * 4);
status = cudaMalloc((void**)&dstArrayDevice, srcWidth * srcHeight );
cudaStream_t testStream;
cudaStreamCreateWithFlags(&testStream, cudaStreamNonBlocking);
nppSetStream(testStream);
NppiSize roiSize = { srcWidth,srcHeight };
//status = cudaMemcpyAsync(srcArrayDevice, &srcArrayHost, srcWidth*srcHeight*4, cudaMemcpyHostToDevice, testStream);
int yRect = 100;
int xRect = 60;
float thrL = 50;
float thrH = 1500;
NppiSize sz = { 200, 400 };
for (int i = 0; i < 10; i++) {
int status3 = nppiThreshold_LTValGTVal_32f_C1R(srcArrayDevice + (srcWidth*yRect + xRect)
, srcWidth * 4
, srcArrayDevice2 + (srcWidth*yRect + xRect)
, srcWidth * 4
, sz
, thrL
, thrL
, thrH
, thrH);
}
int length = (srcWidth + paddStride)*srcHeight;
int status6 = nppiScale_32f8u_C1R(srcArrayDevice, srcWidth * 4, dstArrayDevice + paddStride, srcWidth + paddStride, roiSize, 0, 65535);
//int status7 = cudaMemcpyAsync(dstPinPtr, dstTest, length, cudaMemcpyDeviceToHost, testStream);
cudaFree(srcArrayDevice);
cudaFree(srcArrayDevice2);
cudaFree(dstArrayDevice);
cudaStreamDestroy(testStream);
cudaProfilerStop();
return 0;
}
This what I got from the Nvidia Visual Profiler: image_width1344
Why are there two streams if I set only one stream? This causes errors in my original project so I'm thinking to switch to a single stream.
I noticed that this behaviour is dependent from the size of the image, if srcWidth and srcHeight are set to 1500 the result is this:image_width1500.
Why changing the size of the image produces another stream?
Why are there two streams if I setted [sic] only one stream?
It appears that nppiThreshold_LTValGTVal_32f_C1R
creates its own internal stream for executing one of the kernels it uses. The other is launched either into the default stream, or the stream you specified with nppSetStream
.
I think this is really a documentation oversight/user expectation problem. nppSetStream
is doing what it says, but nowhere is it stated that the library is limited to using one stream. It probably should be more explicit in the documentation about how many streams the library uses internally, and how nppSetStream
interacts with the library. If this is a problem for your application, I suggest you raise a bug report with NVIDIA.
Why changing the size of the image produces another stream?
My guess would be that there are some performance heuristics at work, and whether the second stream is used depends in image size. The library is closed source, however, so I can't say for sure.