The CUDA NPP library supports filtering of image using the nppiFilter_8u_C1R command but keep getting errors. I have no problem getting the boxFilterNPP sample code up and running.
eStatusNPP = nppiFilterBox_8u_C1R(, oDeviceSrc.pitch(),, oDeviceDst.pitch(),
oSizeROI, oMaskSize, oAnchor);
But if I change it to use nppiFilter_8u_C1R instead, eStatusNPP return the error -24 (NPP_TEXTURE_BIND_ERROR). The code below is the alterations I made to the original boxFilterNPP sample.
NppiSize oMaskSize = {5,5};
npp::ImageCPU_32s_C1 hostKernel(5,5);
for(int x = 0 ; x < 5; x++){
for(int y = 0 ; y < 5; y++){
hostKernel.pixels(x,y)[0].x = 1;
npp::ImageNPP_32s_C1 pKernel(hostKernel);
Npp32s nDivisor = 1;
eStatusNPP = nppiFilter_8u_C1R(, oDeviceSrc.pitch(),, oDeviceDst.pitch(),
oMaskSize, oAnchor,
This have been tried on CUDA 4.2 and 5.0, with same result.
The code runs with the expected result when oMaskSize = {1,1}
I had the same problem when I stored my kernel as an ImageCPU
A good solution is to store the kernel as a traditional 1D array on the device. I tried this, and it gave me good results (and none of those unpredictable or garbage images).
Thanks to Frank Jargstorff in this StackOverflow post for the 1D idea.
NppiSize oMaskSize = {5,5};
Npp32s hostKernel[5*5];
for(int x = 0 ; x < 5; x++){
for(int y = 0 ; y < 5; y++){
hostKernel[x*5+y] = 1;
Npp32s* pKernel; //just a regular 1D array on the GPU
cudaMalloc((void**)&pKernel, 5 * 5 * sizeof(Npp32s));
cudaMemcpy(pKernel, hostKernel, 5 * 5 * sizeof(Npp32s), cudaMemcpyHostToDevice);
Using this original image, here's the blurred result that I get from your code with the 1D kernel array:
Other parameters that I used:
Npp32s nDivisor = 25;
NppiPoint oAnchor = {4, 4};