c++image-processing cuda convolution npp

CUDA, NPP Filters

The CUDA NPP library supports filtering of image using the nppiFilter_8u_C1R command but keep getting errors. I have no problem getting the boxFilterNPP sample code up and running.

eStatusNPP = nppiFilterBox_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(), 
                                  oDeviceDst.data(), oDeviceDst.pitch(), 
                                  oSizeROI, oMaskSize, oAnchor);

But if I change it to use nppiFilter_8u_C1R instead, eStatusNPP return the error -24 (NPP_TEXTURE_BIND_ERROR). The code below is the alterations I made to the original boxFilterNPP sample.

NppiSize oMaskSize = {5,5};
npp::ImageCPU_32s_C1 hostKernel(5,5);

for(int x = 0 ; x < 5; x++){
    for(int y = 0 ; y < 5; y++){
        hostKernel.pixels(x,y)[0].x = 1;
    }
}

npp::ImageNPP_32s_C1 pKernel(hostKernel);

Npp32s nDivisor = 1;

eStatusNPP = nppiFilter_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(), 
                               oDeviceDst.data(), oDeviceDst.pitch(), 
                               oSizeROI, 
                               pKernel.data(),
                               oMaskSize, oAnchor,
                               nDivisor);

This have been tried on CUDA 4.2 and 5.0, with same result.

The code runs with the expected result when oMaskSize = {1,1}

Solution

I had the same problem when I stored my kernel as an ImageCPU/ImageNPP.

A good solution is to store the kernel as a traditional 1D array on the device. I tried this, and it gave me good results (and none of those unpredictable or garbage images).

Thanks to Frank Jargstorff in this StackOverflow post for the 1D idea.

NppiSize oMaskSize = {5,5};
Npp32s hostKernel[5*5];

for(int x = 0 ; x < 5; x++){
    for(int y = 0 ; y < 5; y++){
        hostKernel[x*5+y] = 1;
    }
}

Npp32s* pKernel; //just a regular 1D array on the GPU
cudaMalloc((void**)&pKernel, 5 * 5 * sizeof(Npp32s));
cudaMemcpy(pKernel, hostKernel, 5 * 5 * sizeof(Npp32s), cudaMemcpyHostToDevice);

Using this original image, here's the blurred result that I get from your code with the 1D kernel array: enter image description here

Other parameters that I used:

Npp32s nDivisor = 25;
NppiPoint oAnchor = {4, 4};