Using "cuFFT Device Callbacks"

This is my first question, so I'll try to be as detailed as possible. I'm working on implementing noise reduction algorithm in CUDA 6.5. My code is based on this Matlab implementation: http://pastebin.com/HLVq48C1.
I'd love to use new cuFFT Device Callbacks feature, but I'm stuck on cufftXtSetCallback. Every time my cufftResult is CUFFT_NOT_IMPLEMENTED (14). Even example provided by nVidia fails the same way... My device callback testing code:

__device__ void noiseStampCallback(void *dataOut,
                                size_t offset,
                                cufftComplex element,
                                void *callerInfo,
                                void *sharedPointer) {
    element.x = offset;
    element.y = 2;
    ((cufftComplex*)dataOut)[offset] = element;
}
__device__ cufftCallbackStoreC noiseStampCallbackPtr = noiseStampCallback;

CUDA part of my code:

cufftHandle forwardFFTPlan;//RtC
//find how many windows there are
int batch = targetFile->getNbrOfNoiseWindows();
size_t worksize;

cufftCreate(&forwardFFTPlan);
cufftMakePlan1d(forwardFFTPlan, WINDOW, CUFFT_R2C, batch, &worksize); //WINDOW = 2048 

//host memory, allocate
float *h_wave;
cufftComplex *h_complex_waveSpec;
unsigned int m_num_real_elems = batch*WINDOW*2;
h_wave = (float*)malloc(m_num_real_elems * sizeof(float));
h_complex_waveSpec = (cufftComplex*)malloc((m_num_real_elems/2+1)*sizeof(cufftComplex));

//init
memset(h_wave, 0, sizeof(float) * m_num_real_elems); //last window won't probably be full of file data, so fill memory with 0
memset(h_complex_waveSpec, 0, sizeof(cufftComplex) * (m_num_real_elems/2+1));
targetFile->getNoiseFile(h_wave); //fill h_wave with samples from sound file

//device memory, allocate, copy from host
float *d_wave;
cufftComplex *d_complex_waveSpec;

cudaMalloc((void**)&d_wave, m_num_real_elems * sizeof(float));
cudaMalloc((void**)&d_complex_waveSpec, (m_num_real_elems/2+1) * sizeof(cufftComplex));

cudaMemcpy(d_wave, h_wave, m_num_real_elems * sizeof(float), cudaMemcpyHostToDevice);

//prepare callback
cufftCallbackStoreC hostNoiseStampCallbackPtr;

cudaMemcpyFromSymbol(&hostNoiseStampCallbackPtr,
                          noiseStampCallbackPtr,
                          sizeof(hostNoiseStampCallbackPtr));

cufftResult status = cufftXtSetCallback(forwardFFTPlan,
                                        (void **)&hostNoiseStampCallbackPtr,
                                        CUFFT_CB_ST_COMPLEX,
                                        NULL);
//always return status 14 - CUFFT_NOT_IMPLEMENTED

//run forward plan
cufftResult result = cufftExecR2C(forwardFFTPlan, d_wave, d_complex_waveSpec);
//result seems to be okay without cufftXtSetCallback

I'm aware that I'm just a beginner in CUDA. My question is:
How can I call cufftXtSetCallback properly or what is a cause of this error?

Solution

Referring to the documentation:

The callback API is available in the statically linked cuFFT library only, and only on 64 bit LINUX operating systems. Use of this API requires a current license. Free evaluation licenses are available for registered developers until 6/30/2015. To learn more please visit the cuFFT developer page.

I think you are getting the not implemented error because either you are not on a Linux 64 bit platform, or you are not explicitly linking against the CUFFT static library. The Makefile in the cufft callback sample will give the correct method to link.

Even if you fix that issue, you will likely run into a CUFFT_LICENSE_ERROR unless you have gotten one of the evaluation licenses.

Note that there are various device limitations as well for linking to the cufft static library. It should be possible to build a statically linked CUFFT application that will run on cc 2.0 and greater devices.