Paralelizing FFT (using CUDA)

On my application I need to transform each line of an image, apply a filter and transform it back.

I want to be able to make multiple FFT at the same time using the GPU. More precisely, I'm using NVIDIA's CUDA. Now, some considerations:

CUDA's FFT library, CUFFT is only able to make calls from the host ( https://devtalk.nvidia.com/default/topic/523177/cufft-device-callable-library/).
On this topic (running FFTW on GPU vs using CUFFT), Robert Corvella says

"cufft routines can be called by multiple host threads".

I believed that doing all this FFTs in parallel would increase performance, but Robert comments

"the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine"

So, Is this it? Is there no gain in performing more than one FFT at a time?

Is there any library that supports calls from the device?

Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?

Or the best option is to call mutiple host threads?

(this 2 links limit is killing me...)

My objective is to get some discussion on what's the best solution to this problem, since many have faced similar situations. This might be obsolete once NVIDIA implements device calls on CUFFT. (something they said they are working on but there is no expected date for the release - something said on the discussion at the NVIDIA forum (first link))

Solution

So, Is this it? Is there no gain in performing more than one FFT at a time?

If the individual FFT's are large enough to fully utilize the device, there is no gain in performing more than one FFT at a time. You can still use standard methods like overlap of copy and compute to get the most performance out of the machine.

If the FFT's are small then the batched plan is a good way to get the most performance. If you go this route, I recommend using CUDA 5.5 or newer, as there have been some API improvements.

Is there any library that supports calls from the device?

cuFFTDx library can be used to make FFT calls from device code.

Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)? Or the best option is to call mutiple host threads?

Batched plan is preferred over multiple host threads - the API can do a better job of resource management that way, and you will have more API-level visibility (such as through the resource estimation functions in CUDA 5.5) as to what is possible.