pythonnumpyfftpyfftw

PyFFTW perfomance on multidimensional arrays


I have a nD array, say of dimensions: (144, 522720) and I need to compute its FFT.

PyFFTW seems slower than numpy and scipy, that it is NOT expected.

Am I doing something obviously wrong?

Below is my code

import numpy
import scipy      
import pyfftw
import time

n1 = 144
n2 = 522720
loops = 2

pyfftw.config.NUM_THREADS = 4
pyfftw.config.PLANNER_EFFORT = 'FFTW_ESTIMATE'
# pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'

Q_1 = pyfftw.empty_aligned([n1, n2], dtype='float64')
Q_2 = pyfftw.empty_aligned([n1, n2], dtype='complex_')
Q_ref = pyfftw.empty_aligned([n1, n2], dtype='complex_')

# repeat a few times to see if pyfft planner helps
for i in range(0,loops):
    Q_1 = numpy.random.rand(n1,n2)

    s1 = time.time()
    Q_ref = numpy.fft.fft(Q_1, axis=0)
    print('NUMPY - elapsed time: ', time.time() - s1, 's.')

    s1 = time.time()
    Q_2 = scipy.fft.fft(Q_1, axis=0)
    print('SCIPY - elapsed time: ', time.time() - s1, 's.')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

    s1 = time.time()
    Q_2 = pyfftw.interfaces.numpy_fft.fft(Q_1, axis=0)
    print('PYFFTW NUMPY - elapsed time = ', time.time() - s1, 's.')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

    s1 = time.time()
    Q_2 = pyfftw.interfaces.scipy_fftpack.fft(Q_1, axis=0)
    print('PYFFTW SCIPY - elapsed time = ', time.time() - s1, 's.')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))

    s1 = time.time()
    fft_object = pyfftw.builders.fft(Q_1, axis=0)
    Q_2 = fft_object()
    print('FFTW PURE Elapsed time = ', time.time() - s1, 's')
    print('Equal = ', numpy.allclose(Q_2, Q_ref))


Solution

  • Firstly, if you turn on the cache before you main loop, the interfaces work largely as expected:

    pyfftw.interfaces.cache.enable()
    pyfftw.interfaces.cache.set_keepalive_time(30)
    

    It's interesting that despite wisdom that should be stored, the construction of the pyfftw objects is still rather slow when the cache is off. No matter, this is exactly the purpose of the cache. In your case you need to make the cache keep-alive time quite long because your loop is very long.

    Secondly, it's not a fair comparison to include the construction time of the fft_object in the final test. If you move it outside the timer, then calling fft_object is a better measure.

    Thirdly, it's also interesting to see that even with cache turned on, the call to numpy_fft is slower than the call to scipy_fft. Since there is no obvious difference in the code path, I suggest that is caching issue. This is the sort of issue that timeit seeks to mitigate. Here's my proposed timing code which is more meaningful:

    import numpy
    import scipy
    import pyfftw
    import timeit
    
    n1 = 144
    n2 = 522720
    
    pyfftw.config.NUM_THREADS = 4
    pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
    
    Q_1 = pyfftw.empty_aligned([n1, n2], dtype='float64')
    
    pyfftw.interfaces.cache.enable()
    pyfftw.interfaces.cache.set_keepalive_time(30)
    
    times = timeit.repeat(lambda: numpy.fft.fft(Q_1, axis=0), repeat=5, number=1)
    print('NUMPY fastest time = ', min(times))
    
    times = timeit.repeat(lambda: scipy.fft.fft(Q_1, axis=0), repeat=5, number=1)
    print('SCIPY fastest time = ', min(times))
    
    times = timeit.repeat(
        lambda: pyfftw.interfaces.numpy_fft.fft(Q_1, axis=0), repeat=5, number=1)
    print('PYFFTW NUMPY fastest time = ', min(times))
    
    times = timeit.repeat(
        lambda: pyfftw.interfaces.scipy_fftpack.fft(Q_1, axis=0), repeat=5, number=1)
    print('PYFFTW SCIPY fastest time = ', min(times))
    
    fft_object = pyfftw.builders.fft(Q_1, axis=0)
    times = timeit.repeat(lambda: fft_object(Q_1), repeat=5, number=1)
    print('FFTW PURE fastest time = ', min(times))
    

    On my machine this gives an output like:

    NUMPY fastest time =  0.6622681759763509
    SCIPY fastest time =  0.6572431400418282
    PYFFTW NUMPY fastest time =  0.4003451430471614
    PYFFTW SCIPY fastest time =  0.40362057799939066
    FFTW PURE fastest time =  0.324020683998242
    

    You can do a bit better if you don't force it to copy the input array into a complex data type by changing Q_1 to be complex128:

    NUMPY fastest time =  0.6483533839927986
    SCIPY fastest time =  0.847397351055406
    PYFFTW NUMPY fastest time =  0.3237176960101351
    PYFFTW SCIPY fastest time =  0.3199474769644439
    FFTW PURE fastest time =  0.2546963169006631
    

    That interesting scipy slow-down is repeatable.

    That said, if your input is real, you should be doing a real transform (for >50% speed-up with pyfftw) and manipulating the resultant complex output.

    What's interesting about this example is (I think) how important the cache is in the results (which I suggest is why switching to a real transform is so effective in speeding things up). You see something dramatic also when you use change the array size to 524288 (the next power of two, which you think might perhaps speed things up, but not slow it down dramatically). In this case everything slows down quite a bit, scipy particularly. It feels to me that scipy is more cache sensitive, which would explain the slow down with changing the input to complex128 (522720 is quite a nice number for FFTing though, so perhaps we should expect a slowdown).

    Finally, if speed is secondary to accuracy, you can always use 32-bit floats as the data type. If you combine that with doing a real transform, you get a better than factor of 10 speed-up over the initial numpy best given above:

    PYFFTW NUMPY fastest time =  0.09026529802940786
    PYFFTW SCIPY fastest time =  0.1701313250232488
    FFTW PURE fastest time =  0.06202622700948268
    

    (numpy and scipy don't change much as I think they use 64-bit floats internally).

    Edit: I forgot that the Scipy's fftpack real FFTs have a weird output structure, which pyfftw replicates with some slowdown. This is changed to be more sensible in the new FFT module.

    The new FFT interface is implemented in pyFFTW and should be preferred. There was unfortunately a problem with the docs being rebuilt so the docs were a long time out of date and didn't show the new interface - hopefully that is fixed now.