I am using Python to compute the cross-correlation of two 2D matrices, and I have implemented three different methods. Below is my experimental code along with their execution times:
import numpy as np
from scipy.signal import fftconvolve
# Randomly generate 2D matrices
a = np.random.randn(64, 4200)
b = np.random.randn(64, 4200)
# Using np.correlate
def test_1(a, b):
return np.array([np.correlate(s_row, t_row, mode="full") for s_row, t_row in zip(a, b)])[:, len(a[0]) - 1:].sum(axis=0)
# Using scipy.signal.fftconvolve
def test_2(a, b):
return np.array([fftconvolve(s_row, t_row[::-1], mode="full") for s_row, t_row in zip(a, b)])[:, len(a[0]) - 1:].sum(axis=0)
# Using scipy.signal.fftconvolve but for 2D
def test_3(a, b):
return fftconvolve(a, np.fliplr(b), mode="full")[:, len(a[0]) - 1:].sum(axis=0)
# Performance testing
for i in range(100):
x = test_1(a, b) # 19.8 seconds
for i in range(100):
x = test_2(a, b) # 1.1 seconds
for i in range(100):
x = test_3(a, b) # 3.8 seconds
Even if I reverse b in advanced, the cost of time remains unaltered.
My computer configuration is:
Operating System: Windows 11
CPU: i7-12700F
Memory: 16GB
I have already used from concurrent.futures import ThreadPoolExecutor, as_completed to speed up the multiple loops. I am looking for further ways to accelerate this computation.
Are there any other algorithm optimization suggestions that could help reduce the computation time? Thank you.
I do not think you can find a much faster implementation than that using a sequential code on CPU. This answer explains why. It also provide a slightly-faster experimental FFTW-based implementation which is up to 40% faster, and alternative possibly-faster solutions.
The FFT library used by Scipy, which is pypocketfft
is surprisingly pretty fast (faster than the FFTW library in this case). test_2
spend most of its time in this native library so using a native code or even vectorization would not make the code much faster. Indeed, 60~65% of the time is spent in the function pocketfft::detail::rfftp::exec
, 15~20% in Python overheads, ~10% in pocketfft::detail::general_r2c
/pocketfft::detail::general_c2r
, and ~10% in other operations like copies and the main multiplication. As a result, writing your own specialized native module directly using pypocketfft
would make this computation only 15~20% faster at best. I do not think it is worth the effort.
Low-level profiling results show that the computing function test_2
is clearly compute bound. The same thing is true for the provided FFTW functions in this answer. That being said, the library pypocketfft
only use scalar instructions. Alternative libraries like the FFTW can theoretically benefit from SIMD units.
However, in practice, this is far from being easy to write an implementation both using an efficient algorithm and SIMD instructions. The Intel MKL (alternative library) certainly does that. The FFTW only uses SIMD instructions in the complex-to-complex FFTs. This is why the function test_fftw_2
(provided in at the end of this answer) is slower than the other FFTW-based functions (also provided).
I tried the FFTW (fully vectorized code), but it is significantly slower in this case (despite the name of the library meaning "Fastest Fourier Transform in the West"). This is because 4200*2-1
has 2 big prime factors and the FFTW underlying algorithm (AFAIK Cooley-Tukey) is inefficient for length having such big prime factors (it is much better for power of 2 length). In fact, the computation is about 8 times faster for a length of 8192 instead of 8399 (your use-case). Note I tried two versions: one in-place with only complex-to-complex FFTs and one out-of-place with real-to-complex/complex-to-real FFTs. The later was supposed to be faster but it is slower in practice... Maybe I missed something critical in my implementations.
We can cheat by using bigger array for the FFT, that is a kind of padding. Indeed, using an array size which is divisible by a big power of two can mitigate the aforementioned performance issue (related to prime factors). For example, here we can use an array size of 1024*9=9216
(the smallest number divisible by 1024 bigger than n*2-1
). While, we could use a smaller power-of-two base to be as close as possible to n*2-1
(increasing the efficiency by computing useful numbers), this also cause the algorithm the be less efficient (due to a smaller power-of-two factor). In this case, the FFTW implementation is about as fast as the Scipy code with a FFTW_MEASURE
plan (this results in a 7% faster code which is a tiny performance boost). With the FFTW_PATIENT
, it is actually faster by about 25%! The thing is FFTW_PATIENT
plans can be pretty slow to make. The initialization time can only be amortized with at least several thousands of iterations and not just 100. Note that pre-allocating all arrays and plan once before the loop results in a few percent faster execution.
I also found out that the FFTW is sub-optimal and computing chunks of line is ~10% faster than computing the whole array in one row. As a result, the final implementation test_fftw_4
is about 40% faster than test_2
(still without considering the first execution slowed down due to the planning time). This implementation is still a bit faster than test_2
without expensive planning (e.g. FFTW_MEASURE
). This last implementation efficiently use a CPU core: it massively use the SIMD units, it is pretty cache-friendly and spend nearly all its time in the native FFTW library. It can also be easily parallelized on few cores (regarding the chunk size).
Here is the different FFTW functions I tested:
import pyfftw
import numpy as np
# Initial FFTW implementation (slower than Scipy/pypocketfft)
def test_fftw_1(a, b):
m = a.shape[0]
n = a.shape[1]
assert b.shape[1] == n
# Input/output FFT temporary arrays
a2 = pyfftw.empty_aligned((m, 2*n-1), dtype='complex128')
b2 = pyfftw.empty_aligned((m, 2*n-1), dtype='complex128')
r2 = pyfftw.empty_aligned((m, 2*n-1), dtype='complex128')
a2[:,:n] = a
a2[:,n:] = 0
b2[:,:n] = b[:,::-1]
b2[:,n:] = 0
# See FFTW_ESTIMATE, FFTW_MEASURE or FFTW_PATIENT for varying
# performance at the expense of a longuer plannification time
axes = (1,)
flags = ['FFTW_MEASURE']
a2_fft_plan = pyfftw.FFTW(a2, a2, axes, 'FFTW_FORWARD', flags)
b2_fft_plan = pyfftw.FFTW(b2, b2, axes, 'FFTW_FORWARD', flags)
r2_ifft_plan = pyfftw.FFTW(r2, r2, axes, 'FFTW_BACKWARD', flags)
# Actual execution of the plans
a2_fft_plan()
b2_fft_plan()
np.multiply(a2, b2, out=r2)
r2_ifft_plan()
return r2.real[:, n-1:].sum(axis=0)
# Supposed to be faster, but slower in practice
# (because it seems not to use SIMD instructions in practice)
def test_fftw_2(a, b):
m = a.shape[0]
n = a.shape[1]
assert b.shape[1] == n
# Input FFT temporary arrays
a2 = pyfftw.empty_aligned((m, 2*n-1), dtype='float64')
b2 = pyfftw.empty_aligned((m, 2*n-1), dtype='float64')
a2[:,:n] = a
a2[:,n:] = 0
b2[:,:n] = b[:,::-1]
b2[:,n:] = 0
# Temporary and output arrays
a2_fft = pyfftw.empty_aligned((m, n), dtype='complex128')
b2_fft = pyfftw.empty_aligned((m, n), dtype='complex128')
r2_fft = pyfftw.empty_aligned((m, n), dtype='complex128')
r2 = pyfftw.empty_aligned((m, 2*n-1), dtype='float64')
# Same thing for planning
axes = (1,)
flags = ['FFTW_MEASURE']
a2_fft_plan = pyfftw.FFTW(a2, a2_fft, axes, 'FFTW_FORWARD', flags)
b2_fft_plan = pyfftw.FFTW(b2, b2_fft, axes, 'FFTW_FORWARD', flags)
r2_ifft_plan = pyfftw.FFTW(r2_fft, r2, axes, 'FFTW_BACKWARD', flags)
a2_fft_plan()
b2_fft_plan()
np.multiply(a2_fft, b2_fft, out=r2_fft)
r2_ifft_plan()
return r2.real[:, n-1:].sum(axis=0)
# Pretty-fast FFTW-based implementation (faster than Scipy/pypocketfft without considering the planning time)
# However, the first call is pretty slow (e.g. dozens of seconds)
def test_fftw_3(a, b):
m = a.shape[0]
n = a.shape[1]
assert b.shape[1] == n
# Input/output FFT temporary arrays
block_size = 1024
size = (2*n-1 + block_size-1) // block_size * block_size
a2 = pyfftw.empty_aligned((m, size), dtype='complex128')
b2 = pyfftw.empty_aligned((m, size), dtype='complex128')
r2 = pyfftw.empty_aligned((m, size), dtype='complex128')
a2[:,:n] = a
a2[:,n:] = 0
b2[:,:n] = b[:,::-1]
b2[:,n:] = 0
axes = (1,)
flags = ['FFTW_PATIENT']
a2_fft_plan = pyfftw.FFTW(a2, a2, axes, 'FFTW_FORWARD', flags)
b2_fft_plan = pyfftw.FFTW(b2, b2, axes, 'FFTW_FORWARD', flags)
r2_ifft_plan = pyfftw.FFTW(r2, r2, axes, 'FFTW_BACKWARD', flags)
# Actual execution of the plans
a2_fft_plan()
b2_fft_plan()
np.multiply(a2, b2, out=r2)
r2_ifft_plan()
diff = size - (2*n-1)
return r2.real[:, -diff-n:-diff].sum(axis=0)
# Fastest FFTW-based implementation
def test_fftw_4(a, b):
return sum([test_fftw_3(a[i*8:(i+1)*8], b[i*8:(i+1)*8]) for i in range(8)])
The Intel MKL is certainly faster than pypocketfft
, but this certainly requires a native C/C++ code wrapped to be callable from Python (not so simple to do). I think this is the best option in sequential on CPU.
An alternative solution is to use GPUs. For example, CuPy supports FFW computations though this requires Nvidia GPUs (due to CUDA). There is also the quite-new portable pyvkfft module based on VkFFT (supported by AMD). Be aware that double-precision computations are only fast on (expensive) server-side GPUs, not client-side ones (i.e. personal computers).
Note using single-precision should results in faster execution timings on both CPU and GPU (especially on client-side GPUs).
You can find a list of interesting C++ library and a benchmark here.