I am trying to optimize a code, which has allocations inside a function that is repeatedly called in a loop. I ran some performance tests using jupyter and results were counterintuitive for me. As a minimal example, see the following.
Given arrays A
, B
, I will perform matrix multiplication of these two in a loop.
C
,D
, where the result of the multiplication is storedimport numpy as np
A = np.random.rand(10, 10)
B = np.random.rand(10, 10000)
D = np.random.rand(10, 10000)
# Approach 1, no pre-allocation
for i in range(20000):
C = A @ B
# Approach 2, pre-allocated D
for i in range(20000):
D[:] = A @ B
I expected the second approach to be faster since it reuses the memory in D instead of allocating a new array each time. However, timing the loops shows that the first approach is actually 2x faster.
Why is the in-place assignment (D[:] = A @ B) slower than creating a new array (C = A @ B)? Is this related to memory management of numpy?
You're not reusing D
's memory. Both of your approaches allocate a new array every time. Your second approach then copies the contents of this new array into D
, taking extra time to do so.
If you want to directly write the results into D
's memory, that'd be
np.matmul(A, B, out=D)