pythonnumpynumpy-einsum

Comparison of Looping and Einsum Operations on a List of Arrays for Speed Optimization


I want to make use of the einsum to speed up a code as following:

As simple example using a list of 2 (3,3) arrays called dcx:

That i created:

In [113]: dcx = [np.arange(9).reshape(3,3), np.arange(10,19).reshape(3,3)]; dcx
Out[113]: 
[array([[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]]),
 array([[10, 11, 12],
        [13, 14, 15],
        [16, 17, 18]])]

I wanted to perform the following operation:

In [114]: Ax = np.zeros((2,2))
     ...: for i in range(2):
     ...:         for j in range(2):
     ...:             Ax[i, j] = np.sum(np.dot(dcx[i], dcx[j].T))

In [115]: Ax
Out[115]: 
array([[ 450., 1530.],
       [1530., 5310.]])

The time taken ~ 0.001009225845336914 sec

Then i tried using einsum as folowing:

x =np.array(dcx)
In [117]: np.einsum('ikl,jml->ij', x,x)
Out[117]: 
array([[ 450, 1530],
       [1530, 5310]])

And the time taken in second in almost zero,

So i thought that einsum should be faster, i extended my example to more complex case as following:

dCx is now a list of 40 (40,40) arrays.

I noticed that the time take by:

In [114]: Ax = np.zeros((40,40))
     ...: for i in range(40):
     ...:         for j in range(40):
     ...:             Ax[i, j] = np.sum(np.dot(dcx[i], dcx[j].T))

Is 0.02168726921081543 sec 

while the time taken by the einsum:

array_list = []
for _ in range(40):
    array = np.random.rand(40, 40)
    array_list.append(array)
t0 = time.time()
x =np.array(array_list)
np.einsum('ikl,jml->ij', x,x)
t1 = time.time()
print(f"Execution time : {t1-t0} sec")

0.15755462646484375 sec

I am confused why einsum now takes longer time? i wanted to make use of ituse it to speed up more complex code with dCx of dimensions (3000,3000,3000) where my loops takes very long time.


Solution

  • Both of the methods are inefficient. Einsum surprizingly fail to find the optimal einsum path in this case (even with the optional optimal parameter set). The dot approach is faster here (possibly because it explicitly benefit from the highly-optimized BLAS matrix multiplication implementation), but it is still clearly inefficient algorithmically since many operations are not actually needed.

    One can drastically reduce the number of operation by factorizing the operations. This kind of optimisation requires you to do a bit of maths. Let us describe mathematically what np.sum(np.dot(x, y)) (with x = dcx[i] and y = dcx[j]) does and factorize the resulting expression:

    enter image description here

    For each (i,j), we can pre-compute the inner terms of the expression and reuse them:

    tmp1 = np.sum(dcx[i], axis=0)      # Sum over i
    tmp2 = np.sum(dcx[j], axis=0)      # Sum over j
    Ax[i, j] = np.dot(tmp1, tmp2)      # Final sum over k
    

    This is a huge improvement over the initial code since this approach runs in O(n**4) while the initial one runs in O(n**5) (due to the expensive matrix multiplication), where n is the side of the dcx matrix (40 in your example).

    We can do better: we can pre-compute the sum of all dcx slice in a row so not to recompute them. Here is the final code:

    tmp = np.sum(dcx, axis=1)          # Sum over i and j for all planes
    Ax = tmp @ tmp.T                   # Final sum over k for all planes
    

    This approach runs in O(n**3). It is optimal because the size of the input is O(n**3) too. It should be >1_000_000 times faster than the initial approach for n=3000, not to mention the code is also shorter!

    Here is performance results for n=40:

    einsum:     131_885 µs
    loop:        17_142 µs
    proposed:        47 µs   <----
    

    Related post: How is numpy.einsum implemented?