pythonnumpyoptimizationlinear-algebranumpy-einsum

Optimizing np.einsum calls in Python


I have two numpy arrays: X of dimension (N,N,N,N) and Y of dimension (N,N). My goal is to evaluate the following einsum call as fast as possible:

Z = np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y)

To do this, I benchmarked three implementations of the above, which yield identical answers but are optimized differently:

def einsum_one(X, Y):
    return np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y)

def einsum_two(X, Y):
    return np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')

def fast_einsum(X, Y):
    Z = np.einsum('ij,iiii->ij', Y, X)
    W = np.einsum('il,ik->ikl', Y, Y)
    Z = np.einsum('ij,im->ijm', Z, Y)
    Z = np.einsum('ijm,ikl->jklm', Z, W)
    return Z

The way I got fast_einsum was by seeing the output of np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')[1], which was:

  Complete contraction:  iiii,ij,ik,il,im->jklm
         Naive scaling:  5
     Optimized scaling:  5
      Naive FLOP count:  1.638e+05
  Optimized FLOP count:  6.662e+04
   Theoretical speedup:  2.459
  Largest intermediate:  4.096e+03 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   2                 ij,iiii->ij                        ik,il,im,ij->jklm
   3                  il,ik->ikl                          im,ij,ikl->jklm
   3                  ij,im->ijm                            ikl,ijm->jklm
   5               ijm,ikl->jklm                               jklm->jklm

So, my expectation is as follows: for large N, fast_einsum should be the fastest, and einsum_one should be the slowest. For einsum_two there should be some overhead which is noticeable for small N but is negligible for large N. However, when I benchmarked the above, I obtained the following times (numerical values are in microseconds):

N | einsum_one | einsum_two | fast_einsum
4    15.1         949          13.8
8    256          966          87.9
12   1920         1100         683
16   7460         1180         2490
20   21800        1390         7080

So my implementation of fast_einsum is not optimal. So my question is: how do I interpret the output of np.einsum_path so that I can implement np.einsum as fast as possible without having to include the overhead of the optimize='optimal' parameter?


Solution

  • With a small N=10 (my machine isn't large enough to use 100 (8Mb arrays)

    But here some quick timings:

    Precompute the path:

    In [183]: path=np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=True)[0]
    

    default:

    In [184]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=False)
    2.12 ms ± 4.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    True:

    In [185]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=True)
    509 µs ± 687 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    
    In [186]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize='optimal')
    3.31 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    path appears to be same for True and 'optimal', but the times are quite different. I don't know why.

    Using the precomputed path:

    In [187]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=path)
    291 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    In [188]: path
    Out[188]: ['einsum_path', (0, 1), (0, 1), (0, 1), (0, 1)]
    

    Since all dimensions are the same, I'm not sure that path makes a big difference. That is all Y are the same, so can be ordered in any way. Anyways, the path calculation is, to some degree or other, hueristic, and not guaranteed to be optimal for all sizes. As you found there's an important scaling factor, partly due to the cost of calculating the path, partly (I suspect) due to the size of the intermediate arrays, and the resulting memory management issues.


    einsum_path explains the alternatives:

    * if a list is given that starts with ``einsum_path``, uses this as the
      contraction path
    * if False no optimization is taken
    * if True defaults to the 'greedy' algorithm
    * 'optimal' An algorithm that combinatorially explores all possible
      ways of contracting the listed tensors and choosest the least costly
      path. Scales exponentially with the number of terms in the
      contraction.
    * 'greedy' An algorithm that chooses the best pair contraction
      at each step. Effectively, this algorithm searches the largest inner,
      Hadamard, and then outer products at each step. Scales cubically with
      the number of terms in the contraction. Equivalent to the 'optimal'
      path for most contractions.
    

    Ok, this explains why 'optimal' is slower - it's looking for more alternatives. Given these dimensions True, 'greedy' and 'optimal' end up with the same path.