I have two numpy arrays: X
of dimension (N,N,N,N)
and Y
of dimension (N,N)
. My goal is to evaluate the following einsum
call as fast as possible:
Z = np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y)
To do this, I benchmarked three implementations of the above, which yield identical answers but are optimized differently:
def einsum_one(X, Y):
return np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y)
def einsum_two(X, Y):
return np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')
def fast_einsum(X, Y):
Z = np.einsum('ij,iiii->ij', Y, X)
W = np.einsum('il,ik->ikl', Y, Y)
Z = np.einsum('ij,im->ijm', Z, Y)
Z = np.einsum('ijm,ikl->jklm', Z, W)
return Z
The way I got fast_einsum
was by seeing the output of np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')[1]
, which was:
Complete contraction: iiii,ij,ik,il,im->jklm
Naive scaling: 5
Optimized scaling: 5
Naive FLOP count: 1.638e+05
Optimized FLOP count: 6.662e+04
Theoretical speedup: 2.459
Largest intermediate: 4.096e+03 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
2 ij,iiii->ij ik,il,im,ij->jklm
3 il,ik->ikl im,ij,ikl->jklm
3 ij,im->ijm ikl,ijm->jklm
5 ijm,ikl->jklm jklm->jklm
So, my expectation is as follows: for large N
, fast_einsum
should be the fastest, and einsum_one
should be the slowest. For einsum_two
there should be some overhead which is noticeable for small N
but is negligible for large N
. However, when I benchmarked the above, I obtained the following times (numerical values are in microseconds):
N | einsum_one | einsum_two | fast_einsum
4 15.1 949 13.8
8 256 966 87.9
12 1920 1100 683
16 7460 1180 2490
20 21800 1390 7080
So my implementation of fast_einsum
is not optimal. So my question is: how do I interpret the output of np.einsum_path
so that I can implement np.einsum
as fast as possible without having to include the overhead of the optimize='optimal'
parameter?
With a small N=10
(my machine isn't large enough to use 100 (8Mb arrays)
But here some quick timings:
Precompute the path:
In [183]: path=np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=True)[0]
default:
In [184]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=False)
2.12 ms ± 4.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
True
:
In [185]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=True)
509 µs ± 687 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [186]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize='optimal')
3.31 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
path
appears to be same for True
and 'optimal', but the times are quite different. I don't know why.
Using the precomputed path:
In [187]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=path)
291 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [188]: path
Out[188]: ['einsum_path', (0, 1), (0, 1), (0, 1), (0, 1)]
Since all dimensions are the same, I'm not sure that path makes a big difference. That is all Y
are the same, so can be ordered in any way. Anyways, the path
calculation is, to some degree or other, hueristic, and not guaranteed to be optimal for all sizes. As you found there's an important scaling factor, partly due to the cost of calculating the path, partly (I suspect) due to the size of the intermediate arrays, and the resulting memory management issues.
einsum_path
explains the alternatives:
* if a list is given that starts with ``einsum_path``, uses this as the
contraction path
* if False no optimization is taken
* if True defaults to the 'greedy' algorithm
* 'optimal' An algorithm that combinatorially explores all possible
ways of contracting the listed tensors and choosest the least costly
path. Scales exponentially with the number of terms in the
contraction.
* 'greedy' An algorithm that chooses the best pair contraction
at each step. Effectively, this algorithm searches the largest inner,
Hadamard, and then outer products at each step. Scales cubically with
the number of terms in the contraction. Equivalent to the 'optimal'
path for most contractions.
Ok, this explains why 'optimal' is slower - it's looking for more alternatives. Given these dimensions True
, 'greedy' and 'optimal' end up with the same path.