I have an odd case where I can see numpy.einsum
speeding up a computation but can't see the same in einsum_path
. I'd like to quantify/explain this possible speed-up but am missing something somewhere...
In short, I have a matrix multiplication where only the diagonal of the final product is needed.
a = np.arange(9).reshape(3,3)
print('input array')
print(a)
print('normal method')
print(np.diag(a.dot(a)))
print('einsum method')
print(np.einsum('ij,ji->i', a, a))
which produces the output:
input array
[[0 1 2]
[3 4 5]
[6 7 8]]
normal method
[ 15 54 111]
einsum method
[ 15 54 111]
When running on a large matrix, numpy.einsum
is substantially faster.
A = np.random.randn(2000, 300)
B = np.random.randn(300, 2000)
print('normal method')
%timeit np.diag(A.dot(B))
print('einsum method')
%timeit np.einsum('ij,ji->i', A, B)
which produces:
normal method
17.2 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
einsum method
1.02 ms ± 7.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
My intuition is that this speed up is possible as numpy.einsum
is able to drop computations that would eventually be dropped by taking the diagonal - but, if I'm reading it correctly, the output of numpy.einsum_path
is showing no speed up at all.
print(np.einsum_path('ij,ji->i',A,B,optimize=True)[1])
Complete contraction: ij,ji->i
Naive scaling: 2
Optimized scaling: 2
Naive FLOP count: 1.200e+06
Optimized FLOP count: 1.200e+06
Theoretical speedup: 1.000
Largest intermediate: 2.000e+03 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
2 ji,ij->i i->i
Questions:
ij,ji->i
path in numpy.einsum
?That path
just looks at alternative orders when working with more than 2 arguments. With just 2 arguments that analysis does nothing. Your diag(dot)
In [113]: np.diag(a.dot(a))
Out[113]: array([ 15, 54, 111])
The equivalent using einsum
is:
In [115]: np.einsum('ii->i',np.einsum('ij,jk->ik',a,a))
Out[115]: array([ 15, 54, 111])
But we can skip the intermediate step:
In [116]: np.einsum('ij,ji->i',a,a)
Out[116]: array([ 15, 54, 111])
The indexed notation is flexible enough that it doesn't need to go through the full dot
calculation.
Another way to get the same result is:
In [117]: (a*a.T).sum(axis=1)
Out[117]: array([ 15, 54, 111])
With matmul
we can do the calc without the diag
, treating the first dimension as a 'batch'. But requires some reshaping first:
In [121]: a[:,None,:]@a.T[:,:,None]
Out[121]:
array([[[ 15]],
[[ 54]],
[[111]]])
In [122]: np.squeeze(a[:,None,:]@a.T[:,:,None])
Out[122]: array([ 15, 54, 111])
My times
normal method
135 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
einsum method
3.09 ms ± 46.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [130]: timeit (A*B.T).sum(axis=0)
8.06 ms ± 78.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [131]: timeit np.squeeze(A[:,None,:]@B.T[:,:,None])
3.52 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
dot
creates a (2000,2000) array, and then extracts 2000 elements.
In terms of element wise multiplication, the dot is:
In [136]: (a[:,:,None]*a[None,:,:]).sum(axis=1)
Out[136]:
array([[ 15, 18, 21],
[ 42, 54, 66],
[ 69, 90, 111]])
With A
and B
, the intermediate product would be (2000,300,2000), which is summed down to (2000,2000). The einsum
does (effectively) 2000 calulations of size (300,300) reduced to (1,).
The einsum
is closer to this calculation than the diag/dot
, treating the size 2000 dimension as a 'batch' for 1d dot
calculations:
In [140]: timeit np.array([A[i,:].dot(B[:,i]) for i in range(2000)])
9.46 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)