I have the following benchmarking script
import numpy as np
import timeit
import matplotlib.pyplot as plt
n1, n2, n3, n4, n5, m = (101, 33, 1, 32, 2, 32)
def make_arrays(aOrder, bOrder, cOrder):
a = np.random.randn(n1, m) + 1j * np.random.randn(n1, m)
b = np.random.randn(n2, m) + 1j * np.random.randn(n2, m)
c = np.random.randn(n1, n2, n3, n4, n5) + 1j * np.random.randn(n1, n2, n3, n4, n5)
return (
np.array(a, order=aOrder),
np.array(b, order=bOrder),
np.array(c, order=cOrder),
)
# used in B()
blockSize = 84
resA = []
resB = []
resC = []
sizes = np.unique(np.exp(np.linspace(2, 6, 8)).astype(np.int64))
numTrials = 10
# overrides m form above
for m in sizes:
a, b, c = make_arrays("F", "F", "F")
path = np.einsum_path(
a,
[0, 5],
b,
[1, 5],
c,
[0, 1, Ellipsis],
[Ellipsis, 5],
optimize="greedy",
)
def A():
np.einsum(
a,
[0, 5],
b,
[1, 5],
c,
[0, 1, 2, 3, 4],
[5, 2, 3, 4],
optimize="greedy",
order="F",
)
# print("einsum\n", res.flags)
def B():
numBlocks = int(a.shape[1] // blockSize)
np.concatenate(
tuple(
np.einsum(
c,
[1, 2, Ellipsis],
a[:, kk * blockSize : (kk + 1) * blockSize],
[1, 0],
b[:, kk * blockSize : (kk + 1) * blockSize],
[2, 0],
[0, Ellipsis],
optimize="greedy",
order="F",
)
for kk in range(numBlocks)
)
+ (
np.einsum(
c,
[1, 2, Ellipsis],
a[:, numBlocks * blockSize :],
[1, 0],
b[:, numBlocks * blockSize :],
[2, 0],
[0, Ellipsis],
optimize="greedy",
order="F",
),
),
axis=0,
)
def C():
tmp = np.einsum(a, [0, 5], c, [0, 1, 2, 3, 4], [1, 2, 3, 4, 5], order="F")
np.einsum(b, [1, 5], tmp, [1, 2, 3, 4, 5], [2, 3, 4, 5], order="F")
A()
B()
C()
measured = np.mean(timeit.repeat(A, number=numTrials, repeat=numTrials)) / (
numTrials * m
)
resA.append(measured)
measured = np.mean(timeit.repeat(B, number=numTrials, repeat=numTrials)) / (
numTrials * m
)
resB.append(measured)
measured = np.median(timeit.repeat(C, number=numTrials, repeat=numTrials)) / (
numTrials * m
)
resC.append(measured)
plt.figure()
plt.semilogy(sizes, resA, label="A")
plt.semilogy(sizes, resB, label="B")
plt.semilogy(sizes, resC, label="C")
plt.legend()
plt.show()
I think one only needs numpy
and matplotlib
to run it.
Approach A
is my naive way of using einsum, since I expect this to work well. After some time this code has been residing in my codebase, I noticed that after a certain size of m
, the computation just gets very slow. Hence, I went ahead and implemented B
, which is producing satisfactory results across the board, but especially in the beginning it looses against A
. Also, no matter how I toss and turn this in terms of memory-layout of the input and output-arrays, I see no noticeable difference in qualitative behavior.
Just to retain my sanity, I went ahead and tried out an even more naive way by using C
, which is superslow as expected.
On a very powerful AMD EPYC 7343 16-Core Processor
with numpy-intelpython, i.e. using MKL, where I have forced the MKL to only use one core for the computation, I get the following result:
Essentially I divide the runtime by the problem size m
, to get an estimate for the cost of a single "slice" of the problems.
To iron out any relation to the CPU or MKL, I used my Macbook Air with the M2 chip first with a vanilla numpy:
and then also with a numpy linking again the accelerateBLAS library to make use of the GPU, or whatever, don't really care. Then I get
So my question is: what the effing heck is wrong with A
?
As a kinda off-topic sidenote: I am also running the same code on cupy from time to time and there this behavior is not visible. There the corresponding version of A
is scaling nicely, at least for the exact same problem sizes as used above.
Another off-topic sidenote: If I use opt_einsum
for the contraction, basically with the code of A
, I get something similar to B
from above, but slightly slower.
Just help me picture the action, the first try, with m=100
is
np.einsum(
...: a,
...: [0, 5],
...: b,
...: [1, 5],
...: c,
...: [0, 1, 2, 3, 4],
...: [5, 2, 3, 4],
...: #optimize="greedy",
...: ).shape
Out[15]: (100, 1, 32, 2)
default greedy show the same thing:
In [17]: print(np.einsum_path(
...: a,
...: [0, 5],
...: b,
...: [1, 5],
...: c,
...: [0, 1, 2, 3, 4],
...: [5, 2, 3, 4],
...: optimize="greedy",
...: )[1])
Complete contraction: af,bf,abcde->fcde
Naive scaling: 6
Optimized scaling: 6
Naive FLOP count: 6.399e+07
Optimized FLOP count: 4.308e+07
Theoretical speedup: 1.485
Largest intermediate: 2.112e+05 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
6 abcde,af->cedbf bf,cedbf->fcde
5 cedbf,bf->fcde fcde->fcde
With optimize=False
(none):
Complete contraction: af,bf,abcde->fcde
Naive scaling: 6
Optimized scaling: 6
Naive FLOP count: 6.399e+07
Optimized FLOP count: 6.399e+07
Theoretical speedup: 1.000
Largest intermediate: 6.400e+03 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
6 abcde,bf,af->fcde fcde->fcde
I picture this as a double matmul
with f
as the 'batchdimension,
aand
bthe two sum-of-product dimensions, and
cde` reducible to 1 dimension ('going along for the ride').
With optimize=False
, the timeit is
242 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
with greedy
:
21.5 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The equivalent double matmul
times the same
In [41]: timeit res=b.T[:,None,:]@(a.T@c.reshape(c.shape[0],-1)).reshape(m,c.shape[1],-1
...: )
22.3 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
in this res.shape
is (100, 1, 64)
I wonder if this is a more straightforward matmul dimension layout?
'cde1ab,fb1->cdefa1; cdefa1,fa1->cdef11'
In [96]: c1 = c.transpose((2,3,4,0,1))[...,None,:,:]
In [97]: a1 = a.T[...,None]; b1 = b.T[...,None]
In [98]: res3 = ((c1@b1)[...,None,:,0]@a1)[...,0,0]
In [99]: res3.shape
Out[99]: (1, 32, 2, 100)
If I change a
and b
to m=300
, I get a different einsum_path
In [106]: print(np.einsum_path(
...: ...: abig,
...: ...: [0, 5],
...: ...: bbig,
...: ...: [1, 5],
...: ...: c,
...: ...: [0, 1, 2, 3, 4],
...: ...: [5, 2, 3, 4],
...: ...: optimize="greedy",
...: ...: )[1])
Complete contraction: af,bf,abcde->fcde
Naive scaling: 6
Optimized scaling: 6
Naive FLOP count: 1.920e+08
Optimized FLOP count: 1.920e+08
Theoretical speedup: 1.000
Largest intermediate: 1.920e+04 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
6 abcde,bf,af->fcde
In [111]: %%timeit
...: np.einsum(
...: abig,
...: [0, 5],
...: bbig,
...: [1, 5],
...: c,
...: [0, 1, 2, 3, 4],
...: [5, 2, 3, 4],
...: optimize=True,
...: )
...:
464 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
matmul equivalent only jumps 3x, linear in m
:
In [112]: a11 = abig.T[...,None]; b11 = bbig.T[...,None]
In [113]: timeit ((c1@b11)[...,None,:,0]@a11)[...,0,0].shape
74.9 ms ± 637 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [114]: a11.shape,b11.shape
Out[114]: ((300, 101, 1), (300, 33, 1))