Using python/numpy, I have the following np.einsum
:
np.einsum('abde,abc->bcde', X, Y)
Y
is sparse: for each [a,b]
, only one c == 1
; all others := 0.
For an example of relative size of the axes, X.shape
is on the order of (1000, 5, 30, 30)
, and Y.shape
is equivalently (1000, 5, 300)
.
This operation is extremely costly; I want to make this more performant. For one thing, einsum is not parallelized. For another, beecause Y
is sparse, I'm effectively computing 300x the number of multiplication operations I should be doing. In fact, when I wrote the equivalent of this einsum using a loop over n, I got a speed-up of around 3x. But that's clearly not very good.
How should I approach making this more performant? I've tried using np.tensordot, but I could not figure out how to get what I want from it (and I still run into the sparse/dense problem).
If Y only contains 1 and 0 then the einsum basically does this:
result = np.zeros(Y.shape[1:] + X.shape[2:], X.dtype)
I, J, K = np.nonzero(Y)
result[J, K] += X[I, J]
But this doesn't give the correct result due to duplicate j, k indices. I couldn't get numpy.add.at to work, but a loop over just these indices is still pretty fast, at least for the given shapes and sparsity.
result = np.zeros(Y.shape[1:] + X.shape[2:], X.dtype)
for i, j, k in zip(*np.nonzero(Y)):
result[j, k] += X[i, j]
This is the test code that I used:
a, b, c, d, e = 1000, 5, 300, 30, 30
X = np.random.randint(10, size=(a,b,d,e))
R = np.random.rand(a, b, c)
K = np.argmax(R, axis=2)
I, J = np.indices((a, b), sparse=True)
Y = np.zeros((a, b, c), int)
Y[I, J, K] = 1