pythonnumpysparse-matrixnumpy-einsumtensordot

Translating np.einsum to something more performant


Using python/numpy, I have the following np.einsum:

np.einsum('abde,abc->bcde', X, Y)

Y is sparse: for each [a,b], only one c == 1; all others := 0. For an example of relative size of the axes, X.shape is on the order of (1000, 5, 30, 30), and Y.shape is equivalently (1000, 5, 300).

This operation is extremely costly; I want to make this more performant. For one thing, einsum is not parallelized. For another, beecause Y is sparse, I'm effectively computing 300x the number of multiplication operations I should be doing. In fact, when I wrote the equivalent of this einsum using a loop over n, I got a speed-up of around 3x. But that's clearly not very good.

How should I approach making this more performant? I've tried using np.tensordot, but I could not figure out how to get what I want from it (and I still run into the sparse/dense problem).


Solution

  • If Y only contains 1 and 0 then the einsum basically does this:

    result = np.zeros(Y.shape[1:] + X.shape[2:], X.dtype)
    I, J, K = np.nonzero(Y)
    result[J, K] += X[I, J]
    

    But this doesn't give the correct result due to duplicate j, k indices. I couldn't get numpy.add.at to work, but a loop over just these indices is still pretty fast, at least for the given shapes and sparsity.

    result = np.zeros(Y.shape[1:] + X.shape[2:], X.dtype)
    for i, j, k in zip(*np.nonzero(Y)):
        result[j, k] += X[i, j]
    

    This is the test code that I used:

    a, b, c, d, e = 1000, 5, 300, 30, 30
    X = np.random.randint(10, size=(a,b,d,e))
    R = np.random.rand(a, b, c)
    K = np.argmax(R, axis=2)
    I, J = np.indices((a, b), sparse=True)
    Y = np.zeros((a, b, c), int)
    Y[I, J, K] = 1