numpyvectorizationarray-broadcasting

Faster numpy calculations than reshaping with einsum


Consider the following in Python: A has dimension (T,), U has dimension (L,T) and G has dimension (K,T), Y is (L,L,T). My code outputs a numer1 and numer2 with dimensions (T, LK, 1), . Consider that numer1 is (T, LK, 1), while numer3 is (LK,1)

# MWE
np.random.seed(0)
T=250
K = 15
L = 20
A = np.random.normal(size=(T))
Y = np.random.normal(size=(L,L,T))
G = np.random.normal(size=(K,T))
U = np.random.normal(size=(L,T))

# Calculations
L, T = U.shape
K = G.shape[0]
Y_transposed = Y.transpose(2, 0, 1)
sG = np.einsum('it,jt->tij', G, G)
A_transposed = A[None, :, None].transpose(1, 0, 2)
numer1 = np.einsum('ai,ak->aik', U.T, G.T).reshape(T, L * K, 1)
numer2 = numer1 * A_transposed
numer3 = numer2.sum(axis=0)

It turns out that np.reshape is very slow in this piece of code. When I say slow, in mean it compared to other solutions that are loop-based. Is there a way to avoid reshaping by using the einsum in a different way since I take sums in the last step?

A similar approach would also benefit the other piece of code, for which the problem seems a bit more biting:

denom1 = np.einsum('aij,akl->aikjl', Y_transposed, sG).reshape(T, L * K, L *K)
denom2 = denom1 * A_transposed
denom3 = denom2.sum(axis=0)


Solution

  • You are using einsum to do broadcasted outer products.

    With a small example:

    In [63]: A = np.arange(3); U = np.arange(4*3).reshape(4,3); G=np.arange(5*3).reshape(5,3); Y=np.arange(4*4*3).reshape(4,4,3)
    

    Your numer? calculations time as:

    In [64]: %%timeit 
        ...: A_transposed = A[None, :, None].transpose(1, 0, 2)
        ...: numer1 = np.einsum('ai,ak->aik', U.T, G.T).reshape(T, L * K, 1)
        ...: numer2 = numer1 * A_transposed
        ...: numer3 = numer2.sum(axis=0)
        ...: 
        ...: 
    37.8 µs ± 74.2 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    

    An equivalent with just broadcasting is modestly faster:

    In [65]: timeit x=((U[:,None,:]*G[None,:,:]).reshape(-1,3)*A).sum(axis=1)
    24.8 µs ± 144 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    

    They are the same, except for that added trailing dimension:

    In [66]: numer3.shape, x.shape
    Out[66]: ((20, 1), (20,))
    
    In [67]: np.allclose(x, numer3[:,0])
    Out[67]: True
    

    or moving the reshape to the end

    In [73]: timeit y=(U[:,None,:]*G[None,:,:]*A).sum(axis=-1).reshape(-1,1)
    25.5 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
    

    edit

    In the previous calculation I did two things, replace the einsum with broadcasted multiply, and moved the T dimension to the end, removing the need to the transposing.

    In [76]: denom1 = np.einsum('aij,akl->aikjl', Y_transposed, sG).reshape(T, L * K, L *K)
        ...: denom2 = denom1 * A_transposed
        ...: denom3 = denom2.sum(axis=0)
    
    In [77]: Y_transposed.shape, sG.shape
    Out[77]: ((3, 4, 4), (3, 5, 5))
    
    In [78]: denom1.shape
    Out[78]: (3, 20, 20)
    
    In [79]: Y.shape, G.shape
    Out[79]: ((4, 4, 3), (5, 3))
    

    Focusing on moving T/t/a to the end:

    In [80]: sg1 = np.einsum('it,jt->ijt', G, G)
    

    Doing broadcasting in [80] will be just as before.

    In [82]: denom4 = (np.einsum('ija,kla->ikjla', Y, sg1)*A).sum(axis=-1)
    
    In [83]: denom4.shape
    Out[83]: (4, 5, 4, 5)
    

    There are some more dimensions in [82] but the same idea applies.

    Now do the reshape and check for equality:

    In [84]: denom3.shape
    Out[84]: (20, 20)    
    In [85]: denom4 = denom4.reshape(L*K,L*K)    
    In [86]: np.allclose(denom3, denom4)
    Out[86]: True