[SOLVED] More efficient nested sum in numpy

I am trying to calculate a vectorised nested sum

$\sum_{i=1}^{N-1}V_{ki}\sum_{j=i+1}^{N}W_{kj}$

(so effectively doing a separate calculation for each row k)

The fastest way I have come up with is to define a lower triangular matrix of ones to take into account the range on the inner sum

O = np.tril(np.ones((N,N),uint8),-1)

and then evaluate the inner sum with range j=1..N, allowing the use of a dot product

np.einsum('ij,ij->i',V,np.dot(W,O))

This works well, but even with uint8 data type, O requires a lot of memory for N>>10000. Is there a better way using standard numpy/scipy functions? My plan is to run this on a GPU using cupy.

As suggested by @hpaulj, you can use cumsum to completely get rid of O. YOu can write sum W as total sum - cumsum.

O = np.sum(W,axis=1)[:,None]-np.cumsum(W,axis=1)
np.einsum('ij,ij->i',V,O)