I am trying to calculate a vectorised nested sum
(so effectively doing a separate calculation for each row k)
The fastest way I have come up with is to define a lower triangular matrix of ones to take into account the range on the inner sum
O = np.tril(np.ones((N,N),uint8),-1)
and then evaluate the inner sum with range j=1..N, allowing the use of a dot product
np.einsum('ij,ij->i',V,np.dot(W,O))
This works well, but even with uint8 data type, O requires a lot of memory for N>>10000. Is there a better way using standard numpy/scipy functions? My plan is to run this on a GPU using cupy.
As suggested by @hpaulj, you can use cumsum
to completely get rid of O. YOu can write sum W as total sum - cumsum.
O = np.sum(W,axis=1)[:,None]-np.cumsum(W,axis=1)
np.einsum('ij,ij->i',V,O)