I have a 1000 x 500 matrix. For each row, I want to calculate the dot product between the transpose of the row vector and the row vector itself (which should be a scalar), and sum over all rows. Right now I am using
import numpy as np
A = np.random.random((1000, 500))
res = sum(A[i].T @ A[i] for i in range(A.shape[0]))
Since this is the performance bottleneck of my algorithm, I wonder if there are faster ways to do this, peferably a Numpyic solution.
Testing hpaulj answers in question comments. Adding simple alternative - np.sum(A**2)
and two numba functions: single cpu and parallel.
TLDR: Fastest is sum_of_squares_numba_parallel
function. Fastest numpy only solution: np.einsum('ij,ij',A,A)
@numba.jit("float64(float64[:,:])")
def sum_of_squares_numba(a):
sum = 0
n_y, n_x = a.shape
for i in range(n_y):
for j in range(n_x):
sum += a[i, j] * a[i, j]
return sum
@numba.jit("float64(float64[:, :])", parallel=True)
def sum_of_squares_numba_parallel(a):
sum = 0
n_rows, n_cols = a.shape
for i in numba.prange(n_rows):
for j in range(n_cols):
sum += a[i, j] * a[i, j]
return sum
Rough timings, using line profiler show that np.einsum('ij,ij',A,A)
is clear winner when it comes to numpy only results. Probably because it does not have to create itermediate A*A result in the memory and access it afterwards to compute sum.
Numba single CPU calculation is not far off, probably could be done better.
sum_of_squares_numba_parallel
is fastest version so far. On my machine 3-4 times faster than einsum.