[SOLVED] Fastest way to calculate the sum of row-wise "v.T @ v" of a matrix

Fastest way to calculate the sum of row-wise "v.T @ v" of a matrix

I have a 1000 x 500 matrix. For each row, I want to calculate the dot product between the transpose of the row vector and the row vector itself (which should be a scalar), and sum over all rows. Right now I am using

import numpy as np

A = np.random.random((1000, 500))
res = sum(A[i].T @ A[i] for i in range(A.shape[0]))

Since this is the performance bottleneck of my algorithm, I wonder if there are faster ways to do this, peferably a Numpyic solution.

Solution

Testing hpaulj answers in question comments. Adding simple alternative - np.sum(A**2) and two numba functions: single cpu and parallel.

TLDR: Fastest is sum_of_squares_numba_parallel function. Fastest numpy only solution: np.einsum('ij,ij',A,A)

@numba.jit("float64(float64[:,:])")
def sum_of_squares_numba(a):
    sum = 0
    n_y, n_x = a.shape
    for i in range(n_y):
        for j in range(n_x):
            sum += a[i, j] * a[i, j]

    return sum

@numba.jit("float64(float64[:, :])", parallel=True)
def sum_of_squares_numba_parallel(a):
    sum = 0
    n_rows, n_cols = a.shape
    for i in numba.prange(n_rows):
        for j in range(n_cols):
            sum += a[i, j] * a[i, j]
    return sum

Rough timings, using line profiler show that np.einsum('ij,ij',A,A) is clear winner when it comes to numpy only results. Probably because it does not have to create itermediate A*A result in the memory and access it afterwards to compute sum.

Numba single CPU calculation is not far off, probably could be done better.

sum_of_squares_numba_parallel is fastest version so far. On my machine 3-4 times faster than einsum.