pythonpython-3.xnumpyperformance

Fastest way to calculate the sum of row-wise "v.T @ v" of a matrix


I have a 1000 x 500 matrix. For each row, I want to calculate the dot product between the transpose of the row vector and the row vector itself (which should be a scalar), and sum over all rows. Right now I am using

import numpy as np

A = np.random.random((1000, 500))
res = sum(A[i].T @ A[i] for i in range(A.shape[0]))

Since this is the performance bottleneck of my algorithm, I wonder if there are faster ways to do this, peferably a Numpyic solution.


Solution

  • Testing hpaulj answers in question comments. Adding simple alternative - np.sum(A**2) and two numba functions: single cpu and parallel.

    TLDR: Fastest is sum_of_squares_numba_parallel function. Fastest numpy only solution: np.einsum('ij,ij',A,A)

    @numba.jit("float64(float64[:,:])")
    def sum_of_squares_numba(a):
        sum = 0
        n_y, n_x = a.shape
        for i in range(n_y):
            for j in range(n_x):
                sum += a[i, j] * a[i, j]
    
        return sum
    
    @numba.jit("float64(float64[:, :])", parallel=True)
    def sum_of_squares_numba_parallel(a):
        sum = 0
        n_rows, n_cols = a.shape
        for i in numba.prange(n_rows):
            for j in range(n_cols):
                sum += a[i, j] * a[i, j]
        return sum
    

    Rough timings, using line profiler show that np.einsum('ij,ij',A,A) is clear winner when it comes to numpy only results. Probably because it does not have to create itermediate A*A result in the memory and access it afterwards to compute sum.

    Numba single CPU calculation is not far off, probably could be done better.

    sum_of_squares_numba_parallel is fastest version so far. On my machine 3-4 times faster than einsum.

    enter image description here