pythonnumpycythonlinear-algebranumba

Computing the spectral norms of ~1m Hermitian matrices: `numpy.linalg.norm` is too slow


I would like to calculate the spectral norms of N 8x8 Hermitian matrices, with N being close to 1E6. As an example, take these 1 million random complex 8x8 matrices:

import numpy as np

array = np.random.rand(8,8,1e6)  + 1j*np.random.rand(8,8,1e6)

It currently takes me almost 10 seconds using numpy.linalg.norm:

np.linalg.norm(array, ord=2, axis=(0,1))

I tried using the Cython code below, but this gave me only a negligible performance improvement:

import numpy as np
cimport numpy as np
cimport cython

np.import_array()

DTYPE = np.complex64

@cython.boundscheck(False)
@cython.wraparound(False)
def function(np.ndarray[np.complex64_t, ndim=3] Array):
    assert Array.dtype == DTYPE
    cdef int shape0 = Array.shape[2]
    cdef np.ndarray[np.float32_t, ndim=1] normarray = np.zeros(shape0, dtype=np.float32)
    normarray = np.linalg.norm(Array, ord=2, axis=(0, 1))
    return normarray

I also tried numba and some other scipy functions (such as scipy.linalg.svdvals) to calculate the singular values of these matrices. Everything is still too slow.

Is it not possible to make this any faster? Is numpy already optimized to the extent that no speed gains are possible by using Cython or numba? Or is my code highly inefficient and I am doing something fundamentally wrong?

I noticed that only two of my CPU cores are 100% utilized while doing the calculation. With that in mind, I looked at these previous StackOverflow questions:

and several others, but unfortunately I still don't have a solution.

I considered splitting my array into smaller chunks, and processing these in parallel (perhaps on the GPU using CUDA). Is there a way within numpy/Python to do this? I don't yet know where the bottleneck is in my code, i.e. whether it is CPU or memory-bound, or perhaps something different.


Solution

  • Digging into the code for np.linalg.norm, I've deduced, that for these parameters, it is finding the maximum of matrix singular values over the N dimension

    First generate a small sample array. Make N the first dimension to eliminate a rollaxis operation:

    In [268]: N=10; A1 = np.random.rand(N,8,8)+1j*np.random.rand(N,8,8)
    
    In [269]: np.linalg.norm(A1,ord=2,axis=(1,2))
    Out[269]: 
    array([ 5.87718306,  5.54662999,  6.15018125,  5.869058  ,  5.80882818,
            5.86060462,  6.04997992,  5.85681085,  5.71243196,  5.58533323])
    

    the equivalent operation:

    In [270]: np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
    Out[270]: 
    array([ 5.87718306,  5.54662999,  6.15018125,  5.869058  ,  5.80882818,
            5.86060462,  6.04997992,  5.85681085,  5.71243196,  5.58533323])
    

    same values, and same time:

    In [271]: timeit np.linalg.norm(A1,ord=2,axis=(1,2))
    1000 loops, best of 3: 398 µs per loop
    In [272]: timeit np.amax(np.linalg.svd(A1,compute_uv=0),axis=-1)
    1000 loops, best of 3: 389 µs per loop
    

    And most of the time spent in svd, which produces an (N,8) array:

    In [273]: timeit np.linalg.svd(A1,compute_uv=0)
    1000 loops, best of 3: 366 µs per loop
    

    So if you want to speed up the norm, you have look further into speeding up this svd. svd uses np.linalg._umath_linalg functions - that is a .so file - compiled.

    The c code is in https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/linalg/umath_linalg.c.src

    It sure looks like this is the fastest you'll get. There's no Python level loop. Any looping is in that c code, or the lapack function it calls.