pythonnumpyscipysparse-matrix

Improve speed of computations using Scipy sparse arrays


I am using scipy sparse matrices to compute choice probabilities used in an estimation. Below is an example. How could it be faster?

Variables and parameters:

import itertools
import numpy as np
import scipy as sp
import timeit

T = 70000
N = 200
p = 212

X = sp.sparse.random(T, (N)*p, density=0.018, format='csr')
B = np.random.rand(p,)
I = sp.sparse.eye(N,format="csr")

Function that computes the choice probabilities:

def csr_example(X,B,I,tr):
    tem0 = sp.sparse.kron(I, B, format = "csr").transpose()
    exp_xb = X@tem0
    exp_xb.data = np.exp( exp_xb.data, out=exp_xb.data)
    exp_xb0 = sp.sparse.csr_matrix.sum(exp_xb,axis=1)
    exp_xb0 = np.reshape(exp_xb0, (tr, 1))
    pro = exp_xb/exp_xb0
    return pro

Execution time:

print(timeit.timeit(lambda: csr_example(X,B,I,T), setup="pass",number=10))

Around 15 seconds


Solution

  • Yes, you can make this about 3x faster.

    I started my analysis of your program by running it under line profiler.

    Timer unit: 1e-09 s
    
    Total time: 0.688771 s
    File: /tmp/ipykernel_4391/4040554224.py
    Function: csr_example at line 1
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
         1                                           def csr_example(X,B,I,tr):
         2         1    2630726.0    3e+06      0.4      tem0 = sp.sparse.kron(I, B, format = "csr").transpose()
         3         1  574464285.0    6e+08     83.4      exp_xb = X@tem0
         4         1   38584926.0    4e+07      5.6      exp_xb.data = np.exp( exp_xb.data, out=exp_xb.data)
         5         1    3826251.0    4e+06      0.6      exp_xb0 = sp.sparse.csr_matrix.sum(exp_xb,axis=1)
         6         1      13541.0  13541.0      0.0      exp_xb0 = np.reshape(exp_xb0, (tr, 1))
         7         1   69250761.0    7e+07     10.1      pro = exp_xb/exp_xb0
         8         1        287.0    287.0      0.0      return pro
    

    This shows that 83% of the program's time is spent doing a matrix multiply between X and tem0. In general, SciPy uses the fastest known algorithms for doing matrix multiplies between arbitrary matrices. You can't go faster unless you can exploit some property of the input matrices.

    Helpfully, the right side of this multiply is the transposed Kroneker product of the identity matrix and B, a 1D array.

    If B were set to [X, Y, Z], this Kronecker product would look like this:

    X 0 0
    Y 0 0
    Z 0 0
    0 X 0
    0 Y 0
    0 Z 0
    0 0 X
    0 0 Y
    0 0 Z
    

    This has some nice properties:

    With this in mind, I wrote a matrix multiplication function, which computes X @ (I ⊗ B) without explicitly computing (I ⊗ B).

    import numpy as np
    import scipy as sp
    import timeit
    import numba as nb
    
    
    def csr_example(X,B,I,tr):
        tem0 = sp.sparse.kron(I, B, format = "csr").transpose()
        exp_xb = X@tem0
        exp_xb.data = np.exp( exp_xb.data, out=exp_xb.data)
        exp_xb0 = sp.sparse.csr_matrix.sum(exp_xb,axis=1)
        exp_xb0 = np.reshape(exp_xb0, (tr, 1))
        pro = exp_xb/exp_xb0
        return pro
    
        
    def naive_mm(X, B, N):
        tem0 = sp.sparse.kron(I, B, format = "csr").transpose()
        exp_xb = X @ tem0
        return exp_xb
    
    
    @nb.njit()
    def numba_mm_inner(Xindptr, Xindices, Xdata, B, N, out):
        p = B.shape[0]
        # Loop over rows
        for i in range(len(Xindptr) - 1):
            # Get all column numbers and data for this row
            ind_begin = Xindptr[i]
            ind_end = Xindptr[i + 1]
            cols = Xindices[ind_begin:ind_end]
            data = Xdata[ind_begin:ind_end]
            # Where is the beginning of B located with respect to
            # the beginning of X?
            col_offset = 0
            # The column of the output array to write to
            out_col = 0
            # The running total for this element
            sum_ = 0
            for j in range(len(cols)):
                in_col = cols[j]
                while not in_col < col_offset + p:
                    # None of the remaining X values wil be relevant
                    # to this column
                    out[i, out_col] = sum_
                    sum_ = 0
                    col_offset += p
                    out_col += 1
                
                assert in_col >= col_offset
                sum_ += data[j] * (B[in_col - col_offset])
            # Write any remaining sum
            out[i, out_col] = sum_
    
        
    def numba_mm(X, B, N):
        X = X.tocsr()
        out = np.zeros((X.shape[0], N))
        assert X.shape[0] == out.shape[0]
        assert X.shape[1] == N * B.shape[0]
        numba_mm_inner(X.indptr, X.indices, X.data, B, N, out)
        return out
    
    
    def csr_opt(X,B,N,tr):
        exp_xb = numba_mm(X, B, N)
        np.exp(exp_xb, out=exp_xb, where=exp_xb != 0)
        exp_xb0 = np.sum(exp_xb,axis=1)
        exp_xb0 = np.reshape(exp_xb0, (tr, 1))
        pro = exp_xb/exp_xb0
        return pro
    
    
    def main():
        T = 30000
        N = 200
        p = 212
    
        X = sp.sparse.random(T, (N)*p, density=0.018, format='csr')
        B = np.random.rand(p,)  
        I = sp.sparse.eye(N,format="csr")
    
        old = csr_example(X,B,I,T).toarray()
        new = csr_opt(X,B,N,T)
        print("Methods equal:", np.allclose(old, new))
        time_old = timeit.timeit(lambda: csr_example(X,B,I,T),number=20)
        time_new = timeit.timeit(lambda: csr_opt(X,B,N,T),number=20)
        print(f"old {time_old:.4f}")
        print(f"new {time_new:.4f}")
        print(f"ratio {time_old / time_new:.4f}")
    
    
    main()
    

    Benchmark:

    Methods equal: True
    old 13.1517
    new 4.3569
    ratio 3.0186
    

    Notes: