Improve speed of computations using Scipy sparse arrays

I am using scipy sparse matrices to compute choice probabilities used in an estimation. Below is an example. How could it be faster?

Variables and parameters:

import itertools
import numpy as np
import scipy as sp
import timeit

T = 70000
N = 200
p = 212

X = sp.sparse.random(T, (N)*p, density=0.018, format='csr')
B = np.random.rand(p,)
I = sp.sparse.eye(N,format="csr")

Function that computes the choice probabilities:

def csr_example(X,B,I,tr):
    tem0 = sp.sparse.kron(I, B, format = "csr").transpose()
    exp_xb = X@tem0
    exp_xb.data = np.exp( exp_xb.data, out=exp_xb.data)
    exp_xb0 = sp.sparse.csr_matrix.sum(exp_xb,axis=1)
    exp_xb0 = np.reshape(exp_xb0, (tr, 1))
    pro = exp_xb/exp_xb0
    return pro

Execution time:

print(timeit.timeit(lambda: csr_example(X,B,I,T), setup="pass",number=10))

Around 15 seconds

Solution

Yes, you can make this about 3x faster.

I started my analysis of your program by running it under line profiler.

Timer unit: 1e-09 s

Total time: 0.688771 s
File: /tmp/ipykernel_4391/4040554224.py
Function: csr_example at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def csr_example(X,B,I,tr):
     2         1    2630726.0    3e+06      0.4      tem0 = sp.sparse.kron(I, B, format = "csr").transpose()
     3         1  574464285.0    6e+08     83.4      exp_xb = X@tem0
     4         1   38584926.0    4e+07      5.6      exp_xb.data = np.exp( exp_xb.data, out=exp_xb.data)
     5         1    3826251.0    4e+06      0.6      exp_xb0 = sp.sparse.csr_matrix.sum(exp_xb,axis=1)
     6         1      13541.0  13541.0      0.0      exp_xb0 = np.reshape(exp_xb0, (tr, 1))
     7         1   69250761.0    7e+07     10.1      pro = exp_xb/exp_xb0
     8         1        287.0    287.0      0.0      return pro

This shows that 83% of the program's time is spent doing a matrix multiply between X and tem0. In general, SciPy uses the fastest known algorithms for doing matrix multiplies between arbitrary matrices. You can't go faster unless you can exploit some property of the input matrices.

Helpfully, the right side of this multiply is the transposed Kroneker product of the identity matrix and B, a 1D array.

If B were set to [X, Y, Z], this Kronecker product would look like this:

This has some nice properties:

Before the Kroneker product is done, B is quite small. If the data is small, that saves memory accesses.
Every element of X is relevant to, at most, a single element of the matrix product. Specifically, if the X index is row, col, then this element adds B[col % p] * X[row, col] to the element at row, col // p.

With this in mind, I wrote a matrix multiplication function, which computes X @ (I ⊗ B) without explicitly computing (I ⊗ B).

import numpy as np
import scipy as sp
import timeit
import numba as nb


def csr_example(X,B,I,tr):
    tem0 = sp.sparse.kron(I, B, format = "csr").transpose()
    exp_xb = X@tem0
    exp_xb.data = np.exp( exp_xb.data, out=exp_xb.data)
    exp_xb0 = sp.sparse.csr_matrix.sum(exp_xb,axis=1)
    exp_xb0 = np.reshape(exp_xb0, (tr, 1))
    pro = exp_xb/exp_xb0
    return pro

    
def naive_mm(X, B, N):
    tem0 = sp.sparse.kron(I, B, format = "csr").transpose()
    exp_xb = X @ tem0
    return exp_xb


@nb.njit()
def numba_mm_inner(Xindptr, Xindices, Xdata, B, N, out):
    p = B.shape[0]
    # Loop over rows
    for i in range(len(Xindptr) - 1):
        # Get all column numbers and data for this row
        ind_begin = Xindptr[i]
        ind_end = Xindptr[i + 1]
        cols = Xindices[ind_begin:ind_end]
        data = Xdata[ind_begin:ind_end]
        # Where is the beginning of B located with respect to
        # the beginning of X?
        col_offset = 0
        # The column of the output array to write to
        out_col = 0
        # The running total for this element
        sum_ = 0
        for j in range(len(cols)):
            in_col = cols[j]
            while not in_col < col_offset + p:
                # None of the remaining X values wil be relevant
                # to this column
                out[i, out_col] = sum_
                sum_ = 0
                col_offset += p
                out_col += 1
            
            assert in_col >= col_offset
            sum_ += data[j] * (B[in_col - col_offset])
        # Write any remaining sum
        out[i, out_col] = sum_

    
def numba_mm(X, B, N):
    X = X.tocsr()
    out = np.zeros((X.shape[0], N))
    assert X.shape[0] == out.shape[0]
    assert X.shape[1] == N * B.shape[0]
    numba_mm_inner(X.indptr, X.indices, X.data, B, N, out)
    return out


def csr_opt(X,B,N,tr):
    exp_xb = numba_mm(X, B, N)
    np.exp(exp_xb, out=exp_xb, where=exp_xb != 0)
    exp_xb0 = np.sum(exp_xb,axis=1)
    exp_xb0 = np.reshape(exp_xb0, (tr, 1))
    pro = exp_xb/exp_xb0
    return pro


def main():
    T = 30000
    N = 200
    p = 212

    X = sp.sparse.random(T, (N)*p, density=0.018, format='csr')
    B = np.random.rand(p,)  
    I = sp.sparse.eye(N,format="csr")

    old = csr_example(X,B,I,T).toarray()
    new = csr_opt(X,B,N,T)
    print("Methods equal:", np.allclose(old, new))
    time_old = timeit.timeit(lambda: csr_example(X,B,I,T),number=20)
    time_new = timeit.timeit(lambda: csr_opt(X,B,N,T),number=20)
    print(f"old {time_old:.4f}")
    print(f"new {time_new:.4f}")
    print(f"ratio {time_old / time_new:.4f}")


main()

Benchmark:

Methods equal: True
old 13.1517
new 4.3569
ratio 3.0186

Notes:

naive_mm and numba_mm are two implementations of matrix multiply: both compute X @ (I ⊗ B), and are intended to get the same result. naive_mm is not used by the actual code - it's just for comparison.
numba_mm outputs a dense array as output. I noticed that the output of this step has only 2% sparsity, so sparse matrix isn't really helping you after this point.
I changed T to 30000, because otherwise my computer runs out of memory when benchmarking this.