pythonarraysnumpyarray-broadcasting

Fastest way to construct sparse block matrix in python


I want to construct a matrix of shape (N,2N) in Python.

I can construct the matrix as follows

import numpy as np 

N = 10 # 10,100,1000, whatever

some_vector = np.random.uniform(size=N) 

some_matrix = np.zeros((N, 2*N))

for i in range(N):
    some_matrix[i, 2*i] = 1
    some_matrix[i, 2*i + 1] = some_vector[i]

So the result is a matrix of mostly zeros, but where on row i, the 2*i and 2*i + 1 columns are populated.

Is there a faster way to construct the matrix without the loop? Feels like there should be some broadcasting operation...

Edit: There have been some very fast, great answers! I am going to extend the question a touch to my actual use case.

Now suppose some_vector has a shape (N,T). I want to construct a matrix of shape (N,2*N,T) analogously to the previous case. The naive approach is:

N = 10  # 10,100,1000, whatever
T = 500 # or whatever

some_vector = np.random.uniform(size=(N,T)) 

some_matrix = np.zeros((N, 2*N,T))

for i in range(N):
    for t in range(T):
        some_matrix[i, 2*i,t] = 1
        some_matrix[i, 2*i + 1,t] = some_vector[i,t]

Can we extend the previous answers to this new case?


Solution

  • Variant 1

    You can replace the loop with a broadcasting assignment, which interleaves the columns of an identity matrix with the columns of a diagonal matrix:

    a = np.eye(N)
    b = np.diag(some_vector)
    
    c = np.empty((N, 2*N))
    c[:, 0::2] = a
    c[:, 1::2] = b
    

    This is concise, but not optimal. This requires allocating some unnecessary intermediate matrices (a and b), which becomes increasing expensive as N grows larges. This also involves performing random-access assignments to every position in c, instead of just the non-zero entries.

    This may be faster or slower than the implementation in the question for different values of N.

    Variant 2

    Similar idea Variant 1, but only performs a single allocation and avoids unnecessary assignments to zero entries.

    We create a single vector of size 2*N**2, which essentially represents the rows of some_matrix concatenated with one another. The non-zero positions are populated using broadcasting assignments. We then create an (N, 2*N) view into this vector using np.ndarray.reshape.

    some_matrix = np.zeros(2 * n**2)
    
    step = 2 * (n + 1)
    
    some_matrix[::step] = 1
    some_matrix[1::step] = some_vector
    some_matrix = some_matrix.reshape(n, 2 * n)
    

    Performance Comparison

    For relatively small N (N=100):

    For large N (N=10_000):

    Setup:

    import numpy as np
    
    
    def original(n):
        some_vector = np.random.uniform(size=n)
        some_matrix = np.zeros((n, 2 * n))
    
        for i in range(n):
            some_matrix[i, 2 * i] = 1
            some_matrix[i, 2 * i + 1] = some_vector[i]
    
    
    def variant_1(n):
        some_vector = np.random.uniform(size=n)
    
        a = np.eye(n)
        b = np.diag(some_vector)
        
        c = np.empty((n, 2*n))
        c[:, 0::2] = a
        c[:, 1::2] = b
    
    
    def variant_2(n):
        some_vector = np.random.uniform(size=n)
        
        some_matrix = np.zeros(2 * n**2)
        step = 2 * (n + 1)
        some_matrix[::step] = 1
        some_matrix[1::step] = some_vector
        some_matrix = some_matrix.reshape(n, 2*n)
    

    Timing at N=100:

    %timeit -n1000  original(100)
    21.1 µs ± 2.22 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    %timeit -n1000  variant_1(100)
    12.2 µs ± 431  ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    %timeit -n1000  variant_2(100)
    4.37 µs ± 21.4 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    

    Timing at N=1_000

    %timeit -n100  original(1_000)
    631 µs ± 98.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit -n100  variant_1(1_000)
    5.24 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit -n100  variant_2(1_000)
    408 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Timing at N=10_000

    %timeit -n100  original(10_000)
    12.6 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit -n100  variant_2(10_000)
    10.1 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Tested using Python 3.10.12 and Numpy v1.26.0.