Fastest way to construct sparse block matrix in python

I want to construct a matrix of shape (N,2N) in Python.

I can construct the matrix as follows

import numpy as np 

N = 10 # 10,100,1000, whatever

some_vector = np.random.uniform(size=N) 

some_matrix = np.zeros((N, 2*N))

for i in range(N):
    some_matrix[i, 2*i] = 1
    some_matrix[i, 2*i + 1] = some_vector[i]

So the result is a matrix of mostly zeros, but where on row i, the 2*i and 2*i + 1 columns are populated.

Is there a faster way to construct the matrix without the loop? Feels like there should be some broadcasting operation...

Edit: There have been some very fast, great answers! I am going to extend the question a touch to my actual use case.

Now suppose some_vector has a shape (N,T). I want to construct a matrix of shape (N,2*N,T) analogously to the previous case. The naive approach is:

N = 10  # 10,100,1000, whatever
T = 500 # or whatever

some_vector = np.random.uniform(size=(N,T)) 

some_matrix = np.zeros((N, 2*N,T))

for i in range(N):
    for t in range(T):
        some_matrix[i, 2*i,t] = 1
        some_matrix[i, 2*i + 1,t] = some_vector[i,t]

Can we extend the previous answers to this new case?

Solution

Variant 1

You can replace the loop with a broadcasting assignment, which interleaves the columns of an identity matrix with the columns of a diagonal matrix:

a = np.eye(N)
b = np.diag(some_vector)

c = np.empty((N, 2*N))
c[:, 0::2] = a
c[:, 1::2] = b

This is concise, but not optimal. This requires allocating some unnecessary intermediate matrices (a and b), which becomes increasing expensive as N grows larges. This also involves performing random-access assignments to every position in c, instead of just the non-zero entries.

This may be faster or slower than the implementation in the question for different values of N.

Variant 2

Similar idea Variant 1, but only performs a single allocation and avoids unnecessary assignments to zero entries.

We create a single vector of size 2*N**2, which essentially represents the rows of some_matrix concatenated with one another. The non-zero positions are populated using broadcasting assignments. We then create an (N, 2*N) view into this vector using np.ndarray.reshape.

some_matrix = np.zeros(2 * n**2)

step = 2 * (n + 1)

some_matrix[::step] = 1
some_matrix[1::step] = some_vector
some_matrix = some_matrix.reshape(n, 2 * n)

Performance Comparison

For relatively small N (N=100):

Variant 1 is about 40% faster than the Python loop (12 us vs 21 us)
Variant 2 is about 80% faster than the Python loop (4 us vs 21 us)

For large N (N=10_000):

Variant 1 is slower than the Python loop.
Variant 2 is about 20% faster than the Python loop (10 us vs 13 us)

Setup:

import numpy as np


def original(n):
    some_vector = np.random.uniform(size=n)
    some_matrix = np.zeros((n, 2 * n))

    for i in range(n):
        some_matrix[i, 2 * i] = 1
        some_matrix[i, 2 * i + 1] = some_vector[i]


def variant_1(n):
    some_vector = np.random.uniform(size=n)

    a = np.eye(n)
    b = np.diag(some_vector)
    
    c = np.empty((n, 2*n))
    c[:, 0::2] = a
    c[:, 1::2] = b


def variant_2(n):
    some_vector = np.random.uniform(size=n)
    
    some_matrix = np.zeros(2 * n**2)
    step = 2 * (n + 1)
    some_matrix[::step] = 1
    some_matrix[1::step] = some_vector
    some_matrix = some_matrix.reshape(n, 2*n)

Timing at `N=100`:

%timeit -n1000  original(100)
21.1 µs ± 2.22 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit -n1000  variant_1(100)
12.2 µs ± 431  ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit -n1000  variant_2(100)
4.37 µs ± 21.4 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Timing at `N=1_000`

%timeit -n100  original(1_000)
631 µs ± 98.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -n100  variant_1(1_000)
5.24 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -n100  variant_2(1_000)
408 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timing at `N=10_000`

%timeit -n100  original(10_000)
12.6 ms ± 225 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -n100  variant_2(10_000)
10.1 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Tested using Python 3.10.12 and Numpy v1.26.0.