Optimizing the construction of a large matrix in NumPy

I am working on a part of my code where I need to construct a matrix S1 of size num x num, where num = d^2 is typically very large i.e., > 6000. Below is the part of the code.

S1 = np.zeros((num, num), dtype=np.complex128)
    for i in range(num):
        for j in range(num):
            S1[i][j] = np.trace(np.dot(b1[i], np.dot(b1[j], state)))

'b1' is a list of d x d matrices of length d^2, and 'state' is another d x d matrix, both of which were constructed earlier in my code.

Specifically, 'state' is a density matrix, and 'b1' is a set (list) of d x d Hermitian matrices satisfying the Hilbert-Schmidt inner product, thus forming an orthonormal basis which I use to represent 'state' later in my code (for some other purpose which is not relevant here). Based on the idea provided in this answer, I wrote the following function basis(n) which gives the output of b1:

def basis(n):

    list2=[]
    list3=[]
    list4=[]
    list5=[]
    list6=[]


    for i in range(0,n):
        for j in range(i+1,n):
        
        m1=np.zeros((n,n),dtype=np.complex128)
        m1[i,j]=1j
        m1[j,i]=1j
        
        list2.append(m1/np.sqrt(2))
        
        m2=np.zeros((n,n),dtype=np.complex128)
        m2[i,j]=-1
        m2[j,i]=1
        
        list3.append(m2/np.sqrt(2))
        
    for i in range(0,n-1):
    
        m3=np.zeros(n,dtype=np.complex128)
        m3[i]=1j
        m3[i+1]=-1j
    
        list4.append(m3)
   
        org=np.linalg.qr(np.array(list4).T)[0]
        l1=org[:,i]
        list5.append(l1)
 
        l2=np.diag(list5[i])
        list6.append(l2)
    
    return [np.identity(n)/np.sqrt(n)]+[mat*(-1j) for mat in 
           (list2+list3+list6)]

For example, when d=81, S1 has num=6561 rows and columns, resulting in 6561^2 iterations across the nested loops. As a result, the construction of S1 becomes extremely slow.

I have also checked that 'b1', 'state' are quite sparse. Hence, I am trying to find some way to exploit their sparsity to reduce the dimension/number of iterations.

Are there ways to optimize this code to construct S1 in a faster and efficient way?

I am still learning NumPy, and any suggestions to improve the performance of this code would be helpful. Thanks.

Solution

A first trivial optimization is to move np.dot(b1[j], state) (let's call it tmp) in the for i in range(num): loop so it is not recomputed d**2 times (i.e. 6561 times faster). This should make the computation nearly twice faster.

Another optimization is not to compute a matrix multiplication b1[i] @ tmp (where @ is the operator for matrix multiplications). Indeed, np.trace needs only the diagonal of the result (and compute the sum of the items on this diagonal). Thus, the overall np.trace(...) expression can be computed with np.sum(b1[i] * tmp.T). Fortunately, tmp.T can also be precomputed. This operation should now make the whole operation d times faster (i.e. 81 times faster).

Here is the final code:

S1 = np.zeros((num, num), dtype=np.complex128)
for j in range(num):
    tmpT = (b1[j] @ state).T.copy()
    for i in range(num):
        S1[i][j] = np.sum(b1[i] * tmpT)

If b would not be a list we could optimize this further. Indeed, we can use np.einsum or even matrix multiplications so to compute np.sum(b1[i] * tmpT) for many i simultaneously. On a CPU with T cores, this would make the operation T times faster (i.e. 5~15 times faster in practice on a mainstream PC) because BLAS operations are parallel. Note you can parallelize this manually with tools like Numba, Cython and certainly NumExpr too (not as fast as BLAS' matrix multiplications but not that far either). In the end, it would be certainly 3 orders of magnitude faster.