numpy pytorch vectorization matrix-multiplication gaussian-process

Vectorizing ARD (Automatic Relevance Determination) kernel implementation in Gaussian processes

I am trying to implement an ARD kernel with NumPy as given in the GPML book (M3 from Equation 5.2).

I am struggling in vectorizing this equation for NxM kernel computation. I have tried the following non-vectorized version. Can someone help in vectorizing this in NumPy/PyTorch?

import numpy as np
N = 30 # Number of data points in X1
M = 40 # Number of data points in X2
D = 6  # Number of features (ARD dimensions)

X1 = np.random.rand(N, D)
X2 = np.random.rand(M, D)
Lambda = np.random.rand(D, 1)
L_inv = np.diag(np.random.rand(D))
sigma_f = np.random.rand()

K = np.empty((N, M))
for n in range(N):
  for m in range(M):
    M3 = Lambda@Lambda.T + L_inv**2
    d = (X1[n,:] - X2[m,:]).reshape(-1,1)
    K[n, m] = sigma_f**2 * np.exp(-0.5 * d.T@M3@d)

Solution

We can use the rules of broadcasting and the neat NumPy function einsum to vectorize array operations. In few words, broadcasting allows us to operate with arrays in one-liners by adding new dimensions to the resulting array, while einsum allows us to perform operations with multiple arrays by explicitly working in the index notation (instead of matrices).

Luckily, no loops are necessary to calculate your kernel. Please see below the vectorized solution, ARD_kernel function, which is about 30x faster in my machine than the original loopy version. Now, einsum is usually as fast as it gets, but it's possible that there are faster methods though, I've not checked anything else (e.g. usual @ operator instead of einsum).

Also, there is a missing term in the code (the Kronecker delta), I don't know if it was omitted in purpose (let me know if you have problems implementing it and I'll edit the answer).

import numpy as np
N = 300 # Number of data points in X1
M = 400 # Number of data points in X2
D = 6  # Number of features (ARD dimensions)

np.random.seed(1)  # Fix random seed for reproducibility
X1 = np.random.rand(N, D)
X2 = np.random.rand(M, D)
Lambda = np.random.rand(D, 1)
L_inv = np.diag(np.random.rand(D))
sigma_f = np.random.rand()

# Loopy function
def ARD_kernel_loops(X1, X2, Lambda, L_inv, sigma_f):
    K = np.empty((N, M))
    M3 = Lambda@Lambda.T + L_inv**2
    for n in range(N):
        for m in range(M):
            d = (X1[n,:] - X2[m,:]).reshape(-1,1)
            K[n, m] = np.exp(-0.5 * d.T@M3@d)
    return K * sigma_f**2
    
# Vectorized function
def ARD_kernel(X1, X2, Lambda, L_inv, sigma_f):
    M3 = Lambda.squeeze()*Lambda + L_inv**2  # Use broadcasting to avoid transpose
    d = X1[:,None] - X2[None,...]            #  Use broadcasting to avoid loops
    
    # order=F for memory layout (as your arrays are (N,M,D) instead of (D,N,M))
    return sigma_f**2 * np.exp(-0.5 * np.einsum("ijk,kl,ijl->ij", d, M3, d, order = 'F'))