I have two large 2D numpy arrays A and B (each array has dimensions (18000,18000)
). I want to calculate the Pearson correlation between corresponding columns of the two arrays (i.e. naively calculate
corr_vec = [np.corrcoef(A[:, i], B[:, i])[0,1] for i in range(A.shape[0])]
The naive for loop is of course extremely slow. I tried using the built-in pairwise np.corrcoef() and get the results I want by extracting the part describing corresponding columns of two arrays.
filter_index = A.shape[0]
corr_vec = np.corrcoef(A, B, rowvar = False)[:filter_index, filter_index:].diagonal()
However, the np.corrcoef still calculate all possible pairwise correlation of columns within A, columns within B, and all other column combinations between A and B, while I just need the position-corresponding columns of A and B. Is there an efficient way for this task?
A faster method using einsum:
a = A - A.mean(0)
a /= np.linalg.norm(a, axis=0)
b = B - B.mean(0)
b /= np.linalg.norm(b, axis=0)
np.einsum('ij,ij->j', a, b, optimize=True)
Timings:
import numpy as np
rng = np.random.default_rng()
shape = (18000, 18000)
A = rng.random(shape)
B = rng.random(shape)
corr_vec = [np.corrcoef(A[:, i], B[:, i])[0,1] for i in range(A.shape[0])]
def onyambu1(A, B):
a = (A - A.mean(0))/A.std(0)
b = (B - B.mean(0))/B.std(0)
return (a * b).mean(0)
def onyambu2(A, B):
a1 = (A - A.mean(0))
b1 = (B - B.mean(0))
return (a1 * b1).sum(0)/np.sqrt((a1*a1).sum(0)*(b1*b1).sum(0))
def nin17(A, B):
a = A - A.mean(0)
a /= np.linalg.norm(a, axis=0)
b = B - B.mean(0)
b /= np.linalg.norm(b, axis=0)
return np.einsum('ij,ij->j', a, b, optimize=True)
assert np.allclose(corr_vec, onyambu1(A, B))
assert np.allclose(corr_vec, onyambu2(A, B))
assert np.allclose(corr_vec, nin17(A, B))
%timeit onyambu1(A, B)
%timeit onyambu2(A, B)
%timeit nin17(A, B)
Output:
4.12 s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.56 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.21 s ± 79.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)