pythonnumpymemory-efficient

How to efficiently calculate Pearson correlation between corresponding columns of two 2D arrays?


I have two large 2D numpy arrays A and B (each array has dimensions (18000,18000)). I want to calculate the Pearson correlation between corresponding columns of the two arrays (i.e. naively calculate

corr_vec = [np.corrcoef(A[:, i], B[:, i])[0,1] for i in range(A.shape[0])]

The naive for loop is of course extremely slow. I tried using the built-in pairwise np.corrcoef() and get the results I want by extracting the part describing corresponding columns of two arrays.

filter_index = A.shape[0]
corr_vec = np.corrcoef(A, B, rowvar = False)[:filter_index, filter_index:].diagonal()

However, the np.corrcoef still calculate all possible pairwise correlation of columns within A, columns within B, and all other column combinations between A and B, while I just need the position-corresponding columns of A and B. Is there an efficient way for this task?


Solution

  • A faster method using einsum:

    a = A - A.mean(0)
    a /= np.linalg.norm(a, axis=0)
    b = B - B.mean(0)
    b /= np.linalg.norm(b, axis=0)
    np.einsum('ij,ij->j', a, b, optimize=True)
    

    Timings:

    import numpy as np
    
    rng = np.random.default_rng()
    
    shape = (18000, 18000)
    
    A = rng.random(shape)
    B = rng.random(shape)
    
    corr_vec = [np.corrcoef(A[:, i], B[:, i])[0,1] for i in range(A.shape[0])]
    
    def onyambu1(A, B):
        a = (A - A.mean(0))/A.std(0)
        b = (B - B.mean(0))/B.std(0)
    
        return (a * b).mean(0)
    
    def onyambu2(A, B):
        a1 = (A - A.mean(0))
        b1 = (B - B.mean(0))
        return (a1 * b1).sum(0)/np.sqrt((a1*a1).sum(0)*(b1*b1).sum(0))
    
    def nin17(A, B):
        a = A - A.mean(0)
        a /= np.linalg.norm(a, axis=0)
        b = B - B.mean(0)
        b /= np.linalg.norm(b, axis=0)
        return np.einsum('ij,ij->j', a, b, optimize=True)
    
    assert np.allclose(corr_vec, onyambu1(A, B))
    assert np.allclose(corr_vec, onyambu2(A, B))
    assert np.allclose(corr_vec, nin17(A, B))
    
    %timeit onyambu1(A, B)
    %timeit onyambu2(A, B)
    %timeit nin17(A, B)
    

    Output:

    4.12 s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    2.56 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    2.21 s ± 79.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)