pythonscipyvectorizationadjacency-matrix

How to vectorize computation of spearman correlation between a 2D array and a reference 1D array with scipy


I have a matrix M with size (37, N) and an additionnal 1D reference vector of size (37,1)

I am looking for a way to compute the spearman correlation between each sample of M and my reference to obtain a (N, 1) matrix as a result. Using a for loop I can do :


from scipy.stats import spearmanr
import numpy as np 

rng = np.random.default_rng()
M = rng.standard_normal((37, 100))
reference = rng.standard_normal((37, 1))

expected_result = []
for element in M.T:
    expected_result.append( spearmanr(element, reference).statistic)

print(np.array(expected_result).shape)

It gives what I want (shape (100,1)) but I would like it to be vectorized to speed up computations. I tried using the following code :


from scipy.stats import spearmanr
import numpy as np 

rng = np.random.default_rng()
M = rng.standard_normal((37, 100))
reference = rng.standard_normal((37, 1))
res = spearmanr(M, reference)

but the res.statistic ends up in the shape (101,101), which is even more confusing to me.

Is there a way to vectorize this ?

Thank you


Solution

  • The thing is spearmanr is not the same in 2D and 1D.

    In 1D, (spearmanr(M[:,0], reference), spearmanr(M[:,1], reference), ... as you are doing in your loop) it does what you expect from it.

    But in 2D, each column is considered a different variable to compare. So if you pass a 37×100 matrix to spearmanr (spearmanr(M, M) for example), what it does is compute every 100×100 combination of columns of M, resulting an array of 100×100 results of spearman(M[:,i], M[:,j]) for all 100×100 possible i,j.

    Note that I've used only one argument here. There is a reason for that: usually, people either pass 2 1D array to compare, or they pass one 2D array, to compare each of their columns against each other. It is not usual (but not impossible) to pass two 2D arrays. But when you do, it is just as if you were concatenating the columns of the 2 before calling spearmanr on the resulting 2D array.

    So, in your case, what you are doing is calling spearmanr with 100 columns (m) and 1 column (reference) of 37 rows. So, exactly as if you have called it with 101 columns. And the result is the 101×101 possible combination of correlations.

    Hence the reason why the result you are looking for, since you are interested only interested in correlation between 100 columns of M (aka the 100 first column of the resulting 101 columns), vs the column of reference (aka the 101th column of the resulting 101 columns), is the last column of the result of spearmanr(M, reference)

    spearmanr(M, reference).statistic[:-1, -1]
    

    Which means that you compute way too much things.

    Still, in your case, it is faster (because of vectorization) to compute all those 101×101 correlations, and then dropping 1+100×101 results, and keeping only 100

    from scipy.stats import spearmanr
    import numpy as np
    import timeit
    
    rng = np.random.default_rng()
    M = rng.standard_normal((37, 100))
    reference = rng.standard_normal((37, 1))
    
    def sp0(C, reference):
        return spearmanr(C, reference).statistic
    
    def sp1(M, reference):
        expected_result = []
        for element in M.T:
            expected_result.append(sp0(element, reference))
        er1=np.array(expected_result)
        return er1
    
    def sp2(M, reference):
        n=M.shape[1]
        er2=spearmanr(M, reference).statistic[:n,n]
        return er2
    
    print(timeit.timeit(lambda: sp1(M, reference), number=100))
    print(timeit.timeit(lambda: sp2(M, reference), number=100))
    

    On my (slow) computer, that is 60 ms seconds per run (your for loop method), vs 17 ms (computing 101×101 result with vectorized version and keeping only 100)

    Of course, on the long run, (if that 100 becomes bigger), the vectorized version would loose.

    You may be tempted to vectorize with np.vectorize. But that is roughly the same as a for loop

    sp3 = np.vectorize(sp0, signature='(m),(m,1)->()')
    print(timeit.timeit(lambda: sp3(M.T, reference), number=100))
    

    gives 54 ms (so basically same as 60, or almost so)

    Lastly, you can go back to what is a spearman correlation: just a correlation between the ranks of the values

    Assuming there is no tie (and that is a very reasonable assumption if data are continuous random variables as yours):

    def sp4(M, reference):
        n=len(reference) # 37
        mean = (n-1)/2
        var = (n-1)*(n+1)/12
        rgm = M.argsort(axis=0).argsort(axis=0)
        rgr = reference.argsort(axis=0).argsort(axis=0)
        return ((rgm*rgr).mean(axis=0)-mean**2)/var
    

    This time, you have it both way: it is vectorized, and you don't compute useless result. Timing is as expected way lower : 0.021 ms (so factor 300 from your code)

    Note that, since I assume no tie, so, ranks are 0 to 36 (or 1 to 37, it doesn't matter), I can compute mean and variance easily (mean is 18, whatever the data. And variance, based of the sum of the k 1st squares k×(k+1)×(2k+1)/6, is as shown)

    Edit

    I see that I was too slow, and in the meantime, basically the same answer was given (should have checked while I was playing with this). Since I played a bit more with timings I let it anyway. One small implementation difference (outside the fact that I compute mean and variance of rank, assuming no tie, directly) is the usage of the .argsort().argsort() trick, that is faster for small enough data. But rankdata becomes faster (around n=20000 on my computer. So way over 37) in the long run. Plus, indeed, rankdata doesn't assume tie as I do. If tie are possible, then, not only should the .argsort().argsort() trick be replaced by rankdata. But also, using just n/2 and (n-1)(n+1)/12 as mean and variance is not possible any more, and we need to compute all 101 means (not really a problem tho, since this is not where cpu time is really spend anyway)