I have a matrix M with size (37, N) and an additionnal 1D reference vector of size (37,1)
I am looking for a way to compute the spearman correlation between each sample of M and my reference to obtain a (N, 1) matrix as a result. Using a for loop I can do :
from scipy.stats import spearmanr
import numpy as np
rng = np.random.default_rng()
M = rng.standard_normal((37, 100))
reference = rng.standard_normal((37, 1))
expected_result = []
for element in M.T:
expected_result.append( spearmanr(element, reference).statistic)
print(np.array(expected_result).shape)
It gives what I want (shape (100,1)) but I would like it to be vectorized to speed up computations. I tried using the following code :
from scipy.stats import spearmanr
import numpy as np
rng = np.random.default_rng()
M = rng.standard_normal((37, 100))
reference = rng.standard_normal((37, 1))
res = spearmanr(M, reference)
but the res.statistic ends up in the shape (101,101), which is even more confusing to me.
Is there a way to vectorize this ?
Thank you
The thing is spearmanr
is not the same in 2D and 1D.
In 1D, (spearmanr(M[:,0], reference)
, spearmanr(M[:,1], reference)
, ... as you are doing in your loop) it does what you expect from it.
But in 2D, each column is considered a different variable to compare. So if you pass a 37×100 matrix to spearmanr
(spearmanr(M, M)
for example), what it does is compute every 100×100 combination of columns of M
, resulting an array of 100×100 results of spearman(M[:,i], M[:,j])
for all 100×100 possible i,j.
Note that I've used only one argument here. There is a reason for that: usually, people either pass 2 1D array to compare, or they pass one 2D array, to compare each of their columns against each other. It is not usual (but not impossible) to pass two 2D arrays. But when you do, it is just as if you were concatenating the columns of the 2 before calling spearmanr
on the resulting 2D array.
So, in your case, what you are doing is calling spearmanr
with 100 columns (m) and 1 column (reference) of 37 rows. So, exactly as if you have called it with 101 columns. And the result is the 101×101 possible combination of correlations.
Hence the reason why the result you are looking for, since you are interested only interested in correlation between 100 columns of M (aka the 100 first column of the resulting 101 columns), vs the column of reference (aka the 101th column of the resulting 101 columns), is the last column of the result of spearmanr(M, reference)
spearmanr(M, reference).statistic[:-1, -1]
Which means that you compute way too much things.
Still, in your case, it is faster (because of vectorization) to compute all those 101×101 correlations, and then dropping 1+100×101 results, and keeping only 100
from scipy.stats import spearmanr
import numpy as np
import timeit
rng = np.random.default_rng()
M = rng.standard_normal((37, 100))
reference = rng.standard_normal((37, 1))
def sp0(C, reference):
return spearmanr(C, reference).statistic
def sp1(M, reference):
expected_result = []
for element in M.T:
expected_result.append(sp0(element, reference))
er1=np.array(expected_result)
return er1
def sp2(M, reference):
n=M.shape[1]
er2=spearmanr(M, reference).statistic[:n,n]
return er2
print(timeit.timeit(lambda: sp1(M, reference), number=100))
print(timeit.timeit(lambda: sp2(M, reference), number=100))
On my (slow) computer, that is 60 ms seconds per run (your for
loop method), vs 17 ms (computing 101×101 result with vectorized version and keeping only 100)
Of course, on the long run, (if that 100 becomes bigger), the vectorized version would loose.
You may be tempted to vectorize with np.vectorize
. But that is roughly the same as a for
loop
sp3 = np.vectorize(sp0, signature='(m),(m,1)->()')
print(timeit.timeit(lambda: sp3(M.T, reference), number=100))
gives 54 ms (so basically same as 60, or almost so)
Lastly, you can go back to what is a spearman correlation: just a correlation between the ranks of the values
Assuming there is no tie (and that is a very reasonable assumption if data are continuous random variables as yours):
def sp4(M, reference):
n=len(reference) # 37
mean = (n-1)/2
var = (n-1)*(n+1)/12
rgm = M.argsort(axis=0).argsort(axis=0)
rgr = reference.argsort(axis=0).argsort(axis=0)
return ((rgm*rgr).mean(axis=0)-mean**2)/var
This time, you have it both way: it is vectorized, and you don't compute useless result. Timing is as expected way lower : 0.021 ms (so factor 300 from your code)
Note that, since I assume no tie, so, ranks are 0 to 36 (or 1 to 37, it doesn't matter), I can compute mean and variance easily (mean is 18, whatever the data. And variance, based of the sum of the k 1st squares k×(k+1)×(2k+1)/6, is as shown)
I see that I was too slow, and in the meantime, basically the same answer was given (should have checked while I was playing with this). Since I played a bit more with timings I let it anyway.
One small implementation difference (outside the fact that I compute mean and variance of rank, assuming no tie, directly) is the usage of the .argsort().argsort()
trick, that is faster for small enough data. But rankdata
becomes faster (around n=20000 on my computer. So way over 37) in the long run. Plus, indeed, rankdata doesn't assume tie as I do. If tie are possible, then, not only should the .argsort().argsort()
trick be replaced by rankdata
. But also, using just n/2
and (n-1)(n+1)/12
as mean and variance is not possible any more, and we need to compute all 101 means (not really a problem tho, since this is not where cpu time is really spend anyway)