I have been trying to implement in Python (with numpy and scipy) this variant of Matlab's corr function, but it seems I cannot solve it by myself. What I need is to implement the alternative Matlab corr implementation:
[rho,pval] = corr(X,Y)
I would appreciate any help on this!!!
What I tried:
I tried to modify the solutions posted here and in this other thread, without much success. For instance, I was able to hstack the two matrixes X and Y, and by keeping a part of the correlation matrix, I got the right result for the correlations. However, the same trick is NOT working for the p-values. Actually, both the solutions in the other threads gave me the (more or less) correct values for the correlation coefficients, which is great, but I cannot reproduce the behavior of the Matlab implementation for the p-values.
Also, the solution here aims to reproduce corrcoef's behaviour, which, according to Matlab's documentation, converts the input matrices into column vectors before computing the correlations.
On the other hand, I also tried to hstack
the matrices in Matlab, and again got the same answer for the correlations, but the values are quite different for the p-values between what I get in Python and what I got in Matlab. This made me think that the problem perhaps is in the statistic being calculated. However, according to the Matlab documentation, it uses:
corr computes the p-values for Pearson's correlation using a Student's t distribution for a transformation of the correlation
and, from the documentation in SciPy I think they are using the same test, but I am not 100% sure, given that the bibligraphic reference there is for Student's paper, which is the one that Matlab's docs say it uses (Student's r), but as I said, I am NOT sure at all.
I think the easiest way is to do it in pandas, using scipy.stats pearsonr, which returns the pairwise rho and pval. I tested with some sample below, and I believe the results match the matlab results.
import numpy as np
from scipy.stats import pearsonr
import pandas as pd
X = np.array([
[0.5377, 0.3188, 3.5784, 0.7254],
[1.8339, -1.3077, 2.7694, -0.0631],
[-2.2588, -0.4336, -1.3499, 0.7147],
[0.8622, 0.3426, 3.0349, -0.2050]
])
Y1 = np.array([
[-0.1241, 0.6715, 0.4889, 0.2939],
[1.4897, -1.2075, 1.0347, -0.7873],
[1.4090, 0.7172, 0.7269, 0.8884],
[1.4172, 1.6302, -0.3034, -1.1471]
])
Y2 = Y1
Y2[:, 3] = Y2[:, 3] + X[:, 1]
df1 = pd.DataFrame(X)
df2 = pd.DataFrame(Y2)
coeffmat = np.zeros((df1.shape[1], df2.shape[1]))
pvalmat = np.zeros((df1.shape[1], df2.shape[1]))
for i in range(df1.shape[1]):
for j in range(df2.shape[1]):
corrtest = pearsonr(df1[df1.columns[i]], df2[df2.columns[j]])
coeffmat[i,j] = corrtest[0]
pvalmat[i,j] = corrtest[1]
dfcoeff = pd.DataFrame(coeffmat, columns=df2.columns, index=df1.columns)
print(dfcoeff)
dfpvals = pd.DataFrame(pvalmat, columns=df2.columns, index=df1.columns)
print(dfpvals)
I also tried playing around with pandas corrwith function (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corrwith.html#pandas.DataFrame.corrwith) but didn't spend enough time. I think there is a more elegant solution which uses that.
Hope this helps