pythonmatlabnumpy

python use of corrcoeff to achieve matlab's corr function


After I tried every solution I found online, I must ask here.

I want to achieve the behavior of matlab's corr function:
I have 2 matrices A and B.
A's shape: (200, 30000)
B's shape: (200, 1)

in matlab, corr(A, B) will return a matrix with size (30000, 1). when I use numpy.corrcoef (or dask for better performance) I get a (30001, 30001) matrix which is extremely huge, and a wrong answer. I tried using argument rowvar=False as some answer suggested, but it didnt work as well.

I even tried scipy.spatial.distance.cdist(np.transpose(traces), np.transpose(my_trace), metric='correlation') which indeed returned a matrix in shape(30000, 1) as expected but the values were differnet then the result in matlab.

I am desperate for a solution for this problem, please help.


Solution

  • Matlab's corr by default calculates the correlation of columns of A and B, while Python's corrcoef calculates the correlation of rows within an array(if you pass the function two arrays, it seems it will do the same with vertically stacked arrays). If you do not care about the performance and need to find an easy way to do it, you can stack two arrays horizontally and calculate correlation and get the corresponding elements you would like:

    correlation = np.corrcoef(np.hstack((B,A)),rowvar=False)[0,1:]
    

    But if you care about performance more than simple codes, you would need to implement the corr function yourself. (Please comment and I will add it if that is what you are looking for)

    UPDATE: If you would like to implement corr to prevent extra calculations/memory usage, you can calculate correlation using its formula by first normalizing arrays and then multiplying them:

    A = (A - A.mean(axis=0))/A.std(axis=0)
    B = (B - B.mean(axis=0))/B.std(axis=0)
    correlation = (np.dot(B.T, A)/B.shape[0])[0]
    

    output of sample code:

    A = np.array([1,2,2,2]).reshape(4,1)
    B = np.arange(20).reshape(4,5)
    
    Python: np.corrcoef(np.hstack((A,B)),rowvar=False)[0,1:]
    
    [0.77459667 0.77459667 0.77459667 0.77459667 0.77459667]
    
    Matlab:  corr(A,B)
    
    0.7746    0.7746    0.7746    0.7746    0.7746