I want to calculate pearson correlations of the columns of a pandas DataFrame. I don't just want to use DataFrame.corr()
because I also need the pvalue of the correlation; therefore, I am using scipy.stats.pearsonr(x, y)
. My problem right now is that my dataframe is huge (shape: (1166, 49262)), so I'm looking at (49262^2-49262)/2 correlations.
Please advise on how I can optimize this to reduce the computation time.
My code for the correlation:
# the variable `data` contains the dataframe of shape (1166, 49262)
# setting up output dataframes
dfcols = pd.DataFrame(columns=data.columns)
correlation = dfcols.T.join(dfcols, how='outer')
pvalues = correlation.copy()
# pairwise calculation
for r in range(len(data.columns)):
for c in range(r+1, len(data.columns)):
# iterate over all combinations of columns to calculate correlation
tmp = input.iloc[:, [r,c]].dropna()
if len(tmp) < 2:
# too few data points to calculate correlation coefficient
result = (0, 1)
else:
result = pearsonr(tmp.iloc[:, 0], tmp.iloc[:, 1])
correlation.iloc[r, c] = result[0]
pvalues.iloc[r, c] = result[1]
Some notes:
.dropna()
and catch cases where less than two data points remain.def foo():
data = load_df() # the pd.DataFrame of shape (1166, 49262)
cols = data.columns
for i in range(len(cols)):
logging.info(f"{i+1}/{len(cols)}")
for j in range(i+1, len(cols)):
tmp = data.iloc[:, [i, j]].dropna()
if len(tmp) < 2:
# You may ignore this for this post; I was looking for columns pairs with too few data points to correlate
logging.warn(f"correlating columns '{cols[i]}' and '{cols[j]}' results in less than 2 usable data points")
foo()
You can obtain about a 200x speedup by using pd.corr()
plus converting the R values into a probability with a beta distribution.
I would suggest implementing this by looking at how SciPy did it and seeing if there are any improvements which are applicable to your case. The source code can be found here. This tells you exactly how they implemented the p-value. Specifically, they take a beta distribution with a = b = n / 2 - 1, running from -1 to 1, and find either the cumulative distribution function or the survival function of that distribution at the specified R value.
So while pearsonr()
does not support being vectorized across all pairs of columns, the underlying beta distribution does support this. Using this, you can turn the correlation that pd.corr()
gives you into a correlation plus a p-value.
I've checked this against your existing algorithm, and it agrees with it to within machine epsilon. I also tested it with NA values.
In terms of speed, it is roughly ~200x faster than your original solution, making faster than a multicore solution while only using one core.
Here is the code. Note that only calculate_corr_fast
and get_pvalue_vectorized
are important to the solution. The rest is just to set up test data or for comparison.
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
import scipy
M = 1000
N = 200
P = 0.1
A = np.random.rand(M, N)
A[np.random.rand(M, N) < P] = np.nan
df = pd.DataFrame(A, columns=[f"a{i}" for i in range(1, N + 1)])
# setting up output dataframes
def calculate_corr(data):
dfcols = pd.DataFrame(columns=data.columns)
correlation = dfcols.T.join(dfcols, how='outer')
pvalues = correlation.copy()
# pairwise calculation
for r in range(len(data.columns)):
for c in range(r, len(data.columns)):
# iterate over all combinations of columns to calculate correlation
tmp = data.iloc[:, [r,c]].dropna()
if len(tmp) < 2:
# too few data points to calculate correlation coefficient
result = (0, 1)
else:
result = pearsonr(tmp.iloc[:, 0], tmp.iloc[:, 1])
correlation.iloc[r, c] = result[0]
correlation.iloc[c, r] = result[0]
pvalues.iloc[r, c] = result[1]
pvalues.iloc[c, r] = result[1]
return correlation, pvalues
def get_pvalue_vectorized(r, ab, alternative='two-sided'):
"""Get p-value from beta dist given the statistic, and alternative."""
assert len(r.shape) == 2
assert len(ab.shape) == 2
diag = np.arange(r.shape[0])
# This is just to keep squareform happy. These don't actually
# get sent to the beta distribution function.
r[diag, diag] = 0
ab[diag, diag] = 0
# Avoid doing repeated computations of r,c and c,r
rsq = scipy.spatial.distance.squareform(r)
r[diag, diag] = 1
absq = scipy.spatial.distance.squareform(ab)
kwargs = dict(a=absq, b=absq, loc=-1, scale=2)
if alternative == 'less':
pvalue = scipy.stats.beta.cdf(rsq, **kwargs)
elif alternative == 'greater':
pvalue = scipy.stats.beta.sf(rsq, **kwargs)
elif alternative == 'two-sided':
pvalue = 2 * (scipy.stats.beta.sf(np.abs(rsq), **kwargs))
else:
message = "`alternative` must be 'less', 'greater', or 'two-sided'."
raise ValueError(message)
# Put back into 2d matrix
pvalue = scipy.spatial.distance.squareform(pvalue)
return pvalue
def calculate_corr_fast(data):
correlation = data.corr()
# For each pair of data values, count how many cases where both data values are
# defined at the same position, using matrix multiply as a pairwise boolean and.
data_notna = data.notna().values.astype('float32')
value_count = data_notna.T @ data_notna
# This is the a and b parameter for the beta distribution. It is different
# for every p-value, because each one can potentially have a different number
# of missing values
ab = value_count / 2 - 1
pvalues = get_pvalue_vectorized(correlation.values, ab)
invalid = value_count < 2
pvalues[invalid] = np.nan
# Convert back to dataframe
pvalues = pd.DataFrame(pvalues, columns=correlation.columns, index=correlation.index)
return correlation, pvalues
correlation, pvalues = calculate_corr(df)
correlation_fast, pvalues_fast = calculate_corr_fast(df)
assert np.allclose(pvalues_fast.values.astype('float64'), pvalues.values.astype('float64'))
assert np.allclose(correlation_fast.values.astype('float64'), correlation.values.astype('float64'))
Benchmark results for 1000x200 dataframe:
Original code
40.5 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
@Andrej Kesely's answer
~20 seconds
My answer
190 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Benchmark notes: