I have a dataset of different measurements for several individuals in a Pandas dataframe, a similar structure to this random data:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(25, 3), columns=['var1', 'var2', 'var3'])
df.head()
var1 var2 var3
0 -0.484272 -1.232702 -0.104978
1 -0.104346 0.439150 -0.324739
2 -0.764503 0.679031 1.786502
3 -1.551942 0.136850 0.557289
4 0.081988 -0.482199 -0.560156
I want to figure out if any of these individuals are outliers, and I know that measuring Mahalanobis Distance is a common approach to this problem. I noticed that Scipy also has Mahalanobis function but it takes as input two 1-D arrays and their covariance matrix, rather than an entire dataframe. Is there a way to calculate the MD for each row in a dataframe using the Scipy function?
I found this implementation on Machine Learning Plus which calculates the MD and p-value for each row in a dataframe, and then calculates the p-value from a chi-squared test to determine if the result is an outlier:
df['mahalanobis'] = mahalanobis(df, df[['var1', 'var2', 'var3']])
df['p_value'] = 1 - chi2.cdf(df['mahalanobis'], 2)
df.head()
var1 var2 var3 mahalanobis p_value
0 -0.484272 -1.232702 -0.104978 2.972031 0.226272
1 -0.104346 0.439150 -0.324739 0.823351 0.662539
2 -0.764503 0.679031 1.786502 4.490658 0.105893
3 -1.551942 0.136850 0.557289 2.738988 0.254236
4 0.081988 -0.482199 -0.560156 0.386796 0.824154
But I wanted to see if there was a way to just use/modify the Scipy function to accomplish the same thing.
You can implement something along these lines by taking the mean, and comparing each row in your dataset to this mean.
For example:
import scipy
import numpy as np
def mahalanobis_dist_from_center(df):
cov = df.cov().values
inv_cov = np.linalg.inv(cov)
mean = df.mean().values
scores = []
values = df.values
for i in range(values.shape[0]):
score = scipy.spatial.distance.mahalanobis(mean, values[i], inv_cov)
scores.append(score)
return pd.Series(data=scores, index=df.index)