[SOLVED] How can I use the Scipy Mahalanobis distance implementation for outlier detection?

How can I use the Scipy Mahalanobis distance implementation for outlier detection?

I have a dataset of different measurements for several individuals in a Pandas dataframe, a similar structure to this random data:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(25, 3), columns=['var1', 'var2', 'var3'])
df.head()

       var1      var2      var3
0 -0.484272 -1.232702 -0.104978
1 -0.104346  0.439150 -0.324739
2 -0.764503  0.679031  1.786502
3 -1.551942  0.136850  0.557289
4  0.081988 -0.482199 -0.560156

I want to figure out if any of these individuals are outliers, and I know that measuring Mahalanobis Distance is a common approach to this problem. I noticed that Scipy also has Mahalanobis function but it takes as input two 1-D arrays and their covariance matrix, rather than an entire dataframe. Is there a way to calculate the MD for each row in a dataframe using the Scipy function?

I found this implementation on Machine Learning Plus which calculates the MD and p-value for each row in a dataframe, and then calculates the p-value from a chi-squared test to determine if the result is an outlier:

df['mahalanobis'] = mahalanobis(df, df[['var1', 'var2', 'var3']])
df['p_value'] = 1 - chi2.cdf(df['mahalanobis'], 2)
df.head()

       var1      var2      var3  mahalanobis   p_value
0 -0.484272 -1.232702 -0.104978     2.972031  0.226272
1 -0.104346  0.439150 -0.324739     0.823351  0.662539
2 -0.764503  0.679031  1.786502     4.490658  0.105893
3 -1.551942  0.136850  0.557289     2.738988  0.254236
4  0.081988 -0.482199 -0.560156     0.386796  0.824154

But I wanted to see if there was a way to just use/modify the Scipy function to accomplish the same thing.

Solution

You can implement something along these lines by taking the mean, and comparing each row in your dataset to this mean.

For example:

import scipy
import numpy as np

def mahalanobis_dist_from_center(df):
    cov = df.cov().values
    inv_cov = np.linalg.inv(cov)
    mean = df.mean().values
    scores = []
    values = df.values
    for i in range(values.shape[0]):
        score = scipy.spatial.distance.mahalanobis(mean, values[i], inv_cov)
        scores.append(score)
    return pd.Series(data=scores, index=df.index)