pythonpandasscipystatisticsmahalanobis

How can I use the Scipy Mahalanobis distance implementation for outlier detection?


I have a dataset of different measurements for several individuals in a Pandas dataframe, a similar structure to this random data:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(25, 3), columns=['var1', 'var2', 'var3'])
df.head()
       var1      var2      var3
0 -0.484272 -1.232702 -0.104978
1 -0.104346  0.439150 -0.324739
2 -0.764503  0.679031  1.786502
3 -1.551942  0.136850  0.557289
4  0.081988 -0.482199 -0.560156

I want to figure out if any of these individuals are outliers, and I know that measuring Mahalanobis Distance is a common approach to this problem. I noticed that Scipy also has Mahalanobis function but it takes as input two 1-D arrays and their covariance matrix, rather than an entire dataframe. Is there a way to calculate the MD for each row in a dataframe using the Scipy function?

I found this implementation on Machine Learning Plus which calculates the MD and p-value for each row in a dataframe, and then calculates the p-value from a chi-squared test to determine if the result is an outlier:

df['mahalanobis'] = mahalanobis(df, df[['var1', 'var2', 'var3']])
df['p_value'] = 1 - chi2.cdf(df['mahalanobis'], 2)
df.head()
       var1      var2      var3  mahalanobis   p_value
0 -0.484272 -1.232702 -0.104978     2.972031  0.226272
1 -0.104346  0.439150 -0.324739     0.823351  0.662539
2 -0.764503  0.679031  1.786502     4.490658  0.105893
3 -1.551942  0.136850  0.557289     2.738988  0.254236
4  0.081988 -0.482199 -0.560156     0.386796  0.824154

But I wanted to see if there was a way to just use/modify the Scipy function to accomplish the same thing.


Solution

  • You can implement something along these lines by taking the mean, and comparing each row in your dataset to this mean.

    For example:

    import scipy
    import numpy as np
    
    def mahalanobis_dist_from_center(df):
        cov = df.cov().values
        inv_cov = np.linalg.inv(cov)
        mean = df.mean().values
        scores = []
        values = df.values
        for i in range(values.shape[0]):
            score = scipy.spatial.distance.mahalanobis(mean, values[i], inv_cov)
            scores.append(score)
        return pd.Series(data=scores, index=df.index)