python python-3.x dataframe outliers iqr

How to calculate modified Z score and IQR on each column of the dataframe

I have a data frame and i am trying to clean the data before my analysis.

I am putting a sample data for trial as my data is bit complex.

A   B	 C	D
30	24	13	41
30	25	14	45
30	27	15	44
30	28	16	43
31	21	12	4
31	2 	17	99
3	89	99	45
78	24	0   43
35	252	12	45
36	23	13	44

I am trying to deal with the outliers and trying to calculate the Modified Z score (median one) and IQR for filtering out the outliers from the data so that i can get the quality data for further analysis.

I want to calculate IQR and then Z score for each column and filter out the outliers for each column in the data frame.

I have tried few things till now like:

IQR:


for col in df2.columns:
    col = np.array([col])
    q1_a = np.percentile(col, 25)
    q3_a = np.percentile(col, 75)
    iqr1 = q3_a - q1_a
    print(iqr1)

Modified Z score:


for col in df2.columns:
    threshold = 3.5
    col_zscore = col +'_zscore'
    median_y = df[col].median()
    print(median_y)
    median_absolute_deviation_y = (np.abs(df2[col] - median_y)).median()
    print(median_absolute_deviation_y)
    modified_z_scores = 0.7413 *((df2[col] - median_y)/median_absolute_deviation_y)
    print(modified_z_scores)
    df2[col_zscore] = np.abs(modified_z_scores)

df2 = df2[(np.abs(df2[col_zscore]) < 3.5).all(axis=1)]
print(df2)

But not getting the right answer. The function does not apply on each column and create the dataframe of my intention at the end. Please help. Thanks.

Solution

Use `scipy`:

Data:

  A    B   C   D
 30   24  13  41
 30   25  14  45
 30   27  15  44
 30   28  16  43
 31   21  12   4
 31    2  17  99
  3   89  99  45
 78   24   0  43
 35  252  12  45
 36   23  13  44

Code:

pandas.DataFrame.apply can be used to make column (axis=0), or row (axis=1) wise calculations.

import pandas as pd
import scipy

df.apply(scipy.stats.iqr)

# output
A    4.0
B    4.5
C    3.5
D    2.0
dtype: float64

df.apply(scipy.stats.zscore)

# output
        A         B         C         D
-0.196943 -0.392191 -0.307452 -0.200009
-0.196943 -0.377930 -0.269495 -0.013954
-0.196943 -0.349407 -0.231538 -0.060468
-0.196943 -0.335145 -0.193581 -0.106981
-0.139019 -0.434976 -0.345409 -1.921013
-0.139019 -0.705944 -0.155624  2.497782
-1.760907  0.534806  2.956852 -0.013954
 2.583435 -0.392191 -0.800893 -0.106981
 0.092679  2.859432 -0.345409 -0.013954
 0.150604 -0.406453 -0.307452 -0.060468

Verify iqr:

df.describe()
With a visual inspection of the df.describe() output, it's easy to confirm the results of scipy.stats.iqr

df.describe().loc[['25%', '75%']]

# Output
        A      B      C     D
25%  30.0  23.25  12.25  43.0
75%  34.0  27.75  15.75  45.0

`df` transform (modified z-score):

A z-score is the number of standard deviations from the mean, a data point is. More technically, it's a measure of how many standard deviations below or above the population mean, a raw score is. A z-score is also known as a standard score, and it can be placed on a normal distribution curve.

import numpy as np

def mod_z(col: pd.Series, thresh: float=3.5) -> pd.Series:
    med_col = col.median()
    med_abs_dev = (np.abs(col - med_col)).median()
    mod_z = 0.7413 * ((col - med_col) / med_abs_dev)
    mod_z = mod_z[np.abs(mod_z) < thresh]
    return np.abs(mod_z)

df_mod_z = df.apply(mod_z)

# Output
      A        B       C       D
 0.7413  0.12355  0.2471  2.2239
 0.7413  0.12355  0.2471  0.7413
 0.7413  0.61775  0.7413  0.0000
 0.7413  0.86485  1.2355  0.7413
 0.7413  0.86485  0.7413     NaN
 0.7413      NaN  1.7297     NaN
    NaN      NaN     NaN  0.7413
    NaN  0.12355     NaN  0.7413
    NaN      NaN  0.7413  0.7413
    NaN  0.37065  0.2471  0.0000

Filtered Original dataframe:

df_filtered = df[df_mod_z >= 0]
print(df_filtered)

Output

    A     B     C     D
 30.0  24.0  13.0  41.0
 30.0  25.0  14.0  45.0
 30.0  27.0  15.0  44.0
 30.0  28.0  16.0  43.0
 31.0  21.0  12.0   NaN
 31.0   NaN  17.0   NaN
  NaN   NaN   NaN  45.0
  NaN  24.0   NaN  43.0
  NaN   NaN  12.0  45.0
  NaN  23.0  13.0  44.0