pythonpython-3.xdataframeoutliersiqr

How to calculate modified Z score and IQR on each column of the dataframe


I have a data frame and i am trying to clean the data before my analysis.

I am putting a sample data for trial as my data is bit complex.

A   B	 C	D
30	24	13	41
30	25	14	45
30	27	15	44
30	28	16	43
31	21	12	4
31	2 	17	99
3	89	99	45
78	24	0   43
35	252	12	45
36	23	13	44

I am trying to deal with the outliers and trying to calculate the Modified Z score (median one) and IQR for filtering out the outliers from the data so that i can get the quality data for further analysis.

I want to calculate IQR and then Z score for each column and filter out the outliers for each column in the data frame.

I have tried few things till now like:

IQR:


for col in df2.columns:
    col = np.array([col])
    q1_a = np.percentile(col, 25)
    q3_a = np.percentile(col, 75)
    iqr1 = q3_a - q1_a
    print(iqr1)

Modified Z score:


for col in df2.columns:
    threshold = 3.5
    col_zscore = col +'_zscore'
    median_y = df[col].median()
    print(median_y)
    median_absolute_deviation_y = (np.abs(df2[col] - median_y)).median()
    print(median_absolute_deviation_y)
    modified_z_scores = 0.7413 *((df2[col] - median_y)/median_absolute_deviation_y)
    print(modified_z_scores)
    df2[col_zscore] = np.abs(modified_z_scores)

df2 = df2[(np.abs(df2[col_zscore]) < 3.5).all(axis=1)]
print(df2)

But not getting the right answer. The function does not apply on each column and create the dataframe of my intention at the end. Please help. Thanks.


Solution

  • Use scipy:

    Data:

      A    B   C   D
     30   24  13  41
     30   25  14  45
     30   27  15  44
     30   28  16  43
     31   21  12   4
     31    2  17  99
      3   89  99  45
     78   24   0  43
     35  252  12  45
     36   23  13  44
    

    Code:

    import pandas as pd
    import scipy
    
    df.apply(scipy.stats.iqr)
    
    # output
    A    4.0
    B    4.5
    C    3.5
    D    2.0
    dtype: float64
    
    df.apply(scipy.stats.zscore)
    
    # output
            A         B         C         D
    -0.196943 -0.392191 -0.307452 -0.200009
    -0.196943 -0.377930 -0.269495 -0.013954
    -0.196943 -0.349407 -0.231538 -0.060468
    -0.196943 -0.335145 -0.193581 -0.106981
    -0.139019 -0.434976 -0.345409 -1.921013
    -0.139019 -0.705944 -0.155624  2.497782
    -1.760907  0.534806  2.956852 -0.013954
     2.583435 -0.392191 -0.800893 -0.106981
     0.092679  2.859432 -0.345409 -0.013954
     0.150604 -0.406453 -0.307452 -0.060468
    

    Verify iqr:

    df.describe().loc[['25%', '75%']]
    
    # Output
            A      B      C     D
    25%  30.0  23.25  12.25  43.0
    75%  34.0  27.75  15.75  45.0
    

    df transform (modified z-score):

    import numpy as np
    
    def mod_z(col: pd.Series, thresh: float=3.5) -> pd.Series:
        med_col = col.median()
        med_abs_dev = (np.abs(col - med_col)).median()
        mod_z = 0.7413 * ((col - med_col) / med_abs_dev)
        mod_z = mod_z[np.abs(mod_z) < thresh]
        return np.abs(mod_z)
    
    df_mod_z = df.apply(mod_z)
    
    # Output
          A        B       C       D
     0.7413  0.12355  0.2471  2.2239
     0.7413  0.12355  0.2471  0.7413
     0.7413  0.61775  0.7413  0.0000
     0.7413  0.86485  1.2355  0.7413
     0.7413  0.86485  0.7413     NaN
     0.7413      NaN  1.7297     NaN
        NaN      NaN     NaN  0.7413
        NaN  0.12355     NaN  0.7413
        NaN      NaN  0.7413  0.7413
        NaN  0.37065  0.2471  0.0000
    

    Filtered Original dataframe:

    df_filtered = df[df_mod_z >= 0]
    print(df_filtered)
    

    Output

        A     B     C     D
     30.0  24.0  13.0  41.0
     30.0  25.0  14.0  45.0
     30.0  27.0  15.0  44.0
     30.0  28.0  16.0  43.0
     31.0  21.0  12.0   NaN
     31.0   NaN  17.0   NaN
      NaN   NaN   NaN  45.0
      NaN  24.0   NaN  43.0
      NaN   NaN  12.0  45.0
      NaN  23.0  13.0  44.0