I have a data frame and i am trying to clean the data before my analysis.
I am putting a sample data for trial as my data is bit complex.
A B C D
30 24 13 41
30 25 14 45
30 27 15 44
30 28 16 43
31 21 12 4
31 2 17 99
3 89 99 45
78 24 0 43
35 252 12 45
36 23 13 44
I am trying to deal with the outliers and trying to calculate the Modified Z score (median one) and IQR for filtering out the outliers from the data so that i can get the quality data for further analysis.
I want to calculate IQR and then Z score for each column and filter out the outliers for each column in the data frame.
I have tried few things till now like:
IQR:
for col in df2.columns:
col = np.array([col])
q1_a = np.percentile(col, 25)
q3_a = np.percentile(col, 75)
iqr1 = q3_a - q1_a
print(iqr1)
Modified Z score:
for col in df2.columns:
threshold = 3.5
col_zscore = col +'_zscore'
median_y = df[col].median()
print(median_y)
median_absolute_deviation_y = (np.abs(df2[col] - median_y)).median()
print(median_absolute_deviation_y)
modified_z_scores = 0.7413 *((df2[col] - median_y)/median_absolute_deviation_y)
print(modified_z_scores)
df2[col_zscore] = np.abs(modified_z_scores)
df2 = df2[(np.abs(df2[col_zscore]) < 3.5).all(axis=1)]
print(df2)
But not getting the right answer. The function does not apply on each column and create the dataframe of my intention at the end. Please help. Thanks.
scipy
: A B C D
30 24 13 41
30 25 14 45
30 27 15 44
30 28 16 43
31 21 12 4
31 2 17 99
3 89 99 45
78 24 0 43
35 252 12 45
36 23 13 44
pandas.DataFrame.apply
can be used to make column (axis=0
), or row (axis=1
) wise calculations.import pandas as pd
import scipy
df.apply(scipy.stats.iqr)
# output
A 4.0
B 4.5
C 3.5
D 2.0
dtype: float64
df.apply(scipy.stats.zscore)
# output
A B C D
-0.196943 -0.392191 -0.307452 -0.200009
-0.196943 -0.377930 -0.269495 -0.013954
-0.196943 -0.349407 -0.231538 -0.060468
-0.196943 -0.335145 -0.193581 -0.106981
-0.139019 -0.434976 -0.345409 -1.921013
-0.139019 -0.705944 -0.155624 2.497782
-1.760907 0.534806 2.956852 -0.013954
2.583435 -0.392191 -0.800893 -0.106981
0.092679 2.859432 -0.345409 -0.013954
0.150604 -0.406453 -0.307452 -0.060468
df.describe()
df.describe()
output, it's easy to confirm the results of scipy.stats.iqr
df.describe().loc[['25%', '75%']]
# Output
A B C D
25% 30.0 23.25 12.25 43.0
75% 34.0 27.75 15.75 45.0
df
transform (modified z-score):import numpy as np
def mod_z(col: pd.Series, thresh: float=3.5) -> pd.Series:
med_col = col.median()
med_abs_dev = (np.abs(col - med_col)).median()
mod_z = 0.7413 * ((col - med_col) / med_abs_dev)
mod_z = mod_z[np.abs(mod_z) < thresh]
return np.abs(mod_z)
df_mod_z = df.apply(mod_z)
# Output
A B C D
0.7413 0.12355 0.2471 2.2239
0.7413 0.12355 0.2471 0.7413
0.7413 0.61775 0.7413 0.0000
0.7413 0.86485 1.2355 0.7413
0.7413 0.86485 0.7413 NaN
0.7413 NaN 1.7297 NaN
NaN NaN NaN 0.7413
NaN 0.12355 NaN 0.7413
NaN NaN 0.7413 0.7413
NaN 0.37065 0.2471 0.0000
df_filtered = df[df_mod_z >= 0]
print(df_filtered)
A B C D
30.0 24.0 13.0 41.0
30.0 25.0 14.0 45.0
30.0 27.0 15.0 44.0
30.0 28.0 16.0 43.0
31.0 21.0 12.0 NaN
31.0 NaN 17.0 NaN
NaN NaN NaN 45.0
NaN 24.0 NaN 43.0
NaN NaN 12.0 45.0
NaN 23.0 13.0 44.0