I am cleaning a dataset using the z-score with a threshold >3. Below is the code that I am using. As you can, I first calculate the mean and std. After the code goes in a loop and checks for every value the z-score and if it is greater than 3 and, if yes, the value is treated as an outlier which is first added to the list "outlier". At last the outlier list is deleted for the dataset.
"""SD MonthlyIncome"""
MonthlyIncome_std = df ['MonthlyIncome'].std()
MonthlyIncome_std
"""MEAN MonthlyIncome"""
MonthlyIncome_mean = df ['MonthlyIncome'].mean()
MonthlyIncome_mean
threshold = 3
outlier = []
for i in df ['MonthlyIncome']:
z = (i-MonthlyIncome_mean)/MonthlyIncome_std
if z >= threshold:
outlier.append(i)
df = df[~df.MonthlyIncome.isin(outlier)]
The above code works fine, the fact is that I have to write it for every numerical column. I was trying to create a function that does the same and it is replicable for every numerical column. Below the function:
for col in df.columns:
if df[col].dtypes == 'float64' or df[col].dtypes == 'int64':
threshold = 3
outlier = []
col_mean = col.mean()
col_std = col.std()
z = (i-col_mean)/col_std
if z >= threshold:
outlier.append(i)
df = df[~df.col.isin(outlier)]
AttributeError Traceback (most recent call last)
<ipython-input-62-4f8b1224061e> in <module>
----> 1 z_score_elimination(df)
<ipython-input-61-dc3c84b60dd1> in z_score_elimination(df)
4 threshold = 3
5 outlier = []
----> 6 col_mean = col.mean()
7 col_std = col.std()
8 z = (i-col_mean)/col_std
AttributeError: 'str' object has no attribute 'mean'
How can I fix the code?
You are iterating over column names, which are string, not the actual columns. Try
df[col].mean()