I am new to python programming. I am trying to determine outliers in my data set. I have converted the data set to a pandas data frame and then applying the IQR principle. After that I want to replace my OUTLIERS by zero and then calculate the mean and standard deviation as the outliers as skewing the mean and SD.
The code for Data Set is as below:
import pandas as pd
data = [[123,100,1200,800,800,1200,900,1400],[246,15,16,45,15,45,11,55],[234,90,105,180,90,180,100,220],[236,100,90,9000,90,9000,70,140]]
df = pd.DataFrame(data,columns=['ID','Store1','Store2','Store3','Min','Max','Lower_Limit','Upper_limit'])
print (df)
Data Set Snippet:
ID Store1 Store2 Store3 Min Max Lower_Limit Upper_limit
123 100 1200 800 800 1200 900 1400
246 15 16 45 15 45 11 55
234 90 105 180 90 180 100 220
236 100 90 9000 90 9000 70 140
I want to update values of Store1,Store2,Store3 to ZERO(0) if they are less than Lower_limit(['Store1'] < ['Lower_limit']) or greater than Upper_limit(['Store1'] > ['Upper_limit']).
Below is my Function:
def calculate_Outliers(row):
if row['Store1'] < row['Lower_limit'] or row['Store1'] > row['Upper_limit']:
return 0
else:
return row['Store1']
if row['Store2'] < row['Lower_limit'] or row['Store2'] > row['Upper_limit']:
return 0
else:
return row['Store2']
if row['Store3'] < row['Lower_limit'] or row['Store3'] > row['Upper_limit']:
return 0
else:
return row['Store3']
I am applying it like this:
df['Store1','Store3','Store3'] = df.apply(calculate_Outliers, axis=1)
Below is the result which is wrong...
ID Store1 Store2 Store3(Store1 Store2 Store3)
ID
123 NaN NaN NaN NaN 1000
246 NaN NaN NaN NaN 15
234 NaN NaN NaN NaN 0
236 NaN NaN NaN NaN 0
Desired Result:
ID Store1 Store2 Store3 Min Max Lower_Limit Upper_limit
123 100 1200 800 800 1200 900 1400
246 15 16 45 15 45 11 55
234 0 105 180 90 180 100 220
236 100 90 0 90 9000 70 140
Is there a way in which i can modify my original code to achieve this?
Try this:
m=df.filter(like='Store').lt(df.Lower_Limit,axis=0)|df.filter(like='Store').\
gt(df.Upper_limit,axis=0)
df.update(df.where(~m,0).filter(like='Store'))
print(df)
ID Store1 Store2 Store3 Min Max Lower_Limit Upper_limit
0 123 0 1200 0 800 1200 900 1400
1 246 15 16 45 15 45 11 55
2 234 0 105 180 90 180 100 220
3 236 100 90 0 90 9000 70 140
EDIT
you can use iloc[]
if column names doesnot have a common string:
m=df.iloc[:,1:4].lt(df.Lower_Limit,axis=0)|df.iloc[:,1:4].gt(df.Upper_limit,axis=0)
df.update(df.where(~m,0).iloc[:,1:4])
print(df)
ID Store1 Store2 Store3 Min Max Lower_Limit Upper_limit
0 123 0 1200 0 800 1200 900 1400
1 246 15 16 45 15 45 11 55
2 234 0 105 180 90 180 100 220
3 236 100 90 0 90 9000 70 140
Wrapping in a function :
def calculate_Outliers(df):
m1= df['Store1'].lt(df['Lower_limit'])|df['Store1'].gt(df['Upper_limit'])
m2 = df['Store2'].lt(df['Lower_limit'])|df['Store2'].gt(df['Upper_limit'])
m3= df['Store3'].lt(df['Lower_limit'])|df['Store3'].gt(df['Upper_limit'])
df.loc[m1,'Store1']=0
df.loc[m1,'Store2']=0
df.loc[m1,'Store3']=0
print(df)
calculate_Outliers(df)
ID Store1 Store2 Store3 Min Max Lower_limit Upper_limit
0 123 0 0 0 800 1200 900 1400
1 246 15 16 45 15 45 11 55
2 234 0 0 0 90 180 100 220
3 236 100 90 9000 90 9000 70 140