I am using this code to remove outliers.
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame(np.random.randn(100, 3))
df[np.abs(stats.zscore(df[0])) < 1.5]
This works. We can see that the number of rows of data frame has reduced. However, I need to remove outliers in the percentage change values of a similar data frame.
df = df.pct_change()
df.plot.line(subplots=True)
df[np.abs(stats.zscore(df[0])) < 1.5]
This results in an empty data frame. What am I doing wrong? Should the value 1.5 be adjusted? I tried several values. Nothing works.
It's because the first value of your dataframe is null due to pct_change
. So use fillna
to remove nan value.
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame(np.random.randn(100, 3))
pct = df[0].pct_change().fillna(0)
out = df[stats.zscore(pct).abs() < 1.5]
Output:
>>> out
0 1 2
0 0.496714 -0.138264 0.647689
1 1.523030 -0.234153 -0.234137
2 1.579213 0.767435 -0.469474
3 0.542560 -0.463418 -0.465730
4 0.241962 -1.913280 -1.724918
.. ... ... ...
95 -1.952088 -0.151785 0.588317
96 0.280992 -0.622700 -0.208122
97 -0.493001 -0.589365 0.849602
98 0.357015 -0.692910 0.899600
99 0.307300 0.812862 0.629629
[92 rows x 3 columns] # <- 8 rows have been removed