pythonpandasoutliersz-score

Removing rows that have outliers in pandas data frame using Z - Score method


I am using this code to remove outliers.

import pandas as pd
import numpy as np
from scipy import stats

df = pd.DataFrame(np.random.randn(100, 3))
df[np.abs(stats.zscore(df[0])) < 1.5]

This works. We can see that the number of rows of data frame has reduced. However, I need to remove outliers in the percentage change values of a similar data frame.

df = df.pct_change()
df.plot.line(subplots=True)

df[np.abs(stats.zscore(df[0])) < 1.5]

This results in an empty data frame. What am I doing wrong? Should the value 1.5 be adjusted? I tried several values. Nothing works.


Solution

  • It's because the first value of your dataframe is null due to pct_change. So use fillna to remove nan value.

    import pandas as pd
    import numpy as np
    
    np.random.seed(42)
    df = pd.DataFrame(np.random.randn(100, 3))
    
    pct = df[0].pct_change().fillna(0)
    out = df[stats.zscore(pct).abs() < 1.5]
    

    Output:

    >>> out
               0         1         2
    0   0.496714 -0.138264  0.647689
    1   1.523030 -0.234153 -0.234137
    2   1.579213  0.767435 -0.469474
    3   0.542560 -0.463418 -0.465730
    4   0.241962 -1.913280 -1.724918
    ..       ...       ...       ...
    95 -1.952088 -0.151785  0.588317
    96  0.280992 -0.622700 -0.208122
    97 -0.493001 -0.589365  0.849602
    98  0.357015 -0.692910  0.899600
    99  0.307300  0.812862  0.629629
    
    [92 rows x 3 columns]  # <- 8 rows have been removed