pythonpandasscipyoutliersz-score

Replace detected outliers with equal amount of values or a multitude of values


I'm analyzing a data set which consists of R-R Intervals which are fractions of seconds and in total are equal to 240 seconds or 4 minutes.

There are some outliers frequently in this data which I can manage to detect and replace using the following code:

RR's:

[1.076,
 0.957,
 0.939,
 0.956,
 0.942,
 0.888,
 0.821,
 0.903,
 0.806,
 0.971,
 0.973,
 0.999,
 0.909,
 0.811,
 0.921,
 0.915,
 0.917,
 0.801,
 0.815,
 0.94,
 0.946,
 0.963,
 0.911,
 0.804,
 1.015,
 0.981,
 0.971,
 0.901,
 0.809,
 1.005,
 0.962,
 0.986,
 0.911,
 0.816,
 0.99,
 0.948,
 0.969,
 0.832,
 0.932,
 0.928,
 0.972,
 0.932,
 0.809,
 0.932,
 0.928,
 0.974,
 0.943,
 0.81,
 0.936,
 0.918,
 0.94,
 0.941,
 0.826,
 0.909,
 0.913,
 0.947,
 0.965,
 0.913,
 0.895,
 0.941,
 0.963,
 0.954,
 0.905,
 0.803,
 0.954,
 0.917,
 0.949,
 0.896,
 0.777,
 0.913,
 0.905,
 0.957,
 0.918,
 0.799,
 0.966,
 0.975,
 1.021,
 0.986,
 0.839,
 0.938,
 0.95,
 0.973,
 0.959,
 0.823,
 0.842,
 0.959,
 0.946,
 0.983,
 0.924,
 0.814,
 1.002,
 0.972,
 0.98,
 0.967,
 0.836,
 0.95,
 0.957,
 1.012,
 0.939,
 0.812,
 0.964,
 0.937,
 0.963,
 0.931,
 0.92,
 0.963,
 0.995,
 0.985,
 0.914,
 0.914,
 0.943,
 0.977,
 0.957,
 0.822,
 0.926,
 0.932,
 0.992,
 1.061,
 0.999,
 0.841,
 0.983,
 0.955,
 0.972,
 0.823,
 0.809,
 0.769,
 0.765,
 0.728,
 0.697,
 0.699,
 0.694,
 0.694,
 0.695,
 0.689,
 0.692,
 0.697,
 0.76,
 0.669,
 0.676,
 0.673,
 0.67,
 0.668,
 0.665,
 0.666,
 0.753,
 0.778,
 **8.154**,
 0.784,
 0.762,
 0.741,
 0.743,
 0.752,
 0.836,
 0.738,
 0.838,
 0.813,
 0.807,
 0.798,
 0.793,
 0.784,
 0.71,
 0.729,
 0.73,
 0.801,
 0.771,
 0.709,
 0.798,
 0.778,
 0.782,
 0.712,
 0.804,
 0.781,
 0.784,
 0.774,
 0.779,
 0.785,
 0.786,
 0.773,
 0.77,
 0.769,
 0.771,
 0.766,
 0.773,
 0.784,
 0.79,
 0.789,
 0.779,
 0.784,
 0.792,
 0.795,
 0.786,
 0.784,
 0.784,
 0.791,
 0.784,
 0.783,
 0.783,
 0.785,
 0.787,
 0.776,
 0.792,
 0.807,
 0.81,
 0.814,
 0.824,
 0.833,
 0.839,
 0.807,
 0.795,
 0.795,
 0.789,
 0.777,
 0.759,
 0.745,
 0.748,
 0.756,
 0.759,
 0.753,
 0.767,
 0.783,
 0.793,
 0.787,
 0.793,
 0.797,
 0.813,
 0.826,
 0.805,
 0.779,
 0.771,
 0.762,
 0.746,
 0.737,
 0.739,
 0.745,
 0.746,
 0.691,
 0.771,
 0.765,
 0.805,
 0.807,
 0.789,
 0.806,
 0.811,
 0.8,
 0.732,
 0.798,
 0.771,
 0.761,
 0.705,
 0.775,
 0.76,
 0.771,
 0.775,
 0.777,
 0.797,
 0.806,
 0.799,
 0.786,
 0.789,
 0.789,
 0.782,
 0.769,
 0.776,
 0.78,
 0.786,
 0.77,
 0.774,
 0.782,
 0.785,
 0.787,
 0.79,
 0.786,
 0.773,
 0.779,
 0.787]
from scipy import stats
import pandas as pd
import numpy as np
    
df_rrs = pd.DataFrame(RRs, columns=['RRs'])
mask = (np.abs(stats.zscore(df_rrs['RRs'])) > 1)
df_rrs.RRs = df_rrs.RRs.mask(mask).interpolate()
df_rrs = df_rrs['RRs'].to_list()

The code does what it is supposed to detect and replace the value with an interpolated value. The problem is that this will leave a gap in the data since the cumulative sum is 4 minutes and in the example list of RR's the outlier is 8.154 seconds.

So my code example above will just replace it with one value and the dataset is basically shortened and missing 8 seconds.

So instead of replacing it with the mean/interpolation of the neighbouring values I need to replace it with multiple mean/interpolation values that equate in total 8.154 seconds. Ideally each of these values should be interpolated as well.

What would be the best approach to do this?


Solution

  • Here's one approach:

    Minimal reproducible example

    from scipy import stats
    import pandas as pd
    import numpy as np
    
    np.random.seed(0) # for reproducibility
    
    RRs = np.random.randint(40, 60, size=(8)).astype("float") / 100
    
    desired_sum = 10 # total seconds
    
    # adding 3 outliers at index `2, 3, 6`
    RRs[[2, 3, 6]] += (desired_sum - RRs.sum())/3
    
    df_rrs = pd.DataFrame(RRs, columns=['RRs'])
    
    print(df_rrs['RRs'].sum())
    # 10.0
    
        RRs
    0  0.52
    1  0.55
    2  2.44 # outlier (consecutive: to be grouped)
    3  2.47 # outlier (consecutive: to be grouped)
    4  0.43
    5  0.47
    6  2.53 # outlier
    7  0.59
    

    Code

    # mask for outliers
    mask = (np.abs(stats.zscore(df_rrs['RRs'])) > 1)
    
    # store outliers
    outliers = df_rrs.loc[mask, 'RRs']
    
    # group consecutive outliers
    group_outliers = (outliers.index.to_series().diff() != 1).cumsum()
    
    # grouper values as first index value per outlier group
    grouper = group_outliers.index.to_series().mask(group_outliers.duplicated()).ffill()
    
    # get sum per grouop
    outliers_grouped = outliers.groupby(grouper).sum()
    
    # determine rows needed per group
    rows = np.round(outliers_grouped / df_rrs.loc[~mask, 'RRs'].mean())
    
    # isolate index values we no longer want
    outliers_excluded = outliers.index.difference(outliers_grouped.index)
    
    # get index `df_rrs` without index values we no longer want
    reindex = df_rrs.index.difference(outliers_excluded)
    
    # reindex with `np.repeat` to get appropriate repeats for index values in `rows`
    df_rrs = df_rrs.reindex(
        np.repeat(reindex, rows.reindex(reindex, fill_value=1))
        )
    
    # interpolate
    df_rrs['RRs'] = df_rrs['RRs'].mask(mask).interpolate()
    
    # scale interpolated values
    df_rrs.loc[mask, 'RRs'] = (
        df_rrs.loc[mask, 'RRs'] * (outliers_grouped 
                                   / df_rrs.loc[mask, 'RRs'].groupby(level=0).sum())
        )
    

    Output:

            RRs
    0  0.520000
    1  0.550000
    2  0.540191 # index value 2, outlier interpolated (grouped)
    2  0.529260
    2  0.518328
    2  0.507397
    2  0.496466
    2  0.485534
    2  0.474603
    2  0.463672
    2  0.452740
    2  0.441809
    4  0.430000
    5  0.470000
    6  0.467811 # index value 6, outlier interpolated
    6  0.486906
    6  0.506000
    6  0.525094
    6  0.544189
    7  0.590000
    

    Sum check with np.isclose:

    np.isclose(df_rrs['RRs'].sum(), desired_sum)
    # True
    

    Note that inherent float precision issues may cause a slight difference between the sum and desired_sum, so that df_rrs['RRs'].sum() == desired_sum may not always be True (it is here, incidentally). That seems unavoidable.


    Explanation / intermediates

    outliers
    
    2    2.44
    3    2.47
    6    2.53
    Name: RRs, dtype: float64
    
    grouper
    
    2    2.0 # one group
    3    2.0 # one group
    6    6.0
    dtype: float64
    
    outliers.groupby(grouper).sum()
    
    2.0    4.91
    6.0    2.53
    Name: RRs, dtype: float64
    
    rows
    
    2.0    10.0
    6.0     5.0
    Name: RRs, dtype: float64
    
    outliers.index.difference(outliers_grouped.index)
    
    Index([3], dtype='int64')
    
    df_rrs.index # reindexed
    
    Index([0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 5, 6, 6, 6, 6, 6, 7], dtype='int64')