I'm analyzing a data set which consists of R-R Intervals which are fractions of seconds and in total are equal to 240 seconds or 4 minutes.
There are some outliers frequently in this data which I can manage to detect and replace using the following code:
RR's:
[1.076,
0.957,
0.939,
0.956,
0.942,
0.888,
0.821,
0.903,
0.806,
0.971,
0.973,
0.999,
0.909,
0.811,
0.921,
0.915,
0.917,
0.801,
0.815,
0.94,
0.946,
0.963,
0.911,
0.804,
1.015,
0.981,
0.971,
0.901,
0.809,
1.005,
0.962,
0.986,
0.911,
0.816,
0.99,
0.948,
0.969,
0.832,
0.932,
0.928,
0.972,
0.932,
0.809,
0.932,
0.928,
0.974,
0.943,
0.81,
0.936,
0.918,
0.94,
0.941,
0.826,
0.909,
0.913,
0.947,
0.965,
0.913,
0.895,
0.941,
0.963,
0.954,
0.905,
0.803,
0.954,
0.917,
0.949,
0.896,
0.777,
0.913,
0.905,
0.957,
0.918,
0.799,
0.966,
0.975,
1.021,
0.986,
0.839,
0.938,
0.95,
0.973,
0.959,
0.823,
0.842,
0.959,
0.946,
0.983,
0.924,
0.814,
1.002,
0.972,
0.98,
0.967,
0.836,
0.95,
0.957,
1.012,
0.939,
0.812,
0.964,
0.937,
0.963,
0.931,
0.92,
0.963,
0.995,
0.985,
0.914,
0.914,
0.943,
0.977,
0.957,
0.822,
0.926,
0.932,
0.992,
1.061,
0.999,
0.841,
0.983,
0.955,
0.972,
0.823,
0.809,
0.769,
0.765,
0.728,
0.697,
0.699,
0.694,
0.694,
0.695,
0.689,
0.692,
0.697,
0.76,
0.669,
0.676,
0.673,
0.67,
0.668,
0.665,
0.666,
0.753,
0.778,
**8.154**,
0.784,
0.762,
0.741,
0.743,
0.752,
0.836,
0.738,
0.838,
0.813,
0.807,
0.798,
0.793,
0.784,
0.71,
0.729,
0.73,
0.801,
0.771,
0.709,
0.798,
0.778,
0.782,
0.712,
0.804,
0.781,
0.784,
0.774,
0.779,
0.785,
0.786,
0.773,
0.77,
0.769,
0.771,
0.766,
0.773,
0.784,
0.79,
0.789,
0.779,
0.784,
0.792,
0.795,
0.786,
0.784,
0.784,
0.791,
0.784,
0.783,
0.783,
0.785,
0.787,
0.776,
0.792,
0.807,
0.81,
0.814,
0.824,
0.833,
0.839,
0.807,
0.795,
0.795,
0.789,
0.777,
0.759,
0.745,
0.748,
0.756,
0.759,
0.753,
0.767,
0.783,
0.793,
0.787,
0.793,
0.797,
0.813,
0.826,
0.805,
0.779,
0.771,
0.762,
0.746,
0.737,
0.739,
0.745,
0.746,
0.691,
0.771,
0.765,
0.805,
0.807,
0.789,
0.806,
0.811,
0.8,
0.732,
0.798,
0.771,
0.761,
0.705,
0.775,
0.76,
0.771,
0.775,
0.777,
0.797,
0.806,
0.799,
0.786,
0.789,
0.789,
0.782,
0.769,
0.776,
0.78,
0.786,
0.77,
0.774,
0.782,
0.785,
0.787,
0.79,
0.786,
0.773,
0.779,
0.787]
from scipy import stats
import pandas as pd
import numpy as np
df_rrs = pd.DataFrame(RRs, columns=['RRs'])
mask = (np.abs(stats.zscore(df_rrs['RRs'])) > 1)
df_rrs.RRs = df_rrs.RRs.mask(mask).interpolate()
df_rrs = df_rrs['RRs'].to_list()
The code does what it is supposed to detect and replace the value with an interpolated value. The problem is that this will leave a gap in the data since the cumulative sum is 4 minutes and in the example list of RR's the outlier is 8.154 seconds.
So my code example above will just replace it with one value and the dataset is basically shortened and missing 8 seconds.
So instead of replacing it with the mean/interpolation of the neighbouring values I need to replace it with multiple mean/interpolation values that equate in total 8.154 seconds. Ideally each of these values should be interpolated as well.
What would be the best approach to do this?
Here's one approach:
from scipy import stats
import pandas as pd
import numpy as np
np.random.seed(0) # for reproducibility
RRs = np.random.randint(40, 60, size=(8)).astype("float") / 100
desired_sum = 10 # total seconds
# adding 3 outliers at index `2, 3, 6`
RRs[[2, 3, 6]] += (desired_sum - RRs.sum())/3
df_rrs = pd.DataFrame(RRs, columns=['RRs'])
print(df_rrs['RRs'].sum())
# 10.0
RRs
0 0.52
1 0.55
2 2.44 # outlier (consecutive: to be grouped)
3 2.47 # outlier (consecutive: to be grouped)
4 0.43
5 0.47
6 2.53 # outlier
7 0.59
Code
# mask for outliers
mask = (np.abs(stats.zscore(df_rrs['RRs'])) > 1)
# store outliers
outliers = df_rrs.loc[mask, 'RRs']
# group consecutive outliers
group_outliers = (outliers.index.to_series().diff() != 1).cumsum()
# grouper values as first index value per outlier group
grouper = group_outliers.index.to_series().mask(group_outliers.duplicated()).ffill()
# get sum per grouop
outliers_grouped = outliers.groupby(grouper).sum()
# determine rows needed per group
rows = np.round(outliers_grouped / df_rrs.loc[~mask, 'RRs'].mean())
# isolate index values we no longer want
outliers_excluded = outliers.index.difference(outliers_grouped.index)
# get index `df_rrs` without index values we no longer want
reindex = df_rrs.index.difference(outliers_excluded)
# reindex with `np.repeat` to get appropriate repeats for index values in `rows`
df_rrs = df_rrs.reindex(
np.repeat(reindex, rows.reindex(reindex, fill_value=1))
)
# interpolate
df_rrs['RRs'] = df_rrs['RRs'].mask(mask).interpolate()
# scale interpolated values
df_rrs.loc[mask, 'RRs'] = (
df_rrs.loc[mask, 'RRs'] * (outliers_grouped
/ df_rrs.loc[mask, 'RRs'].groupby(level=0).sum())
)
Output:
RRs
0 0.520000
1 0.550000
2 0.540191 # index value 2, outlier interpolated (grouped)
2 0.529260
2 0.518328
2 0.507397
2 0.496466
2 0.485534
2 0.474603
2 0.463672
2 0.452740
2 0.441809
4 0.430000
5 0.470000
6 0.467811 # index value 6, outlier interpolated
6 0.486906
6 0.506000
6 0.525094
6 0.544189
7 0.590000
Sum check with np.isclose
:
np.isclose(df_rrs['RRs'].sum(), desired_sum)
# True
Note that inherent float precision issues may cause a slight difference between the sum and desired_sum
, so that df_rrs['RRs'].sum() == desired_sum
may not always be True
(it is here, incidentally). That seems unavoidable.
Explanation / intermediates
outliers
.outliers
2 2.44
3 2.47
6 2.53
Name: RRs, dtype: float64
2, 3
). Applying:
index.to_series
+ Series.diff
unequal to 1 and getting Series.cumsum
, stored as group_outliers
.group_outliers
: again index.to_series
+ Series.mask
, passing Series.duplicated
, and Series.ffill
on the result.df.groupby
to get the sum, stored here as outliers_grouped
.grouper
2 2.0 # one group
3 2.0 # one group
6 6.0
dtype: float64
outliers.groupby(grouper).sum()
2.0 4.91
6.0 2.53
Name: RRs, dtype: float64
np.round
+ outlier_grouped
divided by the mean of correct values.rows
2.0 10.0
6.0 5.0
Name: RRs, dtype: float64
df.reindex
. First exclude the index values that we cut out in the grouping process: index.difference
with outliers
and outliers_grouped.index
.outliers.index.difference(outliers_grouped.index)
Index([3], dtype='int64')
df_rrs.index
and outliers_excluded
, excluding in this case 3
.np.repeat
to get 10
and 5
rows for our rows
index and 1 for the others (fill_value=1
).df_rrs.index # reindexed
Index([0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 5, 6, 6, 6, 6, 6, 7], dtype='int64')
Series.interpolate
.desired_sum
. For this we need another df.groupby
on the interpolated values to get the sum per group as input for: interpolated values * (outliers_grouped
/ sum grouped interpolated values).