I have time series data for Physical Activities. The data was recorded at 50hz frequency. But now I want to down sample the data at 20hz because I want to train and predict model at 20hz.
Is there an efficient way in python to do that ? I've heard of Panda's resample function but don't exactly know how can I use it efficiently for my problem. Any piece of code will be really helpful.
epoch (ms) time (10:00) elapsed (s) x-axis (g) y-axis (g) z-axis (g)
1613977400899 2021-02-22T12:03:20.899 0 -0.336 0.886 0.649
1613977400920 2021-02-22T12:03:20.920 0.021 -0.233 0.799 0.648
1613977400940 2021-02-22T12:03:20.940 0.041 -0.173 0.771 0.629
1613977400961 2021-02-22T12:03:20.961 0.062 -0.132 0.757 0.596
1613977400981 2021-02-22T12:03:20.981 0.082 -0.113 0.724 0.57
1613977401002 2021-02-22T12:03:21.002 0.103 -0.127 0.713 0.538
1613977401021 2021-02-22T12:03:21.021 0.122 -0.175 0.743 0.488
1613977401041 2021-02-22T12:03:21.041 0.142 -0.266 0.775 0.417
1613977401062 2021-02-22T12:03:21.062 0.163 -0.281 0.774 0.402
1613977401082 2021-02-22T12:03:21.082 0.183 -0.212 0.713 0.427
1613977401103 2021-02-22T12:03:21.103 0.204 -0.17 0.649 0.46
1613977401123 2021-02-22T12:03:21.123 0.224 -0.204 0.649 0.524
1613977401144 2021-02-22T12:03:21.144 0.245 -0.313 0.684 0.658
1613977401164 2021-02-22T12:03:21.164 0.265 -0.415 0.727 0.785
1613977401183 2021-02-22T12:03:21.183 0.284 -0.419 0.726 0.82
A main issue here seems to be that you original frequency is “roughly” 20ms (or 50Hz), not exactly. We’ll need to resample in 2 steps:
First let’s build a time index. Here you have the information twice, so either of these will work:
>>> df = df.set_index(df['epoch (ms)'].astype('datetime64[ms]'))
>>> df = df.set_index(pd.to_datetime(df['time (10:00)']))
>>> df
epoch (ms) time (10:00) elapsed (s) x-axis (g) y-axis (g) z-axis (g)
time (10:00)
2021-02-22 12:03:20.899 1613977400899 2021-02-22T12:03:20.899 0.000 -0.336 0.886 0.649
2021-02-22 12:03:20.920 1613977400920 2021-02-22T12:03:20.920 0.021 -0.233 0.799 0.648
2021-02-22 12:03:20.940 1613977400940 2021-02-22T12:03:20.940 0.041 -0.173 0.771 0.629
2021-02-22 12:03:20.961 1613977400961 2021-02-22T12:03:20.961 0.062 -0.132 0.757 0.596
2021-02-22 12:03:20.981 1613977400981 2021-02-22T12:03:20.981 0.082 -0.113 0.724 0.570
2021-02-22 12:03:21.002 1613977401002 2021-02-22T12:03:21.002 0.103 -0.127 0.713 0.538
2021-02-22 12:03:21.021 1613977401021 2021-02-22T12:03:21.021 0.122 -0.175 0.743 0.488
2021-02-22 12:03:21.041 1613977401041 2021-02-22T12:03:21.041 0.142 -0.266 0.775 0.417
2021-02-22 12:03:21.062 1613977401062 2021-02-22T12:03:21.062 0.163 -0.281 0.774 0.402
2021-02-22 12:03:21.082 1613977401082 2021-02-22T12:03:21.082 0.183 -0.212 0.713 0.427
2021-02-22 12:03:21.103 1613977401103 2021-02-22T12:03:21.103 0.204 -0.170 0.649 0.460
2021-02-22 12:03:21.123 1613977401123 2021-02-22T12:03:21.123 0.224 -0.204 0.649 0.524
2021-02-22 12:03:21.144 1613977401144 2021-02-22T12:03:21.144 0.245 -0.313 0.684 0.658
2021-02-22 12:03:21.164 1613977401164 2021-02-22T12:03:21.164 0.265 -0.415 0.727 0.785
2021-02-22 12:03:21.183 1613977401183 2021-02-22T12:03:21.183 0.284 -0.419 0.726 0.820
(Now we don’t really need the epoch
and time
columns any more, as the info is in the index)
Now we can do the resampling:
>>> df.resample('1ms').interpolate().resample('50ms').last()
epoch (ms) time (10:00) elapsed (s) x-axis (g) y-axis (g) z-axis (g)
time (10:00)
2021-02-22 12:03:20.850 1.613977e+12 2021-02-22T12:03:20.899 0.000 -0.336000 0.886000 0.649000
2021-02-22 12:03:20.900 1.613977e+12 2021-02-22T12:03:20.940 0.050 -0.155429 0.765000 0.614857
2021-02-22 12:03:20.950 1.613977e+12 2021-02-22T12:03:20.981 0.100 -0.125000 0.714571 0.542571
2021-02-22 12:03:21.000 1.613977e+12 2021-02-22T12:03:21.041 0.150 -0.271714 0.774619 0.411286
2021-02-22 12:03:21.050 1.613977e+12 2021-02-22T12:03:21.082 0.200 -0.178000 0.661190 0.453714
2021-02-22 12:03:21.100 1.613977e+12 2021-02-22T12:03:21.144 0.250 -0.338500 0.694750 0.689750
2021-02-22 12:03:21.150 1.613977e+12 2021-02-22T12:03:21.183 0.284 -0.419000 0.726000 0.820000
Note that you can do different types of interpolations, by specifying the argument you pass to .interpolate()
. See the doc on this:
method : str, default ‘linear’
Interpolation technique to use. One of:
- ‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
- ‘time’: Works on daily and higher resolution data to interpolate given length of interval.
- ‘index’, ‘values’: use the actual numerical values of the index.
- ‘pad’: Fill in NaNs using existing values.
- ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5).
- ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes.
- ‘from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.
You can see slight differences in the coordinates, up to you to pick what the right method is for you:
>>> df.resample('1ms').interpolate('time').resample('50ms').last()
epoch (ms) time (10:00) elapsed (s) x-axis (g) y-axis (g) z-axis (g)
time (10:00)
2021-02-22 12:03:20.850 1.613977e+12 2021-02-22T12:03:20.899 0.000 -0.336000 0.886000 0.649000
2021-02-22 12:03:20.900 1.613977e+12 2021-02-22T12:03:20.940 0.050 -0.155429 0.765000 0.614857
2021-02-22 12:03:20.950 1.613977e+12 2021-02-22T12:03:20.981 0.100 -0.125000 0.714571 0.542571
2021-02-22 12:03:21.000 1.613977e+12 2021-02-22T12:03:21.041 0.150 -0.271714 0.774619 0.411286
2021-02-22 12:03:21.050 1.613977e+12 2021-02-22T12:03:21.082 0.200 -0.178000 0.661190 0.453714
2021-02-22 12:03:21.100 1.613977e+12 2021-02-22T12:03:21.144 0.250 -0.338500 0.694750 0.689750
2021-02-22 12:03:21.150 1.613977e+12 2021-02-22T12:03:21.183 0.284 -0.419000 0.726000 0.820000
>>> df.resample('1ms').interpolate('cubic').resample('50ms').last()
epoch (ms) time (10:00) elapsed (s) x-axis (g) y-axis (g) z-axis (g)
time (10:00)
2021-02-22 12:03:20.850 1.613977e+12 2021-02-22T12:03:20.899 0.000 -0.336000 0.886000 0.649000
2021-02-22 12:03:20.900 1.613977e+12 2021-02-22T12:03:20.940 0.050 -0.153162 0.766266 0.615219
2021-02-22 12:03:20.950 1.613977e+12 2021-02-22T12:03:20.981 0.100 -0.122950 0.711454 0.543581
2021-02-22 12:03:21.000 1.613977e+12 2021-02-22T12:03:21.041 0.150 -0.285487 0.781273 0.403123
2021-02-22 12:03:21.050 1.613977e+12 2021-02-22T12:03:21.082 0.200 -0.172478 0.656944 0.452494
2021-02-22 12:03:21.100 1.613977e+12 2021-02-22T12:03:21.144 0.250 -0.342439 0.695493 0.693425
2021-02-22 12:03:21.150 1.613977e+12 2021-02-22T12:03:21.183 0.284 -0.419000 0.726000 0.820000