I have a timeseries of intraday tick-by-tick stock prices that change gradually over time. Whenever there is a small change (e.g. the price increases by $0.01), a new row of data is created. This leads to a very large data series which is slow to plot. I want to downsample so that small changes (e.g. the price goes up/down/up/down/up/down and is unchanged after 50 rows of data) are ignored, which improves plotting speed without sacrificing the qualitative accuracy of the graph. I only want to sample if the price goes up/up/up/up so that I am only displaying obvious changes.
import pandas as pd
import numpy as np
prices = pd.DataFrame(np.random.randint(0,1000, size=(100, 1))/100+1000, columns=list('A'))
I wish to sample whenever the difference with the previous sample exceeds some threshold. So, I will sample row 0 by default. If row 1, 2, 3 and 4 are too close to row 0, I want to throw them away. Then, if row 5 is sufficiently far away from row 0, I will sample that. Then, row 5 becomes my new anchor point, and I will repeat the same process described immediately above.
Is there a way to do this, ideally without a loop?
You could apply a down-sampling masking function that checks if the distance has been exceeded. Then use that to select to select the applicable rows.
Here is the down-sampling masking function:
def down_mask(x, max_dist=3):
global cum_diff
# if NaN return True
if x!=x:
return True
cum_diff += x
if abs(cum_diff) > max_dist:
cum_diff = 0
return True
return False
Then apply it and use it as a mask to get the entries that you want:
cum_diff = 0
df[df['prices'].diff().apply(down_mask, max_dist=5)]
prices
0 1002.07
1 1007.37
2 1000.09
6 1008.08
10 1001.57
14 1006.74
18 1000.42
19 1006.98
21 1001.30
26 1008.89
28 1003.77
38 1009.04
40 1000.52
44 1007.06
47 1001.21
48 1009.38
49 1001.81
51 1008.64
52 1002.72
55 1008.84
56 1000.86
57 1007.17
67 1001.31
68 1006.33
79 1001.14
98 1009.74
99 1000.53