I have created a pandas dataframe as follows:
import pandas as pd
import numpy as np
ds = { 'trend' : [1,1,1,1,2,2,3,3,3,3,3,3,4,4,4,4,4], 'price' : [23,43,56,21,43,55,54,32,9,12,11,12,23,3,2,1,1]}
df = pd.DataFrame(data=ds)
The dataframe looks as follows:
display(df)
trend price
0 1 23
1 1 43
2 1 56
3 1 21
4 2 43
5 2 55
6 3 54
7 3 32
8 3 9
9 3 12
10 3 11
11 3 12
12 4 23
13 4 3
14 4 2
15 4 1
16 4 1
I have saved the dataframe to a .csv file called df.csv:
df.to_csv("df.csv", index = False)
I need to create a new field called ema2
which:
iterates through each and every record of the dataframe
calculates the Exponential Moving Average (EMA) by considering the price observed at each iteration and the prices (EMA length is 2 in this example) observed in the previous trends. For example:
I iterate at record 0 and the EMA is NaN (missing).
I iterate at record 1 and the EMA is still NaN (missing)
I Iterate at record 12 and the EMA is 24.20 (it considers price at record 3, price at record 5 and price at record 12
I Iterate at record 13 and the EMA is 13.53 (it considers price at record 3, price at record 5 and price at record 13
I Iterate at record 15 and the EMA is 12.46 (it considers price at record 3, price at record 5 and price at record 15 and so on .....
I have written the following code:
time_window = 2
ema= []
for i in range(len(df)):
ds = pd.read_csv("df.csv", nrows=i+1)
d = ds.groupby(['trend'], as_index=False).agg(
{'price':'last'})
d['ema2'] = d['price'].ewm(com=time_window - 1, min_periods=time_window).mean()
ema.append(d['ema2'].iloc[-1])
df['ema2'] = ema
Which produces the correct dataframe:
print(df)
trend price ema2
0 1 23 NaN
1 1 43 NaN
2 1 56 NaN
3 1 21 NaN
4 2 43 35.666667
5 2 55 43.666667
6 3 54 49.571429
7 3 32 37.000000
8 3 9 23.857143
9 3 12 25.571429
10 3 11 25.000000
11 3 12 25.571429
12 4 23 24.200000
13 4 3 13.533333
14 4 2 13.000000
15 4 1 12.466667
16 4 1 12.466667
The problem is that when the dataframe has millions of records: it takes a very long time to run.
Does anyone know how to get the same results in a quick, efficient way, please?
I slightly changed the example that you asked about RSI. I added -1
in the first prev, the cycle of filling by slices, price and in setting the values by slice of the data frame.
You can also try numba, cython, but most likely the code will need to be rewritten(not all functions in them are available from numpy, I don't know about pandas).
trends = df["trend"].unique()
arr = df['price'].values
range_group = np.stack(
[df[df["trend"] == trend].index.values.take([0, -1]) for trend in trends]
)
price = np.full((len(df), trends.size), np.nan)
prev = arr[range_group[:time_window-1, 1]]
for i in range(time_window-1, len(trends)):
stop = range_group[i, 1] + 1
price[range_group[i, 0]:stop, -1] = arr[range_group[i, 0]:stop]
price[range_group[i, 0]:stop, -(prev.size+1):-1] = prev
prev = price[range_group[i, 1], -(prev.size+1):]
price = price[range_group[time_window-1, 0]:]
val = (pd.DataFrame(price).T.ewm(com=time_window - 1,
min_periods=time_window).mean().iloc[-1].values)
df.loc[range_group[time_window-1, 0]:, 'ema'] = val