pythonpandasdataframeiteratorcalculated-columns

How to calculate the Exponential Moving Average (EMA) through record iterations in pandas dataframe


I have created a pandas dataframe as follows:

import pandas as pd
import numpy as np
    
ds = { 'trend' : [1,1,1,1,2,2,3,3,3,3,3,3,4,4,4,4,4], 'price' : [23,43,56,21,43,55,54,32,9,12,11,12,23,3,2,1,1]}

df = pd.DataFrame(data=ds)

The dataframe looks as follows:

display(df)

   trend    price
0   1        23
1   1        43
2   1        56
3   1        21
4   2        43
5   2        55
6   3        54
7   3        32 
8   3         9
9   3        12
10  3        11
11  3        12
12  4        23
13  4         3
14  4         2
15  4         1
16  4         1

I have saved the dataframe to a .csv file called df.csv:

df.to_csv("df.csv", index = False)

I need to create a new field called ema2 which:

  1. iterates through each and every record of the dataframe

  2. calculates the Exponential Moving Average (EMA) by considering the price observed at each iteration and the prices (EMA length is 2 in this example) observed in the previous trends. For example:

  3. I iterate at record 0 and the EMA is NaN (missing).

  4. I iterate at record 1 and the EMA is still NaN (missing)

  5. I Iterate at record 12 and the EMA is 24.20 (it considers price at record 3, price at record 5 and price at record 12

  6. I Iterate at record 13 and the EMA is 13.53 (it considers price at record 3, price at record 5 and price at record 13

  7. I Iterate at record 15 and the EMA is 12.46 (it considers price at record 3, price at record 5 and price at record 15 and so on .....

I have written the following code:

time_window = 2

ema= []

for i in range(len(df)):

    ds = pd.read_csv("df.csv", nrows=i+1)
    d = ds.groupby(['trend'], as_index=False).agg(
                                                    {'price':'last'})

    d['ema2'] = d['price'].ewm(com=time_window - 1, min_periods=time_window).mean()    
    ema.append(d['ema2'].iloc[-1])

df['ema2'] = ema

Which produces the correct dataframe:

print(df)

    trend  price       ema2
0       1     23        NaN
1       1     43        NaN
2       1     56        NaN
3       1     21        NaN
4       2     43  35.666667
5       2     55  43.666667
6       3     54  49.571429
7       3     32  37.000000
8       3      9  23.857143
9       3     12  25.571429
10      3     11  25.000000
11      3     12  25.571429
12      4     23  24.200000
13      4      3  13.533333
14      4      2  13.000000
15      4      1  12.466667
16      4      1  12.466667

The problem is that when the dataframe has millions of records: it takes a very long time to run.

Does anyone know how to get the same results in a quick, efficient way, please?


Solution

  • I slightly changed the example that you asked about RSI. I added -1 in the first prev, the cycle of filling by slices, price and in setting the values ​​by slice of the data frame.

    You can also try numba, cython, but most likely the code will need to be rewritten(not all functions in them are available from numpy, I don't know about pandas).

    trends = df["trend"].unique()
    arr = df['price'].values
    range_group = np.stack(
        [df[df["trend"] == trend].index.values.take([0, -1]) for trend in trends]
    )
    price =  np.full((len(df), trends.size), np.nan)
    prev = arr[range_group[:time_window-1, 1]]
    
    for i in range(time_window-1, len(trends)):
        stop = range_group[i, 1] + 1
        price[range_group[i, 0]:stop, -1] = arr[range_group[i, 0]:stop]
        price[range_group[i, 0]:stop, -(prev.size+1):-1] = prev
        prev = price[range_group[i, 1], -(prev.size+1):]
        
    price = price[range_group[time_window-1, 0]:]
    val = (pd.DataFrame(price).T.ewm(com=time_window - 1,
                                    min_periods=time_window).mean().iloc[-1].values)
    
    df.loc[range_group[time_window-1, 0]:, 'ema'] = val