pythonpandasdataframetechnical-indicator

Calculate Relative Volume Ratio indicator in pandas data frame and add the indicator value as new column


I know there have been a few posts on this, but my case is a little bit different and I wanted to get some help on this.

I have a pandas dataframe symbol_df with 1 min bars in the below format for each stock symbol:

id  Symbol_id                Date     Open     High      Low    Close  Volume
1          1 2023-12-13 09:15:00  4730.95  4744.00  4713.95  4696.40    2300
2          1 2023-12-13 09:16:00  4713.20  4723.70  4717.85  4702.55    1522
3          1 2023-12-13 09:17:00  4716.40  4718.55  4701.00  4701.00     909
4          1 2023-12-13 09:18:00  4700.15  4702.80  4696.70  4696.00     715
5          1 2023-12-13 09:19:00  4696.70  4709.90  4702.00  4696.10     895
...        ...                 ...      ...      ...      ...      ...     ...
108001     1 2024-03-27 13:44:00  6289.95  6291.95  6289.00  6287.55     989
108002     1 2024-03-27 13:45:00  6288.95  6290.85  6289.00  6287.75     286
108003     1 2024-03-27 13:46:00  6291.25  6293.60  6292.05  6289.10    1433
108004     1 2024-03-27 13:47:00  6295.00  6299.00  6293.20  6293.15    2702
108005     1 2024-03-27 13:48:00  6292.05  6296.55  6291.95  6291.95     983

I would like to calculate the "Relative Volume Ratio" indicator and add this calculated value to the symbol_df as a new column on a rolling basis.

"Relative volume ratio" indicator calculated as below:

So far today's Volume is compared with the mean volume of the last 10 days of the same period. To get the ratio value, we simply divide "today so far volume" by "mean volume of the last 10 days of the same period".

For example..the current bar time is now 13:48.

cumulativeVolumeOfToday = Volume of 1 minuite bars between 00:00 -13:48 today added up

avergeVolumeOfPreviousDaysOfSamePeriod = Average accumulation of volume from the same period(00:00 - 13:48) over the last 10 days.

relativeVolumeRatio = CumulativeVolumeOfToday/AvergeVolumeOfPrevious10DaysOfSamePeriod

Add this value as a new column to the dataframe.

Sample data download for the test case:

import yfinance as yf #pip install yfinance
from datetime import datetime
import pandas as pd

symbol_df = yf.download(tickers="AAPL", period="7d", interval="1m")["Volume"] 
symbol_df=symbol_df.reset_index(inplace=False)
#symbol_df['Datetime'] = symbol_df['Datetime'].dt.strftime('%Y-%m-%d %H:%M')
symbol_df = symbol_df.rename(columns={'Datetime': 'Date'})
#We can only download 7 days sample data. So 5 days mean for calculations

How can I do this in Pandas?


Solution

  • TL;DR

    from yfinance import download
    
    # Prepare data similar to the original
    symbol_df = (
        download(tickers="AAPL", period="7d", interval="1m")
        .rename_axis(index='Date')
        .reset_index()
    )
    
    # Calculate Relative Volume Ratio
    volume = symbol_df.set_index('Date')['Volume']
    dts = volume.index
    cum_volume = volume.groupby(dts.date, sort=False).cumsum()
    prev_mean = lambda days: (
        cum_volume
        .groupby(dts.time, sort=False)
        .rolling(days, closed='left')
        .mean()
        .reset_index(0, drop=True)    # drop the level with dts.time
    )
    rvr = cum_volume / prev_mean(5)
    
    # Assign the output to the initial data
    symbol_df = symbol_df.join(rvr.rename('Relative volume ratio'), on='Date')
    

    Explanation

    Based on the provided description, you need to perform several transformations on the aggregated data. First is to cumulatively summarize the data for each day. Then run a [ten]-day window over the data grouped by time of day to calculate the average. And at the end, actually divide the former by the latter.

    Let's say, you have the following test data, where "Date" is a column of type datetime:

    from yfinance import download
    
    symbol_df = (
        download(tickers="AAPL", period="7d", interval="1m")
        .rename_axis(index='Date')
        .reset_index()
    )
    

    To calculate the Relative Volume Ratio values, we will use "Volume" as a separate sequence with date-time stamps "Date" as its index:

    volume = symbol_df.set_index('Date')['Volume']
    dts = volume.index    # date-time stamps for convenient grouping
    

    Let's create a sequence of cumulative volumes for each day. For this, we group volume by its date (the year, month and day values with no time) and apply cumsum to a group (use sort=False in hopes to speed up calculations):

    cum_volume = volume.groupby(dts.date, sort=False).cumsum()
    

    To calculate the mean of cumulative volumes at the same time of day in the given number of previous days, we group cum_volume by its time (hours and minutes with no year, month, day values), and apply rolling calculations to each group to obtain averages over windows. Note that here we need the source data to be sorted by date-time stamps since only business days are taken into account and we can't use a non-fixed frequency of "10B" as a window value. To calculate means for exactly the previous days excluding the current one, we pass closed='left' (see DataFrameGroupBy.rolling docs for details):

    prev_mean = lambda days: (
        cum_volume
        .groupby(dts.time, sort=False)
        .rolling(days, closed='left')
        .mean()
        .reset_index(0, drop=True)
    )
    

    Now the final touch with the window of 5 days:

    rvr = cum_volume / prev_mean(5)
    

    Comparison

    Compared to Andrei Kesely's solution, this one wins in speed (on Intel Core i3-2100, for example, processing the data offered there will take over 1 minute versus 300-400 ms with the code above). The calculation result is the same for timestamps after the first 10 days. But in the beginning, when there's less then 10 previous days, calculation of mean in rolling windows is made as if there's always 10 items (missing values are set to nan). Whereas in the case of the Kesely's solution, we obtain average values only for the available cumulative volumes.