I know there have been a few posts on this, but my case is a little bit different and I wanted to get some help on this.
I have a pandas dataframe symbol_df with 1 min bars in the below format for each stock symbol:
id Symbol_id Date Open High Low Close Volume
1 1 2023-12-13 09:15:00 4730.95 4744.00 4713.95 4696.40 2300
2 1 2023-12-13 09:16:00 4713.20 4723.70 4717.85 4702.55 1522
3 1 2023-12-13 09:17:00 4716.40 4718.55 4701.00 4701.00 909
4 1 2023-12-13 09:18:00 4700.15 4702.80 4696.70 4696.00 715
5 1 2023-12-13 09:19:00 4696.70 4709.90 4702.00 4696.10 895
... ... ... ... ... ... ... ...
108001 1 2024-03-27 13:44:00 6289.95 6291.95 6289.00 6287.55 989
108002 1 2024-03-27 13:45:00 6288.95 6290.85 6289.00 6287.75 286
108003 1 2024-03-27 13:46:00 6291.25 6293.60 6292.05 6289.10 1433
108004 1 2024-03-27 13:47:00 6295.00 6299.00 6293.20 6293.15 2702
108005 1 2024-03-27 13:48:00 6292.05 6296.55 6291.95 6291.95 983
I would like to calculate the "Relative Volume Ratio" indicator and add this calculated value to the symbol_df as a new column on a rolling basis.
"Relative volume ratio" indicator calculated as below:
So far today's Volume is compared with the mean volume of the last 10 days of the same period. To get the ratio value, we simply divide "today so far volume" by "mean volume of the last 10 days of the same period".
For example..the current bar time is now 13:48.
cumulativeVolumeOfToday = Volume
of 1 minuite bars between 00:00 -13:48 today added up
avergeVolumeOfPreviousDaysOfSamePeriod = Average
accumulation of volume from the same period(00:00 - 13:48) over the last 10 days.
relativeVolumeRatio = CumulativeVolumeOfToday/AvergeVolumeOfPrevious10DaysOfSamePeriod
Add this value as a new column to the dataframe.
Sample data download for the test case:
import yfinance as yf #pip install yfinance
from datetime import datetime
import pandas as pd
symbol_df = yf.download(tickers="AAPL", period="7d", interval="1m")["Volume"]
symbol_df=symbol_df.reset_index(inplace=False)
#symbol_df['Datetime'] = symbol_df['Datetime'].dt.strftime('%Y-%m-%d %H:%M')
symbol_df = symbol_df.rename(columns={'Datetime': 'Date'})
#We can only download 7 days sample data. So 5 days mean for calculations
How can I do this in Pandas?
from yfinance import download
# Prepare data similar to the original
symbol_df = (
download(tickers="AAPL", period="7d", interval="1m")
.rename_axis(index='Date')
.reset_index()
)
# Calculate Relative Volume Ratio
volume = symbol_df.set_index('Date')['Volume']
dts = volume.index
cum_volume = volume.groupby(dts.date, sort=False).cumsum()
prev_mean = lambda days: (
cum_volume
.groupby(dts.time, sort=False)
.rolling(days, closed='left')
.mean()
.reset_index(0, drop=True) # drop the level with dts.time
)
rvr = cum_volume / prev_mean(5)
# Assign the output to the initial data
symbol_df = symbol_df.join(rvr.rename('Relative volume ratio'), on='Date')
Based on the provided description, you need to perform several transformations on the aggregated data. First is to cumulatively summarize the data for each day. Then run a [ten]-day window over the data grouped by time of day to calculate the average. And at the end, actually divide the former by the latter.
Let's say, you have the following test data, where "Date"
is a column of type datetime
:
from yfinance import download
symbol_df = (
download(tickers="AAPL", period="7d", interval="1m")
.rename_axis(index='Date')
.reset_index()
)
To calculate the Relative Volume Ratio values, we will use "Volume"
as a separate sequence with date-time stamps "Date"
as its index:
volume = symbol_df.set_index('Date')['Volume']
dts = volume.index # date-time stamps for convenient grouping
Let's create a sequence of cumulative volumes for each day. For this, we group volume
by its date (the year, month and day values with no time) and apply cumsum
to a group (use sort=False
in hopes to speed up calculations):
cum_volume = volume.groupby(dts.date, sort=False).cumsum()
To calculate the mean of cumulative volumes at the same time of day in the given number of previous days, we group cum_volume
by its time (hours and minutes with no year, month, day values), and apply rolling calculations to each group to obtain averages over windows. Note that here we need the source data to be sorted by date-time stamps since only business days are taken into account and we can't use a non-fixed frequency of "10B"
as a window
value. To calculate means for exactly the previous days excluding the current one, we pass closed='left'
(see DataFrameGroupBy.rolling docs for details):
prev_mean = lambda days: (
cum_volume
.groupby(dts.time, sort=False)
.rolling(days, closed='left')
.mean()
.reset_index(0, drop=True)
)
Now the final touch with the window of 5 days:
rvr = cum_volume / prev_mean(5)
Compared to Andrei Kesely's solution, this one wins in speed (on Intel Core i3-2100, for example, processing the data offered there will take over 1 minute versus 300-400 ms with the code above). The calculation result is the same for timestamps after the first 10 days. But in the beginning, when there's less then 10 previous days, calculation of mean in rolling windows is made as if there's always 10 items (missing values are set to nan). Whereas in the case of the Kesely's solution, we obtain average values only for the available cumulative volumes.