reinforcement-learning openai-gym stable-baselines

Do I need to retrain reinforcement model from scratch each time I want to use it in practice?

This seems like it should be obvious but I can't find resources on it anywhere. I am building a reinforcement learning model with openai's gym any_trading environment and stablebaselines3. There are a ton of online tutorials and documentation for training and evaluating the model but almost nothing on actually using it in practice.

e.g. I want the model constantly looking at today's data and making predictions about what action I should take to lock in tomorrow's profits.

Reinforcement learning algorithms seem to all have a model.predict() method but you have to pass the environment which is just more historical data. What if I want it to use today's data to predict tomorrow's values? Do I just include up to today in the test set and retrain the model from scratch each time I want it to make a prediction?

e.g. Original training data ranges from 2014-01-01 to Today (aka 2023-02-12) then run through the whole train and testing process? Then tomorrow I start from scratch and train/test using date_ranges 2014-01-01 to Today (aka 2023-02-13) then the next day 2014-01-01 to Today (aka 2023-02-14) etc etc? How do I actually make real-time predictions with a Reinforcement Learning model as opposed to continually evaluating how it would have performed on past data?

Thanks.

Solution

This is a very good and practical question. I assume you use all the history data to train your RL agent in stablebaselines3 in practice and then apply the trained RL agent to predict tomorrow's action. Short answer is NO, you don't need to train your agent from scratch every day.

First you need to understand the procedure in learning and prediction:

In learning or training process:

Initialize your RL agent policy or value network.
Input the observation on day 2014-01-01 to your RL agent.
Your agent makes decisions based on the observation.
Calculate your observation and reward/profit on day 2014-01-02 and send them back to your agent.
Depend on the RL algorithm you use, your agent might update its policy or value network based on this observation reward pair or it could save this pair into buffer. And only update its policy or value network after certain amount of days (e.g., 30 days, 180 days)
repeat step 2-6 until you reach the last day of your database (e.g., 2023-02-12)

In prediction process (which has only procedure 2,3 from training process):

Input the observation on day 2014-01-01 to your RL agent.
Your agent makes decisions based on the observation. That's it.

You can repeated train your model in the training process with the history data until you are satisfied with the performance during the training process. In this retrain process, after each training through the entire history data, you can save the model and load the saved model for the retrain as the initialized model.

Once you get that good model, you don't need to train it anymore with the new coming data after 2023-2-12. It is still valid.

You may think new data is generated everyday and the most recent data is the most valuable one. In this case, you can periodically update your existing model with the new data using following procedure:

load your existing RL agent model (the trained model).
Input the observation on day one in your most recent new data to your RL agent.
Your agent makes decisions based on the observation.
Calculate your observation and reward/profit on day two of your new data and send them back to your agent.
Depend on the RL algorithm you use, your agent might update its policy or value network based on this observation reward pair or it could save this pair into buffer. And only update its policy or value network after certain amount of days (e.g., 30 days)
repeat step 2-6 until you reach the last day of your new data