In a DQN for Q-learning, how should I apply high gamma values during experience replay?

I'm using pyTorch to implement a Q-Learning approach to card game, where the rewards come only at the end of the hand when a score is calculated. I am using experience replay with high gammas (0.5-0.95) to train the network.

My question is about how to apply the discounted rewards to the replay memory. It seems that the correct discounted reward depends on understanding, at some point, the temporal sequence of state transitions and rewards, and applying the discount recursively to the from the terminal state.

Yet most algorithms seem to apply the gamma somehow to a randomly-selected batch of transitions from the replay memory, which would seem to do-coordinate them temporally and make calculation of discounted rewards problematic. The discount in these algorithms seems to be applied to a forward pass of the "next_state", although it can be hard to interpret.

My approach has been to calculate the discounted rewards when the terminal state has been reached, and apply them directly to the replay memory's reward values at that time. I do not reference the gamma at replay time, since it has already been factored in.

This makes sense to me, but it is not what I see for example in the pyTorch "[Reinforcement Learning (DQN) Tutorial". Can someone explain how the time-decorrelation in random batches is managed for high-gamma Q-Learning?

Solution

Imagine you are playing a simple game, where you move in a grid and collect coins. you're facing a common challenge in reinforcement learning: rewards come late and it's hard to know which actions were good or bad. In Q-Learning, you want to know how good it is to take a certain move (action) at a certain spot (state) on the grid. We call this the Q-value where you calculate the Q-value with this formula:

   Q(state, action) = reward + gamma * max_next Q(next_state, next_action)

the Q-value is the immediate reward and the best Q-value you can get in the next move. You save each move (state, action, reward, next_state) in a memory. During training, you randomly pick some of these moves to update the Q-values. This helps to avoid focusing too much on the recent moves. Although the moves are picked randomly, you still consider the sequence of rewards. This is done by looking at the next state of each move, which allows you to predict future rewards. This is represented by the gamma * max_next Q(next_state, next_action) part in the Q-value formula. Your approach of waiting until the end of the game to calculate the Q-values is a bit different. It's closer to the Monte Carlo method, where you only update the Q-values at the end of each game. It's like playing the whole game first, then deciding how good the moves were. This might work but could be less effective when the games are long. The traditional Q-Learning method, on the other hand, updates the Q-values as you play the game.

Keep in mind, in standard Q-Learning you don't need to calculate and store discounted rewards manually. Instead, you handle the discounting of future rewards in the Q-value update formula, and even when you use random batches of transitions for training, the future rewards are still taken into account in this update. This is how Q-Learning manages time decorrelation with high gamma.