My question follows my examination of the code in the PyTorch DQN tutorial, but then refers to Reinforcement Learning in general: what are the best practices for optimal exploration/exploitation in reinforcement learning?
In the DQN tutorial, the steps_done variable is a global variable, and the EPS_DECAY = 200. This means that: after 128 steps, the epsilon threshold = 0.500; after 889 steps, the epsilon threshold = 0.0600; and after 1500 steps, the epsilon threshold = 0.05047.
This might work for the CartPole problem featured in the tutorial – where the early episodes might be very short and the task fairly simple – but what about on more complex problems in which far more exploration is required? For example, if we had a problem with 40,000 episodes, each of which had 10,000 timesteps, how would we set up the epsilon greedy exploration policy? Is there some rule of thumb that’s used in RL work?
well, for that I guess it is better to use the linear annealed epsilon-greedy policy which updates epsilon based on steps:
EXPLORE = 3000000 #how many time steps to play
FINAL_EPSILON = 0.001 # final value of epsilon
INITIAL_EPSILON = 1.0# # starting value of epsilon
if epsilon > FINAL_EPSILON:
epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE