python reinforcement-learning openai-gym dqn

Understanding and Evaluating different methods in Reinforcement Learning

I have been trying to implement the Reinforcement learning algorithm on Python using different variants like Q-learning, Deep Q-Network, Double DQN and Dueling Double DQN. Consider a cart-pole example and to evaluate the performance of each of these variants, I can think of plotting sum of rewards to number of episodes (attaching a picture of the plot) and the actual graphical output where how well the pole is stable while the cart is moving.

But these two evaluations are not really of interest in terms to explain the better variants quantitatively. I am new to the Reinforcement learning and trying to understand if any other ways to compare different variants of RL models on the same problem.

I am referring to the colab link https://colab.research.google.com/github/ageron/handson-ml2/blob/master/18_reinforcement_learning.ipynb#scrollTo=MR0z7tfo3k9C for the code on all the variants of cart pole example.

Solution

You can find the answer in research paper about those algorithms, because when a new algorithm been proposed we usually need the experiments to show the evident that it have advantage over other algorithm.

The most commonly used evaluation method in research paper about RL algorithms is average return (note not reward, return is accumulated reward, is like the score in game) over timesteps, and there many way you can average the return, e.g average wrt different hyperparameters like in Soft Actor-Critic paper's comparative evaluation average wrt different random seeds (initialize the model):

Figure 1 shows the total average return of evaluation rolloutsduring training for DDPG, PPO, and TD3. We train fivedifferent instances of each algorithm with different randomseeds, with each performing one evaluation rollout every1000 environment steps. The solid curves corresponds to themean and the shaded region to the minimum and maximumreturns over the five trials.

And we usually want compare the performance of many algorithms not only on one task but diverse set of tasks (i.e Benchmark), because algorithms may have some form of inductive bias for them to better at some form of tasks but worse on other tasks, e.g in Phasic Policy Gradient paper's experiments comparison to PPO:

We report results on the environments in Procgen Benchmark (Cobbe et al.,2019). This benchmark was designed to be highly diverse, and we expect improvements on this benchmark to transfer well to many other RL environment