pythonmachine-learningreinforcement-learningq-learningmdp

MDP & Reinforcement Learning - Convergence Comparison of VI, PI and QLearning Algorithms


I have implemented VI (Value Iteration), PI (Policy Iteration), and QLearning algorithms using python. After comparing results, I noticed something. VI and PI algorithms converge to same utilities and policies. With same parameters, QLearning algorithm converge to different utilities, but same policies with VI and PI algorithms. Is this something normal? I read a lot of papers and books about MDP and RL, but couldn't find anything which tells if utilities of VI-PI algorithms should converge to same utilities with QLearning or not.

Following information is about my grid world and results.

MY GRID WORLD

grid_world.png

RESULTS

vi_pi_results.png

qLearning_1million_10million_iterations_results.png

In addition, I also noticed that, when QLearning does 1 million iterations, states which are equally far away from the +10 rewarded terminal have the same utilities. Agent seems it does not care if it is going to the reward from a path which is near to -10 terminal or not, while agent cares about it on VI and PI algorithms. Is this because, in QLearning, we don't know the transition probability of environment?


Solution

  • If the state and action spaces are finite, as in your problem, Q-learning algorithm should converge asymptotically to the optimal utility (aka, Q-function), when the number of transitions approaches to infinity and under the following conditions:

    enter image description here,

    where n is the number of transitions and a is the learning rate. This conditions requires updating your learning rate as learning progresses. A typical choice could be use a_n = 1/n. However, in practice, the learning rate schedule may require some tuning depending on the problem.

    On the other hand, another convergence condition consists in update all state-action pairs infinitely (in a asymtotical sense). This could be achieved simply by maintaining an exploration rate bigger than zero.

    So, in your case, you need to decrease the learning rate.