reinforcement-learningq-learningconvergencemarkov-decision-process

Why do we need exploitation in RL(Q-Learning) for convergence?


I am implementing Q-learning algorithm and I observed that my Q-values are not converging to optimal Q-values even though the policy seems to be converging. I defined the action selection strategy as epsilon-greedy and epsilon is decreasing by 1/N starting from 1(N being the total number of iterations). That way in the earlier iterations the algorithm explores random states then this rate gradually decreases leading to exploitation. In addition, I defined the learning rate as 1/N_t(s,a) where N_t(s,a) is the total number of times (s,a) is visited.

Everything seems to be correct but since I can't get to the optimal Q-values I started looking into different strategies and in the meantime got super confused. I know that convergence is achieved when all (s,a) pairs are visited infinitely often. Isn't this equivalent to saying all (s,a) pairs are explored many times? In other words, why do we need exploitation for convergence? What if we don't exploit and just focus on exploring? If we do that we search all of the solution space, hence shouldn't that be enough to find an optimal policy?

Also, when its said the Q-values converge to optimal, does only the max_a[Q(s,a)] converge to its optimal value or all Q(s,a) values converge to their optimal value?

Probably there is a simple answer to all of these however even though I checked a lot of resources and similar threads I still couldn't figure out the logic behind exploitation. Thanks a lot for your time in advance!


Solution

  • Exploitation indeed isn't necessary for convergence in theory. In practice, it often turns out to be important / necessary for one or both of the following two reasons:

    1. Sometimes we are not learning just for the sake of learning, but we also care about our performance already during the learning/training process. This means we need a balance between exploitation (performing well) and exploration (continuing to learn).

    2. More importantly, if we purely explore and do not exploit at all, this may also limit our ability to learn in practice, because there are many states that we may simply fail to reach if we always act randomly.

    To clarify on the second point, consider, for example, that we're in one corner of a large, 2D grid, and our goal position is in the opposite corner. Suppose that we already get small rewards whenever we move closer to the goal, and small negative rewards whenever we move further away. If we have a balance between exploration and exploitation, it is likely that we'll quickly learn to walk along the path from start to goal, but also bounce around that path a bit randomly due to exploration. In other words, we'll start learning what to do in all states around that path.

    Now, suppose you try learning in the same situation only by acting randomly (e.g. no exploitation). If we only act randomly in a sufficiently large 2D grid, and we always start in one corner, it's highly unlikely that we'll ever manage to reach the other side of the grid. We'll just randomly keep moving in an area around the starting position, and never learn what to do in states far away from this starting position. It's unlikely to ever reach them with pure random behaviour in practice. Obviously we will reach every state given an infinite amount of time, but we rarely have an infinite amount of time in practice.