reinforcement-learningq-learningreward-system

Q-Learning Intermediate Rewards


If a Q-Learning agent actually performs noticeably better against opponents in a specific card game when intermediate rewards are included, would this show a flaw in the algorithm or a flaw in its implementation?


Solution

  • It's difficult to answer this question without more specific information about the Q-Learning agent. You might term the seeking of immediate rewards as being the exploitation rate, which is generally inversely proportional to the exploration rate. It should be possible to configure this and the learning rate in your implementation. The other important factor is the choice of exploration strategy and you should not have any difficulty in finding resources that will assist in making this choice. For example:

    http://www.ai.rug.nl/~mwiering/GROUP/ARTICLES/Exploration_QLearning.pdf

    https://www.cs.mcgill.ca/~vkules/bandits.pdf

    To answer the question directly, it may be either a question of implementation, configuration, agent architecture or learning strategy that leads to immediate exploitation and a fixation on local minima.