Recently, I have tried to apply the naive policy gradient method to my problem. However, I found that the difference between different outputs of the last layer of the neural network is huge, which means that after applying the softmax layer, only one action will be marked as 1, and other actions will be marked as 0. For instance, the output of the last layer is shown below:
[ 242.9629, -115.6593, 63.3984, 226.1815, 131.5903, -316.6087,
-205.9341, 98.7216, 136.7644, 266.8708, 19.2289, 47.7531]
After applying the softmax function, it is clear that only one action will be chosen.
[4.1395e-11, 0.0000e+00, 0.0000e+00, 2.1323e-18, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00]
This problem severely affects the final performance as the neural network will try constant action only after a few steps. Therefore, is there any way to solve this problem?
(By the way, even if I have tried to give some negative rewards to the neural network, the actions chosen by it are still unchanged.)
In fact, there is no deterministic way to solve this problem as this is an age-old problem in the optimization domain called "exploration-exploitation dilemma". Specifically, in reinforcement learning, there are two simple ways to solve this problem:
Both methods have been validated in my task, and both of them effectively alleviate the pre-mature problem. However, both of them have a parameter that needs to be tuned by hand, which increases the complexity of my algorithm.
By the way, similar to Q-Learning, we can also use the epsilon-greedy mechanism to encourage the agent to explore more actions. However, this is not an elegant method to solve this problem because it is hard to determine the epsilon value.