montecarloreinforcement-learningmonte-carlo-tree-search

Reinforcement Learning: fine-tuning MCTS node selection and expansion stage with inaccurate values


I am implementing a Go playing program roughly according to the architecture of earlier versions of AlphaGo(AlphaGo Fan or AlphaGo Lee), e.g. using policy network, value network, and Monte Carlo tree search(MCTS). Currently I have trained a decent policy network and an insensitive value network, and I don't have a fast roll-out policy. By "insensitive" I mean, the value network is not able to judge complicated situations, only outputing a win rate around 50% unless the situation is concise. The value network can judge concise board(no big fight going on) correctly.

Using this policy network and value network, I also implemented MCTS algorithm(The evaluation of a tree node is done only by value network). Since the value network is not accurate, I am afraid MCTS is prone to be trapped in bad moves before the time of MCTS is up. In order to better fine-tune the hyper parameters of MCTS to remedy the bad influence brought by inaccurate value network, I have two questions to ask:

  1. Node selection is done by arg max (p_value + lambda * p_policy/visit_cnt). Does fine-tune the parameter lambda help?
  2. Intuitively I want MCTS to explore as further as possible. In node expansion stage, does setting the expansion condition as expand a leaf once it is visited a very small number of times, like 3 help? What expansion method should I use?

EDIT: The second question is about the 'expand' stage of typical 'selection, expand, evaluation, backup' MCTS algorithm. I reckon by expand as quickly as possible, the MCTS can explore deeper, and give more accurate value approximations. I set a parameter n as how many times a leaf node is visited before it is expanded. I want to know intuitively, what a large n and a small n would influence the performance of MCTS.


Solution

    1. Node selection is done by arg max (p_value + lambda * p_policy/visit_cnt). Does fine-tune the parameter lambda help?

    Let's first try to develop a good understanding of what all the terms in that formula do:

    In the text of your question, you have stated that you believe the policy network is decent, and the value network isn't really informative. So, if that's the case, I'd try using a high value for lambda: if you believe the policy network to be more informative than the value network, you'll want to rely more on the policy network than the value network, so you'll want a high lambda.

    1. Intuitively I want MCTS to explore as further as possible. In node expansion stage, does setting the expansion condition as expand a leaf once it is visited a very small number of times, like 3 help? What expansion method should I use?

    The only reason why the expansion phase is often rather limited in classical MCTS implementations (e.g. often only expands a single node per iteration) is memory concerns; if you expand too often, your tree grows too quickly, and you run out of memory.

    In these AlphaGo-style setups (mixed Deep Learning + MCTS), you often use a lot more computation time in these networks, and therefore get much fewer MCTS iterations than a raw, pure MCTS algorithm without any Deep Learning (but often higher quality / more informative iterations, which makes up for the lower iteration count). This lower iteration count significantly reduces the risk of running out of memory due to over-enthusiastic expansion, so I suspect you can afford to expand more aggressively. The only possible negative effect of expanding too much will be that you run out of RAM, and you'll easily notice when that happens because your program will crash.