montecarlo reinforcement-learning monte-carlo-tree-search

Reinforcement Learning: fine-tuning MCTS node selection and expansion stage with inaccurate values

I am implementing a Go playing program roughly according to the architecture of earlier versions of AlphaGo(AlphaGo Fan or AlphaGo Lee), e.g. using policy network, value network, and Monte Carlo tree search(MCTS). Currently I have trained a decent policy network and an insensitive value network, and I don't have a fast roll-out policy. By "insensitive" I mean, the value network is not able to judge complicated situations, only outputing a win rate around 50% unless the situation is concise. The value network can judge concise board(no big fight going on) correctly.

Using this policy network and value network, I also implemented MCTS algorithm(The evaluation of a tree node is done only by value network). Since the value network is not accurate, I am afraid MCTS is prone to be trapped in bad moves before the time of MCTS is up. In order to better fine-tune the hyper parameters of MCTS to remedy the bad influence brought by inaccurate value network, I have two questions to ask:

Node selection is done by arg max (p_value + lambda * p_policy/visit_cnt). Does fine-tune the parameter lambda help?
Intuitively I want MCTS to explore as further as possible. In node expansion stage, does setting the expansion condition as expand a leaf once it is visited a very small number of times, like 3 help? What expansion method should I use?

EDIT: The second question is about the 'expand' stage of typical 'selection, expand, evaluation, backup' MCTS algorithm. I reckon by expand as quickly as possible, the MCTS can explore deeper, and give more accurate value approximations. I set a parameter n as how many times a leaf node is visited before it is expanded. I want to know intuitively, what a large n and a small n would influence the performance of MCTS.

Solution

Node selection is done by arg max (p_value + lambda * p_policy/visit_cnt). Does fine-tune the parameter lambda help?

Let's first try to develop a good understanding of what all the terms in that formula do:

p_value: average of all the evaluations at the end of iterations that went through this node. This is our current estimate of how good this node is according to the value network's evaluations at the end of rollouts.
p_policy/visit_cnt: p_policy will be high for nodes that seem good according to your policy network, and low for nodes that seem bad according to your policy network. visit_cnt will be high for nodes which we've already visited often, and low for nodes we have not yet visited often. This complete term makes us initially ''lean towards'' the policy network, but move away from the policy network as time goes on (because the nodes that are good according to policy network will have high visit counts).
lambda: A parameter that determines a balance between the first and the second term of the two points above. If lambda is high, we'll rely more on the information that the policy network gives us, and less on the information that the value network gives us. If lambda is low, we'll more quickly start relying on the information provided to us by earlier MCTS iterations + value network evaluations, and rely less on the policy network.

In the text of your question, you have stated that you believe the policy network is decent, and the value network isn't really informative. So, if that's the case, I'd try using a high value for lambda: if you believe the policy network to be more informative than the value network, you'll want to rely more on the policy network than the value network, so you'll want a high lambda.

Intuitively I want MCTS to explore as further as possible. In node expansion stage, does setting the expansion condition as expand a leaf once it is visited a very small number of times, like 3 help? What expansion method should I use?

The only reason why the expansion phase is often rather limited in classical MCTS implementations (e.g. often only expands a single node per iteration) is memory concerns; if you expand too often, your tree grows too quickly, and you run out of memory.

In these AlphaGo-style setups (mixed Deep Learning + MCTS), you often use a lot more computation time in these networks, and therefore get much fewer MCTS iterations than a raw, pure MCTS algorithm without any Deep Learning (but often higher quality / more informative iterations, which makes up for the lower iteration count). This lower iteration count significantly reduces the risk of running out of memory due to over-enthusiastic expansion, so I suspect you can afford to expand more aggressively. The only possible negative effect of expanding too much will be that you run out of RAM, and you'll easily notice when that happens because your program will crash.