algorithm tree hashmap graph-algorithm monte-carlo-tree-search

Transposition table in Monte Carlo Tree Search algorithm unintended effect on UCT score

So I implemented a transposition table in a Monte Carlo Tree Search algorithm using UCT. This allows for keeping a cumulative reward value for a game state, no matter where, and how many times, it is encountered throughout the tree. This increases the quality of the information gathered on the particular game state.

Alas, I have noticed that this creates certain problem with the exploitation vs. exploration selection phase of UCT. In short, the UCT score assigned to a state, takes into account the ratio between the number of times the parent state was visited and the number of times the child (whom we are calculating the UCT score for) was visited. As you can see from this diagram, when pulling in a state from the transposition table into a newly created branch of the tree, that ratio is all out of whack, with the child state having been visited a great number of times (from elsewhere in the tree) and the parent state having been visited a much smaller amount of times, which should technically be impossible.

So, using a transposition table and persisting the cumulative reward value of a state helps the exploitation part of the algorithm make better choices, but at the same time it skews the exploitation part of the algorithm in a potentially harmful way. Are you aware of any ways to counteract this unintended problem?

Solution

Intuitively, I expect you'll want to try the following.

For the exploitation part of UCT, you'll want to store and use the average scores W / V of child nodes. This average in its entirety can be shared through transpositions. So, in your example, you don't necessarily want to separately share the W = 300 and V = 600, but instead just share the average score W / V = 300 / 600 = 0.5. This means that, thanks to transpositions, you can share more accurate estimates of scores (estimates based on larger sample sizes).

For the exploration part of UCT, I think you'll want to use statistics "from the perspective" of the parent node (where you don't have a transposition), rather than the child node (which is a transposition of a node elsewhere in the tree). Instead of using visit counts of child nodes when selecting which child node to go to, this means you'll use visit counts collected per state + action pair in the parent node.

For example, consider we are in the node where you wrote V: 2, W: 300 (refer to this node as P), and we have to select a child node. The more standard implementation would loop over the children, and use the child node's visit counts. In your situation, I think it'd be better to instead loop over the actions in the node P, and keep track of visit counts for those actions instead of for the child nodes. In P, you'll not yet ever have taken the action leading to the transposition child node, so this visit count for the (P + action) pair will still be 0.

I do not personally have experience with the combination of MCTS + transpositions though, so you may wish to do additional research to see what others have come up with in the past. There are for example the following papers:

Transpositions and Move Groups in Monte Carlo Tree Search
BTT-Go: An Agent for Go That Uses a Transposition Table to Reduce the Simulations and the Supervision in the Monte-Carlo Tree Search
And there may be more if you look around, not sure.