artificial-intelligencegame-physicsmontecarlotree-search

What is the concept of "Last Good Reply" and "Rapid Action Value Estimation" in Monte Carlo Simulation?


I have developed a simple hex player based on Monte Carlo Tree Search for the game of Hex. Now I want to extend the hex player using RAVE (Rapid Action Value Estimation) and LGP (last good reply). The articles are here and here.
I was wondering if anyone here has used any of these methods to improve the tree search performance and could help me understand it?
I also want to know why these algorithms are called AMAF (All Moves As First) heuristics?


Solution

  • In the realm of monte carlo tree search in games which takes advantage of reinforcement learning, there are two types of back-propagation, AMAF and UCT.

    UCT method back-propagates the path which during selection phase it has passed. only nodes which during selection are met are back-propagated exactly at their states. But in AMAF, All the nodes that are met during roll_out phase are stored and in the back-propagation phase, along with nodes in the selection path, are back-propagated without considering the states.

    UCT gives a very precise and local value of a (state,action) pair, but it is too slow to converge. on the other hand AMAF heuristic converges very fast, but (state,action) pair value is too general and can't be reliable.

    We can have benefits of both strategies by using a decreasing coefficient for values like this:

    a * UCT + (1 - a) * AMAF

    This is RAVE(Rapid Action Value Stimation) heuristic.

    Last-Good-Reply is AMAF-based but could benefit from RAVE. its general idea is that in playout phase, when we use moves against opponent moves, if these moves were successful against opponents', so we might be able to store these moves and use them in next playouts.