algorithm artificial-intelligence memoization minimax pacman

Pacman AI - Minimax Application - Avoiding repeated game tree states

In the context of a project, following the UC Berkley pacman ai project (its second part), I want to implement the minimax algorithm, without alpha-beta pruning, for an adversarial agent in a layout small enough that recursion is not a problem.

Having defined the problem as a 2-player (we assume only 1 ghost), turn taking, zero-sum game with perfect information, applying the recursive would be pretty trivial. However, since many different strategies can end up in the same game state (defined as a tuple of pacman's position, the ghost's position, the food's position and the player currently playing), I wanted to find a way to avoid recomputing all those states.

I searched and I read some things about transposition tables. I am not really sure on how to use such a method however and what I thought I should implement was the following: Each time a state, not yet visited, is expanded, add it to a 'visited' set. If the state has already been expanded, then if it's the max player's turn (pacman) return a +inf value (which would normally never be chosen by the min player), if it's min's turn return -inf accordingly.

The problem with this idea, I think, and the reason why it works for some layouts but not others, is that when I hit a node, all the children of which have already been expanded, the only values I have to choose from are +/- infinities. This causes an infinite value to propagate upwards and be selected, while in fact it is possible that this game state leads to a loss. I think, I have understood the problem, but I can't seem to find a way to get around it.

Is there any other method I could use to avoid computing repeated game states? Is there a standard approach to this that I am not aware of?

Here is some pseudocode:

 def maxPLayer(currentState, visitedSet):
     if not isTerminalState
         for nextState, action in currentState.generateMaxSuccessors()
             if nextState not in visitedSet
                mark nextState as visited
                scores = scores + [minPlayer(nextState, visitedSet)]
         if scores is not empty
            return bestScore = max(scores)
         else
            return +inf              #The problem is HERE!
     else
         return evalFnc(currentState)
 end MaxPlayer

def minPlayer(currenstState, visitedSet):
    if not isTerminal
        for nextState, action in generateMinSuccessors()
            if nextState not in visitedSet 
                mark nextState as visited
                scores = scores + [maxPLayer(nextState, visitedSet)]
        if scores is not empty
            return bestScore = min(scores)
        else
            return -inf            #The problem is also HERE!
    else
        return evalFnc(currentState)
   end MinPlayer

Note that the first player to play is max and I choose the action that has the highest score. Nothing changes if I take into account infinite values or not, there are still instances of the game where the agent loses, or loops infinitely.

Solution

The problem I was having was finally related with the definition of a 'game state' and how 'repeated states' had to be handled.

In fact, consider a the game state tree and a particular game state x which is identified by the following:

The position of pacman.
The number and position of food pellets on the grid.
The position and the direction of the ghost (the direction is taken into acount because the ghost is considered to not be able to make a half turn.

Now suppose you start going down a certain branch of the tree and at some point you visit the node x. Assuming it had not already been visited before and it is not a terminal state for the game, this node should added to the set of visited nodes.

Now suppose that once you're done with this particular branch of the tree, you start exploring a different one. After a certain, undetermined number of steps you get once again to a node identified as x. This is where the problem with the code in the question lies.

In fact, while the game state as defined is exactly the same, the path followed to get to this state is not (since we are currently on a new, different branch than the original one). Obviously, considering the state as visited or using the utility calculated by the last branch is false. It produces unexpected results.

The solution to this problem is, simply, to have a separate set of visited nodes for each branch of the tree. This way the situation described above is avoided. From there on, there are two strategies that can be considered:

The first one consists of considering looping through already visited states as a worst case scenario for pacman and an optimal strategy for the ghost (which is obviously not strictly true). Taking this into account, repeated states in the same branch of the tree are treated as a kind of 'terminal' states that return -inf as a utility.
The second approach consists of making use of a transposition table. This is however not trivial to implement: If a node is not already in the dictionary, initialize it at infinity to show that it is currently being computed and should not be recomputed if visited later. When reaching a terminal state, while recursing on all nodes store in the dictionary the difference in the game score between the current node and the corresponding terminal state. If while traversing a branch you visit a node that was already in the dictionary return the current game score (which depends on the path you took to get to this node and can change from one branch to the other) plus the value in the dictionaty (which is the gain (or loss) in score form getting from this node to the terminal state and which is always the same).

In more practical terms, the first approach is really simple to implement, it suffices to copy the set every time you pass it as an argument to the next player (so that values in different branches won't affect each other). This will make the algorithm significantly slower and alpha beta should be applied even for very small, simple mazes (1 food pellet and maybe 7x7 mazes). In any other case python will wither complain about recursion or simply take too long to solve (more than a few minutes). It is however correct.

The second approach is more complicated. I have no formal proof of correctness, although intuitively it seems to work. It is significantly faster and also compatible with alpha beta pruning.

The corresponding pseudo code is easy to derive from the explanation.