In the context of a project, following the UC Berkley pacman ai project (its second part), I want to implement the minimax algorithm, without alpha-beta pruning, for an adversarial agent in a layout small enough that recursion is not a problem.
Having defined the problem as a 2-player (we assume only 1 ghost), turn taking, zero-sum game with perfect information, applying the recursive would be pretty trivial. However, since many different strategies can end up in the same game state (defined as a tuple of pacman's position, the ghost's position, the food's position and the player currently playing), I wanted to find a way to avoid recomputing all those states.
I searched and I read some things about transposition tables. I am not really sure on how to use such a method however and what I thought I should implement was the following: Each time a state, not yet visited, is expanded, add it to a 'visited' set. If the state has already been expanded, then if it's the max player's turn (pacman) return a +inf value (which would normally never be chosen by the min player), if it's min's turn return -inf accordingly.
The problem with this idea, I think, and the reason why it works for some layouts but not others, is that when I hit a node, all the children of which have already been expanded, the only values I have to choose from are +/- infinities. This causes an infinite value to propagate upwards and be selected, while in fact it is possible that this game state leads to a loss. I think, I have understood the problem, but I can't seem to find a way to get around it.
Is there any other method I could use to avoid computing repeated game states? Is there a standard approach to this that I am not aware of?
Here is some pseudocode:
def maxPLayer(currentState, visitedSet):
if not isTerminalState
for nextState, action in currentState.generateMaxSuccessors()
if nextState not in visitedSet
mark nextState as visited
scores = scores + [minPlayer(nextState, visitedSet)]
if scores is not empty
return bestScore = max(scores)
else
return +inf #The problem is HERE!
else
return evalFnc(currentState)
end MaxPlayer
def minPlayer(currenstState, visitedSet):
if not isTerminal
for nextState, action in generateMinSuccessors()
if nextState not in visitedSet
mark nextState as visited
scores = scores + [maxPLayer(nextState, visitedSet)]
if scores is not empty
return bestScore = min(scores)
else
return -inf #The problem is also HERE!
else
return evalFnc(currentState)
end MinPlayer
Note that the first player to play is max and I choose the action that has the highest score. Nothing changes if I take into account infinite values or not, there are still instances of the game where the agent loses, or loops infinitely.
The problem I was having was finally related with the definition of a 'game state'
and how 'repeated states'
had to be handled.
In fact, consider a the game state tree and a particular game state x
which is identified by the following:
Now suppose you start going down a certain branch of the tree and at some point you visit the node x
. Assuming it had not already been visited before and it is not a terminal state for the game, this node should added to the set of visited nodes.
Now suppose that once you're done with this particular branch of the tree, you start exploring a different one. After a certain, undetermined number of steps you get once again to a node identified as x
. This is where the problem with the code in the question lies.
In fact, while the game state as defined is exactly the same, the path followed to get to this state is not (since we are currently on a new, different branch than the original one). Obviously, considering the state as visited or using the utility calculated by the last branch is false. It produces unexpected results.
The solution to this problem is, simply, to have a separate set of visited nodes for each branch of the tree. This way the situation described above is avoided. From there on, there are two strategies that can be considered:
-inf
as a utility.In more practical terms, the first approach is really simple to implement, it suffices to copy the set every time you pass it as an argument to the next player (so that values in different branches won't affect each other). This will make the algorithm significantly slower and alpha beta should be applied even for very small, simple mazes (1 food pellet and maybe 7x7 mazes). In any other case python will wither complain about recursion or simply take too long to solve (more than a few minutes). It is however correct.
The second approach is more complicated. I have no formal proof of correctness, although intuitively it seems to work. It is significantly faster and also compatible with alpha beta pruning.
The corresponding pseudo code is easy to derive from the explanation.