Exploration and Exploitation: Machine Learning for Games

In any domain with clear feedback and an opportunity for random “failures”, Monte Carlo Tree Search can help identify the current best action.

I recently saw the AlphaGo documentary film and it blew my mind. Essentially, a computer was able to beat a world champion in a best of five match of Go, a game with an overwhelmingly large search space—a 19 x 19 board. One of the principal techniques of its success was Monte Carlo Tree Search (MCTS). While it has received adequate exposition elsewhere,1 I thought that a video could be a nice addition to what’s already out there.

MCTS in Action

In the video, Player 1 is “O” and Player 2 is “X”. To avoid boredom, I’ve started the MCTS process in the middle of a game when there are only 5 possible moves for Player 1 (rather than 9 at the start of a game).2 We can see this to be the case as the first thing the MCTS process does is expand the tree from state0 to include all Player 1’s legal moves.

Notice as well that above each game state there are two numbers, p/n. The first, p, represents the sum of wins (+1), losses (-1), and ties (0) of any games starting from this leaf or any of its descendants. The second, n, represents the total number of games starting from this leaf or any of its descendants. Below each game state is a third number, UCB1 = Inf. Here Inf stands for infinity and is to be replaced using the formula above once a game has started from this leaf.

In order to play a game, the MCTS process examines all the leaves stemming from a game state and asks, “which of these has the greatest UCB1?” In our case, the first leaf chosen places an “O” in the bottom-middle position of the board—this choice is random among all legal moves stemming from the current game state that share the maximum UCB1.

If this leaf has never been visited by the MCTS process, you play out a game. While you can use more sophisticated rules, a simple way to play out a game is to have “X” make a random move, followed by a random move by “O”, and so on until there is a win, lose, or draw.

This win, lose, or draw (for Player 1) now needs to be backpropagated up the tree. The game state where “O” went to the bottom-middle of the board changes its p/n to 1/1 and adds these two values to all its ancestors’ p/n accumulations. Now that all p/n values are up to date, we recalculate UCB1 for the every game state in the tree.

Then you start all over. After the next round, the tree looks like this—two wins.

This doesn’t look like much at first, but if we advance further into the video, we can see that many leaves have been visited and thus have proper UCB1 values. Game #33 chooses to place the “O” at the bottom-left position. Note that this is neither the position with the greatest win rate (move to the center of the board), nor the least explored (move to the bottom-right), but is a balance between exploration and exploitation.

As this leaf has been visited before (and a game has been played out), this time we attempt to go one descendant leaf further. That is, the MCTS process now takes the perspective of Player 2 (“X”) and asks what is their best move according to UCB1. Here, after expanding the tree with a new set of descendant leaves, “X” takes the bottom-middle of the board. Note that we’ve flipped around the results to match the new player’s perspective—what was a 6/8 game state for Player 1 becomes a -6/8 game state for Player 2.

When it is time for the MCTS process to select Player 1’s next move to actually play on the board, it shifts from looking at the game states’ UCB1 to looking for the move with the greatest n value. After 48 games, Player 1 had moved to the bottom-left of the board 14 times for a win/lose/draw score of 0.786. This will be Player 1’s move.

Edit this page

Joseph de la Torre Dwyer

My research interests include distributive justice; the principles of responsibility, desert, and control; and reproducible research with R.