Exploration and Exploitation: Machine Learning for Games

In any domain with clear feedback and an opportunity for random “failures”, Monte Carlo Tree Search can help identify the current best action.

Mar 3, 2020 5 min read R

I recently saw the AlphaGo documentary film and it blew my mind. Essentially, a computer was able to beat a world champion in a best of five match of Go, a game with an overwhelmingly large search space—a 19 x 19 board. One of the principal techniques of its success was Monte Carlo Tree Search (MCTS). While it has received adequate exposition elsewhere,¹ I thought that a video could be a nice addition to what’s already out there.

Monte Carlo Tree Search

How does MCTS work? First, the computer begins with (1) the current state of the game, state0, (2) rules for determining when the game is over, and (3) rules for determining legal moves (or game actions). The tree starts at state0, a leaf represents a potential legal move, and any sequence of legal moves is composed of state0 and a series of descendant leaves connected by branches.

For each new round of a game to be played within the MCTS process, we will travel as far down the tree as we can following this rule:

At each leaf, L, we will either expand our tree with a new set of descendant leaves connected to L (representing legal moves) or choose the connected descendant leaf with the greatest upper confidence bound (UCB1) where the confidence bound (CB1) is defined by $\bar{V_i}$ (the average win rate of this leaf and all its descendants), $N$ (the total number of MCTS games played), and $n_i$ (the number of games started at this leaf or one of its descendants):

$CB1 = \bar{V_i} \pm 2 \sqrt{\frac{\ln N}{n_i}}$

The point of this formula is to balance the exploitation of promising moves, $\bar{V_i}$ , with the exploration of seemingly less promising moves, $2 \sqrt{\frac{\ln N}{n_i}}$ .

MCTS in Action

In the video, Player 1 is “O” and Player 2 is “X”. To avoid boredom, I’ve started the MCTS process in the middle of a game when there are only 5 possible moves for Player 1 (rather than 9 at the start of a game).² We can see this to be the case as the first thing the MCTS process does is expand the tree from state0 to include all Player 1’s legal moves.

Notice as well that above each game state there are two numbers, p/n. The first, p, represents the sum of wins (+1), losses (-1), and ties (0) of any games starting from this leaf or any of its descendants. The second, n, represents the total number of games starting from this leaf or any of its descendants. Below each game state is a third number, UCB1 = Inf. Here Inf stands for infinity and is to be replaced using the formula above once a game has started from this leaf.

In order to play a game, the MCTS process examines all the leaves stemming from a game state and asks, “which of these has the greatest UCB1?” In our case, the first leaf chosen places an “O” in the bottom-middle position of the board—this choice is random among all legal moves stemming from the current game state that share the maximum UCB1.

If this leaf has never been visited by the MCTS process, you play out a game. While you can use more sophisticated rules, a simple way to play out a game is to have “X” make a random move, followed by a random move by “O”, and so on until there is a win, lose, or draw.

This win, lose, or draw (for Player 1) now needs to be backpropagated up the tree. The game state where “O” went to the bottom-middle of the board changes its p/n to 1/1 and adds these two values to all its ancestors’ p/n accumulations. Now that all p/n values are up to date, we recalculate UCB1 for the every game state in the tree.

Then you start all over. After the next round, the tree looks like this—two wins.

This doesn’t look like much at first, but if we advance further into the video, we can see that many leaves have been visited and thus have proper UCB1 values. Game #33 chooses to place the “O” at the bottom-left position. Note that this is neither the position with the greatest win rate (move to the center of the board), nor the least explored (move to the bottom-right), but is a balance between exploration and exploitation.

As this leaf has been visited before (and a game has been played out), this time we attempt to go one descendant leaf further. That is, the MCTS process now takes the perspective of Player 2 (“X”) and asks what is their best move according to UCB1. Here, after expanding the tree with a new set of descendant leaves, “X” takes the bottom-middle of the board. Note that we’ve flipped around the results to match the new player’s perspective—what was a 6/8 game state for Player 1 becomes a -6/8 game state for Player 2.

When it is time for the MCTS process to select Player 1’s next move to actually play on the board, it shifts from looking at the game states’ UCB1 to looking for the move with the greatest n value. After 48 games, Player 1 had moved to the bottom-left of the board 14 times for a win/lose/draw score of 0.786. This will be Player 1’s move.