I am a management consultant exploring the World of Artificial intelligence.

How does AlphaGo work?

How does AlphaGo work?

As part of my Udacity Artificial Intelligence Nanodegree, one assignment was to summarize the original paper on AlphaGo by Silver et. al. from 2016. The Go match between AlphaGo vs Lee was given quite some media coverage both by expert and mainstream publications. I just stumbled upon my work from the online class by accident and thought - since this is now almost 2 years old - I can publish my little summary on how AlphaGo works here:

 

This is a summary of the paper „Mastering the game of Go with deep neural networks and tree search“ by Silver et. al., 2016.

The paper describes the methods used to construct the Go-playing artificial intelligence AlphaGo that in March 2016 defeated professional Go player Lee Sedol in a 1-4 series. A result like this was estimated to be about a decade away (Bostrom, Superintelligence, 2014). The reason this was deemed not feasible is the huge search space of Go if tree search is used, the common approach for solving board games with software. The search space is b^d with b≈250 and d≈250, making exhaustive search to all endgames and deriving an optimal path impossible. To find a solution for a move in a finite amount of time, an approximation must be made at some point in the search, done by an evaluation function that estimates the value of a certain move without venturing (much) deeper into the tree. This approach has achieved limited results in Go so far.

The paper's approach combines Monte Carlo Tree Search (MCTS) with a stack of neural networks to reduce the search space. MCTS uses a probability distribution to select a move in position s, called a policy function, effectively reducing breadth. It uses a technique called rollouts to play out a large number of games per move. A value function estimates the output of state s and replaces the subtree, reducing depth.

In prior approaches using MCTS, both functions needed to be hand-crafted. The value function would consist of manually selected and weighted features, which performs poorly against human players.

In the paper, neural networks are used as probability and value functions, replacing the manually selected features by self learned ones. This basically means that AlphaGo learned what makes a good position instead of having someone program it into a fixed function. Instead of using only one neural network as in prior research, a pipeline approach of stacked networks was chosen to gradually improve the networks accuracies by combining their experiences.

The first policy network consists of 13 layers and is trained by supervised learning. It outputs a probability distribution over all possible moves from a given board state. Input is a simple representation of that state (position of stones, number of liberties after this move is played, etc.). Randomly sampled state-action pairs were used as training data. This policy network is able to predict a move a human expert would take with an accuracy of 57% after training. This is already a vast improvement over all other research work reaching a maximum accuracy of 44.4%.

The next step in the training pipeline was setting up a network with the same structure as the policy network and initializing it with the same weights, but training it not with supervised learning but a reinforcement learning approach. Randomly sampled previous iterations from the first policy network were used as training data. The resulting network achieves a winning rate of 80% against the first network and 85% against the strongest available open-source Go program Pachi. This is an impressive improvement regarding that conventional neural network approaches in Go only achieved a success rate of 11% against Pachi.

Finally, a network was constructed that outputs only a single evaluation value for a given state. It was trained with a dataset consisting of 30 million distinct positions, each from a separate game, to avoid overfitting.

The combination of the MCTS approach, the policy network to decide on rollouts and the evaluation of leaf nodes by the value network lead to a 100% success rate against any other available Go program (in a distributed version) and ultimately to the triumph over human Go champions

The AI Weekender - Artificial Intelligence News

What is an algorithm?

What is an algorithm?