Why DeepMind AlphaGo Zero is a game changer for AI research

DeepMind, a London based artificial intelligence (AI) company currently owned by Alphabet, recently made great strides in AI with its AlphaGo program.

It all began in October 2015 when the program beat the European Go champion Fan Hui 5-0, in a game of Go. This was the very first time an AI defeated a professional Go player. Earlier, computers were only known to have played Go at the "amateur" level.

Then, the company made headlines again in 2016 after its AlphaGo program beat Lee Sedol, a professional Go player (a world champion) with a score of 4-1 in a five-game match. Furthermore, in late 2017, an improved version of the program called AlphaGo Zero defeated AlphaGo 100 games to 0.

The best part? AlphaGo Zero's strategies were self-taught i.e it was trained without any data from human games. AlphaGo Zero was able to defeat its predecessor in only three days time with lesser processing power than AlphaGo. However, the original AlphaGo, on the other hand required months to learn how to play.

All these facts beg the questions: what makes AlphaGo Zero so exceptional? Why is it such a big deal? How does it even work?

So, without further ado, let’s dive into the what, why, and how of DeepMind’s AlphaGo Zero.

What is DeepMind AlphaGo Zero?

Simply put, AlphaGo Zero is the strongest Go program in the world (with the exception of AlphaZero). As mentioned before, it monumentally outperforms all previous versions of AlphaGo. Just check out the graph below which compares the Elo rating of the different versions of AlphaGo.

deepmind-alphago-zero-game-changer-for-ai-research-img-0

Source: DeepMind

The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess and Go. It is named after its creator Arpad Elo, a Hungarian-American physics professor.

Now, all previous versions of AlphaGo were trained using human data. The previous versions learned and improved upon the moves played by human experts/professional Go players.

But AlphaGo Zero didn’t use any human data whatsoever. Instead, it had to learn completely from playing against itself.

According to DeepMind's Professor David Silver, the reason that playing against itself enables it to do so much better than using strong human data is that AlphaGo always has an opponent of just the right level. So it starts off extremely naive, with perfectly random play. And yet at every step of the learning process, it has an opponent (a “sparring partner”) that’s exactly calibrated to its current level of performance. That is, to begin with, these players are terribly weak but over time they become progressively stronger and stronger.

Why is reinforcement learning such a big deal?

People tend to assume that machine learning is all about big data and massive amounts of computation. But actually, with AlphaGo Zero, AI scientists at DeepMind realized that algorithms matter much more than the computing processing power or data availability. AlphaGo Zero required less computation than previous versions and yet it was able to perform at a much higher level due to using much more principled algorithms than before.

It is a system which is trained completely from scratch, starting from random behavior, and progressing from first principles to really discover tabula rasa, in playing the game of Go. It is, therefore, no longer constrained by the limits of human knowledge.

Note that AlphaGo Zero did not use zero-shot learning which essentially is the ability of the machine to solve a task despite not having received any training for that task.

How does it work?

AlphaGo Zero is able to achieve all this by employing a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. As explained previously, the system starts off with a single neural network that knows absolutely nothing about the game of Go. By combining this neural network with a powerful search algorithm, it then plays games against itself. As it plays more and more games, the neural network is updated and tuned to predict moves, and even the eventual winner of the games.

This revised neural network is then recombined with the search algorithm to generate a new, stronger version of AlphaGo Zero, and the process repeats. With each iteration, the performance of the system enhances with each iteration, and the quality of the self-play games’ advances, leading to increasingly accurate neural networks and ever-more powerful versions of AlphaGo Zero.

Now, let’s dive into some of the technical details that make this version of AlphaGo so much better than all its forerunners. AlphaGo Zero's neural network was trained using TensorFlow, with 64 GPU workers and 19 CPU parameter servers. Only four Tensor Processing Units (TPUs) were used for inference. And of course, the neural network initially knew nothing about Go beyond the rules.

Both AlphaGo and AlphaGo Zero took a general approach to play Go. Both evaluated the Go board and chose moves using a combination of two methods:

Conducting a “lookahead” search: This means looking ahead several moves by simulating games, and hence seeing which current move is most likely to lead to a “good” position in the future.

Assessing positions based on an “intuition” of whether a position is “good” or “bad” and is likely to result in a win or a loss.

Go is a truly intricate game which means computers can’t merely search all possible moves using a brute force approach to discover the best one.

Method 1: Lookahead

Before AlphaGo, all the finest Go programs tackled this issue by using “Monte Carlo Tree Search” or MCTS. This process involves initially exploring numerous possible moves on the board and then focusing this search over time as certain moves are found to be more likely to result in wins than others.

deepmind-alphago-zero-game-changer-for-ai-research-img-1

Source: LOC

Both AlphaGo and AlphaGo Zero apply a fairly elementary version of MCTS for their “lookahead” to correctly maintain the tradeoff between exploring new sequences of moves or more deeply explore already-explored sequences.

Although MCTS has been at the heart of all effective Go programs preceding AlphaGo, it was DeepMind’s smart coalescence of this method with a neural network-based “intuition” that enabled it to attain superhuman performance.

Method 2: Intuition

DeepMind’s pivotal innovation with AlphaGo was to utilize deep neural networks to identify the state of the game and then use this knowledge to effectively guide the search of the MCTS. In particular, they trained networks that could record:

The current board position

Which player was playing

The sequence of recent moves (in order to rule out certain moves as “illegal”)

With this data, the neural networks could propose:

Which move should be played

If the current player is likely to win or not

So how did DeepMind train neural networks to do this? Well, AlphaGo and AlphaGo Zero used rather different approaches in this case.

AlphaGo had two separately trained neural networks: Policy Network and Value Network.

deepmind-alphago-zero-game-changer-for-ai-research-img-2

Source: AlphaGo’s Nature Paper

DeepMind then fused these two neural networks with MCTS — that is, the program’s “intuition” with its brute force “lookahead” search — in an ingenious way.

It used the network that had been trained to predict:

Moves to guide which branches of the game tree to search

Whether a position was “winning” to assess the positions it encountered during its search

This let AlphaGo to intelligently search imminent moves and eventually beat the world champion Lee Sedol.

AlphaGo Zero, however, took this principle to the next level. Its neural network’s “intuition” was trained entirely differently from that of AlphaGo. More specifically:

The neural network was trained to play moves that exhibited the improved evaluations from performing the “lookahead” search

The neural network was tweaked so that it was more likely to play moves like those that led to wins and less likely to play moves similar to those that led to losses during the self-play games

Much was made of the fact that no games between humans were used to train AlphaGo Zero. Thus, for a given state of a Go agent, it can constantly be made smarter by performing MCTS-based lookahead and using the results of that lookahead to upgrade the agent. This is how AlphaGo Zero was able to perpetually improve, from when it was an “amateur” all the way up to when it better than the best human players.

Moreover, AlphaGo Zero’s neural network architecture can be referred to as a “two-headed” architecture.

deepmind-alphago-zero-game-changer-for-ai-research-img-3

Source: Hacker Noon

Its first 20 layers were “blocks” of a typically seen in modern neural net architectures. These layers were followed by two “heads”:

One head that took the output of the first 20 layers and presented probabilities of the Go agent making certain moves

Another head that took the output of the first 20 layers and generated a probability of the current player winning.

What’s more, AlphaGo Zero used a more “state of the art” neural network architecture as opposed to AlphaGo. Particularly, it used a “residual” neural network architecture rather than a plainly “convolutional” architecture. Deep residual learning was pioneered by Microsoft Research in late 2015, right around the time work on the first version of AlphaGo would have been concluded. So, it is quite reasonable that DeepMind did not use them in the initial AlphaGo program.

Notably, each of these two neural network-related acts — switching from separate-convolutional to the more advanced dual-residual architecture and using the “two-headed” neural network architecture instead of separate neural networks — would have resulted in nearly half of the increase in playing strength as was realized when both were coupled.

deepmind-alphago-zero-game-changer-for-ai-research-img-4

Source: AlphaGo’s Nature Paper

Wrapping it up

According to DeepMind:

“After just three days of self-play training, AlphaGo Zero emphatically defeated the previously published version of AlphaGo - which had itself defeated 18-time world champion Lee Sedol - by 100 games to 0. After 40 days of self-training, AlphaGo Zero became even stronger, outperforming the version of AlphaGo known as “Master”, which has defeated the world's best players and world number one Ke Jie.

Over the course of millions of AlphaGo vs AlphaGo games, the system progressively learned the game of Go from scratch, accumulating thousands of years of human knowledge during a period of just a few days. AlphaGo Zero also discovered new knowledge, developing unconventional strategies and creative new moves that echoed and surpassed the novel techniques it played in the games against Lee Sedol and Ke Jie.”

Further, the founder and CEO of DeepMind, Dr. Demis Hassabis believes AlphaGo's algorithms are likely to most benefit to areas that need an intelligent search through an immense space of possibilities.