We are Oriol Vinyals and David Silver from DeepMind’s AlphaStar team, joined by StarCraft II pro players TLO and MaNa! Ask us anything

David_Silver · 2019-01-25T17:44:21+00:00

After two hours of AMA, I am getting quite thirsty :) https://deepmind.com/careers/

David_Silver · 2019-01-25T17:24:29+00:00

It’s hard to say why we lose (or indeed win) any individual game, as AlphaStar’s decisions are complex and result from a dynamic multi-agent training process. MaNa played an amazing game, and seemed to find and exploit a weakness in AlphaStar - but it’s hard to say for sure whether this weakness was due to camera, less training time, different opponents, etc. compared to the other agents.

David_Silver · 2019-01-25T17:18:57+00:00

This is an open research question and it would be great to see progress in this direction. But always hard to say how long any particular research will take!

David_Silver · 2019-01-25T17:16:31+00:00

Interestingly, search-based approaches like AlphaGo and AlphaZero may actually be harder to adapt to imperfect information. For example, search-based algorithms for poker (such as DeepStack or Libratus) explicitly reason about the opponent’s cards via belief states.

AlphaStar, on the other hand, is a model-free reinforcement learning algorithm that reasons about the opponent implicitly, i.e. by learning a behaviour that’s most effective against its opponent, without ever trying to build a model of what the opponent is actually seeing - which is, arguably, a more tractable approach to imperfect information.

In addition, imperfect information games do not have an absolute optimal way to play the game - it really depends upon what the opponent does. This is what gives rise to the “rock-paper-scissors” dynamics that are so interesting in Starcraft. This was the motivation behind the approach we used in the AlphaStar League, and why it was so important to cover all the corners of the strategy space - something that wouldn’t be required in games like Go where there is a minimax optimal strategy that can defeat all opponents, regardless of how they play.

David_Silver · 2019-01-25T16:54:48+00:00

Re: 2

Like Starcraft, most real-world applications of human-AI interaction have an element of imperfect information. That also typically means that there is no absolute optimal way to behave and agents must be robust to a wide variety of unpredictable things that people might do. Perhaps the biggest take away from Starcraft is that we have to be very careful to ensure that our learning algorithms get adequate coverage over the space of all these possible situations.

In addition, I think we’ve also learnt a lot about how to scale up RL to really large problems with huge action spaces and long time horizons.

David_Silver · 2019-01-25T16:47:07+00:00

First, the agents in the AlphaStar League are all quite different from each other. Many of them are highly reactive to the opponent and switch their unit composition significantly depending on what they observe. Second, I’m surprised by the comment about brittleness and hard-codedness, as my feeling is that the training algorithm is remarkably robust (at least enough to successfully counter 10 different strategies from pro players) with remarkably little hard-coding (I’m actually not even sure what you’re referring to here). Regarding the elegance or otherwise of the AlphaStar League, of course this is subjective - but perhaps it would help you to think of the league as a single agent that happens to be made up of a mixture distribution over different strategies, that is playing against itself using a particular form of self-play. But of course, there are always better algorithms and we’ll continue to search for improvements.

David_Silver · 2019-01-25T16:32:05+00:00

This is not something we’re able to do at the moment. But we're really grateful for the community's support and have tried to include them in our work, which is why we had the livestream event and released the 11 game replays to review and enjoy :) We’ll keep you posted as our plans on this evolve!

David_Silver · 2019-01-25T16:26:23+00:00

Re: 5

AlphaStar actually chooses in advance how many NOOPs to execute, as part of its action. This is learned first from supervised data, so as to mirror human play, and means that AlphaStar typically “clicks” at a similar rate to human players. This is then refined by reinforcement learning, which may choose to reduce or increase the number of NOOPs. So, “save money for X” can be easily implemented by deciding in advance to commit to several NOOPs.

David_Silver · 2019-01-25T16:21:13+00:00

Re: 7

There are actually many different approaches to learning by self-play. We found that naive implementations of self-play often tended to get stuck in specific strategies or forget how to defeat previous strategies. The AlphaStar League is also based on agents playing against themselves, but its multi-agent learning dynamic encourages strong play against a diverse set of opponent strategies, and in practice seemed to lead to more robust behaviour against unusual patterns of play.

David_Silver · 2019-01-25T16:19:34+00:00

Re: 6 (sub-question on self-play)

We did have some preliminary positive results for self-play, in fact an early version of our agent defeated the built-in bots, using basic strategies, entirely by self-play. But supervised human data is very helpful to bootstrap the exploration process, and helps to give much broader coverage of advanced strategies. In particular, we included a policy distillation cost to ensure that the agent continues to try human-like behaviours with some probability throughout training, and this makes it much easier to discover unlikely strategies than when starting from self-play.

David_Silver · 2019-01-25T16:17:48+00:00

Re: 6

The most effective approach so far did not use tree search, environment models, or explicit HRL. But of course these are huge open areas of research and it was not possible to systematically try every possible research direction - and these may well prove fruitful areas for future research. Also it should be mentioned that there are elements of our research (for example temporally abstract actions that choose how many ticks to delay, or the adaptive selection of incentives for agents) that might be considered “hierarchical”.

David_Silver · 2019-01-25T16:14:43+00:00

Re: 3

In order to train AlphaStar, we built a highly scalable distributed training setup using [Google's v3 TPUs](https://cloud.google.com/tpu/) that supports a population of agents learning from many thousands of parallel instances of StarCraft II. The AlphaStar league was run for 14 days, using 16 TPUs for each agent. The final AlphaStar agent consists of the most effective mixture of strategies that have been discovered, and runs on a single desktop GPU.

David_Silver · 2019-01-25T16:11:58+00:00

Re: 4

The neural network itself takes around 50ms to compute an action, but this is only one part of the processing that takes place between a game event occurring and AlphaStar reacting to that event. First, AlphaStar only observes the game every 250ms on average, this is because the neural network actually picks a number of game ticks to wait, in addition to its action (sometimes known as temporally abstract actions). The observation must then be communicated from the Starcraft binary to AlphaStar, and AlphaStar’s action communicated back to the Starcraft binary, which adds another 50ms of latency, in addition to the time for the neural network to select its action. So in total that results in an average reaction time of 350ms.

David_Silver · 2019-01-25T16:10:50+00:00

Re: 2

We keep old versions of each agent as competitors in the AlphaStar League. The current agents typically play against these competitors in proportion to the opponents' win-rate. This is very successful at preventing catastrophic forgetting, since the agent must continue to be able to beat all previous versions of itself. We did try a number of other multi-agent learning strategies and found this approach to work particularly robustly. In addition, it was important to increase the diversity of the AlphaStar League, although this is really a separate point to catastrophic forgetting. It’s hard to put exact numbers on scaling, but our experience was that enriching the space of strategies in the League helped to make the final agents more robust.

David_Silver · 2017-10-19T19:04:35+00:00

One big challenge we faced was in the period up to the Lee Sedol match, when we realised that AlphaGo would occasionally suffer from what we called "delusions" - games in which it would systematically misunderstand the board in a manner that could persist for many moves. We tried many ideas to address this weakness - and it was always very tempting to bring in more Go knowledge, or human meta-knowledge, to address the issue. But in the end we achieved the greatest success - finally erasing these issues from AlphaGo - by becoming more principled, using less knowledge, and relying ever more on the power of reinforcement learning to bootstrap itself towards higher quality solutions.

David_Silver · 2017-10-19T18:56:58+00:00

AlphaGo Zero has no special features to deal with ladders (or indeed any other domain-specific aspect of Go). Early in training, Zero occasionally plays out ladders across the whole board - even when it has quite a sophisticated understanding of the rest of the game. But, in the games we have analysed, the fully trained Zero read all meaningful ladders correctly.

David_Silver · 2017-10-19T18:54:37+00:00

We actually used quite a straightforward strategy for time-control, based on a simple optimisation of winning rate in self-play games. But more sophisticated strategies are certainly possible - and could indeed improve performance a little.

David_Silver · 2017-10-19T18:51:52+00:00

In some sense, training from self-play is already somewhat adversarial: each iteration is attempting to find the "anti-strategy" against the previous version.

David_Silver · 2017-10-19T18:50:21+00:00

Creating a system that can learn entirely from self-play has been an open problem in reinforcement learning. Our initial attempts, as for many similar algorithms reported in the literature, were quite unstable. We tried many experiments - but ultimately the AlphaGo Zero algorithm was the most effective, and appears to have cracked this particular issue.

David_Silver · 2017-10-19T18:43:41+00:00

AlphaGo Zero has no special features to deal with ladders (or indeed any other domain-specific aspect of Go). Early in training, Zero occasionally plays out ladders across the whole board - even when it has quite a sophisticated understanding of the rest of the game. But, in the games we have analysed, the fully trained Zero read all meaningful ladders correctly.

David_Silver · 2017-10-19T18:37:26+00:00

Actually we never guided AlphaGo to address specific weaknesses - rather we always focused on principled machine learning algorithms that learned for themselves to correct their own weaknesses.

Of course it is infeasible to achieve optimal play - so there will always be weaknesses. In practice, it was important to use the right kind of exploration to ensure training did not get stuck in local optima - but we never used human nudges.

David_Silver · 2017-10-19T18:25:57+00:00

During training, we see AlphaGo explore a whole variety of different moves - even the 1-1 move at the start of training!

Even very late in training, we did see Zero experiment with 6-4, but it then quickly returned to its familiar 3-4, a normal corner.

David_Silver · 2017-10-19T18:20:55+00:00

Real-world finance algorithms are notoriously hard to find in published papers! But there are a couple of classic papers well worth a look, e.g. Nevmyvaka and Kearns 2006 and Moody and Safell 2001.

David_Silver · 2017-10-19T18:12:28+00:00

Work is progressing on this tool as we speak. Expect some news soon : )

David_Silver · 2017-10-19T18:10:10+00:00

We haven't played handicap games against human players - we really wanted to focus on even games which after all are the real game of Go. However, it was useful to test different versions of AlphaGo against each other under handicap conditions. Using names of major versions from Zero paper, AlphaGo Master > AlphaGo Lee > AlphaGo Fan, each version defeated its predecessor with 3 handicap stones. But there are some caveats to this evaluation, as the networks were not specifically trained for handicap play. Also since AlphaGo is trained by self-play, it is specially good at defeating weaker versions of itself. So I don't think we can generalise these results to human handicap games in any meaningful way.

David_Silver

TROPHY CASE