Uber’s AI research team introduced Go-Explore, a new family of algorithms, capable of achieving scores over 2,000,000 on Atari game Montezuma’s Revenge and an average score of over 21,000 on Atari 2600 game Pitfall, earlier this week. This is the first time when any learning algorithm has managed to score above 0 in Pitfall. Go-explore outshines the other state of the art algorithms on Montezuma’s revenge and Pitfall by two orders of magnitude and 21,000 points.
Go-Explore uses the human domain knowledge but isn’t entirely dependent on it. This highlights the ability of Go-Explore to score well despite very little prior knowledge. For instance, Go-Explore managed to score over 35,000 points on Montezuma’s Revenge, with zero domain knowledge.
As per the Uber team, “Go-Explore differs radically from other deep RL algorithms and could enable rapid progress in a variety of important, challenging problems, especially robotics. We, therefore, expect it to help teams at Uber and elsewhere increasingly harness the benefits of artificial intelligence”.
A common challenge
One of the most challenging and common problems associated with Montezuma’s Revenge and Pitfall is that of a “sparse reward” faced during the game exploration phase. Sparse reward refers to few reliable reward signals or feedback that could help the player complete the stage and advance within the game quickly.
To make things even more complicated, any rewards that are offered during the game are usually “deceptive”, meaning that it misleads the AI agents to maximize rewards in a short period of time, instead of focusing on something that could make them jump the game level (eg; hitting an enemy nonstop, instead of working towards getting to the exit).
Now, usually, to tackle such a problem, researchers add “intrinsic motivation” (IM) to agents, meaning that they get rewarded on reaching new states within the game. Adding IM has helped the researchers successfully tackle the sparse reward problems in many games, but they still haven’t been able to do so in the case of Montezuma’s Revenge and Pitfall.
Uber’s Solution: Exploration and Robustification
According to the Uber team, “a major weakness of current IM algorithms is detachment, wherein the algorithms forget about promising areas they have visited, meaning they do not return to them to see if they lead to new states. This problem would be remedied if the agent returned to previously discovered promising areas for exploration”. Uber researchers have come out with a method that separates the learning in Agents into two steps: exploration and robustification.
Go-Explore builds up an archive of multiple different game states called “cells” along with paths leading to these states. It selects a particular cell from the archive, goes back to that cell, and then explores from that cell. For all cells that have been visited (including new cells), if the new trajectory is better (e.g. higher score), then its chosen to reach that cell. This helps GoExplore remember and return to promising areas for exploration (unlike intrinsic motivation), avoid over-exploring, and also makes them less susceptible to “deceptive” rewards as it tries to cover all the reachable states.
Results of the Exploration phase
Montezuma’s Revenge: During the exploration phase of the algorithm, Go-Explore reaches an average of 37 rooms and solves level 1 (comprising 24 rooms, not all of which need to be visited) 65 percent of the time in Montezuma’s Revenge. The previous state of the art algorithms could explore only 22 rooms on average.
Pitfall: Pitfall requires significant exploration and is much harder than Montezuma’s Revenge since it offers only 32 positive rewards that are scattered over 255 rooms. The complexity of this game is so high, that no RL algorithm has been able to collect even a single positive reward in this game. During the exploration phase of the algorithm, Go explore is able to visit all 255 rooms and manages to collect over 60,000 points. With zero domain knowledge, Go-Explore finds an impressive 22 rooms but does not find any reward.
If the solutions found via exploration are not robust to noise, you can robustify them, meaning add in domain knowledge using a deep neural network with an imitation learning algorithm, a type of algorithm that can learn a robust model-free policy via demonstrations. Uber researchers chose Salimans & Chen’s “backward” algorithm to get started with, although any imitation learning algorithm would do.
“We found it somewhat unreliable in learning from a single demonstration. However, because Go-Explore can produce plenty of demonstrations, we modified the backward algorithm to simultaneously learn from multiple demonstrations ” writes the Uber team.
Results of robustification
Montezuma’s Revenge: Robustifying the trajectories that are discovered with the domain knowledge version of Go-Explore, it manages to solve the first 3 levels of Montezuma’s Revenge. Now, since, all levels beyond level 3 in this game are nearly identical, Go-Explore has solved the entire game.
“In fact, our agents generalize beyond their initial trajectories, on average solving 29 levels and achieving a score of 469,209! This shatters the state of the art on Montezuma’s Revenge both for traditional RL algorithms and imitation learning algorithms that were given the solution in the form of a human demonstration,” mentions the Uber team.
Pitfall: Once the trajectories had been collected in the exploration phase, researchers managed to reliably robustify the trajectories that collect more than 21,000 points. This led to Go-explore outperforming both the state of the art algorithms as well as average human performances, setting an AI record on Pitfall for scoring more than 21,000 points on Pitfall.
“Some might object that, while the methods already work in the high-dimensional domain of Atari-from-pixels, it cannot scale to truly high-dimensional domains like simulations of the real world. We believe the methods could work there, but it will have to marry a more intelligent cell representation of interestingly different states (e.g. learned, compressed representations of the world) with intelligent (instead of random) exploration”, writes the Uber team.
For more information, check out the official blog post.