The Google Brain team with DeepMind and ETH Zurich have introduced an episodic memory-based curiosity model which allows Reinforcement Learning (RL) agents to explore environments in an intelligent way. This model was the result of a study called Episodic Curiosity through Reachability, the findings of which Google AI shared yesterday.
Why this episodic curiosity model is introduced?
In real-world scenarios, the rewards required in reinforcement learning are sparse and most of the current reinforcement learning algorithms struggle with such sparsity. Wouldn’t it be better if the agent is capable of creating its own rewards? That’s what this model does. This makes the rewards denser and more suitable for learning.
Many researchers have worked on some curiosity-driven learning approaches before, one of them is Intrinsic Curiosity Module (ICM). This method is explored in the recent paper Curiosity-driven Exploration by Self-supervised Prediction published by Ph.D. students at the University of California, Berkeley.
ICM builds a predictive model of the dynamics of the world. The agent is rewarded when the model fails to make good predictions. Exploring unvisited locations is not directly a part of the ICM curiosity formulation. In the ICM method, visiting them is only a way to obtain more “surprise” and thus maximize overall rewards. As a result, in some environments there could be other ways to cause self-surprise, leading to unforeseen results.
The authors of the ICM method along with researchers at OpenAI, in their research Large-Scale Study of Curiosity-Driven Learning, show a hidden danger of surprise maximization. Instead of doing something useful for the task at hand, agents can learn to indulge procrastination-like behavior. The episodic memory-based curiosity model overcomes this “procrastination” issues.
What is episodic memory-based curiosity model?
This model uses a deep neural network trained to measure how similar two experiences are. For training the model, the researchers made it guess whether two observations were experienced close together in time, or far apart in time. Temporal proximity is a good proxy for whether two experiences should be judged to be part of the same experience. This training gives a general concept of novelty via reachability, which is shown a follows:
How this model works
Inspired by curious behavior in animals, this model rewards the agent with a bonus when it observes something novel. This bonus is summed up with the real task reward making it possible for the RL algorithm to learn from the combined reward. To calculate the bonus of the agent, the current observation is compared with the observation in memory. This comparison is done based on how many environment steps it takes to reach the current observation from those in memory.
This method follows these steps:
- The agent’s observations of the environment are stored in an episodic memory.
- The agents are also rewarded for reaching observations that are not yet represented in memory. In this method, being “not in memory” is the definition of novelty in our method.
- Such a behavior of seeking the unfamiliar will lead the artificial agent to new locations, thus keeping it from wandering in circles and ultimately help it stumble on the goal.
Experiment and results
Different approaches to curiosity were tested in two visually rich 3D environments: ViZDoom and DMLab. The agent was given various tasks such as searching for a goal in a maze or collecting good objects and avoiding bad objects.
The standard setting in previous formulations, such as ICM, on DMLab, was to provide the agent a laser-like science fiction gadget. If the agent does not need a gadget for a particular task, it was free not to use it. In this test, the surprise-based ICM method used this gadget a lot even when it is useless for the task at hand.
The newly introduced method instead learns reasonable exploration behavior under the same conditions. This is because it does not try to predict the result of its actions, but rather seeks observations which are “harder” to achieve from those already in the episodic memory. In short, the agent implicitly pursues goals which require more effort to reach from memory than just a single tagging action.
This approach penalizes an agent running in circles because after completing the first circle the agent does not encounter new observations other than those in memory, and thus receives no rewards.
In the experimental environment, the model was able to achieve:
- In ViZDoom, the agent learned to successfully navigate to a distant goal at least two times faster than the state-of-the-art curiosity method ICM.
- In DMLab, the agent generalized well to new procedurally generated levels of the game. It was able to reach the goal at least two times more frequently than ICM on test mazes with very sparse reward.