9 min read

Pieter Abbeel is a professor at UC Berkeley and a former Research Scientist at OpenAI. His current research focuses on robotics and machine learning with particular focus on deep reinforcement learning, deep imitation learning, deep unsupervised learning, meta-learning, learning-to-learn, and AI safety.

This article attempts to bring our readers to Pieter’s fantastic Keynote speech at NIPS 2017. It talks about the implementation of Deep Reinforcement Learning in Robotics, what challenges exist and how these challenges can be overcome. Once you’ve been through this article, we’re certain you’d be extremely interested in watching the entire video on the NIPS Facebook page. All images in this article come from his presentation slides and do not belong to us.

Robotics in ML has been growing in leaps and bounds with several companies investing huge amounts to tie both these technologies together in the best way possible. Although, there are still several aspects that are not thoroughly accomplished when it comes to AI Robotics. Here are a few of them:

  • Maximize Signal Extracted from Real World Experience
  • Faster/Data efficient Reinforcement Learning
  • Long Horizon Reasoning
  • Taskability (Imitation Learning)
  • Lifelong Learning (Continuous Adaptation)
  • Leverage Simulation

Maximise signal extracted from real world experience

We need more real world data, so we need to extract as much signal from it. In the diagram below, are the different layers of machine learning that engineers perform.


There are engineers who look at the entire cake and train the agent to take both the learning from the reward and from auxiliary signals. This is because using only Reinforcement Learning doesn’t give you a lot of signal. Is there then, a possibility of having a Reward Signal in RL that ties more RL into the system? There’s something known as Hindsight Experience Replay. The idea is to get a reward signal from any experience by assuming the goal equals whatever happened, and not just from success like in usual RL.

For this, we need to assume that whatever the agent does is a success. We use Q-learning and instead of a standard Q function, we use multiple goals even though they were not really a goal when you were acting.  Here, a replay buffer collects experience, Q-learning is then applied and a hindsight replay is performed to infuse a new reward for everything the agent has done.

For various robotic tasks like pushing, sliding and picking and placing objects, this does very well.

Faster Reinforcement Learning

When we’re talking about faster RL, we’re talking about much more data efficient RL. Here is a diagram that demonstrates standard RL:

Reinforcement learning

An agent lets a robot perform an action in a particular environment or situation in order to achieve a reward. Here, the goal is to maximise the reward.

NIPS - 2

As against Supervised Learning, there is no supervision as to whether the actions taken by the agents are right or wrong. That brings in a few additional challenges in RL, which are:

  1. Credit assignment: This is a major problem and is where you get the signal from in RL
  2. Stability: Because of the feedback loop, the system could destabilize and destroy itself
  3. Exploration: Doing things you’ve never done before when the only way to learn is based on what you’ve done before

Despite this, there have been great improvements in Reinforcement Learning in the past few years, enabling AI systems to play games like Go, Dota, etc. It has also been implemented in building robots by NASA for planetary exploration.

But the question still exists: “How good is learning?”

In the game of pong, a human takes roughly 2 hours to learn what Deep Q-Network (DQN) learns in 40 hours! A more careful study reveals that after 15 minutes, humans tend to outperform DDQN that has trained for 115 hours. This is a tremendous gap in terms of learning efficiency.

So, how do we overcome the challenge? Several fully generalised algorithms like Trust Region Policy Optimization (TRPO), DQN, Asynchronous Actor-Critic Agents (A3C) and Rainbow are available, meaning that they can be applied to any kind of environment. Although, only a very small subset of environments are actually encountered in the real world. Can we develop fast RL algorithms that take advantage of this situation?

RL Agents can be reused to train various policies. The RL algorithm is developed to train the policy to adapt to a particular environment A. This can then be replicated to environment B and so on. Humans develop the RL algorithm and then rely on it to train the policy. Despite this, none of the algorithms are as good as human learners. Do we have an alternative then? Indeed, yes! Why not let the system learn not just the policy but the algorithm as well or in other words, the entire agent?

Enter Meta-Reinforcement Learning

In Meta-RL, the learning algorithm itself is being learnt. You could relate this to meta-programming, where one program is trained to write another. This process helps a system learn the world better so it can pick up on learning a new situation quicker. So how does this work? The system is faced with many environments, so that it learns the algorithms and then outputs a faster RL Agent. So, when faced with a new environment, it quickly adapts to it.

NIPS - 3

For evaluating the actual performance, the Multi-armed bandits problem can be considered. Here’s the setting: each bandit has its own distribution over payouts, and in each episode you can choose one bandit. A good RL agent should be able to explore a sufficient number of bandits and exploit the best ones. We need to come up with an algorithm that pulls a higher probability of payoff, rather than a low probability. There are already several asymptotically optimal algorithms like Gittins index, UCB1, Thompson Sampling, that have been created to solve this problem. Here’s a comparison of some of them with the Meta-RL algorithm.

The result is quite impressive. The Meta-RL algorithm is equally competitive with Gittins. In a case where the task is to obtain an on target running direction as well as attain the maximum speed, the agent when dropped into an environment is able to master the the task almost instantly.

However, meta-learning succeeds only 2/3rd of the time. It doesn’t succeed the rest of the time due to two main reasons.

  1. Overfitting: You would usually tend to overfit to the current situation rather than generically fitting to situations
  2. Underfitting: This is when you don’t get enough signal to get any rewards

The solution is to put a different structure underneath the system. Instead of using an RNN, we use a wavenet like architecture or maybe Simple Neural Attentive Meta-Learner (SNAIL).

NIPS - 4

SNAIL is able to perform a bit better than RL2 in the same Bandits problem.

Longer Horizon Reasoning

We need to learn to reason over longer horizons than what canonical algorithms do. For this, we need hierarchy.

For example, suppose a robot has to perform 10 tasks in a day. This would mean it has 10 timesteps per day? Each of these 10 tasks would have subtasks under them. Let’s assume that would make it a total of 1000 time steps. To perform these tasks, the robot would need footstep planning, which would amount to 100,000 time steps. Footsteps in turn require commands to be sent to motors, which would make it 100,000,000 time steps. This is a very long horizon.

We can formulate this as a meta-learning problem. The agent has to solve a distribution of related long-horizon tasks with the goal of learning new tasks in the distribution quickly. If that is our objective, hierarchy would fall out.

Taskability (Imitation Learning)

There are several things we want from robots. We need to be able to tell them what to do and we can do this by giving them examples. This is called Imitation Learning, which can be successfully implemented to a variety of use cases.

NIPS - 5

The idea is to collect many demonstrations, then train something from those demonstrations, then deploy the learn policy. The problem with this is that everytime there is a new task, you start from scratch. The solution to this problem is experience through several demonstrations, as in the case of humans. Although, instead of running the agent through several demos, it is trained completely on one, then showed a frame of a second demo, where it uses it to predict what the outcome would be. This is known as One-Shot imitation learning which is a part of supervised learning, where in several demonstrations are used to train the system to be able to handle any new environment it is put into.

NIPS - 6

Lifelong learning (Continuous Adaptation)

What we usually do in ML can be divided into two broad steps:

  1. Run Machine Learning
  2. Deploy it, which is a canonical way

In this case, all the learning happens ahead of time, before the deployment. However, in real world cases, what you learn from past data might not work in the future. There is a necessity to learn during deployment, which is a lifelong learning spirit. This brings us to Continuous Adaptation. Can we train an agent to be good at non stationary environments?

NIPS - 2

We need to find whether at the time of meta training the agent is able to adapt to a new/changing task. We can try changing the dynamics since it’s hard to do ML training in the real world. At the same time, we can also use competitor environments; which means you’re in an environment with other agents who are trying to beat your agent. The only way to succeed is to continuously adapt more quickly than the others.

Leverage Simulation

Simulation is very helpful and it’s not that expensive. It’s fast and scalable and lets you label more easily. However, the challenge is how to get useful things out of the simulator. One approach is to build realistic simulators. This is quite expensive. Another way is to use a close enough simulator that uses real world data through domain confusion or adaptation. It allows to learn from a small amount of real world data and is quite successful.

Further, another approach to look at is Domain Randomisation, which is also working well in the real world. If the model sees enough simulated variations, the real world might appear like just the next simulator. This has worked in the context of using simulator data to train a quadcopter to avoid collision. Moreover, when pre trained from imagenet or just training in simulation, both performances were similar, after around 8000 examples.

To conclude, the beauty of meta learning is that it enables the discovery of algorithms that are data driven, as against those that are created from pure human ingenuity. This requires more compute power, but several companies like Nvidia and Intel are working hard to overcome this challenge. This will surely power meta-learning to great heights to be implemented in robotics.

While we figure out these above mentioned technical challenges of incorporating AI in robotics, some significant other challenges that we must focus on in parallel are safe learning, and value alignment among others.



I'm a technology enthusiast who designs and creates learning content for IT professionals, in my role as a Category Manager at Packt. I also blog about what's trending in technology and IT. I'm a foodie, an adventure freak, a beard grower and a doggie lover.


Please enter your comment!
Please enter your name here