5 key reinforcement learning principles explained by AI expert, Hadelin de Ponteves

0
2529
9 min read

When people refer to artificial intelligence, some think of it as machine learning, while others think of it as deep learning or reinforcement learning, etc. While artificial intelligence is a broad term which involves machine learning, reinforcement learning is a type of machine learning, thereby a branch of AI. In this article we will understand 5 key reinforcement learning principles with some simple examples.

Reinforcement learning allows machines and software agents to automatically determine the ideal behavior within a specific context, in order to maximize its performance. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation.

This article is an excerpt from the book AI Crash Course written by Hadelin de Ponteves. In this book Hadelin helps you understand what you really need to build AI systems with reinforcement learning. The book involves descriptive and practical projects to put ideas into action and show how to build intelligent software step by step.

While reinforcement learning in some way is a form of AI, machine learning does not include the process of taking action and interacting with an environment like we humans do. Indeed, as intelligent human beings, what we constantly keep doing is the following:


  1. We observe some input, whether it’s what we see with our eyes, what we hear with our ears, or what we remember in our memory.
  2. These inputs are then processed in our brain.
  3. Eventually, we make decisions and take actions.

This process of interacting with an environment is what we are trying to reproduce in terms of artificial intelligence. And to that extent, the branch of AI that works on this is reinforcement learning. This is the closest match to the way we think; the most advanced form of artificial intelligence, if we see AI as the science that tries to mimic (or surpass) human intelligence.

Reinforcement learning principles also has the most impressive results in business applications of AI. For example, Alibaba leveraged reinforcement learning to increase its ROI in online advertising by 240% without increasing their advertising budget.

Five reinforcement learning principles

Let’s begin building the first pillars of your intuition into how reinforcement learning works. These are the fundamental reinforcement learning principles, which will get you started with the right, solid basics in AI.

Here are the five principles:

  • Principle #1: The input and output system
  • Principle #2: The reward
  • Principle #3: The AI environment
  • Principle #4: The Markov decision process
  • Principle #5: Training and inference

Principle #1 – The input and output system

The first step is to understand that today, all AI models are based on the common principle of input and output. Every single form of artificial intelligence, including machine learning models, chatBots, recommender systems, robots, and of course reinforcement learning models, will take something as input, and will return another thing as output.

human learning

Figure 1: The input and output system

In reinforcement learning, this input and output have a specific name: the input is called the state, or input state. The output is the action performed by the AI. And in the middle, we have nothing other than a function that takes a state as input and returns an action as output. That function is called a policy. Remember the name, “policy,” because you will often see it in AI literature.

As an example, consider a self-driving car. Try to imagine what the input and output would be in that case.

The input would be what the embedded computer vision system sees, and the output would be the next move of the car: accelerate, slow down, turn left, turn right, or brake. Note that the output at any time (t) could very well be several actions performed at the same time. For instance, the self-driving car can accelerate while at the same time turning left. In the same way, the input at each time (t) can be composed of several elements: mainly the image observed by the computer vision system, but also some parameters of the car such as the current speed, the amount of gas remaining in the tank, and so on.

That’s the very first important principle in artificial intelligence: it is an intelligent system (a policy) that takes some elements as input, does its magic in the middle, and returns some actions to perform as output. Remember that the inputs are also called the states.

Principle #2 – The reward

Every AI has its performance measured by a reward system. There’s nothing confusing about this; the reward is simply a metric that will tell the AI how well it does over time.

The simplest example is a binary reward: 0 or 1. Imagine an AI that has to guess an outcome. If the guess is right, the reward will be 1, and if the guess is wrong, the reward will be 0. This could very well be the reward system defined for an AI; it really can be as simple as that!

A reward doesn’t have to be binary, however. It can be continuous. Consider the famous game of Breakout:

breakout game

Figure 2: The Breakout game

Imagine an AI playing this game. Try to work out what the reward would be in that case. It could simply be the score; more precisely, the score would be the accumulated reward over time in one game, and the rewards could be defined as the derivative of that score.

This is one of the many ways we could define a reward system for that game. Different AIs will have different reward structures; we will build five rewards systems for five different real-world applications in this book.

With that in mind, remember this as well: the ultimate goal of the AI will always be to maximize the accumulated reward over time.

Those are the first two basic, but fundamental, principles of artificial intelligence as it exists today; the input and output system, and the reward.

Principle #3 – AI environment

The third reinforcement learning principles involves an “AI environment.” It is a very simple framework where you will define three things at each time (t):

  • The input (the state)
  • The output (the action)
  • The reward (the performance metric)

For each and every single AI based on reinforcement learning that is built today, we always define an environment composed of the preceding elements. It is, however, important to understand that there are more than these three elements in a given AI environment.

For example, if you are building an AI to beat a car racing game, the environment will also contain the map and the gameplay of that game. Or, in the example of a self-driving car, the environment will also contain all the roads along which the AI is driving and the objects that surround those roads. But what you will always find in common when building any AI, are the three elements of state, action, and reward.

Principle #4 – The Markov decision process

The Markov decision process, or MDP, is simply a process that models how the AI interacts with the environment over time. The process starts at t = 0, and then, at each next iteration, meaning at t = 1, t = 2, … t = n units of time (where the unit can be anything, for example, 1 second), the AI follows the same format of transition:

  1. The AI observes the current state, st
  2. The AI performs the action, at
  3. The AI receives the reward, rt = R(st,at)
  4. The AI enters the following state, st+1
  5. The goal of the AI is always the same in reinforcement learning: it is to maximize the accumulated rewards over time, that is, the sum of all the rt = R(st,at) received at each transition. received at each transition.

The following graphic will help you visualize and remember an MDP better, the basis of reinforcement learning models:

reinforcement learning

Figure 3: The Markov Decision process

Now four essential pillars are already shaping your intuition of AI. Adding a last important one completes the foundation of your understanding of AI. The last principle is training and inference; in training, the AI learns, and in inference, it predicts.

Principle #5 – Training and inference

The final principle you must understand is the difference between training and inference. When building an AI, there is a time for the training mode, and a separate time for the inference mode. I’ll explain what that means starting with the training mode.

Training mode

Now you understand, from the three first principles, that the very first step of building an AI is to build an environment in which the input states, the output actions, and a system of rewards are clearly defined. From the fourth principle, you also understand that inside this environment an AI will be built that interacts with it, trying to maximize the total reward accumulated over time.

To put it simply, there will be a preliminary (and long) period during which the AI will be trained to do that. That period is called the training; we can also say that the AI is in training mode. During that time, the AI tries to accomplish a certain goal repeatedly until it succeeds. After each attempt, the parameters of the AI model are modified in order to do better at the next attempt.

Inference mode

Inference mode simply comes after your AI is fully trained and ready to perform well. It will simply consist of interacting with the environment by performing the actions to accomplish the goal the AI was trained to achieve before in training mode. In inference mode, no parameters are modified at the end of each episode.

For example, imagine you have an AI company that builds customized AI solutions for businesses, and one of your clients asked you to build an AI to optimize the flows in a smart grid. First, you’d enter an R&D phase during which you would train your AI to optimize these flows (training mode), and as soon as you reached a good level of performance, you’d deliver your AI to your client and go into production. Your AI would regulate the flows in the smart grid only by observing the current states of the grid and performing the actions it has been trained to do. That’s inference mode.

Sometimes, the environment is subject to change, in which case you must alternate fast between training and inference modes so that your AI can adapt to the new changes in the environment. An even better solution is to train your AI model every day and go into inference mode with the most recently trained model. That was the last fundamental principle common to every AI.

To summarize, we explored the five key reinforcement learning principles which involves the input and output system, a reward system, AI environment, Markov decision process, training and inference mode for AI.

Get this guide AI Crash Course by Hadelin de Ponteves today to learn about programming an AI software in Python without any math or data science background. It will also help you master the key skills of deep learning, reinforcement learning, and deep reinforcement learning.

Read Next

How artificial intelligence and machine learning can help us tackle the climate change emergency

DeepMind introduces OpenSpiel, a reinforcement learning-based framework for video games

OpenAI’s AI robot hand learns to solve a Rubik Cube using Reinforcement learning and Automatic Domain Randomization (ADR)

DeepMind’s AI uses reinforcement learning to defeat humans in multiplayer games