6 min read

Reinforcement learning is a branch of artificial intelligence that deals with an agent that perceives the information of the environment in the form of state spaces and action spaces and acts on the environment thereby resulting in a new state and receiving a reward as feedback for that action. This received reward is assigned to the new state. Just like when we had to minimize the cost function in order to train our neural network, here the reinforcement learning agent has to maximize the overall reward to find the optimal policy to solve a particular task.

This article is an extract from the book Reinforcement Learning with TensorFlow

How is reinforcement learning different from supervised and unsupervised learning?

In supervised learning, the training dataset has input features, X, and their corresponding output labels, Y. A model is trained on this training dataset, to which test cases having input features, X’, are given as the input and the model predicts Y’.

In unsupervised learning, input features, X, of the training set are given for the training purpose. There are no associated Y values. The goal is to create a model that learns to segregate the data into different clusters by understanding the underlying pattern and thereby, classifying them to find some utility. This model is then further used for the input features X’ to predict their similarity to one of the clusters.

Reinforcement learning is different from both supervised and unsupervised. Reinforcement learning can guide an agent on how to act in the real world. The interface is broader than the training vectors, like in supervised or unsupervised learning. Here is the entire environment, which can be real or a simulated world. Agents are trained in a different way, where the objective is to reach a goal state, unlike the case of supervised learning where the objective is to maximize the likelihood or minimize cost.

Reinforcement learning agents automatically receive the feedback, that is, rewards from the environment, unlike in supervised learning where labeling requires time-consuming human effort. One of the bigger advantages of reinforcement learning is that phrasing any task’s objective in the form of a goal helps in solving a wide variety of problems. For example, the goal of a video game agent would be to win the game by achieving the highest score. This also helps in discovering new approaches to achieving the goal. For example, when AlphaGo became the world champion in Go, it found new, unique ways of winning.

A reinforcement learning agent is like a human. Humans evolved very slowly; an agent reinforces, but it can do that very fast. As far as sensing the environment is concerned, neither humans nor and artificial intelligence agents can sense the entire world at once. The perceived environment creates a state in which agents perform actions and land in a new state, that is, a newly-perceived environment different from the earlier one. This creates a state space that can be finite as well as infinite.

The largest sector interested in this technology is defense. Can reinforcement learning agents replace soldiers that not only walk, but fight, and make important decisions?

Basic terminologies and conventions

The following are the basic terminologies associated with reinforcement learning:

  • Agent: This we create by programming such that it is able to sense the environment, perform actions, receive feedback, and try to maximize rewards.
  • Environment: The world where the agent resides. It can be real or simulated.
  • State: The perception or configuration of the environment that the agent senses. State spaces can be finite or infinite.
  • Rewards: Feedback the agent receives after any action it has taken. The goal of the agent is to maximize the overall reward, that is, the immediate and the future reward. Rewards are defined in advance. Therefore, they must be created properly to achieve the goal efficiently.
  • Actions: Anything that the agent is capable of doing in the given environment. Action space can be finite or infinite.
  • SAR triple: (state, action, reward) is referred as the SAR triple, represented as (s, a, r).
  • Episode: Represents one complete run of the whole task.

Let’s deduce the convention shown in the following diagram:

reinforcement learning

Every task is a sequence of SAR triples. We start from state S(t), perform action A(t) and thereby, receive a reward R(t+1), and land on a new state S(t+1). The current state and action pair gives rewards for the next step. Since, S(t) and A(t) results in S(t+1), we have a new triple of (current state, action, new state), that is, [S(t),A(t),S(t+1)] or (s,a,s’).

Pioneers and breakthroughs in reinforcement learning

Here are the pioneers, industrial leaders, and research breakthroughs in the field of deep reinforcement learning.

David Silver

Dr. David Silver, with an h-index of 30, heads the research team of reinforcement learning at Google DeepMind and is the lead researcher on AlphaGo. David co-founded Elixir Studios and then completed his PhD in reinforcement learning from the University of Alberta, where he co-introduced the algorithms used in the first master-level 9×9 Go programs. After this, he became a lecturer at University College London. He used to consult for DeepMind before joining full-time in 2013. David lead the AlphaGo project, which became the first program to defeat a top professional player in the game of Go.

Pieter Abbeel

Pieter Abbeel is a professor at UC Berkeley and was a Research Scientist at OpenAI. Pieter completed his PhD in Computer Science under Andrew Ng. His current research focuses on robotics and machine learning, with a particular focus on deep reinforcement learning, deep imitation learning, deep unsupervised learning, meta-learning, learning-to-learn, and AI safety. Pieter also won the NIPS 2016 Best Paper Award.

Google DeepMind

Google DeepMind is a British artificial intelligence company founded in September 2010 and acquired by Google in 2014. They are an industrial leader in the domains of deep reinforcement learning and a neural turing machine. They made news in 2016 when the AlphaGo program defeated Lee Sedol, 9th dan Go player. Google DeepMind has channelized its focus on two big sectors: energy and healthcare.

Here are some of its projects:

The AlphaGo program

As mentioned previously in Google DeepMind, AlphaGo is a computer program that first defeated Lee Sedol and then Ke Jie, who at the time was the world No. 1 in Go. In 2017 an improved version, AlphaGo zero was launched that defeated AlphaGo 100 games to 0.


Libratus is an artificial intelligence computer program designed by the team led by Professor Tuomas Sandholm at Carnegie Mellon University to play Poker. Libratus and its predecessor, Claudico, share the same meaning, balanced.

In January 2017, it made history by defeating four of the world’s best professional poker players in a marathon 20-day poker competition.

Though Libratus focuses on playing poker, its designers mentioned its ability to learn any game that has incomplete information and where opponents are engaging in deception. As a result, they have proposed that the system can be applied to problems in cybersecurity, business negotiations, or medical planning domains.

You enjoyed an excerpt on Reinforcement learning and got to know about breakthrough research in this field. If you want to leverage the power of reinforcement learning techniques, grab our latest edition Reinforcement Learning with TensorFlow.

Read Next:

Top 5 tools for reinforcement learning

How to implement Reinforcement Learning with TensorFlow

How to develop a stock price predictive model using Reinforcement Learning and TensorFlow

Subscribe to the weekly Packt Hub newsletter. We'll send you this year's Skill Up Developer Skills Report.

* indicates required


Please enter your comment!
Please enter your name here