10 min read

This article is an excerpt taken from the book, Hands-On Intelligent Agents with OpenAI Gym, written by Praveen Palanisamy. In this article, the author introduces us to the Markov Decision Process followed by the understanding of Deep reinforcement learning.

A Markov Decision Process (MDP) provides a formal framework for reinforcement learning. It is used to describe a fully observable environment where the outcomes are partly random and partly dependent on the actions taken by the agent or the decision maker. The following diagram is the progression of a Markov Process into a Markov Decision Process through the Markov Reward Process:

These stages can be described as follows:

  • A Markov Process (or a markov chain) is a sequence of random states s1, s2,…  that obeys the Markov property. In simple terms, it is a random process without any memory about its history.
  • A Markov Reward Process (MRP) is a Markov Process (also called a Markov chain) with values.
  • A Markov Decision Process is a Markov Reward Process with decisions.

Dynamic programming with Markov Decision Process

Dynamic programming is a very general method to efficiently solve problems that can be decomposed into overlapping sub-problems. If you have used any type of recursive function in your code, you might have already got some preliminary flavor of dynamic programming. Dynamic programming, in simple terms, tries to cache or store the results of sub-problems so that they can be used later if required, instead of computing the results again.

Okay, so how is that relevant here, you may ask. Well, they are pretty useful for solving a fully defined MDP, which means that an agent can find the most optimal way to act in an environment to achieve the highest reward using dynamic programming if it has full knowledge of the MDP! In the following table, you will find a concise summary of what the inputs and outputs are when we are interested in sequential prediction or control:

Task/objective Input Output
Prediction MDP or MRP and policy  Value function 
Control MDP Optimal value function 
and optimal policy 

Monte Carlo learning and temporal difference learning

At this point, we understand that it is very useful for an agent to learn the state value function , which informs the agent about the long-term value of being in state so that the agent can decide if it is a good state to be in or not. The Monte Carlo (MC) and Temporal Difference (TD) learning methods enable an agent to learn that!

The goal of MC and TD learning is to learn the value functions from the agent’s experience as the agent follows its policy .

The following table summarizes the value estimate’s update equation for the MC and TD learning methods:

Learning method State-value function
Monte Carlo
Temporal Difference

MC learning updates the value towards the actual return ,which is the total discounted reward from time step t. This means that until the end. It is important to note that we can calculate this value only after the end of the sequence, whereas TD learning (TD(0) to be precise), updates the value towards the estimated return given by , which can be calculated after every step.

SARSA and Q-learning

It is also very useful for an agent to learn the action value function , which informs the agent about the long-term value of taking action  in state  so that the agent can take those actions that will maximize its expected, discounted future reward. The SARSA and Q-learning algorithms enable an agent to learn that! The following table summarizes the update equation for the SARSA algorithm and the Q-learning algorithm:

Learning method Action-value function

SARSA is so named because of the sequence State->Action->Reward->State’->Action’ that the algorithm’s update step depends on. The description of the sequence goes like this: the agent, in state S, takes an action A and gets a reward R, and ends up in the next state S’, after which the agent decides to take an action A’ in the new state. Based on this experience, the agent can update its estimate of Q(S,A).

Q-learning is a popular off-policy learning algorithm, and it is similar to SARSA, except for one thing. Instead of using the Q value estimate for the new state and the action that the agent took in that new state, it uses the Q value estimate that corresponds to the action that leads to the maximum obtainable Q value from that new state, S’.

Deep reinforcement learning

With a basic understanding of reinforcement learning, you are now in a better state (hopefully you are not in a strictly Markov state where you have forgotten the history/things you have learned so far) to understand the basics of the cool new suite of algorithms that have been rocking the field of AI in recent times.

Deep reinforcement learning emerged naturally when people made advancements in the deep learning field and applied them to reinforcement learning. We learned about the state-value function, action-value function, and policy. Let’s briefly look at how they can be represented mathematically or realized through computer code. The state-value function  is a real-value function that takes the current state  as the input and outputs a real-value number (such as 4.57). This number is the agent’s prediction of how good it is to be in state and the agent keeps updating the value function based on the new experiences it gains. Likewise, the action-value function is also a real-value function, which takes action as an input in addition to state , and outputs a real number. One way to represent these functions is using neural networks because neural networks are universal function approximators, which are capable of representing complex, non-linear functions. For an agent trying to play a game of Atari by just looking at the images on the screen (like we do), state could be the pixel values of the image on the screen. In such cases, we could use a deep neural network with convolutional layers to extract the visual features from the state/image, and then a few fully connected layers to finally output  or , depending on which function we want to approximate.

Recall from the earlier sections of this chapter that  is the state-value function and provides an estimate of the value of being in state , and  is the action-value function, which provides an estimate of the value of each action given the  state.

If we do this, then we are doing deep reinforcement learning! Easy enough to understand? I hope so. Let’s look at some other ways in which we can use deep learning in reinforcement learning.

Recall that a policy is represented as  in the case of deterministic policies, and as  in the case of stochastic policies, where action could be discrete (such as “move left,” “move right,” or “move straight ahead”) or continuous values (such as “0.05” for acceleration, “0.67” for steering, and so on), and they can be single or multi-dimensional. Therefore, a policy can be a complicated function at times! It might have to take in a multi-dimensional state (such as an image) as input and output a multi-dimensional vector of probabilities as output (in the case of stochastic policies). So, this does look like it will be a monster function, doesn’t it? Yes it does. That’s where deep neural networks come to the rescue! We could approximate an agent’s policy using a deep neural network and directly learn to update the policy (by updating the parameters of the deep neural network). This is called policy optimization-based deep reinforcement learning and it has been shown to be quite efficient in solving several challenging control problems, especially in robotics.

So in summary, deep reinforcement learning is the application of deep learning to reinforcement learning and so far, researchers have applied deep learning to reinforcement learning successfully in two ways. One way is using deep neural networks to approximate the value functions, and the other way is to use a deep neural network to represent the policy.

These ideas have been known from the early days, when researchers were trying to use neural networks as value function approximators, even back in 2005. But it rose to stardom only recently because although neural networks or other non-linear value function approximators can better represent the complex values of environment states and actions, they were prone to instability and often led to sub-optimal functions. Only recently have researchers such as Volodymyr Mnih and his colleagues at DeepMind (now part of Google) figured out the trick of stabilizing the learning and trained agents with deep, non-linear function approximators that converged to near-optimal value functions. In the later chapters of this book, we will, in fact, reproduce some of their then-groundbreaking results, which surpassed human Atari game playing capabilities!

Practical applications of reinforcement and deep reinforcement learning algorithms

Until recently, practical applications of reinforcement learning and deep reinforcement learning were limited, due to sample complexity and instability. But, these algorithms proved to be quite powerful in solving some really hard practical problems. Some of them are listed here to give you an idea:

  • Learning to play video games better than humans: This news has probably reached you by now. Researchers at DeepMind and others developed a series of algorithms, starting with DeepMind’s Deep-Q-Network, or DQN for short, which reached human-level performance in playing Atari games. We will actually be implementing this algorithm in a later chapter of this book! In essence, it is a deep variant of the Q-learning algorithm we briefly saw in this chapter, with a few changes that increased the speed of learning and the stability. It was able to reach human-level performance in terms of game scores after several games. What is more impressive is that the same algorithm achieved this level of play without any game-specific fine-tuning or changes!
  • Mastering the game of Go: Go is a Chinese game that has challenged AI for several decades. It is played on a full-size 19 x 19 board and is orders of magnitude more complex than chess because of the large number () of possible board positions. Until recently, no AI algorithm or software was able to play anywhere close to the level of humans at this game. AlphaGo—the AI agent from DeepMind that uses deep reinforcement learning and Monte Carlo tree search—changed this all and beat the human world champions Lee Sedol (4-1) and Fan Hui (5-0). DeepMind released more advanced versions of their AI agent, named AlphaGO Zero (which uses zero human knowledge and learned to play all by itself!) and AlphaZero (which could play the games of Go, chess, and Shogi!), all of which used deep reinforcement learning as the core algorithm.
  • Helping AI win Jeopardy!: IBM’s Watson—an AI system developed by IBM, which came to fame by beating humans at Jeopardy!—used an extension of TD learning to create its daily-double wagering strategies that helped it to win against human champions.
  • Robot locomotion and manipulation: Both reinforcement learning and deep reinforcement learning have enabled the control of complex robots, both for locomotion and navigation. Several recent works from the researchers at UC Berkeley have shown how, using deep reinforcement, they train policies that offer vision and control for robotic manipulation tasks and generate join actuations for making a complex bipedal humanoid walk and run.


To summarize, in this article, we learned about the Markov Decision process, Deep reinforcement learning, and its applications. If you’ve enjoyed this post, head over to the book, Hands-On Intelligent Agents with OpenAI Gym for implementing learning algorithms for machine software agents in order to solve discrete or continuous sequential decision making and control tasks, and much more.

Read next

A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.