11 min read

The Markov decision process, better known as MDP, is an approach in reinforcement learning to take decisions in a gridworld environment. A gridworld environment consists of states in the form of grids.

The MDP tries to capture a world in the form of a grid by dividing it into states, actions, models/transition models, and rewards. The solution to an MDP is called a policy and the objective is to find the optimal policy for that MDP task.

Thus, any reinforcement learning task composed of a set of states, actions, and rewards that follows the Markov property would be considered an MDP.

In this tutorial, we will dig deep into MDPs, states, actions, rewards, policies, and how to solve them using Bellman equations.

This article is a reinforcement learning tutorial taken from the book, Reinforcement learning with TensorFlow.

Markov decision processes

MDP is defined as the collection of the following:

  • States: S
  • Actions: A(s), A
  • Transition model: T(s,a,s’) ~ P(s’|s,a)
  • Rewards: R(s), R(s,a), R(s,a,s’)
  • Policy is the optimal policy

In the case of an MDP, the environment is fully observable, that is, whatever observation the agent makes at any point in time is enough to make an optimal decision. In case of a partially observable environment, the agent needs a memory to store the past observations to make the best possible decisions.

Let’s try to break this into different lego blocks to understand what this overall process means.

The Markov property

In short, as per the Markov property, in order to know the information of near future (say, at time t+1) the present information at time t matters.

Given a sequence, , the first order of Markov says,
, that is,
 depends only on . Therefore,
 will depend only on . The second order of Markov says, , that is,  depends only on  and 

In our context, we will follow the first order of the Markov property from now on. Therefore, we can convert any process to a Markov property if the probability of the new state, say , depends only on the current state, , such that the current state captures and remembers the property and knowledge from the past. Thus, as per the Markov property, the world (that is, the environment) is considered to be stationary, that is, the rules in the world are fixed.

The S state set

The S state set is a set of different states, represented as s, which constitute the environment. States are the feature representation of the data obtained from the environment. Thus, any input from the agent’s sensors can play an important role in state formation. State spaces can be either discrete or continuous. The starts from start state and has to reach the goal state in the most optimized path without ending up in bad states (like the red colored state shown in the diagram below).

Consider the following gridworld as having 12 discrete states, where the green-colored grid is the goal state, red is the state to avoid, and black is a wall that you’ll bounce back from if you hit it head on:

The states can be represented as 1, 2,….., 12 or by coordinates, (1,1),(1,2),…..(3,4).


The actions are the things an agent can perform or execute in a particular state. In other words, actions are sets of things an agent is allowed to do in the given environment. Like states, actions can also be either discrete or continuous.

Consider the following gridworld example having 12 discrete states and 4 discrete actions (UPDOWN, RIGHT, and LEFT):

The preceding example shows the action space to be a discrete set space, that is,  A where, A = {UP, DOWN, RIGHT, and LEFT}. It can also be treated as a function of state, that is, a = A(s), where depending on the state function, it decides which action is possible.

Transition model

The transition model T(s, a, s’) is a function of three variables, which are the current state (s), action (a), and the new state (s’), and defines the rules to play the game in the environment. It gives probability P(s’|s, a), that is, the probability of landing up in the new s’ state given that the agent takes an action, a, in given state, s.

The transition model plays the crucial role in a stochastic world, unlike the case of a deterministic world where the probability for any landing state other than the determined one will have zero probability.

Let’s consider the following environment (world) and consider different cases, determined and stochastic:

Since the actions  A where, A = {UP, DOWN, RIGHT, and LEFT}.

The behavior of these two cases depends on certain factors:

  • Determined environment: In a determined environment, if you take a certain action, say UP, you will certainly perform that action with probability 1.
  • Stochastic environment: In a stochastic environment, if you take the same action, say UP, there will certain probability say 0.8 to actually perform the given action and there is 0.1 probability it can perform an action (either LEFT or RIGHT) perpendicular to the given action, UP. Here, for the s state and the UP action transition model, T(s’,UP, s) = P(s’| s,UP) = 0.8.

Since T(s,a,s’) ~ P(s’|s,a), where the probability of new state depends on the current state and action only, and none of the past states. Thus, the transition model follows the first order Markov property.

We can also say that our universe is also a stochastic environment, since the universe is composed of atoms that are in different states defined by position and velocity. Actions performed by each atom change their states and cause changes in the universe.


The reward of the state quantifies the usefulness of entering into a state. There are three different forms to represent the reward namely, R(s)R(s, a) and R(s, a, s’), but they are all equivalent.

For a particular environment, the domain knowledge plays an important role in the assignment of rewards for different states as minor changes in the reward do matter for finding the optimal solution to an MDP problem.

There are two approaches we reward our agent for when taking a certain action. They are:

  • Credit assignment problem: We look at the past and check which actions led to the present reward, that is, which action gets the credit
  • Delayed rewards: In contrast, in the present state, we check which action to take that will lead us to potential rewards

Delayed rewards form the idea of foresight planning. Therefore, this concept is being used to calculate the expected reward for different states. We will discuss this in the later sections.


Until now, we have covered the blocks that create an MDP problem, that is, states, actions, transition models, and rewards, now comes the solution. The policy is the solution to an MDP problem.

The policy is a function that takes the state as an input and outputs the action to be taken. Therefore, the policy is a command that the agent has to obey.

 is called the optimal policy, which maximizes the expected reward. Among all the policies taken, the optimal policy is the one that optimizes to maximize the amount of reward received or expected to receive over a lifetime. For an MDP, there’s no end of the lifetime and you have to decide the end time.

Thus, the policy is nothing but a guide telling which action to take for a given state. It is not a plan but uncovers the underlying plan of the environment by returning the actions to take for each state.

The Bellman equations

Since the optimal  policy is the policy that maximizes the expected rewards, therefore,


where  means the expected value of the rewards obtained from the sequence of states agent observes if it follows the  policy. Thus,  outputs the  policy that has the highest expected reward.

Similarly, we can also calculate the utility of the policy of a state, that is, if we are at the s state, given a  policy, then, the utility of the  policy for the s state, that is,  would be the expected rewards from that state onward:

The immediate reward of the state, that is,  is different than the utility of the  state (that is, the utility of the optimal policy of the  state) because of the concept of delayed rewards. From now onward, the utility of the  state will refer to the utility of the optimal policy of the state, that is, the  state.

Moreover, the optimal policy can also be regarded as the policy that maximizes the expected utility. Therefore,

where, T(s,a,s’) is the transition probability, that is, P(s’|s,a) and U(s’) is the utility of the new landing state after the a action is taken on the s state.

 refers to the summation of all possible new state outcomes for a particular action taken, then whichever action gives the maximum value of  that is considered to be the part of the optimal policy and thereby, the utility of the ‘s’ state is given by the following Bellman equation,

where,  is the immediate reward and  is the reward from future, that is, the discounted utilities of the ‘s’ state where the agent can reach from the given s state if the action, a, is taken.

Solving the Bellman equation to find policies

Say we have some n states in the given environment and if we see the Bellman equation,

we find out that n states are given; therefore, we will have n equations and n unknown but the  function makes it non-linear. Thus, we cannot solve them as linear equations.

Therefore, in order to solve:

  • Start with an arbitrary utility
  • Update the utilities based on the neighborhood until convergence, that is, update the utility of the state using the Bellman equation based on the utilities of the landing states from the given state

Iterate this multiple times to lead to the true value of the states. This process of iterating to convergence towards the true value of the state is called value iteration.

For the terminal states where the game ends, the utility of those terminal state equals the immediate reward the agent receives while entering the terminal state.

Let’s try to understand this by implementing an example.

An example of value iteration using the Bellman equation

Consider the following environment and the given information:

bellman equation

Given information:

  • AC, and X are the names of some states.
  • The green-colored state is the goal state, G, with a reward of +1.
  • The red-colored state is the bad state, B, with a reward of -1, try to prevent your agent from entering this state
  • Thus, the green and red states are the terminal states, enter either and the game is over. If the agent encounters the green state, that is, the goal state, the agent wins, while if they enter the red state, then the agent loses the game.
  •  (that is, reward for all states except the G and B states is -0.04),  (that is, the utility at the first time step is 0, except the G and B states).
  • Transition probability T(s,a,s’) equals 0.8 if going in the desired direction; otherwise, 0.1 each if going perpendicular to the desired direction. For example, if the action is UP then with 0.8 probability, the agent goes UP but with 0.1 probability it goes RIGHT and 0.1 to the LEFT.


  1.  Find , the utility of the X state at time step 1, that is, the agent will go through one iteration
  2. Similarly, find 


R(X) = -0.04

Action as’RIGHT


0.8+10.8 x 1 = 0.8RIGHTC0.100.1 x 0 = 0RIGHTX0.100.1 x 0 = 0

Thus, for action a = RIGHT,

Action as’DOWN


0.800.8 x 0 = 0DOWNG0.1+10.1 x 1 = 0.1DOWNA0.100.1 x 0 = 0

Thus, for action a = DOWN,

Action as’UP


0.800.8 x 0 = 0UPG0.1+10.1 x 1 = 0.1UPA0.100.1 x 0 = 0

Thus, for action a = UP,

Action as’LEFT


0.800.8 x 0 = 0LEFTX0.100.1 x 0 = 0LEFTC0.100.1 x 0 = 0

Thus, for action a = LEFT,

Therefore, among all actions,

Therefore, , where  and 

Similarly, calculate  and  and we get  and 

Since, , and,

R(X) = -0.04

Action as’RIGHT


0.8+10.8 x 1 = 0.8RIGHTC0.1-0.040.1 x -0.04 = -0.004RIGHTX0.10.360.1 x 0.36 = 0.036

Thus, for action a = RIGHT,

Action as’DOWN


0.8-0.040.8 x -0.04 = -0.032DOWNG0.1+10.1 x 1 = 0.1DOWNA0.1-0.040.1 x -0.04 = -0.004

Thus, for action a = DOWN,

Action as’UP


0.80.360.8 x 0.36 = 0.288UPG0.1+10.1 x 1 = 0.1UPA0.1-0.040.1 x -0.04 = -0.004

Thus, for action a = UP,

Action as’LEFT


0.8-0.040.8 x -0.04 = -0.032LEFTX0.10.360.1 x 0.36 = 0.036LEFTC0.1-0.040.1 x -0.04 = -0.004

Thus, for action a = LEFT,

Therefore, among all actions,

Therefore, , where  and 

Therefore, the answers to the preceding questions are:


Policy iteration

The process of obtaining optimal utility by iterating over the policy and updating the policy itself instead of value until the policy converges to the optimum is called policy iteration. The process of policy iteration is as follows:

  • Start with a random policy, 
  • For the given  policy at iteration step t, calculate  by using the following formula:
  • Improve the  policy by

This ends an interesting reinforcement learning tutorial. Want to implement state-of-the-art Reinforcement Learning algorithms from scratch? Get this best-selling title, Reinforcement Learning with TensorFlow.

Read Next:

How Reinforcement Learning works

Convolutional Neural Networks with Reinforcement Learning

Getting started with Q-learning using TensorFlow

Subscribe to the weekly Packt Hub newsletter. We'll send you this year's Skill Up Developer Skills Report.

* indicates required


Please enter your comment!
Please enter your name here