Reinforcement learning, one of the foundations of machine learning, supposes learning through trial and error by interacting with an environment. Reinforcement learning often uses the **Markov Decision Process** (**MDP**). MDP contains a memoryless and unlabeled action-reward equation with a learning parameter. This equation, the Bellman equation (often coined as the Q function), was used to beat world-class Atari gamers.

In this article, we are going to tackle Markov’s Decision Process (Q function) and apply it to reinforcement learning with the Bellman equation.

This tutorial is taken from the book Artificial Intelligence By Example by Denis Rothman. In this book, you will develop machine intelligence from scratch using real artificial intelligence use cases.

## Step 1 – Markov Decision Process in natural language

Step 1 of any artificial intelligence problem is to transpose it into something you know in your everyday life (work or personal).

Let’s say you are an e-commerce business driver delivering a package in an area you do not know. You are the operator of a self-driving vehicle. You have a GPS system with a beautiful color map on it. The areas around you are represented by the letters **A** to **F**, as shown in the simplified map in the following diagram. You are presently at **F**. Your goal is to reach area **C**. You are happy, listening to the radio. Everything is going smoothly, and it looks like you are going to be there on time. The following graph represents the locations and routes that you can possibly cover.

The guiding system’s **state** indicates the complete path to reach **C**. It is telling you that you are going to go from **F** to **B** to **D** and then to **C**. It looks good!

To break things down further, let’s say:

- The present
**state**is the letter*s.* - Your next
**action**is the letter*a (action).*This action a is not location A. - The next action
*a (*not location A*)*is to go to location**B**. You look at your guiding system; it tells you there is no traffic, and that to go from your present state**F**to your next state**B**will take you only a few minutes. Let’s say that the next state**B**is the letter**B**.

At this point, you are still quite happy, and we can sum up your situation with the following sequence of events:

The letter *s* is your present state, your present situation. The letter *a* is the action you’re deciding, which is to go to the next area; there you will be in another state, *s’*. We can say that thanks to the action *a*, you will go from *s* to *s’*.

Now, imagine that the driver is not you anymore. You are tired for some reason. That is when a self-driving vehicle comes in handy. You set your car to autopilot. Now you are not driving anymore; the system is. Let’s call that system the **agent.** At point **F**, you set your car to autopilot and let the self-driving **agent** take over.

The agent now sees what you have asked it to do and checks its mapping **environment**, which represents all the areas in the previous diagram from **A** to **F**.

In the meantime, you are rightly worried. Is the **agent** going to make it or not? You are wondering if its strategy meets yours. You have your **policy** *P*—your way of thinking—which is to take the shortest paths possible. Will the agent agree? What’s going on in its mind? You observe and begin to realize things you never noticed before. Since this is the first time you are using this car and guiding system, the agent is **memoryless**, which is an MDP feature. This means the agent just doesn’t know anything about what went on before. It seems to be happy with just calculating from this **state** *s* at area **F**. It will use machine power to run as many calculations as necessary to reach its goal.

Another thing you are watching is the total distance from **F** to **C** to check whether things are OK. That means that the agent is calculating all the states from **F** to **C**.

In this case, state **F** is state 1, which we can simplify by writing *s*_{1}**.** **B** is state 2, which we can simplify by write *s*_{2}. **D** is *s*_{3} and **C** is *s*_{4}. The agent is calculating all of these possible states to make a decision.

The agent knows that when it reaches **D**, **C** will be better because the reward will be higher to go to **C** than anywhere else. Since it cannot eat a piece of cake to reward itself, the agent uses numbers. Our agent is a real number cruncher. When it is wrong, it gets a poor reward or nothing in this model. When it’s right, it gets a reward represented by the letter **R**. This action-value (reward) transition, often named the Q function, is the core of many reinforcement learning algorithms.

When our agent goes from one state to another, it performs a *transition* and gets a reward. For example, the *transition* can be from **F** to **B**, state 1 to state 2, or *s*_{1} to *s*_{2.}

You are feeling great and are going to be on time. You are beginning to understand how the machine learning agent in your self-driving car is thinking. Suddenly your guiding system breaks down. All you can see on the screen is that static image of the areas of the last calculation. You look up and see that a traffic jam is building up. Area **D** is still far away, and now you do not know whether it would be good to go from **D** to **C** or **D** to **E** to get a taxi that can take special lanes. You are going to need your agent!

The agent takes the traffic jam into account, is stubborn, and increases its reward to get to **C** by the shortest way. Its **policy** is to stick to the initial plan. You do not agree. You have another **policy**.

You stop the car. You both have to agree before continuing. You have your opinion and policy; the agent does not agree. Before continuing, your views need to **converge**. **Convergence** is the key to making sure that your calculations are correct. This is the kind of problem that persons, or soon, self-driving vehicles (not to speak about drone air jams), delivering parcels encounter all day long to get the workload done. The number of parcels to delivery per hour is an example of the workload that needs to be taken into account when making a decision.

To represent the problem at this point, the best way is to express this whole process mathematically.

## Step 2 – the mathematical representation of the Bellman equation and MDP

Mathematics involves a whole change in your perspective on a problem. You are going from words to functions, the pillars of source coding. The goal here is to pick up enough mathematics to implement a solution in real-life companies.

It is necessary to think of a problem through, by finding something familiar around us, such as the delivery itinerary example covered before. It is a good thing to write it down with some abstract letters and symbols as described before, with **a** meaning an action and **s** meaning a state. Once you have understood the problem and expressed the parameters in a way you are used to, you can proceed further.

### From MDP to the Bellman equation

In the previous step 1, the agent went from **F** or state 1 or *s* to **B**, which was state 2 or *s*‘.

To do that, there was a strategy—a policy represented by **P**. All of this can be shown in one mathematical expression, the MDP state transition function:

**P** is the policy, the strategy made by the agent to go from **F** to **B** through action *a*. When going from **F** to **B**, this state transition is called **state transition function:**

*a*is the action*s*is state 1 (*F*) and*s*‘ is state 2 (*B*)

This is the basis of MDP. The reward (right or wrong) is represented in the same way:

_{}

That means **R** is the reward for the action of going from state *s* to state *s*‘. Going from one state to another will be a random process. This means that potentially, all states can go to another state.

The example we will be working on inputs a reward matrix so that the program can choose its best course of action. Then, the agent will go from state to state, learning the best trajectories for every possible starting location point. The goal of the MDP is to go to **C** (line 3, column 3 in the reward matrix), which has a starting value of 100 in the following Python code.

# Markov Decision Process (MDP) - The Bellman equations adapted to # Reinforcement Learning # R is The Reward Matrix for each state R = ql.matrix([ [0,0,0,0,1,0], [0,0,0,1,0,1], [0,0,100,1,0,0], [0,1,1,0,1,0], [1,0,0,1,0,0], [0,1,0,0,0,0] ])

Each line in the matrix in the example represents a letter from **A** to **F**, and each column represents a letter from **A** to **F**. All possible states are represented. The `1` values represent the nodes (vertices) of the graph. Those are the possible locations. For example, line 1 represents the possible moves for letter **A**, line 2 for letter **B**, and line 6 for letter **F**. On the first line, **A** cannot go to **C** directly, so a `0` value is entered. But, it can go to **E**, so a `1` value is added.

Some models start with -1 for impossible choices, such as **B** going directly to **C** and `0` values to define the locations. This model starts with 0 and 1 values. It sometimes takes weeks to design functions that will create a reward matrix

To sum it up, we have three tools:

**P**: A_{a}(s,s’)**policy**,**P**, or strategy to move from one state to another**T**: A_{a}(s,s’)**T**, or stochastic (random)**transition**, function to carry out that action**R**: An_{a}(s,s’)**R**, or**reward**, for that action, which can be negative, null, or positive

T is the transition function, which makes the **agent** decide to go from one point to another with a policy. In this case, it will be random. That’s what machine power is for, and that’s how reinforcement learning is often implemented.

Randomness is a property of MDP.

The following code describes the choice the *agent *is going to make.

next_action = int(ql.random.choice(PossibleAction,1)) return next_action

Once the code has been run, a new random action (state) has been chosen.

Bellman’s equation completes the MDP. To calculate the value of a state, let’s use `Q`, for the Q action-reward (or value) function. The pre-source code of Bellman’s equation can be expressed as follows for one individual state:

The source code then translates the equation into a machine representation as in the following code:

# The Bellman equation Q[current_state, action] = R[current_state, action] + gamma * MaxValue

The source code variables of the Bellman equation are as follows:

`Q(s)`: This is the value calculated for this state—the total reward. In step 1 when the agent went from**F**to**B**, the driver had to be happy. Maybe she/he had a crunch in a candy bar to feel good, which is the human counterpart of the reward matrix. The automatic driver maybe ate (reward matrix) some electricity, renewable energy of course! The reward is a number such as 50 or 100 to show the agent that it’s on the right track. It’s like when a student gets a good grade in an exam.`R(s)`: This is the sum of the values up to there. It’s the total reward at that point.`ϒ`**= gamma**: This is here to remind us that trial and error has a price. We’re wasting time, money, and energy. Furthermore, we don’t even know whether the next step is right or wrong since we’re in a trial-and-error mode.**Gamma**is often set to`0.8`. What does that mean? Suppose you’re taking an exam. You study and study, but you don’t really know the outcome. You might have 80 out of 100 (0.8) chances of clearing it. That’s painful, but that’s life. This is what makes Bellman’s equation and MDP realistic and efficient.`max(s')`:is one of the possible states that can be reached with*s*'*P*_{a}(s,s’);`max`is the highest value on the line of that state (location line in the reward matrix).

## Step 3 – implementing the solution in Python

In step 1, a problem was described in natural language to be able to talk to experts and understand what was expected. In step 2, an essential mathematical bridge was built between natural language and source coding. Step 3 is the software implementation phase.

The code is a reinforcement learning program using the Q function with the following reward matrix:

import numpy as ql R = ql.matrix([ [0,0,0,0,1,0], [0,0,0,1,0,1], [0,0,100,1,0,0], [0,1,1,0,1,0], [1,0,0,1,0,0], [0,1,0,0,0,0] ])

Q = ql.matrix(ql.zeros([6,6]))

gamma = 0.8

`R` is the reward matrix described in the mathematical analysis.

`Q` inherits the same structure as `R`, but all values are set to `0` since this is a learning matrix. It will progressively contain the results of the decision process. The gamma variable is a double reminder that the system is learning and that its decisions have only an 80% chance of being correct each time. As the following code shows, the system explores the possible actions during the process.

agent_s_state = 1

# The possible “a” actions when the agent is in a given state

def possible_actions(state):

current_state_row = R[state,]
possible_act = ql.where(**current_state_row >0**)[1]
return possible_act

# Get available actions in the current state

PossibleAction = possible_actions(agent_s_state)

The agent starts in state 1, for example. You can start wherever you want because it’s a random process. Note that only values > 0 are taken into account. They represent the possible moves (decisions).

The current state goes through an analysis process to find possible actions (next possible states). You will note that there is no algorithm in the traditional sense with many rules. It’s a pure random calculation, as the following `random.choice` function shows.

def ActionChoice(available_actions_range): next_action = int(ql.random.choice(PossibleAction,1)) return next_action

# Sample next action to be performed

action = ActionChoice(PossibleAction)

Now comes the core of the system containing Bellman’s equation, translated into the following source code:

def reward(current_state, action, gamma): Max_State = ql.where(Q[action,] == ql.max(Q[action,]))[1]

if Max_State.shape[0] > 1:

Max_State = int(ql.random.choice(Max_State, size = 1))

else:

Max_State = int(Max_State)

MaxValue = Q[action, Max_State]

# Q functionQ[current_state, action] = R[current_state, action] + gamma * MaxValue

# Rewarding Q matrix

reward(agent_s_state,action,gamma)

You can see that the agent looks for the maximum value of the next possible state chosen at random.

The best way to understand this is to run the program in your Python environment and `print()` the intermediate values. I suggest that you open a spreadsheet and note the values. It will give you a clear view of the process.

The last part is simply about running the learning process 50,000 times, just to be sure that the system learns everything there is to find. During each iteration, the agent will detect its present state, choose a course of action, and update the `Q` function matrix:

for i in range(50000): current_state = ql.random.randint(0, int(Q.shape[0])) PossibleAction = possible_actions(current_state) action = ActionChoice(PossibleAction) reward(current_state,action,gamma)

# Displaying Q before the norm of Q phase

print(“Q :”)

print(Q)

# Norm of Q

print(“Normed Q :”)

print(Q/ql.max(Q)*100)

After the process is repeated and until the learning process is over, the program will print the result in `Q` and the normed result. The normed result is the process of dividing all values by the sum of the values found. The result comes out as a normed percentage.

View the Python program at https://github.com/PacktPublishing/Artificial-Intelligence-By-Example/blob/master/Chapter01/MDP.py.

In this article, we talked about MDP, a stochastic random action-reward (value) system enhanced by Bellman’s equation, as a means of effective solution provider to many AI problems in corporate environments.

Next, to discover how to create the reward matrix in the first place through explanations and source code read our book Artificial Intelligence By Example.

### Read Next

How Reinforcement Learning works.

Convolutional Neural Networks with Reinforcement Learning.

Implement Reinforcement learning using Markov Decision Process [Tutorial].