*This article is an excerpt from a book by Rodolfo Bonnin titled*

*Machine Learning for Developers.***Reinforcement learning** is a field that has resurfaced recently, and it has become more popular in the fields of control, finding the solutions to games and situational problems, where a number of steps have to be implemented to solve a problem.

A formal definition of reinforcement learning is as follows:

*“Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment.**” **(Kaelbling et al. 1996).*

In order to have a reference frame for the type of problem we want to solve, we will start by going back to a mathematical concept developed in the 1950s, called the **Markov decision process**.

## Markov decision process

Before explaining reinforcement learning techniques, we will explain the type of problem we will attack with them.

When talking about reinforcement learning, we want to optimize the problem of a Markov decision process. It consists of a mathematical model that aids decision making in situations where the outcomes are in part random, and in part under the control of an agent.

The main elements of this model are an **Agent**, an **Environment**, and a **State**, as shown in the following diagram:

*Simpliﬁed scheme of a reinforcement learning process*

The agent can perform certain actions (such as moving the paddle left or right). These actions can sometimes result in a reward *r**t*, which can be positive or negative (such as an increase or decrease in the score). Actions change the environment and can lead to a new state *s**t+1*, where the agent can perform another action *a**t+1*. The set of states, actions, and rewards, together with the rules for transitioning from one state to another, make up a Markov decision process.

## Decision elements

To understand the problem, let’s situate ourselves in the problem solving environment and look at the main elements:

- The set of states
- The action to take is to go from one place to another
- The reward function is the value represented by the edge The policy is the way to complete the task
- A discount factor, which determines the importance of future rewards

The main difference with traditional forms of supervised and unsupervised learning is the time taken to calculate the reward, which in reinforcement learning is not instantaneous; it comes after a set of steps.

Thus, the next state depends on the current state and the decision maker’s action, and the state is not dependent on all the previous states (it doesn’t have memory), thus it complies with the Markov property.

Since this is a Markov decision process, the probability of state *s**t+1 *depends only on the current state *s**t *and action *a**t*:

*Unrolled reinforcement mechanism*

The goal of the whole process is to generate a policy *P, *that maximizes rewards. The training samples are tuples, *<s, a, r>*.

## Optimizing the Markov process

Reinforcement learning is an iterative interaction between an agent and the environment. The following occurs at each timestep:

- The process is in a state and the decision-maker may choose any action that is available in that state
- The process responds at the next timestep by randomly moving into a new state and giving the decision-maker a corresponding reward
- The probability that the process moves into its new state is influenced by the chosen action in the form of a state transition function

## Basic RL techniques: Q-learning

One of the most well-known reinforcement learning techniques, and the one we will be implementing in our example, is **Q-learning**.

Q-learning can be used to find an optimal action for any given state in a finite Markov decision process. Q-learning tries to maximize the value of the Q-function that represents the maximum discounted future reward when we perform action *a *in state *s*.

Once we know the Q-function, the optimal action *a *in state *s *is the one with the highest Q- value. We can then define a policy π*(s), *that gives us the optimal action in any state, expressed as follows:

We can define the Q-function for a transition point (*s**t**, a**t**, r**t**, s**t+1*) in terms of the Q-function at the next point (*s**t+1*, *a**t+1*, *r**t+1*, *s**t+2*), similar to what we did with the total discounted future reward. This equation is known as the **Bellman equation for Q-learning:**

In practice, we can think of the Q-function as a lookup table (called a **Q-table**) where the states (denoted by *s*) are rows and the actions (denoted by *a*) are columns, and the elements (denoted by *Q(s, a)*) are the rewards that you get if you are in the state given by the row and take the action given by the column. The best action to take at any state is the one with the highest reward:

```
initialize Q-table Q observe initial state s while (! game_finished):
select and perform action a get reward r
advance to state s'
Q(s, a) = Q(s, a) + α(r + γ max_a' Q(s', a') - Q(s, a)) s = s'
```

You will realize that the algorithm is basically doing stochastic gradient descent on the Bellman equation, backpropagating the reward through the state space (or episode) and averaging over many trials (or epochs). Here, α is the learning rate that determines how much of the difference between the previous Q-value and the discounted new maximum Q- value should be incorporated.

We can represent this process with the following flowchart:

We have successfully reviewed Q-Learning, one of the most important and innovative architecture of reinforcement learning that have appeared in recent. Every day, such reinforcement models are applied in innovative ways, whether to generate feasible new elements from a selection of previously known classes or even to win against professional players in strategy games.

*If you enjoyed this excerpt from the book Machine learning for developers, check out the book below.*