This article is an excerpt taken from the book, Hands-On Intelligent Agents with OpenAI Gym, written by Praveen Palanisamy. In this article, the author introduces us to the Markov Decision Process followed by the understanding of Deep reinforcement learning.
A Markov Decision Process (MDP) provides a formal framework for reinforcement learning. It is used to describe a fully observable environment where the outcomes are partly random and partly dependent on the actions taken by the agent or the decision maker. The following diagram is the progression of a Markov Process into a Markov Decision Process through the Markov Reward Process:
These stages can be described as follows:
Dynamic programming is a very general method to efficiently solve problems that can be decomposed into overlapping sub-problems. If you have used any type of recursive function in your code, you might have already got some preliminary flavor of dynamic programming. Dynamic programming, in simple terms, tries to cache or store the results of sub-problems so that they can be used later if required, instead of computing the results again.
Okay, so how is that relevant here, you may ask. Well, they are pretty useful for solving a fully defined MDP, which means that an agent can find the most optimal way to act in an environment to achieve the highest reward using dynamic programming if it has full knowledge of the MDP! In the following table, you will find a concise summary of what the inputs and outputs are when we are interested in sequential prediction or control:
Task/objective | Input | Output |
Prediction | MDP or MRP and policy | Value function |
Control | MDP | Optimal value function and optimal policy |
At this point, we understand that it is very useful for an agent to learn the state value function
The goal of MC and TD learning is to learn the value functions from the agent’s experience as the agent follows its policy
The following table summarizes the value estimate’s update equation for the MC and TD learning methods:
Learning method | State-value function |
Monte Carlo | |
Temporal Difference |
MC learning updates the value towards the actual return
It is also very useful for an agent to learn the action value function
Learning method | Action-value function |
SARSA | |
Q-learning |
SARSA is so named because of the sequence State->Action->Reward->State’->Action’ that the algorithm’s update step depends on. The description of the sequence goes like this: the agent, in state S, takes an action A and gets a reward R, and ends up in the next state S’, after which the agent decides to take an action A’ in the new state. Based on this experience, the agent can update its estimate of Q(S,A).
Q-learning is a popular off-policy learning algorithm, and it is similar to SARSA, except for one thing. Instead of using the Q value estimate for the new state and the action that the agent took in that new state, it uses the Q value estimate that corresponds to the action that leads to the maximum obtainable Q value from that new state, S’.
With a basic understanding of reinforcement learning, you are now in a better state (hopefully you are not in a strictly Markov state where you have forgotten the history/things you have learned so far) to understand the basics of the cool new suite of algorithms that have been rocking the field of AI in recent times.
Deep reinforcement learning emerged naturally when people made advancements in the deep learning field and applied them to reinforcement learning. We learned about the state-value function, action-value function, and policy. Let’s briefly look at how they can be represented mathematically or realized through computer code. The state-value function
If we do this, then we are doing deep reinforcement learning! Easy enough to understand? I hope so. Let’s look at some other ways in which we can use deep learning in reinforcement learning.
Recall that a policy is represented as
So in summary, deep reinforcement learning is the application of deep learning to reinforcement learning and so far, researchers have applied deep learning to reinforcement learning successfully in two ways. One way is using deep neural networks to approximate the value functions, and the other way is to use a deep neural network to represent the policy.
These ideas have been known from the early days, when researchers were trying to use neural networks as value function approximators, even back in 2005. But it rose to stardom only recently because although neural networks or other non-linear value function approximators can better represent the complex values of environment states and actions, they were prone to instability and often led to sub-optimal functions. Only recently have researchers such as Volodymyr Mnih and his colleagues at DeepMind (now part of Google) figured out the trick of stabilizing the learning and trained agents with deep, non-linear function approximators that converged to near-optimal value functions. In the later chapters of this book, we will, in fact, reproduce some of their then-groundbreaking results, which surpassed human Atari game playing capabilities!
Until recently, practical applications of reinforcement learning and deep reinforcement learning were limited, due to sample complexity and instability. But, these algorithms proved to be quite powerful in solving some really hard practical problems. Some of them are listed here to give you an idea:
To summarize, in this article, we learned about the Markov Decision process, Deep reinforcement learning, and its applications. If you’ve enjoyed this post, head over to the book, Hands-On Intelligent Agents with OpenAI Gym for implementing learning algorithms for machine software agents in order to solve discrete or continuous sequential decision making and control tasks, and much more.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…