How to implement Reinforcement Learning with TensorFlow

2 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt from the book, Deep Learning Essentials co-authored by Wei Di, Anurag Bhardwaj, and Jianing Wei. This book will help you get to grips with the essentials of deep learning by leveraging the power of Python.[/box]

In today’s tutorial, we will implement reinforcement learning with TensorFlow-based Qlearning algorithm.

We will look at a popular game, FrozenLake, which has an inbuilt environment in the OpenAI gym package. The idea behind the FrozenLake game is quite simple. It consists of 4 x 4 grid blocks, where each block can have one of the following four states:

S: Starting point/Safe state
F: Frozen surface/Safe state
H: Hole/Unsafe state
G: Goal/Safe or Terminal state

In each of the 16 cells, you can use one of the four actions, namely up/down/left/right, to move to a neighboring state. The goal of the game is to start from state S and end at state G. We will show how we can use a neural network-based Q-learning system to learn a safe path from state S to state G. First, we import the necessary packages and define the game environment:

import gym

import numpy as np

import random

import tensorflow as tf

env = gym.make('FrozenLake-v0')

Once the environment is defined, we can define the network structure that learns the Qvalues. We will use a one-layer neural network with 16 hidden neurons and 4 output neurons as follows:

input_matrix = tf.placeholder(shape=[1,16],dtype=tf.float32)

weight_matrix = tf.Variable(tf.random_uniform([16,4],0,0.01))

Q_matrix = tf.matmul(input_matrix,weight_matrix)

prediction_matrix = tf.argmax(Q_matrix,1)

nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32)

loss = tf.reduce_sum(tf.square(nextQ - Q_matrix))

train = tf.train.GradientDescentOptimizer(learning_rate=0.05)

model = train.minimize(loss)

init_op = tf.global_variables_initializer()

Now we can choose the action greedily:

ip_q = np.zeros(num_states)

ip_q[current_state] = 1

a,allQ = sess.run([prediction_matrix,Q_matrix],feed_dict={input_matrix:

[ip_q]})

if np.random.rand(1) < sample_epsilon:

a[0] = env.action_space.sample()

next_state, reward, done, info = env.step(a[0])

ip_q1 = np.zeros(num_states)

ip_q1[next_state] = 1

Q1 = sess.run(Q_matrix,feed_dict={input_matrix:[ip_q1]})

maxQ1 = np.max(Q1)

targetQ = allQ

targetQ[0,a[0]] = reward + y*maxQ1

_,W1 = sess.run([model,weight_matrix],feed_dict={input_matrix:

[ip_q],nextQ:targetQ})

Figure RL with Q-learning example shows the sample output of the program when executed. You can see different values of Q matrix as the agent moves from one state to the other. You also notice a value of reward 1 when the agent is in state 15:

To summarize, we saw how reinforcement learning can be practically implemented using TensorFlow.

If you found this post useful, do check out the book Deep Learning Essentials which will help you fine-tune and optimize your deep learning models for better performance.

Gebin George