10 min read
This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn advanced features of TensorFlow1.x, such as distributed TensorFlow with TF Clusters, deploy production models with TensorFlow Serving, and more.

Today, we will help you understand OpenAI Gym and how to apply the basics of OpenAI Gym onto a cartpole game.

OpenAI Gym 101

OpenAI Gym is a Python-based toolkit for the research and development of reinforcement learning algorithms. OpenAI Gym provides more than 700 opensource contributed environments at the time of writing. With OpenAI, you can also create your own environment. The biggest advantage is that OpenAI provides a unified interface for working with these environments, and takes care of running the simulation while you focus on the reinforcement learning algorithms.

Note : The research paper describing OpenAI Gym is available here: http://arxiv.org/abs/1606.01540

You can install OpenAI Gym using the following command:

pip3  install  gym

Note: If the above command does not work, then you can find further help with installation at the following link: https://github.com/openai/ gym#installation

  1.  Let us print the number of available environments in OpenAI Gym:
all_env  =  list(gym.envs.registry.all())

print('Total  Environments  in  Gym  version  {}  :  {}'

.format(gym.     version     ,len(all_env)))

Total  Environments  in  Gym  version  0.9.4  :  777
  1.  Let us print the list of all environments:
for  e  in  list(all_env):


The partial list from the output is as follows:

EnvSpec(Carnival-ramNoFrameskip-v0) EnvSpec(EnduroDeterministic-v0) EnvSpec(FrostbiteNoFrameskip-v4) EnvSpec(Taxi-v2)

EnvSpec(Pooyan-ram-v0) EnvSpec(Solaris-ram-v4) EnvSpec(Breakout-ramDeterministic-v0)

EnvSpec(Kangaroo-ram-v4) EnvSpec(StarGunner-ram-v4) EnvSpec(Enduro-ramNoFrameskip-v4)

EnvSpec(DemonAttack-ramDeterministic-v0) EnvSpec(TimePilot-ramNoFrameskip-v0) EnvSpec(Amidar-v4)

Each environment, represented by the env object, has a standardized interface, for example: An env object can be created with the env.make() function by passing the id string.

Each env object contains the following main functions:

The step() function takes an action object as an argument and returns four objects:

observation: An object implemented by the environment, representing the observation of the environment.

reward: A signed float value indicating the gain (or loss) from the previous action.

done: A Boolean value representing if the scenario is finished. info: A Python dictionary object representing the diagnostic information.

The render() function creates a visual representation of the environment.

The reset() function resets the environment to the original state.

Each env object comes with well-defined actions and observations, represented by action_space and observation_space.

One of the most popular games in the gym to learn reinforcement learning is CartPole. In this game, a pole attached to a cart has to be balanced so that it doesn’t fall. The game ends if either the pole tilts by more than 15 degrees or the cart moves by more than 2.4 units from the center. The home page of OpenAI.com emphasizes the game in these words:

The small size and simplicity of this environment make it possible to run very quick experiments, which is essential when learning the basics.

The game has only four observations and two actions. The actions are to move a cart by applying a force of +1 or -1. The observations are the position of the cart, the velocity of the cart, the angle of the pole, and the rotation rate of the pole. However, knowledge of the semantics of observation is not necessary to learn to maximize the rewards of the game.

Now let us load a popular game environment, CartPole-v0, and play it with stochastic control:

  1.  Create the env object with the standard make function:
env  =  gym.make('CartPole-v0')
  1.  The number of episodes is the number of game plays. We shall set it to one, for now, indicating that we just want to play the game once. Since every episode is stochastic, in actual production runs you will run over several episodes and calculate the average values of the rewards. Additionally, we can initialize an array to store the visualization of the environment at every timestep:
n_episodes  =  1 env_vis  =  []
  1.  Run two nested loopsan external loop for the number of episodes and an internal loop for the number of timesteps you would like to simulate for. You can either keep running the internal loop until the scenario is done or set the number of steps to a higher value.

At the beginning of every episode, reset the environment using env.reset().

At the beginning of every timestep, capture the visualization using env.render().

for  i_episode  in  range(n_episodes):

observation  =  env.reset()

for  t  in  range(100):

env_vis.append(env.render(mode  =  'rgb_array'))


action  =  env.action_space.sample()

observation,  reward,  done,  info  =  env.step(action)

if  done:

print("Episode  finished  at  t{}".format(t+1))

  1.  Render the environment using the helper function:
  1.  The code for the helper function is as follows:
def  env_render(env_vis):


plot  =  plt.imshow(env_vis[0])


def  animate(i):


anim  =  anm.FuncAnimation(plt.gcf(), animate, frames=len(env_vis), interval=20, repeat=True, repeat_delay=20)

display(display_animation(anim,  default_mode='loop'))

We get the following output when we run this example:

[-0.00666995  -0.03699492  -0.00972623    0.00287713] [-0.00740985    0.15826516  -0.00966868  -0.29285861] [-0.00424454  -0.03671761  -0.01552586  -0.00324067] [-0.0049789    -0.2316135    -0.01559067    0.28450351] [-0.00961117  -0.42650966  -0.0099006    0.57222875] [-0.01814136  -0.23125029    0.00154398    0.27644332] [-0.02276636  -0.0361504    0.00707284  -0.01575223] [-0.02348937    0.1588694       0.0067578    -0.30619523] [-0.02031198  -0.03634819    0.00063389  -0.01138875] [-0.02103895    0.15876466    0.00040612  -0.3038716  ] [-0.01786366    0.35388083  -0.00567131  -0.59642642] [-0.01078604    0.54908168  -0.01759984  -0.89089036]

[    1.95594914e-04   7.44437934e-01    -3.54176495e-02    -1.18905344e+00]
[ 0.01508435 0.54979251 -0.05919872 -0.90767902]
[ 0.0260802 0.35551978 -0.0773523 -0.63417465]
[ 0.0331906 0.55163065 -0.09003579 -0.95018025]
[ 0.04422321 0.74784161 -0.1090394 -1.26973934]
[ 0.05918004 0.55426764 -0.13443418 -1.01309691]
[ 0.0702654 0.36117014 -0.15469612 -0.76546874]
[ 0.0774888 0.16847818 -0.1700055 -0.52518186]
[ 0.08085836 0.3655333 -0.18050913 -0.86624457]
[ 0.08816903 0.56259197 -0.19783403 -1.20981195]

Episode  finished  at  t22

It took 22 time-steps for the pole to become unbalanced. At every run, we get a different time-step value because we picked the action scholastically by using env.action_space.sample().

Since the game results in a loss so quickly, randomly picking an action and applying it is probably not the best strategy. There are many algorithms for finding solutions to keeping the pole straight for a longer number of time-steps that you can use, such as Hill Climbing, Random Search, and Policy Gradient.

Note: Some of the algorithms for solving the Cartpole game are available at the following links:




Applying simple policies to a cartpole game

So far, we have randomly picked an action and applied it. Now let us apply some logic to picking the action instead of random chance. The third observation refers to the angle. If the angle is greater than zero, that means the pole is tilting right, thus we move the cart to the right (1). Otherwise, we move the cart to the left (0). Let us look at an example:

  1. We define two policy functions as follows:
def  policy_logic(env,obs):

return  1  if  obs[2]  >  0  else  0 def  policy_random(env,obs):

return  env.action_space.sample()
  1. Next, we define an experiment function that will run for a specific number of episodes; each episode runs until the game is lost, namely when done is True. We use rewards_max to indicate when to break out of the loop as we do not wish to run the experiment forever:
def  experiment(policy,  n_episodes,  rewards_max):


env  =  gym.make('CartPole-v0')

for  i  in  range(n_episodes):

obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done:

action  =  policy(env,obs)

obs,  reward,  done,  info  =  env.step(action)

episode_reward  +=  reward

if  episode_reward  >  rewards_max:

break rewards[i]=episode_reward

print('Policy:{},  Min  reward:{},  Max  reward:{}'

.format(policy.     name     , min(rewards), max(rewards)))
  1. We run the experiment 100 times, or until the rewards are less than or equal to rewards_max, that is set to 10,000:
n_episodes  =  100 rewards_max  =  10000

experiment(policy_random,  n_episodes,  rewards_max)

experiment(policy_logic,  n_episodes,  rewards_max)

We can see that the logically selected actions do better than the randomly selected ones, but not that much better:

Policy:policy_random,  Min  reward:9.0,  Max  reward:63.0,  Average  reward:20.26

Policy:policy_logic,  Min  reward:24.0,  Max  reward:66.0,  Average  reward:42.81

Now let us modify the process of selecting the action further—to be based on parameters. The parameters will be multiplied by the observations and the action will be chosen based on whether the multiplication result is zero or one. Let us modify the random search method in which we initialize the parameters randomly. The code looks as follows:

def  policy_logic(theta,obs):

#  just  ignore  theta

return  1  if  obs[2]  >  0  else  0

def  policy_random(theta,obs):

return  0  if  np.matmul(theta,obs)    rewards_max:


return  episode_reward

def  experiment(policy,  n_episodes,  rewards_max):


env  =  gym.make('CartPole-v0')

for  i  in  range(n_episodes):


#print("Episode  finished  at  t{}".format(reward))

print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}'

.format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards)))

n_episodes  =  100 rewards_max  =  10000

experiment(policy_random,  n_episodes,  rewards_max)

experiment(policy_logic,  n_episodes,  rewards_max)

We can see that random search does improve the results:

Policy:policy_random,  Min  reward:8.0,  Max  reward:200.0,  Average reward:40.04

Policy:policy_logic,  Min  reward:25.0,  Max  reward:62.0,  Average  reward:43.03

With the random search, we have improved our results to get the max rewards of 200. On average, the rewards for random search are lower because random search tries various bad parameters that bring the overall results down. However, we can select the best parameters from all the runs and then, in production, use the best parameters. Let us modify the code

to train the parameters first:

def  policy_logic(theta,obs):

#  just  ignore  theta

return  1  if  obs[2]  >  0  else  0

def  policy_random(theta,obs):

return  0  if  np.matmul(theta,obs)    rewards_max:


return  episode_reward

def  train(policy,  n_episodes,  rewards_max):

env  =  gym.make('CartPole-v0') theta_best  =  np.empty(shape=[4]) reward_best  =  0

for  i  in  range(n_episodes):

if  policy.   name         

in  ['policy_random']:

theta  =  np.random.rand(4)  *  2  -  1 else:

theta  =  None reward_episode=episode(env,policy,rewards_max,  theta) if  reward_episode  >  reward_best:

reward_best  =  reward_episode theta_best  =  theta.copy()

return  reward_best,theta_best

def  experiment(policy,  n_episodes,  rewards_max,  theta=None):


env  =  gym.make('CartPole-v0')

for  i  in  range(n_episodes):


#print("Episode  finished  at  t{}".format(reward))

print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}'

.format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards)))

n_episodes  =  100 rewards_max  =  10000

reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max)

We train for 100 episodes and then use the best parameters to run the experiment for the random search policy:

n_episodes  =  100 rewards_max  =  10000

reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max)

We find the that the training parameters gives us the best results of 200:

trained  theta:  [-0.14779543               0.93269603    0.70896423   0.84632461],  rewards:


Policy:policy_random,  Min  reward:200.0,  Max  reward:200.0,  Average reward:200.0

Policy:policy_logic,  Min  reward:24.0,  Max  reward:63.0,  Average  reward:41.94

We may optimize the training code to continue training until we reach a maximum reward.

To summarize, we learnt the basics of OpenAI Gym and also applied it onto a cartpole game for relevant output.  

If you found this post useful, do check out this book Mastering TensorFlow 1.x  to build, scale, and deploy deep neural network models using star libraries in Python.

Mastering TensorFlow 1.x

Subscribe to the weekly Packt Hub newsletter. We'll send you this year's Skill Up Developer Skills Report.

* indicates required


Please enter your comment!
Please enter your name here