CMU students propose a competitive reinforcement learning approach based on A3C using visual transfer between Atari games

Earlier this month, some students of Robotics Institute Carnegie Mellon University published a paper proposing a learning approach called competitive reinforcement learning using visual transfer. This method, with the help of asynchronous advantage actor critic (A3C) architecture, generalizes a target game using an agent trained on a source game in Atari.

What is A3C architecture?

The A3C architecture is an asynchronous variant of the actor-critic model, in which the actor takes in the current environment state and determines the best action to take from there. It consists of four convolutional layers, an LSTM layer, and two fully connected layers to predict actions and value functions of the states.

In this architecture, multiple worker agents are trained in parallel, each with their own copy of the model and environment. Advantage here refers to a metric that is set to judge how good the agents' actions were.

What is the aim of competitive reinforcement learning?

The learning approach introduced in this paper aims to use a reinforcement agent to generalize between two related but vastly different Atari games like Pong-v0 and Breakout-v0. This is done by learning visual mappers: given a frame from the source game, we should be able to generate the analogous frame in the target game.

In both these games, a paddle is controlled to hit a ball to obtain a certain objective. Using this method the six actions of Pong-v0 {No Operation, Fire, Right, Left, Right Fire, Left Fire} are mapped to the four actions of Breakout-v0 as {Fire, Fire, Right, Left, Right, Left} respectively. The rewards are mapped directly from source game to target game without any scaling. The source and target environment they experimented on was OpenAI gym.

They found underlying similarities between the source and the target game to represent common knowledge using Unsupervised Image-to-image Translation (UNIT) Generative adversarial networks (GANs). The target game competes with its visual representation obtained after using the UNIT GAN as a visual mapper between the source and target game.

How competitive reinforcement learning works?

The following diagram depicts how knowledge is transferred from source game to target game by competitively and simultaneously fine-tuning the model using two different visual representations of the target game:

cmu-students-propose-a-competitive-reinforcement-learning-approach-based-on-a3c-using-visual-transfer-between-atari-games-img-0

Source: arXiv

First stage: The baseline A3C network is trained for source game (Pong-v0) in the first stage of the training process. The knowledge learned is then transferred from this model to learning to play target game (Breakout-v0). The efficiency of transfer learning method in terms of training time and data efficiency across parallel actor-learners is measured.

Second stage: In this stage of the training process, two representations of the target game are used amongst the workers in parallel. The first representation of transfer process uses the target game frames taken directly from the environment. The second representation of transfer process uses the frames learned from the visual mapper (visual analogies between games). The ratio of number of workers that train directly on frames queried from the target game and frames mapped from the source game is a hyperparameter that is determined through experimentation.

Results

They concluded that it is possible to generate a visual mapper for semantically similar games with the use of UNIT GANs.

Learning from two different representations of the same game and using them simultaneously for transfer learning stabilizes the learning curve.

Although the workers using representations of the target game obtained from the visual mappers did not perform well in a stand alone setting, they showed improvements when used for the competitive learning.