7 min read

In this final post of this three part series we cover the optimizer, batches, complex and networks and we also discuss running the code of our example on the GPU. Let’s continue from where we left off in Part 2 with training the model to go over optimizers.

Optimizers

The optimizer module chainer.optimizer is responsible for orchestrating the parameter updates minimizing the loss. It is instantiated with a class that inherits the base class chainer.optimizer.Optimizer. Once instantiated, it needs to be set up by calling its setup() method, passing it a Link to optimize, that in turn contains all the Chainer variables that are to be trained. Remember that we give it the whole model, an instance of the chainer.link.Chain class which is a subclass of the Chainer chainer.link.Link. We can then call the update method every time we want to optimize the parameters in the training loop. The update method can be invoked by passing a loss function and the arguments to it, in this case the input value and the target value. One motivation for defining the __call__ method as the loss function is precisely so that the model instance can be passed to the optimzer here. Note that the loss is both stored in the class instance and returned. It is returned so that the model instance can be directly passed to the optimizer, which expects a loss function. But, the loss is also stored in the model itself so that it can be read by the training loop to compute the average loss over the course of an epoch.

The chainer.optimizers.SGD used in this example inherits from chainer.optimizer.GradientMethod which in turn inherits from the base optimizer class. The SGD, or Stochastic Gradient Descent optimizer performs a parameter optimization in each update much similar to the one we did in the previous article when performing a gradient descent. You have probably noted that it takes the learning rate as an argument in its constructor.

What the update method actually does is that it first resets the gradients for the variables in the model, computes the loss, runs the backward() method on the parameters and then updates the parameters using the algorithm defined by the optimizer instance, a simple SGD in our case. Other optimization algorithms in the framework include Momentum SGD, AdaGrad, AdaDelta, Adam and RMSProp. To train the model in this example using AdaGrad instead of SGD, you only need to change the instantiation of the optimizer from optimizer = optimizers.SGD(lr=learning_rate) to optimizer = optimizers.AdaGrad(lr=learning_rate, eps=1e-08). The arguments differ from optimizer to optimizer. AdaGrad also takes the epsilon smoothing term for instance, usually a small number that in Chainer defaults to 1e-08.

Always in Batches

Each time you feed the model with training data (invoke the __call__ method directly on the model instance or use optimizer.update()), the data being passed needs to be in a mini-batch format. This means that you cannot feed the model above with data such as x = Variable(np.array([1, 2, 3], dtype=np.float32)) because you want to do online training. This would result in the following error.

...
Invalid operation is performed in: LinearFunction (Forward)

Expect: in_types[0].ndim >= 2
Actual: 1 < 2

However, wrapping the array in another list x = Variable(np.array([[1, 2, 3]], dtype=np.float32)) would work since you simply made a mini-batch of size one, which would yield the same results as online training. You can of course train the model using regular batch training by passing the whole dataset to the model to perform one update per epoch.

You might have noticed that the loss being returned is multiplied by the size of the batch. This is because the Chainer loss functions such as chainer.functions.mean_squared_error returns the average loss over the given mini-batch. In order to compute the average loss over the complete dataset, we keep track of an accumulated loss over each individual sample. If you don’t want to do that, you could simply pass the whole dataset as a single batch into the model at the end of an epoch to compute the loss.

Extending to More Complex Networks

Adding activation functions or noise such as dropout can be done by adding a single function calls in the model definition. They are all part of the chainer.functions module. If you’d like to add 50% dropout to the hidden layer in our autoencoder, you’d change the first forward pass line of code from h = self.l1(x) to h = F.dropout(self.l1(x), ratio=0.5). Since this is such a small network, you will see that the loss increases quite significantly. Adding a ReLU activation function would look like this, F.relu(self.l1(x)). These methods can be applied to other Links as well and not just the linear connection that we’ve used in the autoencoder.

Creating other types of networks such as convolutional neural networks or recurrent ones are done by changing the layer definitions in the Chain constructor. What you need to be careful of when training is mainly to make sure that the dimensions of the Chainer variables that are passed between the layers and especially in the input layer match. If the first layer of a network is a convolutional layer, that is chainer.links.Convolution2D, the input dimensions is slightly different from this autoencoder example since there is an additional channel, width and height dimensionality to the data. Remember the data is still passed in batches.

Running the Same Code on The GPU

Assuming that CUDA is installed, you only need to add a few lines of code in order to train the model on the GPU. What you only need to do is to copy the Chainer Links and the training data to the GPU and you can run the same training code. Since the CuPy interface in Chainer implements a subset of the NumPy interface we could write it nicely in the following way.

import numpy as np
from chainer import cuda
from chainer.cuda import cupy as cp

xp = None

model = Autoencoder()

if device_id > 0:
   cuda.check_cuda_available()
  
   # A CUDA installation was found, set the default device
   cuda.get_device(device_id).use()
  
   # Copy the Links to the default device
   model.to_gpu()
  
   xp = cp
else:
   xp = np
  
# Replace all old occurrences of np with xp

The device_id in this case identifies a GPU. If you have installed CUDA correctly, you should be able to list all devices with the nvidia-smi command in the CLI to see exactly which devices are available. The id can of course be hardcoded in the code itself but could for example be passed as an argument to the Python script. Depending on the specified id and the availability, the variable xp is set to NumPy or CuPy accordingly. What you need to change in the rest of the code is simply replacing all previous occurrences of np with xp.

Saving and Loading Training Data

The trained parameters can be written to files for persistence. This is natively supported by the framework using any of the modules in the chainer.serializers. It is also possible to load parameters into existing models and their layers in the same manner. This is useful when you need to stop the training or want to take snapshots during the training process.

Summary

Defining and training neural networks with Chainer is intuitive and requires little code. It is easy to maintain and experiment with various hyper parameters because of its design. In this second part of the series with Chainer, we implemented a neural network and trained it with randomly generated data and common patterns were introduced such as how to design the model and the loss function to demonstrate this fact.

The network covered in this article along with its data is more or less a toy problem. But, hopefully you will try Chainer out on your own and experiment with what it’s capable of.

About the Author

Hiroyuki Vincent Yamazaki is a graduate student at KTH, Royal Institute of Technology in Sweden, currently conducting research in convolutional neural networks at Keio University in Tokyo, partially using Chainer as a part of a double-degree programme.

GitHub

LinkedIn

LEAVE A REPLY

Please enter your comment!
Please enter your name here