7 min read

With the increasing popularity of neural networks, or deep learning, companies ranging from smaller start-ups to major ones such as Google have been releasing frameworks for deep learning related tasks. Caffe from the Berkeley Vision and Learning Center (BVLC), Torch and Theano has been around for quite a while. TensorFlow was open sourced by Google last year in 2015 and has since then been expanding its community. Neon from Nervana Systems is a more recent addition to this repository with good reputation for its performance.

This three-part post series will introduce you to yet another framework for neural networks called Chainer, which similarly to most of the previously mentioned frameworks is based on Python. It has an intuitive interface with a low learning curve but hasn’t been widely adopted yet outside the borders of Japan where it is being developed. This article is split up into three parts where this first part aims to explain the characteristics of Chainer and the basic data structures. The second and third parts will help you get started with the actual training.

The basic theory of neural networks will not be covered but if you are familiar with forward pass, back propagation and gradient descent and on top of that have some coding experience, you should be be able to follow this article.

What is Chainer?

Chainer is an open sourced Python based framework maintained by Preferred Infrastructure/Preferred Networks in Japan. The company behind the framework put a heavy emphasis on closing the gap between the machine learning research being carried out in academia and the more practical applications of machine learning. They focus on deep learning, IoT, edge-heavy computing with applications in the automobile manufacturing and healthcare markets by for instance developing autonomous cars and factory robots with grasping capabilities.

Why Chainer?

Defining and training simple neural networks with Chainer can be done with just few lines of code. It can also scale to larger models with more complex architecture with little effort. It is a framework for basically anyone, working, studying or researching in neural networks. There are however other alternatives as mentioned in the introduction. This section will explain the characteristics of the framework and why you might want to try it out.

One major issue with deep learning related tasks is configuring the hyper parameters. Chainer makes this less of a pain. It comes with many layers, activation functions, loss functions and optimization algorithms in a plug-and-play fashion. With a single line of code or a single function call, those components can be added or removed without affecting the rest of the program. The abstraction and class structure of the framework makes it intuitive to learn and start experimenting. We will dig deeper into that in the second part of this series.

On top of that, it is well documented. It is actually so well documented that you may stop reading this article right here and jump to the official documentation. Much of the content in this post is extracted from the official documentation, but I will try to complement it with additional details and awareness for common pitfalls.

GPU Support

NumPy is a Python package commonly used in academia due to its rich interface for manipulating multidimensional arrays similar to MATLAB. If you are working with neural networks, chances are, you’re well familiar with NumPy and its methods and operations. What Chainer does is that it comes with CuPy (chainer.cuda.cupy), a GPU alternative to NumPy. CuPy is a CUDA-based GPU backend for array manipulation that implements a subset of the NumPy interface. Hence, it is possible to write almost generic code for both the CPU and the GPU. You can simply change the NumPy package to CuPy and vice versa in your source code to switch from one to another. It unfortunately lacks some features such as advanced indexing and numpy.where. Multi-GPU training in terms of model parallelism and data parallelism is also supported, although not in the scope of this series. As we will discover, the most fundamental data structure in Chainer is a NumPy (or CuPy) array wrapper with added functionality.

The Basics of Chainer

Let’s write some code to get familiar with the Chainer interface. First, make sure that you have a working Python environment. We will use pip to install Chainer even though it can be installed directly from source. The code included in this post is verified with Python 3.5.1 and Chainer 1.8.2.

Installation

Install NumPy and Chainer using pip, which comes with the Python environment.

pip install numpy
pip install chainer

Chainer Variables and Variable Differentiation

You are now ready to start writing code. First, let’s take a look at the snippet below.

import numpy as np
from chainer import Variable

x_data = np.array([4], dtype=np.float32)

x = Variable(x_data)
assert x.data == x_data # True

y = x ** 2 + 5 * x + 3
assertisinstance(y, Variable) # True

# Compute the gradient for x and store it in x.grad
y.backward()

# y'(x) = 2 * x + 5; y'(4) = 13
assert x.grad == 13# True

The most fundamental data structure is the Chainer variable class, chainer.Variable. It can be initialized by passing a NumPy array (which must have the datatype numpy.float32) or by applying functions to already instantiated Chainer variables. Think of them as wrappers of NumPy’s N-dimensional arrays. To access the NumPy array from the variable, use the data property. So what makes them different? Each Chainer variable actually holds a reference to it’s creator, unless it is a leaf node, x in the example above. This means that y can reference x through the functions that created it. That way, a computational graph is maintained by the framework which is used for computing the gradients during back propagation. This is exactly what happens when calling the Variable.backward() method on y. The variable is differentiated and the gradient with respect to x is stored in the x variable itself. A pitfall here is that if x would be an array with more than one element, y.grad would need to be initialized with an initial error. If, as in the above mentioned example, x only contains one element, the error is automatically set to 1.

Principles of Gradient Descent

Using the differentiation mechanism above, you may implement a gradient descent optimization algorithm the following way. Initialize a weight w, a one dimensional array with one element to any value, say 4 as in the previous example, and iteratively optimizes the loss function w ** 2. The loss function is a function depending on the parameter w, just as y was depending on x. The loss function is obviously at it’s global minimum when w == 0. This is what we want to achieve with gradient descent. We can get very close by repeating the optimization step using Variable.backward().

import numpy as np
from chainer import Variable

w = Variable(np.array([4], dtype=np.float32))

learning_rate = 0.1
max_iters = 100

for i in range(max_iters):
   loss = w ** 2
   loss.backward() # Compute w.grad
  
   # Optimize / Update the parameter using gradient descent
   w.data -= learning_rate * w.grad
  
   # Reset gradient for the next iteration, w.grad == 0
   w.zerograd()
  
   print('Iteration: {} Loss: {}'.format(i, loss.data))

In each iteration, the parameter is updated towards the negative gradient to lower the loss. We are performing a gradient descent. Note that the gradient was scaled by a learning rate before the parameter was updated in order to stabilize the optimization. In fact, if the learning rate were removed from this example, the loss would be stuck at 16, since the derivative of the loss functions is 2 * w, which would cause w to simply jump back and forth between 4 and -4. With the learning rate, we see that the loss decreases with each iteration. loss.data as seen in the output below is an array with one element since it has the same dimensions as w.data.

Iteration: 0 Loss: [ 16.]
Iteration: 1 Loss: [ 10.24000072]
Iteration: 2 Loss: [ 6.55359983]
Iteration: 3 Loss: [ 4.19430351]
Iteration: 4 Loss: [ 2.68435407]
Iteration: 5 Loss: [ 1.71798646]
Iteration: 6 Loss: [ 1.09951138]
Iteration: 7 Loss: [ 0.70368725]
Iteration: 8 Loss: [ 0.45035988]
Iteration: 9 Loss: [ 0.2882303]
...

Summary

This was a very brief introduction to the framework and its fundamental chainer.Variable class, how variable differentiation using computational graphs form the core concept of the framework. In the second and third part of this series, we will implement a complete training algorithm using Chainer.

About the Author

Hiroyuki Vincent Yamazaki is a graduate student at KTH, Royal Institute of Technology in Sweden, currently conducting research in convolutional neural networks at Keio University in Tokyo, partially using Chainer as a part of a double-degree programme.

GitHub 

LinkedIn 

LEAVE A REPLY

Please enter your comment!
Please enter your name here