Convolutional Neural networks (CNNs), are a group from the neural network family that has manifested in areas such as Image recognition, classification, etc. They are one of the popular neural network models present in nearly all of the image recognition tasks that provide state-of-the-art-results. However, these CNNs have drawbacks, which are to be discussed later in the article. In order to address the issue with CNNs, Geoffrey Hinton, popularly known as the Godfather of Deep Learning, recently proposed a research paper along with two other researchers, Sara Sabour and Nicholas Frosst. In this paper, they introduced CapsNet or Capsule Network–a neural network, based on multi-layer capsule system.
Let’s explore the issue with CNNs and how CapsNet came as an advancement to it.
What is the issue with CNNs?
Convolutional Neural Network or CNNs are known to seamlessly handle image classification tasks. They are experts in learning at a granular level; where the lower layers detect edges and shape of an object, and the higher layers detect the image as a whole. However, CNNs perform poorly when an image possesses a slightly different orientation (rotation or a tilt), as it compares every image with the ones it learns during training. For instance, if an image of a face is to be detected, it checks for facial features such as nose, two eyes, mouth, eyebrows, etc; irrespective of the placement. This means CNNs may identify an incorrect face in cases where the placement of an eye and the nose is not as conventionally expected, for example in case of the profile view. So, the orientation and the spatial relationships between the objects within an image is not considered by a CNN.
To make CNNs understand orientation and spatial relationships, they were trained profusely with images taken from all possible angles. Unfortunately, it resulted in excess amount of time required to train the model. Also, the performance of the CNNs did not improve largely.
Pooling methods were also introduced at each layer within the CNN model for two reasons; first to reduce the time invested in training, and second to bring out positional invariance within CNNs. It resulted in triggering false positives in an image, i.e., it detected the object within an image but did not check its orientation. Also it incorrectly declared it as a right image. Thus, positional invariance made the CNNs susceptible to minute changes in viewpoint.
Instead of invariance, what CNNs require is equivariance– a feature that makes CNNs adapt to change in rotation or proportion within an image. This equivariance feature is now possible via Capsule Network!
The Solution: Capsule Network
CapsNet or Capsule network is an encapsulation of nested neural network layers. Traditional neural network contains multiple layers whereas a capsule network contains multiple layers within a single capsule. CNNs go deeper in terms of height, whereas the capsule network deepens in terms of nesting or internal structure. Such a model is highly robust to geometric distortions and transformations, which are a result of non-ideal camera angles. Thus, it is able to exceptionally handle orientations, rotations and so on.
Layer based Squashing
In a typical Convolutional Neural Network, the squashing function is added to each layer of the CNN model. A squashing function compresses the input to one of the ends of a small interval, introducing nonlinearity to the neural network and enables the network to be effective. Whereas, in a Capsule network, the squashing function is applied to the vector output of each capsule. Given below is a squashing function proposed by Hinton in his research paper.
Instead of applying non-linearity to each neuron, the squashing function applies squashing to a group of neurons i.e the capsule. To be more precise, it applies nonlinearity to the vector output of each capsule. The squashing function also tries to squash the vector output to zero if it is a small vector. If the vector is too long, the function tries to limit the output vector to 1.
Dynamic routing algorithm in CapsNet replaces the scalar-output feature detectors of the CNN with the vector-output capsules. Also, the max pooling feature in CNNs, which led to positional invariance, is replaced with ‘routing by agreement’. The algorithm ensures that when they forward propagate the data, it goes to the next most relevant capsule in the layer above. Although dynamic routing adds an extra computational cost to the capsule network, it has been proved to be advantageous to the network by making it more scalable and adaptable.
Training the Capsule Network
The capsule network is trained using the MNIST. MNIST is a dataset which includes more than 60,000 handwritten digit images. It is used to test machine learning algorithms. The capsule model is trained for 50 epochs with a batch size of 128 parts, where each epoch is responsible for a complete run through the training dataset.
A TensorFlow implementation of the CapsNet based on Hinton’s research paper is available in GitHub repository. Similarly, CapsNet can also be implemented using other deep learning frameworks such as Keras, PyTorch, MXNet, etc.
CapsNet is a recent breakthrough in the field of Deep learning and have a promise to benefit organizations with accurate image recognition tasks. Also, implementations with CapsNet is slowly catching up and is expected to reach at par like CNNs. They have been trained on a very simplistic dataset i.e the MNIST. They will still require to prove themselves on various other datasets. However, as time advances and we see CapsNet being trained within different domains, it will be exciting to discern how it moulds itself as a faster and more efficient training technique for deep learning models.