In this article by Giancarlo Zaccone, the author of the book Deep Learning with TensorFlow, we learn about multi-layer network the outputs of all neurons of the input layer would be connected to each neuron of the hidden layer (fully connected layer).

(For more resources related to this topic, see here.)

In CNN networks, instead, the connection scheme, that defines the convolutional layer that we are going to describe, is significantly different.

As you can guess this is the main type of layer, the use of one or more of these layers in a convolutional neural network is indispensable.

In a convolutional layer, each neuron is connected to a certain region of the input area called the receptive field.

For example, using a 3×3 kernel filter, each neuron will have a bias and 9=3×3 weights connected to a single receptive field. Of course, to effectively recognize an image we need different kernel filters applied to the same receptive field, because each filter should recognize a different feature's image.

The set of neurons that identify the same feature define a single feature map.

The preceding figure shows a CNN architecture in action, the input image of 28×28 size will be analyzed by a convolutional layer composed of 32 feature map of 28×28 size. The figure also shows a receptive field and the kernel filter of 3×3 size.

cnn-architecture-img-0

Figure: CNN in action

A CNN may consist of several convolution layers connected in cascade. The output of each convolution layer is a set of feature maps (each generated by a single kernel filter), then all these matrices defines a new input that will be used by the next layer.

CNNs also use pooling layers positioned immediately after the convolutional layers. A pooling layer divides a convolutional region in subregions and select a single representative value (max-pooling or average pooling) to reduce the computational time of subsequent layers and increase the robustness of the feature with respect its spatial position.

The last hidden layer of a convolutional network is generally a fully connected network with softmax activation function for the output layer.

A model for CNNs - LeNet

Convolutional and max-pooling layers are at the heart of the LeNet family models. It is a family of multi-layered feedforward networks specialized on visual pattern recognition.

While the exact details of the model will vary greatly, the following figure points out the graphical schema of a LeNet network:

cnn-architecture-img-1

Figure: LeNet network

In a LeNet model, the lower-layers are composed to alternating convolution and max-pooling, while the last layers are fully-connected and correspond to a traditional feed forward network (fully connected + softmax layer).

The input to the first fully-connected layer is the set of all features maps at the layer below.

From a TensorFlow implementation point of view, this means lower-layers operate on 4D tensors. These are then flattened to a 2D matrix to be compatible with a feed forward implementation.

Build your first CNN

In this section, we will learn how to build a CNN to classify images of the MNIST dataset. We will see that a simple softmax model provides about 92% classification accuracy for recognizing hand-written digits in the MNIST.

Here we'll implement a CNN which has a classification accuracy of about 99%.

The next figure shows how the data flow in the first two convolutional layer--the input image is processed in the first convolutional layer using the filter-weights. This results in 32 new images, one for each filter in the convolutional layer. The images are also dowsampled with the pooling operation so the image resolution is decreased from 28×28 to 14×14.

These 32 smaller images are then processed in the second convolutional layer. We need filter-weights again for each of these 32 features, and we need filter-weights for each output channel of this layer. The images are again downsampled with a pooling operation so that the image resolution is decreased from 14×14 to 7×7. The total number of features for this convolutional layer is 64.

cnn-architecture-img-2

Figure: Data flow of the first two convolutional layers

The 64 resulting images are filtered again by a (3×3) third convolutional layer. We don't apply a pooling operation for this layer. The output of the second convolutional layer is 128 images of 7×7 pixels each. These are then flattened to a single vector of length 4×4×128, which is used as the input to a fully-connected layer with 128 neurons (or elements).

This feeds into another fully-connected layer with 10 neurons, one for each of the classes, which is used to determine the class of the image, that is, which number is depicted in the following image:

cnn-architecture-img-3

Figure: Data flow of the last three convolutional layers

The convolutional filters are initially chosen at random. The error between the predicted and actual class of the input image is measured as the so-called cost function which generalize our network beyond training data. The optimizer then automatically propagates this error back through the convolutional network and updates the filter-weights to improve the classification error.

This is done iteratively thousands of times until the classification error is sufficiently low.

Now let's see in detail how to code our first CNN.

Let's start by importing Tensorflow libraries for our implementation:

import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data

Set the following parameters, that indicate the number of samples to consider respectively for the training phase (128) and then in the test phase (256).

batch_size = 128
test_size = 256

We define the following parameter, the value is 28 because a MNIST image, is 28 pixels in height and width:

img_size = 28

And the number of classes; the value 10 means that we'll have one class for each of 10 digits:

num_classes = 10

A placeholder variable, X, is defined for the input images. The data type for this tensor is set to float32 and the shape is set to [None, img_size, img_size, 1], where None means that the tensor may hold an arbitrary number of images:

X = tf.placeholder("float", [None, img_size, img_size, 1])

Then we set another placeholder variable, Y, for the true labels associated with the images that were input data in the placeholder variable X.

The shape of this placeholder variable is [None, num_classes] which means it may hold an arbitrary number of labels and each label is a vector of length num_classes which is 10 in this case.

Y = tf.placeholder("float", [None, num_classes])

We collect the mnist data which will be copied into the data folder:

mnist = mnist_data.read_data_sets("data/")

We build the datasets for training (trX, trY) and testing the network (teX, teY).

trX, trY, teX, teY = mnist.train.images, 
                     mnist.train.labels, 
                     mnist.test.images,  
                     mnist.test.labels

The trX and teX image sets must be reshaped according the input shape:

trX = trX.reshape(-1, img_size, img_size, 1) 
teX = teX.reshape(-1, img_size, img_size, 1)

We shall now proceed to define the network's weights.

The init_weights function builds new variables in the given shape and initializes network's weights with random values.

def init_weights(shape):
    return tf.Variable(tf.random_normal(shape, stddev=0.01))

Each neuron of the first convolutional layer is convoluted to a small subset of the input tensor, of dimension 3×3×1, while the value 32 is just the number of feature map we are considering for this first layer. The weight w is then defined:

w = init_weights([3, 3, 1, 32])

The number of inputs is then increased of 32, this means that each neuron of the second convolutional layer is convoluted to 3x3x32 neurons of the first convolution layer. The w2 weight is:

w2 = init_weights([3, 3, 32, 64])

The value 64 represents the number of obtained output feature.

The third convolutional layer is convoluted to 3x3x64 neurons of the previous layer, while 128 are the resulting features:

w3 = init_weights([3, 3, 64, 128])

The fourth layer is fully connected, it receives 128x4x4 inputs, while the output is equal to 625:

w4 = init_weights([128 * 4 * 4, 625])

The output layer receives625inputs, while the output is the number of classes:

w_o = init_weights([625, num_classes])

Note that these initializations are not actually done at this point; they are merely being defined in the TensorFlow graph.

p_keep_conv = tf.placeholder("float")
p_keep_hidden = tf.placeholder("float")

It's time to define the network model; as we did for the network's weights definition it will be a function.

It receives as input, the X tensor, the weights tensors, and the dropout parameters for convolution and fully connected layer:

def model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden):

The tf.nn.conv2d() function executes the TensorFlow operation for convolution, note that the strides are set to 1 in all dimensions.

Indeed, the first and last stride must always be 1, because the first is for the image-number and the last is for the input-channel. The padding parameter is set to 'SAME' which means the input image is padded with zeroes so the size of the output is the same:

    conv1 = tf.nn.conv2d(X, w,strides=[1, 1, 1, 1],
                         padding='SAME')

Then we pass the conv1 layer to a relu layer. It calculates the max(x, 0) funtion for each input pixel x, adding some non-linearity to the formula and allows us to learn more complicated functions:

        conv1 = tf.nn.relu(conv1)

The resulting layer is then pooled by the tf.nn.max_pool operator:

               conv1 = tf.nn.max_pool(conv1, ksize=[1, 2, 2, 1]
                                     ,strides=[1, 2, 2, 1],
                             padding='SAME')

It is a 2×2 max-pooling, which means that we are considering 2×2 windows and select the largest value in each window. Then we move 2 pixels to the next window.

We try to reduce the overfitting, via the tf.nn.dropout() function, passing the conv1layer and the p_keep_convprobability value:

        conv1 = tf.nn.dropout(conv1, p_keep_conv)

As you can note the next two convolutional layers, conv2, conv3, are defined in the same way as conv1:

   conv2 = tf.nn.conv2d(conv1, w2,
                         strides=[1, 1, 1, 1],
                         padding='SAME')
    conv2 = tf.nn.relu(conv2)

    conv2 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1],
                        padding='SAME')
    conv2 = tf.nn.dropout(conv2, p_keep_conv)
     conv3=tf.nn.conv2d(conv2, w3,
                       strides=[1, 1, 1, 1]
                       ,padding='SAME')
     conv3_a = tf.nn.relu(conv3)

Two fully-connected layers are added to the network. The input of the first FC_layer is the convolution layer from the previous convolution:

    FC_layer = tf.nn.max_pool(conv3, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1],
                        padding='SAME')
       FC_layer = tf.reshape(FC_layer, [-1,w4.get_shape().as_list()[0]])

A dropout function is again used to reduce the overfitting:

    FC_layer = tf.nn.dropout(FC_layer, p_keep_conv)

The output layer receives the input as FC_layer and the w4 weight tensor. A relu and a dropout operator are respectively applied:

    output_layer = tf.nn.relu(tf.matmul(FC_layer, w4))
    output_layer = tf.nn.dropout(output_layer, p_keep_hidden)

The result is a vector of length 10 for determining which one of the 10 classes for the input image belongs to:

    result = tf.matmul(output_layer, w_o)
    return result

The cross-entropy is the performance measure we used in this classifier. The cross-entropy is a continuous function that is always positive and is equal to zero, if the predicted output exactly matches the desired output. The goal of this optimization is therefore to minimize the cross-entropy so it gets, as close to zero as possible, by changing the variables of the network layers.

TensorFlow has a built-in function for calculating the cross-entropy. Note that the function calculates the softmax internally so we must use the output of py_x directly:

py_x = model(X, w, w2, w3, w4, w_o, p_keep_conv, p_keep_hidden)
Y_ = tf.nn.softmax_cross_entropy_with_logits(py_x, Y)

Now that we have defined the cross-entropy for each classified image, we have a measure of how well the model performs on each image individually. But using the cross-entropy to guide the optimization of the networks's variables we need a single scalar value, so we simply take the average of the cross-entropy for all the classified images:

cost = tf.reduce_mean(Y_)

To minimize the evaluated cost, we must define an optimizer. In this case, we adopt the implemented RMSPropOptimizer which is an advanced form of gradient descent.

RMSPropOptimizer implements the RMSProp algorithm, that is an unpublished, adaptive learning rate method proposed by Geoff Hinton in Lecture 6e of his Coursera class.

You find George Hinton's course in https://www.coursera.org/learn/neural-networks

RMSPropOptimizeras well divides the learning rate by an exponentially decaying average of squared gradients. Hinton suggests setting the decay parameter to 0.9, while a good default value for the learning rate is 0.001.

optimizer = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cost)

Basically, the common Stochastic Gradient Descent (SGD) algorithm has a problem in that learning rates must scale with 1/T to get convergence, where T is the iteration number. RMSProp tries to get around this by automatically adjusting the step size so that the step is on the same scale as the gradients as the average gradient gets smaller, the coefficient in the SGD update gets bigger to compensate.

An interesting reference about this algorithm can be found here:
http://www.cs.toronto.edu/%7Etijmen/csc321/slides/lecture_slides_lec6.pdf

Finally, we define predict_op that is the index with the largest value across dimensions from the output of the mode:

predict_op = tf.argmax(py_x, 1)

Note that optimization is not performed at this point. Nothing is calculated at all; we'll just add the optimizer object to the TensorFlow graph for later execution.

We now come to define the network's running session, there are 55,000 images in the training set, so it takes a long time to calculate the gradient of the model using all these images. Therefore we'll use a small batch of images in each iteration of the optimizer. If your computer crashes or becomes very slow because you run out of RAM, then you may try and lower this number, but you may then need to perform more optimization iterations.

Now we can proceed to implement a TensorFlow session:

with tf.Session() as sess:
    tf.initialize_all_variables().run()
    for i in range(100):

We get a batch of training examples, the tensor training_batch now holds a subset of images and corresponding labels:

        training_batch =  zip(range(0, len(trX), batch_size),
                                         range(batch_size, 
                                          len(trX)+1, 
                                            batch_size))

Put the batch into feed_dict with the proper names for placeholder variables in the graph. We run the optimizer using this batch of training data, TensorFlow assigns the variables in feed to the placeholder variables and then runs the optimizer:

        for start, end in training_batch:
            sess.run(optimizer, feed_dict={X: trX[start:end],
                                          Y: trY[start:end],
                                          p_keep_conv: 0.8,
                                          p_keep_hidden: 0.5})

At the same time we get a shuffled batch of test samples:

        test_indices = np.arange(len(teX))
        np.random.shuffle(test_indices)
        test_indices = test_indices[0:test_size]

For each iteration we display the accuracy evaluated on the batch set:

        print(i, np.mean(np.argmax(teY[test_indices], axis=1) ==
                         sess.run
                         (predict_op,
                          feed_dict={X: teX[test_indices],
                                     Y: teY[test_indices], 
                                     p_keep_conv: 1.0,
                                     p_keep_hidden: 1.0})))

Training a network can take several hours depending on the used computational resources. The results on my machine is as follows:

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Successfully extracted to train-images-idx3-ubyte.mnist 9912422 bytes.
Loading ata/train-images-idx3-ubyte.mnist
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Successfully extracted to train-labels-idx1-ubyte.mnist 28881 bytes.
Loading ata/train-labels-idx1-ubyte.mnist
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Successfully extracted to t10k-images-idx3-ubyte.mnist 1648877 bytes.
Loading ata/t10k-images-idx3-ubyte.mnist
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Successfully extracted to t10k-labels-idx1-ubyte.mnist 4542 bytes.
Loading ata/t10k-labels-idx1-ubyte.mnist
(0, 0.95703125)
(1, 0.98046875)
(2, 0.9921875)
(3, 0.99609375)
(4, 0.99609375)
(5, 0.98828125)
(6, 0.99609375)
(7, 0.99609375)
(8, 0.98828125)
(9, 0.98046875)
(10, 0.99609375)

.

.

.

..

.

(90, 1.0)
(91, 0.9921875)
(92, 0.9921875)
(93, 0.99609375)
(94, 1.0)
(95, 0.98828125)
(96, 0.98828125)
(97, 0.99609375)
(98, 1.0)
(99, 0.99609375)

After 10,000 iterations, the model has an accuracy of about 99%....no bad!!

Summary

In this article, we introduced Convolutional Neural Networks (CNNs).

We have seen how the architecture of these networks, yield CNNs, particularly suitable for image classification problems, making faster the training phase and more accurate the test phase.

We have therefore implemented an image classifier, testing it on MNIST data set, where have achieved a 99% accuracy.

Finally, we built a CNN to classify emotions starting from a dataset of images; we tested the network on a single image and we evaluated the limits and the goodness of our model.