[box type=”note” align=”” class=”” width=””]This article is an excerpt from a book written by Ankit Dixit titled Ensemble Machine Learning. This book provides a practical approach to building efficient machine learning models using ensemble techniques with real-world use cases.[/box]
Today we will look at how we can create, train, and test a neural network to perform digit classification using Keras and TensorFlow.
This article uses MNIST dataset with images of handwritten digits.It contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST’s training dataset, while the other half of the training set and the other half of the test set were taken from NIST’s testing dataset. There have been a number of scientific papers on attempts to achieve the lowest error rate. One paper, by using a hierarchical system of CNNs, manages to get an error rate on the MNIST database of 0.23 percent. The original creators of the database keep a list of some of the methods tested on it. In their original paper, they used a support vector machine to get an error rate of 0.8 percent.
Images in the dataset look like this:
So let’s not waste our time and start implementing our very first neural network in Python. Let’s start the code by importing the supporting projects.
# Imports for array-handling and plotting import numpy as np
import matplotlib
import matplotlib.pyplot as plt
Keras already has the MNIST dataset as a sample dataset, so we can import it as it is. Generally, it downloads the data over the internet and stores it into the database. So, if your system does not have the dataset, Internet will be required to download it:
# Keras imports for the dataset and building our neural network from keras.datasets import mnist
Now, we will import the Sequential and load_model classes from the keras.model class. We are working with sequential networks as all layers will be in forward sequence only. We are not using any split in the layers. The Sequential class will create a sequential model by combining the layers sequentially. The load_model class will help us to load the trained model for testing and evaluation purposes:
#Import Sequential and Load model for creating and loading model from keras.models import Sequential, load_model
In the next line, we will call three types of layers from the keras library. Dense layer means a fully connected layer; that is, each neuron of current layer will have a connection to the each neuron of the previous as well as next layer.
The dropout layer is for reducing overfitting in our model. It randomly selects some neurons and does not use them for training for that iteration. So there are less chances that two different neurons of the same layer learn the same features from the input. By doing this, it prevents redundancy and correlation between neurons in the network, which eventually helps prevent overfitting in the network.
The activation layer applies the activation function to the output of the neuron. We will use rectified linear units (ReLU) and the softmax function as the activation layer. We will discuss their operation when we use them in network creation:
#We will use Dense, Drop out and Activation layers
from keras.layers.core import Dense, Dropout, Activation
from keras.utils import np_utils
So we will start with loading our dataset by mnist.load. It will give us training and testing input and output instances.
Then, we will visualize some instances so that we know what kind of data we are dealing with. We will use matplotlib to plot them.
As the images have gray values, we can easily plot a histogram of the images, which can give us the pixel intensity distribution:
#Let's Start by loading our dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
#Plot the digits to verify
plt.figure()
for i in range(9): plt.subplot(3,3,i+1) plt.tight_layout()
plt.imshow(X_train[i], cmap='gray', interpolation='none') plt.title("Digit: {}".format(y_train[i]))
plt.xticks([])
plt.yticks([]) plt.show()
When we execute our code for the preceding code block, we will get the output as:
#Lets analyze histogram of the image
plt.figure()
plt.subplot(2,1,1)
plt.imshow(X_train[0], cmap='gray', interpolation='none')
plt.title("Digit: {}".format(y_train[0]))
plt.xticks([])
plt.yticks([])
plt.subplot(2,1,2)
plt.hist(X_train[0].reshape(784))
plt.title("Pixel Value Distribution") plt.show()
The histogram of an image will look like this:
# Print the shape before we reshape and normalize
print("X_train shape", X_train.shape)
print("y_train shape", y_train.shape)
print("X_test shape", X_test.shape)
print("y_test shape", y_test.shape)
Currently, this is shape of the dataset we have:
X_train shape (60000, 28, 28)
y_train shape (60000,)
X_test shape (10000, 28, 28)
y_test shape (10000,)
As we are working with 2D images, we cannot train them as with our neural network. For training 2D images, there are different types of neural networks available; we will discuss those in the future.
To remove this data compatibility issue, we will reshape the input images into 1D vectors of 784 values (as images have size 28X28). We have 60000 such images in training data and 10000 in testing:
# As we have data in image form convert it to row vectors
X_train = X_train.reshape(60000, 784) X_test = X_test.reshape(10000, 784) X_train = X_train.astype('float32') X_test = X_test.astype('float32')
Normalize the input data into the range of 0 to 1 so that it leads to a faster convergence of the network. The purpose of normalizing data is to transform our dataset into a bounded range; it also involves relativity between the pixel values. There are various kinds of normalizing techniques available such as mean normalization, min-max normalization, and so on:
# Normalizing the data to between 0 and 1 to help with the training
X_train /= 255
X_test /= 255
# Print the final input shape ready for training print("Train matrix shape", X_train.shape) print("Test matrix shape", X_test.shape)
Let’s print the shape of the data:
Train matrix shape (60000, 784)
Test matrix shape (10000, 784)
Now, our training set contains output variables as discrete class values; say, for an image of number eight, the output class value is eight. But our output neurons will be able to give an output only in the range of zero to one. So, we need to convert discrete output values to categorical values so that eight can be represented as a vector of zero and one with the length equal to the number of classes. For example, for the number eight, the output class vector should be:
8 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
# One-hot encoding using keras' numpy-related utilities
n_classes = 10
print("Shape before one-hot encoding: ", y_train.shape) Y_train = np_utils.to_categorical(y_train, n_classes) Y_test = np_utils.to_categorical(y_test, n_classes) print("Shape after one-hot encoding: ", Y_train.shape)
After one-hot encoding of our output, the variable’s shape will be modified as:
Shape before one-hot encoding: (60000,)
Shape after one-hot encoding: (60000, 10)
So, you can see that now we have an output variable of 10 dimensions instead of 1.
Now, we are ready to define our network parameters and layer architecture. We will start creating our network by creating a Sequential class object, model. We can add different layers to this model as we have done in the following code block.
We will create a network of an input layer, two hidden layers, and one output layer. As the input layer is always our data layer, it doesn’t have any learning parameters. For hidden layers, we will use 512 neurons in each. At the end, for a 10-dimensional output, we will use 10 neurons in the final layer:
# Here, we will create model of our ANN
# Create a linear stack of layers with the sequential model
model = Sequential()
#Input Layer with 512 Weights
model.add(Dense(512, input_shape=(784,)))
#We will use relu as Activation
model.add(Activation('relu'))
#Put Drop out to prevent over-fitting
model.add(Dropout(0.2))
#Add Hidden layer with 512 neurons with relu activation
model.add(Dense(512)) model.add(Activation('relu')) model.add(Dropout(0.2))
#This is our Output layer with 10 neurons
model.add(Dense(10))model.add(Activation('softmax'))
After defining the preceding structure, our neural network will look something like this:
The Shape field in each layer shows the shape of the data matrix in that layer, and it is quite intuitive. As we first get the multiplication of input with length of 784 values to 512 neurons, the data shape at Hidden-1 will be 784 X 512. It will be calculated similarly for the other two layers. We have used two different kinds of activation functions here. The first one is ReLU and the second one is sofmax probabilities. We will give some time to discuss these two. ReLU prevent the output of the neuron from becoming negative. The expression for relu function is:
So if any neuron produces an output less than 0, it converts it to 0. We can write it in conditional form as:
You just need to know that ReLU is a slightly better activation function than sigmoid. If we plot a sigmoid function, it will look like:
If you look closer, the sigmoid function starts getting saturated before reaching its minimum (0) or maximum (1) values. So at the time of gradient calculation, values in the saturated region result in a very small gradient. That causes a very small change in the weight values, which is not sufficient to optimize the cost function. Now, as we go more backward during the backpropagation, that small change becomes smaller and almost reaches zero. This problem is known as the problem of vanishing gradients. So, in practical cases, we avoid sigmoid activation when our network has many stacked layers. Whereas if we see the expression of ReLU activation, it is more like a straight line:
So, the gradient of the preceding function will always a non-zero value until and unless the output itself is a zero value. Thus, it prevents the problem of vanishing gradients.
We have discussed the significance of the dropout layer earlier and I don’t think that it is further required. We are using 20% neuron dropout during the training time. We will not use the dropout layer during the testing time.
Now, we are all set to train our very first ANN, but before starting training, we have to define the values of the network hyperparameters.
We will use SGD using adaptive momentum. There are many algorithms to optimize the performance of the SGD algorithm. You just need to know that adaptive momentum is a better choice than simple gradient descent because it modifies the learning rate using previous errors created by the network. So, there are less chances of getting trapped at the local minima or missing the global minima conditions. We are using SGD with ADAM, using its default parameters.
Here, we use batch_size of 128 samples. That means we will update the weights after calculating the error on these 128 samples. It is a sufficient batch size for our total data population.
We are going to train our network for 20 epochs for the time being. Here, one epoch means one complete training cycle of all mini-batches.
Now, let’s start training our network:
#Here we will be compiling the sequential model model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
# Start training the model and saving metrics in history
history = model.fit(X_train, Y_train, batch_size=128, epochs=20, verbose=2, validation_data=(X_test, Y_test))
We will save our trained model on disk so that we can use it for further fine-tuning whenever required. We will store the model in the HDF5 file format:
# Saving the model on disk
path2save = 'E:/PyDevWorkSpaceTest/Ensembles/Chapter_10/keras_mnist.h5' model.save(path2save)
print('Saved trained model at %s ' % path2save)
# Plotting the metrics fig = plt.figure() plt.subplot(2,1,1)
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='lower right')
plt.subplot(2,1,2)
plt.plot(history.history['loss']) plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right') plt.tight_layout()
plt.show()
Let’s analyze the loss with each iteration during the training of our neural network; we will also plot the accuracies for validation and test set. You should always monitor validation and training loss as it can help you know whether your model is underfitting or overfitting:
Test Loss 0.0824991761778
Test Accuracy 0.9813
As you can see, we are getting almost similar performance for our training and validation sets in terms of loss and accuracy. You can see how accuracy is increasing as the number of epochs increases. This shows that our network is learning.
Now, we have trained and stored our model. It’s time to reload it and test it with the 10000 test instances:
#Let's load the model for testing data
path2save = 'D:/PyDevWorkspace/EnsembleMachineLearning/Chapter_10/keras_mnist.h5' mnist_model = load_model(path2save)
#We will use Evaluate function
loss_and_metrics = mnist_model.evaluate(X_test, Y_test, verbose=2)
print("Test Loss", loss_and_metrics[0])
print("Test Accuracy", loss_and_metrics[1])
#Load the model and create predictions on the test set
mnist_model = load_model(path2save)
predicted_classes = mnist_model.predict_classes(X_test)
#See which we predicted correctly and which not
correct_indices = np.nonzero(predicted_classes == y_test)[0] incorrect_indices = np.nonzero(predicted_classes != y_test)[0] print(len(correct_indices)," classified correctly") print(len(incorrect_indices)," classified incorrectly")
So, here is the performance of our model on the test set:
9813 classified correctly
187 classified incorrectly
As you can see, we have misclassified 187 instances out of 10000, which I think is a very good accuracy on such a complex dataset. In the next code block, we will analyze such cases where we detect false labels:
#Adapt figure size to accomodate 18 subplots plt.rcParams['figure.figsize'] = (7,14) plt.figure() # plot 9 correct predictions
for i, correct in enumerate(correct_indices[:9]): plt.subplot(6,3,i+1) plt.imshow(X_test[correct].reshape(28,28), cmap='gray',
interpolation='none')
plt.title(
"Predicted: {}, Truth: {}".format(predicted_classes[correct],
y_test[correct]))
plt.xticks([])
plt.yticks([])
# plot 9 incorrect predictions
for i, incorrect in enumerate(incorrect_indices[:9]):
plt.subplot(6,3,i+10)
plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none')
plt.title( "Predicted {}, Truth: {}".format(predicted_classes[incorrect], y_test[incorrect]))
plt.xticks([])
plt.yticks([])
plt.show()
If you look closely, our network is failing on such cases that are very difficult to identify by a human, too. So, we can say that we are getting quite a good accuracy from a very simple model.
We saw how to create, train, and test a neural network to perform digit classification using Keras and TensorFlow.
If you found our post useful, do check out this book Ensemble Machine Learning to build ensemble models using TensorFlow and Python libraries such as scikit-learn and NumPy.