Pretrained models are used in the following two popular ways when building new models or reusing them:
- Using a pretrained model as a feature extractor
- Fine-tuning the pretrained model
This article is an excerpt taken from the book Hands-on transfer learning with Python. This book covers the process of setting up of DL environment and talks about various DL architectures, including CNN, LSTM, and capsule networks and more. In this article, we will leverage a pre-trained model that is basically an expert in the computer vision domain and renowned for image classification and categorization.
The pretrained model we will be using in this article is the popular VGG-16 model, created by the Visual Geometry Group at the University of Oxford, which specializes in building very deep convolutional networks for large-scale visual recognition. You can find out more about it on the official website of robots. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) evaluates algorithms for object detection and image classification at large scale and their models have often secured the first place in this competition.
A pretrained model like the VGG-16 is an already trained model on a huge dataset (ImageNet) with a lot of diverse image categories. Considering this fact, the model should have learned a robust hierarchy of features, which are spatial, rotation, and translation invariant, as we have discussed before with regard to features learned by CNN models. Hence, the model, having learned a good representation of features for over a million images belonging to 1,000 different categories, can act as a good feature extractor for new images suitable for computer vision problems. These new images might never exist in the ImageNet dataset or might be of totally different categories, but the model should still be able to extract relevant features from these images, considering the principles of transfer learning.
This gives us an advantage of using pretrained models as effective feature extractors for new images, to solve diverse and complex computer vision tasks, such as solving our cat versus dog classifier with fewer images, or even building a dog breed classifier, a facial expression classifier, and much more! Let’s briefly discuss the VGG-16 model architecture before unleashing the power of transfer learning on our problem.
Understanding the VGG-16 model
The VGG-16 model is a 16-layer (convolution and fully connected) network built on the ImageNet database, which is built for the purpose of image recognition and classification. This model was built by Karen Simonyan and Andrew Zisserman and is mentioned in their paper titled Very Deep Convolutional Networks for Large-Scale Image Recognition.
I recommend all interested readers to go and read up on the excellent literature in this paper. The architecture of the VGG-16 model is depicted in the following diagram:
You can clearly see that we have a total of 13 convolution layers using 3 x 3 convolution filters along with max-pooling layers for downsampling and a total of two fully connected hidden layers of 4,096 units in each layer followed by a dense layer of 1,000 units, where each unit represents one of the image categories in the ImageNet database.
We do not need the last three layers since we will be using our own fully connected dense layers to predict whether images will be a dog or a cat. We are more concerned with the first five blocks, so that we can leverage the VGG model as an effective feature extractor. For one of the models, we will use it as a simple feature extractor by freezing all the five convolution blocks to make sure their weights don’t get updated after each epoch. For the last model, we will apply fine-tuning to the VGG model, where we will unfreeze the last two blocks (Block 4 and Block 5) so that their weights get updated in each epoch (per batch of data) as we train our own model.
We represent the preceding architecture, along with the two variants (basic feature extractor and fine-tuning) that we will be using, in the following block diagram, so you can get a better visual perspective:
Thus, we are mostly concerned with leveraging the convolution blocks of the VGG-16 model and then flattening the final output (from the feature maps) so that we can feed it into our own dense layers for our classifier.
Building our dataset
To start with, we load up the following dependencies, including a utility module called utils, which is available in the utils.py file present in the code files. This is mainly used to get a visual progress bar when we copy images into new folders:
import glob import numpy as np import os import shutil from utils import log_progress np.random.seed(42)
Let’s now load up all the images in our original training data folder as follows:
files = glob.glob('train/*') cat_files = [fn for fn in files if 'cat' in fn] dog_files = [fn for fn in files if 'dog' in fn] len(cat_files), len(dog_files) Out : (12500, 12500)
We can verify with the preceding output that we have 12,500 images for each category. Let’s now build our smaller dataset so that we have 3,000 images for training, 1,000 images for validation, and 1,000 images for our test dataset (with equal representation for the two animal categories):
cat_train = np.random.choice(cat_files, size=1500, replace=False) dog_train = np.random.choice(dog_files, size=1500, replace=False) cat_files = list(set(cat_files) - set(cat_train)) dog_files = list(set(dog_files) - set(dog_train)) cat_val = np.random.choice(cat_files, size=500, replace=False) dog_val = np.random.choice(dog_files, size=500, replace=False) cat_files = list(set(cat_files) - set(cat_val)) dog_files = list(set(dog_files) - set(dog_val)) cat_test = np.random.choice(cat_files, size=500, replace=False) dog_test = np.random.choice(dog_files, size=500, replace=False) print('Cat datasets:', cat_train.shape, cat_val.shape, cat_test.shape) print('Dog datasets:', dog_train.shape, dog_val.shape, dog_test.shape) Cat datasets: (1500,) (500,) (500,) Dog datasets: (1500,) (500,) (500,)
Now that our datasets have been created, let’s write them out to our disk in separate folders, so that we can come back to them anytime in the future without worrying if they are present in our main memory:
train_dir = 'training_data' val_dir = 'validation_data' test_dir = 'test_data' train_files = np.concatenate([cat_train, dog_train]) validate_files = np.concatenate([cat_val, dog_val]) test_files = np.concatenate([cat_test, dog_test]) os.mkdir(train_dir) if not os.path.isdir(train_dir) else None os.mkdir(val_dir) if not os.path.isdir(val_dir) else None os.mkdir(test_dir) if not os.path.isdir(test_dir) else None for fn in log_progress(train_files, name='Training Images'): shutil.copy(fn, train_dir) for fn in log_progress(validate_files, name='Validation Images'): shutil.copy(fn, val_dir) for fn in log_progress(test_files, name='Test Images'): shutil.copy(fn, test_dir)
The progress bars depicted in the following screenshot become green once all the images have been copied to their respective directory:
Pretrained CNN model as a feature extractor with image augmentation
We will leverage the same data generators for our train and validation datasets that we used before. The code for building them is depicted as follows for ease of understanding:
train_datagen = ImageDataGenerator(rescale=1./255, zoom_range=0.3, rotation_range=50, width_shift_range=0.2, height_shift_range=0.2, shear_range=0.2, horizontal_flip=True, fill_mode='nearest') val_datagen = ImageDataGenerator(rescale=1./255) train_generator = train_datagen.flow(train_imgs, train_labels_enc, batch_size=30) val_generator = val_datagen.flow(validation_imgs, validation_labels_enc, batch_size=20) Let's now build our deep learning model architecture. We won't extract the bottleneck features like last time since we will be training on data generators; hence, we will be passing the vgg_model object as an input to our own model: model = Sequential() model.add(vgg_model) model.add(Dense(512, activation='relu', input_dim=input_shape)) model.add(Dropout(0.3)) model.add(Dense(512, activation='relu')) model.add(Dropout(0.3)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer=optimizers.RMSprop(lr=2e-5), metrics=['accuracy'])
You can clearly see that everything is the same. We bring the learning rate slightly down since we will be training for 100 epochs and don’t want to make any sudden abrupt weight adjustments to our model layers. Do remember that the VGG-16 model’s layers are still frozen here and we are still using it as a basic feature extractor only:
history = model.fit_generator(train_generator, steps_per_epoch=100, epochs=100, validation_data=val_generator, validation_steps=50, verbose=1) Epoch 1/100 100/100 - 45s 449ms/step - loss: 0.6511 - acc: 0.6153 - val_loss: 0.5147 - val_acc: 0.7840 Epoch 2/100 100/100 - 41s 414ms/step - loss: 0.5651 - acc: 0.7110 - val_loss: 0.4249 - val_acc: 0.8180 ... ... Epoch 99/100 100/100 - 42s 417ms/step - loss: 0.2656 - acc: 0.8907 - val_loss: 0.2757 - val_acc: 0.9050 Epoch 100/100 100/100 - 42s 418ms/step - loss: 0.2876 - acc: 0.8833 - val_loss: 0.2665 - val_acc: 0.9000
We can see that our model has an overall validation accuracy of 90%, which is a slight improvement from our previous model, and also the train and validation accuracy are quite close to each other, indicating that the model is not overfitting. This can be reinforced by looking at the following plots for model accuracy and loss:
We can clearly see that the values of train and validation accuracy are quite close to each other and the model doesn’t overfit. Also, we reach 90% accuracy, which is neat! Let’s save this model on the disk now for future evaluation on the test data:
We will now fine-tune the VGG-16 model to build our last classifier, where we will unfreeze blocks 4 and 5, as we depicted at the beginning of this article.
Pretrained CNN model with fine-tuning and image augmentation
We will now leverage our VGG-16 model object stored in the vgg_model variable and unfreeze convolution blocks 4 and 5 while keeping the first three blocks frozen. The following code helps us achieve this:
vgg_model.trainable = True set_trainable = False for layer in vgg_model.layers: if layer.name in ['block5_conv1', 'block4_conv1']: set_trainable = True if set_trainable: layer.trainable = True else: layer.trainable = False print("Trainable layers:", vgg_model.trainable_weights) Trainable layers: [
, , , , , , , , , , , ]
You can clearly see from the preceding output that the convolution and pooling layers pertaining to blocks 4 and 5 are now trainable, and you can also verify which layers are frozen and unfrozen using the following code:
layers = [(layer, layer.name, layer.trainable) for layer in vgg_model.layers] pd.DataFrame(layers, columns=['Layer Type', 'Layer Name', 'Layer Trainable'])
The preceding code generates the following output:
We can clearly see that the last two blocks are now trainable, which means the weights for these layers will also get updated with backpropagation in each epoch as we pass each batch of data. We will use the same data generators and model architecture as our previous model and train our model. We reduce the learning rate slightly since we don’t want to get stuck at any local minimal, and we also do not want to suddenly update the weights of the trainable VGG-16 model layers by a big factor that might adversely affect the model:
# data generators train_datagen = ImageDataGenerator(rescale=1./255, zoom_range=0.3, rotation_range=50, width_shift_range=0.2, height_shift_range=0.2, shear_range=0.2, horizontal_flip=True, fill_mode='nearest') val_datagen = ImageDataGenerator(rescale=1./255) train_generator = train_datagen.flow(train_imgs, train_labels_enc, batch_size=30) val_generator = val_datagen.flow(validation_imgs, validation_labels_enc, batch_size=20) # build model architecture model = Sequential() model.add(vgg_model) model.add(Dense(512, activation='relu', input_dim=input_shape)) model.add(Dropout(0.3)) model.add(Dense(512, activation='relu')) model.add(Dropout(0.3)) model.add(Dense(1, activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer=optimizers.RMSprop(lr=1e-5), metrics=['accuracy']) # model training history = model.fit_generator(train_generator, steps_per_epoch=100, epochs=100, validation_data=val_generator, validation_steps=50, verbose=1) Epoch 1/100 100/100 - 64s 642ms/step - loss: 0.6070 - acc: 0.6547 - val_loss: 0.4029 - val_acc: 0.8250 Epoch 2/100 100/100 - 63s 630ms/step - loss: 0.3976 - acc: 0.8103 - val_loss: 0.2273 - val_acc: 0.9030 ... ... Epoch 99/100 100/100 - 63s 629ms/step - loss: 0.0243 - acc: 0.9913 - val_loss: 0.2861 - val_acc: 0.9620 Epoch 100/100 100/100 - 63s 629ms/step - loss: 0.0226 - acc: 0.9930 - val_loss: 0.3002 - val_acc: 0.9610
We can see from the preceding output that our model has obtained a validation accuracy of around 96%, which is a 6% improvement from our previous model. Overall, this model has gained a 24% improvement in validation accuracy from our first basic CNN model. This really shows how useful transfer learning can be.
Let’s observe the model accuracy and loss plots:
We can see that accuracy values are really excellent here, and although the model looks like it might be slightly overfitting on the training data, we still get great validation accuracy. Let’s save this model to disk now using the following code:
Let’s now put all our models to the test by actually evaluating their performance on our test dataset.
In this article, we learned how to leverage pre-trained models for transfer learning and covered the various ways to use them, including as feature extractors, as well as fine-tuning. We saw the detailed architecture of the VGG-16 model and how to leverage the model as an efficient image feature extractor. To know more about Pretrained CNN models check out our book Hands-On Transfer Learning with Python