In this tutorial, we will combine techniques in both computer vision and natural language processing to form a complete image description approach. This will be responsible for constructing computer-generated natural descriptions of any provided images. The idea is to replace the encoder (RNN layer) in an encoder-decoder architecture with a deep convolutional neural network (CNN) trained to classify objects in images.
Normally, the CNN’s last layer is the softmax layer, which assigns the probability that each object might be in the image. But if we remove that softmax layer from CNN, we can feed the CNN’s rich encoding of the image into the decoder (language generation RNN) designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that the descriptions it produces best match the training descriptions for each image.
This tutorial is an excerpt from a book written by Matthew Lamons, Rahul Kumar, Abhishek Nagaraja titled Python Deep Learning Projects.
This book will simplify and ease how deep learning works, demonstrating how neural networks play a vital role in exploring predictive analytics across different domains. Users will explore projects in the field of computational linguistics, computer vision, machine translation, pattern recognition and many more!
All of the Python files and Jupyter Notebook files for this tutorial can be found at GitHub.
In this implementation, we will be using a pretrained Inception-v3 model as a feature extractor in an encoder trained on the ImageNet dataset. Let’s import all of the dependencies that we will need to build an auto-captioning model.
All of the Python files and the Jupyter Notebooks for this article can be found on GitHub.
Initialization
For this implementation, we need a TensorFlow version greater than or equal to 1.9 and we will also enable the eager execution mode, which will help us use the debug the code more effectively. Here is the code for this:
# Import TensorFlow and enable eager execution import tensorflow as tf tf.enable_eager_execution()
import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.utils import shuffle import re import numpy as np import os import time import json from glob import glob from PIL import Image import pickle
Download and prepare the MS-COCO dataset
We are going to use the MS-COCO dataset to train our model. This dataset contains more than 82,000 images, each of which has been annotated with at least five different captions. The following code will download and extract the dataset automatically:
annotation_zip = tf.keras.utils.get_file('captions.zip', cache_subdir=os.path.abspath('.'), origin = 'http://images.cocodataset.org/annotations/annotations_trainval2014.zip', extract = True) annotation_file = os.path.dirname(annotation_zip)+'/annotations/captions_train2014.json'
name_of_zip = 'train2014.zip' if not os.path.exists(os.path.abspath('.') + '/' + name_of_zip): image_zip = tf.keras.utils.get_file(name_of_zip, cache_subdir=os.path.abspath('.'), origin = 'http://images.cocodataset.org/zips/train2014.zip', extract = True) PATH = os.path.dirname(image_zip)+'/train2014/' else: PATH = os.path.abspath('.')+'/train2014/'
The following will be the output:
Downloading data from http://images.cocodataset.org/annotations/annotations_trainval2014.zip 252878848/252872794 [==============================] - 6s 0us/step Downloading data from http://images.cocodataset.org/zips/train2014.zip 13510574080/13510573713 [==============================] - 322s 0us/step
For this example, we’ll select a subset of 40,000 captions and use these and the corresponding images to train our model. As always, captioning quality will improve if you choose to use more data:
# read the json annotation file with open(annotation_file, 'r') as f: annotations = json.load(f)
# storing the captions and the image name in vectors all_captions = [] all_img_name_vector = [] for annot in annotations['annotations']: caption = '<start> ' + annot['caption'] + ' <end>' image_id = annot['image_id'] full_coco_image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (image_id) all_img_name_vector.append(full_coco_image_path) all_captions.append(caption) # shuffling the captions and image_names together # setting a random state train_captions, img_name_vector = shuffle(all_captions, all_img_name_vector, random_state=1) # selecting the first 40000 captions from the shuffled set num_examples = 40000 train_captions = train_captions[:num_examples] img_name_vector = img_name_vector[:num_examples]
Once the data preparation is completed, we will have all of the image path stored in the img_name_vector list variable, and the associated captions are stored in train_caption, as shown in the following screenshot:
Data preparation for a deep CNN encoder
Next, we will use Inception-v3 (pretrained on ImageNet) to classify each image. We will extract features from the last convolutional layer. We will create a helper function that will transform the input image to the format that is expected by Inception-v3:
#Resizing the image to (299, 299) #Using the preprocess_input method to place the pixels in the range of -1 to 1.
def load_image(image_path): img = tf.read_file(image_path) img = tf.image.decode_jpeg(img, channels=3) img = tf.image.resize_images(img, (299, 299)) img = tf.keras.applications.inception_v3.preprocess_input(img) return img, image_path
Now let’s initialize the Inception-v3 model and load the pretrained ImageNet weights. To do so, we’ll create a tf.keras model where the output layer is the last convolutional layer in the Inception-v3 architecture.
image_model = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet') new_input = image_model.input hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)
The output is as follows:
Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.5/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5 87916544/87910968 [==============================] - 40s 0us/step
So, the image_features_extract_model is our deep CNN encoder, which is responsible for learning the features from the given image.
Performing feature extraction
Now we will pre-process each image with the deep CNN encoder and dump the output to the disk:
- We will load the images in batches using the load_image() helper function that we created before
- We will feed the images into the encoder to extract the features
- Dump the features as a numpy array:
encode_train = sorted(set(img_name_vector)) #Load images image_dataset = tf.data.Dataset.from_tensor_slices( encode_train).map(load_image).batch(16) # Extract features for img, path in image_dataset: batch_features = image_features_extract_model(img) batch_features = tf.reshape(batch_features, (batch_features.shape[0], -1, batch_features.shape[3])) #Dump into disk for bf, p in zip(batch_features, path): path_of_feature = p.numpy().decode("utf-8") np.save(path_of_feature, bf.numpy())
Data prep for a language generation (RNN) decoder
The first step is to pre-process the captions.
We will perform a few basic pre-processing steps on the captions, such as the following:
- We’ll tokenize the captions (for example, by splitting on spaces). This will help us to build a vocabulary of all the unique words in the data (for example, “playing”, “football”, and so on).
- We’ll limit the vocabulary size to the top 5,000 words to save memory. We’ll replace all other words with the token unk (for unknown). You can obviously optimize that according to the use case.
- We will then create a word –> index mapping and vice versa.
- We will finally pad all sequences to be the same length as the longest one.
Here is the code for that:
# Helper func to find the maximum length of any caption in our dataset
def calc_max_length(tensor): return max(len(t) for t in tensor) # Performing tokenization on the top 5000 words from the vocabulary top_k = 5000 tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k, oov_token="<unk>", filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ') # Converting text into sequence of numbers tokenizer.fit_on_texts(train_captions) train_seqs = tokenizer.texts_to_sequences(train_captions) tokenizer.word_index = {key:value for key, value in tokenizer.word_index.items() if value <= top_k} # putting <unk> token in the word2idx dictionary tokenizer.word_index[tokenizer.oov_token] = top_k + 1 tokenizer.word_index['<pad>'] = 0 # creating the tokenized vectors train_seqs = tokenizer.texts_to_sequences(train_captions) # creating a reverse mapping (index -> word) index_word = {value:key for key, value in tokenizer.word_index.items()} # padding each vector to the max_length of the captions cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post') # calculating the max_length # used to store the attention weights max_length = calc_max_length(train_seqs)
The end result will be an array of a sequence of integers:
We will split the data into training and validation samples using an 80:20 split ratio:
img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,cap_vector,test_size=0.2,random_state=0)
# Checking the sample counts print ("No of Training Images:",len(img_name_train)) print ("No of Training Caption: ",len(cap_train) ) print ("No of Training Images",len(img_name_val)) print ("No of Training Caption:",len(cap_val) )
No of Training Images: 24000 No of Training Caption: 24000 No of Training Images 6000 No of Training Caption: 6000
Setting up the data pipeline
Our images and captions are ready! Next, let’s create a tf.data dataset to use for training our model. Now we will prepare the pipeline for an image and the text model by performing transformations and batching on them:
# Defining parameters BATCH_SIZE = 64 BUFFER_SIZE = 1000 embedding_dim = 256 units = 512 vocab_size = len(tokenizer.word_index)
# shape of the vector extracted from Inception-V3 is (64, 2048) # these two variables represent that features_shape = 2048 attention_features_shape = 64 # loading the numpy files def map_func(img_name, cap): img_tensor = np.load(img_name.decode('utf-8')+'.npy') return img_tensor, cap #We use the from_tensor_slices to load the raw data and transform them into the tensors dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train)) # Using the map() to load the numpy files in parallel # NOTE: Make sure to set num_parallel_calls to the number of CPU cores you have # https://www.tensorflow.org/api_docs/python/tf/py_func dataset = dataset.map(lambda item1, item2: tf.py_func( map_func, [item1, item2], [tf.float32, tf.int32]), num_parallel_calls=8) # shuffling and batching dataset = dataset.shuffle(BUFFER_SIZE) dataset = dataset.batch(BATCH_SIZE) dataset = dataset.prefetch(1)
Defining the captioning model
The model architecture we are using to build the auto captioning is inspired by the Show, Attend and Tell paper. The features that we extracted from the lower convolutional layer of Inception-v3 gave us a vector of a shape of (8, 8, 2048). Then, we squash that to a shape of (64, 2048).
This vector is then passed through the CNN encoder, which consists of a single fully connected layer. The RNN (GRU in our case) attends over the image to predict the next word:
def gru(units): if tf.test.is_gpu_available(): return tf.keras.layers.CuDNNGRU(units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform') else: return tf.keras.layers.GRU(units, return_sequences=True, return_state=True, recurrent_activation='sigmoid', recurrent_initializer='glorot_uniform')
Attention
Now we will define the attention mechanism popularly known as Bahdanau attention. We will need the features from the CNN encoder of a shape of (batch_size, 64, embedding_dim). This attention mechanism will return the context vector and the attention weights over the time axis:
class BahdanauAttention(tf.keras.Model): def __init__(self, units): super(BahdanauAttention, self).__init__() self.W1 = tf.keras.layers.Dense(units) self.W2 = tf.keras.layers.Dense(units) self.V = tf.keras.layers.Dense(1)
def call(self, features, hidden): # hidden_with_time_axis shape == (batch_size, 1, hidden_size) hidden_with_time_axis = tf.expand_dims(hidden, 1) # score shape == (batch_size, 64, hidden_size) score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis)) # attention_weights shape == (batch_size, 64, 1) # we get 1 at the last axis because we are applying score to self.V attention_weights = tf.nn.softmax(self.V(score), axis=1) # context_vector shape after sum == (batch_size, hidden_size) context_vector = attention_weights * features context_vector = tf.reduce_sum(context_vector, axis=1) return context_vector, attention_weights
You can refer to the book to understand the CNN encoder, RNN decoder and Loss function used.
Training the captioning model
Let’s the model. The first thing we need to do is to extract the features stored in the respective .npy files and then pass those features through the CNN encoder.
The encoder output, hidden state (initialized to 0) and the decoder input (which is the start token) are passed to the decoder. The decoder returns the predictions and the decoder hidden state.
The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss. While training, we use the teacher forcing technique to decide the next input to the decoder.
The final step is to calculate the gradient and apply it to the optimizer and backpropagate:
EPOCHS = 20 loss_plot = []
for epoch in range(EPOCHS): start = time.time() total_loss = 0 for (batch, (img_tensor, target)) in enumerate(dataset): loss = 0 # initializing the hidden state for each batch # because the captions are not related from image to image hidden = decoder.reset_state(batch_size=target.shape[0]) dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1) with tf.GradientTape() as tape: features = encoder(img_tensor) for i in range(1, target.shape[1]): # passing the features through the decoder predictions, hidden, _ = decoder(dec_input, features, hidden) loss += loss_function(target[:, i], predictions) # using teacher forcing dec_input = tf.expand_dims(target[:, i], 1) total_loss += (loss / int(target.shape[1])) variables = encoder.variables + decoder.variables gradients = tape.gradient(loss, variables) optimizer.apply_gradients(zip(gradients, variables), tf.train.get_or_create_global_step()) if batch % 100 == 0: print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch, loss.numpy() / int(target.shape[1]))) # storing the epoch end loss value to plot later loss_plot.append(total_loss / len(cap_vector)) print ('Epoch {} Loss {:.6f}'.format(epoch + 1, total_loss/len(cap_vector))) print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
The following is the output:
After performing the training process over few epochs lets plot the Epoch vs Loss graph:
plt.plot(loss_plot) plt.xlabel('Epochs') plt.ylabel('Loss') plt.title('Loss Plot') plt.show()
The output is as follows:
Evaluating the captioning model
The evaluation function is similar to the training loop, except we don’t use teacher forcing here. The input to the decoder at each time step is its previous predictions, along with the hidden state and the encoder output.
A few key points to remember while making predictions:
- Stop predicting when the model predicts the end token
- Store the attention weights for every time step
Let’s define the evaluate() function:
def evaluate(image): attention_plot = np.zeros((max_length, attention_features_shape))
hidden = decoder.reset_state(batch_size=1) temp_input = tf.expand_dims(load_image(image)[0], 0) img_tensor_val = image_features_extract_model(temp_input) img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3])) features = encoder(img_tensor_val) dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0) result = [] for i in range(max_length): predictions, hidden, attention_weights = decoder(dec_input, features, hidden) attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy() predicted_id = tf.argmax(predictions[0]).numpy() result.append(index_word[predicted_id]) if index_word[predicted_id] == '<end>': return result, attention_plot dec_input = tf.expand_dims([predicted_id], 0) attention_plot = attention_plot[:len(result), :] return result, attention_plot
Also, let’s create a helper function to visualize the attention points that predict the words:
def plot_attention(image, result, attention_plot): temp_image = np.array(Image.open(image))
fig = plt.figure(figsize=(10, 10)) len_result = len(result) for l in range(len_result): temp_att = np.resize(attention_plot[l], (8, 8)) ax = fig.add_subplot(len_result//2, len_result//2, l+1) ax.set_title(result[l]) img = ax.imshow(temp_image) ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent()) plt.tight_layout() plt.show() # captions on the validation set rid = np.random.randint(0, len(img_name_val)) image = img_name_val[rid] real_caption = ' '.join([index_word[i] for i in cap_val[rid] if i not in [0]]) result, attention_plot = evaluate(image) print ('Real Caption:', real_caption) print ('Prediction Caption:', ' '.join(result)) plot_attention(image, result, attention_plot) # opening the image Image.open(img_name_val[rid])
The output is as follows:
Deploying the captioning model
We will deploy the complete module as a RESTful service. To do so, we will write an inference code that loads the latest checkpoint and makes the prediction on the given image.
Look into the inference.py file in the repository. All the code is similar to the training loop except we don’t use teacher forcing here. The input to the decoder at each time step is its previous predictions, along with the hidden state and the encoder output.
One important part is to load the model in memory for which we are using the tf.train.Checkpoint() method, which loads all of the learned weights for optimizer, encoder, decoder into the memory. Here is the code for that:
checkpoint_dir = './my_model' checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt") checkpoint = tf.train.Checkpoint( optimizer=optimizer, encoder=encoder, decoder=decoder, )
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))
So, we will create an evaluate() function, which defines the prediction loop. To make sure that the prediction ends after certain words, we will stop predicting when the model predicts the end token, <end>:
def evaluate(image): attention_plot = np.zeros((max_length, attention_features_shape))
hidden = decoder.reset_state(batch_size=1) temp_input = tf.expand_dims(load_image(image)[0], 0) # Extract features from the test image img_tensor_val = image_features_extract_model(temp_input) img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3])) # Feature is fed into the encoder features = encoder(img_tensor_val) dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0) result = [] # Prediction loop for i in range(max_length): predictions, hidden, attention_weights = decoder(dec_input, features, hidden) attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy() predicted_id = tf.argmax(predictions[0]).numpy() result.append(index_word[predicted_id]) # Hard stop when end token is predicted if index_word[predicted_id] == '<end>': return result, attention_plot dec_input = tf.expand_dims([predicted_id], 0) attention_plot = attention_plot[:len(result), :] return result, attention_plot
Now let’s use this evaluate() function in our web application code:
#!/usr/bin/env python2 # -*- coding: utf-8 -*- """ @author: rahulkumar """
from flask import Flask , request, jsonify import time from inference import evaluate import tensorflow as tf app = Flask(__name__) @app.route("/wowme") def AutoImageCaption(): image_url=request.args.get('image') print('image_url') image_extension = image_url[-4:] image_path = tf.keras.utils.get_file(str(int(time.time()))+image_extension, origin=image_url) result, attention_plot = evaluate(image_path) data = {'Prediction Caption:': ' '.join(result)} return jsonify(data) if __name__ == "__main__": app.run(host = '0.0.0.0',port=8081)
Execute the following command in the Terminal to run the web app:
python caption_deploy_api.py
You should get the following output:
* Running on http://0.0.0.0:8081/ (Press CTRL+C to quit)
Now we request the API, as follows:
curl 0.0.0.0:8081/wowme?image=https://www.beautifulpeopleibiza.com/images/BPI/img_bpi_destacada.jpg
We should get our caption predicted, as shown in the following screenshot:
Make sure to train the model on the large image to get better predictions.
Summary
In this implementation, we used a pre trained Inception-v3 model as a feature extractor in an encoder trained on the ImageNet dataset as part of a deep learning solution. This solution combines techniques in both computer vision and natural language processing, to form a complete image description approach, able to construct computer-generated natural descriptions of any provided images. We’ve broken the barrier between images and language with this trained model and we’ve provided a technology that could be used as part of an application, helping the visually impaired enjoy the benefits of the megatrend of photo sharing!
To understand insightful projects to master deep learning and neural network architectures using Python and Keras, check out our book Python Deep Learning Projects.