17 min read

In this tutorial, we will combine techniques in both computer vision and natural language processing to form a complete image description approach. This will be responsible for constructing computer-generated natural descriptions of any provided images.  The idea is to replace the encoder (RNN layer) in an encoder-decoder architecture with a deep convolutional neural network (CNN) trained to classify objects in images.

Normally, the CNN’s last layer is the softmax layer, which assigns the probability that each object might be in the image. But if we remove that softmax layer from CNN, we can feed the CNN’s rich encoding of the image into the decoder (language generation RNN) designed to produce phrases. We can then train the whole system directly on images and their captions, so it maximizes the likelihood that the descriptions it produces best match the training descriptions for each image.

This tutorial is an excerpt from a book written by  Matthew Lamons, Rahul Kumar, Abhishek Nagaraja titled Python Deep Learning Projects.

This book will simplify and ease how deep learning works, demonstrating how neural networks play a vital role in exploring predictive analytics across different domains. Users will explore projects in the field of computational linguistics, computer vision, machine translation, pattern recognition and many more!

All of the Python files and Jupyter Notebook files for this tutorial can be found at GitHub.

In this implementation, we will be using a pretrained Inception-v3 model as a feature extractor in an encoder trained on the ImageNet dataset. Let’s import all of the dependencies that we will need to build an auto-captioning model.

All of the Python files and the Jupyter Notebooks for this article can be found on GitHub.


For this implementation, we need a TensorFlow version greater than or equal to 1.9 and we will also enable the eager execution mode, which will help us use the debug the code more effectively. Here is the code for this:

# Import TensorFlow and enable eager execution
import tensorflow as tf
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

import re
import numpy as np
import os
import time
import json
from glob import glob
from PIL import Image
import pickle

Download and prepare the MS-COCO dataset

We are going to use the MS-COCO dataset to train our model. This dataset contains more than 82,000 images, each of which has been annotated with at least five different captions. The following code will download and extract the dataset automatically:

annotation_zip = tf.keras.utils.get_file('captions.zip', 
                                          origin = 'http://images.cocodataset.org/annotations/annotations_trainval2014.zip',
                                          extract = True)
annotation_file = os.path.dirname(annotation_zip)+'/annotations/captions_train2014.json'
name_of_zip = 'train2014.zip'
if not os.path.exists(os.path.abspath('.') + '/' + name_of_zip):
image_zip = tf.keras.utils.get_file(name_of_zip, 
origin = 'http://images.cocodataset.org/zips/train2014.zip',
extract = True)
PATH = os.path.dirname(image_zip)+'/train2014/'
PATH = os.path.abspath('.')+'/train2014/'

The following will be the output:

Downloading data from http://images.cocodataset.org/annotations/annotations_trainval2014.zip 
252878848/252872794 [==============================] - 6s 0us/step 
Downloading data from http://images.cocodataset.org/zips/train2014.zip 
13510574080/13510573713 [==============================] - 322s 0us/step

For this example, we’ll select a subset of 40,000 captions and use these and the corresponding images to train our model. As always, captioning quality will improve if you choose to use more data:

# read the json annotation file
with open(annotation_file, 'r') as f:
    annotations = json.load(f)
# storing the captions and the image name in vectors
all_captions = []
all_img_name_vector = []

for annot in annotations['annotations']:
caption = '<start> ' + annot['caption'] + ' <end>'
image_id = annot['image_id']
full_coco_image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (image_id)


# shuffling the captions and image_names together
# setting a random state
train_captions, img_name_vector = shuffle(all_captions,

# selecting the first 40000 captions from the shuffled set
num_examples = 40000
train_captions = train_captions[:num_examples]
img_name_vector = img_name_vector[:num_examples]

Once the data preparation is completed, we will have all of the image path stored in the img_name_vector list variable, and the associated captions are stored in train_caption, as shown in the following screenshot:

Data preparation for a deep CNN encoder

Next, we will use Inception-v3 (pretrained on ImageNet) to classify each image. We will extract features from the last convolutional layer. We will create a helper function that will transform the input image to the format that is expected by Inception-v3:

#Resizing the image to (299, 299)
#Using the preprocess_input method to place the pixels in the range of -1 to 1.
def load_image(image_path):
img = tf.read_file(image_path)
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.resize_images(img, (299, 299))
img = tf.keras.applications.inception_v3.preprocess_input(img)
return img, image_path

Now let’s initialize the Inception-v3 model and load the pretrained ImageNet weights. To do so, we’ll create a tf.keras model where the output layer is the last convolutional layer in the Inception-v3 architecture.

image_model = tf.keras.applications.InceptionV3(include_top=False, 
new_input = image_model.input
hidden_layer = image_model.layers[-1].output

image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

The output is as follows:

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.5/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5
87916544/87910968 [==============================] - 40s 0us/step

So, the image_features_extract_model is our deep CNN encoder, which is responsible for learning the features from the given image.

Performing feature extraction

Now we will pre-process each image with the deep CNN encoder and dump the output to the disk:

  1. We will load the images in batches using the load_image() helper function that we created before
  2. We will feed the images into the encoder to extract the features
  3. Dump the features as a numpy array:
encode_train = sorted(set(img_name_vector))
#Load images
image_dataset = tf.data.Dataset.from_tensor_slices(
# Extract features
for img, path in image_dataset:
  batch_features = image_features_extract_model(img)
  batch_features = tf.reshape(batch_features, 
                              (batch_features.shape[0], -1, batch_features.shape[3]))
#Dump into disk
  for bf, p in zip(batch_features, path):
    path_of_feature = p.numpy().decode("utf-8")
    np.save(path_of_feature, bf.numpy())

Data prep for a language generation (RNN) decoder

The first step is to pre-process the captions.

We will perform a few basic pre-processing steps on the captions, such as the following:

  • We’ll tokenize the captions (for example, by splitting on spaces). This will help us to build a vocabulary of all the unique words in the data (for example, “playing”, “football”, and so on).
  • We’ll limit the vocabulary size to the top 5,000 words to save memory. We’ll replace all other words with the token unk (for unknown). You can obviously optimize that according to the use case.
  • We will then create a word –> index mapping and vice versa.
  • We will finally pad all sequences to be the same length as the longest one.

Here is the code for that:

# Helper func to find the maximum length of any caption in our dataset
def calc_max_length(tensor):
return max(len(t) for t in tensor)

# Performing tokenization on the top 5000 words from the vocabulary
top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k, 
filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')

# Converting text into sequence of numbers
train_seqs = tokenizer.texts_to_sequences(train_captions)

tokenizer.word_index = {key:value for key, value in tokenizer.word_index.items() if value <= top_k}

# putting <unk> token in the word2idx dictionary
tokenizer.word_index[tokenizer.oov_token] = top_k + 1
tokenizer.word_index['<pad>'] = 0

# creating the tokenized vectors
train_seqs = tokenizer.texts_to_sequences(train_captions)

# creating a reverse mapping (index -> word)
index_word = {value:key for key, value in tokenizer.word_index.items()}

# padding each vector to the max_length of the captions
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')

# calculating the max_length 
# used to store the attention weights
max_length = calc_max_length(train_seqs)

The end result will be an array of a sequence of integers:

We will split the data into training and validation samples using an 80:20 split ratio:

img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,cap_vector,test_size=0.2,random_state=0)
# Checking the sample counts
print ("No of Training Images:",len(img_name_train))
print ("No of Training Caption: ",len(cap_train) )
print ("No of Training Images",len(img_name_val))
print ("No of Training Caption:",len(cap_val) )

No of Training Images: 24000 No of Training Caption: 24000 No of Training Images 6000 No of Training Caption: 6000

Setting up the data pipeline

Our images and captions are ready! Next, let’s create a tf.data dataset to use for training our model. Now we will prepare the pipeline for an image and the text model by performing transformations and batching on them:

# Defining parameters
embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index)
# shape of the vector extracted from Inception-V3 is (64, 2048)
# these two variables represent that
features_shape = 2048
attention_features_shape = 64

# loading the numpy files 
def map_func(img_name, cap):
img_tensor = np.load(img_name.decode('utf-8')+'.npy')
return img_tensor, cap

#We use the from_tensor_slices to load the raw data and transform them into the tensors

dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))

# Using the map() to load the numpy files in parallel
# NOTE: Make sure to set num_parallel_calls to the number of CPU cores you have
# https://www.tensorflow.org/api_docs/python/tf/py_func
dataset = dataset.map(lambda item1, item2: tf.py_func(
map_func, [item1, item2], [tf.float32, tf.int32]), num_parallel_calls=8)

# shuffling and batching
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(1)

Defining the captioning model

The model architecture we are using to build the auto captioning is inspired by the Show, Attend and Tell paper. The features that we extracted from the lower convolutional layer of Inception-v3 gave us a vector of a shape of (8, 8, 2048). Then, we squash that to a shape of (64, 2048).

This vector is then passed through the CNN encoder, which consists of a single fully connected layer. The RNN (GRU in our case) attends over the image to predict the next word:

def gru(units):
  if tf.test.is_gpu_available():
    return tf.keras.layers.CuDNNGRU(units, 
    return tf.keras.layers.GRU(units, 


Now we will define the attention mechanism popularly known as Bahdanau attention. We will need the features from the CNN encoder of a shape of (batch_size, 64, embedding_dim). This attention mechanism will return the context vector and the attention weights over the time axis:

class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)
def call(self, features, hidden):
# hidden_with_time_axis shape == (batch_size, 1, hidden_size)
hidden_with_time_axis = tf.expand_dims(hidden, 1)

# score shape == (batch_size, 64, hidden_size)
score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))

# attention_weights shape == (batch_size, 64, 1)
# we get 1 at the last axis because we are applying score to self.V
attention_weights = tf.nn.softmax(self.V(score), axis=1)

# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, axis=1)

return context_vector, attention_weights

You can refer to the book to understand the CNN encoder, RNN decoder and Loss function used.

Training the captioning model

Let’s the model. The first thing we need to do is to extract the features stored in the respective .npy files and then pass those features through the CNN encoder.

The encoder output, hidden state (initialized to 0) and the decoder input (which is the start token) are passed to the decoder. The decoder returns the predictions and the decoder hidden state.

The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss. While training, we use the teacher forcing technique to decide the next input to the decoder.

The final step is to calculate the gradient and apply it to the optimizer and backpropagate:

loss_plot = []
for epoch in range(EPOCHS):
start = time.time()
total_loss = 0

for (batch, (img_tensor, target)) in enumerate(dataset):
loss = 0

# initializing the hidden state for each batch
# because the captions are not related from image to image
hidden = decoder.reset_state(batch_size=target.shape[0])

dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1)

with tf.GradientTape() as tape:
features = encoder(img_tensor)

for i in range(1, target.shape[1]):
# passing the features through the decoder
predictions, hidden, _ = decoder(dec_input, features, hidden)

loss += loss_function(target[:, i], predictions)

# using teacher forcing
dec_input = tf.expand_dims(target[:, i], 1)

total_loss += (loss / int(target.shape[1]))

variables = encoder.variables + decoder.variables

gradients = tape.gradient(loss, variables)

optimizer.apply_gradients(zip(gradients, variables), tf.train.get_or_create_global_step())

if batch % 100 == 0:
print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, 
loss.numpy() / int(target.shape[1])))
# storing the epoch end loss value to plot later
loss_plot.append(total_loss / len(cap_vector))

print ('Epoch {} Loss {:.6f}'.format(epoch + 1, 
print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

The following is the output:

After performing the training process over few epochs lets plot the Epoch vs Loss graph:

plt.title('Loss Plot')

The output is as follows:

The loss vs Epoch plot during training process

Evaluating the captioning model

The evaluation function is similar to the training loop, except we don’t use teacher forcing here. The input to the decoder at each time step is its previous predictions, along with the hidden state and the encoder output.

A few key points to remember while making predictions:

  • Stop predicting when the model predicts the end token
  • Store the attention weights for every time step

Let’s define the evaluate() function:

def evaluate(image):
 attention_plot = np.zeros((max_length, attention_features_shape))
hidden = decoder.reset_state(batch_size=1)

temp_input = tf.expand_dims(load_image(image)[0], 0)
img_tensor_val = image_features_extract_model(temp_input)
img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))

features = encoder(img_tensor_val)

dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
result = []

for i in range(max_length):
predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

predicted_id = tf.argmax(predictions[0]).numpy()

if index_word[predicted_id] == '<end>':
return result, attention_plot

dec_input = tf.expand_dims([predicted_id], 0)

attention_plot = attention_plot[:len(result), :]
return result, attention_plot

Also, let’s create a helper function to visualize the attention points that predict the words:

def plot_attention(image, result, attention_plot):
    temp_image = np.array(Image.open(image))
fig = plt.figure(figsize=(10, 10))

len_result = len(result)
for l in range(len_result):
temp_att = np.resize(attention_plot[l], (8, 8))
ax = fig.add_subplot(len_result//2, len_result//2, l+1)
img = ax.imshow(temp_image)
ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())


# captions on the validation set
rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
real_caption = ' '.join([index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)

print ('Real Caption:', real_caption)
print ('Prediction Caption:', ' '.join(result))
plot_attention(image, result, attention_plot)
# opening the image

The output is as follows:

Deploying the captioning model

We will deploy the complete module as a RESTful service. To do so, we will write an inference code that loads the latest checkpoint and makes the prediction on the given image.

Look into the inference.py file in the repository. All the code is similar to the training loop except we don’t use teacher forcing here. The input to the decoder at each time step is its previous predictions, along with the hidden state and the encoder output.

One important part is to load the model in memory for which we are using the tf.train.Checkpoint() method, which loads all of the learned weights for optimizer, encoder, decoder into the memory. Here is the code for that:

checkpoint_dir = './my_model'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(

So, we will create an evaluate() function, which defines the prediction loop. To make sure that the prediction ends after certain words, we will stop predicting when the model predicts the end token, <end>:

def evaluate(image):
    attention_plot = np.zeros((max_length, attention_features_shape))
hidden = decoder.reset_state(batch_size=1)

temp_input = tf.expand_dims(load_image(image)[0], 0)
 # Extract features from the test image
img_tensor_val = image_features_extract_model(temp_input)
img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
 # Feature is fed into the encoder
features = encoder(img_tensor_val)

dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
result = []
 # Prediction loop
for i in range(max_length):
predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

predicted_id = tf.argmax(predictions[0]).numpy()
 # Hard stop when end token is predicted
if index_word[predicted_id] == '<end>':
return result, attention_plot

dec_input = tf.expand_dims([predicted_id], 0)

attention_plot = attention_plot[:len(result), :]
return result, attention_plot

Now let’s use this evaluate() function in our web application code:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
@author: rahulkumar
from flask import Flask , request, jsonify

import time
from inference import evaluate
import tensorflow as tf

app = Flask(__name__)

def AutoImageCaption():
image_extension = image_url[-4:]
image_path = tf.keras.utils.get_file(str(int(time.time()))+image_extension, origin=image_url)
result, attention_plot = evaluate(image_path)
data = {'Prediction Caption:': ' '.join(result)}

return jsonify(data)

if __name__ == "__main__":
app.run(host = '',port=8081)

Execute the following command in the Terminal to run the web app:

python caption_deploy_api.py 

You should get the following output:

* Running on (Press CTRL+C to quit)

Now we request the API, as follows:


We should get our caption predicted, as shown in the following screenshot:

Make sure to train the model on the large image to get better predictions.


In this implementation, we used a pre trained Inception-v3 model as a feature extractor in an encoder trained on the ImageNet dataset as part of a deep learning solution. This solution combines techniques in both computer vision and natural language processing, to form a complete image description approach, able to construct computer-generated natural descriptions of any provided images. We’ve broken the barrier between images and language with this trained model and we’ve provided a technology that could be used as part of an application, helping the visually impaired enjoy the benefits of the megatrend of photo sharing!

To understand insightful projects to master deep learning and neural network architectures using Python and Keras, check out our book  Python Deep Learning Projects.

Read Next

Getting started with Web Scraping using Python [Tutorial]

Google researchers introduce JAX: A TensorFlow-like framework for generating high-performance code from Python and NumPy machine learning programs

Google releases Magenta studio beta, an open source python machine learning library for music artists