Datasets and deep learning methodologies to extend image-based applications to videos

4 min read

In today’s tutorial, we will extend image based application to videos, which will include pose estimation, captioning, and generating videos.

Extending image-based application to videos

Images can be used for pose estimation, style transfer, image generation, segmentation, captioning, and so on. Similarly, these applications find a place in videos too. Using the temporal information may improve the predictions from images and vice versa. In this section, we will see how to extend these applications to videos.

Regressing the human pose

Human pose estimation is an important application of video data and can improve other tasks such as action recognition. First, let’s see a description of the datasets available for pose estimation:

Pfister et al. proposed a method to predict the human pose in videos. The following is the pipeline for regressing the human pose:

pipeline for regressing the human pose

The frames from the video are taken and passed through a convolutional network. The layers are fused, and the pose heatmaps are obtained. The pose heatmaps are combined
with optical flow to get the warped heatmaps. The warped heatmaps across a timeframe are pooled to produce the pooled heatmap, getting the final pose.

Tracking facial landmarks

Face analysis in videos requires face detection, landmark detection, pose estimation, verification, and so on. Computing landmarks are especially crucial for capturing facial animation, human-computer interaction, and human activity recognition. Instead of computing over frames, it can be computed over video. Gu et al. proposed a method to use a joint estimation of detection and tracking of facial landmarks in videos using RNN. The results outperform frame wise predictions and other previous models. The landmarks are computed by CNN, and the temporal aspect is encoded in an RNN. Synthetic data was used for training.

Segmenting videos

Videos can be segmented in a better way when temporal information is used. Gadde et al. proposed a method to combine temporal information by warping. The following image demonstrates the solution, which segments two frames and combines the warping:

Image segmentation CNN

The warping net is shown in the following image:

Reproduced from Gadde et al

Reproduced from Gadde et al

The optical flow is computed between two frames, which are combined with warping. The warping module takes the optical flow, transforms it, and combines it with the warped representations.

Captioning videos

Captions can be generated for videos, describing the context. Let’s see a list of the datasets available for captioning videos:

Yao et al. proposed a method for captioning videos. A 3D convolutional network trained for action recognition is used to extract the local temporal features. An attention mechanism is then used on the features to generate text using an RNN. The process is shown here:

Reproduced from Yao et al

Reproduced from Yao et al

Donahue et al. proposed another method for video captioning or description, which uses LSTM with convolution features.

This is similar to the preceding approach, except that we use 2D convolution features over here, as shown in the following image:

Reproduced from Donahue et al

Reproduced from Donahue et al

We have several ways to combine text with images, such as activity recognition, image description, and video description techniques. The following image illustrates these techniques:

Reproduced from Donahue et al

Reproduced from Donahue et al

Venugopalan et al. proposed a method for video captioning using an encoder-decoder approach. The following is a visualization of the technique proposed by him:

Reproduced from Venugopalan et al

Reproduced from Venugopalan et al

The CNN can be computed on the frames or the optical flow of the images for this method.


Generating videos

Videos can be generated using generative models, in an unsupervised manner. The future frames can be predicted using the current frame. Ranzato et al. proposed a method for generating videos, inspired by language models. An RNN model is utilized to take a patch of the image and predict the next patch.

To summarize, we learned about video-based solutions in various scenarios such as action recognition, gesture recognition, security applications, and intrusion detection.

You read an excerpt from a book written by Rajalingappaa Shanmugamani titled, Deep Learning for Computer Vision. This book will help you learn to model and train advanced neural networks for implementation of Computer Vision tasks.



Data Science fanatic. Cricket fan. Series Binge watcher. You can find me hooked to my PC updating myself constantly if I am not cracking lame jokes with my team.


Please enter your comment!
Please enter your name here