In today’s tutorial, we will extend image based application to videos, which will include pose estimation, captioning, and generating videos.
Extending image-based application to videos
Images can be used for pose estimation, style transfer, image generation, segmentation, captioning, and so on. Similarly, these applications find a place in videos too. Using the temporal information may improve the predictions from images and vice versa. In this section, we will see how to extend these applications to videos.
Regressing the human pose
Human pose estimation is an important application of video data and can improve other tasks such as action recognition. First, let’s see a description of the datasets available for pose estimation:
- Poses in the wild dataset: Contains 30 videos annotated with the human pose. The dataset is annotated with human upper body joints.
- Frames Labeled In Cinema (FLIC): A human pose dataset obtained from 30
Pfister et al. proposed a method to predict the human pose in videos. The following is the pipeline for regressing the human pose:
The frames from the video are taken and passed through a convolutional network. The layers are fused, and the pose heatmaps are obtained. The pose heatmaps are combined
with optical flow to get the warped heatmaps. The warped heatmaps across a timeframe are pooled to produce the pooled heatmap, getting the final pose.
Tracking facial landmarks
Face analysis in videos requires face detection, landmark detection, pose estimation, verification, and so on. Computing landmarks are especially crucial for capturing facial animation, human-computer interaction, and human activity recognition. Instead of computing over frames, it can be computed over video. Gu et al. proposed a method to use a joint estimation of detection and tracking of facial landmarks in videos using RNN. The results outperform frame wise predictions and other previous models. The landmarks are computed by CNN, and the temporal aspect is encoded in an RNN. Synthetic data was used for training.
Videos can be segmented in a better way when temporal information is used. Gadde et al. proposed a method to combine temporal information by warping. The following image demonstrates the solution, which segments two frames and combines the warping:
The warping net is shown in the following image:
Reproduced from Gadde et al
The optical flow is computed between two frames, which are combined with warping. The warping module takes the optical flow, transforms it, and combines it with the warped representations.
Captions can be generated for videos, describing the context. Let’s see a list of the datasets available for captioning videos:
- Microsoft Research – Video To Text (MSR-VTT) has 200,000 video clip and sentence pairs.
- MPII Movie Description Corpus (MPII-MD) has 68,000 sentences with 94 movies.
- Montreal Video Annotation Dataset (M-VAD) has 49,000 clips.
- YouTube2Text has 1,970 videos with 80,000 descriptions.
Yao et al. proposed a method for captioning videos. A 3D convolutional network trained for action recognition is used to extract the local temporal features. An attention mechanism is then used on the features to generate text using an RNN. The process is shown here:
Reproduced from Yao et al
Donahue et al. proposed another method for video captioning or description, which uses LSTM with convolution features.
This is similar to the preceding approach, except that we use 2D convolution features over here, as shown in the following image:
Reproduced from Donahue et al
We have several ways to combine text with images, such as activity recognition, image description, and video description techniques. The following image illustrates these techniques:
Reproduced from Donahue et al
Venugopalan et al. proposed a method for video captioning using an encoder-decoder approach. The following is a visualization of the technique proposed by him:
Reproduced from Venugopalan et al
|The CNN can be computed on the frames or the optical flow of the images for this method.|
Videos can be generated using generative models, in an unsupervised manner. The future frames can be predicted using the current frame. Ranzato et al. proposed a method for generating videos, inspired by language models. An RNN model is utilized to take a patch of the image and predict the next patch.
To summarize, we learned about video-based solutions in various scenarios such as action recognition, gesture recognition, security applications, and intrusion detection.
You read an excerpt from a book written by Rajalingappaa Shanmugamani titled, Deep Learning for Computer Vision. This book will help you learn to model and train advanced neural networks for implementation of Computer Vision tasks.