Data

A new Video-to-Video Synthesis model uses Artificial Intelligence to create photorealistic videos

3 min read

A paper titled “Video-to-Video Synthesis”, introduces a new model using the generative adversarial learning framework. This model is capable of performing video to video synthesis to achieve high-resolution, photorealistic, and temporally coherent video results on a diverse set of inputs. These inputs include segmentation masks, sketches, and poses.

Video-to-video synthesis

What problem is the paper trying to solve?

The paper focuses on a mapping function which can effectively convert an input video to an output video. Although image-to-image translation methods are quite popular, a general-purpose solution for video-to-video synthesis has not yet been explored. The paper considers the video-to-video synthesis problem as a distribution matching problem. This involves training a model in such a way that conditional distribution of the synthesized videos makes sure that the input videos resembles that of real videos.

Given a set of aligned input and output videos, the model maps the input videos to the output domain at the test time. This approach is also capable of generating photorealistic 2K resolution videos which can be up to 30 seconds long.

How does the model work?

The network is trained in a spatio-temporally progressive manner. “We start with generating low-resolution and few frames, and all the way up to generating full resolution and 30 (or more) frames. Our coarse-to-fine generator consists of three scales, which operates on 512 × 256, 1024 × 512, and 2048 × 1024 resolutions, respectively” reads the paper.

The model is trained for 40 epochs and uses the ADAM [36] optimizer with lr = 0.0002 and (β1, β2) = (0.5, 0.999) on an NVIDIA DGX1 machine. All the GPUs in DGX1 (8 V100 GPUs,

each with 16GB memory) are used for training purposes. A generator computation task is distributed to 4 GPUs and the discriminator computation task is distributed to the other 4 GPUs. Training the model takes somewhere around 10 days for 2K resolution.

There are several datasets which are used for training the model such as Cityscapes, Apollo Scape, Face video dataset, FaceForensics dataset, and Dance video dataset.

Apart from this, the researchers compared the approach to two baselines trained on the same data, namely, pix2pixHD ( the state-of-the-art image-to-image translation approach) and COVST.

For evaluating the model’s performance, both subjective and objective metrics are used. First is the Human preference score that performs a human subjective test for evaluation of the visual quality of synthesized videos. Second is the Fréchet Inception Distance (FID), a widely used metric for implicit generative models.

Limitations of the model

This model fails in situations when synthesizing turning cars because of insufficient information in label maps. This can be addressed by adding 3D cues, such as depth maps. Also, the model doesn’t guarantee that an object will have a consistent appearance across the whole video. This means that there can be instances where a car may change its color gradually.

Lastly, by performing semantic manipulations such as turning trees into buildings, visible artifacts may appear i.e. building and trees can have different label shapes. However, this can be resolved by using coarser semantic labels to train the model since that would make it less sensitive to label shapes.

“Extensive experiments demonstrate that our results are significantly better than the results by state-of-the-art methods. Its extension to the future video prediction task also compares favorably against the competing approaches” reads the paper.

The paper has received public criticism from a few over the concern that it can be used to create deepfakes or tampered videos which can deceive people for illegal and exploitation purposes. While others view it as a great step into the AI-driven future.

For more information, be sure to check out the official research paper.

Read Next

This self-driving car can drive in its imagination using deep reinforcement learning

Introducing Deon, tool for data scientists to add an ethics checklist

Baidu releases EZDL – platform that lets you build AI and machine learning models without any coding knowledge

Natasha Mathur

Tech writer at the Packt Hub. Dreamer, book nerd, lover of scented candles, karaoke, and Gilmore Girls.

Share
Published by
Natasha Mathur
Tags: AI News

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago