A paper titled “Video-to-Video Synthesis”, introduces a new model using the generative adversarial learning framework. This model is capable of performing video to video synthesis to achieve high-resolution, photorealistic, and temporally coherent video results on a diverse set of inputs. These inputs include segmentation masks, sketches, and poses.
What problem is the paper trying to solve?
The paper focuses on a mapping function which can effectively convert an input video to an output video. Although image-to-image translation methods are quite popular, a general-purpose solution for video-to-video synthesis has not yet been explored. The paper considers the video-to-video synthesis problem as a distribution matching problem. This involves training a model in such a way that conditional distribution of the synthesized videos makes sure that the input videos resembles that of real videos.
Given a set of aligned input and output videos, the model maps the input videos to the output domain at the test time. This approach is also capable of generating photorealistic 2K resolution videos which can be up to 30 seconds long.
How does the model work?
The network is trained in a spatio-temporally progressive manner. “We start with generating low-resolution and few frames, and all the way up to generating full resolution and 30 (or more) frames. Our coarse-to-fine generator consists of three scales, which operates on 512 × 256, 1024 × 512, and 2048 × 1024 resolutions, respectively” reads the paper.
The model is trained for 40 epochs and uses the ADAM  optimizer with lr = 0.0002 and (β1, β2) = (0.5, 0.999) on an NVIDIA DGX1 machine. All the GPUs in DGX1 (8 V100 GPUs,
each with 16GB memory) are used for training purposes. A generator computation task is distributed to 4 GPUs and the discriminator computation task is distributed to the other 4 GPUs. Training the model takes somewhere around 10 days for 2K resolution.
There are several datasets which are used for training the model such as Cityscapes, Apollo Scape, Face video dataset, FaceForensics dataset, and Dance video dataset.
Apart from this, the researchers compared the approach to two baselines trained on the same data, namely, pix2pixHD ( the state-of-the-art image-to-image translation approach) and COVST.
For evaluating the model’s performance, both subjective and objective metrics are used. First is the Human preference score that performs a human subjective test for evaluation of the visual quality of synthesized videos. Second is the Fréchet Inception Distance (FID), a widely used metric for implicit generative models.
Limitations of the model
This model fails in situations when synthesizing turning cars because of insufficient information in label maps. This can be addressed by adding 3D cues, such as depth maps. Also, the model doesn’t guarantee that an object will have a consistent appearance across the whole video. This means that there can be instances where a car may change its color gradually.
Lastly, by performing semantic manipulations such as turning trees into buildings, visible artifacts may appear i.e. building and trees can have different label shapes. However, this can be resolved by using coarser semantic labels to train the model since that would make it less sensitive to label shapes.
“Extensive experiments demonstrate that our results are significantly better than the results by state-of-the-art methods. Its extension to the future video prediction task also compares favorably against the competing approaches” reads the paper.
The paper has received public criticism from a few over the concern that it can be used to create deepfakes or tampered videos which can deceive people for illegal and exploitation purposes. While others view it as a great step into the AI-driven future.
For more information, be sure to check out the official research paper.