Nvidia and the MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) have open-sourced their video-to-video synthesis model. A generative adversarial learning framework is used as a method to generate high-resolution, photorealistic and temporally coherent results with various input format, including segmentation masks, sketches and poses.
There has been less research into video to video synthesis compared to image to image translation. Video to video synthesis aims to solve the problem of low visual quality and incoherency of video results in existing image synthesis approach. The research group proposed a novel video-to-video synthesis approach capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long.
An extensive experimental validation was performed on various datasets by the authors and the model showed better results than existing approaches in quantitative and qualitative perspectives. When this method was extended to multimodal video synthesis with identical input data, it produced new visual properties with high resolution and coherency.
Researchers suggested the model may be improved in the future by adding additional 3D cues such as depth maps to better synthesize turning cars. We can use object tracking to ensure an object maintains its colour and appearance throughout the video; and training with coarser semantic labels to solve issues in semantic manipulation.