4 min read

Some of the recent works have shown how to obtain highly realistic human head images by training convolutional neural networks to generate them.

For creating such a personalized talking head model, these works require training on a large dataset of images of a single person. So the researchers from Samsung AI Center presented a system with few-shot capability. They have presented the paper, Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. The system performs lengthy meta-learning on a large dataset of videos, and further frames few and one-shot learning of neural talking head models of previously unseen people with the help of high capacity generators and discriminators.

The system is capable of initializing the parameters of both the generator and the discriminator in a person-specific way such that the training can be based on just a few images and can be done quickly. The researchers have shown in the paper that such an approach is capable of learning highly realistic and personalized talking head models of new people and even portrait paintings.

The researchers have considered the task of creating personalized photo realistic talking head models or systems that can synthesize video-sequences of speech expressions and mimics of a particular individual.

To be more specific the researchers have considered the problem of synthesizing photorealistic personalized head images with a set of face landmarks, which drive the animation of the model. Such a system has practical applications for telepresence, including videoconferencing, multi-player games, and in special effects industry.

Why is synthesizing realistic talking head sequences difficult?

Synthesizing realistic talking head sequences is difficult because of two major reasons. The first issue is that the human heads have high photometric, geometric and kinematic complexity so it is difficult to model faces. The second is complicating factor is the acuteness of the human visual system so even minor mistakes in the appearance while modelling can cause a problem.

What the researchers have done to overcome the problem?

The researchers have presented a system for creating talking head models from a handful of photographs which is also called few-shot learning. The system can also generate a result based on a single photograph, this process is also known as one-shot learning. But adding a few more photographs increases the fidelity of personalization.

The talking heads created by the researchers’ system can handle a large variety of poses that goes beyond the abilities of warping-based systems. The few-shot learning ability is obtained by extensive pre-training (meta-learning) on a large corpus of talking head videos that correspond to different speakers with diverse appearance.

In the course of meta-learning, this system simulates few-shot learning tasks and also learns to transform landmark positions into realistically-looking personalized photographs.

A handful of photographs of a new person will set up a new adversarial learning problem with high-capacity generator and discriminator that are pre-trained via meta-learning. The new problem converges to the state that would generate realistic and personalized images post a few training steps.

In the experiments, the researchers have provided comparisons of talking heads created by their system with alternative neural talking head models through quantitative measurements and a user study. They have also demonstrated several use cases of their talking head models which includes video synthesis using landmark tracks extracted from video sequences and puppeteering (video synthesis of a certain person based on the face landmark tracks of a different person).

The researchers have used two datasets with talking head videos for quantitative and qualitative evaluation: VoxCeleb1 [26] (256p videos at 1 fps) and VoxCeleb2 [8] (224p videos at 25 fps), with the second one having approximately 10 times more videos than the first one. The first dataset, VoxCeleb1 is used for comparison with baselines and ablation studies, the researchers show the potential of their approach with the second dataset, VoxCeleb2.

To conclude, researchers have presented a framework for meta-learning of adversarial generative models that can train highly realistic virtual talking heads in the form of deep generator networks. A handful of photographs (could be as little as one) is needed to create a new model, but the model that is trained on 32 images achieves perfect realism and personalization score in their user study (for 224p static images).

The key limitations of the method are the mimics representation and the lack of landmark adaptation. The landmarks from a different person can lead to a noticeable personality mismatch. If someone wants to create “fake” puppeteering videos without such mismatch then, in that case, some landmark adaptation is needed.

The paper further reads, “We note, however, that many applications do not require puppeteering a different person and instead only need the ability to drive one’s own talking head. For such scenario, our approach already provides a high-realism solution.”

To know more about this news, check out the paper, Few-Shot Adversarial Learning of Realistic Neural Talking Head Models.

Read Next

Samsung opens its AI based Bixby voice assistant to third-party developers

Researchers from China introduced two novel modules to address challenges in multi-person pose estimation

AI can now help speak your mind: UC researchers introduce a neural decoder that translates brain signals to natural sounding speech