Data

Baidu announces ClariNet, a neural network for text-to-speech synthesis

2 min read

Text-to-speech synthesis has been a booming research area, with Google, Facebook, Deepmind, and other tech giants showcasing their interesting research and trying to build better TTS models. Now Baidu has stolen the show with ClariNet, the first fully end-to-end TTS model, that directly converts text to a speech waveform in a single neural network.

Classical TTS models such as Deepmind’s Wavenet usually have a separately text-to-spectrogram and waveform synthesis models. Having two models may result in suboptimal performance. ClariNet combines the two models into one fully convolutional single neural network. Not only that, their text-to-wave model significantly outperforms the previous separate TTS models, they claim.

Baidu’s ClariNet consists of four components:

  • Encoder, which encodes textual features into an internal hidden representation.
  • Decoder, which decodes the encoder representation into the log-mel spectrogram in an autoregressive manner.
  • Bridge-net: An intermediate processing block, which processes the hidden representation from the decoder and predicts log-linear spectrogram. It also upsamples the hidden representation from frame-level to sample-level.
  • Vocoder: A Gaussian autoregressive WaveNet to synthesize the waveform. It is conditioned on the upsampled hidden representation from the bridge-net.

ClariNet’s Architecture

Baidu has also proposed a new parallel wave generation method based on the Gaussian inverse autoregressive flow (IAF).  This mechanism generates all samples of an audio waveform in parallel, speeding up waveform synthesis dramatically as compared to traditional autoregressive methods.

To teach a parallel waveform synthesizer, they use a Gaussian autoregressive WaveNet as the teacher-net and the Gaussian IAF as the student-net. Their Gaussian autoregressive WaveNet is trained with maximum likelihood estimation (MLE).

The Gaussian IAF is distilled from the autoregressive WaveNet by minimizing KL divergence between their peaked output distributions, stabilizing the training process.

For more details on ClariNet, you can check out Baidu’s paper and audio samples.

Read Next

How Deep Neural Networks can improve Speech Recognition and generation
AI learns to talk naturally with Google’s Tacotron 2

Sugandha Lahoti

Content Marketing Editor at Packt Hub. I blog about new and upcoming tech trends ranging from Data science, Web development, Programming, Cloud & Networking, IoT, Security and Game development.

Share
Published by
Sugandha Lahoti
Tags: AI News

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago