Mozilla introduces LPCNet: A DSP and deep learning-powered speech synthesizer for lower-power devices

Yesterday, Mozilla’s Emerging Technologies group introduced a new project called LPCNet, which is a WaveRNN variant. LPCNet aims to improve the efficiency of speech synthesis by combining deep learning and digital signal processing (DSP) techniques. It can be used for Text-to-Speech (TTS), speech compression, time stretching, noise suppression, codec post-filtering, and packet loss concealment.

Why is LPCNet introduced?

Many recent neural speech synthesis algorithms have made it possible to synthesize high-quality speech and code high-quality speech at very low bitrate. These algorithms, which are often based on algorithms like WaveNet, give promising results in real-time with a high-end GPU. But LPCNet aims to perform speech synthesis on end-user devices like mobile phones, which generally do not have powerful GPUs and have a very limited battery capacity.

We do have some low complexity parametric synthesis models such as low bitrate vocoders, but their quality is a concern. Generally, they are efficient at modeling the spectral envelope of the speech using linear prediction, but no such simple model exists for the excitation. LPCNet aims to show that the efficiency of speaker-independent speech synthesis can be improved by combining newer neural synthesis techniques with linear prediction.

What mechanisms does LPCNet use?

In addition to linear prediction, it includes the following tricks:

Pre-emphasis/de-emphasis filters: These filters allow shaping the noise caused by the μ-law quantization. LPCNet is capable of shaping the μ-law quantization noise to be mostly inaudible.

Sparse matrices: LPCNet uses sparse matrices in the main RNN similar to WaveRNN. These block-sparse matrices consist of blocks with size 16x1 to make it easier to vectorize the products. Instead of forcing many non-zero blocks along the diagonal, as a minor improvement, all the weights on the diagonal of the matrices are kept.

Input embedding: Instead of feeding the inputs directly to the network, the developers have used an embedding matrix. Embedding is generally used in natural language processing, but using it for μ-law values makes it possible to learn non-linear functions of the input.