2 min read
Why is LPCNet introduced?
Many recent neural speech synthesis algorithms have made it possible to synthesize high-quality speech and code high-quality speech at very low bitrate. These algorithms, which are often based on algorithms like WaveNet, give promising results in real-time with a high-end GPU. But LPCNet aims to perform speech synthesis on end-user devices like mobile phones, which generally do not have powerful GPUs and have a very limited battery capacity.
We do have some low complexity parametric synthesis models such as low bitrate vocoders, but their quality is a concern. Generally, they are efficient at modeling the spectral envelope of the speech using linear prediction, but no such simple model exists for the excitation. LPCNet aims to show that the efficiency of speaker-independent speech synthesis can be improved by combining newer neural synthesis techniques with linear prediction.
What mechanisms does LPCNet use?
In addition to linear prediction, it includes the following tricks:
- Pre-emphasis/de-emphasis filters: These filters allow shaping the noise caused by the μ-law quantization. LPCNet is capable of shaping the μ-law quantization noise to be mostly inaudible.
- Sparse matrices: LPCNet uses sparse matrices in the main RNN similar to WaveRNN. These block-sparse matrices consist of blocks with size 16×1 to make it easier to vectorize the products. Instead of forcing many non-zero blocks along the diagonal, as a minor improvement, all the weights on the diagonal of the matrices are kept.
- Input embedding: Instead of feeding the inputs directly to the network, the developers have used an embedding matrix. Embedding is generally used in natural language processing, but using it for μ-law values makes it possible to learn non-linear functions of the input.
You can read more in detail about LPCNet on Mozilla’s official website.