Last week, Facebook AI Research (FAIR) speech team introduced the first fully convolutional speech recognition approach. Additionally, they have also open-sourced flashlight, a C++ library for machine learning and wav2letter++, a fast and simple system for developing end-to-end speech recognizers.
Fully convolutional speech recognition approach
The current state-of-the-art-speech recognition systems are built on RNNs for acoustic or language modeling. Facebook’s newly-introduced system provides an alternative approach based solely on convolutional neural networks. This system eliminates the feature extraction step altogether as it is trained end-to-end to predict characters from the raw waveform. It uses an external convolutional language model to decode words.
The following diagram depicts the architecture of this CNN-based speech recognition system:
- Learnable frontend: This section of the system first contains a convolution of width 2 that emulates the pre-emphasis step followed by a complex convolution of width 25 ms. After calculating the squared absolute value, the low-pass filter and stride perform the decimation. The frontend finally applies a log-compression and a per-channel mean-variance normalization.
- Acoustic model: It is a CNN with gated linear units (GLU), which is fed with the output of the learnable frontend. These acoustic models are trained to predict letters directly with the Auto Segmentation Criterion.
- Language model: The convolutional language model (LM) contains 14 convolutional residual blocks and uses GLUs as the activation function. It is used to score candidate transcriptions in addition to the acoustic model in the beam search decoder.
- Beam-search decoder: The beam-search decoder is used to generate word sequences given the output from our acoustic model.
Apart from this CNN-based approach, Facebook released the wav2letter++ and flashlight frameworks to complement this approach and enable reproducibility.
flashlight is a C++ standalone library for machine learning. It uses the ArrayFire tensor library and features just-in-time compilation with modern C++. It targets both CPU and GPU backends to provide maximum efficiency and scale.
The wav2letter++ toolkit is built on top of flashlight and written entirely in C++. It also uses ArrayFire as its primary library for tensor operations. ArrayFire is a highly optimized tensor library that can execute on multiple backends including a CUDA GPU and CPU backed. It supports multiple audio file formats such as wav and flac. And, also supports several feature types including the raw audio, a linearly scaled power spectrum, log-Mels (MFSC) and MFCCs.
To read more in detail, check out Facebook’s official announcement.
Read Next
Facebook releases DeepFocus, an AI-powered rendering system to make virtual reality more real
The district of Columbia files a lawsuit against Facebook for the Cambridge Analytica scandal