Last month, 2Hz introduced an app called krisp which was featured on the Nvidia website. It uses deep learning for noise suppression and is powered by krispNet Deep Neural Network. krispNet is trained to recognize and reduce background noise from real-time audio and yields clear human speech.
2Hz is a company which builds AI-powered voice processing technologies to improve voice quality in communications.
What are the limitations in the current ways of noise suppression?
Many edge devices from phones, laptops, to conferencing systems come with noise suppression technologies. Latest mobile phones come equipped with multiple microphones which helps suppress environmental noise when we talk.
Generally, the first mic is placed on the front bottom of the phone to directly capture the user’s voice. The second mic is placed as far as possible from the first mic. After the surrounding sounds are captured by both these mics, the software effectively subtracts them from each other and yields an almost clean voice.
The limitations of multiple mics design:
- Since multiple mics design requires a certain form factor, their application is only limited to certain use cases such as phones or headsets with sticky mics.
- These designs make the audio path complicated, requiring more hardware and code.
- Audio processing can only be done on the edge or device side, thus the underlying algorithm is not very sophisticated due to the low power and compute requirement.
The traditional Digital Signal Processing (DSP) algorithms also work well only in certain use cases. Their main drawback is that they are not scalable to variety and variability of noises that exist in our everyday environment.
This is why 2Hz has come up with a deep learning solution that uses a single microphone design and all the post processing is handled by a software. This allows hardware designs to be simpler and more efficient.
How deep learning can be used in noise suppression?
There are three steps involved in applying deep learning to noise suppression:
- Data collection: The first step is to build a dataset to train the network by combining distinct noises and clean voices to produce synthetic noisy speech.
- Training: Next, feed the synthetic noisy speech dataset to the DNN on input and the clean speech on the output.
- Inference: Finally, produce a mask which will filter out the noise giving you a clear human voice.
What are the advantages of krispNet DNN?
krispNet is trained with a very large amount of distinct background noises and clean human voices.
- It is able to optimize itself to recognize what’s background noise and separate it from a human speech by leaving only the latter. While inferencing, krispNet acts on real-time audio and removes background noise.
- krispNet DNN can also perform Packet Loss Concealment for audio and fill out missing voice chunks in voice calls by eliminating “chopping”.
- krispNet DNN can predict higher frequencies of a human voice and produce much richer voice audio than the original lower bitrate audio.
Read more in detail about how we can use deep learning in noise suppression on the Nvidia blog.