Common Voice: Mozilla’s largest voice dataset with approx 1400 hours of voice clips in 18 different languages

Mozilla, a popular free and open-source web browser, released the largest public dataset of human voices available for use, called Common Voice, yesterday. The dataset consists of 18 different languages (including English, French, German, Mandarin Chinese, Welsh, Kabyle, etc) and adds about 1,400 hours of recorded voice clips from more than 42,000 contributors.

“With this release, the continuously growing Common Voice dataset is now the largest ever of its kind, with tens of thousands of people contributing their voices and originally written sentences to the public domain (CC0)”, states the Mozilla team.

The Common Voice dataset is unique and rich in diversity as it represents a global community of voice contributors. These contributors can also opt-in to offer other information such as age, sex, and accent so that their voice clips get attached to data that is useful in training speech engines.

Mozilla had enabled multi-language support back in June 2018, making Common Voice more global and inclusive. Mozilla also involves different communities contributing towards the project who have helped with launching the data collection efforts in 22 different languages and 70 more in progress on the Common Voice site.

With the help of these communities, Mozilla has made the latest additions to the Common Voice dataset including languages such as Dutch, Hakha-Chin, Esperanto, Farsi, Basque, and Spanish. It also plans to continue working with these communities to retain the diversity in the voices represented. As per the Mozilla team, these public contributors are not only able to track the progress per language in recording and validation but have also improved the prompts that vary from clip to clip.

Mozilla has also added a new option to create a saved profile, that helps the contributors keep track of their progress and metrics across different languages. It also offers optional demographic profile information that further helps improve the audio data used in training speech recognition accuracy.

Apart from the dataset, Mozilla also has goals towards contributing to a more diverse and innovative voice technology ecosystem in the future. It aims to release voice-enabled products while also making sure to support researchers and smaller players.

“For Common Voice, our focus in 2018 was to build out the concept, make it a tool for any language community to use, optimize the website, and build a robust backend. Our overall aim remains: Providing more and better data to everyone in the world who seeks to build and use voice technology”, states the Mozilla team.

For more information on this announcement, check out the official Mozilla blog post.