Mozilla Updates Common Voice with Expanded Voice Data
Mozilla has recently updated the voice data collection for Common Voice, now including examples of pronunciation from more than 200 thousand individuals. This valuable data has been published as a public resource under the Creative Commons Zero (CC0) license. The datasets can be utilized in machine learning systems to develop models for speech recognition and synthesis.
The updated collection now contains a total of 28.1 thousand hours of speech, an increase from the previous update’s 27.1 thousand hours. Out of this total, 18.6 thousand hours have undergone the verification process. Furthermore, the number of supported languages has expanded from 108 to 112, with the addition of Pashtu, Albanian, Ampha, and Standard Moroccan Berber Languages.
In terms of English language contributions, 88.1 thousand individuals participated in dictating 3279 hours of speech, representing an increase from 88 thousand participants and 3161 hours from the previous data release. The Belarusian language dataset now includes 8162 participants and 1511 hours of speech material (up from 7903 participants and 1419 hours). For the Russian language, there are 3001 participants and 263 hours recorded (compared to 2815 participants and 229 hours). The Uzbek language dataset entails 2134 participants and 262 hours (previously 2092 participants and 261 hours), and finally, for the Ukrainian language, there are 789 participants and 92 hours (relative to 780 participants and 87 hours previously).
The Common Voice project aims to facilitate collaborative efforts in accumulating a comprehensive voice template database, encompassing a wide range of voices and speech styles. Users are encouraged to voice phrases displayed on the screen or evaluate the quality of data added by other contributors. This