Mozilla has updated its Voice data sets for Common Voice, providing examples of pronunciation from over 200,000 individuals. The data has been published as a public resource under the cc0 license. These datasets can be utilized in machine learning systems to develop speech recognition and synthesis models.
In comparison to the previous update, the collection of speech material has increased from 28.7 to 30.3 thousand hours of speech, with 19.7 thousand hours undergoing verification. Additionally, the number of supported languages has expanded from 114 to 120, now including Yiddish, Latgale, Ligurian, Ossetian, Telogu, and West Sierra-Pueblanes Nautl.
A total of 90,670 individuals contributed to the preparation of English materials, dictating 3,438 hours of speech. For the Belarusian language, there were 8,249 participants contributing 1,641 hours of speech material. The Russian language dataset involved 3,133 participants and 265 hours, while the Uzbek dataset had 2,151 participants and 264 hours. Finally, the Ukrainian dataset consisted of 1,058 participants and 108 hours.
The Common Voice project aims to collaboratively gather a diverse range of voice templates encompassing various voices and speech styles. Users are encouraged to voice the displayed phrases or evaluate the quality of data submitted by others. The accumulated database, consisting of different pronunciations of common speech phrases, can be utilized in machine learning systems and research projects.