Mozilla Updates Common Voice Data to Version 20

16 Dec 2024 10:27 am GMT+0000 Date Time

Recently, Mozilla has released an update to its Common Voice project, which now includes voice data from over 200 thousand people. This data is shared as a public resource under a cc0 license and can be utilized in machine learning systems for speech recognition and synthesis. The latest update has increased the volume of speech material in the collection to 33.1 thousand hours, with 22.1 thousand hours passing the verification process. Additionally, the number of supported languages has grown from 129 to 133, with the addition of languages like Aragon, Isindable, South Soto, and Tupuri.

In terms of participation, 94.9 thousand individuals contributed 3631 hours of speech in English, while the set for the Belarusian language involved 8521 participants and 1860 hours of speech material. The Russian language saw 3365 participants contributing 281 hours, Uzbek had 2211 participants with 265 hours, and Ukrainian had 1120 participants with 114 hours.

The Common Voice project aims to build a diverse voice dataset by involving users in voicing displayed phrases or evaluating existing data quality. This database, containing a wide range of pronunciations of common phrases, is accessible for use in machine learning systems and research projects without any restrictions.

/Reports, release notes, official announcements.