Mozilla introduced Voice data sets Common voice , including examples of pronunciation of about 200 thousand people. Data published as a public treasure ( cc0 ). The proposed sets can be used in machine learning systems to build models of recognition and synthesis of speech.
Compared to the last update, the volume of speech material in the collection increased by 10% – from 18.2 to 20.2 thousand hours of speech. The number of supported languages increased from 87 to 93. For 27 languages, more than 100 hours of speech data were accumulated, and for 9 more than 500 hours of speech data. For 9 languages, it was also possible to achieve a share of female speech, which is at least 45%.
More than 81 thousand people who dictated 2953 hours of speech took part in the preparation of materials in English (there were 79 thousand participants and 2886 hours). A set for the Belarusian language covers 6326 participants and 1054 hours of speech material (there were 6160 participants and 987 hours), the Russian language – 2585 participants and 201 hours (it was 2452 participants and 193 hours), Uzbek – 1503 participants and 231 hours (there were 1355 participants and 227 hours), Ukrainian language – 696 participants and 79 hours (there were 684 participants and 76 hours).
Recall that the Common Voice project is aimed at organizing joint work to accumulate the base of voice templates, taking into account the whole variety of votes and manners of speech. Users are invited to voice phrases displayed on the screen or evaluate the quality of data added by other users. The accumulated database with records of various pronunciations of typical phrases of human speech without restrictions can be used in machine learning systems and in research projects.
According to the author of the library of the recognition of the alleged speech vosk the disadvantages of the Common Voice set are one-sidedness of the voice material (the predominance of male people 20-30 years old, and the lack of material with the voice of women, children and the elderly), the lack of variability of the dictionary (repetition of the same phrases) and the distribution of records in the distortion format mp3.