Mozilla introduced Updating voice datasets Common Voice , including examples of pronunciation of about 200 thousand people. Data Published as a public domain ( CC0 ). The proposed kits can be used in machine learning systems for building recognition models and speech synthesis. Compared to the previous updating, the volume of speech material in the collection increased by 30% – from 13.9 to 18.2 thousand hours of speech. The number of supported languages has increased from 67 to 87.
The set for the Russian language covers 2452 participants and 193 hours of speech material (there were 2136 participants and 173 hours), for the Belarusian language – 6160 participants and 987 hours (it was – 3831 participants and 356 hours), for the Ukrainian language – 684 participants and 76 hours (there were 615 participants and 66 hours). In the preparation of materials in English, more than 79 thousand people who dictated 2886 hours of confirmed speech were attended (there were 75 thousand participants and 2637 hours).
Recall that the Common Voice project is aimed at organizing collaboration on the accumulation of voice patterns base, taking into account all the variety of votes and a speech manner. Users are invited to sound the phrases displayed on the screen or evaluate the quality of data added by other users. The accumulated database with records of various pronunciation of typical phrases of human speech without restrictions can be used in machinery learning systems and research projects.
According to the author of the Snump Speech Library vosk The shortcomings of the Common Voice set is the single-scene of the voice material (the predominance of male people is 20-30 years old, and lack of material with the voice of women, children and the elderly), the absence of the vocabulary of the dictionary (repetition of the same phrases) and the dissemination of entry records of MP3.
Additionally, you can noted release toolkit NVIDIA NEMO 1.6 , providing machine learning methods for creating speech recognition systems, speech synthesis and information processing in natural language. NEMO includes ready-made trained models for machine learning systems based on Pytorch framework, prepared by NVIDIA using Common Voice speech data and covering various languages, accents and speech forms. Models may be useful for researchers engaged in the creation of voice dialog systems, transcription platforms and automated call centers. For example, NVIDIA NEMO is used in automated MTS and Sberbank’s vocal services. NEMO code is written in Python using pytorch and spreads Under the APACHE 2.0 license.