Companies NVIDIA and Mozilla presented updating voice sets Data collected as a result of the Initiative Common Voice and including examples of pronunciation of 182 thousand people, which is 25% more than 6 months ago. Data Published as a public domain ( CC0 ). The proposed kits can be used in machine learning systems to build recognition models and speech synthesis.
Compared to the previous update, the size of the speech material in the collection increased from 9 to 13.9 thousand hours of speech. The number of supported languages has increased from 60 to 76, including the support of the Belarusian, Kazakh, Uzbek, Bulgarian, Armenian, Azerbaijani and Bashkir languages added for the first time. The set for the Russian language covers 2136 participants and 173 hours of speech material (there were 1412 participants and 111 hours), and for the Ukrainian language – 615 participants and 66 hours (there were 459 participants and 30 hours).
More than 75 thousand people who dictated 2637 hours of confirmed speech were involved in the preparation of materials in English (66 thousand participants and 1686 hours) were taken part in English. Interestingly, in second place in terms of accumulated data, the language Rwanda , for which 2260 hours collected. Then follow German (1040), Catalan (920) and Esperanto (840). Of the most dynamically increasing size of voice data languages are called Thai (the base of the base is 20 times, from 12 to 250 hours), Lugando (from 8 to 80 hours), Esperanto (from 100 to 840 hours) and Tamil language (from 24 to 220 hours).
As part of its participation in the Common Voice project, NVIDIA prepared on the basis of the data collected ready-made trained models for machine learning systems (PyTorch is supported). Models are distributed in the composition of free and Open Toolkit NVIDIA NEMO , which, for example, is already used in automated MTS and Sberbank’s vocal services. Models are focused on using speech recognition systems, speech synthesis and information processing in natural language, and may be useful for researchers engaged in the creation of voice dialogue systems, platforms for transcription and automated call centers. Unlike the previously available projects, published models are not limited to the recognition of English and cover various languages, accents and form of speech.