NVIDIA invests 1.5 million dollars in Mozilla Common Voice project

NVIDIA invests 1.5 million dollars in the project Mozilla Common Voice . Interest in speech recognition systems is associated with the forecast that in the next ten years, voice technologies will become one of the main ways to interact people with various devices, from computers and phones, to digital assistants and kiosks for selling goods.

The quality of voice systems is highly dependent on the volume and diversity of voice data available for training machine learning models. Today’s voice technologies are mainly focused on English recognition and do not cover a huge number of languages, accents and speech models. Investments will help speed up the increasing volume of publicly accessible voice data, attract more communities and volunteers to work, as well as expand the number of project employees for the principal time.

Recall that the Common Voice project aims to organize collaboration on the accumulation of the voice patterns base, taking into account all the variety of votes and a speech manner. Users are invited to voice the phrase displayed on or evaluate the quality of data added by other users. The accumulated database with records of various pronunciation of typical phrases of human speech without restrictions can be used in machine learning systems and research projects.

Currently, the Common Voice kit includes more than 164 thousand people pronunciation. About 9 thousand hours of voice data in 60 different languages ​​have been accumulated. The set for the Russian language covers 1412 participants and 111 hours of speech material, and for the Ukrainian language – 459 participants and 30 hours. For comparison, more than 66 thousand people who jigged 1686 hours of confirmed speech took part in the preparation of materials in English. The proposed sets can be used in machine learning systems to build recognition models and speech synthesis. Data Published as a public domain ( CC0 ).

According to the author’s recognition library vosk The shortcomings of the COMMON VOICE set is the one-sidedness of the voice material (the predominance of male people 20- 30 years, and lack of material with the voice of women, children and the elderly), the absence of the vocabulary of the dictionary (repetition of the same phrases) and the dissemination of records in the MP3 distortion.

/Media reports.