New Issue Silence Synthesis System System

A new public outlet of the speech synthesis neural network system is available Silero Text-to-Speech . The project is primarily aimed at creating a modern high-quality speech synthesis system that is not inferior to commercial solutions from corporations and accessible to everyone without the use of expensive server equipment.

distributed under the GNU AGPL license, but the developing project company does not disclose the model training mechanism. To start, you can use pytorch and frameworks with ONNX format support. Synthesis of speech in Silero is based on the use of deep modified modern neural network algorithms and methods of digital signal processing.

It is noted that the main problem of modern neural network solutions for speech synthesis is that often they are available only within paid cloud solutions, and public products have high equipment requirements, lower quality or are not finished and ready-made products. For example, for trouble-free launch of one of the new popular END-TO-END synthesis architectures, Vits, in synthesis mode (that is, not for model training) is required video cards with more than 16 gigabytes VRAM.

Contrary to the current trend, the Silero solution successfully launches even on 1 x86 processor X86 processor with AVX2 instructions. On 4 processor streams, synthesis allows to synthesize from 30 to 60 seconds per second in 8 KHz synthesis mode, in 24 KHz mode – 15-20 seconds, and in 48 KHz mode – about 10 seconds.

The main features of the new release of Silero:

  • model size reduced by 2 times to 50 megabytes;
  • models know how to pause;
  • 4 high-quality voices in Russian (and an infinite number of random). Pronunciation Examples ;
  • steel models 10 times faster and, for example, in 24 KHz mode, it allows you to synthesize up to 20 seconds of audio per second on 4 processor threads;
  • All votes for one language are packaged in one model;
  • models can take entire paragraphs text input, supported SSML tags ;
  • Synthesis works at once in three sampling frequencies to choose from – 8, 24 and 48 kilohertz;
  • Solved “Children’s Problems”: instability and skip words;
  • Added flags to control the automatic strokes of stress and prostanovka letters “E”.

Now 4 votes are publicly available for the new version of the synthesis, but in the near future the following version will be published in the near future:

  • Synthesis speed will grow another 2-4 times;
  • synthesis models for CIS languages ​​will be updated: Kalmyk, Tatar, Uzbek and Ukrainian;
  • models for European languages ​​will be added;
  • models for Indian languages ​​will be added;
  • Models for English will be added.

Some of the system breakdown inherent in Silero synthesis:

/Media reports.