A new project called charttts has been released, featuring a model and tools for machine learning designed to synthesize emotional speech. The Chattts project focuses on optimizing dialogue systems, such as interactive assistants, to replicate natural emotional communication. The project supports interactions with multiple speakers and enables the construction of interactive dialogues. It accurately reproduces prosodic elements like laughter, pauses, and interjections.
The model was trained using approximately 40,000 hours of speech data (100,000 hours in the non-public version). Developers claim that the model surpasses previous open models of speech synthesis in terms of intonation formation. Emotions can be managed during synthesis by substituting tokens like “Laugh]” for laughter. Generating a 30-second recording requires a GPU with 4GB of memory, with an approximate generation rate of 7 semantic tokens per second on an NVIDIA GeForce RTX 4090D GPU. The synthesis supports both female and male voices in English and Chinese. For Russian language synthesis, the tts framework and the xtts-v2 model are recommended.
The Chattts model is released under the CC BY-NC-ND 4.0 license, allowing for free distribution with attribution but prohibiting the creation of derivatives and commercial use. To prevent fraudulent and criminal use, the model incorporates high-frequency noise substitution and utilizes maximum sound compression in the mp3 format during training.