Zyphra has released the first beta version of AI-models for speech synthesis called zonos under the Apache 2.0 license. The model is accompanied by tools that support a vote cloning function, enabling users to synthesize desired voices with just a 30-second speech reference from the speaker in English, Japanese, Chinese, French, and German.
With 1.6 billion parameters and 200,000 hours of audio recordings, the model offers synthesis capabilities for both monotonous (e.g., audiobooks) and emotional speech, as well as synthesis based on a given prefix. The output has a sampling frequency of 44KHZ and allows for the substitution of synthesized inserts, enabling simulations of dialogues and the addition of labels for managing speech speed, tonality, and expression of emotions.
Developers claim that the model matches or exceeds the quality of all publicly available open and commercial speech synthesis systems. However, some drawbacks include a higher concentration of sound artifacts such as coughing or breathing noises at the beginning or end of the audio.
To utilize the model, a Docker-ready image is provided, which includes a synthesis web interface based on gradio. Users can start the process by cloning the image with the command “Git Clone https://github.com/zyphra/zonos.git; CD Zonos; Docker Compose UP” and accessing the page “https://localhost:7860”. It is recommended to have an NVIDIA GPU, preferably from the 3000 series with at least 6GB of video memory for optimal performance. The system’s performance with an RTX 4090 GPU can double the necessary requirements for real-time synthesis.