Standard Intelligence announced on the publication hertz-Dev, the first open AI model for speech synthesis in full-shoulder mode. This model can serve as the foundation for creating vocal communication systems in real time or generating informal speech. The innovative model can generate speech closely resembling the voice data it was trained on, facilitating interaction in a manner that simulates human conversation without noticeable delays akin to intermittent telephone exchanges. The projects stemming from this development can be accessed under the Apache 2.0 license.
When operated on a system with the NVIDIA GeForce RTX 4090 GPU, the average delay before speech generation is 120 ms (theoretically up to 65 ms). This lag time is approximately twice faster than existing open-access executive models in the field. The published version was constructed using the “Transformer” architecture. It boasts 8.5 billion parameters and was trained using 500 billion tokens. The model can consider up to 2048 tokens as part of the context, equating to roughly 4 minutes of speech.