DeepMind, a subsidiary of Google, is continuing to push the boundaries of audio generation technology to enhance digital assistants and AI tools for users worldwide. The company’s goal is to create realistic sound to facilitate communication, information sharing, and emotional expression.
Recently, DeepMind unveiled two new features for generating dialogue: Notebooklm Audio Overviews and Illuminate. The former transforms loaded documents into dialogues between two AI hosts, summarizing the content and drawing associative connections. The latter helps convert scientific articles into understandable discussions, making information more accessible.
Building on advancements in audio generation research, Google DeepMind has developed models capable of reproducing conversations between multiple speakers using cutting-edge technologies like SoundStream and Audiolm. SoundStream compresses audio without compromising quality, transforming it into tokens that preserve important characteristics such as timbre and intonation. Audiolm treats speech generation as a language processing task, enabling flexible interaction with various sounds.
In scaling the models for polyglot generation, DeepMind has improved audio encoding efficiency to compress sound at a rate of up to 600 bits per second. Moreover, the model can generate two-minute dialogues in just three seconds, which is more than 40 times faster than real-time.
The model was trained using hundreds of thousands of audio hours, incorporating conversations with actors to accurately capture natural pauses and intonations. This training methodology has enabled the model to produce realistic dialogues, seamlessly transitioning between speakers and delivering studio-quality sound.
In line with the principles of responsible AI development, DeepMind has implemented Synthid technology to embed watermarks in audio files generated by its model, thereby deterring unauthorized use of the technology.
The future holds promise for further advancements in sound quality and precision, extending to video applications. The integration of these innovations with Gemini family models presents exciting opportunities for creating accessible and inclusive content, particularly beneficial for educational projects and multimodal solutions.