Google presented Audio Code lyra v2 , using machine learning methods to achieve the maximum quality of speech transmission when using very slow communication channels. The new version is characterized by the transition to the new architecture of the neural network, the support of additional platforms, the extended capabilities of Bitrate Management, increased performance and achieving higher sound quality. The reference implementation of the code is written on C ++ and spreads under the license Apache 2.0.
In terms of the quality of the transmitted voice data at low speeds, Lyra significantly exceeds the traditional codecs that use the methods of digital signal processing. To achieve a high quality of voice transmission in conditions of limited volume of transmitted information, in addition to the usual methods of compression of sound and converting signals, a speech model on the basis of the machine learning system is used in Lyra, which allows to recreate the missing information based on typical speech characteristics.
Codec includes an encoder and a decoder. The encoder’s algorithm is reduced to the extraction of voice data parameters every 20 milliseconds, compressing them and transferring them to the recipient on the network with a bitrate from 3.2KBPS to 9.2KBPS. On the side of the recipient, the decoder uses generative model to recreate the initial speech signal based on the transmitted sound parameters, which include logarithmic chalk spectral taking into account the characteristics of speech energy in various frequency ranges and prepared taking into account the model of human auditory perception.
The Lyra V2 used a new generative model based on the reserve neural network soundstream , characterized by low requirements in computing resources, which allows you to perform decoding in real time even on low -power systems. The model involved in generating sound is trained using several thousand hours with votes in more than 90 languages. To perform the model, tensorflow Lite is used. The performance of the proposed implementation is enough for coding and decoding of speech on smartphones of the lower price range.