Mozilla Unveils Whisperfile Speech Recognition Tool

Mozilla company develops the instrumentation of speech recognition Whisperfile , including an independent high-performance implementation of the machine learning model Whisper, developed and open organization Openai. The tools were based on Whisper.cpp , the implementation of the Whisper model on C/C ++ created by Georgy Gerganov (author lma.cpp ). The code is written in C ++ and spreads under the license mit.

Whisperfile is developing a command mozilla ocho and supplements the project llamafile , designed to create universal executable files for launching large language models of machine learning (LLM). By analogy with Llamafile, the Whisperfile project allows on the basis of files with parameters Model Learning Models in GGUF format in format Generate a executable file that can be launched in various operating systems on equipment with AMD64 and ARM64 processors. Completed code can contact with the standard Cobblemopolitan, letting The ability to create application assemblies launched in Linux, FreeBSD, MacOS, OpenBSD, NETBSD and Windows.

When the executable file is launched as the input parameter, a file with the sound of speech in WAV, MP3, OGG or FLAC format is transmitted, and the recognized text is preserved at the output. In practice, the project can be used to solve problems such as generating text credits for video, creating a log of voice and video calls, converting the recorded voice materials into the text, and the organization of voice input. Using Whisperfile, such problems can be solved on a local system without contacting external services.

The work is additionally supported in the role of an HTTP server that processes speech recognition requests via Web API. To accelerate working with the model can be involved gpu and instructions of AVX . The tools can also derive the coefficients of reliability that allow you to paint recognized words depending on the accuracy of their definition.


The used Whisper is trained at 680 thousand hours of speech data covering different thematic areas and languages ​​(2/3 data in English). The model copes well with the recognition of speech with the accent, determines the technical jargon, supports the automatic determination of the language and can work in the presence of a background noise. For speech in English, the system demonstrates the level of reliability and accuracy of automatic recognition close to human recognition. In addition to speech transcription into text, the model can also be used to translation of speech to another language.

/Reports, release notes, official announcements.