Google announced On the opening of the project code magika, designed to determine the type of contents based on the analysis of the available data file. Magika can accurately determine in the contents of the programming languages, compression methods, installation packages, executable code, marking types, sound formats, video, documents and images. The tools associated with the project and the finished machine learning model published under the license Apache 2.0
From similar projects that determine the MIME-type according to the contents, Magika is distinguished by the use of machine learning methods, high performance and excellent accuracy of the definition. The model is trained using the framework keras for 25 million examples of files and supports the recognition 116 data types with an accuracy of at least 99%. The model is arranged in the format onnx and has the size of only 1 MB. The use of deep machine learning methods made it possible to increase the accuracy of determination by 50% compared to a manually used system used in the Google system.
In Google, the system is used to classify files in Gmail, Drive, Code Insight and Safe Browsing when performing safety checks and compliance with services rules. Work is underway to integrate Magika into the Virustotal platform as a link for primary filtering of files before performing specific analyzers. The Magika configuration in the Google infrastructure provides scanning several million files per second and several hundred billion files per week. After loading the model, the output formation time is 5-6 ms when testing on one CPU core. Determination time almost does not depend on the size of the file.
To use Magika, the command line utility is prepared in their projects, package for python and