Facebook company (prohibited in the Russian Federation) published The projects of the project nllb (No Language Left Behind), aimed at creating a universal model of machine learning for a direct translation of text from one language to another, bypassing intermediate translation into English. The proposed model covers more than 200 languages, including rare languages of African and Australian peoples. The ultimate goal of the project is to provide funds for communication of any people, regardless of the language in which they say.
Model Available under the license of Creative Commons by-NC 4.0, allowing copying, distributing, involving in their projects and Creation of work, but subject to authorship, maintaining a license and using only for non -profit purposes. Instrumentation for working with models comes under the license MIT. To stimulate developments using the NLLB model, it was decided to allocate 200 thousand dollars to provide grants to researchers.
To simplify the creation of projects using the proposed model, the application code used to test and evaluate the quality of models (FLores-200, NLLB-MD, TOXICITY-200), code for training models and encoders based on the library laser3 (Language-agnostic Sentence Representation). The final model is proposed in two versions – complete and abbreviated. The abbreviated version requires less resources and is suitable for testing and use in research projects.
Unlike other systems on the basis of machine learning systems, the solution from Facebook is noteworthy in that for all 200 languages one general model is proposed, covering all languages and does not require the use of individual models for each language. Translation is carried out directly from the original in the target language, without intermediate translation into English. To create universal translation systems, LID model is additionally proposed, which allows you to determine the language used. Those. The system can automatically recognize in what language information is provided and translated into the user language.
Supported translation in any direction between any of the supported 200 languages. To confirm the quality of translation between any languages, the Flores-200 Standard test set has been prepared, which showed that the NLLB-200 model in terms of quality of translation by 44% exceeds previously proposed research systems based on machine learning when using metrics Bleu comparing machine translation with reference human translation. For rare African languages and Indian dialects, superiority reaches 70%. Clearly, the quality of the translation is fashionable to evaluate on the specially prepared demonstration site .