Internet Filled with Low-Quality Machine Translation

A recent study, conducted by the laboratory of artificial intelligence amazon Web Services (AWS AI Lab), found that a significant part of the content on the Internet, Especially in languages ​​common in Africa and the countries of the global south, it consists of texts translated by machine translation.

More than half of the sentences on the Internet are translated into two or more language, often with errors due to poor-quality machine translation, which causes fears about teaching large language models (Large Language Model, LLM).

.

AWS noted that interest in this topic arose after the colleagues of Amazon researchers working in the field of machine translation and are native speakers of low -liable languages ​​indicated a large amount of content in their native languages ​​created using machine translation.

The study included an analysis of 6.38 billion sentences collected from the Internet. It was found that 57.1% of the proposals were translated into three or more languages. This is especially true for languages ​​in which they speak Africa and other regions with a small volume of content, which leads to poor translation quality.

The proposals are more often translated into French than in small languages, since there is much more data in French. Languages ​​with a large volume of resources, such as English or French, had an average parallelism in 4 languages ​​(sentences have translated equivalents in three other languages), while low -hearted languages, for example, African languages ​​of Volof or Sicks, in 8.6 languages . In addition, less common languages, as a rule, had much worst translation.

Transfer equivalents are words, phrases or sentences in one language that have an appropriate analogue in another language that conveys the same meaning or meaning. For example, the English expression “Good Morning” in Russian corresponds to the phrase “Good morning”. The phrases are not identical literally, but they convey the same wish in the appropriate cultural and linguistic context.

It was also found that in languages ​​with a high level of multilateral parallelism, short and more predictable sentences from 5-10 words are often selected. Most of them are taken from articles that researchers characterized as low -quality and not requiring special knowledge or efforts to create.

Researchers emphasized that such a choice of short sentences from low -quality articles is explained by the desire to generate advertising income due to mass machine translation into low -consumed languages. Such activities raise questions about the development of large language models in these languages.

The study says that modern AI requires huge volumes of training data, and the presence of such problems with the quality and accuracy of machine translation can lead to the creation of less competent models with a large number of errors.

/Reports, release notes, official announcements.