The content created using artificial intelligence begins to fill in the Internet, and this can be bad news for future AI models. Language models, such as ChatGPT, are trained on the basis of content found on the Internet. As AI creates more and more “synthetic” content, an engineering problem can arise, known as “Model Collapse”.
The filtering of synthetic data from teaching models becomes an important field of research, and will probably grow as AI content fills the Internet. Ouroboros is an ancient symbol of a snake that absorbs its own tail. In the AI era, this symbolism acquires a new, acute meaning. When the editorial content created by the AI language models begins to fill in the Internet, this is accompanied by many errors.
The Internet is a source on which these language models are trained. In other words, AI “absorbs” itself. AI can start training on data full of errors, until what he tried to create will not become complete nonsense. This is what AI researchers call the “model collapse.” In one of the recent studies a language model was used to generate the text about English architecture. After multiple AI training on this synthetic test, the answer of the 10th model was completely meaningless.
In order to effectively train new AI models, companies need data that are not spoiled by synthetically created information. Alex Dimakis, the co -director of the AI National Institute on the basics of machine learning, says that a small collection of high -quality data can surpass a large synthetic one. At the moment, engineers have to sift the data to make sure that AI does not study on the synthetic data that he created.