Researchers from the UK and Canada have issued a warning about the dangers of using data generated by artificial intelligence (AI) models to train other AI models. An article published on the Arxiv.org website titled “Curses of Recursion: training on generated data forces the model to forget” shows that this approach can lead to the gradual deterioration in the quality and realism of the data, ultimately resulting in “model collapse”.
The collapse of the model is a process in which over time, models forget the true distribution of data and begin to incorrectly interpret what they consider as real, therefore reinforcing their own beliefs. This phenomenon is similar to catastrophic forgetting and data poisoning, which can also have a negative impact on AI training. In the case of catastrophic forgetting, the model “forgets” the previous data when studying new information. Data poisoning entails the malicious introduction of false information into data sources.
The scientists conducted experiments involving text and graphic models of AI and discovered that training these models on data generated by other AI models leads to a rapid decline in the quality of the data produced. Ilya Shumaylov from the University of Oxford, one of the authors of the article, said, “We were surprised to observe how quickly the collapse of the model occurs: the models can quickly forget most of the source data from which they initially studied.”
The reason for this phenomenon is that AI models tend to retain popular data and incorrectly understand or represent less popular data. This decrease in rare events or minor experiences in datasets can have adverse effects. Ross Anderson from the University of Cambridge and the University of Edinburgh, another author of the article said, “Just as we clogged the oceans with plastic garbage and filled the atmosphere with carbon dioxide, so we are going to fill the Internet with nonsense.”
To combat model collapse, the researchers suggest several solutions. One solution is to maintain original data created by humans to teach future models. Another solution is to take into account minorities and rare events in the dataset. Lastly, it is important to monitor the quality and sources of the data used to teach AI. Otherwise, the internet could become filled with digital noise, rendering it useless.
Sources:
Arxiv.org https://arxiv.org/pdf/2101.10597.pdf