Beware of AI 'model collapse': How training on synthetic data pollutes the next generation
To feed the endless appetite of generative artificial intelligence (gen AI) for data, researchers have in recent years increasingly tried to create “synthetic” data, which is similar to the human-created works that have been used to train AI models but is itself created by AI.
The synthetic data movement is a vibrant one because of copyright infringement issues with human-based training data, and also because the requirements of training better and better models may eventually exceed the availability of human-generated data.
Also: 3 ways Meta’s Llama 3.1 is an advance for Gen AI
For example, in Meta’s flagship open-source model, Llama 3.1 405B, which the company introduced last week, the researchers made extensive use of synthetic data to “fine-tune” the model and to supplement the human feedback they gathered.
There’s a catch, though. Oxford University scholars warn in the most recent issue of the prestigious science journal Nature that using such synthetic data to train gen AI can drastically degrade the accuracy of the models, to the point of making them useless.
In the paper, lead author Ilia Shumailov and his team describe what they call “model collapse,” and how it becomes worse each time models feed the next model with fake data.
Also: Google’s DeepMind AI takes home silver medal in complex math competition
“Model collapse is a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set of the next generation,” Shumailov’s team wrote. “Being trained on polluted data, they then mis-perceive reality.”
Specifically, the models lose track of the less-common facts over generations, becoming more and more generic. As they do so, the answers they produce become totally irrelevant to the questions they ask, turning into effectively gibberish. “Models start forgetting improbable events over time, as the model becomes poisoned with its own projection of reality,” they write.
The authors wrote that the findings “must be taken seriously,” as gen AI risks a compounding process of deterioration the more that the internet is flooded with the output of AI models that then gets re-used. “The use of LLMs at scale to publish content on the internet will pollute the collection of data to train their successors: data about human interactions with LLMs will be increasingly valuable,” they wrote.
Also: OpenAI offers GPT-4o mini to slash the cost of applications
To arrive at that conclusion, the authors conducted an experiment using Meta’s open-source AI model, OPT, for “open pre-trained transformer,” introduced in 2022. It is similar in structure to OpenAI’s GPT-3, but much smaller, with only 125 million neural parameters, or “weights.”
Shumailov’s team used the Wikitext2 dataset of Wikipedia articles to “fine-tune” OPT, meaning, to re-train it with additional data, a very common practice in gen AI. The authors then used the fine-tuned OPT to in turn generate synthetic copies of the Wikitext data, and they fed that new, fake data to the next fine-tuning operation, a kind of cannibalistic use of the output of one model as the input of another.
The authors provided examples of what happens after five rounds of using each fine-tuned model as the source for teaching the next: by generation five, it’s complete gibberish. At the same time, they wrote, specific errors of fact became more common with each generation: “We find that, over the generations, models […] start introducing their own improbable sequences, that is, errors.”
Reflecting on what can be done to avoid model collapse, the authors ended their paper on an ominous note. It’s essential to preserve the original, human-created training data, and to also have continued access to new human-created data, but doing so becomes harder as synthetic data from gen AI fills up more and more of the internet, creating a kind of lost internet of the past.
They warned, “It may become increasingly difficult to train newer versions of LLMs without access to data that were crawled from the internet before the mass adoption of the technology or direct access to data generated by humans at scale.”
The editors of the magazine summed up the problem perhaps most succinctly with the old data science adage they placed on the cover: “garbage in, garbage out.”