Is model collapse inevitable? If you are happily using AI these days — and by that I mean Large Language Models (LLMs) like ChatGPT — you likely don’t know that the very foundation of these models is at risk.

What do I mean? To train an LLM, you essentially download the entire internet, spend millions of dollars on compute time and GPU chips, and out pops an LLM model.

For a time, 2018 to 2023 or so, the more money and data you had, the better the results, something called Scaling Laws, though it’s not a law of physics but more of an observation. The limits of pure scaling have become apparent over the last 2 years, but even more worrying is something else: model collapse.

What is that? Until the introduction of ChatGPT in late 2022, almost all internet content was human-generated. It may not have been high quality, but it was largely done by humans. That is no longer the case. Estimates vary, but anyone can see that an increasing amount of new written content now comes from AI or other LLMs.

The underlying training data has changed. We now have LLMs training on data generated by LLMs, a form of synthetic data. So what’s the problem?

A July 2024 academic paper out of Oxford University showed that if you train LLMs on data generated by other LLMs, the quality and diversity of those models degrade over time. Basically, they will get worse because the underlying data is worse. The paper even included all the maths to prove this, given current LLM training architectures.

That’s a big deal. It might take a few years for this to fully reveal itself, but if true, it means LLMs themselves are in a race to improve even as their foundations in human content fade away.

What is the pushback? Well, not all synthetic data is created equally. You can have LLMs create coding examples that are tested and executed. But it’s still AI-generated.

Or maybe this academic result doesn’t apply to the real world, where training sizes are much larger? We have the maths to show why model collapse happens, so there needs to be something else we don’t fully understand to counteract it.

What does this all mean for you? LLMs still appear to improve with each new frontier model, all the same. But how long does that increase?

The earliest training data for LLMs, think GPT2 back in 2019, used Reddit posts with at least three upvotes. Then the scaling laws showed you should just download the entire internet and worry about curation later, with human filters added on top of the trained model. That happens today and is called Reinforcement Learning from Human Feedback. But it doesn’t seem feasible for humans to review all the content needed to train large models.

Maybe you keep human-generated data as the backbone, focusing on books and journalism, though again, do you stop at 2023? Already, Amazon is polluted with AI-generated books, as are many news outlets. AI slop is coming for videos, too. By one estimate, at least 20% of YouTube videos are purely AI, and that will likely grow. So an LLM company can scrape YouTube transcripts, thinking they’re human data, but they’re not.

Maybe you can rank content quality somehow? Do you improve synthetic data?

These are all unknowns at the moment, but it’s something to think about the next time you use AI.