What we can learn about AI from the ‘dead internet theory’

7 hours ago 2

ARTICLE AD BOX

The ‘dead internet theory,’ or the idea that much of the web is now dominated by bots and AI-generated content, is largely speculative. However, the concern behind it is worth taking seriously. The internet is changing, and the content that once made it a valuable source of knowledge is increasingly diluted by duplication, misinformation, and synthetic material.

For the development of artificial intelligence, especially large language models (LLMs), this shift presents an existential problem. The functionality of these systems depends on large volumes of training data. When that data is grounded in human expertise and real experience, the results are meaningful and reliable. When the data is noisy, synthetic, or stale, the value of the output declines. This raises a pressing question for those building or deploying AI: where will the next generation of high-quality training data come from?

A shrinking pool of good data

Modern LLMs consume vast amounts of data during training. Historically, this has come from scraping everything from publicly available websites, online forums and documentation, to articles and user-generated Q&As. Until recently, the internet sufficed as a near-endless source of content – but that may be changing.

As more sites resist scraping and legal frameworks around content usage are developed, it’s becoming more difficult to collect new, high-quality data at scale. At the same time, the overall quality of information available online is deteriorating, in part due to the increasing presence of AI-generated content that rehashes what already exists.

The AI field is now facing a supply issue. It’s not just a matter of volume; there is growing concern that the diversity, originality, and relevance of the available training data are being eroded. Without fresh, human-generated knowledge feeding the system, models risk stagnating. They may still produce fluent responses, but their value, especially in high-stakes or fast-moving contexts, begins to diminish.

Synthetic data isn’t a long-term solution

To address the shortage, some organisations have turned to synthetic data. These are artificially generated examples designed to simulate human-created inputs. In theory, this provides a scalable way to generate training material. In practice, it introduces new risks.

Synthetic data is based on patterns learned from existing datasets. It often lacks nuance, fails to reflect edge cases, and may reinforce biases or errors already present in the original material. When synthetic data is used to train or fine-tune other models, it can create a feedback loop that magnifies inaccuracies. There are use cases where synthetic data can be helpful – such as testing, augmentation, or anonymisation – but it is not a replacement for authentic, well-sourced, human knowledge.

The most valuable training data still comes from people solving real problems, sharing what they know, and refining each other’s understanding. This kind of input brings context, relevance, and expertise that AI cannot generate on its own.

One emerging model for collecting and maintaining this kind of data is Knowledge as a Service (KaaS). Rather than scraping static sources, KaaS creates a living, structured ecosystem of contributions from real users (often experts in their fields) who continuously validate and update content. This approach takes inspiration from open-source communities but remains focused on knowledge creation and maintenance rather than code.

KaaS supports AI development with a sustainable, high-quality stream of data that reflects current thinking. It’s designed to scale with human input, rather than in spite of it.

Why KaaS works

KaaS helps AI stay relevant by providing fresh, domain-specific input from real users. Unlike static datasets, KaaS adapts as conditions change. It also brings greater transparency, illustrating directly how contributors’ inputs are utilised. This level of attribution represents a step toward more ethical and accountable AI.

Most importantly, KaaS is sustainable. As communities continue to share and refine knowledge, the system improves over time. This living, human-driven approach offers a stronger foundation for AI than synthetic or scraped data can provide, supporting smarter and more reliable outcomes.

Building better AI through using KaaS

For AI to remain useful and trustworthy, it must stay grounded in reality. That depends on better quality data, not just more of it. Organisations working with LLMs should invest in the creation of high-quality, human-led knowledge systems rather than relying on content that is synthetic or scraped from a shrinking pool of online sources.

This involves building platforms that reward participation, respect data ownership, and create feedback loops between AI systems and the people informing them. Industries such as healthcare, finance, and education, where accuracy and trust are critical, are well-placed to lead the charge.

Knowledge as a Service provides a practical and sustainable alternative. It delivers verified, evolving, and diverse insight rooted in real expertise, not noise or repetition. The goal isn’t to slow progress, but to guide it. By embedding human context and collaboration into AI development, we can ensure future systems improve not just in size, but in substance.

Jody Bailey is the chief product and technology Officer for Stack Overflow

What we can learn about AI from the ‘dead internet theory’

ARTICLE AD BOX

A shrinking pool of good data

Synthetic data isn’t a long-term solution

Why KaaS works

Building better AI through using KaaS

Read more: Why AI-powered, ethical ‘hackbots’ are the ultimate bulwark against AI-enabled cybercriminals

Related

Vibe coding is coming to the enterprise. Here’s how to do it...

Why parents in tech still feel the brunt of developer toil –...

The Future of the Digital Estate: Autonomous Endpoint Manage...

LEFT SIDEBAR AD