Can humans and bots share the Internet? Wikipedia thinks so.

side view of a young Asian woman sitting at a desk working on a computer

Author

Anabelle Nicoud

Staff Writer

IBM

This article was featured in the Think newsletter. Get it in your inbox.

Content scraping, or automated data extraction for AI, has become one of the most discussed topics this year, with many content creators and businesses questioning the impact of AI search on their own traffic. But few organizations have felt the impact of rising non-human traffic as acutely as Wikipedia.

Earlier this year, the online encyclopedia revealed that more than 65% of its most expensive traffic comes from bots, nearly double the proportion of bot traffic on its pages. In short, bots are overloading its servers while contributing little to actual page views.

For Wikipedia, which relies on over 260,000 human editors and places human knowledge at the core of its mission, this is more than a technical issue; it’s an existential challenge, as Wikipedia relies on humans to create and edit its content, but also to fund it, via donations.

“The AI companies are here, and they are particularly voracious,” said Lane Becker, the President of Wikimedia LLC, a commercial subsidiary of the Wikimedia Foundation, in an interview with IBM Think. We’re a general store of knowledge with extremely timely information, and a lot of these organizations are slamming our servers.”

The scraping surge

Adapting to new technologies is nothing new for Wikipedia. But the AI age has brought new demands. Voice assistants and AI-powered search engines—such as SearchGPT and Perplexity—require constant, real-time access to structured information. “Wikipedia content is so valuable; it’s used in every LLM, and the ones that don’t, don’t function nearly as well without Wikipedia data,” Becker said.

Content creators are also feeling the pressure. Independent research conducted by tech company Miso.ai previously told IBM Think that they identified more than 1,700 bots on publishers’ sites through its Sentinel project, an increase of 35% since February.

For Wikipedia, this trend is particularly noticeable during major news events. For example, the death of Jimmy Carter triggered an unprecedented surge for multimedia content on its site, which resulted in a slow load time for some users. “This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models,” Wikimedia wrote in a blog post. “Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs.”

Beyond the server strain

The challenge here isn’t just about relieving the server strain—it’s also about improving the quality of the data that AI models ingest. IBM, for instance, has been working with Wikimedia to enhance the annotations and structure of its content, using open-source tools that aim to help manage its data landscape. “Wikipedia is an important source of knowledge for AI,” said Nirmit Desai, a Director of Data and Tools for AI Models at IBM, in an interview with IBM Think. “We are collaborating with them to help them improve the quality of, and annotations for, their data and help accelerate progress in AI.”

Rosario Uceda-Sosa, a Senior Technical Staff Member at IBM Research, has spent over a decade working with knowledge graphs—including extensive use of Wikidata and other Wikimedia projects in AI research.

“It really does pay to have clean data,” she said in an interview with IBM Think. “The principle of garbage in, garbage out still holds in AI, despite how powerful the models are.”

Wikimedia Enterprise: A new model for the AI era

To adapt to private-sector demand for Wikipedia content, the Wikipedia Foundation launched Wikimedia Enterprise in 2021. The commercial service provides API access to Wikipedia content and access to curated datasets that are more difficult to extract from public APIs.

“It comes in a more machine-readable format,” Becker said. “We made it easier for large corporations to use. We have SLAs, guarantees of uptime, a support team—all the things you expect when you pay for a product.”

Getting AI companies to pay for informative, trusted and constantly updated content is no easy task. The rise of AI platforms has led to a wave of partnerships between AI giants and news organizations, including a recent deal between local news giant Gannett and Perplexity, and programs like Perplexity’s Comet Plus initiative, which promises to share 80% of the revenue from this program with its publisher partners. Meanwhile, several media companies have issued various copyright lawsuits against major AI players like OpenAI and Cohere.

But Wikimedia’s Enterprise model is just one of several strategies aimed at creating a fairer, more sustainable ecosystem for content in the AI era.

Earlier this summer, internet infrastructure giant Cloudflare announced that it would block scrapers by default. “That content is the fuel that powers AI engines, and so it’s only fair that content creators are compensated directly for it,” said Cloudflare CEO Matthew Prince in a company blog post. Startups like ScalePost, TollBit and ProRata.ai also said they want to help content creators and publishers get revenue when their content is used by AI systems.

“This is a quickly developing market, and none of us know which strategy will succeed,” Becker said. “But it’s critical that organizations try different approaches to find something meaningful and sustainable.”

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

In the content space, Wikipedia’s approach remains unique. It continues to offer free, unbiased content designed to be accessed by as many people as possible. Attribution is a key motivator for its editors, and the Foundation sees proper sourcing as essential to its mission. Making sure LLMs credit Wikipedia for their information is an essential step.

“Attribution and sourcing are key,” said Luis Bitencourt-Emilio, a Wikimedia Foundation board member, in a recent interview with IBM Think. “One of our asks for those using our content, which benefits both users and Wiki, is to attribute it properly.”

The rise of AI systems isn’t challenging only Wikimedia’s servers. If human traffic gets smaller, then Wikipedia gets less chance to recruit humans to contribute their time or donations. “There will always be a craving for factual encyclopedic data,” he said. “Wiki remains the platform of choice for users seeking neutral, reliable information.”

“I’d encourage AI companies to think longer term and recognize that understanding where something comes from is critical to knowledge,” Becker said. “When we say we want to support the knowledge ecosystem, we mean not just the content, but [also] the systems and structures that sustain it.”

The AI opportunity

AI platforms like ChatGPT, which boasts over 700 million weekly active users, also present an opportunity for Wikipedia to reach new audiences.

“I see that as an opportunity to reach more and more people,” Becker said. “Our goal is to disseminate free knowledge for the world. The question is: How can [tech companies] help us further this knowledge mission without impacting fair use and the usage of our infrastructure?”

Still, as knowledge repositories, AI-powered tools come with limitations. Hallucinations—false or misleading outputs—can be problematic, especially when users rely on LLMs for factual information. And when models are trained on AI-generated content, they risk model collapse, a phenomenon where performance degrades over time.

Ultimately, Wikipedia wants to put humans first. Humans write and review the content. “We have over a quarter million editors that have helped us create what you see as Wiki today,” said Bitencourt-Emilio. “In a world where you have a bunch of bots generating this, you certainly have much worse content. I don’t know if there’s ever a world where you get better content.”

Related solutions
IBM Granite

Achieve over 90% cost savings with Granite's smaller and open models, designed for developer efficiency. These enterprise-ready models deliver exceptional performance against safety benchmarks and across a wide range of enterprise tasks from cybersecurity to RAG.

Explore Granite
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Explore the IBM library of foundation models in the IBM watsonx portfolio to scale generative AI for your business with confidence.

Discover watsonx.ai Explore IBM Granite AI models