In early 2019, a subtle yet significant shift occurred in the world of artificial intelligence. OpenAI, a key player in the field, began moving away from its earlier emphasis on open research. Over time, access to its datasets narrowed, details about its training methods became increasingly difficult to find and its internal work became more closed off. What seemed like a routine change in direction at the time would go on to mark a turning point for AI, reshaping how research is shared, developed and debated worldwide.
“There is no visibility into their datasets anymore,” says Karen Hao, a longtime observer of the field and the former Senior Editor for Artificial Intelligence at MIT Technology Review, in an interview with IBM Think.
Hao’s new book, Empire of AI, chronicles the development of generative AI from the inside, tracing not only the economic and political motives behind the rise of companies like OpenAI but also the quiet technical decisions that redefined the science itself. “Even OpenAI does not always know what is in their training sets. The data is just too large to audit manually.”
That admission might sound trivial to a casual observer. But for researchers, the inability to reliably characterize or replicate the data used to train a model undermines the very foundations of the discipline. For decades, machine learning has depended on a simple scientific principle: reproducibility. A model should behave in the same way if trained under the same conditions. But with today’s massive, uncurated datasets, those conditions are often unknowable.
In most empirical sciences, reproducibility is a litmus test for rigor. A chemistry experiment that cannot be reproduced is suspect. A medical trial with untraceable inputs is unlikely to pass peer review. In artificial intelligence, reproducibility has traditionally relied on researchers publishing not only their model architectures and training parameters but also the exact datasets used to train those models. These datasets, whether collections of images, audio recordings or text documents, form the basis of what the models know and how they generalize to new inputs.
In the early 2010s, this model of openness was the norm. Academic labs and corporate researchers alike shared their training corpora, described their preprocessing steps, and ran benchmarks against common standards. But by 2020, the landscape had changed. As companies like OpenAI began to compete more aggressively for commercial advantage, the practice of sharing datasets fell out of favor.
This shift was not just about intellectual property. As Hao points out, the sheer size of modern training datasets, often comprising hundreds of billions of tokens scraped from the internet, made it practically impossible to document them thoroughly. Companies began to rely on automated scraping and filtering tools to assemble their datasets. However, these tools were unable to detect subtle issues, and they introduced a new level of uncertainty to the training process.
A revealing case came from researchers at Stanford University, who audited the widely used LAION-5B image dataset. Despite being public, the dataset contained thousands of instances of either verified or suspected child sexual abuse material. This discovery came years after the dataset had been circulating freely and had already been used to train commercial image generators. The episode served as a wake-up call. If this much harm could be embedded in an open dataset, what might be lurking in the private ones?
“We cannot even guarantee a test-train split anymore,” Hao explains, referring to a basic methodological practice in machine learning.
In a typical AI setup, the dataset is divided into two parts: one part is used for training the model and the other for testing its performance. This helps measure the model’s accuracy on data it has not seen before. But when a dataset is so large and opaque that its contents are effectively unknown, the risk arises that the duplicate content appears in both sets, contaminating the evaluation and inflating performance metrics.
The result is a field increasingly reliant on faith rather than verification. “It has become more alchemical than scientific,” Hao says. “We throw more compute and more data at the model and hope something emerges.”
Not everyone stampeded to scale. As Hao describes it, another subtle movement came in from researchers who followed a different path. Instead of reaching for ever-greater databases, they went after small sets of handpicked data. It wasn’t how much data they had, but how that data captured the nuance of language, the range of human experience and the imperatives of fairness.
While the industry pushed for more, it was also asking what was being overlooked along the way. Mozilla’s DeepSpeech, for example, was a speech recognition project built on audio clips donated by users with full consent. Each clip was manually reviewed and tagged, with extensive effort devoted to refining the dataset to ensure clarity and diversity in terms of voices, accents and linguistic patterns.
Similarly, the BLOOM language model, developed by a global research consortium under the guidance of Hugging Face, was trained on public datasets collected with attention to linguistic, geographic and topical diversity. Every source was documented. Community audits were invited. Unlike opaque foundation models, BLOOM made its training methodology legible.
But such efforts have been increasingly overshadowed. The prevailing industry logic now favors scale, Hao says. Larger models trained on larger datasets tend to show emergent properties, such as complex reasoning or code generation, even without task-specific tuning. This encourages teams to abandon the careful design of data in favor of scraping everything they can.
Industry newsletter
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
The scale-first mindset at OpenAI was not merely a technical conclusion. It was the result of a coherent, if unorthodox, belief system shared by its leadership, Hao points out. She said that Ilya Sutskever, the Chief Scientist at OpenAI, was a deep learning absolutist. He believed that a sufficiently large neural network, if fed enough data, would eventually develop humanlike intelligence. On the other hand, Sam Altman, OpenAI’s CEO, approached artificial intelligence as an entrepreneur, seeing exponential scaling as the fastest path to dominance. Greg Brockman, the President of OpenAI, was the engineering mind focused on making that scaling happen.
The architecture that enabled this doctrine was the transformer, a type of neural network first introduced in 2017. Transformers excel at modeling sequences of data, such as text, because they can track relationships between words across long distances in a sentence. Crucially, they can be scaled up efficiently. Adding more layers and more parameters improves performance.
OpenAI’s research team realized that if they trained transformers on a massive enough dataset with sufficient computational power, they could bypass the need for handcrafted features, symbolic reasoning or modular design. Intelligence, in their view, would emerge from the data.
To train models like GPT-4, OpenAI needed not just ideas but infrastructure. Language models of this size require clusters of tens of thousands of graphics processing units. Designed initially for rendering three-dimensional images, GPUs proved exceptionally useful for the matrix multiplications at the heart of neural networks. But stringing them together to act as a unified system required custom software and hardware orchestration.
OpenAI’s engineers developed techniques to partition models into shards, which could be distributed across multiple chips and trained in parallel. They created checkpointing protocols to preserve partial training runs, reducing the risk of catastrophic failure. They built custom communication protocols to synchronize updates across machines. These were not glamorous advances, but they were essential.
“No one had trained across 10,000 chips before,” Hao says. “They had to figure it out in real time.”
These advances enabled the scaling up of models faster and with greater efficiency than competitors. But they also contributed to a new kind of secrecy. OpenAI stopped publishing many of the details behind its breakthroughs. To disclose too much, the company argued, would be to give away competitive advantage.
By 2024, most major tech firms had caught up. IBM, Google, Meta, Amazon, Anthropic and newer entrants, such as Mistral, have all produced large language models using similar transformer architectures and training techniques. Many used reinforcement learning with human feedback, a method in which humans rate the quality of a model’s outputs, allowing the model to be fine-tuned to better align with human preferences.
To outsiders, the differences between these systems became harder to discern. Application developers began designing interfaces that could work with any model behind the scenes, allowing them to switch providers as needed. Pricing, latency and uptime became more important than marginal differences in intelligence.
“Everyone is trying to be model agnostic now,” Hao says. “OpenAI does not have a monopoly on good models anymore.”
With scale no longer a differentiator, companies began investing in a different paradigm: agency. In artificial intelligence, agency refers to a system’s ability to take initiative, persist over time and act toward its goals. Rather than reacting to a prompt, an agent plans actions, monitors results and adjusts behavior.
This required new capabilities. Models had to maintain memory across sessions, integrate with third-party tools and make decisions without explicit prompts. The goal was to move from a passive chatbot to an active collaborator.
OpenAI had long been inspired by the film “Her,” in which a user falls in love with an AI assistant who adapts seamlessly to his needs. Creating such a system meant developing not just intelligence, but presence. Hao noted that OpenAI’s internal teams have pursued this dream across product and research domains.
“You cannot build that kind of assistant without giving the model memory, persistence and autonomy,” she says.
But to make agents truly effective, OpenAI needed more than algorithms. It needed new kinds of data and new ways to collect it. The internet, once an abundant source of training data, has become saturated with synthetic content. Many of the documents now available online were themselves generated by previous models.
This creates a feedback loop where online training becomes increasingly less valuable. To break the loop, companies are turning toward more intimate data collection. Hao reported that OpenAI is exploring custom devices that could capture real-time user behavior, from mobile interactions to voice conversations and environmental context.
“There is too much AI-generated content online,” Hao says. “If you want high-quality data, you have to get it directly from people.”
The growing flood of AI-generated content, Hao says, brings up difficult questions about consent, surveillance and control. Can people truly choose not to have their data collected? And what say will they have over models trained on their words, images or behavior?
For Hao, the answer lies not in techno-optimism or doomsaying, but in transparency. She does not subscribe to the dominant ideologies in AI—what she calls the “boomers,” who believe artificial intelligence will save humanity, or the “doomers” who fear it will destroy us.
“I am in the accountability camp,” she says. “These systems reflect institutional power. We need to know how they are made and who benefits.”
Companies need to explain how their models are tested, what data they use and how they make sense of the results, Hao says. They should keep track of mistakes and share their findings so others can take a closer look.
Without this kind of openness, Hao warns, AI risks becoming a proprietary black box—powerful, but unaccountable.