The data gap that’s holding back robotics

Hispanic female engineer working on a robotic arm in a lab
Antonia Davison

Staff Writer

This article was featured in the Think newsletter. Get it in your inbox.

The common thread behind AI’s biggest wins? Massive datasets. Computer vision took off with ImageNet. Language models got a boost from Wikipedia. Video generation? You can thank YouTube’s vast library for that. But robotics is still waiting for its moment, because there’s no centralized treasure trove of robot data to fuel the next breakthrough.

The problem is, building a robotics equivalent of these datasets is uniquely hard work. “Unlike text or images, which are readily available online, robotics requires real-world physical interactions,” said Ben Levin, General Manager of Physical AI at Scale AI, in an interview with IBM Think. “You can’t just search the web for millions of examples. They actually have to be collected, curated and annotated through purpose-built facilities and infrastructure.”

This scarcity, Levin explained, is why embodied AI has been stuck behind language and vision models. But as the robotics sector grows, a wave of efforts—from well-funded startups to grad students driving robots around in rented vans—is trying to fill that data gap.

A data giant gets physical

Scale AI, the data infrastructure giant that powers language and vision models, recently made a major move into robotics. This fall, the company debuted its Physical AI Data Engine, a rapidly growing library of custom-collected, multimodal datasets designed to train foundation models for robots, autonomous vehicles and other tech, such as drones.

The company has already collected more than 100,000 hours of real-world robotics data this year. While it’s a robust number, Levin explained that Scale’s differentiator is actually quality, not quantity. “We enrich every dataset with semantic layers that capture not just what motion was executed, but why, or the task goal; how, or the sequence of steps, and where it failed,” he said.

The company deploys both human teleoperators and autonomous data collection robots in homes, businesses and worksites. “Since no single robotics company can generate enough data to train embodied foundation models, companies are realizing that partnering with large-scale providers can accelerate progress and reduce costs,” Levin said.

Going deep on contact

While Scale went for breadth, Professor Hao-Shu Fang of the University of Maryland went for depth. His RH20T (Robot-Human Demonstration in 20TB) dataset, which he started in 2021 while at Shanghai Jiao Tong University in China, took a focused approach: contact-rich manipulation tasks where robots continuously interact with their environment. Think swiping blackboards, plugging in USB cables, cutting vegetables—the kind of nuanced, force-dependent tasks that trip up most robots.

RH20T includes 110,000-plus robot sequences across roughly 150 tasks, all recorded with multi-modal sensing that includes force/torque sensors, audio, tactile feedback and more. Each robot episode comes with a paired human demonstration video: a volunteer tele-operates the robot while being recorded, providing a task example the model can try to imitate. According to Fang, this multimodality is the point, because vision alone can’t reveal the physical dynamics robots must master. “RH20T’s [sequences] let models learn when contact is made, how it evolves and how to modulate actions when contact happens,” he said in an interview with IBM Think.

His work paid off. Google Robotics invited Fang’s team to join the OpenX-Embodiment project (see more about that below). Robotics legend Sergey Levine called it “an amazing dataset” on a podcast. And data companies have reached out about commercial collaborations, according to Fang.

Fang open-sourced everything from day one. His reasoning is that robotics is too far behind language models for any single research group or company to catch up alone. “Compared to the NLP [natural language processing] community, we are at the very initial stage of robotic foundation models,” he said.

When asked about commercial versus academic datasets, Fang is blunt: “Complementary,” he said. “Open sets like RH20T establish public benchmarks, methods and baselines. Proprietary corpora push scale and product-specific edge cases.” The knowledge flows both ways, according to Fang, thanks to standardized formats and tooling.

The mega collaboration

If RH20T showed what one focused lab could do, Open X-Embodiment demonstrated the power of pooling resources at unprecedented scale. The project brought together 21 institutions to aggregate data across 22 different robot platforms (integrated hardware and software systems used for building and testing robots)—a radical departure from the traditional model of individual labs keeping their own datasets proprietary. Karl Pertsch, a postdoctoral researcher at UC Berkeley and Stanford University and one of the project’s organizers, said the speed was remarkable.

“At the time, around 2023, the largest robot datasets that were used in research were on the order of a few tens of hours of robot data,” Pertsch said in an interview with IBM Think. “Open X contains about 2,000 hours and was put together within the span of just a few months.”

The innovation was cross-embodiment learning: train on multiple robot types and models learn general task representations instead of robot-specific quirks. The proof came quickly, according to Pertsch. “A robot that had never seen a Coke can in its training data was able to pick up a Coke can after it was co-trained on data from other robots that featured Coke cans.”

Getting 21 institutions to collaborate wasn’t trivial. It involved countless meetings, alignment on data formats and work by many graduate students, Pertsch noted. But the robotics community was ready. “All the people we reached out to were actually very excited to contribute,” he said. “I think there was a shared belief in large parts of the robotics community at the time that we needed larger and more diverse datasets.”

The impact has been dramatic. “The community has broadly adopted Open X as the default data source for larger-scale data-driven robot learning research in the open-source community,” Pertsch said. The most impressive validation? “Most VLAs [vision-language-action models] today, open and closed, use at least a portion of the Open X dataset as part of their data mix.”

But Open X-Embodiment had a limitation, according to Pertsch: most contributing datasets came from controlled lab environments. The solution, he said, was DROID (Distributed Robot Interaction Dataset), one of Open X-Embodiment’s major components that pushed the collaboration into real-world settings. DROID sent 50 data collectors across three continents to gather manipulation data in realistic environments, including graduate students’ homes.

The logistics were intense. Pertsch recalled that one of his colleagues spent a lot of time driving robots around in rented vans. But there was a payoff. “Today, the most generalizable open-source robot models are trained on DROID, and DROID has become a great platform for testing robot models out of the box, just like you would download an LLM and prompt it to do something,” Pertsch said.

The latest tech news, backed by expert insights

Stay up to date on the most important—and intriguing—industry trends on AI, automation, data and beyond with the Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

A robot for all seasons

Not all robot challenges are about picking and placing objects. Fabian Schmidt at Esslingen University in Germany tackled a completely different problem: outdoor navigation across changing seasons.

“While indoor environments are relatively stable and well-lit, outdoor settings are dynamic and unpredictable,” Schmidt said in an interview with IBM Think. “Lighting, vegetation and weather constantly change, which challenges visual perception and can easily break the assumptions many SLAM systems rely on.”

SLAM is short for Simultaneous Localization and Mapping, which is the process of building a map of an unknown environment while simultaneously estimating the robot’s position within it. ROVER, Schmidt said, is the first dataset to systematically capture the same five outdoor locations across all four seasons and under varying weather and lighting conditions. The target applications are practical: robotic lawn mowers, delivery bots, inspection systems—anything that needs to work reliably year-round, according to Schmidt.

The findings varied by season. “We observed that localization accuracy was highest during winter and spring, while summer and autumn proved more challenging due to denser vegetation and visual clutter,” Schmidt says. Even small details like shadows or surface texture variations had outsized impacts on performance.

“Most manipulation datasets are based on a static robot operating within a defined configuration space, while navigation datasets rely on mobile robots acting in a dynamic and changing world,” Schmidt said. “Both represent complementary aspects of embodied intelligence—understanding what to do and where to do it. Connecting these two worlds should, and will, be the next major step toward achieving full autonomy.”

What's next

Looking ahead, researchers see the data collection paradigm itself shifting, although not necessarily in the same direction.

Fang is moving beyond the teleoperation approach used in RH20T toward what he calls “perioperation.” Instead of having operators control robots, perioperation records humans performing everyday manipulations. “Humans are doing manipulation every day, but the data is not recorded,” he said. “If we can build devices to sensorize and record human manipulations, including vision, tactile, proprioception data, while maximizing the transferability to robots, then we can scale up robotic data in a much faster way.”

Pertsch is more measured. While human videos are valuable, “human data alone is not a silver bullet,” he said. “We will need a large amount of robot data to teach models about the precise details of fine-grained robot control.” Instead, Pertsch sees deployment as the inflection point: “Once robots can be out in the world and collect data while creating economic value, the cost of robot data will go down rapidly, since it’s subsidized by the value the robot generates in the process.”

Both visions point to the same reality. Robotics simply doesn’t have its own canonical dataset—and it may be some time before it does. While researchers are making headway, the constraints are real. Embodied AI can’t scrape its way to generality. It has to collect the world one physical interaction at a time.