Open sourcing data prep tools for large language models
25 November 2024
Author
Aili McConnon Tech Reporter, IBM

Open source large language models (LLMs) get a lot of love because they make it easier for anyone to modify and use them. But the benefits of open sourcing are lost if preparing the data needed to train and tweak the models is expensive and time-consuming.

“Every conversation in AI starts with models and, in reality, ends with data,” says Petros Zerfos, Principal Research Scientist of Data Engineering for Generative AI at IBM Research. For enterprises, that often means AI teams actually spend more time preparing data for models than on the models themselves, Zerfos says.

The solution? Some large tech companies are open sourcing data preparation tools. For example, IBM's Data Prep Kit and the NVIDIA NeMo Curator make it easier for enterprises of all sizes to train and fine-tune LLMs, allowing them to get value from AI applications more quickly and cost-effectively.

The data challenge

As companies race to develop and deploy LLMs and AI applications, one of the biggest bottlenecks is data preparation. In fact, 79% of enterprise AI teams surveyed in Gartner’s 2023 Explore Data-Centric AI Solutions to Streamline AI Development report said the most common strategic task they perform is data preparation and generation.

Data preparation generally happens during two key stages in the development of LLMs. In the pretraining stage, the models are trained with hundreds of terabytes of data so they can comprehend plain English and acquire enough knowledge and nuance in various domains. According to Zerfos, pretraining models from scratch requires hundreds of people and millions of dollars, so only very large companies—or a few well-capitalized startups—have the resources to do so.

In the second stage of data preparation, AI teams use smaller volumes of targeted data to fine-tune LLMs so they can generate more accurate and relevant text. Some very large companies with ample resources do both phases but most companies are focused on data preparation to fine-tune models that have already been built by others.

Open source data preparation tools

Several companies, including IBM and NVIDIA, have recently open sourced tools to help developers tackle the arduous task of unstructured data preparation. IBM’s Data Prep Kit is a library of modules that a developer can plug into their pipeline to curate data in either the pretraining or fine-tuning stage.  The modules work with source documents containing unstructured data such as text (for example, a PDF) and code (HTML) and can be used to annotate, transform and filter the data.

The IBM team open sourced these tools to make them accessible to enterprises of all sizes, says Zerfos. “The developer does not need to do anything special whether they’re running it on a laptop, a server or a cluster,” he says. “It can also run on any cloud infrastructure.”

Since it launched in May 2024, developers have been experimenting with the Data Prep Kit framework and its modules, which are accessible via GitHub. Several members of the AI Alliance, a community that includes tech companies large and small, have also started testing how certain modules can streamline and accelerate training and fine-tuning, Zerfos says.

AI hardware and software giant NVIDIA has also recently open sourced a series of data preparation modules to improve the accuracy of generative AI models. The NVIDIA NeMo Curator processes text, images and video data at scale. It also provides pre-built pipelines to generate synthetic data to customize and evaluate generative AI systems.

One of the tasks that NVIDIA’s NeMo Curator promises to speed up is deduplication. When downloading data from massive web-crawl sources like Common Crawl, it’s typical for the model to encounter both documents that are exact duplicates of one another and documents that are near-duplicates. 

Using an upcoming version of the NeMo Curator, the tool’s developers say organizations will be able to complete this deduplication task 20 times faster and five times cheaper than they currently do. 

To be sure, open sourcing these tools makes them more broadly accessible. Enterprise AI teams, however, still need a certain level of skill and training to generate value from these tools, caution experts such as Mark A. Beyer, a Distinguished VP Analyst at Gartner.

“Simply giving someone a tool without guidance, methodologies and functions to support it starts to turn into experimentation, he says. “It can take four to five times longer than simply leveraging existing tools.”

Going forward, though, Ben Lorica, host of The Data Exchange podcast, sees great potential for data preparation tools as companies increase their use of multimodal data—even if it’s still early days.

“As your applications rely on an increasing amount of video and audio in addition to text, you will need some sort of tool that will allow you to scale and use larger data sets and take advantage of whatever hardware you have,” he says. “Especially in the agent world, data will be a differentiator. You want access to the right data at the right time.”