What is the Data Prep Kit?

Authors

Technical Content Writer

IBM

Shalisha Witherspoon

Software Engineer

IBM Research

What is the Data Prep Kit?

A reliable set of data from quality data sources is necessary for producing efficient artificial intelligence (AI) systems and large language model (LLM) applications. Data preparation includes cleaning, transformation and sophisticated processes for organizing raw unstructured data into a usable dataset. This process is difficult, as quality issues can be hidden and become apparent late in the software development lifecycle, leading to challenging debugging efforts and delays in overall project time. Quality data is essential for building responsible LLM and AI applications, as poorly prepared data could lead to biased or unsafe outputs. There is also the risk of a privacy violation if Personally Identifiable Information (PII) is included in the model data without proper recognition of its ramifications. Also, poor data preparation can lead to wasted time and resources. The foundation for developing responsible generative AI (gen AI) is to properly prepare your data to efficiently build your product.

To provide developers and data science specialists with the tools to address the challenges of data preparation, IBM built the open-source data prep kit (DPK). The DPK is a suitable toolkit for practitioners to prepare data files for AI-based applications and LLM workflows. To achieve this objective, DPK provides several modular "transforms" in a Python environment. The user can employ the “transforms” as isolated classes or combine them into a data pipeline to clean, transform and organize data from a data source. DPK is built for scale and uses columnar datasets like Parquet files and scalability runtimes like Ray and Spark. This capacity helps data preparation to be performed on a single machine or large cluster to make the data processing fast. DPK has been successfully used to prepare data for IBM® Granite® models, showcasing its effectiveness in real-world use cases.

Key features and architecture of DPK

DPK is built with three major architectural components:

Data access, transformation and runtime.

Data access: This component enables a uniform way to read and write data from various storage locations (for example, local file systems, S3-compatible storage). It supports checkpointing, so that if a job needs to restart, it will process only files that haven't been completed, saving time and resources.

Transformation: This module provides specific functions that will be applied to the data, including data conversion, deduplication and personally identifiable information detection (PII). DPK has a developing library of transforms as a suite of prebuilt transforms. The framework also allows a contributor (not extensively familiar in a distributed computing framework (Ray or Spark)) to create and build their own custom transforms. Each transform acts as a self-contained and configurable unit of work, is composable into pipelines and provides LLM-scale data preparation.

Runtime: The runtime is the specific execution environment for the transform assigning work and monitoring. DPK can run in various runtimes including pure Python (with multiprocessing), Ray and Spark. This ability gives the user the flexibility to scale their data preparation from a single laptop to a large cluster with thousands of nodes. The runtime manages the assignment of work to the workers. Runtimes also allow transforms to use specific functions such as shared classes or object stores. The overall runtime architecture is made up of a transform launcher (the entry point), a transform orchestrator (the component that runs the workflow and creates workers) and file processors (the component that processes each individual file). The orchestrator will use the runtime to establish shared components and the data access that will allow it to find files to work with. Each file processor will read, process and then write back the results and metadata by using application programming interface (API) calls.

Automation and scalability

DPK's automation and scalability abilities are innovative, able to accommodate several runtimes such as Local, Ray and Spark.

With its new functions, DPK stands apart from established data preparation projects such as BigCode, DataTrove and Dolma. A major differentiator is the number of use cases that it supports (pretraining, fine-tuning and retrieval-augmented generation (RAG) and the dimension of data it can support. Unlike other toolkits that support a single data modality, DPK supports natural language data and code data. DPK is extensible, which means to add any number of scalable modules is easy.

For automation, DPK has a command line interface (CLI) interface and no-code option by using Kubeflow Pipelines (KFP), a platform for deploying scalable machine learning (ML) workflows on Kubernetes. This function broadens the audience and allows for broader applicability to data scientists and AI application developers.

The features of KFP integration are:

Scalability: KFP allows to scale large datasets and high complexity workflows because of Kubernetes, which is a scalable, on-demand computing platform.

Modularity: Each workflow gets divided into reusable components, which readily implements complex workflows and additional debugging.

Reproducibility: Each execution's history is tracked which allows for reproducing experiments.

Visualization: There is an interface to monitor pipeline runs and visualize results and troubleshoot issues.

DPK's scalability is supported by experimental studies. For example, the toolkit takes a proportional reduction in execution time for transforms as the scale of processing nodes increases. It scales uniformly across workloads, from annotating simple data to running more complex workloads such as language identification that incurs extra computational costs.

Governing data preprocessing for AI training with DPK

As the capabilities of AI systems improve, becoming more integral in decision-making processes, the call to establish governance allowing ethical, safe and compliant AI becomes urgent. However, it is important to highlight that governance does not begin with models; it begins with data.

What makes DPK especially valuable is how it contributes directly to the objectives of AI governance:

Safety: DPK 'transforms" identify and remove toxic or harmful content with its HAP (hate, abuse and profanity) detection.

Fairness: DPK "transforms" remove bias through deduplication and filtering of poor-quality or irrelevant data.

Compliance: DPK "transforms" deduce and redact PII in compliance with privacy-related regulations such as GDPR (General Data Protection Regulation) and HIPAA (Health Insurance Portability and Accountability Act).

Transparency: DPK "transforms" add document hashes and quality scores to boost traceability and auditability.

Whether you work in healthcare, finance, government or any domain where trust is significant, DPK helps to create AI systems on a foundation of responsible data.

Real-world use case: legal document summarization

Scenario: A government agency is building a domain-specific LLM to help summarize legal documents. The dataset contains public records, legal filings and policy documents scraped from various sources.

Governance challenges:

Some sources are unverified or politically biased.

The dataset includes multiple languages.

Possible toxic or inflammatory language.

PII such as names and addresses must be removed.

Duplicate filings and boilerplate text are common.

Document quality varies tremendously.

Solution: By deploying a governance recipe leveraging DPK, the agency can ensure the dataset is:

Ethically sourced

Privacy compliant

High quality and diverse

Ready for responsible LLM training

Recipe for a governance pipeline

If you need to remedy the challenges presented in the preceding use-case, or any established scenario to mitigate risk for data quality, safety and compliance, you need a pipeline that fundamentally enforces governance policies. DPK makes this pipeline possible by providing a series of modular transforms that you can chain together for cleaning, filtering and enriching the dataset.

This table highlights a simple 6-step governance pipeline that meets the main requirements for AI governance. Every step has an associated risk or quality issue to ensure that the final dataset is ethically sourced, privacy compliant and ready for responsible AI development.

To build your own governance pipeline:

Clone the repository: https://github.com/data-prep-kit/data-prep-kit
Install transforms: pip install 'data-prep-toolkit-transforms[all]'
Configure your pipeline by using KFP or Python with dpk_transform_chain.
Run the pipeline by using the CLI or integrate it into your data platform by using Ray or Spark.

Available transforms of DPK

DPK contains an array of prebuilt transforms, categorized for different data preparation requirements.

Data ingest transforms: These transforms convert data from one or more formats into a standardized format. For instance, compressed code files to Parquet, HTML/PDF to Parquet.

Universal (code and language) transforms: These transforms can be used for both code data and natural language data. For instance, exact deduplication and fuzzy deduplication to remove duplicate records and resize to size files appropriately.

Language only transforms: These transforms are for natural language data. They include language identification that identifies the language of the text, PII annotator or redactor that identifies and removes personally identifiable information.

Code only transforms: They are for code only data. They include programming language annotation, code quality annotation and header cleanser to target license and copyright information. The framework also contains transforms for RAG applications, such as document chunking and a text encoder that creates embedding vectors for text. These transforms are a key aspect of getting data prepared that will be used in a vector database. 

DPK is crucial for anyone creating AI, especially LLM applications. This toolkit is a clean, scalable and standardized framework that simplifies the complex and sometimes laborious process of data preparation. With a complete set of modular "transforms" to support developers and data scientists, they can clean and structure raw data and build in significant facets of AI governance into their data workflows. It lets them to be sure of the quality and trustworthiness of the foundational data used for training and accountability around ethics and privacy strategies. DPK ultimately improves AI systems by ensuring responsibility and trustworthiness for developers, enabling model innovation while being firmly anchored on a strong database.