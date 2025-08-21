DPK is built with three major architectural components:

Data access, transformation and runtime.

Data access: This component enables a uniform way to read and write data from various storage locations (for example, local file systems, S3-compatible storage). It supports checkpointing, so that if a job needs to restart, it will process only files that haven't been completed, saving time and resources.

Transformation: This module provides specific functions that will be applied to the data, including data conversion, deduplication and personally identifiable information detection (PII). DPK has a developing library of transforms as a suite of prebuilt transforms. The framework also allows a contributor (not extensively familiar in a distributed computing framework (Ray or Spark)) to create and build their own custom transforms. Each transform acts as a self-contained and configurable unit of work, is composable into pipelines and provides LLM-scale data preparation.

Runtime: The runtime is the specific execution environment for the transform assigning work and monitoring. DPK can run in various runtimes including pure Python (with multiprocessing), Ray and Spark. This ability gives the user the flexibility to scale their data preparation from a single laptop to a large cluster with thousands of nodes. The runtime manages the assignment of work to the workers. Runtimes also allow transforms to use specific functions such as shared classes or object stores. The overall runtime architecture is made up of a transform launcher (the entry point), a transform orchestrator (the component that runs the workflow and creates workers) and file processors (the component that processes each individual file). The orchestrator will use the runtime to establish shared components and the data access that will allow it to find files to work with. Each file processor will read, process and then write back the results and metadata by using application programming interface (API) calls.