11 July 2025
On 11 June 2025, the IBM launched the availability of its new approach to data integration: watsonx.data integration. This solution offers a single control plane to author batch, real-time streaming and data replication pipelines, underpinned by built-in observability.
Within the same solution, teams can build reusable unstructured data pipelines alongside structured ones, unlocking a goldmine of previously inaccessible data to power new use cases and meet the evolving demands of modern data environments. With watsonx.data integration’s unstructured data integration (UDI) capability, users can intuitively build pipelines that ingest, transform, and process high volumes of unstructured data—including documents, PDFs, PPTs, and more—in just minutes.
This product combines breakthrough open source and proprietary innovations straight from IBM Research. Some best-in-class product features include:
Designed to handle the enterprise data that has been traditionally underutilized, watsonx.data integration marks a major step forward in unlocking unstructured data for AI and analytics.
Most public data is already well-represented in today’s foundation models—so the real competitive advantage comes from leveraging your enterprise data. Yet 90% of enterprise data is unstructured, an overwhelming volume of information that remains untapped, from documents and PDFs to emails, images and logs - most of it remains outside the reach of traditional analytics and AI workflows. And due to access and management complexity, only 1% is currently used in generative AI.
Learn more about the challenges of traditional unstructured data approaches. IBM watsonx.data integration and its broader ecosystem of tools are designed to address these challenges head-on. Below are key features of the UDI capability that help organizations navigate today’s rapidly evolving data landscape.
This solution includes prebuilt connectors that enable users to ingest a wide range of commonly used data sources and formats—along with their associated metadata and access controls—at scale and as they evolve. While some unstructured connectors exist in the market, few can adapt dynamically as documents or permissions change over time.
Developed in collaboration with IBM Research, watsonx.data integration combines proprietary innovation with leading open-source technologies to bring unstructured data processing into the modern data pipeline. Its visual canvas includes purpose-built operators for text and other modalities—covering personally identifiable information (PII) masking, hate, abuse and profanity (HAP) detection, quality filtering, language detection, and confidence scoring. Developers can design a single pipeline to process diverse file types at scale—without writing or maintaining custom code. Just like drag-and-drop ELT for structured data, watsonx.data integration brings the same intuitive, low/no-code experience to unstructured data and also touts a full-functioned Python SDK for those who prefer working more programatically.
In addition, prebuilt operators for embedding, chunking, and vectorization allow users to transform raw documents into structured representations optimized for downstream AI. These operators automatically convert unstructured content into semantically meaningful vectors, enabling use cases such as RAG, document classification, and intelligent search—all without requiring deep machine learning (ML) expertise.
This support for unstructured data integration is architected to process petabytes of complex, unstructured content efficiently. Documents of 10MB or more—across thousands of files—are compressed into a unified, high-performance format, enabling rapid processing and reprocessing. This architecture is purpose-built to meet the demands of enterprise-scale unstructured data.
The pipeline supports self-updating data structures. When a source document—say, “Document A”—is updated to a new version, only the delta is captured and seamlessly propagated downstream, including to the vector database. This ensures that thousands of pipelines at scale stay current without the need for full reprocessing.
Native support for ACLs, ensuring that document-level permissions are preserved throughout the data pipeline. This means users only access data they are authorized to see—critical for maintaining security, compliance, and trust as unstructured data flows across teams and applications.
Ultimately, no singular organization can remediate the aforementioned problems in a vacuum. watsonx.data integration’s support for UDI is built on a flexible infrastructure grounded in modern open-source tools. Below are the core technical components that form this foundation.
watsonx.data integration’s support for UDI was developed in response to IBM’s own experience building the Granite family of foundation models. Processing and preparing the 12 trillion tokens used to train Granite exposed critical gaps in existing unstructured data tooling. In response, IBM Research created the Data Prep Kit (DPK) and data and Model Factory (DMF)—modular frameworks that offer robust cleaning operators across modalities like text, code, languages, and images. These battle-tested components, now packaged into watsonx.data integration, were designed for high-throughput, production-grade use cases. Today, DPK has been open-sourced through the Linux Foundation, continuing IBM’s mission to democratize access to advanced unstructured data tooling.
watsonx.data integration’s support for UDI also incorporates Watson Document Understanding and Docling, an open-source IBM initiative with over 30K GitHub stars, to deliver state-of-the-art document parsing and entity extraction. These technologies excel at complex extraction tasks—including table extraction—with industry-leading speed and accuracy.
Whether you prefer open-source options like Milvus and or managed vector databases, watsonx.data integration’s UDI offers support options. Vectorization pipelines are natively embedded in the platform, enabling fast deployment to your preferred storage solution for RAG and semantic search workloads.
IBM watsonx.data integration is actively piloting integrations with Langchain and other popular open-source orchestration frameworks—bringing a true upswell of community-driven innovation into the platform. These integrations enable full-stack orchestration of functions built or leveraged via Langchain directly within a native watsonx.data integration pipeline, while preserving the enterprise-grade governance, security, and scalability required for production use.
With IBM watsonx.data integration, clients can unlock the full potential of unstructured data through a powerful combination of open-source innovation and proprietary enterprise technology. From personalized content generation to invoice aggregation and agentic decision-making, UDI transforms raw content into AI-ready insights—now available as part of IBM watsonx.data integration.
What sets this offering apart is its ability to unify structured and unstructured data in one platform—simplifying pipeline building and tool sprawl, thus accelerating outcomes. No matter the use case, watsonx.data integration is the foundation for unlocking business value from all your data.