Building data pipelines that ingest, preprocess and transform unstructured data to enable RAG use cases

31 October 2024

 

Author

Scott Brokaw

Director, Product Management, Data Integration, IBM

Unstructured data is all the information in various formats that a company collects as a part of doing business. While its use may not be immediately clear, all that data has immense value—especially for organizations looking to unlock the potential of generative AI (gen AI). It just needs to be processed and organized.

To address this, IBM is releasing a new capability: IBM Data Integration for Unstructured Data. With this technology, data teams will soon be able to ingest, cleanse, transform and enrich unstructured data at scale for downstream AI, specifically for retrieval augmented generation (RAG) cases.

As a key component of the IBM data integration portfolio, this capability will empower businesses to integrate unstructured data at scale.

The challenge of unstructured data

Data teams are vital for improving data quality and supporting AI and analytics. Yet data science teams spend the majority of their time processing data for downstream use. This challenge is intensified by a rapid growth in the diversity of data formats—especially when it comes to unstructured data, which can include text, images, videos and IoT sensor data. Furthermore, unstructured data now accounts for 90% of all enterprise-generated data, and is growing three times faster than structured data, according to IDC.

Though unstructured data can produce valuable insights on consumer behavior and market trends, few tools can manage it effectively, highlighting the need for scalable solutions. Additionally, data teams face numerous challenges when trying to manage unstructured data for AI, including:

  • Diverse file types and the need to preprocess unstructured data for downstream use
  • The existence of multiple different versions of unstructured documents, or changes that occur within source documents
  • Unstructured documents that contain sensitive information, such as personally identifiable information (PII)

IBM’s approach

To address these challenges, IBM’s solution provides data teams with a low-code platform to automatically ingest raw data, organize it with drag-and-drop functionality and then populate the results into targets, such as vector databases. With these features, teams can build reusable, repeatable pipelines that process and transform an organization’s unstructured data—reducing the overwhelming and tedious manual work often involved in preparing raw unstructured data for enterprise-grade AI.

Initial use cases for building these new pipelines include:

  • RAG cases: Clients can build a powerful chatbot with IBM watsonx.ai™, leveraging their own unstructured data to create AI applications at scale. These pipelines help clients create and populate embeddings into vector databases such as Milvus on watsonx.data™.
  • Populating lakehouses for analytics: Clients can perform entity extraction when powering analytics on insights from unstructured documents that merge with traditional structured or semi-structured data.

With IBM Data Integration for Unstructured Data, there is no need to stitch multiple disparate tools together. Clients will be able to manage both structured and unstructured data all in one place.

Watch the demo here (no audio):

Be sure to sign up for early access today.