Integrating unstructured data documents

Use Unstructured Data Integration to ingest, cleanse, transform, and enrich unstructured data for RAG processing. Use the intuitive, drag-and-drop user interface with pre-built modules for tasks such as text data extraction, filtering, and PII redaction to process your data. You can build repeatable visual data flows that help to continuously process new changes and updates to ensure the application is always using the latest available data.

Overview

Unstructured Data Integration enables data teams of all skill levels to build repeatable data preparation pipelines for RAG applications. The service provides pre-built processing modules that can be combined into flows to ingest unstructured data from various sources such as local documents, AWS S3, Box, SharePoint, or FileNet, and store the output in vector databases like Milvus.

A flow is a pipeline of steps for processing your data. Flows consist of operator nodes that run in sequence, where each operator performs a specific task such as data extraction, filtering, or PII redaction. Operators process documents, consuming the representation of a document and producing additional data about it. You can configure which operators to use and in what order they should run, and you can branch flows to apply different processing based on conditions.

The service supports continuous processing through scheduled jobs that automatically detect and process only changed documents, ensuring your RAG application always uses the latest data without manual intervention. This reduces the manual burden involved in preparing raw unstructured data for enterprise-grade AI by determining if data is relevant and accurate for the use case at hand, enabling unstructured integration at scale.

How flows process data

When unstructured data documents are loaded into a flow, an in-memory table is created with one row per document, and document metadata is collected. Each operator in the flow sequence modifies this table by adding a new feature.

Operators can perform the following operations on the data table:

  • Filter rows: Exclude documents that are not relevant
  • Modify content: Transform data in columns, for example, for PII redaction
  • Add columns: Include new information, for example, for annotations

Each operator receives input features from all previous operators and produces output features that are passed to the next operator in the sequence. As a result of the complete flow, preprocessed, vectorized data is loaded into a vector database, and extracted entities can be curated and stored in entity store, so that the data can be later used in your RAG use cases.

Flow execution and automation

To run a flow, you configure a job that defines execution parameters. Each execution of a flow is called a job run. A runtime is used to run a flow. You can select the execution engine (Python or Spark) for each run depending on your needs. You can schedule jobs to parallel process hundreds of thousands of pages with Spark. This automates the transformation of your documents. Python is set by default.

After a flow is created, Unstructured Data Integration maintains the live flow. With scheduled jobs, as source documents are updated, the embeddings are reprocessed only for those documents that changed. This solves post-vectorization problems of data being out of date after consumption and ensures data freshness while reducing the manual effort required to prepare unstructured data for enterprise AI applications.

Data sources

Unstructured Data Integration supports a number of input formats as described in Supported source data formats.

You can use connected assets, upload any supported data assets from your local computer into a project, or use the existing document sets.

Language support

You can process unstructured data documents in the following languages:

  • English
  • Japanese
  • Korean
  • French
  • Italian
  • Polish
  • German
  • Spanish

Note that documents processed together must use the same language. You can use the branching node to branch the flow based on language. Also, document classes must be translated to the respective language.

Target databases

Unstructured Data Integration supports a number of target databases for storing the output of the flows, such as vector stores or entity stores. For a list of target databases, see Supported output targets.

Runtimes

Unstructured Data Integration provides the following runtimes responsible for executing the flows:

  • Python (default)
  • Spark

Project settings

You can preset a number of configuration values for your Unstructured Data Integration flows in the project, such as ACL settings, storage for document sets, default embedding models, runtime environment. For details, see Unstructured Data Integration settings.

Retrieving Access Control List for ingested documents

With the Access Control List (ACL), you can retrieve and preserve file-level permission details during data ingestion. When ACL retrieval is enabled in the project settings, the system fetches ownership and access rights from the source and stores them in the Common Policy Gateway (CPG) using a Presto connection. The project settings allow you to control how to proceed if the selected connection does not support ACL retrieval.

Requirements

To retrieve ACLs, you must meet the following requirements:

Required connections

You must have a Presto connection with CPG enabled to store the retrieved ACL information. This can be configured in project settings for all flows in the project, or by using the Access control node for individual flows.

Enabling ACL in a project

Complete the following steps to enable ACL retrieval:

  1. Navigate to Project Settings and under Access control list policy, select Enable Access Control List retrieval.
  2. Enable ACLs on watsonx.data as described in Governance through Access Controlled Lists.

With these settings enabled, the system fetches ACLs for all documents from the source during ingestion. By default, if the source does not support ACL retrieval during import, the corresponding documents are not imported. However, you can choose to import documents regardless of ACL fetch status by selecting Ingest the documents even if the fetching access control list from the source is not supported by the connection in the project settings.

Enabling ACL for individual flows

Regardless of the project settings, you can add the Access control node in your flow as described in Access control.

Learn more