Preparing documents for RAG with Unstructured Data Integration

Use Unstructured Data Integration to ingest, cleanse, transform, and enrich unstructured data for gen AI processing. Use the intuitive, drag-and-drop user interface with pre-built modules for tasks such as text data extraction, filtering, and PII redaction to process your data. You can build repeatable visual data flows that help to continuously process new changes and updates to ensure the application is always using the latest available data.

Regardless of skill level or expertise, data teams can utilize an intuitive UI with pre-built, drag-and-drop operator nodes to ingest unstructured data from a variety of disparate input data sources such a local documents imported in the project, or connected data assets from AWS S3, Box, Sharepoint, Filenet, and store the output in the Milvus vector database.

You use the graphical canvas to build your flow. A flow is a pipeline of steps for processing your data. The flow consists of multiple connected nodes, where each node represents an operator. Operators process documents, and each operator has a different task, for example: data ingestion, data annotation, generating vector embeddings. They consume the representation of a document and produce additional data about this document.

Operators in the flow run in sequence, you can specify which operators to use and in what order they should run.

On the left side of the canvas, you use the nodes palette to drag and drop the operators on the canvas and connect them into a pipeline. Some nodes are mandatory, some must be used in a specified order. You can branch the flows to apply different operators based on conditions, and then, merge the flow again. See Data preparation nodes for more information.

To run the flow you need to configure a job where you define parameters for running it. Each run of the flow is called a job run. A runtime is used to run a flow.

After a flow is created, Unstructured Data Integration maintains the live flow. With scheduled jobs, as source documents are updated, the embeddings are reprocessed only for those documents that changed. This solves post-vectorization problems of data being out of date after consumption. It reduces the manual burden involved in preparing raw unstructured data for enterprise-grade AI by sifting through data and also determining if that data is relevant and accurate for the use case at hand, so that unstructured integration can happen at scale.

How unstructured data is processed

When unstructured data documents are loaded into the flow, an in-memory table is created with one row per document, and document metadata is collected. Then, each operator in the sequence of the flow edits the table to add a new feature:

  • Filter rows out, for example, to exclude some documents that are not relevant;

  • Modify the content in the column, for example, for PII redaction;

  • Add new columns, for example, for annotations.

On the canvas, when you double-click each operator node in the flow, in the right-hand side panel you can view input features which are the accumulation of all the features added by the previous operators sorted in descending order, and an editable list of output features where you can decide which features to pass to the following node.

As a result of the complete flow, preprocessed, vectorized data is loaded into a vector database, and extracted entities can be curated and stored in entity store, so that the data can be later used in your RAG usecases.

Data sources

Unstructured Data Integration supports the following input formats:

  • .pdf
  • .txt
  • .md
  • .ppt
  • .pptx
  • .doc
  • .docx

The data can be pulled from a supported connection, or you can upload any supported data assets from your local computer into a project, or use the existing document sets.

Vector databases

The following vector databases can be configured in the flow to store your output:

  • Milvus
  • Elasticsearch

Runtimes

Unstructured Data Integration provides the following runtimes responsible for executing the flows:

  • Python

Retrieving Access Control List for ingested documents

With the Access Control List (ACL) you can retrieve and preserve file-level permission details during data ingestion. When ACL retrieval is enabled in the project settings, the system fetches ownership and access rights from the source and stores them in the Common Policy Gateway (CPG) using a Presto connection. The project settings allow you to control how to proceed if the selected connection does not support ACL retrieval - you can select to ingest the documents anyway.

Requirements

The following requirements must be met to retrieve ACLs:

  • Required connections: A Presto connection with CPG enabled is required to store the retrieved ACL information.

  • Required settings:

    1. To enable ACLs, navigate to Project Settings and under Access control list policy select Enable Access Control List retrieval.
    2. Enable ACLs on watsonx.data as described in Governance through Access Controlled Lists.

With these settings, the ACLs of any documents from source will be fetched during ingestion. By default, if the source does not support ACL retrieval during import, the corresponding documents would not be imported. However, you can choose to import the documents irrespective of the ACLs fetch status. To do this, select Ingest the documents even if the fetching access control list from the source is not supported by the connection in the project settings.

Learn more