Unstructured data transformation nodes

Unstructured Data Integration provides a graphical user interface with drag and drop pre-built nodes where you can build a flow of pre-processing tasks to prepare your unstructured data for RAG usecases. Each node type represents a stage in a flow where a specific task is completed.

The following sections describe all the nodes available on the palette in Unstructured Data Integration. Drag-and-drop or double-click a node in the list to add it to your flow canvas. You can then double-click any node icon in your flow to set its properties. Hover over a property to see information about it, or click the information icon to see Help.

The nodes must be added in the following order:

Ingest data node

One node of this type is required as the first node. With this node type you can specify where to pull the documents from. The following options are available:

Data assets - Use this node to pull the data from project assets.
Connections- Ingest the documents from Amazon S3 bucket, Box, Filenet or Sharepoint.
From document set - Use the existing document sets to ingest data from.

For details, see Ingest data nodes.

Data extract node

One node of this type is required right after the ingest node. Use this node to extract data from the source into a markdown format for further processing.

Optionally, if you want to filter the documents before extracting the data, you can add an Annotation Filter quality node between Ingest and Extract nodes.

For details, see Extract data node.

Quality nodes

Multiple nodes can be added in a sequence after the extract node. These nodes are optional. Use these nodes to ensure that the data meets the quality requirements.

Language annotator - Ensure accurate processing by adding annotations on the language that is used in the documents.
De-duplicator - Remove identical duplicate documents.
Document quality - The operator will calculate and annotate several metrics related to document, which are useful to see the quality of document.
PII and HAP annotator - Identify and annotate personally identifiable information (PII) and HAP to maintain data privacy during model ingestion.
Redaction - Hide sensitive information by replacing text with a mask character.
Annotation filter - Filter documents based on added annotations to streamline processing and ensure relevant content is ingested into the language model
Data class assignment - Assign data classes to individual documents.
Terms and classifications - Assign business terms and classifications to individual documents.
Classify documents - Categorize documents by using document classes.

For details, see Quality nodes.

Transform data nodes

With these nodes the input data is processed for vector databases. Add the Chunking node first, then the Embeddings node.

Chunking - Divide text into sections, improving context understanding and processing accuracy. This node can be skipped if the processed documents do not exceed the maximum size limit for a record while inserting embeddings into a vector database.
Embeddings - Generate embeddings to transform text into numerical vectors.
Branching and Merging - Branch the flow so that documents undergo different processing based on conditions you provide. You can then merge the flow.
Entity curation - Curate extracted entities into a structured format compatible with the target table schema.

For details, see Transform data nodes.

Generate output node

Select what output to generate:

Milvus and Elasticsearch - Select a vector database to store the processing output for LLM.
Document set - Select if you want to store the output of the flow for further downstream tools to utilize it, or to reuse it as input to multiple use cases. A document set is created.
Entity store - Use to store extracted entities in structured entities tables.

This node must always be the last one in the flow. It is required for the flow to run.