Supported connectors for unstructured data curation

You can connect to various data sources from which you can import metadata of unstructured data to analyze and process the documents that they contain. Unstructured data curation creates entity relationship tables that can be used in sophisticated RAG pipelines. You can then display selected assets on a lineage graph to visualize the flow of data and see how data was transformed.

Requirements and restrictions

Understand the requirements and restrictions for connections to be used in the curation of unstructured data.

Required permissions

Users must be authorized to access the connections to the data sources.

General prerequisites

Connection assets must exist in the project for connections that are used for running unstructured data import and unstructured data enrichment.

For lineage metadata import, a data source definition is created automatically for the IBM watsonx.data Presto connection when you run unstructured data curation or Unstructured Data Integration flows. If you want to use your own name for the data source definition, create it manually.

For more information about data source definitions, see Creating a data source definition.

Supported source data formats

In general, unstructured data curation and Unstructured Data Integration flows support the following data formats:

Supported file types
File type Unstructured Data Integration Unstructured data curation
BMP
DOC, DOCX
GIF
HTML
JFIF
JPG
JSON
MD
PDF
PNG
PPT, PPTX
TIFF
TXT
XLSX

Supported data sources

Data from these data sources can be ingested.

Supported data sources
Data source Unstructured Data Integration Unstructured data curation
Amazon S3
Box
Google Drive
IBM Cloud Object Storage
No retrieval of access control lists
IBM FileNet P8
IBM watsonx.data SharePoint
Slack

Data from these data sources can be visualized on the data lineage graph:

Supported output targets

The generated embeddings are written to a vector database that is connected through one of these connectors:

Supported vector databases
Vector database Unstructured Data Integration Unstructured data curation
IBM watsonx.data Milvus

For unstructured lineage metadata, the generated embeddings are written to a vector database that is connected through one of these connectors:

Entity tables and document sets are written to Iceberg tables. The database can be connected through one of these connectors:

Output targets
Databases for entity tables and document sets Unstructured Data Integration Unstructured data curation
Db2 3
IBM watsonx.data Presto 1
Iceberg metastore 2
Oracle 3
PostgreSQL
Presto 3

Notes:

1 The connection must be configured with the engine connection properties hostname or IP address, engine ID, and engine port.

2 Only for flows that run in a Python runtime environment. You must specify a catalog and a schema for the output that already exist in the Iceberg metastore.

3 Only for flows that run in a Python runtime environment.