Supported connectors for unstructured data curation

You can connect to various data sources from which you can import metadata of unstructured data to analyze and process the documents that they contain. Unstructured data curation creates entity relationship tables that can be used in sophisticated RAG pipelines. You can then display selected assets on a lineage graph to visualize the flow of data and see how data was transformed.

Requirements and restrictions

Understand the requirements and restrictions for connections to be used in the curation of unstructured data.

Required permissions

Users must be authorized to access the connections to the data sources.

General prerequisites

Connection assets must exist in the project for connections that are used for running unstructured data import and unstructured data enrichment.

For lineage metadata import, a data source definition is created automatically for the IBM watsonx.data Presto connection when you run unstructured data curation or Unstructured Data Integration flows. If you want to use your own name for the data source definition, create it manually.

For more information about data source definitions, see Creating a data source definition.

Supported source data formats

In general, unstructured data curation and Unstructured Data Integration flows support the following data formats:

Supported file types
File type	Unstructured Data Integration	Unstructured data curation
BMP	✓	—
DOC, DOCX	✓	✓
GIF	✓	—
HTML	✓	✓
JFIF	✓	—
JPG	✓	—
JSON	✓	—
MD	✓	✓
PDF	✓	✓
PNG	✓	—
PPT, PPTX	✓	✓
TIFF	✓	—
TXT	✓	✓
XLSX	✓	✓

Supported data sources

Data from these data sources can be ingested.

Supported data sources
Data source	Unstructured Data Integration	Unstructured data curation
Amazon S3	✓	✓
Box	✓	✓
Google Drive	✓	—
IBM Cloud Object Storage No retrieval of access control lists	✓	✓
IBM FileNet P8	✓	✓
IBM watsonx.data SharePoint	✓	✓
Slack	✓	✓

Data from these data sources can be visualized on the data lineage graph:

Supported output targets

The generated embeddings are written to a vector database that is connected through one of these connectors:

Supported vector databases
Vector database	Unstructured Data Integration	Unstructured data curation
IBM watsonx.data Milvus	✓	✓

For unstructured lineage metadata, the generated embeddings are written to a vector database that is connected through one of these connectors:

Entity tables and document sets are written to Iceberg tables. The database can be connected through one of these connectors:

Output targets
Databases for entity tables and document sets	Unstructured Data Integration	Unstructured data curation
Db2 ³	✓	✓
IBM watsonx.data Presto ¹	✓	✓
Iceberg metastore ²	✓	✓
Oracle ³	✓	—
PostgreSQL	✓	✓
Presto ³	✓	✓

Notes:

¹ The connection must be configured with the engine connection properties hostname or IP address, engine ID, and engine port.

² Only for flows that run in a Python runtime environment. You must specify a catalog and a schema for the output that already exist in the Iceberg metastore.

³ Only for flows that run in a Python runtime environment.