Supported connectors for unstructured data curation
You can connect to various data sources from which you can import metadata of unstructured data to analyze and process the documents that they contain. Unstructured data curation creates entity relationship tables that can be used in sophisticated RAG pipelines. You can then display selected assets on a lineage graph to visualize the flow of data and see how data was transformed.
Requirements and restrictions
Understand the requirements and restrictions for connections to be used in the curation of unstructured data.
Required permissions
Users must be authorized to access the connections to the data sources.
General prerequisites
Connection assets must exist in the project for connections that are used for running unstructured data import and unstructured data enrichment.
For lineage metadata import, a data source definition is created automatically for the IBM watsonx.data Presto connection when you run unstructured data curation or Unstructured Data Integration flows. If you want to use your own name for the data source definition, create it manually.
For more information about data source definitions, see Creating a data source definition.
Supported source data formats
In general, unstructured data curation and Unstructured Data Integration flows support the following data formats:
| File type | Unstructured Data Integration | Unstructured data curation |
|---|---|---|
| BMP | ✓ | — |
| DOC, DOCX | ✓ | ✓ |
| GIF | ✓ | — |
| HTML | ✓ | ✓ |
| JFIF | ✓ | — |
| JPG | ✓ | — |
| JSON | ✓ | — |
| MD | ✓ | ✓ |
| ✓ | ✓ | |
| PNG | ✓ | — |
| PPT, PPTX | ✓ | ✓ |
| TIFF | ✓ | — |
| TXT | ✓ | ✓ |
| XLSX | ✓ | ✓ |
Supported data sources
Data from these data sources can be ingested.
| Data source | Unstructured Data Integration | Unstructured data curation |
|---|---|---|
| Amazon S3 | ✓ | ✓ |
| Box | ✓ | ✓ |
| Google Drive | ✓ | — |
| IBM Cloud Object Storage No retrieval of access control lists |
✓ | ✓ |
| IBM FileNet P8 | ✓ | ✓ |
| IBM watsonx.data SharePoint | ✓ | ✓ |
| Slack | ✓ | ✓ |
Data from these data sources can be visualized on the data lineage graph:
Supported output targets
The generated embeddings are written to a vector database that is connected through one of these connectors:
| Vector database | Unstructured Data Integration | Unstructured data curation |
|---|---|---|
| IBM watsonx.data Milvus | ✓ | ✓ |
For unstructured lineage metadata, the generated embeddings are written to a vector database that is connected through one of these connectors:
Entity tables and document sets are written to Iceberg tables. The database can be connected through one of these connectors:
| Databases for entity tables and document sets | Unstructured Data Integration | Unstructured data curation |
|---|---|---|
| Db2 3 | ✓ | ✓ |
| IBM watsonx.data Presto 1 | ✓ | ✓ |
| Iceberg metastore 2 | ✓ | ✓ |
| Oracle 3 | ✓ | — |
| PostgreSQL | ✓ | ✓ |
| Presto 3 | ✓ | ✓ |
Notes:
1 The connection must be configured with the engine connection properties hostname or IP address, engine ID, and engine port.
2 Only for flows that run in a Python runtime environment. You must specify a catalog and a schema for the output that already exist in the Iceberg metastore.
3 Only for flows that run in a Python runtime environment.