Importing unstructured data

Capture and import metadata about unstructured data in your organization.

Import unstructured data to start the curation process of your unstructured assets. Base document sets are created in the project as a result of the import.

Requirements

Before you create and run unstructured data import, understand the requirements and restrictions.

Required permissions

To import unstructured data, you must have the Admin or the Editor role in the project where you import data.

To import unstructured lineage metadata, you must have the Lineage administrator role.

Required connections

When you connect to an unstructured data source, the following connections must be created in the project where you import data:

  • A connection to one of the supported data sources from which you want to import unstructured data
  • A connection to IBM watsonx.data Presto to connect to the Iceberg table to store the imported metadata.

You can use the same IBM watsonx.data Presto connection for many unstructured data imports but this connection must be added to each project where the imports are run.

Required project settings

Define the storage for base document set. In the project settings, go to Unstructured Data Integration > Document set storage, select a Presto connection and the schema where you want to store the imported content.

Required configuration for lineage import

For lineage metadata import, data source definitions are required. If you have required permissions and roles, data source definitions are created automatically when you run unstructured data import, or unstructured data enrichment.

The following roles and permissions are required:

If you want to use your own names for data source definitions, create them manually for the following connections:

For more information about data source definitions, see Creating a data source definition.

Supported data sources

For information about supported data sources for import of unstructured data, see Supported connectors for curation of unstructured data.

Overview

The process of curating unstructured data starts with importing metadata from the unstructured data sources. The metadata is saved as base document sets in an Iceberg table. The documents that are part of the base document set are transformed and enriched.
When you create unstructured data imports, you can schedule the job to run regularly. You might want to coordinate scheduled unstructured data import and the corresponding unstructured data enrichment jobs for the same assets.

You can run unstructured data import independently, or it can be run automatically as part of the unstructured data enrichment.

Learn more