Importing unstructured data
Capture and import metadata about unstructured data in your organization.
Import unstructured data to start the curation process of your unstructured assets. Base document sets are created in the project as a result of the import.
Requirements
Before you create and run unstructured data import, understand the requirements and restrictions.
Required permissions
To import unstructured data, you must have the Admin or the Editor role in the project where you import data.
To import unstructured lineage metadata, you must have the Lineage administrator role.
Required connections
When you connect to an unstructured data source, the following connections must be created in the project where you import data:
- A connection to one of the supported data sources from which you want to import unstructured data
- A connection to IBM watsonx.data Presto to connect to the Iceberg table to store the imported metadata.
You can use the same IBM watsonx.data Presto connection for many unstructured data imports but this connection must be added to each project where the imports are run.
Required project settings
Define the storage for base document set. In the project settings, go to Unstructured Data Integration > Document set storage, select a Presto connection and the schema where you want to store the imported content.
Required configuration for lineage import
For lineage metadata import, data source definitions are required. If you have required permissions and roles, data source definitions are created automatically when you run unstructured data import, or unstructured data enrichment.
The following roles and permissions are required:
- Create data source definitions or Manage data source definitions permission. These permissions are not assigned to any predefined role, you must create a custom role. For more information, see Roles and asset privacy settings for data source definitions.
If you want to use your own names for data source definitions, create them manually for the following connections:
- The data source from which you import unstructured data
- The IBM watsonx.data Presto connection
For more information about data source definitions, see Creating a data source definition.
Supported data sources
For information about supported data sources for import of unstructured data, see Supported connectors for curation of unstructured data.
Overview
The process of curating unstructured data starts with importing metadata from the unstructured data sources. The metadata is saved as base document sets in an Iceberg table. The documents that are part of the base document set are transformed
and enriched.
When you create unstructured data imports, you can schedule the job to run regularly. You might want to coordinate scheduled unstructured data import and the corresponding unstructured data enrichment jobs for the same assets.
You can run unstructured data import independently, or it can be run automatically as part of the unstructured data enrichment.