Working with unstructured data

With unstructured data tools, you can import metadata for unstructured documents, transform these documents, generate entities and embeddings, create document sets and document libraries that can then be used in your gen AI projects.

Unstructured data curation and Unstructured Data Integration flows are available in watsonx.data and the Data Fabric experience.

Prerequisites

The following services must be provisioned to work with unstructured data:

  • watsonx.ai Runtime

  • watsonx.ai Studio

  • watsonx.data

  • watsonx.data intelligence

  • watsonx.data integration

You must have a project set up in the watsonx.data or Data Fabric experience to work with unstructured data assets:

  • Associate an instance of the watsonx.ai Runtime service with the project. In the project settings, go to Services & integrations and associate the service.
  • Configure Unstructured Data Integration project settings.
  • Add connections to the data sources from which you want to import data and to the databases where you want to store the generated embeddings, entities, and document sets. For more information, see Supported connectors for unstructured data curation.
  • Set up task credentials.
  • Make sure you and other users who plan to work with unstructured data tools and assets have the required access:
    • To create, edit, or delete any type of project asset, and to run unstructured data curation or Unstructured Data Integration flows, the Admin or the Editor role in the project is required.
    • To add, edit, or delete document classes from within a unstructured data curation asset, users also need the Manage document classes permission.

Unstructured data tools

You can use the following tools to work with unstructured data:

  • Unstructured data curation: Use this tool to build and run analysis and processing flows for unstructured data even without much experience in designing ETL flows for RAG and analytical queries. Import metadata and analyze those documents to identify key facets for grouping, and optionally further process the grouped documents to prepare document sets for RAG or by extracting structured information for complex queries.

    Run unstructured data curation to find out what kind of documents are in a data source and identify the documents that are appropriate for your use case. After the initial analysis, select the sets of documents that you want to process further.

  • Unstructured Data Integration: Use this tool to flexibly build unstructured data transformation flows that suit your needs. Use data from various sources, decide which steps to include and configure them, whether you want to import metadata, improve data quality, extract entities, enrich data, or generate vector embeddings.

  • Document library: Create collections of document sets that you can then reuse in your AI projects.

Unstructured asset types

  • Document set

    A document set contains structured information about a set of documents, including their purpose, contents, and usage. This asset type is created during Unstructured data curation or as output of the Unstructured Data Integration flows. A document set includes information on the lifecycle of the documents: about their source, how they were transformed, and what derivatives were produced (extracted entities, vector embeddings).

    Document sets can be published to catalogs or grouped into document libraries that you can then reuse in your AI projects.

  • Unstructured data curation asset

    An unstructured data curation asset represents the configuration of Unstructured Data Integration flows for analyzing and processing unstructured data. It also provides access to the available document classes. For more information, see Creating unstructured data curation flows.

  • Unstructured Data Integration flow

    A Unstructured Data Integration flow represents a pipeline of configurable steps that define which data is processed, which operators transform the data, and what output is generated as a result. When a flow is ready, you can configure a job to schedule the runs. For more information, see Creating data preparation flows.