Creating unstructured data curation flows

Import unstructured documents, analyze those documents to identify key facets for grouping, and further process the grouped documents such as preparing them for RAG or extracting structured information for complex queries.

Create an unstructured data curation asset to analyze documents to detect languages, document size and formats, and determine whether your documents match the data format of selected document classes. Convert your documents into a simpler textual format that can be used in a RAG solution. Extract text and other structured content from your documents based on the classification result.

Parts of an unstructured data curation asset

An unstructured data curation asset consists of several sections that you can access from the navigation menu by clicking one of the icons:

  • Click the Documents icon (Documents icon) for the Documents page.
  • Click the Document classes icon (Document classes icon) for Document classes page.
  • Click the Settings icon (Settings icon) for the Settings page.

Documents

This page is your main entry point to an unstructured data curation asset where you can see the documents that you added and where you start any analysis and processing flows. If you just added documents when you set up the asset, the page shows the list of documents. The documents were not yet analyzed and thus are not grouped in any way. You can sort them by type or by path. If the documents were already analyzed, the page shows them initially grouped by document class. The order of the groups is determined by the number of documents in each group. Other grouping options are by language or by type.

Document classes

View the available document classes to be able to select the ones that meet your requirements for this unstructured data curation.

You can download each document class in JSON format to modify the document class or to create a new document class based on that JSON structure. To import an updated or new document class, you must have the Manage document classes permission. With this permission, you can also rename or delete document classes.

For more information, see Document classes.

Settings

On this page, you can work with these settings:

General
Change general settings such as the name or the description, or delete the unstructured data curation asset.
Document scope
Update the list of documents that you want to send for analysis. Depending on your analysis configuration, the number or size of the documents that are sent for analysis is restricted.
Analysis configuration
Configure default settings for the analysis flows that you run in this unstructured data curation asset, such as limiting the input to the analysis flow or selecting the document classes that you want to work with.
Processing configurations
Configure default settings for the processing flows that you run in this unstructured data curation asset, such as selecting whether you want to create embeddings and extract structured information. You can override these default settings each time that you run a processing flow.

For more information about the individual settings, see Designing unstructured data curation flows.

How to set up and run unstructured data curation

Create and configure an unstructured data curation asset to create Unstructured Data Integration flows for analysis and processing automatically.

  1. In a project, select New asset > Analyze and enrich unstructured data.

  2. Specify a name and optionally a description and save that information. When you create your first unstructured data curation asset, an overview is shown that you can click through or skip.

  3. Add documents that you want to analyze to your asset. Select documents from a connection to a data source, project assets, or upload files from your local filesystem. You can directly generate and run an analysis flow with the default settings or choose to change the analysis configuration first.

    The generated Unstructured Data Integration flow consists of the following nodes:

    • An Ingest data node to load the document metadata
    • A Classification node to determine the format, size, and document class of each document
    • An Extract data node to extract content from the source documents and transform it to markdown for further processing
    • A Language annotator node to identify the language of each document
    • A Design flow output node to hold the analysis results

    Based on a sample of the selected documents, analysis detects document types such as invoices or receipts, languages, and formats, and provides you with an overview of the data that you work with. The analysis results help you decide which documents you want to process further.

  4. Review the results of an analysis flow after the associated Unstructured Data Integration job completes.

    Initially, the documents are grouped by document class, but you can also filter your documents by type or language. If you apply a fine-grained filter, you can also save different views of the analysis result for later use.

    For any type of grouping, you can switch between table and charts view.

    You can rerun analysis at any time, for example, after you change the document scope or the analysis configuration. For a rerun, the existing Unstructured Data Integration flow is deleted and a new one is created.

  5. Select which groups of documents you want to send for further processing. The type of grouping determines which processing steps are available. For example, extraction of structured text is available only if you group by document class.

    You can send all groups for processing or selected ones. Document sets are created only for the groups that you send for processing.

  6. After you initiate the processing step, you are asked to review the processing configuration. If you did not set up a default configuration before, you must complete the configuration information before you can proceed. Or, you might want to change the processing configuration for this run or permanently as the new default configuration.

  7. To immediately generate and run the Unstructured Data Integration flow for processing, click Process documents. Alternatively, you can only generate the flow in a separate window for further review and adjustments by clicking View flow. You can then manually run that flow from the Unstructured Data Integration UI.

    Tip: If your web browser blocks pop-up windows, no flow window might open. Consider disabling the pop-up blocker for this site.

    If you send more than one group of documents for processing, the flow contains a branch for each group that is processed in parallel. Each branch produces one document set. The document set storage is defined in the Unstructured Data Integration project settings.

    The generated Unstructured Data Integration flow can consist of different combinations of nodes depending on how many and what type of document groups you send for processing. All flows contain these nodes:

    • An Ingest data node
    • One or more Extract data nodes to extract text and semantic key-value pairs as applicable
    • One or more Generate output: Document set nodes

    Depending on the grouping and the selected processing options, the flows additionally can contain combinations of these nodes (alphabetically ordered):

    • An Annotation filter node that contains the document class ID for a single set of document with one document class
    • A Branch node to create branches if more than one group of documents is processed
    • Chunking nodes to split the documents into smaller parts
    • A Classification node to determine the format, size, and document class of each document
    • Data class assignment nodes to assign data classes to each document
    • Embeddings nodes to create numerical representations of units of information
    • Entity curation nodes to convert extracted entities into the format that is defined in the document class as target table
    • Entity store nodes to store the converted entities in an entity table
    • Generate output nodes to store the generated embeddings in a vector database
    • Language annotator nodes to identify the language of each document
    • Terms and classifications nodes to assign business terms and classifications to each document
    Important:

    Whenever you click one of the buttons for viewing the flow or for processing documents, the existing Unstructured Data Integration flow is deleted and a new one is generated. If you modified the Unstructured Data Integration flow and want to keep that flow, do not rerun document processing from the unstructured data curation asset.

    After the processing flow completes, you can review and test the generated document set.

You can access job metrics for your analysis and processing flows from the side panel for detailed information about the processing in each node of the flow.

If you delete an unstructured data curation asset, the associated analysis job is also deleted. Any Unstructured Data Integration processing flows and resulting jobs remain in the project.

Learn more