Data extract nodes

Use the Data extract nodes to extract content from the source documents and to retrieve Access Control Lists for the ingested documents.

Extract

This node extracts content from the source documents and transforms it to markdown for further processing.

This node usually follows the Ingest data node. The output from ingestion node is the input for this node. You can also use Annotation Filter right before this node if you need to filter documents before extraction.

You can extract both unstructured and structured data from the documents. Select how you want to extract entities. You can extract generic key-value pairs, or refine the scope by providing predefined or custom schemas to extract only those entities that meet your requirements.

Extract entity

Select if you want to extract entities. It is disabled by default.

Entity extraction is the process of extracting structured content from your unstructured document schemas for processing downstream. With predefined document classes or with custom document class uploads, you can specify what values are written to a structured target table.

Requirements:

OCR mode has to be either enabled or forced.
You must have a document class with defined criteria, namely the document itself, the extracted fields that you want to extract, and a target table structure for how you want to structure the table. See Document classes.
To enable entity extraction, 2 GPUs are required.

OCR mode

Optical character recognition (OCR) extracts text from images, scanned documents, and tables. If the source documents have images, choose the OCR mode. It is disabled by default if Extract entity is also disabled.

For text extraction to watsonx.data Milvus both of these settings can remain disabled. If you are processing entities, both settings need to be enabled.

Generic key-value pair

Key value pair (KVP) extraction is the process of identifying and extracting structured information from unstructured or semi-structured documents such as invoices, forms, contracts, or receipts, where a key corresponds to a piece of identifying information (for example, Invoice Number) and the value is the associated data (for example, 123456). Generic KVP extraction goes beyond templates or predefined layouts. It aims to extract pairs from a wide variety of document types and formats, regardless of structure or layout. It is ideal for a broad sweep of all labeled information, especially when you don’t know in advance which fields might appear. When this option is enabled, all the key-value pairs found in the document are retrieved by default. If you want to restrict which key-value pairs to extract, use the following settings.

Schema definition

Choose document classes to serve as custom schemas used for entity extraction. You can select to extract based on all available predefined document classes, or select only some of them.

Use custom schema for key-value pairs

Select if you want to provide a custom JSON schema to extract specific fields from structured documents.

To build a custom schema for a document, you must define metadata and write effective descriptions for each field you want to extract before validating and scaling the schema for accurate key-value pair extraction.

Prepare custom schema as described in the Procedure steps in Creating custom schemas for key-value pair extraction.

When your schema is ready, paste it into the Extract node configuration box under Custom schema.

Note: There is no validation for the custom schema that you provide in the box. Test the flow on small loads to ensure the schema is valid.

Access control

This node extracts access control information (ACL) from the source and stores it in the CPG connection of your choice.

Use this node instead of the project level settings for access control list retrieval. When you use this node you don't need to enable ACL on the project level.

For more information on ACL, see Retrieving Access Control List for ingested documents.

Next node in the flow

The Extract data node can be followed by any of the Quality nodes. If you choose not to use quality nodes, you can connect this node directly to any of the Transform data nodes.

Learn more

Creating a data preparation flow