Creating a data extraction model

As a result of your classification model, your document types have a shared set of data points or field values that you want to extract from the documents. You can train an extraction model to extract the correct data from each of your document types.

About this task

Each document type has fields with corresponding values that represent the data that you want to extract. Field values can be formatted in multiple ways.
Important: After an upgrade from version 21.0.3 of the IBM Cloud Pak® for Business Automation, make sure to retrain your extraction model for your existing projects, and then redeploy the projects to the runtime environment.
Bar code
When you include a bar code field in your extraction model, the model captures the bar code from the document. At run time, the project interprets the bar code and stores the text or digits that the bar code represents.
The following types of bar codes are supported:
  • Bookland EAN also known as ISBN (International Standard Book Number)
  • Codabar
  • Code 128 also known as GS1-128 and EAN 128, based on ISO/IEC 15417
  • Code 25 also known as Interleaved 2 of 5, based on ITF (ISO/IEC 16390)
  • Code 39, based on ISO/IEC 16388
  • Code 93
  • Data Matrix, based on ISO/IEC 16022
  • EAN-8 / EAN-13, based on ISO/IEC 15420
  • EAN-14
  • PDF417, based on ISO/IEC 15438
  • QR Code (Quick Response), based on ISO/IEC 18004
  • SSCC18 (Serial Shipping Container Code) / EAN-18
  • UCC / EAN-128, based on Code 128
  • Universal Product Code (UPC-A and UPC-E), based on ISO/IEC 15420
Note: The runtime application might have issues reading a bar code if the image is of poor quality. For example, if the image is poorly scanned, low resolution, copied several times, or has other visible issues, the bar code might not be read.
Check box
This field has a Boolean value. At run time, the processing detects whether a box is checked or not checked.
Signature presence
The signature field also has a Boolean value. The processing detects whether a signature exists in the field, that is, whether a document was signed. The processing supports stamped signatures, short format (initials), single or multiple signatures, and signatures on multiple pages. The processing does not interpret the signature or store information beyond the presence or absence of the signature.
Table
If the data that you want from a document is formatted in a table, you can use a Table field to process the data. For more information, see Creating a table field.
Text
Use Text fields to capture text from your document. Fields like company name, invoice number, street name, and so on are examples of text fields.

To create your data extraction model, you designate which field values are important for your document types. When you create the model, you specify the location of the field and the location of the value. You also specify characteristics for the field values, such as whether the value is Sensitive. This designation means that the value contains personal or private information. If you designate the value as Required, the model flags any document that does not contain a value for that field at processing time.

You might find that some fields like phone number or address fields are similar across different document types. You can reuse existing field types when you designate fields for extraction.

When you upload a sample document, the system processes the document in a multi-stage pipeline, in which the information of the document is extracted from the document type and field information that is available at the time. If you make updates to the fields, the information for your sample can become out of sync with the updated fields in the document type. For example, new added fields are not extracted, or updated fields still have the previous values. In such cases, you can sync up by using the Reanalyze button to reprocess the sample against the updated field information. You can use Reanalyze on a single sample document, or you can use it in a batch scenario to sync up multiple samples.

Tip: This procedure leads you through the main steps in creating the data extraction model. However, you might need to repeat certain steps, such as defining fields, multiple times per document type.
Important: Training takes time. When you train the data extraction model, the operation might take many hours to complete. You can let the training process run while you work on other tasks.