Creating a data extraction model

Your document types have a shared set of data points or field values that you want to extract from the documents. You can train an extraction model to extract the right data from each of your document types.

About this task

Each document type has fields with corresponding values that represent the data you want to extract. Field values can be formatted in multiple ways, for example:

Bar code
When you include a bar code field in your extraction model, the model captures the bar code from the document. At runtime, the project interprets the bar code and stores the text or digits that the bar code represents.
The following types of bar codes are supported:
  • Code 128 also known as GS1-128 and EAN 128, based on ISO/IEC 15417
  • Code 25 also known as Interleaved 2 of 5, based on ITF (ISO/IEC 16390)
  • Code 39, based on ISO/IEC 16388
  • Codabar
Note: The runtime application might have issues reading a bar code if the image is of poor quality. For example, if the image is poorly scanned, low resolution, copied several times, or has other visible issues, the bar code might not be read.
Check box
This field has a Boolean value. At runtime, the processing detects whether a box is checked or not checked.
Signature presence
The signature field also has a Boolean value. The processing detects whether a signature exists in the field, that is, whether a document was signed. The processing supports stamped signatures, short format (initials), single or multiple signatures, and signatures on multiple pages. The processing does not interpret the signature or store information beyond the presence or absence of the signature.
Table
If the data that you want from a document is formatted in a table, you can use a Table field to process the data. For more information, see Creating a table field.
Text
Use Text fields to capture text from your document. Fields like company name, invoice number, street name, and so on are examples of text fields.

To create your data extraction model, you designate which field values are important for your document types. When you create the model, you specify the location of the field and the location of the value. You also specify characteristics for the field values, such as whether the value is Sensitive. This designation means that the value contains personal or private information. If you designate the value as Required, the model flags any document that does not contain a value for that field at processing time.

You might find that some fields like phone number or address fields are similar across different document types. You can reuse existing field types when you designate fields for extraction.

When you upload a sample document, the system processes the document in a multi-stage pipeline, in which the information of the document is extracted using the document type and field information that is available at the time. If you make updates to the fields, the information for your sample can become out of synch with the updated fields in the document type. For example, new added fields are not extracted, or updated fields still have the previous values. In such cases, you can synch up by using the Reanalyze button to re-process the sample against the updated field information. You can use Reanalyze on a single sample document, or you can use it in a batch scenario to synch up multiple samples.

Tip: This procedure leads you through the main steps in creating the data extraction model. However, you might need to repeat certain steps, such as defining fields, multiple times per document type.
Important: Training takes time. When you train the data extraction model, the operation might take many hours to complete. You can let the training process run while you work on other tasks.

Procedure

To create a data extraction model:

  1. From the main page in the Designer, on Extraction model, click Start.
    Your document types are displayed, with your sample documents divided into a set for training and a set for testing.
  2. Review your samples to make sure that documents are categorized correctly into your document types.
    If you have different file formats that are represented in a single document type, confirm that each format is included in the training sample set and the testing sample set.
  3. When you have finished reviewing your samples, click Next.
  4. Highlight the document type that you want to start with, and from the sample documents list, select a document to train on.
    The sample displays the undefined data that the model discovered.
  5. Highlight each item of undefined data, and click Define field.
    The Data in document information shows the following details:
    • Captured field and location - The field descriptor in the document and the location of the field. Use the Draw control to capture the name and location.
      Note: Not every field value has a field key. For example, a document might include a bar code or invoice number without a field or label attached to the value. If no key exists, leave this setting blank. For some value formats, such as Signature presence, the field key is required.
    • Value format - For example, Text, Checkbox, Table, Signature presence, or Barcode.
    • Captured value and location - The actual value for the field. Use the Draw control to capture the value and location.
  6. Choose how to incorporate the field:
    • Add as a new field - Provide a name for the new field. Choose from the options in Select a field type, then click Next.
      Note: Available field type choices are dependent on the value format that you selected for the field. For example, a Signature presence or Checkbox value format can only have a Boolean field type.
    • Match to an existing field - Select from a list of fields that have already been added and click Next.
      Note: The values displayed for Other possible names help you to confirm that the existing field you specify matches the intent of the field in this sample.
  7. For Field details, provide further information about the field.
    • If you answer Yes to the question Is this value required?, an error occurs when the value is missing for that field at processing time.
    • If you answer Yes to the question Is this value sensitive?, the value is redacted for that field.
  8. When you have completed the details for the field, click Finish.
  9. To create additional fields, click Define new.
  10. To create a composite field, select multiple fields and click Group into composite.
    An address is an example of a composite field that consists of multiple values- street number, street name, city, and so on- that combine into one value- an address.
    1. Choose how to add the composite field.
      You can Create a new composite field and enter a name for the new field, or Map to an existing composite field and select the name of the composite that you want to map to.
    2. Click Next, and provide the relevant Composite options for the field:
      • If you answer yes to the question Is this value required?, an error occurs when the value is missing for that field at processing time.
      • If you answer yes to the question Is this value sensitive?, the value is redacted for that field.
    3. For each subfield, specify definitions for the field, and click OK.
    4. Define any additional subfield that you want to extract.
      You can use the Draw tool to designate the area and value for the field.
    5. Click Save to save and add the subfield definition.
    6. When you have defined all the subfields for the composite field, click Finish.
  11. To create a table field, see additional instructions in Creating a table field.
  12. When you have defined all the fields for your document type sample, click Done.
    Remember: If needed, you can use the Reanalyze capability to synch up your existing document samples with any field or document type changes. This optional step can be applied to individual documents or to a batch of documents.
  13. If your Sample documents list contains an issue indicator status, click the document link to discover the issue.
    The Data in document shows which fields are missing from the definition.
  14. On the Undefined data entry, click Define, then define the fields.
  15. When you are finished defining the fields for data extraction, click Done.
  16. Review the document types that are ready to progress, and click Train model.

    You can see the status of the model training from the home page. If you need to stop the training process, because you need the resources for another job or because you want to make changes to the model, click Cancel training.

    Remember: This training process can take some time, up to multiple hours. You receive a notification when the training is complete.

What to do next

After the model training completes, you can continue to refine the model with the following tasks:
  1. Review the extraction model to see how accurate the model is at finding and extracting the specified values. See Reviewing the trained model.
  2. Test the data extraction model. For more information, see Testing the extraction model.
  3. Optionally enrich your extracted values so that you can more easily use the data in other applications. For more information, see Defining field types and enrichments.