Creating a data extraction model
Your document types have a shared set of data points or field values that you want to extract from the documents. You can train an extraction model to extract the right data from each of your document types.
About this task
Each document type has fields with corresponding values that represent the data you want to extract. Field values can be formatted in multiple ways, for example:
- Bar code
- When you include a bar code field in your extraction model, the model captures the bar code from
the document. At runtime, the project interprets the bar code and stores the text or digits that the
bar code represents.The following types of bar codes are supported:
- Code 128 also known as GS1-128 and EAN 128, based on ISO/IEC 15417
- Code 25 also known as Interleaved 2 of 5, based on ITF (ISO/IEC 16390)
- Code 39, based on ISO/IEC 16388
- Codabar
Note: The runtime application might have issues reading a bar code if the image is of poor quality. For example, if the image is poorly scanned, low resolution, copied several times, or has other visible issues, the bar code might not be read. - Check box
- This field has a Boolean value. At runtime, the processing detects whether a box is checked or not checked.
- Signature presence
- The signature field also has a Boolean value. The processing detects whether a signature exists in the field, that is, whether a document was signed. The processing supports stamped signatures, short format (initials), single or multiple signatures, and signatures on multiple pages. The processing does not interpret the signature or store information beyond the presence or absence of the signature.
- Table
- If the data that you want from a document is formatted in a table, you can use a Table field to process the data. For more information, see Creating a table field.
- Text
- Use Text fields to capture text from your document. Fields like company name, invoice number, street name, and so on are examples of text fields.
To create your data extraction model, you designate which field values are important for your document types. When you create the model, you specify the location of the field and the location of the value. You also specify characteristics for the field values, such as whether the value is Sensitive. This designation means that the value contains personal or private information. If you designate the value as Required, the model flags any document that does not contain a value for that field at processing time.
You might find that some fields like phone number or address fields are similar across different document types. You can reuse existing field types when you designate fields for extraction.
When you upload a sample document, the system processes the document in a multi-stage pipeline, in which the information of the document is extracted using the document type and field information that is available at the time. If you make updates to the fields, the information for your sample can become out of synch with the updated fields in the document type. For example, new added fields are not extracted, or updated fields still have the previous values. In such cases, you can synch up by using the Reanalyze button to re-process the sample against the updated field information. You can use Reanalyze on a single sample document, or you can use it in a batch scenario to synch up multiple samples.
Procedure
To create a data extraction model:
What to do next
- Review the extraction model to see how accurate the model is at finding and extracting the specified values. See Reviewing the trained model.
- Test the data extraction model. For more information, see Testing the extraction model.
- Optionally enrich your extracted values so that you can more easily use the data in other applications. For more information, see Defining field types and enrichments.