Creating a document classification model

The classification model helps you figure out what sort of document you are entering. To create your document classification model, you use sample documents to establish document types and the fields and values that correspond with those types.

Before you begin

To get started on your classification model, assemble your collection of sample documents:

Type

Consider which types of documents you want to include. Your project can be created to process multiple kinds of documents. However, you might decide to create one project to handle all the variations of a single document, for example, multiple invoices. Or, you might decide to create a project that handles all of the document types that are related to a specific line of business or area of your organization, for example, invoices, receipts, and shipping forms.

Include samples for each type of document that you want the model to process. You might have variations of a document type, for example, different kinds of invoices. If the format and field layout vary significantly between the different invoices, consider them as different document types.

You must define at least one document type. If you have only one type, all your documents will be classified as that type, but you still need to create and train the classification model.

Number

To get enough information into the model, you should provide at least five different samples for each document type that you want to include and at least one document for testing. Make sure that you have enough samples of the types you are modeling to use when you train the model.

Group

After you create the model, you go on to train the model with more samples. It helps to understand how you expect the model to work on the training documents.

Format

In Document Processing Extension Designer, supported file types for documents include PDF, Microsoft Word (DOCX and DOC), and image formats (JPG, JPEG, PNG, TIFF).

About this task

To create your document classification model, you identify your document types, train the model to recognize your document types, and test the model that you select with more sample documents.

To create the model, you need to know which documents are important to your organization. You can divide your sample documents into groups based on shared attributes. Ultimately the document types that you choose also help to establish the extraction model, which determines what data is extracted from the documents at processing time. Different document types might share similar data definitions for extraction. For example, you might have several document types that include a similar account number, but that doesn’t mean that these are necessarily similar document types that should be grouped together.

The model comes pre-trained with several common document types. If any of your document types align with the characteristics of the pre-trained document types, you can assign your documents to those types. If you have document types that are different from the pre-trained types, you simply add new document types. There is no requirement to use the pre-trained types. The following document types are included to use as needed, and contribute to a more robust model:

Bill of Lading
Invoice
Utility Bill

What to do next

After the system knows what each type of document looks like for you, create a data extraction model to help you decide what pieces of information (data) you want to extract from it.