Identification of page types from documents

There are several techniques to identify individual page types, but the most common technique is called fingerprint matching.

In a typical Datacap application, documents start as a batch of unidentified image files with one image per page. A single batch might contain a mix of document types, and each document might contain a number of different page types. There is nothing within the page image that identifies the page type or any of the data on the page. In other words, the page images do not contain any structured content.

Before Datacap can begin to extract data, it must identify the individual page types. Datacap then maps pages to documents, and fields to pages, by using the information in the document hierarchy. After Datacap identifies the fields and their locations within each page, it extracts and stores the data in a structured format. The structured format is known as the runtime batch hierarchy.