AutomaticDocumentFingerprinting actions

Identifies pages and assigns types and fingerprints through fingerprint matching.

When ingesting and processing documents in a Datacap application, typically documents and pages are classified to identify the data within the document or to control how the page should processed. Fingerprints are a classification mechanism that matches images based on the geometry of the image. It is typical for a Datacap application to use fingerprints to classify pages within the workflow. It is allowed for other types of classification techniques to be used in addition to fingerprints. The rules within an application are flexible to allow exclusive use of fingerprints for classification or mixing fingerprinting with other types of classification, including techniques that may be custom for an application.

Classification can be configured uniquely for any Datacap application. With that said, there are some typical types of classification that are performed and are briefly described here. The intent is to provide basic understanding of fingerprints and how they can be used as a starting point in an application. These are then changed or expanded as required by a specific application's use cases.

One approach to classification is by using separator sheets. These are typically sheets with a barcode and are used to separate different pages that are scanned. The presence of a new separator sheet, denotes the start of a new document. Fingerprints can classify documents in addition to or instead of separator sheets. To fingerprint a page, the page needs to be analyzed, using the AnalyzeImage action, to create a CCO file that is a geometric representation of the page. This representation will then be matched using the FindFingerprint action against known formats. When match, the page type is then determined. Additionally a fingerprint template is assigned to the page, the fingerprint template defines zones / fields on the page that contain the data that will be extracted or verified by an operator. While classification can be performed without using fields, it is a typical scenario where fields are also used.

Using the APT application as an example, a page may be identified as a "MainPage" which is the first page of an invoice. This causes different processing for that type of page, opposed to other types such as separator pages which may be ignored or deleted. An application may process invoices from 1000s of different vendors. One invoice may be from "ABC Plumbing" and another may be from "XYZ Car Rental". Because they are provided by different vendors, each will have their own format and layout to the invoice that was received. However, there are some common properties of an invoice that the application needs to process such as date, vendor name, address, tax, total and line items for items covered by the invoice. While each page is identified as a "MainPage", the matched fingerprint template for each page will be unique. For example, ABC Plumbing may have the date in the top left corner of the page while XYZ Car Rental puts the date at the bottom of the page. The matched template contains the location of the desired data so the application can correctly obtain the data from the invoice. APT has additional abilities where it can ingest invoices that have not been previously processed and dynamically determine the locations of these values and then save a fingerprint so the locations are known the next time an invoice is received from the same vendor.

That is one example of how fingerprints can assign page types and field zone templates for pages. Datacap has the ability to run unique rules for each object type within an application. Perhaps pages defined as invoices need one set of operations and documents identified as bank statements need a different set of processing steps. The page types determined by fingerprinting will then control the pages as the continue through the workflow.

Prior to fingerprinting, images are typically pre-processed to ensure the greatest accuracy when processed. For example, images are typically straightened, de-speckled and have blobs and lines removed prior to fingerprinting. This will help detection by applying a uniform look to pages prior to identification. It is quite possible that it may not be desired to show these processed forms to the user, or more likely, it is desired to save the original documents as they came in and not the versions that have been adjusted for fingerprinting or recognition. The Datcap actions always save the original version of the image prior to performing image adjustments. This original version of the image can always be used or stored as required in later steps of the workflow.

Local Fingerprints

By default fingerprints are stored and used "locally" on the machine that is running rules and are held within the client or process. For example, fingerprints are typically used in a "Page Identification" task. When this process is performed, the AutomaticDocumentFingerprint actions will load the fingerprints defined for the application, then perform fingerprint matching and creation. It is all local on the current machine. This process occurs for each batch that is run. The benefit of using fingerprints locally is that no additional setup is required beyond adding the required fingerprint actions to a Datacap application. A draw back of local fingerprints are that the fingerprints will be loaded for each batch. If there are a large number of fingerprints, this could be several minutes. It also will consume memory within the process that is running the batch. For better performance, the fingerprint service is the recommended mechanism for using fingerprints.

Fingerprint Service

The Datacap Fingerprint Service, in combination with actions from this action library, supports the ability to cache and use the fingerprints from one or multiple Datacap applications simultaneously. The Fingerprint Service overcomes the time-consuming process of loading all the fingerprints of an application each time a batch is run. The solution involves reading all the fingerprints into the system memory cache of the Fingerprint Server when the first fingerprint match is requested. As a further optimization of this process, only the identification portion of the fingerprint is loaded into memory. The fingerprints remain in system memory for the life of the Fingerprint Service process. The result is the elimination of the fingerprint load time for all but the first batch that is processed by an application. Large-scale implementations, where large numbers of fingerprints (typically over 1000 fingerprints) are required for document identification, use of the Fingerprint Service is recommended.

Other than identifying the location of the fingerprint service with the SetFingerprintWebServiceURL, the actions within this action library will transparently use the fingerprint service vs the local service. The important step is to first call the SetFingerprintWebServiceURL action to specify the location of the service, then subsequent actions, in the same profile, will automatically use the service instead of using local fingerprinting operations.

The fingerprint service requires IIS and can be configured on a machine which can be shared by multiple applications. Refer to the fingerprint service configuration documentation Installing and configuring the Datacap Fingerprint Service for more information.

The first time that a match is requested by an application, the Fingerprint Service loads the fingerprints (all *cco files) for that application. When more than one application is set up to use a single Fingerprint Service, the first time each application requests fingerprint matching, the fingerprints for that application are loaded into memory. The greater the number of Fingerprints to read into memory, the longer the service takes to begin fingerprint processing; which can be from a few seconds to many minutes for large fingerprint deployments. Subsequent requests for matching work with the fingerprints in the cache in memory. When a matching fingerprint is not found and the application creates a new fingerprint, that new fingerprint is loaded into the cache.

The Datacap Fingerprint Service is a web service that is based on Microsoft Internet Information Services (IIS). The service can service multiple client requests simultaneously. The Datacap Fingerprint Service cache is shared by all Rulerunners or clients running the application. Therefore, the single load of fingerprints into the cache for an application is all that is required even when multiple threads are running.

If the fingerprint service is enabled and an action that uses the fingerprint service is called, such as FindFingerprint, and if the fingerprints are not yet loaded, or if the load is in progress due to it being initiated by another batch, by default, the action will wait up to 20 minutes for the fingerprint load to complete. The default time is long enough for most installations. Should more time be required due to a significantly large amount of fingerprints, set the DCO variable FingerprintServiceWait_Seconds to the maximum number of seconds that the action should wait for the fingerprints to load on the service. This variable would be set on the same DCO node where FindFingerprint is called. In other words, if one batch is loading fingerprints and another batch calls FindFingerprint, it will wait up to the 20 minute default time for the load to finish. If the load is not finished, the FindFingerprint action will not perform a fingerprint match and will return false. Increasing the timeout will avoid this problem during long fingerprint load times at the service.

Once the fingerprints are loaded by the services, they remain loaded for subsequent calls to FindFingerprint and do not need to be reloaded. If the service is recycled, then on the next call to the service, the fingerprints are reloaded.

Additional Information

It is recommended to review the best practices guide for recognition for tips on how to improve image quality prior to fingerprint matching and recognition. Fingerprints rely on the creation of a CCO for the image that is associated with the DCO page. The CCO is typically created by the Analyze Image action and/or with a recognition action. More information about the CCO can be found in the top level help of the SharedRecognitionTools action library. The top-level help topic of the ApplicationObjects action library discusses the Datacap Object Hierarchy (DCO) including the different object levels and object variables. It is recommended that the DCO is well understood when writing a Datacap application.

Smart Parameters

Parameters that use an "@" notation, such as "@X.xxxx", "@P.xxxx", "@STRING()", "@PILOT()", etc., are known as "smart parameters". The data passed as a parameter to an action will be evaluated at runtime. For example, "@X" is notation for accessing the current object. "@P" accesses the page object, which could be the current object, or a parent object. Being evaluated at runtime, allows recognized data or other batch specific data to be passed to actions or stored as metadata. A smart parameter such as @DATE could be used to name export files based on the current date. Similar smart parameters will retrieve the current batch ID, current batch directory, current operator ID, and more. Smart parameters are a very powerful mechanism of getting data from other areas of the application and copying it into a new location or passing it as a parameter to an action. It is recommended that the smart parameter documentation is reviewed to understand the full capabilities available. The top-level help of the ValidationAndTextAdjustments action library also has information about smart parameters. Some characters such as +, \ and . also have special smart parameter meaning and may be interpreted as a smart parameter. This can be controlled using smart parameter syntax.