Integration with IBM Datacap

IBM® Content Classification provides text-based classification services for IBM Datacap

IBM Datacap uses a number of classification methods to process captured documents, such as barcode, pattern recognition, and fingerprint recognition. Integration with IBM Content Classification provides an additional level of classification based on text analytics.

Classification workflow

IBM Datacap creates a batch of document images and extracts text from each page by using OCR (Optical Character Recognition). Using natural language processing and semantic analysis, Content Classification analyzes the text and identifies the page type. A confidence score (0 - 100) is assigned to each category in a predefined knowledge base. The knowledge base contains categories that correspond to page types in the IBM Datacap document hierarchy. Content Classification returns the classification results to IBM Datacap and the complete batch is classified.

Configuration overview

In Content Classification, you create a knowledge base of categories that correspond to page types. To build the knowledge base, you can configure IBM Datacap to export extracted text files to a folder structure, and then import these files into Content Classification. You train the knowledge base to classify by providing sample pages of each type, either one-by-one by using IBM Datacap, or in bulk by using Content Classification applications.

In IBM Datacap, you configure connection settings and specify the Content Classification knowledge base to use. Two primary actions in IBM Datacap interact with Content Classification:

FindFingerprintICM: This action identifies a page type by using Content Classification. The full text of each page is analyzed and an attempt is made to find a match to a category within the knowledge base. If a match is found, the page type and confidence are populated with the most relevant category and confidence level. If no match is found, the page type is set to Other. Exceptions and low confidence results are reviewed and classified manually. The Content Classification knowledge base is updated with the results of the manual classification.
UpdateKnowledgeBaseICM: This action provides feedback to the knowledge base by associating the text of the current document with its type to enhance the classification results of similar documents in the future.

Classification tips

For best results, combine Content Classification textual analysis with other IBM Datacap classification methods such as barcodes and fingerprinting.
Content Classification learns to classify by example; the more examples, the more accurate the classification results will be. Provide at least 30 sample pages per page type (category) when building your Content Classification knowledge base.
In Content Classification, you can select an option to store learning data with your knowledge base. This can be helpful if you want to modify and rebuild your knowledge base in the future. For more information, see Saving learning data.
If you are building your knowledge base by reviewing pages one-by-one and submitting feedback in IBM Datacap, set the following node activation properties for your knowledge base in Classification Workbench > Knowledge Base Editor > Knowledge Base Properties > Advanced Learning Properties:
- Positive feedbacks: 4
- Negative feedbacks: 36
- Total feedbacks: 40