Integration with IBM Datacap

IBM® Content Classification provides text-based classification services for IBM Datacap

IBM Datacap uses a number of classification methods to process captured documents, such as barcode, pattern recognition, and fingerprint recognition. Integration with IBM Content Classification provides an additional level of classification based on text analytics.

Classification workflow

IBM Datacap creates a batch of document images and extracts text from each page by using OCR (Optical Character Recognition). Using natural language processing and semantic analysis, Content Classification analyzes the text and identifies the page type. A confidence score (0 - 100) is assigned to each category in a predefined knowledge base. The knowledge base contains categories that correspond to page types in the IBM Datacap document hierarchy. Content Classification returns the classification results to IBM Datacap and the complete batch is classified.

Configuration overview

In Content Classification, you create a knowledge base of categories that correspond to page types. To build the knowledge base, you can configure IBM Datacap to export extracted text files to a folder structure, and then import these files into Content Classification. You train the knowledge base to classify by providing sample pages of each type, either one-by-one by using IBM Datacap, or in bulk by using Content Classification applications.

In IBM Datacap, you configure connection settings and specify the Content Classification knowledge base to use. Two primary actions in IBM Datacap interact with Content Classification:
FindFingerprintICM
This action identifies a page type by using Content Classification. The full text of each page is analyzed and an attempt is made to find a match to a category within the knowledge base. If a match is found, the page type and confidence are populated with the most relevant category and confidence level. If no match is found, the page type is set to Other. Exceptions and low confidence results are reviewed and classified manually. The Content Classification knowledge base is updated with the results of the manual classification.
UpdateKnowledgeBaseICM
This action provides feedback to the knowledge base by associating the text of the current document with its type to enhance the classification results of similar documents in the future.

Classification tips