IBM® Content
Classification provides
text-based classification services for IBM Datacap
IBM Datacap uses a
number of classification methods to process captured documents, such
as barcode, pattern recognition, and fingerprint recognition. Integration
with IBM Content
Classification provides
an additional level of classification based on text analytics.
Classification workflow
IBM Datacap creates a batch of
document images and extracts text from each page by using OCR (Optical
Character Recognition). Using natural language processing and semantic
analysis, Content Classification analyzes
the text and identifies the page type. A confidence score (0 - 100)
is assigned to each category in a predefined knowledge base. The knowledge
base contains categories that correspond to page types in the IBM Datacap document hierarchy. Content Classification returns the classification
results to IBM Datacap and
the complete batch is classified.
Configuration overview
In Content Classification, you create a knowledge
base of categories that correspond to page types. To build the knowledge
base, you can configure IBM Datacap to
export extracted text files to a folder structure, and then import
these files into Content Classification.
You train the knowledge base to classify by providing sample pages
of each type, either one-by-one by using IBM Datacap, or in bulk by using Content Classification applications.
In
IBM Datacap, you configure connection
settings and specify the
Content Classification knowledge
base to use. Two primary actions in
IBM Datacap interact with
Content Classification:
- FindFingerprintICM
- This action identifies a page type by using Content Classification. The full text of each
page is analyzed and an attempt is made to find a match to a category
within the knowledge base. If a match is found, the page type and
confidence are populated with the most relevant category and confidence
level. If no match is found, the page type is set to Other. Exceptions
and low confidence results are reviewed and classified manually. The
Content Classification knowledge base is updated with the results
of the manual classification.
- UpdateKnowledgeBaseICM
- This action provides feedback to the knowledge base by associating
the text of the current document with its type to enhance the classification
results of similar documents in the future.
Classification tips
- For best results, combine Content Classification textual analysis with
other IBM Datacap classification
methods such as barcodes and fingerprinting.
- Content Classification learns
to classify by example; the more examples, the more accurate the classification
results will be. Provide at least 30 sample pages per page type (category)
when building your Content Classification knowledge
base.
- In Content Classification, you
can select an option to store learning data with your knowledge base.
This can be helpful if you want to modify and rebuild your knowledge
base in the future. For more information, see Saving learning data.
- If you are building your knowledge base by reviewing pages one-by-one
and submitting feedback in IBM Datacap,
set the following node activation properties for your knowledge base
in :
- Positive feedbacks: 4
- Negative feedbacks: 36
- Total feedbacks: 40