NormalizeCCO
This action sorts the words and lines in a Fingerprint file (.cco) created by a recognition engine.
Returns
True if the CCO is created. Otherwise, False.Level
Page level.Details
The CCO file contains the recognized text and is used for click-n-key during validation, pattern match actions, locate actions, snapping full page recognized text to fields, and used by other components of Datacap that operate on recognized text. It takes the text as recognized by a recognition engine and organizes it into lines. For example, if this action is not called, text that appears separated on the left and right side of a page, but would visually be considered the same line, may not be considered on the same line, causing actions like the Locate actions to operate in ways that may not be anticipated.
The action is automatically called by some full-page recognition actions like OCR/A. This action should always be called before using Locateactions or the pat_RecogMatch_ID action to find recognized text on a page. In this context, the fingerprint is calculated for a particular image in a batch, as opposed to the Fingerprint database, which contains fingerprints for various page types and layout variations that have been defined for a particular application.
There are two types of Fingerprint files. One type is based on the image geometry. The second type is based on recognized text. The AnalyzeImage action creates a geometric fingerprint containing lines and "words" based only on the black pixels in the image. Full-page recognition actions RecognizePageOCR_S, RecognizePageICR_C, RecognizePageOCR_A etc. create a fingerprint based on the results of recognition, i.e. both geometry and text of the recognized characters, words and lines.
In recognition-based fingerprints, the order of lines and words may appear to be arbitrary, especially if the page contains images, tables, stamps, or blocks of text with varying font sizes. This can cause unpredictable results from Locate actions that navigate geometrically. The word-matching and phrase-matching action pat_RecogMatch_ID also requires well-ordered text to work reliably.
The NormalizeCCO action re-orders the words of text in a Recognition-based fingerprint into lines and words in "standard" reading order, from top to bottom and left to right. LineHeightOverlapPercent parameter is percentage of a line that two words must be offset to be considered as separate lines. By default LineHeightOverlapPercent is set to 50 percent.
This variable can be used if the offset between two words in different lines, seems to be less than the default value which is 50. It may be necessary to test multiple values on a wide variety of pages to get a value that works well for all of your documents.
Some recognition actions, such as RecognizePageOCR_S, do not automatically call NormalizeCCO. So, this action needs to be explicitly called in rules after recognition to normalize the CCO for subsequent actions that use the CCO results.
Conversely, it may be possible that for specific applications, it may be better to leave the recognition as it is returned from the engine. If this is the case, then for actions that automatically perform NormalizeCCO, such as RecognizePageOCR_A, the application must first call the action CCONormalization_OFF() to prevent the recognition action from automatically performing the normalization. Because the action MergeCCOs_ByType removes page numbers, NormalizeCCO should not be called after the action MergeCCOs_ByType is used.
- Example:
-
rrSet (35,@P.LineHeightOverlapPercent) - Example:
-
rrSet (35,@P.LineHeightOverlapPercent) RecognizePageOCR_A
If using the OCR/A or OCR/SR action "Recognize", then first use the action CreateCcoFromLayout to load the CCO after recognition has been performed.
If only field level recognition is performed, then the CCO and CCO normalization does not apply.- Example
-
SetFingerprintRecogPriority(True) RecognizePageOCR_S() NormalizeCCO() pat_RecogMatch_ID()