NormalizeCCO

This action sorts the words and lines in a Fingerprint file (.cco) created by a recognition engine.

Returns

True if the CCO is created. Otherwise, False.

Level

Page level.

Details

The CCO file contains the recognized text and is used for click-n-key during validation, pattern match actions, locate actions, snapping full page recognized text to fields, and used by other components of Datacap that operate on recognized text. It takes the text as recognized by a recognition engine and organizes it into lines. For example, if this action is not called, text that appears separated on the left and right side of a page, but would visually be considered the same line, may not be considered on the same line, causing actions like the Locate actions to operate in ways that may not be anticipated.

The action is automatically called by some full-page recognition actions like OCR/A. This action should always be called before using Locateactions or the pat_RecogMatch_ID action to find recognized text on a page. In this context, the fingerprint is calculated for a particular image in a batch, as opposed to the Fingerprint database, which contains fingerprints for various page types and layout variations that have been defined for a particular application.

There are two types of Fingerprint files. One type is based on the image geometry. The second type is based on recognized text. The AnalyzeImage action creates a geometric fingerprint containing lines and "words" based only on the black pixels in the image. Full-page recognition actions RecognizePageOCR_S, RecognizePageICR_C, RecognizePageOCR_A etc. create a fingerprint based on the results of recognition, i.e. both geometry and text of the recognized characters, words and lines.

In recognition-based fingerprints, the order of lines and words may appear to be arbitrary, especially if the page contains images, tables, stamps, or blocks of text with varying font sizes. This can cause unpredictable results from Locate actions that navigate geometrically. The word-matching and phrase-matching action pat_RecogMatch_ID also requires well-ordered text to work reliably.

The NormalizeCCO action re-orders the words of text in a Recognition-based fingerprint into lines and words in "standard" reading order, from top to bottom and left to right. LineHeightOverlapPercent parameter is percentage of a line that two words must be offset to be considered as separate lines. By default LineHeightOverlapPercent is set to 50 percent.

This variable can be used if the offset between two words in different lines, seems to be less than the default value which is 50. It may be necessary to test multiple values on a wide variety of pages to get a value that works well for all of your documents.

Some recognition actions, such as RecognizePageOCR_S, do not automatically call NormalizeCCO. So, this action needs to be explicitly called in rules after recognition to normalize the CCO for subsequent actions that use the CCO results.

Conversely, it may be possible that for specific applications, it may be better to leave the recognition as it is returned from the engine. If this is the case, then for actions that automatically perform NormalizeCCO, such as RecognizePageOCR_A, the application must first call the action CCONormalization_OFF() to prevent the recognition action from automatically performing the normalization. Because the action MergeCCOs_ByType removes page numbers, NormalizeCCO should not be called after the action MergeCCOs_ByType is used.

Example:

rrSet (35,@P.LineHeightOverlapPercent)

In the example, if two words are vertically offset by more than 35% of the height of the words, then they are considered to be on separate lines. Lower values are more likely to split words into separate lines. Higher values are more likely to merge words into a single line.

LineHeightOverlapPercent variable should be set before NormalizeCCO()

Example:

rrSet (35,@P.LineHeightOverlapPercent)
RecognizePageOCR_A

Important: By default NormalizeCCO discards any "words" or blocks containing characters taller than 1/4 inch, or the height set by SetMaxCharacterHeightTMM(). If large words are missing from the page, this may be the reason and can be confirmed by reviewing the action log to see if large text is being removed. This is done as large text, such as company logos, are typically not needed and can cause text alignment problems when the CCO is normalized.

Note: If the AnalyzeImage action is called before full-page recognition, the recognized text is placed into the geometry created by AnalyzeImage. This hybrid Fingerprint file is not always suitable for cco2cco. To force creation of a pure recognition-based fingerprint, call SetFingerprintRecogPriority(True) before full-page recognition. This guarantees that any existing geometric fingerprint is ignored, and it applies to OCR_S and ICR_C only.

Note: The full page recognition actions, RecognizePageXXX from the ICR_C and RecognizeOCRA libraries call NormalizeCCO() automatically unless the action CCONormalization_OFF (from the Recog_Shared library) is called prior to recognition. The full page recognition from the OCR_SR library, however, requires that NormalizeCCO() to be called manually post recognition.

If using the OCR/A or OCR/SR action "Recognize", then first use the action CreateCcoFromLayout to load the CCO after recognition has been performed.

If only field level recognition is performed, then the CCO and CCO normalization does not apply.

Example

SetFingerprintRecogPriority(True)
RecognizePageOCR_S()
NormalizeCCO()
pat_RecogMatch_ID()