RecognitionOCRA actions

The OCR_A action library performs full page and field level recognition on images.

The OCR/A action library provides actions that perform full page, field level recognition, rotation of images, creation of PDF files and other operations using the OCR/A engine.

Page and field level recognition can be performed which can then be processed by actions in other action libraries and ultimately displayed to the user for verification or exported to external repositories or file systems.

Machine print and hand print can be recognized. Cursive text is not supported.

It is strongly recommended that the document "Best Practices for optimal text recognition in IBM Datacap" is reviewed to understand how to get the best results from recognition.

Supported Languages

The expected language must be set for correct recognition. If a page could contain multiple languages, language detection will allow multiple languages to be recognized. Specifying languages not on the page could reduce the quality of recognition. When the OCR process is complete, a report on the number of languages detected (and total number of words detected for each language) is generated. This report is stored in the runtime DCO as variables, and can also be found in the layout XML file.

To enable automatic language detection:

Use rrSet or similar action to set the variable "y_lg" to a list of comma separated of at least three languages from the "Supported Languages by Auto Detection" section below.
After specifying the list of languages, call a recognition action, such as RecognizePage, Recognize, RecognizePageFields, etc.

Note: It is recommended that the list of languages be minimized to the languages expected to be processed by the application. This is because the more languages are specified, the slower the processing and can reduce recognition quality by specifying languages that are not on the page.

Languages Supported by Auto Detection

Important: When setting the comma separated list of languages, make sure that the language are spelled exactly as written below and without spaces. An invalid language name will cause the action to abort. The text following the colon : is informational and should not be included in the language specification. Languages that also support hand print are noted by ICR.

Afrikaans : ICR
Agul
Albanian : ICR
Arabic : Arabic (Saudi Arabia)
ArmenianWestern : Armenian (Western)
AzeriLatin : Azerbaijani (Latin), ICR
Bashkir : ICR
Bulgarian : ICR
Catalan : ICR
ChinesePRC : Chinese Simplified
ChineseTaiwan : Chinese Traditional
Croatian : ICR
Czech : ICR
Danish : ICR
Dutch : Nethernlands, ICR
DutchBelgian : Belgium, ICR
English : ICR
Esperanto
Estonian : ICR
Finnish : ICR
French : ICR
German : ICR
GermanLuxembourg : German (Luxembourg), ICR
GermanNewSpelling : German (new spelling), ICR
Greek : ICR
Hebrew
Hungarian : ICR
Indonesian : ICR
Irish : ICR
Italian : ICR
Japanese
JapaneseModern : Expanded character set that includes English characters
Korean
Korean+English : Korean and English
KoreanHangul : Korean (Hangul)
Latin : ICR
Latvian : ICR
Lithuanian : ICR
Mathematical : English and common math symbols such as ±¼½¾×÷ΣΩαβ≈≠≡≤≥≪≫etc.
Norwegian : NorwegianNynorsk and NorwegianBokmal, ICR
NorwegianBokmal : Norwegian (Bokmal), ICR
NorwegianNynorsk : Norwegian (Nynorsk), ICR
OldEnglish : ICR
OldFrench : ICR
OldGerman : ICR
OldItalian : ICR
OldSpanish : ICR
Polish : ICR
PortugueseBrazilian : Portuguese (Brazil), ICR
PortugueseStandard : Portuguese (Portugal), ICR
Romanian : ICR
RussianOldSpelling
Russian : ICR
RussianWithAccent : Russian (with accents marking stress position)
Slovak : ICR
Slovenian : ICR
Spanish : ICR
Swedish : ICR
Tatar
Thai
Turkish : ICR
Ukrainian : ICR
Vietnamese

The language can be bound to the DCO object by selecting it in the OCR_A tab in the zones tab of Datacap Studio. When selected in the OCR_A tab the variable y_lg is set to the language. The language also can be set within rules using the rrSet action to set the y_lg variable to the desired language. For example: rrSet("Italian", "@X.y_lg") will set the language to Italian.

If the s_lg variable is not set for the current DCO object, the recognized language is determined by the current locale set with the hr_locale variable. For example, if the locale is set for Germany, rrSet("de-DE", "@X.hr_locale"), then the text will be recognized as German.

If both the hr_locale and y_lg variables are set, the value in y_lg takes precedence over the locale setting. If y_lg is set but the engine should use the value set for hr_locale instead,setting the variable dco_uselocale to "1" will give precedence to hr_locale.

Examples of Setting More Than One Language:

rrSet("English,French,German", @P.y_lg") - Auto detection of English, French or German
rrSet("English,German,GermanNewSpelling,Norwegian", @P.y_lg") - Auto detection of English, French, German or Norwegian
rrSet("ChinesePRC+@CHR(43)+English", "@P.y_lg") - An exception for specification of Simplified Chinese and English to prevent smart parameter interpretation of +.
rrSet("@STRING(ChinesePRC+English)", "@P.y_lg") - An exception for specification of Simplified Chinese and English to prevent smart parameter interpretation of +.

Hand Print Recognition

The OCR/A actions can recognize hand printed text. Hand print is much harder to recognize accurately than machine generated text, and success depends very heavily on character quality. The use of structured forms to limit the possible range of characters, together with zone-level filters and individual character validation can significantly improve accuracy. Your application will need to handle reduced accuracy of hand print compared to the high quality of machine printed text. In addition to setting restrictions on the recognized text, applications will typically correct the text using a combination of rules and verification. Recognition of hand printed characters will require more time than machine printed text.

Hand print text has these requirements to get the best recognition:

Text should be from pre-printed forms; Drop-out colors work best. This can be drop out combs or boxes.
Letters must not touch.
Block characters work best.
The list of languages and symbols supported by hand print is smaller compared to machine print.
Multi-line fields are supported although you may get better results with single line fields.
Write characters of even size and spacing.
If using a dropout color, do not use a pen of the same dropout color.
Image enhancement can improve recognition by removing lines, combs and darken hand printed characters.
Pencil and felt-tip pens give poorer results.
Cursive text is not supported.
The text must be in a field.
Setting a regular expression in the property dialog associated with the field that specifies the expected characters can improve recognition.
Automatic rotation is not supported when hand print is enabled.
Picture detection is not supported with hand print is enabled.
Table detection is not supported with hand print is enabled.

Hand print recognition can be enabled in the OCR/A tab of the Zones tab in Datacap Studio. Enable editing in the DCO tree and select the target page or field that will be recognized as hand print. In the "Text Type" property, select "HandPrint" to recognize hand printed text.

The OCR/A property tab also contains the property "Writing Style". This property is used to configure the writing style of hand printed text. This setting defaults to "Automatic Detection" where the engine attempts to detect the style. Setting the correct writing style may improve recognition. The character set supported for hand printed recognition is limited. Symbols such as * ^ ™ © ® № § ¡ ¿ ‰ cannot be recognized. Automatic layout analysis is not available for hand printed text.

MICR Recognition

MICR lines recognition for personal checks can be enabled in the OCR/A tab of the Zones tab in Datacap Studio. In the "Text Type" property, select "MICR_E13B", to recognize MICR lines. Alternatively, you can also use the following setting:

rrset( 128 ,@X.y_tt )
RecognizePageOCR_A()

The CCO File

The CCO is a representation of a recognized text on a page and includes character confidence information. Each recognized character has a unique confidence score that is an indication from the engine regarding the perceived accuracy of the recognition. The CCO is required for many Datacap operations such as using the click-n-key feature of a verify panel. Actions that operate on full page text also require the CCO file, such as Locate actions. The OCR/A action RecognizePage will perform recognition and automatically create and normalize the CCO file. The OCR/A action Recognize does not directly create a CCO. It creates a layout file that must be converted to a CCO using the action CreateCcoFromLayout in the SharedRecognitionTools action library. Once the CCO is created, then subsequent features that require the CCO can be used.

Right-To-Left Text

When recognizing right-to-left text, such as Hebrew or Arabic, set the DCO variable hr_bidi to the value "RTL". This will indicate that the text is expected to be treated as right-to-left text. This variable will adjust how the text is stored in the CCO. It will also instruct Datacap Desktop to treat the page or field as right-to-left text when displayed. This variable must be set for any recognition action, RecognizePageOCR_A, RecognizePageFieldOCR_A, Recognize, etc. If hr_bidi is not set, or if it is set to "LTR", the text is considered to be left-to-right oriented text. If the directionality of text being recognized does not match the setting of hr_bidi, character order can be incorrect.

Hierarchical Variables

Variables that start with hr_ are hierarchical variables. Both hr_locale and hr_bidi are hierarchical. A hierarchical variable is special because its value will propagate down to objects below it in the DCO tree. Meaning that if the variable is set at the document level, the variable will also apply to all pages under the document node and all fields under each page node, unless there is another hr_ variable set at a lower level that overrides the higher level value.

For example, if you had a page that contains 10 fields and 9 of the fields should be treated as RTL and one field should be treated as LTR, then you only need to set hr_bidi in two places. First place it at the page level and set it to RTL. This will affect all fields. Then, at the one field that should be treated as LTR, at that field object set hr_bidi to LTR to override the higher level page setting.

Adjusting The Image Using OCR/A Actions

The actions RecongizePageOCR_A, Recognize, RecognizePageFieldsOCR_A and RecognizeToPDF can have the engine adjust the image during the recognition action. The action RotateImageOCR_A also respects the adjustment settings. The actions will first adjust the image based on the enabled image adjustment features and then perform recognition. This is similar to using the Image Enhancement ruleset prior to calling a recognition action. In some cases the recognition engine can be more reliable than the Image Enhancement ruleset because it can understand the character shapes where the Image Enhancement ruleset works based on image geometry only. Depending on the application, it may still be necessary to perform image enhancements prior to recognition. For example, images should be rotated and skewed prior to fingerprinting. Additional image cleanup may also be needed prior to using fingerprints.

The following actions can automatically rotate, deskew and perform image adjustments during processing:

RecognizePageOCR_A
Recognize
RecognizePageFieldsOCR_A
RotateImageOCR_A

If text is not recognized correctly then image enhancement is the first tool to use to improve recognition. For example, if characters are recognized incorrectly or if text is missing completely, the typical first approach is to review the quality of the images that are being recognized. Images should be in the correct orientation, not skewed, text should have sharp solid lines with minimal background noise. If the page has drop out colors or drop out shading that too should be removed prior to recognition. Even if text is clean and clear, it may be helpful to make the text bolder prior to recognition to fix recognition problems. The original image is always saved so the original image can still be used or restored after recognition for archiving.

In cases where image enhancements are needed prior to recognition, these logical pre-recognition steps, or something similar, may be required (these are likely separate rulesets):

RotateImageOCR_A (to rotate, deskew and fix negative issues images).
ImageEnhancement Ruleset (to perform other enhancements such as line removal).
Other types of operations, such as fingerprint matching.
RecognizePage, Recognize, RecognizePageFields or RecognizeFields.

For applications that do not need to separate rotation and deskew from recognition, then the recognition actions have the ability to perform some of these adjustments in one step.

Page Rotation/Orientation

A page must be in the proper orientation prior to recognition. The actions RecognizePageOCR_A RecognizePageFieldOCR_A and Recognize can rotate an image during the recognition step. Alternatively, they could be rotated prior to the recognition step using the action RotateImageOCR_A or the Image Enhancement ruleset, when it is necessary to separate the rotation and recognition steps.

Rotation during the recognition step can be configured using the action SetAutomaticRotationOCR_A. By default automatic rotation is enabled. If images have been previously rotated or are guaranteed to be in the correct orientation, then automatic rotation can be disabled by calling SetAutomaticRotationOCR_A(False) prior to an OCRA recognition action. For example, it may be necessary to call RotateImageOCR_A prior to fingerprinting allowing auto rotation to be disabled when at the recognition step. If an image contains text with various orientations, for example vertical and horizontal, the image might be rotated undesirably.

If the Text Type is set to Hand Print, then the orientation cannot be automatically adjusted.

Page Deskew

Deskewing an image will perform a slight rotation of the image to correct text that is slanted. An image that is deskewed will provide better recognition and perform better in the verification task. If an image is not deskewed, a single line of text could be recognized poorly or be split across multiple lines instead of the text appearing as a single line. When recognition is performed after deskewing an image it improves recognition quality and ensures that the recognized coordinates will match the text position on the adjusted image.

A page can be deskewed prior to recognition using the Image Enhancement actions. Alternatively, the Recognize and RecognizePage actions can also automatically deskew the image in the recognition step. recognition by the recognition engine by setting the variable y_correctSkewMode. If you have the recognition engine deskew the image, then it is not necessary to first deskew using the image enhancement actions but it may be useful to perform other image cleanup prior to recognition depending on the quality of your input documents.

When the recognition engine deskews an image, a backup of the original image will be made and the adjusted image will be saved as the new page image that can be used in future steps, such as Verification. The maximum amount the image will be deskewed is 15 degrees. The engine may or may not deskew an image greater than that amount.

To have the engine deskew the image during recognition, set y_correctSkewMode to one of the following:

0 : Turns off deskew during recognition.
1 : Deskew using horizontal black registration squares.
2 : Deskew using vertical black registration squares.
4 : Deskew using horizontal lines (do not use if lines do not appear on the page).
8 : Deskew using vertical lines (do not use if lines do not appear on the page).
16: Deskew using horizontal text (Default value, if y_correctSkewMode is not set).

When a page is dekewed, it can cause an image's physical size to become slightly larger because the rotation of the image adds additional pixels along the image to preserve the image. For example, a skewed page which is 8.5" x 11" may become 9.265" x 11.59". The amount of increase is dependent on the skew of the image. For full page recognition, this change usually does not matter. If using field level recognition, it can cause zones to line up incorrectly. In this situation, it may be necessary to use registration anchors to align the image to the zones or crop the image to the original size.

Image Geometry Correction

The geometry filter attempts to fix geometrical distortions (perspective on photos, curved lines from scanned books, etc.) on images. This feature is set to automatic by default. To explicitly enable this feature, set the DCO variable y_pdfGeomCorrect to "1". To disable this feature, set the value to "0".

Enhance Contrast

The DCO variable y_EnhanceLocalContrast can be enabled to have the engine increase contrast prior to recognition. When enabled, this setting will increase the contrast of an image prior to recognition. This increased contrast can improve recognition quality, particularly on images that have text against color backgrounds, such as a drivers license. This setting can only be used for color or gray scale images. Setting this DCO variable to "1" will enable the contrast adjustment; "0" will disable the adjustment. If this variable is not set, then the setting will default to "disabled". For Example: rrSet("1","@X.y_EnhanceLocalContrast")

Noise Filter

A noise filter can be applied to the image by setting the DCO variable y_ApplySigmaFilter. By default, the filter is set to automatic. The filter can be specifically enabled by setting the value to "1" or disabled by setting the value to "0". This adjustment intended to adjust an image to improve recognition. It can have an affect of adjusting colors on an image or photo.

Negative Image Correction

The engine can detect if the page is a negative image, where most of the page is white text on a black background. The engine will convert the image to black text on a white background. This feature is disabled by default. It can be enabled by setting the DCO variable y_CorrectNegativeImage to "1". This adjustment intended to adjust an image to improve recognition. It works best on a black and white image. It can be run on a color image but it may adjust colors on an image or photo, even if the image is not a negative image.

Crop Image

This enhancement looks for the edges of the image within a photo and crops the image. By default, the setting is automatic and the engine makes the decision to crop the image. The feature can be disabled by setting the DCO variable y_CropImage to "0" or enabled by setting y_CropImage to "1". If forcing the crop to always be activated, it is recommended that you first test with a set of real images to confirm it is cropping the image appropriately for your documents. The success of the operation can vary. This setting only supports color images.

Resolution Adjustment

When performing recognition, the engine will attempt to fix images of a low resolution or a high resolution by scaling them to a resolution that can be used for recognition. The DCO variable y_OverwriteResolution determines if the engine will adjust the resolution of the image. The valid values are "0" (no adjustment), "1" (image is adjusted), and "2" (The engine automatically determines if the image should be adjusted). If y_OverwriteResolution is not set, the default of "2" (automatic), will be used. If the variable is set to off, "0", and resolution of the image is too low (less than 50 dpi), too high (more than 3200 dpi), or undefined, recognition will fail. The variable y_ResolutionToOverwrite determines the new DPI of the image. If not set, it defaults to 300 dpi.

Adjust Image To A Suggested DPI

The DPI of an image can affect the quality of the text recognition. Images with a low DPI of 150 or less can recognize poorly. This feature instructs the engine to adjust the DPI of an image to what it thinks is best for the image prior to recognition. This setting can be used to increase the DPI of images. If the application will be ingesting low resolution images, this image could improve the recognition quality. Test with real samples to see if this feature consistently improves the recognition of low resolution images. Setting the DCO variable y_UseSuggestedDPI to "1" will enable this feature.

Datacap has actions in other action libraries, such as ImageUtilities, that can also increase the resolution of low resolution images. In some cases, using these actions to increase the resolution to a predefined resolution of 250 or 300 DPI can provide better results than this setting. Rules can be easily configured to check the current DPI of an image, and optionally increase the DPI if the current image has a low DPI. Testing is recommended to determine which provides the best overall results.

Adjust Low DPI Images To A Fixed DPI

Similar to the previous setting, it is possible to automatically adjust images that are less than a specific DPI to a fixed value. For example, if the DPI of an image is less than 150 DPI, it can be adjusted to 300 DPI prior to recognition. The variable y_SetNewDPI controls the new DPI for the image. If this variable is blank, then no adjustment will be performed. Variable y_SetNewDPIMinDPI controls the minimum DPI of an image. If the image is above this DPI, then no adjustment will be performed.

rrSet("300", "@X.y_SetNewDPI")
rrSet("140", "@X.y_SetNewDPIMinDPI")

This example will adjust all images that have a DPI of less than 140, converting them to a DPI of 300.

Shadows and Highlights Correction

The image will be adjusted to correct of excessive shadows and highlighting to improve recognition quality. This filter is intended for use on photos, but in some instances may also improve scanned documents. By default, this filter is set to automatic and the engine will decide if the filter should be applied. The filter can be always enabled by setting the DCO variable y_CorrectShadowsAndHighlights to "1" or disabled by setting it to "0". This adjustment intended to adjust an image to improve recognition. It can have an affect of changing the sharpness of an image or photo.

Backup Image File

When an image requires rotation, deskew or if another image enhancement is enabled, the original image will be saved in a backup file within the batch using a name similar to: "filename.ocra.ext". For example, if the original file name was "TM000001.tif", the new name of the original file will be "TM000001.ocra.tif" and the the file "TM000001.tif" will be updated with the corrected rotation, skew, image adjustment, etc. The suffix added to the original file name can be controlled by setting the DCO variable y_ImageSuffix. If not set, the value will default to ".ocra". The back up image is created in case the original image is needed later. For example, it is common to manipulate an image for recognition and/or verification, and then archive the original image when the batch is complete. The default suffix can be changed like this before calling an OCR_A action that adjusts the image: rrSet(".mybackup",@X.y_ImageSuffix)

If only rotation or deskew enhancements are enabled, the engine will only make a backup of the original image if it has rotated or deskewed the image. If other image enhancements are enabled, it is not possible for the engine to always tell if the image has been updated or the change may not be visibly noticeable. In these cases a backup of the original file will still be created.

PDF Text Extraction versus Text Recognition

When a PDF is recognized, by default, the text included in the recognition results is obtained from a combination of automatic recognition that is run on the PDF and from searchable text that is embedded within the PDF. Any images that are embedded on the page have the text recognized by the engine. If areas of the page contain both an image and searchable text associated with the image, the engine decides whether the engine must use the searchable text or recognize the text from the matching image. Because the engine performs recognition, the confidence of the text might vary even if the same searchable text is embedded in the PDF. The variable y_contentReuseMode can be used to force the engine only to use the recognized text on the page or only to use the embedded text on the page. One reason why you might decide only to use the embedded text is to prevent recognition and produce high confidence results.

A drawback of only to using the embedded text is that if the embedded text is wrong or incomplete, recognition is not performed to capture that missing data that results in a layout XML that is incomplete compared to what the user sees when the user views the PDF. Do not use this setting if the source PDF file is of the image-on-text type, because in this case, the text layer is not extracted.

If a text line contains characters that are not included in the alphabet of the selected recognition languages, this text is not written to the result and mode 0 or 1 must be used. These settings of y_contentReuseMode can be set on the DCO node that is being converted:

rrSet("0", "@X.y_contentReuseMode") - The default auto mode that uses a combination of recognition and embedded text.
rrSet("1", "@X.y_contentReuseMode") - Only recognition is used to create the layout XML.
rrSet("2", "@X.y_contentReuseMode") - Only embedded text is used to create the layout XML.

When using the action PDFFREDocumentToImage to simply create images from the source PDF where recognition will be performed in a separate step, set the variable convPdfIgnoreContent priot to calling DocumentFactory to prevent creation of unneded recognition resulsts, and providing faster conversion from PDF to TIF. Refer to the action for more details.

Image only and vector PDFs are supported for recognition and conversion to TIF. Editable PDFs or PDFs with an XFA-form are not supported.

Custom Parameters

These custom settings can be configured by setting the variable on the DCO node that is performing the action. The setting could be hard coded in the setup DCO for the node or can be set at runtime using the rrSet() action. If the setting is added to the setup DCO, then it will automatically apply to all objects of that type and will not require any additional steps or processing time that would occur when using rrSet at runtime. Regardless of the variable being set in the Setup DCO or at runtime, the action will use the variable.

The following variables can be used to set custom parameters for recognition:

Aggressive Text Recognition: An optional setting instructs the engine to perform aggressive text recognition. Setting the variable y_EnableAggressiveTextExtraction to "0" will disable this feature and "1" will enable the aggressive mode. By default this setting is enabled.

This setting will instruct the Engine to try to extract as much text on the image as possible. This mode can help when image contains some low-quality text. This processing mode may lead to mistaken interpretation of pictures as text or vertical rearranging of the horizontal text. It is recommended that you test with this setting on and off to see which mode works better for your documents.

Recognition Mode: By default, recognition, when performed on pages and when converting documents to PDF, uses a standard accuracy setting. Setting the variable y_RecognitionMode to "1" on the node performing the recognition action enables greater recognition accuracy. This could improve recognition quality on some documents.

Detect Text on Pictures: This setting instructs the engine to detect all text on a page image, including text embedded into figures and pictures. This setting is enabled by default. To disable this setting, set the DCO variable y_DetectTextOnPictures to "0".

Detect Porous Text: This setting instructs the engine to detect porous text. This setting is enabled by default. To disable this setting, set the DCO variable y_DetectPorousText to "0".

Detect Matrix Printer Text: This setting instructs the engine to detect dot matrix text. This setting is enabled by default. To disable this setting, set the DCO variable y_DetectMatrixPrinter to "0".

Exhaustive Analysis: This setting can improve recognition at additional cost of speed. It is recommended to test with this variable on and off to determine if it improves recognition for your documents and enable only if needed. This setting is off by default.

To enable this feature, set the DCO variable y_EnableExhaustiveAnalysisMode to "1".

Low Resolution Mode: This is a setting that indicates that the image is a low resolution image. This can improve recognition quality when recognizing faxes, small text on images, images with low resolution or bad print quality. To enable this mode, set the DCO variable y_LowResolutionMode to "1". Disable this mode by setting the variable to "0". If not present, then this mode is disabled by default. If necessary, application rules could be setup to enable this setting if an image is of a small size or low resolution. See the action "SaveImageInformation" which can save image properties to the DCO and then these properties can be tested using the action rrCompareNumeric to control rules based on image properties. For Example: rrSet("1","@X.y_LowResolutionMode")

Photo Mode: The setting indicates whether the processing image should be treated as a photo or as a non-photo image. If the image is a photo, enabling this setting will adjust how the engine recognizes the text on the image and can improve the quality of images captured with a camera. The setting is configured by setting the DCO variable y_PhotoProcessingMode. The setting has three modes: "0" = The image is not a photo, "1" = The image is a photo, "2" = Automatic - the engine will attempt to automatically determine if the image is a photo. If this variable is not set, it will default to "Automatic". For Example: rrSet("1","@X.y_PhotoProcessingMode")

Text Recognition Only: Setting y_EnableTextExtractionMode to "1" will enable "text only" recognition. Tables will not be detected. If text on the page is not found, enabling this setting may cause it to be recognized. If table detection is not needed, enabling this setting may improve recognition on some documents. This setting is off by default. The RecognizePage and RecognizeField actions do not support tables and enabling this setting can improve recognition results. This can also be enabled for the Recognize action, if table detection is not required. For Example: rrSet("1","@X.y_EnableTextExtractionMode"). y_predefined Profile: Set this variable to the name of a predefined profile to use during recognition. Valid values are:

DocumentConversion_Accuracy
DocumentConversion_Speed
DocumentArchiving_Accuracy
DocumentArchiving_Speed
TextExtraction_Accuracy
TextExtraction_Speed

Reducing Memory Usage: If an action is processing a very large document or a large number of pages, it may be possible that more memory is required than is available resulting in an error. Most OCR/A actions have the ability to run in a reduced memory mode. There are two settings that can be configured.

"y_maxPagesForInMemoryProcessing" allows selection of when disk can be used to reduce memory requirements. For example, if building a large PDF document with RecognizeToPDF, or splitting a large PDF document with PDFFREDocumentToImage, "y_maxPagesForInMemoryProcessing" will instruct the engine when to use disk. For example, if set to "50", then if the document contains less than 50 pages, only memory will be used. If more than 50 pages are in the document, then disk will be used, saving memory.

"y_LowMemoryMode" instructs the engine to use as little memory as possible. When this variable is set to "1", then low memory mode is enabled.

The variables will need to be set prior to calling the problem action. For example:

rrSet("10", "@X.y_maxPagesForInMemoryProcessing")
rrSet("1", "@X.y_LowMemoryMode")
Recognize()

Limiting Simultaneous Recognition CPU Utilization: Some OCR/A actions, such as Recognize and RecognizeToPDFOCR_A, can perform simultaneous recognition of pages within a single batch. The simultaneous recognition allows pages to be recognized in parallel instead of sequentially. By default, the engine will decide how many CPUs that will be used to perform simultaneous recognition. If the machine has other processes with significant CPU use, such as other Rulerunner processes, then it may be possible to over saturate the machine. While running a machine with high CPU usage is generally desirable to maximize throughput, a machine that is over saturated can cause starvation for some processes and reduce throughput. There are numerous factors, often unique factors, for each environment.

The number of CPUs used by the OCR/A action can be restricted by setting the DCO variable _RecognitionProcessesCount to the maximum number of CPUs that should be used. By default, this variable is set to "0" which lets the engine decide how many CPUs to use. With this default setting a single batch under a single Rulerunner process could use 70% - 100% of the CPUs. If two or more Rulerunner processes are configured and are allowed to use all of the CPUs, it may reduce performance. Setting this variable can control the number of CPUs use by the recognition action.

If the machine has 8 CPUs, the default value may or may not use all available CPUs. If there are 8 CPUs and the variable is set to 8, then a single process will attempt to use up to 8 CPUs, as long as there is sufficient work.

If there are 8 CPUs and y_RecognitionProcessesCount is set to 4, then a single recognition action will use roughly 50% of the CPUs. The utilization may move up and down depending on the current step within action, the input documents, and other processes running in the system. If you have an 8 CPU machine and two Rulerunners using simultaneous recognition, then setting the maximum CPUs to 4 will limit the action to 4 CPUs, so if two Rulerunners are running the same recognition action at the same time, that will maximize up to 8 CPUs. Of course, each application has its own unique set of rules and actions and the best value will need to be determined through testing. It may be easier to find the an optimal setting by performing recognition only in its own task and then deciding how many Rulerunners to run the recognition task.

This setting will only affect specific OCR/A actions and will not change the operation of actions from other action libraries.

Assuming the recognize action is being run on a document level node, this example will limit the simultaneous processing to 4 CPUs:

rrSet("4","@X.y_RecognitionProcessesCount)
Recognize()

As with any DCO variable, the variable could be set at runtime using the rrSet action or could be predefined in the Setup DCO using Datacap Studio.

Additional Recognition Properties: The properties in the OCR/A tab of the Zones tab in Datacap Studio has additional properties that can be set to adjust recognition. The OCR/A tab has a number of additional properties that can be set to control recognition. Properly setting these properties other than leaving them as defaults, can improve recognition accuracy. To set or change a property, select the DCO node, such as a page or field, lock the DCO and the properties can then be changed. After making any selections, save the DCO and unlock it.

"Text Type" is a setting available in the OCR/A properties tab. This setting identifies type of font used on the document to be recognized. By default, if nothing is selected, the engine will recognize using the "Normal" and "Matrix" settings. If any combination of Matrix, Typewriter, OCR_A, or OCR_B settings are used, italic fonts and superscript/subscript will not be recognized. By default "Normal" and "Matrix" are enabled so by default italic, superscript and subscript text will not be recognized. If you need to recognize this text, set the type to "Normal" only.

Tip: Some configuration properties are not available in the OCR/A tab and are configured by setting a DCO variable to enable or disable certain features. Remember that instead of setting variables in the rules at runtime using the rrSet action, a variable can instead be configured in the desired DCO node using the DCO tree for the Setup DCO in Datacap Studio. If a variable is set in the Setup DCO, then it does not need to be set via rules at runtime. For example, if you have 1000 pages of type "Invoice", and if a variable is set at runtime on the "Invoice" page object, then rrSet will run 1000 times to set the variable individually for each runtime DCO node. Conversely, if the variable is set in the Setup DCO, then the variable is already set and does not need to be set in the rules at runtime. DCO variables are all case sensitive regardless of them being set in the Setup DCO or in the Runtime DCO.

Regular Expression Dictionary: When performing field level recognition, it is possible to specify a regular expression which acts as a dictionary to help guide the engine to the correct recognition of characters. The regular expression can be configured in the OCR/A properties tab within the Zones tab of Datacap Studio.

When a regular expression is specified, the resulting recognized words are not strictly limited to words specified by the regular expression. If the engine cannot detect a suitable variation of the word that adheres to the specified syntax, the engine can still create a "best guess" recognized word that does not follow the provided syntax. The regular expression syntax is a subset of standard regular expression syntax. The following table describes the supported regular expression syntax.

Syntax Description	Syntax Example	Result
Any character	c.t	Denotes words like "cat", "cot"
Character from a character range: []	[b-d]ell [ty]ell	Denotes words like "bell", "cell", "dell" Denotes words "tell" and "yell"
Character out of a character range: [^]	[^y]ell [^n-s]ell or: \| c(a\|u)t	Denotes words like "dell", "cell", "tell", but forbids "yell" Denotes words like "bell", "cell", but forbids "nell", "oell", "pell", "qell", "rell", and "sell" Denotes words like"cat" and "cut"
0 or more occurrences in a row: *	10*	Denotes numbers 1, 10, 100, 1000, etc.
1 or more occurrences in a row: +	10+	Allows numbers 10, 100, 1000, etc., but forbids 1
Letter or digit: [0-9a-zA-Z]	[0-9a-zA-Z] [0-9a-zA-Z]+	Allows a single character Allows any word
Capital Latin letter: [A-Z]	[A-Z][A-Z][A-Z][A-Z]	Allows any 4 characters, all upper case words
Small Latin letter: [a-z]	[a-z][a-z][a-z][a-z]	Allows any 4 characters, all lower case words
Capital Cyrillic letter: [А-Я]	[А-Я][А-Я][А-Я][А-Я]	Allows any 4 characters, all upper case words
Small Cyrillic letter: [а-я]	[а-я][а-я][а-я][а-я]	Allows any 4 characters, all lower case words
Digit: [0-9]	[0-9]	Allows any number from 0 to 9
Space: [\s]	[\s]	Allows specification of a space.
Character range: -	-	A dash allows specification of a character range. For example [0-9] would allow any single character between 0 and 9.
Escape character: \	\	A backslash allows a special character to be used as a literal character. For example [0-9\.]+ would allow "123.45" as a valid word.
System character: @	@

Some characters used in regular expressions are “auxiliary,” i.e. they are used for system purposes. As you can see from the list above, such characters include square brackets, periods, etc. If you wish to enter an auxiliary character as a normal one, put a backslash (\) before it. Example: [t-v]x+ denotes words like "tx", "txx", "txxx", etc., "ux", "uxx", etc., but \[t-v\]x+ denotes words like "[t-v]x", "[t-v]xx", "[t-v]xxx" etc.

If you need to group certain regular expression elements, use brackets. For example, (a|b)+|c denotes "c" and any combinations like "abbbaaabbb", "ababab", etc. (a word of any non-zero length in which there may be any number of a's and b's in any order), whilst a|b+|c denotes "a", "c", and "b", "bb", "bbb", etc.

Sample regular expressions:

Regular expression for dates: The number denoting day may consist of one digit (e.g. 1, 2 etc.) or two digits (e.g. 02, 12), but it cannot be zero (00 or 0). The regular expression for the day will then look like this:

((|0)[1-9])|([12][0-9])|(30)|(31)

The regular expression for the month will look like this:

((|0)[1-9])|(10)|(11)|(12)

The regular expression for the year will look like this:

(((19)|(20))[0-9][0-9])|([0-9][0-9])

What is left is to combine all this together and separate the numbers by a period (e.g. 1.03.1999). The period is an auxiliary sign, so we must put a backslash (\) before it. The regular expression for the full date will then look like this:

(((|0)[1-9])|([12][0-9])|(30)|(31))\.(((|0)[1-9])|(10)|(11)|(12))\.((((19)|(20))[0-9][0-9])|([0-9][0-9]))

Regular expression for e-mail addresses: You can make a language for denoting e-mail addresses. The regular expression for an e-mail address may look like this:

[a-zA-Z0-9_\-\.]+\@[a-zA-Z0-9\.\-]+\.[a-zA-Z]+

Optical Mark Recognition (OMR)

OMR detection is a technique where it is determined if a selection mark exists on a page. A common form of this type has one or more empty circles or squares on a page and a selection is indicated by the circle or box being filled in using a dark pencil or pen. The selections can be independent, meaning that they are unrelated selections that allow none, some, or all of them being selected. OMR recognition determines if these areas have been filled in or have been left blank.

For example, one selection could be "Married", indicating if the user is married. The next selection could be "Do you wear a hat?". The two items have no relation to each other. One could be selected, both could be selected or neither could be selected.

Other types of OMR uses would be a selection where there are a group of related marks where only one is expected to be filled in. For example, a payment option could be a credit card selection where the types of credit cards are listed, and only one should be selected. In this scenario, there should also be a verification to confirm that exactly the required number of selections have been performed, that is 'one'.

An optical recognition mark must be configured in a field with a zone created for the location of the optical mark. A two level field is required for OMR. If there is a single OMR entry, then the field must have a sub-field. The sub-field zone must surround the OMR location. The parent field zone location must surround the location of the child field.

If there are a group of OMR selections that are related, then the setup DCO would be set up so a field defines the entire group of marks. In turn, the field would have multiple sub-fields. There must be one sub-field for each selection within the OMR group. Each sub-field is assigned a zone around one of the OMR locations. If there were 3 OMR selections, then the first sub-field would surround the first OMR location, the second sub-field would surround the second OMR location and the zone of the third sub-field would surround the third OMR location. The parent field would then be zoned so that it contains all three sub-field zones. It is not necessary to surround the text, only the zones for the OMR mark.

The sample Survey application included as part of the Datacap Development Kit, DDK, provides examples of OMR detection on a form.

There can be several styles of OMR marks, they could be circular, square, oval. OMR boxes work best when the box is filled in completely. If simply an "X" is placed within the OMR location, it can lower the accuracy of detection.

It is recommended that a form help instruct the user to fill in an OMR completely so it recognizes correctly. While detection can work if the user places an "X", it is not as reliable as correctly filling in the area.

OMR detection also works best on dropout colors. Drop out colors are typically a light colored box that is removed during the scanning process or by subsequent image enhancements. For example, a common dropout color is red and when scanned, scanners can be configured to remove red colors. This will leave an obvious empty area where the OMR should be or leave the OMR mark, if it has been filled in by the user. This will increase the reliability of the detection.

When creating a zone around an OMR box, it is best to zone the area by surrounding the entire visible box with room for alignment movement. If you attempt to create a zone only within the box, or very "tightly" around the box, then alignment errors cause part of the box or circle to appear, or not appear, within the OMR area, creating a false positive. If the zone area is big enough so it always includes the entire area, including the lines, then if there is a slight movement, the OMR box lines can be factored out when determining if there is an intentional mark within the box.

A field is identified as an OMR field by setting the DCO variable RecogType to "4". This variable must be set properly in the DCO for the OMR detection to be successful.

Datacap provides several different mechanisms to detect OMR selection. The SharedRecognitionTools action library action DetermineOMRThreshold is one method of determining if an OMR field is filled in or blank. It allows for some amount of "black" to appear in the OMR selection to account for lines that could not be removed and considers there to be an OMR mark if the amount of black pixels exceeds a specified tolerance.

The RecognitionOCRA actions also support OMR detection. The recognition can be configured to recognize an OMR location and determine if it is selected or not selected.

When using the OCR/A engine for OMR detection, the field must be configured in the OCR/A Properties tab located within the Zones tab of Datacap Studio. There are three OMR properties: Checkmark Type, Length and Multipunch. Checkmark Type instructs the engine on the shape of the OMR mark. Length indicates how many OMR marks are within the field. Multipunch indicates if multiple OMR selections or just a single selection is expected. When a field is configured as an OMR field and the engine settings are set appropriately, the engine detects the field as an OMR field.

An OMR Dictionary is a mechanism that uses DCO settings to provide "User Friendly" meanings to OMR selections. Dictionaries are created using the "Dictionaries" button within the "Document Hierarchy" tab of Datacap Studio. An OMR dictionary contains words that describe the selections within the OMR selection and are also displayed in the verification panel for OMR fields.

A dictionary must be given a unique name. Each dictionary contains a set of words and values. Quite often the Word and Value attributes are set to identical values. The "Word" must match the name of the field that corresponds to it.

For example, in the Survey application, there is an OMR field called Frequency. This field has four sub-fields named: Once, Monthly, Quarterly and Annual. These fields correspond to four OMR marks on the form. When one of the selections is recognized, the dictionary value is used to provide text that is displayed in the verification panel.

A dictionary is "assigned" to an OMR field by creating the variable "DICT" in the DCO and then setting the name of the dictionary as the value.

See Best Practices for optimal text recognition for more information.