OCR_SR actions

The OCR/SR action library provides actions that perform recognition and rotation of images using the OCR/S engine. Page and field level recognition can be performed which can then be processed by actions in other libraries and ultimately displayed to the user for verification. Machine print and hand print text can be recognized.

Recognition Success

After a recognition operation is complete, the variable RecogStatus is set to indicate the success or failure of recognition. If page-level recognition is being performed, RecogStatus values of 0, 1 or 2 are considered successful. The list of successful values includes:

0 Success.
1 Success but no results. The page was empty.
2 Success. Additional processing such as RotateImage was performed.
Any other value Failure.

Automatic Retry

The OCR/SR actions support automatic retries when a recognition takes longer than expected. In rare situations recognition can fail to complete, such as when processing a damaged or atypical image. When this occurs, once the configured timeout is reached, the recognition engine is stopped and restarted. Recognition is attempted again until the retry count is reached.

By default, when a failure occurs because of the timeout being reached, the action is retried 1 time. The default timeout is 180 seconds. These default values can be changed using the SetupAutomaticRetry action in the RecogShared library. No other actions are required to setup the automatic retry feature. Some actions, such as RecognizeToPDFOCR_S, may require more than the default time to complete. The amount of required time increases depending on the number of pages being processed along with machine and network conditions. For more information about extending the time, see RecognizeToPDFOCR_S.

There are other timeouts in the system, such as the timeouts configured for rulerunner. Older, legacy, timeout variables exist and have been deprecated in favor of the single timeout specified by the automatic retry action. Legacy mode has separate timeouts for the engine and the out-of-process recognition. The legacy timeouts are configurable with the actions SetOutOfProcessTimeoutOCR_S and SetEngineTimeoutOCR_S. If you are using the legacy timout settings, they can be more difficult to set properly.

Supported Languages

The following languages are supported by OCR_SR:
  • Afrikaans (39)
  • Albanian (23)
  • Catalan (11)
  • Chinese Simplified (120)
  • Chinese Traditional (121)
  • Croatian (21)
  • Czech (17)
  • Danish (7)
  • Dutch (3)
  • English (0)
  • Esperanto (28)
  • Estonian (25)
  • Faroese (52)
  • Finnish (6)
  • French (2)
  • Gaelic Irish (56)
  • Gaelic Scottish (57)
  • German (1)
  • Greek (15)
  • Hungarian (19)
  • Icelandic (8)
  • Italian (13)
  • Japanese (119)
  • Latvian (26)
  • Lithuanian (27)
  • Maltese (14)
  • Norwegian (4)
  • Polish (16)
  • Portuguese (9)
  • Portuguese Brazil (45)
  • Rhaetic (90)
  • Romanian (22)
  • Russian(36)
  • Sami (98)
  • Sami Northern (100)
  • Sami Southern (101)
  • Serbian Cyrillic (30)
  • Serbian Latin (29)
  • Slovakian (18)
  • Slovenian (20)
  • Spanish (10)
  • Swahili (105)
  • Swedish (5)
  • Turkish (24)

The language can be bound to the DCO object by selecting it in the OCR_S tab in the Zones tab of Datacap Studio. When selected in the OCR_S tab the variable s_lg is set to the numeric value of the language. The language also can be set within rules using the rrSet action to set the s_lg variable to the numeric value for the desired language. For example: rrSet("2", "@X.s_lg") sets the language to French. If the s_lg variable is not set for the current DCO object, the recognized language is determined by the current locale set with the hr_locale variable. For example, if the locale is set for Germany, then the text is recognized as German. The value in s_lg takes precedence over the locale setting. If s_lg is set but the engine should use the value set for hr_locale instead, setting the variable dco_uselocale to "1" gives precedence to hr_locale.

Hand Print Recognition

The OCR/SR actions can recognize hand printed text. Hand print is much harder to recognize accurately than machine generated text, and success depends very heavily on character quality. The use of structured forms to limit the possible range of characters, together with zone-level filters and individual character validation can significantly improve accuracy. Recognition has a 159 member character set. Your application needs to handle reduced accuracy of hand print compared to the high quality of machine printed text. In addition to setting restrictions on the recognized text, applications typically correct the text using a combination of rules and verification. The hand print recognition feature is provided as a "Preview" feature.

Hand print text has these requirements to get the best recognition:
  • Text should be from pre-printed forms; Drop-out colors work best. This can be drop out combo boxes.
  • Letters must not touch.
  • Field level recognition must be used.
  • Block characters work best.
  • The list of languages and symbols supported by hand print is smaller compared to machine print.
  • Multi-line fields are supported although you may get better results with single line fields.
  • Write characters of even size and spacing.
  • If using a dropout color, do not use a pen of the same dropout color.
  • Pencil and felt-tip pens give poorer results.
  • Maximum line length is 200 characters.
  • Cursive text is not supported.

Hand print recognition can be enabled in the OCR/S tab of the Zones tab in Datacap Studio. Enable editing in the DCO tree and select the target field that is recognized as text. In the "Filling Type" property, select "HandPrint". For the "Module" property select "RER reRecognition Handprint". Hand printed text recognizes best when the input data can be restricted. Use the "Filter" and "Filter Plus" properties of the OCR/S tab to restrict the input characters. For example, if the input is only expected to be numeric text, select the "Numeric" filter. The "Filter Plus" setting can be used to instruct the engine to restrict the specific characters for recognition to improve quality.

The following symbols cannot be recognized when hand print is enabled:
  • # Number Sign
  • % Percent Sign
  • @ Commercial At
  • & Ampersand
  • | Vertical Bar
  • $ Dollar Sign
  • * Asterisk
  • + Plus Sign
  • = Equals Sign
  • _ Spacing Underscore
  • / Slash
  • \ Backslash
  • < Less-Than Sign
  • > Greater-Than Sign

Supported Hand Print Languages

Code - Language/Territory
  • AL - Albanian
  • AT - Austrian, German
  • BE - Belgian, Dutch, French, German
  • CH - Swiss, French, German, Italian
  • CS - Czech, Slovakian
  • CZ - Czech
  • DE - German
  • DK - Danish
  • EE - Estonian
  • ES - Spanish
  • EU - West-European
  • FI - Finnish
  • FR - French
  • HU - Hungarian
  • IE - Irish, English, Gaelic Irish
  • IT - Italian
  • LT - Lithuanian
  • LV - Latvian
  • NL - Dutch
  • NO - Norwegian
  • PL - Polish
  • PT - Portuguese
  • RO - Romanian
  • SE - Swedish
  • SF - Scandinavia
  • SL - Slovenian
  • SK - Slovakian
  • TR - Turkish
  • UK - UK
  • US - USA

Cyrillic languages and Greek are not supported. In Hungarian, the lower case characters "Small I Acute", "Small O Acute" and the "Small U Acute" are not supported, thereby limiting recognition to upper case characters.

Automatic Rotation, Deskew and Border Removal

The following actions can automatically rotate, deskew and remove the border of the image during processing:
  • RecognizePageOCR_S
  • Recognize
  • RecognizePageFieldsOCR_S
  • RotateImageOCR_S

When automatic rotation, automatic deskew or border removal are enabled, then the listed actions adjust the image prior to recognition. Images that are oriented correctly and deskewed are recognized better and produce recognized text that is more consistent with the source page. When automatic rotation is enabled for RecognizePageOCR_S, Recognize or RecognizePageFieldsOCR_S actions, it is not necessary to pre-rotate the image using RotateImageOCR_S, RotateImageExOCR_S or other image rotation actions.

Depending on the application, it may be desirable to rotate and deskew prior to calling a recognition action. For example, when performing fingerprinting, images should first be rotated and deskewed along with other Image Enhancement actions such as despeckle and line removal. An application can have other functions that require similar Image Enhancements prior to recognition. If images are enhanced prior to performing recognition, then these features can be disabled for the recognition step since they do not need to be run a second time. If the application does not need to preprocess an image, then rotation can be performed directly in the recognition actions in one step.

Automatic rotation, deskew and other image enhancement settings only work on image files, and do not adjust PDF files that are directly recognized. To rotate, deskew and enhance PDF pages prior to recognition, use the PFFFREDocumentToImage action in the Convert action library to first convert the PDF to separate TIF images without recognizing them, perform image enhancement and correction as necessary, then use RecognizePageOCR_S, Recognize, RecognizePageFieldsOCR_S or RecognizeFieldOCR_S to perform recognition on the corrected images.

Automatic Rotation

Automatic rotation can be enabled by setting the DCO variable "s_autorotate" to "1" or disabled by setting it to "0". Automatic rotation is off by default. It is important to properly set the language for the DCO object when enabling the automatic rotation feature, or when using RotateImageOCR_S or RotateImageExOCR_S.

Automatic Deskew

Automatic deskew can be enabled by setting the DCO variable "s_deskew" to "1" or disabled by setting it to "0". Automatic Deskew is off by default. If an image contains text with various orientations, for example vertical and horizontal, the image might be rotated undesirably. The automatic image rotation algorithm relies on, and works best with, images with good quality machine printed text. The page should include at least one text line which includes a machine printed text of at least 30 characters.

The automatic rotation algorithm does not fully work with images containing 9-pin dot-matrix text or non-machine printed text. The automatic deskew is effective only on images with a lower than 15-degree skew. If the image is skewed greater than that amount the image may or may not be adjusted.

Border Removal

Border removal can be enabled by setting the DCO variable "s_removeborders" to the border pixel width. Border removal does not crop the image, it changes the black border around the image to white. "0" disables border removal. Border Removal is off by default. Border removal cleans marginal shadows that may result from scanning or when deskewing an image. The borders are filled in with the color white. The image is not cropped. The feature is off by default but can be enabled using the DCO variable s_removeborders. Set the variable to a positive numerical integer indicating the width of the borders, in pixels, to remove the borders. If you have documents that need to be adjusted, a typical starting number could be 50 pixels. From that value, test real data and see if the number needs to be adjusted up or down.

If the variable is blank or set to 0, then the feature is disabled. Features like rotation and deskew can be more reliable when performed by a recognition engine as compared to the image enhancement actions which work based on image geometry. However, other types of image enhancements such as despeckle, noise removal line removal and other image cleanup steps can best be handled by the image enhancement actions. The separate Image Enhancement ruleset allows interactive use of settings so values can be entered and the results can be immediately seen in the configuration window. This interactive mode helps to guide you to the optimal set of enhancements to give the clearest letters with a clean background. When setting up an application, some experimentation between the available approaches to image enhancement helps find the right combination for the expected input documents. Additionally, the Image Enhancement ruleset saves unique image adjustment settings for each page type. It is possible to perform different enhancements based on the page type which can tailor the specific enhancements based on the page type. These different approaches may need to be used together to prepare a document for recognition.

Original Image Backup

When an image requires rotation or other enhancements are performed by the OCRS actions, the original image is saved in a backup file within the batch using a name similar to: "filename.ocrs.ext". For example, if the original file name was "TM000001.tif", the new name of the original file is "TM000001.ocrs.tif" and the file "TM000001.tif" is updated with the corrected rotation and skew. The suffix added to the original file name can be controlled by setting the DCO variable s_ImageSuffix. If not set, the default value is ".ocrs".

OCR/S Engine Settings

The OCR/S tab in the Zones tab of Datacap Studio contains settings that can be used to configure the engine. Each setting has a description that identifies what it configures. These settings are used to help tailor the engine to improve the recognition of your specific types of input documents. For example, here is where the recognition language can be set for each document type within the application.

To configure a setting for a page or field level object, first lock the DCO, select the target DCO node, then make the appropriate settings in the OCR/S tab. When complete, save the settings and unlock the DCO. The Filter and Filter Plus settings are field level only settings that allow specification of the expected character sets. These settings improve field level recognition by providing information to the engine regarding the expected characters. For example, if only numeric values are expected in the field, then the numeric characters should be specified to improve the recognition quality. For example, if a field is numeric only, then the engine is provided with hints to make it recognize a "1" as a numeric "1" and not a lower case L "l". These filters can be tailored for the expected set of characters, reducing incorrect characters and substitution characters. The input character sets should always be restricted as much as possible to prevent incorrect characters.