Recognize (deprecated)

This action is deprecated. Refer to action Recognize. Recognize refers to the settings in the OCR/A tab of the Recognition Options Setup dialog to recognize all characters on a page, and populates the page's Fingerprint file (.cco) with the recognition results.

The Recognize action performs document layout analysis and OCR, generating a layout XML file such as TM000001_layout.xml. The layout file groups text into blocks in a way a person would be looking at the document. Each block may have the default type of block or a specific type such as title or table. There are locate actions available in the DocumentAnalytics action library to navigate the block structure such as GoSiblingBlockNext. This is in contrast to the CCO file produced by other actions that group text into lines that span the width of the page. The Recognize can be used with Datacap's actions that use the CCO. To use the Locate actions and perform click n key during verification, use the CreateCcoFromLayout action in the SharedRecognitionTools action library. This action creates a CCO file for the page after producing the layout XML file. Once the CCO has been created, subsequent actions that use the CCO can be used.

Member of namespace

ocr_a

Syntax

bool Recognize ()

Parameters

None.

Returns

False, if the ruleset with this action is not bound to a Page object of the Document Hierarchy. Otherwise, True

Level

Page level.

Details

Refer to the Best Practices for optimal text recognition document to understand the best image settings, such as DPI, for optimal recognition. Additionally, images should use a lossless compression like FAX or LZW, and should not use a loss inducing compression such as JPEG, which is only intended for photos. If JPEG is used at any step in the process, it degrades the crispness of character lines, reducing recognition quality. The layout file contains the results of recognition. Heuristic algorithms are used to identify the text elements on the page. The elements detected can be different on pages that look the same or use the same form. It cannot be guaranteed that a particular element always exists or that blocks or elements always appear in the same order within the XML.

The layout XML file also retains font and color attributes, saved in CSS format, for the text which is used for extracting data and reconstructing the document in a new format.

The following are the types of elements that might be present in the layout XML file:

Table 1. Block types versus XML node
Block type	Node in the layout XML file
Block	Block
Header	Header
Footer	Footer
Title	Title
Heading1	H1
Heading2	H2
Heading3	H3
Picture	Picture
Barcode	Barcode
Space	S
Tab	Tab
Table	Table
Row	Row
Cell	Cell
Paragraph	Para
Line	L
Sentence	Sent
Word	W
Character	C

The layout file is intended for use by other Datacap actions such as CreateCcoFromLayout, FindTableValueRegEx, etc. and the listing of the possible components are intended to help make use of those actions. Direct reading and use of the layout.xml file by custom actions or processes is not supported. Additionally, the format and content of the file can change in future releases without notice.

Supported file formats

The Recognize action can process black and white images, color images and PDF files. The typical use of this action is to recognize text on image files such as TIFF images. It is highly recommended that the images are saved using a loss-less compression.

When processing PDF documents, the action decides to use the embedded text or the image on the page. If there is a textual image that does not have embedded text, it also gets recognized. While this action can directly recognize a PDF, it places the recognition results in a layout file but does not create an extracted TIFF image for the recognized page, so the recognition results cannot be used in other actions that require an image, such as a verify panel or other actions that operate on images. If this action is provided a multi-page PDF, it creates only a single layout file and does not create a DCO page object for each page within the PDF. When recognizing a PDF document, it is recommended to use the action PDFFREDocumentToImage in the Convert action library as this also creates an image file and create a DCO node for each page in the document. Using PDFFREDocumentToImage allows use of the various image enhancements provided by Datacap actions to improve image quality prior to recognition.

Tip: It is also possible to process Microsoft Excel, Microsoft Word, HTML, RTF, and TXT documents by first converting those documents to searchable PDFs via the Convert library. For example, a Microsoft Excel document can be converted to PDF by calling the action ExcelWorkbookToPdf. After the PDF document is created, it can be processed through the PDFFREDocumentToImage and Recognize actions.

Loading the CCO

The CCO is how verification panels and is used by many subsequent actions to work with recognized text for the page. The CCO contains all of the text and the positions for each recognized character. It allows text related actions, like the Locate actions, to operate. The CCO is required to be loaded for verification panels to use the 'click-n-key' feature. The "RecognizePageOCR_A" action creates the CCO automatically in one step. The "Recognize" actions do not directly create the CCO. Recognize, instead creates an intermediate layout file. To create a CCO from the layout file you must call the action CreateCcoFromLayout in the SharedRecognitionTools after performing Recognize. Once CreateCcoFromLayout has been performed, then the CCO is created, allowing subsequent actions and verify panels to use the CCO. Depending on the data, you may need to call NormalizeCCO after calling CreateCcoFromLayout.

Simultaneous page recognition with the recognize action

Note that simultaneous page recognition at the batch or document levels is provided as a "Preview" feature. Typically, the Recognize action is attached to a page level DCO node. As the rules engine iterates through the page objects, the rules are run for each page running the Recognize action attached to the page. Within a single batch this causes each page to be recognized serially, one page after the other. The Recognize action can also be attached to a document or batch node to perform simultaneous recognition within the same batch.

When called at the document or batch level, the Recognize action performs simultaneous recognition of the pages under the calling object. If called on a document object, all of the pages under that document are recognized simultaneously. If called on a batch object, all of the pages in the entire batch are recognized simultaneously. When attached to a document or batch object, the pages are recognized concurrently. This is similar to running multiple Rulerunner batches performing recognition, but the process is not identical. As the concurrency performed is not identical to running multiple Rulerunners, there is no guarantee the recognition speed will be the same; meaning running concurrent recognition of 8 pages contained in a single batch may not take the same length of time as 8 batches of page 1 page each in separate Rulerunner processes. Nevertheless, using the concurrency feature within a single batch, can shorten the time it takes to perform recognition in a single batch. Of course recognition time depends on the types of documents recognized and load on the machine from other processes and Rulerunners that are processing separate batches. The engine determines how many pages to recognize simultaneously based on the number of pages and the machine's attributes. It is not possible to control how many pages are recognized at the same time as the engine typically attempts to recognize as many simultaneous pages up to the number of CPUs in the machine, as long as there are as many waiting pages under the calling DCO node. For example, if there are 20 pages under the current document node, and an 8 CPU machine is used, the engine can recognize up to 8 pages simultaneously as a maximum. As each page in the simultaneous set completes, the engine starts to pick up another waiting page from the current document node.

Testing must be performed to determine the optimal load for the target machine. Factors that affect the recognition time include, dpi of the page, clarity of the text, density of the text, the amount of text covering the page, machine’s CPU speed, CPU count and CPU load from competing processes. If an application is configured to perform simultaneous recognition of pages within a single batch, then it is likely that the number of Rulerunner processes running recognition need to be limited to prevent over saturating the machine, which could cause a performance degradation because the machine is spreading its time across too many processes that are waiting for a free CPU to run. Performing simultaneous recognition on multiple pages requires more memory than recognizing a single page. Generally, pages that contain more words require more memory. It is possible that an out of memory condition can occur depending on the amount of pages being processed and the amount of text on the pages. If this happens, set the y_maxPagesForInMemoryProcessing variable to enable reduced memory mode. Set the variable to the number of pages that enable the reduced memory mode. For example, if the variable is set to 10, then when processing 10 or more pages, the reduced memory mode is enabled. The variable can be configured in the setup DCO using Datacap Studio or it can be set at runtime using rrSet. Additionally, setting the variable "y_LowMemoryMode" to "1" notifies the engine to use as little memory as possible. If out of memory conditions continue to occur when these settings are enabled, then the application should be changed to reduce the maximum number of pages configured for simultaneous recognition or instead only call recognition on page level nodes.

Example to limit memory use:

 rrSet("10", "@X.y_maxPagesForInMemoryProcessing")
rrSet("1", "@X.y_LowMemoryMode")
Recognize()

Tip: Remember that instead of setting variables in the rules at runtime, they can instead be configured in the desired DCO node in the Setup DCO in Datacap Studio. If a variable is set in the Setup DCO, then it does not need to be set via rules at runtime. If you have 1000 page objects and if a variable is set at runtime on a page object, then that variable is set 1000 times. Conversely, if the variable is set in the Setup DCO, then the variable is already set and does not need to be set in the rules.

When called on a batch or document node, all of the pages under that node are recognized. It may not be desired, or may be wasteful, to recognize all pages under a node. For example, some pages may be separator pages, other pages may be unimportant or have no need to be recognized. It is possible to control which pages are recognized by setting a DCO variable prior to calling the Recognize action. Three variables are available: typesToExclude, typesToInclude and statusToExclude. Only one of these variables should be configured to specify the types of pages to include or exclude.

To exclude specific page types, set the variable "typesToExclude" to a comma delimited list of page types to exclude from recognition.

To include specific page types, set the variable "typesToInclude" to a comma delimited list of page types to recognize and other types will be ignored.

To exclude specific page status, set the variable "statusToExclude" to a comma delimited list of page status to exclude from recognition.

When more than one filter is specified, the following order of precedence takes place: - "statusToExclude" overrides "typesToInclude" - "typesToInclude" overrides "typesToExclude"

For example, if performing recognition at a document level and the document has page types of Main_Page, Trailing_Page and Separator pages, the variable typesToExclude could be set to exclude Separator pages from recognition by calling rrSet prior to the recognize action like this: rrSet("Separator", "@X.typesToExclude")

Pages that have the variable "y_sr" set to "1" are skipped as well and if set to "1", the page is excluded regardless of other settings. This variable can be configured at setup time by selecting "Skip Recognition" from the OCR/A property tab in the Datacap Studio Zones tab. If configured in the Setup DCO, then all pages of that type have the variable set at run time. Alternatively, this variable could be set at runtime on pages that should be skipped. For example, if application rules determine that at runtime, a page should be skipped, the rules can set this variable. When a follow-on rule calls Recognize, the pages with this variable set are not recognized.

Calling Recognize at the Batch or Document level has these page restrictions:

- The page must be a single page image. It cannot be a PDF or multi-page image.

- The engine setting to enable a Single Line Per Cell cannot be enabled.

- Using a field to specify the location of a table is not supported.

When recognizing multiple pages simultaneously at the Batch or Document level, the page recognition settings must be the same and set at the level in which the Recognize action is called. All of the pages recognized must use the same engine configuration settings. For example, if settings are made for deskew, automatic rotation, aggressive text recognition, etc., must be made on the DCO object that calls the Recognize action and the same settings are used for all pages.

If settings have been made on the page level object, they are ignored when Recognize is called at the Batch or Document level. The same language recognition parameters are applied to all pages. If pages of different languages are being recognized in the same call to Recognize, then all of the possible languages are specified as described in the RecognizeOCRA action library top level help topic where language configuration is discussed. For example, if some pages are in English, some in French and others in Spanish, then all languages need to be specified at the Batch or Document node where the Recognize action is called.

Note: Recognition is most accurate when the list of languages only contains the languages on the page. Additionally, the call to Recognize must have all languages of the same reading direction. You cannot mix some pages that are left-to-right while other pages are right-to-left. If you have pages with different reading directions, they must be recognized in separate calls to Recognize. Review the top level help topic to properly configure recognition of right-to-left text.

Custom parameters

The following variables can be used to set custom parameters for recognition:

Detect Pictures: By default the engine uses heuristics to determine areas of the pages that appear to be pictures. As with recognition itself, the success of picture detection cannot always be guaranteed. Some of the possible issues could be: a picture is not identified, mistake a non-picture area as a picture, two adjacent pictures to be considered as a single picture, etc. Detection is on by default. To disable this setting, set the DCO variable y_DetectPictures to "0".
y_DetectFontFormatting: y_DetectFontFormatting specifies whether font formatting detection should be performed by the recognition engine at the document level. This can have an effect on the styles listed in the layout XML for the recognized page. In some cases, font sizes may be recognized more accurately if this setting is disabled by setting the value to "0" in the DCO and the setting y_DetectFontFormattingAtPageLevel is enabled. If not defined, this setting is enabled by default.
y_DetectFontFormattingAtPageLevel: If y_DetectFontFormattingAtPageLevel is enabled by setting the value to "1" in the DCO, the engine attempts to detect font parameters, enabling detailed processing of subscripts, superscripts, italic-face type, small capital letters for the page. In some cases, enabling this setting and disabling the y_DetectFontFormatting setting can improve detection of font sizes as specified in the layout XML. If not defined, this setting is disabled by default.

Table identification

Some recognition engines can identify a table on a document when performing full page recognition when using layout files produced by the Recognize action. The other OCR/A recognize actions do not support this functionality. When text is recognized as a table, it means that additional metadata is internally stored about the words that have been recognized. This extra metadata stores the cell information, row and column position, for the text. This table metadata can be used by subsequent actions that support block functionality from a layout file. This is an alternative approach to using the "line items" feature that processes recognized data as tables without a table being identified by the recognition engine using RecognizePageOCR_A.

The Recognize action attempts to detect tables on a page, identifying the rows and columns. If a table is detected, then the table structure can be used by subsequent actions that manipulate the recognized tables. As with character recognition, 100% accuracy of a table's rows and columns is not guaranteed. Tables are detected using heuristics and by their nature, they can have very different results with pages with similar tables.

If this feature is needed, it is best to test on a variety of tables that you expect to be processing in your application and determine if the accuracy is good enough to provide the functionality that you need. If table recognition accuracy is not good enough for your desired approach, it is recommended that you change your approach.

It is possible that parts of a table might not be recognized as expected even if they visually look good to the human eye. For example, parts of your table might be left outside of the table structure or rows and columns might be combined in ways that might not be expected from visually looking at the table. If the tables in your documents are not being identified well by the engine, then you might consider a different approach, such as using line items, to process tabular data within your documents.

The following tips can help to improve the recognition of your table:

Tables must be well-defined with grid lines that identify the rows and columns of the table.
Do not perform line removal if you are attempting to detect table structures.
Cells cannot intersect each other.
All cells must have a rectangular shape.

Future product updates may change how blocks and tables are detected. If the application uses blocks or table detection to process data, always test to confirm the application still works as desired. If not, application changes may be necessary to accommodate the structure and detection changes.

Table detection

Table detection is enabled by default and can be disabled by setting the page level variable y_DetectTables to "0". For example: rrSet("0","@X.y_DetectTables").

Forcing single lines per table cell

If your table contains only a single row but the engine is recognizing multiple rows per table cell, the engine can be instructed to recognize rows as a single row by setting the DCO variable y_SingleLinePerCell to "1". For example: rrSet("1","@X.y_SingleLinePerCell").

In some cases, enabling this setting may worsen table detection or may not make single rows per cell. It is recommended that you test various tables to determine if this setting is appropriate for your documents.

Split table by separators

When identifying a table layout, the engine uses grid lines and its own heuristics to determine the rows and columns for the table. Sometimes, this can cause the cells created by the engine to be different from what is defined by grid lines on the page. For example, your table may have grid lines that show multiple lines in a cell but instead those lines are interpreted as single line cells. The DCO variable y_SplitOnlyBySeparators can be set to "1" which tells the engine to only use grid lines when identifying the layout of a table. For example: rrSet("1","@X.y_SplitOnlyBySeparators"). When enabled, the setting tells the engine to use the grid lines to guide the identification, providing a table layout that should more closely match the visible grid lines. This setting is off by default.

If y_SplitOnlyBySeparators is enabled, the engine does not attempt to recognize a table without grid lines.

Aggressive table detection

An optional setting instructs the engine to perform aggressive table detection. Setting the variable y_AggressiveTableDetection to "0" disables this feature and "1" enables the aggressive mode. By default this setting is enabled. It is recommended that you test with this setting on and off to see which mode works better for your documents.

Exhaustive analysis

The exhaustive analysis mode can be enabled to further increase the quality of page layout analysis in complex cases. The feature can be enabled by setting the page level DCO variable y_EnableExhaustiveAnalysisMode to "1" prior to calling the Recognize action. Use of this setting can improve table detection but can also slow recognition speed. This setting is off by default.

Zoning the table location

In some cases table identification can be improved by identifying the table location / zone. This can improve table identification when the engine is not detecting the table boundaries well or when the table does not have grid lines to identify the table boundaries.

If the page can be fingerprinted so that the location of the table is always known, then the table area can be zoned to identify the table boundaries. To identify the table zone to the recognition engine, specify the name of the field that contains the table zone using the variable y_TableZone. For example, if the current page has a field called "MyTable"and the zone for that field identifies the table location on the page then the action rrSet ("MyTable", "@X.y_TableZone") allows the engine to use the zone to identify the table.

If a table is recognized using the user provided zone, sub-fields are created off of the y_TableZone specified field with the table contents. If there are multiple tables on the page, only one table can be identified using this method. If you need to identify multiple tables on a page then you must use auto detection or process the page multiple times with different zones for each table or take a different approach. If the location of the table cannot be predicted or identified prior to recognition, then you must use automatic table detection or take a different approach.

It tells the engine that the table zone can provide better results in situations where table auto detect is making mistakes, as in the case when there are no grid lines. However, this technique still cannot guarantee 100% accuracy and you may need to take additional custom steps to message the recognized data for your needs.

Skipping page recognition

Recognition can be skipped for a page by using the OCR/A property settings in Datacap Studio and setting the "Skip Recognition" property to "Yes". If the Recognition action is called on the page object, recognition is skipped and the action returns 'True'. Alternatively, recognition can be skipped by setting the DCO variable "y_sr" to "1".

Example

MyAction("123")

This is the action example.