Recognize

Recognize refers to settings in the OCR/A tab of the Recognition Options Setup dialog to recognize all characters on a page, and populates the page's Fingerprint file (.cco) with the recognition results.

The Recognize action performs document layout analysis and OCR, generating a layout XML file such as TM000001_layout.xml. The layout file groups text into blocks as a person would be looking at the document. Each block might have the default type of block or a specific type such as title or table. There are locate actions available in the DocumentAnalytics action library to navigate the block structure such as GoSiblingBlockNext. This is in contrast to the CCO file produced by other actions that groups text into lines that span the width of the page. The layout XML file also retains font and color attributes, which are saved in CSS format, for the text that is used for extracting data and reconstructing the document in a new format. To use the Locate actions and perform click n key during verification, use the CreateCcoFromLayout action in the SharedRecognitionTools action library. This action creates a CCO file for the page after producing the layout XML file.

Member of namespace

ocr_a

Syntax

bool Recognize ()

Parameters

None.

Returns

False if the ruleset with this action is not bound to a Page object of the Document Hierarchy. Otherwise, True.

Level

Page level.

Details

The following are the types of elements that might be present in the layout XML file:

Table 1. Block types versus XML node
Block type	Node in the layout XML file
Block	Block
Header	Header
Footer	Footer
Title	Title
Heading1	H1
Heading2	H2
Heading3	H3
Picture	Picture
Barcode	Barcode
Space	S
Tab	Tab
Table	Table
Row	Row
Cell	Cell
Paragraph	Para
Line	L
Sentence	Sent
Word	W
Character	C

Supported File Formats

The Recognize action can process color images and PDF files. When processing PDF documents, the action extracts embedded text within the PDF document, and performs recognition only on those areas that contain data but do not contain embedded text. This improves the processing speed and overall performance of the processing of PDF documents.

Tip: It is also possible to process Microsoft Excel, Microsoft Word, Html, Rtf, and Txt documents by first converting those documents to searchable PDFs via the Convert library. For example, a Microsoft Excel document can be converted to PDF by calling the action ExcelWorkbookToPdf. After the PDF document is created, it can be processed through the Recognize action.

Language detection

Language detection helps improve recognition results. Instead of using the default English setting, it detects the language and results in more accurate OCR results. When the OCR process is complete, a report on the number of languages detected (and total number of words that are detected for each language) is generated. This report is stored in the runtime DCO as variables, and can also be found in the layout XML file. To enable automatic language detection:

Use rrSet or a similar action to set the y_lg variable to a list of comma-separated list of at least three supported auto-detection languages.
After specifying the list of languages, call the Recognize action.

Important: The list of languages must be minimized to the languages expected to be processed by the application. More languages slow down the processing. However, if an application process only two languages, you must still provide at least three languages to enable automatic language detection.

Languages Supported by Auto Detection

Important: When you set the comma-separated list of languages, make sure that the languages are entered exactly as written in this list. An invalid language name causes the action to abort.

Note: Text that follows the colon : is informational only and must not be included.

Arabic: Arabic (Saudi Arabia)
ArmenianWestern: Armenian (Western)
AzeriLatin: Azerbaijani (Latin)
Bashkir
Bulgarian
Catalan
ChinesePRC: Chinese Simplified
ChinesePRC+English: Chinese Simplified and English
ChineseTaiwan: Chinese Traditional
ChineseTaiwan+English: Chinese Traditional and English
Croatian
Czech
Danish
Dutch
English
Estonian
Finnish
French
German
GermanNewSpelling: German (new spelling)
GermanLuxembourg: German (Luxembourg)
Greek
Hebrew
Hungarian
Hungarian
Indonesian
Italian
Japanese
Korean
Korean+English: Korean and English
KoreanHangul: Korean (Hangul)
Latin
Latvian
Lithuanian
Mixed: Russian and English
Norwegian: NorwegianNynorsk and NorwegianBokmal
NorwegianBokmal: Norwegian (Bokmal)
NorwegianNynorsk: Norwegian (Nynorsk)
OldEnglish
OldFrench
OldGerman
OldItalian
OldSpanish
Polish
PortugueseBrazilian
PortugueseStandard
Romanian
RussianOldSpelling
Russian
RussianWithAccent
Slovak
Slovenian
Spanish
Swedish
Tatar
Thai
Turkish
Ukrainian
Vietnamese

The language can be bound to the DCO object by selecting it in the OCR_A tab in the Zones tab of Datacap Studio. When selected in the OCR_A tab, the variable y_lg is set to the language. The language also can be set within rules by using the rrSet action to set the y_lg variable to the wanted language.

For example: rrSet("Italian", "@X.y_lg") Sets the language to Italian.

If the s_lg variable is not set for the current DCO object, the recognized language is determined by the current locale set with the hr_locale variable. For example, if the locale is set for Germany, rrSet("de-DE", "@X.hr_locale"), then the text is recognized as German.

If both the hr_locale and y_lg variables are set, the value in y_lg takes precedence over the locale setting. If y_lg is set but the engine should use the value set for hr_locale instead, setting the variable dco_uselocale to 1 gives precedence to hr_locale.

If the page you are recognizing is formatted for right-to-left text, such as Arabic or Hebrew, then the variable "hr_bidi" must be set to "rtl" to indicate that the page is right to left.

Example:: rrSet("English,French,German", @P.y_lg") - Auto detection of English, French or German; rrSet("English,German,GermanNewSpelling,Norwegian", @P.y_lg") - Auto detection of English, German, or Norwegian; rrSet("ChinesePRC+@CHR(43)+English", "@P.y_lg") - An exception for specification of Simplified Chinese and English

Custom Parameters

The following variables can be used to set custom parameters for recognition:

y_userProfile

Reserved for Internal use only.

y_predefinedProfile

Set this variable to the name of a predefined profile to use during recognition. Valid values are:

DocumentConversion_Accuracy
DocumentConversion_Speed
DocumentArchiving_Accuracy
DocumentArchiving_Speed
BookArchiving_Accuracy
BookArchiving_Speed
TextExtraction_Accuracy
TextExtraction_Speed
EngineeringDrawingsProcessing
BusinessCardsProcessing

Text Extraction versus Text Recognition

When a PDF is recognized, by default, the text included in the recognition results is obtained from a combination of automatic recognition that is run on the PDF and from searchable text that is embedded within the PDF.

Any images that are embedded on the page have the text that is recognized by the engine. If areas of the page contain both an image and searchable text that is associated with the image, the engine decides whether the engine must use the searchable text or recognize the text from the matching image.

Because the engine performs recognition, the confidence of the text might vary even if the same searchable text is embedded in the PDF.

The variable y_contentReuseMode can be used to force the engine only to use the recognized text on the page or only to use the embedded text on the page. One reason why you might decide only to use the embedded text is to prevent recognition and produce high confidence results.

A drawback of only to using the embedded text is that if the embedded text is wrong or incomplete, recognition is not performed to capture that missing data that results in a layout XML that is incomplete compared to what the user sees when the user views the PDF. Do not use this setting if the source PDF file is of the image-on-text type because in this case, the text layer is not extracted.

If a text line contains characters that are not included in the alphabet of the selected recognition languages, this text is not written to the result and mode 0 or 1 must be used.

These settings of y_contentReuseMode can be set on the DCO node that is being converted:

rrSet("0", "@X.y_contentReuseMode") - The default auto mode that uses a combination of recognition and embedded text.

rrSet("1", "@X.y_contentReuseMode") - Only recognition is used to create the layout XML.

rrSet("2", "@X.y_contentReuseMode") - Only embedded text is used to create the layout XML.

Example:

This sequence creates a layout XML file, and subsequently, a CCO file for the current page. Auto detection is enabled for English, French, and Japanese documents. The CCO file that is produced is ready for use by navigation and pattern match actions.

rrSet("English,French,Japanese", "@P.y_lg")
Recognize() 
CreateCcoFromLayout()

Table Identification

Some recognition engines can identify a table on a document when performing full page recognition when using layout files produced by the Recognize action. When text is recognized as a table, it means that additional metadata is internally stored about the words that have been recognized. This extra metadata stores the cell information, row and column position, for the text. This table metadata can be used by subsequent actions that support table functions. This is an alternative approach to using the "line items" feature that processes recognized data as tables without a table being identified by the recognition engine.

The Recognize action attempts to detect tables on a page, identifying the rows and columns. If a table is detected, then the table structure can be used by subsequent actions that manipulate the recognized tables. As with character recognition, 100% accuracy of a table's rows and columns is not guaranteed. If this feature is needed, it is advisable to test on various tables that you expect to be processing in your application, and determine whether the accuracy is good enough to provide the functions that you need. If table recognition accuracy is not good enough for your wanted approach, it is recommended that you change your approach.

It is possible that parts of a table might not be recognized as expected even if they visually look good to the human eye. For example, parts of your table might be left outside of the table structure or rows and columns might be combined in ways that might not be expected from visually looking at the table. If the tables in your documents are not being identified well by the engine, then you might consider a different approach, such as using line items, to process tabular data within your documents.

The following tips can help to improve the recognition of your table:

Tables must be well-defined with grid lines that identify the rows and columns of the table.
Do not perform line removal if you are attempting to detect table structures.
Cells cannot intersect each other.
All cells must have a rectangular shape.

Forcing single lines per table cell

If your table contains only a single row but the engine recognizes multiple rows per table cell, the engine can be instructed to recognize the multiple rows as a single row by setting the DCO variable y_SingleLinePerCell to 1.

Example: rrSet("1","@X.y_SingleLinePerCell")

Splitting table by separators: While identifying a table layout, the engine uses grid lines and its own heuristics to determine the rows and columns for the table. Sometimes this can cause the cells created by the engine to be different from what is defined by grid lines on the page. For example, your table may have grid lines that show multiple lines in a cell but instead those lines are interpreted as single line cells. The DCO variable y_SplitOnlyBySeparators can be set to "1" which tells the engine to use grid lines only while identifying the layout of a table. For example: rrSet("1","@X.y_SplitOnlyBySeparators"). When enabled, the setting tells the engine to use the grid lines to guide the identification, providing a table layout that closely matches the visible grid lines. This setting is off by default.
If y_SplitOnlyBySeparators is enabled, the engine does not attempt to recognize a table without grid lines.

Disabling table detection: Table detection is enabled by default. You can disable it by setting the page level variable y_DetectTables to 0.
Example: rrSet("0","@X.y_DetectTables").
However, typically, it is not necessary to disable table detection. However, the setting is provided in the case it gives better results for your document types.

Zoning the table location: In some cases, the table identification can be improved by identifying the table location / zone. This might improve table identification when the engine is not detecting the table boundaries well or when the table does not have gridlines to identify the table boundaries.

If the page can be fingerprinted so that the location of the table is always known, then the table area can be zoned to identify the table boundaries. To identify the table zone to the recognition engine, specify the name of the field that contains the table zone by using the variable y_TableZone. For example, if the current page has a field that is called "MyTable"and the zone for that field identifies the table location on the page then the action rrSet("MyTable", "@X.y_TableZone") allows the engine to use the zone to identify the table.

If a table is recognized by using the user-provided zone, sub-fields are created off of the y_TableZone specified field with the table contents. If there are multiple tables on the page, only one table can be identified by using this method.

If you need to identify multiple tables on a page, then you must use auto detection, process the page multiple times with different zones for each table or take a different approach. If the location of the table cannot be predicted or identified before recognition, then you would need to use auto detection or take a different approach.

Providing the engine with the table zone gives better results in situations where table auto detect is making mistakes, as in the case when there are no grid lines. Be aware that this technique still cannot guarantee 100% accuracy and you might need to take additional custom steps to massage the recognized data for your needs. It is a similar situation as to the need to correct standard character recognition.