Recognize
Recognize refers to settings in the OCR/A tab of the Recognition Options Setup dialog to recognize all characters on a page, and populates the page's Fingerprint file (.cco) with the recognition results.
The Recognize action performs document layout analysis and OCR, generating a
layout XML file such as TM000001_layout.xml. The layout file groups text into
blocks as a person would be looking at the document. Each block might have the default type of block
or a specific type such as title or table. There are locate actions available in
the DocumentAnalytics action library to navigate the block structure such as
GoSiblingBlockNext
. This is in contrast to the CCO file produced by other actions that groups
text into lines that span the width of the page. The layout XML file also retains font and color
attributes, which are saved in CSS format, for the text that is used for extracting data and
reconstructing the document in a new format. To use the Locate actions and perform click n
key during verification, use the CreateCcoFromLayout action in the
SharedRecognitionTools action library. This action creates a CCO file for the
page after producing the layout XML file.
Member of namespace
ocr_aSyntax
bool Recognize ()
Parameters
None.Returns
False if the ruleset with this action is not bound to a Page object of the Document Hierarchy. Otherwise, True.Level
Page level.Details
Block type | Node in the layout XML file |
---|---|
Block | Block |
Header | Header |
Footer | Footer |
Title | Title |
Heading1 | H1 |
Heading2 | H2 |
Heading3 | H3 |
Picture | Picture |
Barcode | Barcode |
Space | S |
Tab | Tab |
Table | Table |
Row | Row |
Cell | Cell |
Paragraph | Para |
Line | L |
Sentence | Sent |
Word | W |
Character | C |
Supported File Formats
Language detection
- Use rrSet or a similar action to set the y_lg variable to a list of comma-separated list of at least three supported auto-detection languages.
- After specifying the list of languages, call the Recognize action.
Languages Supported by Auto Detection
:is informational only and must not be included.
- Arabic: Arabic (Saudi Arabia)
- ArmenianWestern: Armenian (Western)
- AzeriLatin: Azerbaijani (Latin)
- Bashkir
- Bulgarian
- Catalan
- ChinesePRC: Chinese Simplified
- ChinesePRC+English: Chinese Simplified and English
- ChineseTaiwan: Chinese Traditional
- ChineseTaiwan+English: Chinese Traditional and English
- Croatian
- Czech
- Danish
- Dutch
- English
- Estonian
- Finnish
- French
- German
- GermanNewSpelling: German (new spelling)
- GermanLuxembourg: German (Luxembourg)
- Greek
- Hebrew
- Hungarian
- Hungarian
- Indonesian
- Italian
- Japanese
- Korean
- Korean+English: Korean and English
- KoreanHangul: Korean (Hangul)
- Latin
- Latvian
- Lithuanian
- Mixed: Russian and English
- Norwegian: NorwegianNynorsk and NorwegianBokmal
- NorwegianBokmal: Norwegian (Bokmal)
- NorwegianNynorsk: Norwegian (Nynorsk)
- OldEnglish
- OldFrench
- OldGerman
- OldItalian
- OldSpanish
- Polish
- PortugueseBrazilian
- PortugueseStandard
- Romanian
- RussianOldSpelling
- Russian
- RussianWithAccent
- Slovak
- Slovenian
- Spanish
- Swedish
- Tatar
- Thai
- Turkish
- Ukrainian
- Vietnamese
The language can be bound to the DCO object by selecting it in the OCR_A tab in the Zones tab of Datacap Studio. When selected in the OCR_A tab, the variable y_lg is set to the language. The language also can be set within rules by using the rrSet action to set the y_lg variable to the wanted language.
For example: rrSet("Italian", "@X.y_lg") Sets the language to Italian.
If the s_lg variable is not set for the current DCO object, the recognized language is determined by the current locale set with the hr_locale variable. For example, if the locale is set for Germany, rrSet("de-DE", "@X.hr_locale"), then the text is recognized as German.
If both the hr_locale and y_lg variables are set, the value in y_lg takes
precedence over the locale setting. If y_lg is set but the engine should use the
value set for hr_locale instead, setting the variable
dco_uselocale to 1
gives precedence to hr_locale.
If the page you are recognizing is formatted for right-to-left text, such as Arabic or Hebrew,
then the variable "hr_bidi" must be set to "rtl
" to indicate that the page
is right to left.
- Example:
- rrSet("English,French,German", @P.y_lg") - Auto detection of English, French or German
- rrSet("English,German,GermanNewSpelling,Norwegian", @P.y_lg") - Auto detection of English, German, or Norwegian
- rrSet("ChinesePRC+@CHR(43)+English", "@P.y_lg") - An exception for specification of Simplified Chinese and English
Custom Parameters
- y_userProfile
- Reserved for Internal use only.
- y_predefinedProfile
- Set this variable to the name of a predefined profile to use during recognition. Valid values are:
- DocumentConversion_Accuracy
- DocumentConversion_Speed
- DocumentArchiving_Accuracy
- DocumentArchiving_Speed
- BookArchiving_Accuracy
- BookArchiving_Speed
- TextExtraction_Accuracy
- TextExtraction_Speed
- EngineeringDrawingsProcessing
- BusinessCardsProcessing
- Text Extraction versus Text Recognition
When a PDF is recognized, by default, the text included in the recognition results is obtained from a combination of automatic recognition that is run on the PDF and from searchable text that is embedded within the PDF.
Any images that are embedded on the page have the text that is recognized by the engine. If areas of the page contain both an image and searchable text that is associated with the image, the engine decides whether the engine must use the searchable text or recognize the text from the matching image.
Because the engine performs recognition, the confidence of the text might vary even if the same searchable text is embedded in the PDF.
The variable y_contentReuseMode can be used to force the engine only to use the recognized text on the page or only to use the embedded text on the page. One reason why you might decide only to use the embedded text is to prevent recognition and produce high confidence results.
A drawback of only to using the embedded text is that if the embedded text is wrong or incomplete, recognition is not performed to capture that missing data that results in a layout XML that is incomplete compared to what the user sees when the user views the PDF. Do not use this setting if the source PDF file is of the image-on-text type because in this case, the text layer is not extracted.
If a text line contains characters that are not included in the alphabet of the selected recognition languages, this text is not written to the result and mode 0 or 1 must be used.
These settings of y_contentReuseMode can be set on the DCO node that is being converted:
rrSet("0", "@X.y_contentReuseMode") - The default auto mode that uses a combination of recognition and embedded text.
rrSet("1", "@X.y_contentReuseMode") - Only recognition is used to create the layout XML.
rrSet("2", "@X.y_contentReuseMode") - Only embedded text is used to create the layout XML.
- Example:
This sequence creates a layout XML file, and subsequently, a CCO file for the current page. Auto detection is enabled for English, French, and Japanese documents. The CCO file that is produced is ready for use by navigation and pattern match actions.
rrSet("English,French,Japanese", "@P.y_lg") Recognize() CreateCcoFromLayout()
Table Identification
Some recognition engines can identify a table on a document when performing full page recognition when using layout files produced by the Recognize action. When text is recognized as a table, it means that additional metadata is internally stored about the words that have been recognized. This extra metadata stores the cell information, row and column position, for the text. This table metadata can be used by subsequent actions that support table functions. This is an alternative approach to using the "line items" feature that processes recognized data as tables without a table being identified by the recognition engine.
The Recognize action attempts to detect tables on a page, identifying the rows and columns. If a table is detected, then the table structure can be used by subsequent actions that manipulate the recognized tables. As with character recognition, 100% accuracy of a table's rows and columns is not guaranteed. If this feature is needed, it is advisable to test on various tables that you expect to be processing in your application, and determine whether the accuracy is good enough to provide the functions that you need. If table recognition accuracy is not good enough for your wanted approach, it is recommended that you change your approach.
It is possible that parts of a table might not be recognized as expected even if they visually look good to the human eye. For example, parts of your table might be left outside of the table structure or rows and columns might be combined in ways that might not be expected from visually looking at the table. If the tables in your documents are not being identified well by the engine, then you might consider a different approach, such as using line items, to process tabular data within your documents.
- Tables must be well-defined with grid lines that identify the rows and columns of the table.
- Do not perform line removal if you are attempting to detect table structures.
- Cells cannot intersect each other.
- All cells must have a rectangular shape.
- Forcing single lines per table cell
If your table contains only a single row but the engine recognizes multiple rows per table cell, the engine can be instructed to recognize the multiple rows as a single row by setting the DCO variable y_SingleLinePerCell to
1
.Example: rrSet("1","@X.y_SingleLinePerCell")
- Splitting table by separators
- While identifying a table layout, the engine uses grid lines and its own heuristics to determine
the rows and columns for the table. Sometimes this can cause the cells created by the engine to be
different from what is defined by grid lines on the page. For example, your table may have grid
lines that show multiple lines in a cell but instead those lines are interpreted as single line
cells. The DCO variable y_SplitOnlyBySeparators can be set to "1" which tells
the engine to use grid lines only while identifying the layout of a table. For example:
rrSet("1","@X.y_SplitOnlyBySeparators"). When enabled, the setting tells the engine
to use the grid lines to guide the identification, providing a table layout that closely matches the
visible grid lines. This setting is off by default.
If y_SplitOnlyBySeparators is enabled, the engine does not attempt to recognize a table without grid lines.
- Disabling table detection
- Table detection is enabled by default. You can disable it by setting the page level variable
y_DetectTables to
0
.Example: rrSet("0","@X.y_DetectTables").
However, typically, it is not necessary to disable table detection. However, the setting is provided in the case it gives better results for your document types.
- Zoning the table location
- In some cases, the table identification can be improved by identifying the table location / zone. This might improve table identification when the engine is not detecting the table boundaries well or when the table does not have gridlines to identify the table boundaries.
If the page can be fingerprinted so that the location of the table is always known, then the table area can be zoned to identify the table boundaries. To identify the table zone to the recognition engine, specify the name of the field that contains the table zone by using the variable y_TableZone. For example, if the current page has a field that is called "MyTable"and the zone for that field identifies the table location on the page then the action rrSet("MyTable", "@X.y_TableZone") allows the engine to use the zone to identify the table.
If a table is recognized by using the user-provided zone, sub-fields are created off of the y_TableZone specified field with the table contents. If there are multiple tables on the page, only one table can be identified by using this method.
If you need to identify multiple tables on a page, then you must use auto detection, process the page multiple times with different zones for each table or take a different approach. If the location of the table cannot be predicted or identified before recognition, then you would need to use auto detection or take a different approach.
Providing the engine with the table zone gives better results in situations where table auto detect is making mistakes, as in the case when there are no grid lines. Be aware that this technique still cannot guarantee 100% accuracy and you might need to take additional custom steps to massage the recognized data for your needs. It is a similar situation as to the need to correct standard character recognition.