Recognize

Performs document layout analysis and OCR and also generates a layout XML file such as TM000001_layout.xml.

Member of namespace

OCR_SR

Syntax

bool Recognize()

Returns

False if the ruleset with this action is not bound to a page object of the document hierarchy. Otherwise, this action returns True.

Level

Page level.

Details

Performs document layout analysis and OCR and also generates an XML layout file such as TM000001_layout.xml.

The recognition performed is similar to the RecognitPage_OCRS action but this action does not directly create a CCO. The CCO must be created in a second step.

The layout file groups text into blocks similar to how a person would see and identify the structure in the document. For example, a page can have items such as tables, paragraphs, and lines, which are all a type of a block. For information about the block types, see DocumentAnalytics actions. A block might be of the default type or of a specific type such as title or table. The type depends on how the recognition engine interprets the block. Locate actions, such as GoSiblingBlockNext, are available in the Locate action library to navigate the block structure.

This is in contrast to the CCO file produced by other actions that groups text into lines that span the width of the page.

The layout XML file also retains the font and color attributes for the extraction text in CSS format. This text is used for extracting data and reconstructing the document in a new format.

To use the Locate actions and perform click 'n' key during verification, use the action CreateCcoFromLayout action in the SharedRecognitionTools action library after performing recognize. The CreateCcoFromLayout action creates a CCO file from the recognized text within layout XML file.

The layout file contains the results of recognition. Heuristic algorithms are used to identify the text elements on the page. The elements detected can be different on pages that look the same or use the same form. It cannot be guaranteed that a particular element will always exist or will be recognized.

Following are the types of elements that may be present in the layout XML file:

Block Type/XML Node
  • Block/Block
  • Header/Header
  • Footer/Footer
  • Title/Title
  • Heading1/H1
  • Heading2/H2
  • Heading3/H3
  • Picture/Picture
  • Barcode/Barcode
  • Space/S
  • Tab/Tab
  • Table/Table
  • Row/Row
  • Cell/Cell
  • Paragraph/Para
  • Line/L
  • Sentence/Sent
  • Word/W
  • Character/C

Supported File Formats

This action can process color images and PDF files. It processes PDF documents in the following manner:
  • Extracts embedded text within the PDF document
  • Performs recognition only on those areas that contain data but do not contain embedded text

This behavior improves the processing speed and overall performance of PDF document processing.

Tip: You can also process the following types of documents by first converting the documents to searchable PDFs with Convert library actions:
  • Microsoft Excel
  • Microsoft Word
  • HTML
  • RTF
  • Txt

For example, you can convert a Microsoft Excel document to PDF by calling the action ExcelWorkbookToPdf. Once the PDF document is created, it can be processed with the Recognize action.

Important:

When processing PDF files with this action, it creates a single layout file only and does not extract the images from the document. DCO nodes for each page are not created. If you intend to extract pages from a PDF for recognition, perform verification or use other Datacap actions on the resulting pages, then use the PDFFREDocumentToImage action in the Convert action library to create unique pages that can be individually processed by subsequent tasks.

Additionally, images to be recognized should use a lossless compression like FAX (for black and white images) or LZW (for grayscale and color), and should not use a lossy compression such as JPEG. JPEG should only be used on photos that do not have text to be recognized.

Configuring Language and Other Properties

The language to be recognized can be configured by setting the "Language" property in the OCR/S tab of the Zones tab in Datacap Studio. The OCR/S tab has a number of additional properties that can be set to control recognition. Instead of setting these properties as defaults, a proper setting can improve recognition accuracy. Select the DCO node, such as a page or field, lock the DCO and the properties can then be changed. After making any selections, save the DCO and unlock it.

Language Detection

The Recognize action does not support language detection. However, this feature is available in the Recognize action of the OCR_A action library.

Loading the CCO

Datacap works with recognized text with the help of CCO. The CCO contains all of the text and the positions for each recognized character. It allows text related actions, like the Locate actions, to operate. The "RecognizePage" actions create the CCO automatically. The "Recognize" actions do not directly create the CCO. They create an intermediate layout file. To create a CCO from the layout file you must call the action CreateCcoFromLayout in the SharedRecognitionTools after performing Recognize. Table layouts can be detected only if the page contains grid lines that define the table. The CCO does not use or store the table information. If table detection is not needed, then it is usually best to remove the lines from the image prior to recognition to get the best possible recognition results.

Automatic Rotation, Deskew and Border Removal

Automatic rotation, deskew and border removal can be performed automatically at the same time as recognition, if the feature is enabled.

Example:
In the following example, the Recognize action first creates an XML layout file and subsequently a CCO file for the current page. The CCO file produced is ready for use by navigation and pattern match actions.

Recognize() 
CreateCcoFromLayout()