Recognizing text and creating images from a PDF

Product Documentation

Abstract

IBM Datacap has several different ways to recognize text within a PDF file. This guide is for those who want to recognize text on all, or some, of the pages within a PDF document. The recognized text can then be utilized in subsequent tasks such as validation, verification, or export as needed by an application.
When first reviewing how to use the action, it may appear that the best approach is to perform recognition on the PDF directly as it is converted to images. While this is supported, there are several reasons why it is better to first convert to images, and then perform recognition directly on the images. The choice depends on the data within the document, and how the data needs to be processed.

Content

Possible Approaches

There are the following two basic approaches to recognize text within a PDF:
1. Convert the PDF pages to images, then perform recognition.
2. Perform recognition in the same step as image conversion.
The first approach where the PDF is first converted to an image than recognition is performed on the image is usually the best approach. The primary benefit to first converting to an image is that this step will allow for image processing prior to recognition. In image processing, the image can be adjusted to improve the quality of the recognized text.

The following are the steps that can only be performed on a separate image:
- Orientation / Rotation correction
- Image deskew
- Image cleanup and adjustment though border removal, despeckling, line removal, and so on. These can all help achieve better recognition quality.
- When Image registration is used, it must be performed on an image prior to recognition.
- Field recognition can only be performed on an image that has an associated template or fingerprint with loaded zones. Field recognition is not possible on a PDF page.
- Recognition can be limited to a subset of pages.
These are very strong reasons to first create an image then perform recognition, and why it is typically the best path. While something like deskew might seem to be unnecessary because the text is clean, not only dekew improves recognition accuracy, it helps ensure that all of the text is considered to be on the same line.

Some scanners ingests pages and output a PDF instead of separate images. When a PDF contains scanned pages like in this situation, then the approach to first split out the pages, and then recognize each is usually the best approach.

If all of the PDF documents are guaranteed to be from an electronic source, such as a word document converted to a PDF, where the pages are never skewed and are always clean, then direct recognition on the PDF can be an option. As this is not typical for most applications, is usually more reliable and flexible to convert to an image first.

If the original PDF is required for archiving or some other use, the PDF is still available in the batch and can be uploaded to a repository, or placed where required, at the end of the process.
Recommended Steps
The following are general steps to first convert a PDF to images, then perform full page or field level recognition:
1. Convert the source PDF to images.
2. Use image enhancement to fix rotation, deskew, and enhance the images.
3. Recognize the pages.
The following shows the basic steps that can be integrated into an application as needed.

Note: If you decide to retain color for pages that are color, use a loss-less compression such as LZW. Do not use a lossy compression such as JPEG as this will reduce the quality of recognition, even if a high quality compression rate is chosen. Always use a loss-less compression for text that will be recognized.
1. Convert the PDF to Images
  This is an example ruleset that converts the PDF to images without performing any recognition. This allows the images to be adjusted to recognize well. This rulset snapshot from Datacap Studio converts a PDF to separate images without performing recognition in this step.
  
  Notes
  - The actions enable the convPdfIgnoreContent variable. This prevents any extraction or recognition on the PDF during the conversion process as it is not needed at this step.
  - The conversion to image is set to 18, which is a loss-less fax compression. Always use a loss-less compression for images that will have recognition performed. Fax is a compression that produces very small black and white images without losing quality.
  - This conversion is setup to convert grayscale and color pages within the PDF to black and white. This is not required but can improve recognition quality by dropping out light shaded backgrounds. Some image enhancement functionality requires a black and white image. If color pages need to stay in color, then that can be achieved by changing the parameters of PDFFREDocumentToImage as described in the action help. Color pages of 24 bits or less can be recognized. An application can also be setup to retain color images, enhance, and recognize pages in black and white and still keep the color images for later storage in a repository.
2. Perform Image Enhancement
  Configure Image Enhancement to run on each image. At a minimum, run automatic rotation and dekew on images that will be recognized. Border cropping or removal may also help. If you find there are other enhancements that will also clean up the images, such as despeckle or line removal, they can be run as well. This document does not focus on all of the available image enhancements.
  
  Here the action RotateImageOCR_A is used because it rotates the image based on the target language, which is usually the most reliable method. Image enhancement actions also have an automatic rotate, but they rotate based on image geometry while the OCRA rotation will rotate based on the target language.
  
  After that, the Image Enhancement ruleset is used to deskew the images. Configure the settings in the Image Enhancement ruleset to adjust images as needed. Be sure to select the target page type in the DCO tree. An image can be loaded in the left panel and the immediate enhancement results can be seen in the right panel. Adjust the values until the resulting image has crisp text with a clean background. Some enhancements work best in certain orders so you can reorder the enhancements as needed. It is also possible to perform enhancements multiple times by adding a second instance of an enhancement to the list in the ruleset.
  
  The use the enhancements that work best for the expected set of input documents. This example shows automatic rotation and then the use of the image Enhancement ruleset. The Image Enhancement ruleset allows a large number of image adjustments. Note that some features require specific image color depths.
  
  If it is not already obvious, these rules to adjust the images need to be run on the newly created page objects from the PDFFREDocumentToImage action. These rules must be in a separate ruleset from the one that contains PDFFREDocumentToImage. For example, if the action processed a single PDF of 10 pages, it will have created 10 new page objects, one for each page and each has an associated image. The initial page type is “Other”. It is up to the application writer to decide if the pages should first be set to a new page type prior to performing the image enhancements.
  
  It is possible to perform different set of image enhancements for different pages using the page type. This can be accomplished by first using some page identification method, such as fingerprinting or other method, to set a page type. It is possible to assign a different set of enhancements based on the page type. Assigning page types also allow you to skip enhancement and/or recognition for specific page types that may not need it or may not need to be recognized.
  
  Some of the recognition actions have built in image enhancements such as automatic rotation and deskew. A recognition engine can potentially do these actions better than the generic image enhancement actions. This ability to process during the recognition step is not available in all versions of IBM Datacap. You will need to look at the recognition actions in your version of IBM Datacap to see if the feature is available. If so, it can be tested to see if it works better on the images.
  
  Other manipulations can also be performed at this stage. For example, if image registration, such as with anchors, is used to align the image to a template to ensure fields line up, that process can be performed at this time as well.
3. Recognize Each Page
  A ruleset like this would be run on each page level object that has been extracted from the PDF. At this point, an additional rule could have changed the page types of the new images as required by the application.
  
  The IBM Datacap rule engine will call each page object and execute the associated rules. There are two types of full page recognition. RecognizePage and Recognize. RecognizePage will directly create and load the CCO. The Recognize action will first create a separate file, called a layout file, that will need to be loaded into the CCO as a second step action.
  
  Field level recognition will recognize the configured field zones on the page and fill each field with the recognition results for each zone.
  
  The CCO is how Datacap works with full page recognized text. The CCO contains all of the text and the positions for each recognized character. It allows text related actions, like the Locate actions, to operate. The "RecognizePage" actions directly create the CCO automatically.
  
  The "Recognize" actions do not directly create the CCO. They create an intermediate layout file. To create a CCO from the layout file call the action CreateCcoFromLayout in the SharedRecognitionTools after performing Recognize. So if you call Recognize instead of RecognizePage, you will need to call CreateCcoFromLayout after the Recognize action to load the CCO. Then actions such as the ones in the Locate action library can be used to process the page results.
  
  This example uses the OCR/A recognition engine to perform recognition and calls the RecognizePage action which will create and load the page CCO in one step. The aggressive text extraction is set here because it increases the engines quality. Refer to the action help for other types of settings to control the engine’s recognition.
  
  There are other recognition engines in IBM Datacap, such as OCR/S, that could be used as an alternative to OCR/A. They have similar actions that can be interchanged. Each engine has its own strengths. Through testing, it is possible to determine which engine works best for the application.
Concluding Remarks
These are the basic steps that are recommended to convert a PDF into images and then performing recognition. These segments can be integrated into an application. This approach is flexible as page types can be assigned to further control how the new pages are manipulated.

The corrected images can be used in verification panels. Because the images have been corrected prior to recognition, and possibly run through a registration step to align the image with a fingerprint, this will produce text that nicely lines up with the image in a verification panel. If desired, these image corrected pages can be converted back into a PDF for export to another location. Alternatively, the original PDF can also be stored as needed.

When configuring recognition, review the action help for other settings that may be available to improve recognition. Some action libraries have a “top-level” help topic as well that can be accessed from within Datacap Studio by clicking on the parent node, providing general help for the entire library.

For additional information to get the best quality recognition, see Best Practices for optimal text recognition in IBM Datacap.

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSZRWV","label":"IBM Datacap"},"ARM Category":[{"code":"a8m0z0000001h36AAA","label":"Technote"}],"Platform":[{"code":"PF033","label":"Windows"}],"Version":"All Version(s)"}]

Tips

Recognizing text and creating images from a PDF

Product Documentation

Abstract

Content

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?