PDFFREDocumentToImage

Converts a PDF file to TIFF format.

Syntax

bool PDFFREDocumentToImage (string resolution , string compressionBW , string compressionColor , string compressionGray , string extensionBW , string extensionColor , string extensionGray , string convertMode , string useFastBinarization , string jpegQuality)

Parameters

Smart parameters are supported.

resolution: The resolution of extracted images. The valid values are 50 – 3200 pixels per inch.
compressionBW: The compression of extracted black and white pages in the source PDF. For more information about the valid types, PDF image compression types.
compressionColor: The compression of extracted color pages in the source PDF. For more information about the valid types, see PDF image compression types.
compressionGray: The compression of extracted grayscale pages in the source PDF. For more information about the valid types, see PDF image compression types.
extensionBW: The file extension of extracted black and white pages in the source PDF.
extensionColor: The file extension of extracted color pages in the source PDF.
extensionGray: The file extension to use for grayscale pages that are extracted from the source PDF.
convertMode: Used to set the conversion mode. Valid values are 0 to preserve color and 1 to convert all images to black and white.
useFastBinarization: This setting is relevant when you set the convertMode to 1. It causes the conversion function to use a faster binarization algorithm to convert pages to black and white during extraction. Valid values are True or False. Smart parameters supported.
jpegQuality: The quality of the color images that are extracted with JPEG compression. The valid levels are 0 – 100. The higher the value, the higher the quality.

If the document is a text document that is used for recognition, select a lossless compression such as FAX for black and white images or LZW for color or grayscale images. Do not use a lossy compression such as JPEG for text documents. JPEG is intended for photographic images. The use of a lossy compression in any step of the process for a textual document causes sharp character edges to become fuzzy, reducing recognition accuracy.

Returns

True if the file is successfully converted to a TIFF document.

False if the current page is not a supported Image type or if failure in the conversion occurs. If the number of input files/pages exceeds the maximum that is allowed or if the conversion fails, the batch is set to abort.

Level

Page or Document level.

If called on a page level object, each new single page TIF file that is created from the source PDF becomes associated with a new DCO page object that is at the same level as the parent DCO page where the page was created.

If called on a document level object, each new single page TIF file becomes associated with a new DCO page object that is a child page of the parent document object.

For example, if the PDF is associated with a page object, then the DCO structure is created like this example:

˂page "Other"˃ TM000001.pdf (source page object) 
˂page "Other"˃ TM000002.tif (new page object, page 1 of TM000001.pdf) 
˂page "Other"˃ TM000003.tif (new page object, page 2 of TM000001.pdf) 
˂page "Other"˃ TM000004.tif (new page object, page 3 of TM000001.pdf) 
˂page "Other"˃ TM000005.tif (new page object, page 4 of TM000001.pdf)

If the PDF is associated with a document object, then the DCO structure would be created like this example:

˂Document "Invoice"˃ TM000001.pdf (source document object)
- ˂page "Other"˃ TM000002.tif (new page object, page 1 of TM000001.pdf)
- ˂page "Other"˃ TM000003.tif (new page object, page 2 of TM000001.pdf)
- ˂page "Other"˃ TM000004.tif (new page object, page 3 of TM000001.pdf)
- ˂page "Other"˃ TM000005.tif (new page object, page 4 of TM000001.pdf)

Details

If the current page is a PDF, the file is converted to multiple TIFF files, one TIFF file for each page within the document.

To recognize the document during conversion, it is recommended that the action is configured to create a layout file. If a layout file is not enabled and if the PDF document contains searchable text, a CCO file containing the positions and text is also created for each page in the document. Typically text quality is better when performing recognition.

Note: If not creating a layout file but simply extracting the PDF embedded text to a CCO, the NormalizeCCO action from the SharedRecognitionTools action library must be called in a subsequent ruleset to ensure the integrity of the CCO file. This is also required if the application will be using the navigation and pattern match actions to find recognized text on a page or perform pattern matching.

Each new TIFF will also have a new DCO page node that is created within the application environment, which can be processed by subsequent rules. The original file name from which the page was extracted will be stored in the "ParentImage" DCO variable, for possible future reference within your application.

To prevent creating a CCO and to ignore searchable text within a PDF, enable convPdfIgnoreContent by setting the variable to "1" in the page DCO before you call PDFDocumentToImage. When y_createLayout is set to "1", then convPdfIgnoreContent is automatically enabled.

It is recommended to turn OFF the CCO file creation feature in the case where it is unlikely that the application might process searchable PDF documents, or in the case where full page, or field level OCR is performed later directly on the newly created images, or when the text is not needed.

Recognition and Creating a Layout File:

This action runs recognition in addition to text extraction. Before you call the action, enable this capability by setting the DCO variable y_createLayout to "1". By default, this feature is turned off.

When this option is turned on, a layout xml file (for example tm000001_layout.xml) is created per image that is extracted.

Note: When y_createLayout is enabled, a CCO file is not created by default. Use the action CreateCcoFromLayout in the SharedRecognitionTools library on each newly created page to convert each layout XML to a CCO to allow other CCO actions to operate on the text, such as Locate actions or click-n-key in a verification panel.

The layout file groups text into blocks as a person would look at the document. Each block might have the default type of block or a specific type such as title or table. Locate actions are available in the DocumentAnalytics action library to navigate in the block structure such as GoSiblingBlockNext. Whereas the CCO file that is produced by other actions groups text into lines that span the width of the page.

The layout XML file also retains font and color attributes that are saved in CSS format for the text, which is used for extracting data and reconstructing the document in a new format.

The CCO is an internal structure Datacap uses to work with text. The CCO contains all of the text and the positions for each recognized character. It allows text-related actions, like the Locate actions and verification click-n-key, to operate.

Using rules that are attached to the create pages, call the action Counterrevolutionary to create a CCO file for each page by using the text that was previously recognized and saved in the layout file and the CCO is created and loaded. CreateCcoFromLayout would need to be called in a separate rule from PFFREDocumentToImage and that rule will need to be run on each of the newly created page objects.

The layout file contains the results of recognition. Heuristic algorithms are used to identify the text elements on the page. The elements that are detected can be different on pages that look the same or use the same form. It cannot be guaranteed that a particular element will always exist or will be recognized.

Following are the types of elements (in the Block Type/XML Node format) that might be present in the layout XML file:

Block/Block
Header/Header
Footer/Footer
Title/Title
Heading1/H1
Heading2/H2
Heading3/H3
Picture/Picture
Barcode/Barcode
Space/S
Tab/Tab
Table/Table
Row/Row
Cell/Cell
Paragraph/Para
Line/L
Sentence/Sent
Word/W
Character/C

Text Extraction versus Text Recognition

By default, the text included in the layout XML is obtained from a combination of automatic recognition that is run on each page of the PDF and from searchable text that is embedded within the PDF. Any images that embedded on the page that contain letters, such as scanned text pages that do not have any embedded text, will have the text recognized by the engine and be added to the final recognized text. This is the most complete, and recommended, recognition mode.

If areas of the page contain both an image and searchable text that is associated with the image, the engine decides whether the engine must use the searchable text or recognize the text from the matching image. Because the engine performs recognition, the confidence of the text might vary even if the same searchable text is embedded in the PDF.

The variable y_contentReuseMode can be used to force the engine to use only the recognized text on the page or only to use the embedded text on the page. One reason why you might decide only to use the embedded text is to prevent recognition and produce high confidence results.

A drawback of using the only embedded text is that if the embedded text is wrong or incomplete, recognition is not performed to capture that missing data. The resultant layout XML that is created is incomplete compared to what the user sees when the user views the PDF. Do not use this setting if the source PDF file is of the image-on-text type because in this case, the text layer is not extracted. If a text line contains characters that are not included in the alphabet of the selected recognition languages, this text is not be written to the result. Mode 0 or 1 must be used.

These settings of y_contentReuseMode can be set on the DCO node that is being converted:

rrSet("0", "@X.y_contentReuseMode") - The default auto mode that uses a combination of recognition and embedded text.
rrSet("1", "@X.y_contentReuseMode") - Only recognition is used to create the layout XML.
rrSet("2", "@X.y_contentReuseMode") - Only embedded text is used to create the layout XML.
Note: Set the variable y_correctSkewMode to 0 when you set y_contentReuseMode=2. This prevents changing of the coordinates for digitally born PDFs.

For more information about configuring the recognition language, refer the OCR_A action, see Recognize (deprecated).

Recognizing the PDF versus the New Image: Choosing The Best Approach

By enabling the creation of a layout file, the engine directly recognizes the PDF simultaneously as it creates a TIF image for each page in the PDF. For PDFs that all contain machine printable text generated directly from office type applications such as a word processor, this may be a good solution. PDFs directly generated from electronic sources usually contain clean machine printed text that is straight and readable without requiring additional enhancements.

For PDFs that have a varied source and contain scans of forms or other documents that were brought together and converted to a PDF, it is usually a better option to simply convert the PDF to images and perform recognition directly on the newly generated page images. This approach has the benefit that the documents can be manipulated before recognition. Image enhancements can be performed to straighten text, remove lines, clean the background, etc. Once the image quality has been improved, then recognition can be performed, providing better recognition results compared to directly recognizing the PDF as it came in. It is often the case that simply deskewing a document will provide significant improvement to the recognized text. Testing different approaches can help you find the best approach for your documents.

Converting PDF to TIF without Recognition

Converting the PDF to TIF without performing recognition or extraction is the fastest way to convert the PDF. This is also the recommended approach if recognition will be performed in a following ruleset with subsequent actions. For example, if the PDF contains scanned images, then it is best to first convert the PDF, then use the Image Enhancement ruleset to adjust the images. To convert without any recognition or extraction, set the variables "convPdfIgnoreContent" to "1" and y_createLayout" to "0" on the same DCO node that is attached to the PDFFREDocumentToImage action. These variables can be placed in the setup DCO or set by using the action rrSet at run time. If these variables are always intended to be set, then placing them in the Setup DCO in Datacap Studio is more efficient then using rrSet at run time.

The following shows how to set these variables at runtime:

rrSet("1","@X.convPdfIgnoreContent")
rrSet("0","@X.y_createLayout")

Considerations

If field recognition is required, then recognition must be performed on the created TIF images. Field recognition cannot be performed directly on a PDF. If there is a requirement that the original images or original PDF must be saved and archived at the end of the batch, the originals are still available at the end of batch processing. It is strongly recommended that the document "Best Practices for optimal text recognition in IBM Datacap" is reviewed to understand how to get the best results from recognition.

Including PDF Annotations

By default, text annotations included in the source PDF file are not included in the output image. "Free Text" annotations in source PDF can be included in the output image by setting the page DCO variable y_IncludeAnnotation to "1". Other types of PDF annotations are not supported, such as popup and ink annotations. This feature does not cause the text of a "Sticky note" to be displayed on the image and a sticky note icon might display on the final image regardless of this setting.

Page Deskew

Deskewing an image performs a slight rotation of the image to correct text that is slanted. An image that is deskewed provides better recognition and perform better in the verification task. If an image is not deskewed, a single line of text might be recognized poorly or be split across multiple lines instead of the text appearing as a single line. When recognition is performed afterdeskewing an image, it improves recognition quality and ensures that the recognized coordinates match the text position on the adjusted image.

When PDFFREDocumentToImage extracts a page from a PDF and converts it to an image, it deskews the image and saves the adjusted image as a TIFF. A page can also be deskewed before recognition by using the Image Enhancement actions. Alternatively, the OCR/A action RotateImageOCR_A can perform deskew and other image enhancements. The OCR/A actions Recognize and RecognizePage actions can also automatically deskew the image in the recognition step.

Page deskew can be controlled by setting the DCO variable y_correctSkewMode. If you have the recognition engine deskew the image, then it is not necessary to first deskew by using the image enhancement actions. It may be useful to perform deskew and other image cleanup before recognition depending on the quality of your input documents.

The maximum amount that the image is deskewed is 15 degrees. The engine might or might not deskew an image greater than that amount.

To have the engine deskew the image during recognition, set y_correctSkewMode to one of the following:

0: Turns off deskew during recognition.

1: Deskew by using horizontal black registration squares.

2: Deskew by using vertical black registration squares.

4: Deskew by using horizontal lines (do not use if lines do not appear on the page).

8: Deskew by using vertical lines (do not use if lines do not appear on the page).

16: Deskew by using horizontal text (Default value, if y_correctSkewMode is not set).

When a page is dekewed, it can cause an image's physical size to become slightly larger because the rotation of the image adds additional pixels along the image to preserve the image. For example, a skewed page that is 8.5" x 11" may become 9.265" x 11.59".

The amount of increase depends on the skew of the image. For full page recognition, this change usually does not matter. If using field level recognition, it can cause zones to line up incorrectly. In this situation, it may be necessary to use registration anchors to align the image to the zones or crop the image to the original size.

The variable y_correctSkewMode might be set at run time by using the action rrSet. For example:

rrSet("0", "@X.y_correctSkewMode")
PDFFREDocumentToImage("...")

Alternatively, the variable y_correctSkewMode might be set in the object in the Setup DCO from within Datacap Studio. If the variable is always intended to be set, then placing it in the Setup DCO in Datacap Studio is more efficient then using rrSet at run time.

Binarization Threshold

This property is used for fine-tuning of the brightness threshold of the image during image preprocessing.

The pixels with brightness higher than the threshold value is replaced with white pixels and the rest with black.

The value of this property may be in the range from 0 to 255. The default value of this property is -1, which means that the binarization threshold value is determined by the recognition engine automatically.

Binarization threshold can be controlled by setting the DCO variable y_binarizationThreshold.

The variable y_binarizationThreshold might be set by using the action rrSet. For example:

rrSet("50", "@X.y_binarizationThreshold")
PDFFREDocumentToImage("...")

Example

rrSet("1", "@X.y_createLayout")
PDFFREDocumentToImage(300, 18, 32, 33,".bw.tif", ".color.tif", ".gray.tif", 0, false, 100)

This example creates a DCO page node with a corresponding image for each page within the PDF and performs recognition at the simultaneously because y_createLayout is set to "1". If any color pages are encountered during the conversion, the color page is retained, creating a tif with LZW compression, which is a loss-less compression that preserves the quality of the source image. For any pages that are black and white, the created images have FAX G4 compression, which is an efficient and loss-less way of compressing a black and white image. The recognized text is stored in a layout file. This text can be used by calling the action CreateCcoFromLayout in the SharedRecognitionTools for each newly created page object. Each image that is created has a DPI of 300.

Example

SetNamePattern("2")
PDFFREDocumentToImage(300, 18, 18, 18, ".tif", ".tif", ".tif", 1, false, 100)

This example creates a DCO page node for each image and converts all pages to black and white using FAX G4 compression, which is loss-less and generally the best compression to use for black and white images. Each image that is created will have a DPI of 300. Depending on your needs, it may be fine to convert to black and white at this step or it may be useful to convert to black and white after performing specific image enhancements. Because SetNamePattern("2") is called before the action, the newly created TIF files are named by using the TMxxxxxx naming convention.

For this example, "SetNamePattern("2")" is called directly before DocumentFactory on the same DCO node. This action can be called at any time before DocumentFactory while it is called within the same task profile. Calling the action once in a batch open event causes all subsequent pages to use this setting and is more efficient than calling the action over and over for each page.

SetNamePattern("2")
SetChildPageType("Mortgage")
rrSet("1","@X.convPdfIgnoreContent")
rrSet("0","@X.y_createLayout")
PDFFREDocumentToImage("300","18","33","32",".bw.tif","col.tif","gray.tif","0","False","100")

This example converts the PDF to TIFs without performing any recognition or extracting any data. This is the fastest mode to convert a PDF to TIF images. This is recommended when pages will be recognized in a following rule. If the PDF contains scanned images, it is typically best to always use this approach as it allows the images to be adduced by using the Image Enhancement ruleset to deskew and improve the quality of the scanned images and then use recognition actions to recognize the pages after the images have been cleaned.

Alternatively, the variables "conPdfIngoreContent" and "y_createLayout" can be created in the setup DCO in Datacap Studio on the DCO node that is calling PDFFREDocumentToImage. This is more efficient as the variables will not need to be created, over and over, at run time for each page.

Handling PDFs with Large Page Sizes

When PDFFREDocumentToImage converts a PDF to a TIF, it does it in a way to preserve the physical page size of the page. This is typical of a well-formed image. For example, if you were to scan a page at 8.5" x 11" at 300 DPI, the image that is created would be 2550x3300 pixels with a DPI set to 300. When this image is printed with a typical application, when printed at 100% size, it prints at the same size as it was scanned, 8.5" x 11". When performing OCR on a document a dpi of 200 to 300 helps provide a good image for engines to recognize text on the image.

It is possible for a PDF to incorrectly report the physical page size when an image is embedded within the PDF. For example, the physical page size in the PDF might be reported as 35.4" x 45.82". In reality, the picture is of a standard 8.5"x11" page. Because the physical page size is so large, an image is created that is 10621 x 13746 pixels. These large images use available memory and can cause errors due to a low amount of memory available to the application. When these images with incorrectly reported physical sizes are printed, they print incorrectly because when printed at 100%, the system would try to print the image at 35.5" x 45.82".

There are two primary ways that the incorrect page size is caused within a PDF. One typical path is when a photo is taken with a phone or other camera. The device has no idea of the actual size of the thing in the picture. Maybe it is a picture of a 8.5" x 11" page or maybe it is a picture of a giant battle ship. The camera does not know the correct physical size of the object and cannot set the correct size. If pre-processing is not performed on the image to correct the information before being converted into a PDF, the PDF reports a huge, and inaccurate, page size. Similarly, there are image types that do not support a DPI, such as PNG and BMP. When these are converted to a PDF page, the process usually assumes a dpi of 72. So, if the actual source was a scanned page at 300 dpi because the native image type does not retain the dpi, a small dpi is assumed by many PDF generation packages. This too creates a huge, and inaccurate, physical page size. When these huge pages are converted back to a TIF, an image with an overly large X and Y pixel size is created.

If you are processing PDF documents with these large page sizes, the following settings can be used to automatically reduce the size of the images created: y_PDFMaxShortSidePixel and y_PDFNewShortSidePixel.

The DCO variable y_PDFMaxShortSidePixel identifies the maximum allowed pixel size of the short side of a page. If the page is a portrait layout, then this controls the maximum width. If the page is landscape, this controls the maximum height. If it is expected that pages should normally be 8.5" x 11", this value might be set to the largest size to allow without readjustment. For example, this value might be set to "3000". For a page that is exactly 8.5" x 11", the width would be 2550 pixels. By setting the adjustment value slightly bigger, the software will still allow pages that are a little bigger without readjusting. One the pixel size is over 3000, then it initiates the automatic resize during the PDFFREDocumentToImage action.

Note: When recognition is performed, it is possible for the recognition step to alter the pixel size. For example, if a page is slightly skewed, the engine deskews the page to correct it. Typically, the deskew process causes a slight increase in the size of the image. This is pixel increase normal and expected.

rrSet("3000", "@X.y_PDFMaxShortSidePixel") 
rrSet("2550", "@X.y_PDFNewShortSidePixel") 
rrSet("1", "@X.y_createLayout") 
rrSet("English", "@X.y_lg") 
PDFFREDocumentToImage("300", "18", "33", "32", "bw.tif", "color.tif", "gray.tif", "0", "False", "100") 
PDFFREReleaseEngine()

This example processes a PDF, create a TIF image and perform recognition, creating a layout file with the recognition results, the layout file can later be converted to a CCO using CreateCCOFromLayout. The setting of y_PDFMaxShortSidePixel and y_PDFNewShortSidePixel will cause the automatic resize of any images that will be bigger than 3000 pixels on the "short" side of the page. For portrait images, this is the width, for landscape images, this is the height. If an image is larger than 3000 pixels, it is resized so the short size is 2550 pixels and the long side of the image is reduced proportionally.

Reducing Memory Usage

If PDFFREDocumentToImage is processing a document with large images or many pages, it may be possible that more memory is required than is available resulting in an error. Low-memory-mode needs to be enabled to allow the action to complete successfully.

There are two settings that can be configured: "y_maxPagesForInMemoryProcessing" allows selection of when disk can be used to reduce memory requirements. For example, if building a large PDF document with RecognizeToPDF, or splitting a large PDF document with PDFFREDocumentToImage, "y_maxPagesForInMemoryProcessing" will instruct the engine when to use disk. For example, if set to "50", then if the document contains less than 50 pages, only memory is used. If more than 50 pages are in the document, then disk is used, saving memory.

"y_LowMemoryMode" instructs the engine to use as little memory as possible. When this variable is set to "1", then low memory mode is enabled.

The variables need to be set before calling PDFFREDocumentToImage.

rrSet("15", "@X.y_maxPagesForInMemoryProcessing") 
rrSet("1", "@X.y_LowMemoryMode") 
rrSet("1", "@X.y_createLayout") 
rrSet("English", "@X.y_lg") 
PDFFREDocumentToImage("300", "18", "33", "32", "bw.tif", "color.tif", "gray.tif", "0", "False", "100") 
PDFFREReleaseEngine()

As an alternative to calling rrSet, the variables might also be configured in the Setup DCO for the application.

Separate Work Across Profiles

PDF processing can require a large amount of memory. Using the feature to automatically reduce large pages should be used to correct large page sizes. This feature will not only reduce memory at this step, but use less disk space and reduce the memory use of follow on actions. Enabling low-memory-mode may also be required to reduce the memory that is used for conversion to PDF and recognition.

After enabling these settings, it is still possible that a large amount of memory was used to process a PDF due to it having many pages. It is recommended to perform PDFFREDocumentToImage as the last action in a profile. Separating actions by using multiple profiles allow memory to be reclaimed, ensuring that the new profile starts fresh, and it allows for greater flexibility in processing profiles.

Profile1 
- - Ruleset for VScan actions 

Profile2 
- - Ruleset to prepare documents. i.e. PDFFREDocumentToImage() 

Profile3 
- - Rulesets for Page ID actions or CreateCCOFromLayout()

The above is a very generic example of how profiles should be separated to run PDFFREDocumentToImage in a profile and then subsequent actions should be in a separate profile. The logical steps of an application can vary. For example, some applications may want to do "page ID" steps before PDFFREDocumentToImage, while other application may do page ID after. There isn't any right or wrong answer, it depends on what is required for the application. The main point is to start a new profile after PDFFREDocumentToImage has completed. Rulerunner should also be configured to restart between profiles to ensure that any allocated memory has been freed before the next step.

Processing Secured PDF Files

PDF files that have the properties, Content Copying: Not Allowed or Content Copying for Accessibility: Not Allowed enabled, the action removes these properties automatically from the PDF so that image files can be created for each page of the PDF. If the PDF is changed, a backup of the original PDF is saved in the batch directory. The original PDF is saved with the name "filename.original.pdf". For example, TM000001.original.pdf

The default suffix "Original" can be changed by setting the DCO variable "y_PdfBackupSuffix" before calling the action PDFFREDocumentToImage.

rrSet(".secure", "@X.y_PdfBackupSuffix")
PDFFREDocumentToImage(300, 18, 32, 33,".bw.tif", ".color.tif", ".gray.tif", 0, false, 100)

This example will backup the original file as PDFFileName.protected.pdf and remove the security properties from the file that is associated with the DCO object and then create and image for each page in the PDF.If y_PdfBackupSuffix is not set then by default original PDF is appended to backup file.