PDFFREDocumentToImage

Converts a PDF file to TIFF format and extracts any searchable text.

Member of namespace

Convert

Syntax

bool PDFFREDocumentToImage (string resolution, string compressionBW, string compressionColor, string compressionGray, string extensionBW, string extensionColor, string extensionGray, string convertMode, string useFastBinarization, string jpegQuality)

Parameters

Smart parameters are supported.

resolution

The resolution of extracted images. The valid values are 50 – 3200 dots per inch.

compressionBW

The compression of extracted black and white pages in the source PDF. For more information about the valid types, see PDF image compression types.

compressionColor

The compression of extracted color pages in the source PDF. For more information about the valid types, see PDF image compression types.

compressionGray

The compression of extracted grayscale pages in the source PDF. For more information about the valid types, see PDF image compression types.

extensionBW

The file extension of extracted black and white pages in the source PDF.

extensionColor

The file extension of extracted color pages in the source PDF.

extensionGray

The file extension to use for grayscale pages that are extracted from the source PDF.

convertMode

Used to set the conversion mode. Following are the possible values:

0	Preserve color.
1	Convert all images to black and white.

useFastBinarization

This setting is relevant when you set the convertMode to 1. It causes the conversion function to use a faster binarization algorithm to convert pages to black and white during extraction. Following are the possible values:

True	Use the faster binarization algorithm, which might result in lower quality images.
False	Do not use the faster binarization algorithm.

jpegQuality

The quality of the color images that are extracted with JPEG compression. The valid levels are 0 – 100. The higher the value, the higher the quality.

Returns

True if the file is successfully converted to a TIFF document. False if the current page is not a supported Image type or if failure in the conversion occurs. If the number of input files/pages exceeds the maximum that is allowed or if failure in the conversion occurs, the batch is set to abort.

Level

Page level.

Details

If the current page is a PDF, the file is converted to multiple TIFF files, one TIFF file for each page within the document.

If the PDF document contains searchable text, a CCO file that contains the positions and text is also created for each page in the document.

Note: Call the NormalizeCCO action, from the CCO2CCO action library, in a subsequent ruleset to ensure the integrity of the CCO file. It is also needed if the application is using the navigation and pattern match actions to find recognized text on a page or perform pattern matching.

Each new TIFF also has a new page that is created within the application environment, which can be processed by subsequent rules. The original file name from which the page is extracted is stored in the ParentImage variable, for possible future reference within your application.

To prevent creating a CCO and to ignore searchable text within a PDF, enable convPdfIgnoreContent by setting the variable to "1" in the page DCO before you call PDFDocumentToImage. When y_createLayout is set to "1", then convPdfIgnoreContent is automatically enabled.

It is recommended to turn OFF the CCO file creation feature, in the case where it is unlikely that the application might process searchable PDF documents, or in the case where full page OCR is needed later in the workflow.

Creating a Layout File

This action runs recognition in addition to text extraction. Before you call the action, enable this capability by setting the DCO variable y_createLayout to "1". By default, this feature is turned off.

When this option is turned on, a layout xml file (for example tm000001_layout.xml) is created per image that is extracted.

Note: When y_createLayout is enabled, a CCO file is not created by default. Use the action CreateCcoFromLayout in the SharedRecognitionTools library on each newly created page to convert each layout XML to a CCO to allow other CCO actions to operate on the text.

The layout file groups text into blocks as a person would look at the document. Each block might have the default type of block or a specific type such as title or table. Locate actions are available in the DocumentAnalytics action library to navigate in the block structure such as GoSiblingBlockNext. Whereas the CCO file, that is produced by other actions, groups text into lines that span the width of the page.

The layout XML file also retains font and color attributes, which are saved in CSS format, for the text, which is used for extracting data and reconstructing the document in a new format.

To use the Locate actions and perform click ‘n’ key during verification, use the action CreateCcoFromLayout action in the SharedRecognitionTools action library. This action creates a CCO file for the page after the layout XML file is produced.

Following are the types of elements (in the "Block Type/XML Node" format) that might be present in the layout XML file:

Block/Block
Header/Header
Footer/Footer
Title/Title
Heading1/H1
Heading2/H2
Heading3/H3
Picture/Picture
Barcode/Barcode
Space/S
Tab/Tab
Table/Table
Row/Row
Cell/Cell
Paragraph/Para
Line/L
Sentence/Sent
Word/W
Character/C

Text Extraction versus Text Recognition

By default, the text included in the layout XML is obtained from a combination of automatic recognition that is run on each page of the PDF and from searchable text that is embedded within the PDF. Any images that are embedded on the page have the text that is recognized by the engine.

If areas of the page contain both an image and searchable text that is associated with the image, the engine decides whether the engine must use the searchable text or recognize the text from the matching image. Because the engine performs recognition, the confidence of the text might vary even if the same searchable text is embedded in the PDF.

The variable y_contentReuseMode can be used to force the engine to use only the recognized text on the page or only to use the embedded text on the page. One reason why you might decide only to use the embedded text is to prevent recognition and produce high confidence results.

A drawback of using the only embedded text is that if the embedded text is wrong or incomplete, recognition is not performed to capture that missing data. The resultant layout XML that is created is incomplete compared to what the user sees when the user views the PDF. Do not use this setting if the source PDF file is of the image-on-text type because in this case, the text layer is not extracted. If a text line contains characters that are not included in the alphabet of the selected recognition languages, this text is not be written to the result. Mode 0 or 1 must be used.

These settings of y_contentReuseMode can be set on the DCO node that is being converted:

rrSet("0", "@X.y_contentReuseMode") - The default auto mode that uses a combination of recognition and embedded text.
rrSet("1", "@X.y_contentReuseMode") - Only recognition is used to create the layout XML.
rrSet("2", "@X.y_contentReuseMode") - Only embedded text is used to create the layout XML.

For more information about configuring the recognition language, refer the OCR_A action, Recognize.

Including PDF Annotations

By default, text annotations included in the source PDF file are not included in the output image. "Free Text" annotations in source PDF can be included in the output image by setting the page DCO variable y_IncludeAnnotation to "1". Other types of PDF annotations are not supported, such as popup and ink annotations. This feature does not cause the text of a "Sticky note" to be displayed on the image and a sticky note icon might display on the final image regardless of this setting.

Example

PDFFREDocumentToImage(300,32,32,32,".bw.tif", ".color.tif", ".gray.tif", 0, false,100)