PDFFREDocumentToImage

Converts a PDF file to TIFF format.

Member of namespace

Convert

Syntax


bool PDFFREDocumentToImage (string resolution, string compressionBW, string compressionColor, string compressionGray, string extensionBW, string extensionColor, string extensionGray, string convertMode, string useFastBinarization, string jpegQuality)

Parameters

Smart parameters are supported.

resolution
The resolution of extracted images. The valid values are 50 – 3200 dots per inch.
compressionBW
The compression of extracted black and white pages in the source PDF. For more information about the valid types, PDF image compression types.
compressionColor
The compression of extracted color pages in the source PDF. For more information about the valid types, see PDF image compression types.
compressionGray
The compression of extracted grayscale pages in the source PDF. For more information about the valid types, see PDF image compression types.
extensionBW
The file extension of extracted black and white pages in the source PDF.
extensionColor
The file extension of extracted color pages in the source PDF.
extensionGray
The file extension to use for grayscale pages that are extracted from the source PDF.
convertMode
Used to set the conversion mode. Following are the possible values:
0 Preserve color.
1 Convert all images to black and white.
useFastBinarization
This setting is relevant when you set the convertMode to 1. It causes the conversion function to use a faster binarization algorithm to convert pages to black and white during extraction. Following are the possible values:
True Use the faster binarization algorithm, which might result in lower quality images.
False Do not use the faster binarization algorithm.
jpegQuality
The quality of the color images that are extracted with JPEG compression. The valid levels are 0 – 100. The higher the value, the higher the quality.

If the document is a text document that will be used for recognition, select a lossless compression such as FAX for black and white images or LZW for color or grayscale images. Do not use a lossy compression such as JPEG for text documents. JPEG is intended for photographic images. The use of a lossy compression in any step of the process for a textual document will cause sharp character edges to become fuzzy, reducing recognition accuracy.

Returns

True if the file is successfully converted to a TIFF document. False if the current page is not a supported Image type or if failure in the conversion occurs. If the number of input files/pages exceeds the maximum that is allowed or if failure in the conversion occurs, the batch is set to abort.

Level

Page or Document level.

If called on a page level object, then each new single page TIF file created from the source PDF will be associated with a new DCO page object that is at the same level as the parent DCO page from which the page was created.

If called on a document level object, then each new single page TIF file will be associated with a new DCO page object that is a child page of the parent document object.

For example, if the PDF is associated with a page object, then the DCO structure will be created like this example:
 ˂page "Other"˃ TM000001.pdf (source page object)
˂page "Other"˃ TM000002.tif (new page object, page 1 of TM000001.pdf)
˂page "Other"˃ TM000003.tif (new page object, page 2 of TM000001.pdf)
˂page "Other"˃ TM000004.tif (new page object, page 3 of TM000001.pdf)
˂page "Other"˃ TM000005.tif (new page object, page 4 of TM000001.pdf)
If the PDF is associated with a document object, then the DCO structure would be created like this example:
˂Document "Invoice"˃ TM000001.pdf (source document object)
- ˂page "Other"˃ TM000002.tif (new page object, page 1 of TM000001.pdf)
- ˂page "Other"˃ TM000003.tif (new page object, page 2 of TM000001.pdf)
- ˂page "Other"˃ TM000004.tif (new page object, page 3 of TM000001.pdf)
- ˂page "Other"˃ TM000005.tif (new page object, page 4 of TM000001.pdf)

Details

If the current page is a PDF, the file is converted to multiple TIFF files, one TIFF file for each page within the document.

To recognize the document during conversion, it is recommended that the action is configured to create a layout file. If a layout file is not enabled and if the pdf document contains searchable text, a CCO file containing the positions and text is also created for each page in the document. Typically text quality is better when performing recognition.

Note: If not creating a layout file but simply extracting the PDF embedded text to a CCO, the NormalizeCCO action from the SharedRecognitionTools action library must be called in a subsequent ruleset to ensure the integrity of the CCO file. This is also required if the application will be using the navigation and pattern match actions to find recognized text on a page or perform pattern matching.

Each new TIFF will also have a new DCO page node created within the application environment which can be processed by subsequent rules. The original file name from which the page was extracted will be stored in the "ParentImage" DCO variable, for possible future reference within your application.

To prevent creating a CCO and to ignore searchable text within a PDF, enable convPdfIgnoreContent by setting the variable to "1" in the page DCO before you call PDFDocumentToImage. When y_createLayout is set to "1", then convPdfIgnoreContent is automatically enabled.

It is recommended to turn OFF the CCO file creation feature in the case where it is unlikely that the application might process searchable PDF documents, or in the case where full page, or field level OCR is performed later directly on the newly created images, or when the text is not needed.

Recognition and Creating a Layout File:

This action runs recognition in addition to text extraction. Before you call the action, enable this capability by setting the DCO variable y_createLayout to "1". By default, this feature is turned off.

When this option is turned on, a layout xml file (for example tm000001_layout.xml) is created per image that is extracted.

Note: When y_createLayout is enabled, a CCO file is not created by default. Use the action CreateCcoFromLayout in the SharedRecognitionTools library on each newly created page to convert each layout XML to a CCO to allow other CCO actions to operate on the text, such as Locate actions or click-n-key in a verification panel.

The layout file groups text into blocks as a person would look at the document. Each block might have the default type of block or a specific type such as title or table. Locate actions are available in the DocumentAnalytics action library to navigate in the block structure such as GoSiblingBlockNext. Whereas the CCO file, that is produced by other actions, groups text into lines that span the width of the page.

The layout XML file also retains font and color attributes, which are saved in CSS format, for the text, which is used for extracting data and reconstructing the document in a new format.

The CCO is an internal structure Datacap uses to work with text. The CCO contains all of the text and the positions for each recognized character. It allows text related actions, like the Locate actions and verification click-n-key, to operate.

Using rules that are attached to the create pages, call the action Counterrevolutionary to create a CCO file for each page using the text that was previously recognized and saved in the layout file and the CCO will be created and loaded. CreateCcoFromLayout would need to be called in a separate rule from PFFREDocumentToImage and that rule will need to be run on each of the newly created page objects.

The layout file contains the results of recognition. Heuristic algorithms are used to identify the text elements on the page. The elements detected can be different on pages that look the same or use the same form. It cannot be guaranteed that a particular element will always exist or will be recognized.

Following are the types of elements (in the Block Type/XML Node format) that might be present in the layout XML file:
  • Block/Block
  • Header/Header
  • Footer/Footer
  • Title/Title
  • Heading1/H1
  • Heading2/H2
  • Heading3/H3
  • Picture/Picture
  • Barcode/Barcode
  • Space/S
  • Tab/Tab
  • Table/Table
  • Row/Row
  • Cell/Cell
  • Paragraph/Para
  • Line/L
  • Sentence/Sent
  • Word/W
  • Character/C
Text Extraction versus Text Recognition

By default, the text included in the layout XML is obtained from a combination of automatic recognition that is run on each page of the PDF and from searchable text that is embedded within the PDF. Any images that embedded on the page that contain letters, such as scanned text pages which do not have any embedded text, will have the text recognized by the engine and be added to the final recognized text. This is the most complete, and recommended, recognition mode.

If areas of the page contain both an image and searchable text that is associated with the image, the engine decides whether the engine must use the searchable text or recognize the text from the matching image. Because the engine performs recognition, the confidence of the text might vary even if the same searchable text is embedded in the PDF.

The variable y_contentReuseMode can be used to force the engine to use only the recognized text on the page or only to use the embedded text on the page. One reason why you might decide only to use the embedded text is to prevent recognition and produce high confidence results.

A drawback of using the only embedded text is that if the embedded text is wrong or incomplete, recognition is not performed to capture that missing data. The resultant layout XML that is created is incomplete compared to what the user sees when the user views the PDF. Do not use this setting if the source PDF file is of the image-on-text type because in this case, the text layer is not extracted. If a text line contains characters that are not included in the alphabet of the selected recognition languages, this text is not be written to the result. Mode 0 or 1 must be used.

These settings of y_contentReuseMode can be set on the DCO node that is being converted:
  • rrSet("0", "@X.y_contentReuseMode") - The default auto mode that uses a combination of recognition and embedded text.
  • rrSet("1", "@X.y_contentReuseMode") - Only recognition is used to create the layout XML.
  • rrSet("2", "@X.y_contentReuseMode") - Only embedded text is used to create the layout XML.
    Note: Set the variable y_correctSkewMode to 0 when you set y_contentReuseMode=2. This prevents changing of the coordinates for digitally born PDFs.

For more information about configuring the recognition language, refer the OCR_A action, Recognize (deprecated).

Recognizing the PDF vs. Recognizing the New Image : Choosing The Best Approach

By enabling the creation of a layout file, the engine will directly recognize the PDF simultaneously as it creates a TIF image for each page in the PDF. For PDFs that all contain machine printable text generated directly from office type applications such as a word processor, this may be a good solution. PDFs directly generated from electronic sources usually contain clean machine printed text that is straight and readable without requiring additional enhancements.

For PDFs that have a varied source and contain scans of forms or other documents that were brought together and converted to a PDF, it is usually a better option to simply convert the PDF to images and perform recognition directly on the newly generated page images. This approach has the benefit that the documents can be manipulated prior to recognition. Image enhancements can be performed to straighten text, remove lines, clean the background, etc. Once the image quality has been improved, then recognition can be performed, providing better recognition results compared to directly recognizing the PDF as it came in. It is often the case that simply deskewing a document will provide significant improvement to the recognized text. Testing different approaches can help you find the best approach for your documents.

Converting PDF to TIF without Recognition

Converting the PDF to TIF without performing recognition or extraction is the fastest way to convert the PDF. This is also the recommended approach if recognition will be performed in a following ruleset with subsequent actions. For example, if the PDF contains scanned images, then it is best to first convert the PDF, then use the Image Enhancement ruleset to adjust the images. To convert without any recognition or extraction, set the variables "convPdfIgnoreContent" to "1" and y_createLayout" to "0" on the same DCO node that is attached to the PDFFREDocumentToImage action. These variables can be placed in the setup DCO or set using the action rrSet at runtime. If these variables are always intended to be set, then placing them in the Setup DCO in Datacap Studio is more efficient then using rrSet at runtime.

The following shows how to set these variables at runtime:
rrSet("1","@X.convPdfIgnoreContent")
rrSet("0","@X.y_createLayout")
Considerations

If field recognition is required, then recognition must be performed on the created TIF images. Field recognition cannot be performed directly on a PDF. If there is a requirement that the original images or original PDF must be saved and archived at the end of the batch, the originals are still available at the end of batch processing. It is strongly recommended that the document "Best Practices for optimal text recognition in IBM Datacap" is reviewed to understand how to get the best results from recognition.

Including PDF Annotations

By default, text annotations included in the source PDF file are not included in the output image. "Free Text" annotations in source PDF can be included in the output image by setting the page DCO variable y_IncludeAnnotation to "1". Other types of PDF annotations are not supported, such as popup and ink annotations. This feature does not cause the text of a "Sticky note" to be displayed on the image and a sticky note icon might display on the final image regardless of this setting.

Example
rrSet("1", "@X.y_createLayout")
PDFFREDocumentToImage(300, 18, 32, 33,".bw.tif", ".color.tif", ".gray.tif", 0, false, 100)

This example create a DCO page node with a corresponding image for each page within the PDF and performs recognition at the simultaneously because y_createLayout is set to "1". If any color pages are encountered during the conversion, the color page will be retained, creating a tif with LZW compression which is a loss-less compression that preserves the quality of the source image. For any pages that are black and white, the created images will have FAX G4 compression, which is a very efficient and loss-less way of compressing a black and white image. The recognized text is stored in a layout file. This text can be used by calling the action CreateCcoFromLayout in the SharedRecognitionTools for each newly created page object. Each image created will have a DPI of 300.

Example
SetNamePattern("2")
PDFFREDocumentToImage(300, 18, 18, 18, ".tif", ".tif", ".tif", 1, false, 100)

This example creates a DCO page node for each image and converts all pages to black and white using FAX G4 compression which is loss-less and generally the best compression to use for black and white images. Each image created will have a DPI of 300. Depending on your needs, it may be fine to convert to black and white at this step or it may be useful to convert to black and white after performing specific image enhancements. Because SetNamePattern("2") is called prior to the action, the newly created TIF files will be named using the TMxxxxxx naming convention.

For this example, "SetNamePattern("2")" is called directly before DocumentFactory on the same DCO node. This action can be called at anytime prior to DocumentFactory as long as it is called within the same task profile. Calling the action just once in a batch open event will cause all subsequent pages to use this setting and is more efficient than calling the action over and over for each page.

Example
SetNamePattern("2")
SetChildPageType("Mortgage")
rrSet("1","@X.convPdfIgnoreContent")
rrSet("0","@X.y_createLayout")
PDFFREDocumentToImage("300","18","33","32",".bw.tif","col.tif","gray.tif","0","False","100")

This example will convert the PDF to TIFs without performing any recognition or extracting any data. This is the fastest mode to convert a PDF to TIF images. This is recommended when pages will be recognized in a following rule. If the PDF contains scanned images, it is typically best to always use this approach as it allows the images to be adduced using the Image Enhancement ruleset to deskew and improve the quality of the scanned images and then use recognition actions to recognize the pages after the images have been cleaned.

Alternativly, the variables "conPdfIngoreContent" and "y_createLayout" can be created in the setup DCO in Datacap Studio on the DCO node that is calling PDFFREDocumentToImage. This will be more efficient as the variables will not need to be created, over and over, at runtime for each page.

Processing Secured PDF Files

PDF files that have the properties, Content Copying: Not Allowed or Content Copying for Accessibility: Not Allowed enabled, the action will remove these properties automatically from the PDF so that image files can be created for each page of the PDF. If the PDF is changed, a backup of the original PDF will be saved in the batch directory. The original PDF will be saved with the name "filename.original.pdf". For example, TM000001.original.pdf

The default suffix "Original" can be changed by setting the DCO variable "y_PdfBackupSuffix" prior to calling the action PDFFREDocumentToImage.

rrSet(".secure", "@X.y_PdfBackupSuffix")
PDFFREDocumentToImage(300, 18, 32, 33,".bw.tif", ".color.tif", ".gray.tif", 0, false, 100)

This example will backup the original file as PDFFileName.protected.pdf and remove the security properties from the file associated with the DCO object and then create and image for each page in the PDF.If y_PdfBackupSuffix is not set then by default .oiginal.pdf will be appended to backup file.