Get PDF Text by OCR

Gets text from a PDF file.

Command availability: IBM RPA SaaS and IBM RPA on premises

Description

Gets text from a PDF file using OCR. The parsing algorithm uses an anchor text as a reference and returns the text from a position based on where the matched anchor text is.

Script syntax

extractPdfText --page(Numeric) --language(String) [--searchregion(Rectangle)] --anchor(String) --anchorprovider(Nullable<OpticalCharacterRecognitionProvider>) [--googlevisionclientsecret(String)] --comparison(OcrStringComparison) --fuzzyalgorithm(Nullable<FuzzyStringComparisonAlgorithms>) --tolerance(Nullable<FuzzyStringComparisonTolerance>) --manualTolerance(Numeric) --segmentation(StringSegmentation) [--anchorhighcontrast(Boolean)] --targetregion(Rectangle) --targetprovider(Nullable<OpticalCharacterRecognitionProvider>) [--targethighcontrast(Boolean)] --file(Pdf) (Image)=anchor (Boolean)=success (Image)=image (String)=text (Rectangle)=bounds

Dependencies

IBM RPA Studio has as helper tool to set the parameters for this command called Extract Pdf Text. You can find it on the Tools tab.

Input parameters

The following table displays the list of input parameters available in this command. In the table, you can see the parameter name when working in IBM RPA Studio's Script mode and its Designer mode equivalent label.

Designer mode label Script mode name Required Accepted variable types Description
Page page Required Number The page number to parse.
Language language Required Text, Culture The language to consider during parsing.

For the available languages see Supported languages.
Search region searchregion Optional Rectangle The region where the parsing algorithm should search for the anchor text.
Anchor anchor Required Text The text to use as an anchor.
Anchor OCR provider anchorprovider Required OpticalCharacterRecognitionProvider The OCR provider to use.

See the anchorprovider parameter options.
API Parameters googlevisionclientsecret Optional Text The absolute path to the JSON file containing the API parameters. Refer to the Google Cloud Vision™External Link documentation for details about the JSON format.
Comparison comparison Required OcrStringComparison Comparison type against anchor text.

See the comparison parameter options.
Fuzzy algorithm fuzzyalgorithm Only when comparison is ApproximatelyEquals FuzzyStringComparisonAlgorithms The fuzzy algorithm to use when comparing strings.

See the fuzzyalgorithm parameter options for details.
Tolerance tolerance Only when comparison is ApproximatelyEquals FuzzyStringComparisonTolerance Tolerance level to be used in the fuzzy algorithm.

See the tolerance parameter options.
Tolerance value manualTolerance Only when tolerance is Manual Number The tolerance percentage between 0 and 100, where 100 means a full match.
Segmentation segmentation Required StringSegmentation Type of text to search.

See the segmentation parameter options.
Enhance contrast for anchor OCR anchorhighcontrast Optional Boolean Enable to enhance PDF contrast.
Text extract region targetregion Required Rectangle The region to parse for text using OCR. This region is relative to where the anchor text was located.
Extraction OCR provider targetprovider Required OpticalCharacterRecognitionProvider The OCR provider to use when getting the text in the targetregion.

See the targetprovider parameter options for details.
Enhance contrast for the text extraction OCR targethighcontrast Optional Boolean Enable to enhance OCR contrast that parses the text to be obtained.
PDF file Required PDF PDF file to get the text.

anchorprovider parameter options

The following table displays the options available for the anchorprovider input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.

Designer mode label Script mode name Description
Google Google Google Tesseract™ OCR provider.
Google Cloud Vision GoogleVision Google Cloud Vision™ OCR provider.
Abbyy Abbyy Abbyy™

comparison parameter options

The following table displays the options available for the comparison input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.

Designer mode label Script mode name Description
Approximately Equals ApproximatelyEquals Checks if two strings are similar considering a tolerance level.
Begin with Begins_With Checks if a string begins with a substring.
Contains Contains Checks if a string contains a substring.
Ends with Ends_With Checks if the string ends with a substring.
Equal to Equal_To Checks if a string is equal to another string.
Matches Matches Check is a string matches another string. This uses regular expressions.

fuzzyalgorithm parameter options

The following table displays the options available for the fuzzyalgorithm input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.

Designer mode label Script mode name
Dice Coefficient DiceCoefficient
Hamming Distance HammingDistance
Jaccard Distance JaccardDistance
Jaro Distance JaroDistance
Jaro Winkler Distance JaroWinklerDistance
Levenshtein Distance LevenshteinDistance
Longest Common Subsequence LongestCommonSubsequence
Longest Common Substring LongestCommonSubstring
Overlap Coefficient OverlapCoefficient
Ratcliff Obershelp Similarity RatcliffObershelpSimilarity
Sorensen Dice Distance SorensenDiceDistance
Tanimoto Coefficient TanimotoCoefficient

tolerance parameter options

The following table displays the options available for the tolerance input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.

Designer mode label Script mode name Description
Manual Manual Evaluate that both strings are approximately equal if they are at least X% equal. X is user-defined in manualTolerance between 0 and 100, where 100 means strings are completely equal.
Normal Normal Evaluate that both strings are approximately equal if they are at least 50% equal.
Strong Strong Evaluate that both strings are approximately equal if they are at least 75% equal.
Weak Weak Evaluate that both strings are approximately equal if they are at least 25% equal.

segmentation parameter options

The following table displays the options available for the segmentation input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.

Designer mode label Script mode name Description
Phrase Phrase Search for a phrase.
Word Word Search for aa word.

targetprovider parameter options

The following table displays the options available for the targetprovider input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.

Designer mode label Script mode name Description
Google Google Google Tesseract™
Google Cloud Vision GoogleVision Google Cloud Vision™
Abbyy Abbyy Abbyy™

Output parameters

Designer mode label Script mode name Accepted variable types Description
Anchor anchor Image The image found in the source file's page containing the anchor text.
Success success Boolean True if the text is found, or False if not.
Image image Image The image containing the target text.
Text text Text The target text returned from the text extract region.
Bounds bounds Rectangle Region where the text was found in the PDF.

Example

The following example shows the extractPdfText command in action.

defVar --name pdfFile --type Pdf
defVar --name getText --type String
// Enter a pdf file on this command.
pdfOpen --file "file.pdf" pdfFile=value
// Use the helper selection tool to use this command.
extractPdfText --page 1 --language en-US --searchregion "171,273,134,89" --anchor Bill --anchorprovider "Google" --comparison "Equal_To" --segmentation "Word" --targetregion "-23,-37,716,404" --targetprovider "Google" --file ${pdfFile} getText=text
logMessage --message "${getText}" --type "Info"
//Get the text from a PDF file and display it in the console.

Limitations

ABBYY® works differently compared to Google Cloud Vision™ and Google Tesseract™. It divides the text components in a different way and get components in a different order.