Get PDF Text by OCR
Gets text from a PDF file.
Command availability: IBM RPA SaaS and IBM RPA on premises
Description
Gets text from a PDF file using OCR. The parsing algorithm uses an anchor text as a reference and returns the text from a position based on where the matched anchor text is.
Script syntax
extractPdfText --page(Numeric) --language(String) [--searchregion(Rectangle)] --anchor(String) --anchorprovider(Nullable<OpticalCharacterRecognitionProvider>) [--googlevisionclientsecret(String)] --comparison(OcrStringComparison) --fuzzyalgorithm(Nullable<FuzzyStringComparisonAlgorithms>) --tolerance(Nullable<FuzzyStringComparisonTolerance>) --manualTolerance(Numeric) --segmentation(StringSegmentation) [--anchorhighcontrast(Boolean)] --targetregion(Rectangle) --targetprovider(Nullable<OpticalCharacterRecognitionProvider>) [--targethighcontrast(Boolean)] --file(Pdf) (Image)=anchor (Boolean)=success (Image)=image (String)=text (Rectangle)=bounds
Dependencies
IBM RPA Studio has as helper tool to set the parameters for this command called Extract Pdf Text. You can find it on the Tools tab.
Input parameters
The following table displays the list of input parameters available in this command. In the table, you can see the parameter name when working in IBM RPA Studio's Script mode and its Designer mode equivalent label.
Designer mode label | Script mode name | Required | Accepted variable types | Description |
---|---|---|---|---|
Page | page |
Required |
Number |
The page number to parse. |
Language | language |
Required |
Text , Culture |
The language to consider during parsing. For the available languages see Supported languages. |
Search region | searchregion |
Optional |
Rectangle |
The region where the parsing algorithm should search for the anchor text. |
Anchor | anchor |
Required |
Text |
The text to use as an anchor. |
Anchor OCR provider | anchorprovider |
Required |
OpticalCharacterRecognitionProvider |
The OCR provider to use. See the anchorprovider parameter options. |
API Parameters | googlevisionclientsecret |
Optional |
Text |
The absolute path to the JSON file containing the API parameters. Refer to the Google Cloud Vision™ |
Comparison | comparison |
Required |
OcrStringComparison |
Comparison type against anchor text. See the comparison parameter options. |
Fuzzy algorithm | fuzzyalgorithm |
Only when comparison is ApproximatelyEquals |
FuzzyStringComparisonAlgorithms |
The fuzzy algorithm to use when comparing strings. See the fuzzyalgorithm parameter options for details. |
Tolerance | tolerance |
Only when comparison is ApproximatelyEquals |
FuzzyStringComparisonTolerance |
Tolerance level to be used in the fuzzy algorithm. See the tolerance parameter options. |
Tolerance value | manualTolerance |
Only when tolerance is Manual |
Number |
The tolerance percentage between 0 and 100, where 100 means a full match. |
Segmentation | segmentation |
Required |
StringSegmentation |
Type of text to search. See the segmentation parameter options. |
Enhance contrast for anchor OCR | anchorhighcontrast |
Optional |
Boolean |
Enable to enhance PDF contrast. |
Text extract region | targetregion |
Required |
Rectangle |
The region to parse for text using OCR. This region is relative to where the anchor text was located. |
Extraction OCR provider | targetprovider |
Required |
OpticalCharacterRecognitionProvider |
The OCR provider to use when getting the text in the targetregion. See the targetprovider parameter options for details. |
Enhance contrast for the text extraction OCR | targethighcontrast |
Optional |
Boolean |
Enable to enhance OCR contrast that parses the text to be obtained. |
file |
Required |
PDF |
PDF file to get the text. |
anchorprovider
parameter options
The following table displays the options available for the anchorprovider
input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.
Designer mode label | Script mode name | Description |
---|---|---|
Google |
Google Tesseract™ OCR provider. | |
Google Cloud Vision | GoogleVision |
Google Cloud Vision™ OCR provider. |
Abbyy | Abbyy |
Abbyy™ |
comparison
parameter options
The following table displays the options available for the comparison
input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.
Designer mode label | Script mode name | Description |
---|---|---|
Approximately Equals | ApproximatelyEquals |
Checks if two strings are similar considering a tolerance level. |
Begin with | Begins_With |
Checks if a string begins with a substring. |
Contains | Contains |
Checks if a string contains a substring. |
Ends with | Ends_With |
Checks if the string ends with a substring. |
Equal to | Equal_To |
Checks if a string is equal to another string. |
Matches | Matches |
Check is a string matches another string. This uses regular expressions. |
fuzzyalgorithm
parameter options
The following table displays the options available for the fuzzyalgorithm
input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.
Designer mode label | Script mode name |
---|---|
Dice Coefficient | DiceCoefficient |
Hamming Distance | HammingDistance |
Jaccard Distance | JaccardDistance |
Jaro Distance | JaroDistance |
Jaro Winkler Distance | JaroWinklerDistance |
Levenshtein Distance | LevenshteinDistance |
Longest Common Subsequence | LongestCommonSubsequence |
Longest Common Substring | LongestCommonSubstring |
Overlap Coefficient | OverlapCoefficient |
Ratcliff Obershelp Similarity | RatcliffObershelpSimilarity |
Sorensen Dice Distance | SorensenDiceDistance |
Tanimoto Coefficient | TanimotoCoefficient |
tolerance
parameter options
The following table displays the options available for the tolerance
input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.
Designer mode label | Script mode name | Description |
---|---|---|
Manual | Manual |
Evaluate that both strings are approximately equal if they are at least X% equal. X is user-defined in manualTolerance between 0 and 100, where 100 means strings are completely equal. |
Normal | Normal |
Evaluate that both strings are approximately equal if they are at least 50% equal. |
Strong | Strong |
Evaluate that both strings are approximately equal if they are at least 75% equal. |
Weak | Weak |
Evaluate that both strings are approximately equal if they are at least 25% equal. |
segmentation
parameter options
The following table displays the options available for the segmentation
input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.
Designer mode label | Script mode name | Description |
---|---|---|
Phrase | Phrase |
Search for a phrase. |
Word | Word |
Search for aa word. |
targetprovider
parameter options
The following table displays the options available for the targetprovider
input parameter. The table shows the options available when working in Script mode and the equivalent label in the Designer mode.
Designer mode label | Script mode name | Description |
---|---|---|
Google |
Google Tesseract™ | |
Google Cloud Vision | GoogleVision |
Google Cloud Vision™ |
Abbyy | Abbyy |
Abbyy™ |
Output parameters
Designer mode label | Script mode name | Accepted variable types | Description |
---|---|---|---|
Anchor | anchor |
Image |
The image found in the source file's page containing the anchor text. |
Success | success |
Boolean |
True if the text is found, or False if not. |
Image | image |
Image |
The image containing the target text. |
Text | text |
Text |
The target text returned from the text extract region. |
Bounds | bounds |
Rectangle |
Region where the text was found in the PDF. |
Example
The following example shows the extractPdfText
command in action.
defVar --name pdfFile --type Pdf
defVar --name getText --type String
// Enter a pdf file on this command.
pdfOpen --file "file.pdf" pdfFile=value
// Use the helper selection tool to use this command.
extractPdfText --page 1 --language en-US --searchregion "171,273,134,89" --anchor Bill --anchorprovider "Google" --comparison "Equal_To" --segmentation "Word" --targetregion "-23,-37,716,404" --targetprovider "Google" --file ${pdfFile} getText=text
logMessage --message "${getText}" --type "Info"
//Get the text from a PDF file and display it in the console.
Limitations
ABBYY® works differently compared to Google Cloud Vision™ and Google Tesseract™. It divides the text components in a different way and get components in a different order.