Using the Extract PDF Text tool to get text with OCR
Learn how to use the Extract PDF Text tool to extract text from scanned documents, for example, and to convert this text into machine-readable data with Optical Character Recognition (OCR).
In this section, learn how to use this tool to extract text from a file based on anchoring:
The Extract PDF Text tool

No. | Topic | Description |
---|---|---|
1 | File | Click Open to select a PDF file. It adds the Open PDF file (pdfOpen ) command to the current script. |
2 | Paging | Pagination of a PDF file. |
3 | Anchor | Options to configure the anchor. |
4 | Target | Options to configure the target text. |
5 | Actions | Actions to run on the file based on the previous configurations. |
6 | PDF page viewer | This pane displays the PDF file contents. |
Paging
Use the controls in the Paging group to move through the PDF pages. The following list describes the controls available:
-
Icons
Click the ◀ icon to return to the previous page, and click the ▶ icon to advance to the next page. -
Page selection
Select a page from the PDF file in the page field.
Anchor
Use the controls in the Anchor group to configure the anchor. An anchor text is used as a reference location for the target text. The following list describes the controls available:
-
OCR provider
Select the OCR provider used to recognize the anchor text.Note:ABBYY® has a different behavior when compared to Google Cloud Vision™ and Google Tesseract™, which might lead to different results on the tool. -
Language
Select the language used by the OCR provider to recognize the anchor text. -
Operator
Select a comparison option to search for the anchor. -
Anchor type
Configure whether the OCR provider is looking for a word or phrase. -
Text box
Enter the anchor text. -
High contrast icon
Click this icon to enable or disable high contrast in the PDF file. -
View text icon
Click this icon to preview the recognized anchor text. -
Fetch anchor icon
Click this icon to fetch and highlight the recognized anchor text.
Target
Use the controls in the Target group to configure the target text. The following list describes the controls available:
-
OCR provider
Select the OCR provider used to recognize the target text. -
High contrast icon
Click this icon to enable or disable high contrast in the PDF file. -
View text icon
Click this icon to preview the recognized target text.
Actions
Use the controls in the Actions group to act on the file based on the previous configurations. The following list describes the controls available:
-
Clear
Clear the settings fields. -
Create command
This tool adds the Get PDF text by OCR (extractPdfText
) command to the script based on anchor, target text, and position that you entered in this tool.
Procedure
Follow the steps to extract text from a scanned PDF file based on anchoring with the Extract PDF Text tool within IBM RPA Studio:
Anchoring
Anchoring is the process to define a relative position from where you want to extract a target text. The anchor acts as a base to find the target text in a PDF file.
- Open the Extract PDF Text tool by clicking Tools > Extract pdf Text.
- Optional: In the Anchor section, define the following settings:
- From the OCR provider list, select the provider to use to find the anchor.
- From the Language list, select the text language.
- From the Operator list, select the comparison option to search for the anchor.
- From the Anchor type list, select the anchor type: word or phrase.
- Drag the cursor around the position that you want to search for the anchor.
- Click the icon
to preview the anchor.
- In the text box, enter an anchor text. The text must be inside the position that you set in the third step.
- Click the icon
to search for the anchor. It highlights the anchor.
The following video shows how to define and fetch the anchor text with the Extract PDF Text tool:
- Set a new area to search for the anchor text.
- Enable or disable high contrast.
- Switch the OCR provider.
- Switch the language.
Extracting the target text
- Drag the cursor around the position that you want to extract the target text. It highlights the target in yellow.
- In the Target section, click the icon
to preview the target text.
- Click Create command to add the Get PDF Text by OCR (
extractPdfText
) command to your script based on anchor, target text, and position that you entered in this tool.
The following video shows how to define and fetch the target text with the Extract PDF Text tool: