Extracting text from scanned PDF files with OCR
Learn how to extract text from scanned PDF files. Extract text based on anchoring with aid of the Extract PDF Text tool, or extract text based on a static region with aid of the Region Selector tool within IBM RPA Studio.
When you scan a paper document and transform it into a PDF file, it captures files as flat images, which means that the text on the scanned PDF file can't be easily read and processed by a computer. In this case, you need to apply Optical character recognition (OCR) to extract texts from it.
Before you begin
Before you extract text from scanned PDF files with OCR, you must open a PDF instance. If you use the Region Selector or Extract PDF Text tools, they automatically add the command to open the PDF file. For more information, see Opening, closing, and saving PDF files.
Extract text from a file based on anchoring
Follow the steps to extract text from a scanned PDF file based on anchoring with the Extract PDF Text tool within IBM RPA Studio:
The Extract PDF Text tool uses OCR capabilities to extract text from scanned documents, for example, and to convert this text into machine-readable data. For the tool's overview and details on the user interface controls, see The Extract PDF Text tool.
Anchoring
Anchoring is the process to define a relative position from where you want to extract a target text. The anchor acts as a base to find the target text in a PDF file.
- Open the Extract PDF Text tool by clicking Tools > Extract pdf Text.
- Optional: In the Anchor section, define the following settings:
- From the OCR provider list, select the provider to use to find the anchor.
- From the Language list, select the text language.
- From the Operator list, select the comparison option to search for the anchor.
- From the Anchor type list, select the anchor type: word or phrase.
- Drag the cursor around the position that you want to search for the anchor.
- Click the icon
to preview the anchor.
- In the text box, enter an anchor text. The text must be inside the position that you set in the third step.
- Click the icon
to search for the anchor. It highlights the anchor.
The following video shows how to define and fetch the anchor text with the Extract PDF Text tool:
- Set a new area to search for the anchor text.
- Enable or disable high contrast.
- Switch the OCR provider.
- Switch the language.
Extracting the target text
- Drag the cursor around the position that you want to extract the target text. It highlights the target in yellow.
- In the Target section, click the icon
to preview the target text.
- Click Create command to add the Get PDF Text by OCR (
extractPdfText
) command to your script based on anchor, target text, and position that you entered in this tool.
The following video shows how to define and fetch the target text with the Extract PDF Text tool:
Extract text from a file based on a static region
Capture a region from a scanned PDF file to extract texts from it with the Region Selector tool within IBM RPA Studio:
- Open this tool by clicking Tools > Region Selector.
- From the Selection tab, select
Text
. - Drag the cursor around the region that you want to extract text from.
It automatically prompts the Get PDF Region Text (pdfRegionText
) command to you. In the Text output's parameter, enter the variable
to receive the text extracted.

The Get PDF Region Text (pdfRegionText
) command can extract text from a specific region with OCR. Read this command's specification for examples on how to use it.
Enhancing the quality of extracted texts
Extracting texts from scanned PDF files might require that you enhance the quality of the file to get the text from it with none or minimum deviation.
Commands like Get PDF Region Text (pdfRegionText
) and Get PDF Image (pdfImage
)
get a region of the PDF file as a static image, which enables you to up the DPI (dots per inch) values that are 110 by default. Read these commands' specification for details on how to use them.