Using the Extract PDF Text tool to get text with OCR

Learn how to use the Extract PDF Text tool to extract text from scanned documents, for example, and to convert this text into machine-readable data with Optical Character Recognition (OCR).

In this section, learn how to use this tool to extract text from a file based on anchoring:

The Extract PDF Text tool

The Extract PDF Text tool.
Figure 1. The image shows the user interface for the Extract PDF Text tool.

Table 1. The table describes and refers to each section of the Extract PDF Text tool's screen capture.
No. Topic Description
1 File Click Open to select a PDF file. It adds the Open PDF file (pdfOpen) command to the current script.
2 Paging Pagination of a PDF file.
3 Anchor Options to configure the anchor.
4 Target Options to configure the target text.
5 Actions Actions to run on the file based on the previous configurations.
6 PDF page viewer This pane displays the PDF file contents.

Paging

Use the controls in the Paging group to move through the PDF pages. The following list describes the controls available:

  • Icons
    Click the ◀ icon to return to the previous page, and click the ▶ icon to advance to the next page.

  • Page selection
    Select a page from the PDF file in the page field.

Anchor

Use the controls in the Anchor group to configure the anchor. An anchor text is used as a reference location for the target text. The following list describes the controls available:

  • OCR provider
    Select the OCR provider used to recognize the anchor text.

    Note:ABBYY® has a different behavior when compared to Google Cloud Vision™ and Google Tesseract™, which might lead to different results on the tool.
  • Language
    Select the language used by the OCR provider to recognize the anchor text.

  • Operator
    Select a comparison option to search for the anchor.

  • Anchor type
    Configure whether the OCR provider is looking for a word or phrase.

  • Text box
    Enter the anchor text.

  • High contrast icon
    Click this icon to enable or disable high contrast in the PDF file.

  • View text icon
    Click this icon to preview the recognized anchor text.

  • Fetch anchor icon
    Click this icon to fetch and highlight the recognized anchor text.

Target

Use the controls in the Target group to configure the target text. The following list describes the controls available:

  • OCR provider
    Select the OCR provider used to recognize the target text.

  • High contrast icon
    Click this icon to enable or disable high contrast in the PDF file.

  • View text icon
    Click this icon to preview the recognized target text.

Actions

Use the controls in the Actions group to act on the file based on the previous configurations. The following list describes the controls available:

  • Clear
    Clear the settings fields.

  • Create command
    This tool adds the Get PDF text by OCR (extractPdfText) command to the script based on anchor, target text, and position that you entered in this tool.

Procedure

Follow the steps to extract text from a scanned PDF file based on anchoring with the Extract PDF Text tool within IBM RPA Studio:

Anchoring

Anchoring is the process to define a relative position from where you want to extract a target text. The anchor acts as a base to find the target text in a PDF file.

  1. Open the Extract PDF Text tool by clicking Tools > Extract pdf Text.
  2. Optional: In the Anchor section, define the following settings:
    • From the OCR provider list, select the provider to use to find the anchor.
    • From the Language list, select the text language.
    • From the Operator list, select the comparison option to search for the anchor.
    • From the Anchor type list, select the anchor type: word or phrase.
  3. Drag the cursor around the position that you want to search for the anchor.
  4. Click the icon Magnifying glass over a document icon to preview the text on the position. to preview the anchor.
  5. In the text box, enter an anchor text. The text must be inside the position that you set in the third step.
  6. Click the icon Magnifying glass icon to fetch the anchor. to search for the anchor. It highlights the anchor.

The following video shows how to define and fetch the anchor text with the Extract PDF Text tool:

Note: The video has no narration.

Tip:Apply the following tips to improve the recognition of the anchor text:
  • Set a new area to search for the anchor text.
  • Enable or disable high contrast.
  • Switch the OCR provider.
  • Switch the language.

Extracting the target text

  1. Drag the cursor around the position that you want to extract the target text. It highlights the target in yellow.
  2. In the Target section, click the icon Magnifying glass icon to preview the target text. to preview the target text.
  3. Click Create command to add the Get PDF Text by OCR (extractPdfText) command to your script based on anchor, target text, and position that you entered in this tool.

The following video shows how to define and fetch the target text with the Extract PDF Text tool:

Note: The video has no narration.