The Extract PDF Text tool

The Extract PDF Text tool uses Optical character recognition (OCR) capabilities to extract text from scanned documents, for example, and to convert this text into machine-readable data.

The Extract PDF Text tool.
Figure 1. The image shows the user interface of the Extract PDF Text tool.

Table 1. The table describes and refers to each section of the Extract PDF Text tool's screen capture.
No. Topic Description
1 File Click Open to select a PDF file. It adds the Open PDF file (pdfOpen) command to the current script.
2 Paging Pagination of a PDF file.
3 Anchor Options to configure the anchor.
4 Target Options to configure the target text.
5 Actions Actions to run on the file based on the previous configurations.
6 PDF page viewer This pane displays the PDF file contents.

Paging

Use the controls in the Paging group to move through the PDF pages. The following list describes the controls available:

  • Icons
    Click the ◀ icon to return to the previous page, and click the ▶ icon to advance to the next page.

  • Page selection
    Select a page from the PDF file in the page field.

Anchor

Use the controls in the Anchor group to configure the anchor. An anchor text is used as a reference location for the target text. The following list describes the controls available:

  • OCR provider
    Select the OCR provider used to recognize the anchor text.

    Note:ABBYY® has a different behavior when compared to Google Cloud Vision™ and Google Tesseract™, which might lead to different results on the tool.
  • Language
    Select the language used by the OCR provider to recognize the anchor text.

  • Operator
    Select a comparison option to search for the anchor.

  • Anchor type
    Configure whether the OCR provider is looking for a word or phrase.

  • Text box
    Enter the anchor text.

  • High contrast icon
    Click this icon to enable or disable high contrast in the PDF file.

  • View text icon
    Click this icon to preview the recognized anchor text.

  • Fetch anchor icon
    Click this icon to fetch and highlight the recognized anchor text.

Target

Use the controls in the Target group to configure the target text. The following list describes the controls available:

  • OCR provider
    Select the OCR provider used to recognize the target text.

  • High contrast icon
    Click this icon to enable or disable high contrast in the PDF file.

  • View text icon
    Click this icon to preview the recognized target text.

Actions

Use the controls in the Actions group to act on the file based on the previous configurations. The following list describes the controls available:

  • Clear
    Clear the settings fields.

  • Create command
    This tool adds the Get PDF text by OCR (extractPdfText) command to the script based on anchor, target text, and position that you entered in this tool.