Skip to main contentwatsonx Developer Hub

Text extraction

Free token limit increased!
New 300k token limit for all new, free trials to use for LLM API calls and more. Sign up for free here.

Overview

Use the extraction method of the REST API to convert files that are highly structured and use diagrams, images, and tables to convey information, into an AI-model-friendly JSON file format. Extracting text from documents works by applying natural language understanding technology that is developed by IBM to identify document structures.

Supported file types for this API request are detailed in this list.

Extract text

The following command submits a request to extract text from a file called document.pdf.

Example

1curl --request POST 'https://{cluster_url}/ml/v1/text/extractions?version=2023-10-25'
2-H 'Authorization: Bearer eyJhbGciOiJSUzUxM...'
3-H 'Content-Type: application/json'
4-H 'Accept: application/json'
5-d '{
6  "project_id": "12ac4cf1-252f-424b-b52d-5cdd9814987f"
7  "document_reference": {
8    "type": "connection_asset",
9    "connection": {
10      "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d"
11    },
12    "location": {
13      "file_name": "files/document.pdf"
14    }
15  },
16  "results_reference": {
17    "type": "connection_asset",
18    "connection": {
19      "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d"
20    },
21    "location": {
22      "file_name": "results"
23    }
24  },
25  "steps": {
26    "tables_processing": {
27      "enabled": true
28    }
29  }
30}'

Response

The response is a created resource and details for the text extraction.

1{
2  "metadata": {
3    "id": "6213cf1-252f-424b-b52d-5cdd9814956c",
4    "created_at": "2023-05-02T16:27:51Z",
5    "project_id": "12ac4cf1-252f-424b-b52d-5cdd9814987f",
6    "name": "extract"
7  },
8  "entity": {
9    "document_reference": {
10      "type": "connection_asset",
11      "connection": {
12        "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d"
13      },
14      "location": {
15        "file_name": "files/document.pdf"
16      }
17    },
18    "results_reference": {
19      "type": "connection_asset",
20      "connection": {
21        "id": "2a7c11bc-2913-48d0-9581-a8d9f40fa159"
22      },
23      "location": {
24        "file_name": "results"
25      }
26    },
27    "steps": {
28      "tables_processing": {
29        "enabled": true
30      }
31    },
32    "results": {
33      "status": "submitted",
34      "number_pages_processed": 0
35    }
36  }
37}

For more information about some of the structures recognized by the API, see Extracting text from documents.

Use the extracted text

Take the extracted text from the generated JSON file and store it as plain text. For example, take the extracted text from the generated JSON file and store it in a plain text file named parsed_output_text.txt.

Example

1cat output_report | jq '[.all_structures.tokens[].text] | join(" ")' > parsed_output_text.txt

Or return the number of pages in the original PDF file.

Example

1cat output_report.json | jq '.metadata.num_pages'

Next steps

For additional information about the extraction API, see the following links: