notification

Free token limit increased!

New 300k token limit for all new, free trials to use for LLM API calls and more. Sign up for free here.

Overview

Use the extraction method of the REST API to convert files that are highly structured and use diagrams, images, and tables to convey information, into an AI-model-friendly JSON file format. Extracting text from documents works by applying natural language understanding technology that is developed by IBM to identify document structures.

Supported file types for this API request are detailed in this list.

Extract text

The following command submits a request to extract text from a file called document.pdf.

Example

1curl --request POST 'https://{cluster_url}/ml/v1/text/extractions?version=2023-10-25'
2-H 'Authorization: Bearer eyJhbGciOiJSUzUxM...'
3-H 'Content-Type: application/json'
4-H 'Accept: application/json'
5-d '{
6  "project_id": "12ac4cf1-252f-424b-b52d-5cdd9814987f"
7  "document_reference": {
8    "type": "connection_asset",
9    "connection": {
10      "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d"
11    },
12    "location": {
13      "file_name": "files/document.pdf"
14    }
15  },
16  "results_reference": {
17    "type": "connection_asset",
18    "connection": {
19      "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d"
20    },
21    "location": {
22      "file_name": "results"
23    }
24  },
25  "steps": {
26    "tables_processing": {
27      "enabled": true
28    }
29  }
30}'

Response

The response is a created resource and details for the text extraction.

1{
2  "metadata": {
3    "id": "6213cf1-252f-424b-b52d-5cdd9814956c",
4    "created_at": "2023-05-02T16:27:51Z",
5    "project_id": "12ac4cf1-252f-424b-b52d-5cdd9814987f",
6    "name": "extract"
7  },
8  "entity": {
9    "document_reference": {
10      "type": "connection_asset",
11      "connection": {
12        "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d"
13      },
14      "location": {
15        "file_name": "files/document.pdf"
16      }
17    },
18    "results_reference": {
19      "type": "connection_asset",
20      "connection": {
21        "id": "2a7c11bc-2913-48d0-9581-a8d9f40fa159"
22      },
23      "location": {
24        "file_name": "results"
25      }
26    },
27    "steps": {
28      "tables_processing": {
29        "enabled": true
30      }
31    },
32    "results": {
33      "status": "submitted",
34      "number_pages_processed": 0
35    }
36  }
37}

For more information about some of the structures recognized by the API, see Extracting text from documents.

Use the extracted text

Take the extracted text from the generated JSON file and store it as plain text. For example, take the extracted text from the generated JSON file and store it in a plain text file named parsed_output_text.txt.

Example

1cat output_report | jq '[.all_structures.tokens[].text] | join(" ")' > parsed_output_text.txt

Or return the number of pages in the original PDF file.

Example

1cat output_report.json | jq '.metadata.num_pages'

Next steps

For additional information about the extraction API, see the following links:

Edit this page on GitHub

Capabilities: Embeddings

Capabilities: AutoAI RAG

Text extraction

Overview

Extract text

Example

Response

Use the extracted text

Example

Example

Next steps