Text extraction
Overview
Use the extraction method of the REST API to convert files that are highly structured and use diagrams, images, and tables to convey information, into an AI-model-friendly JSON file format. Extracting text from documents works by applying natural language understanding technology that is developed by IBM to identify document structures.
Supported file types for this API request are detailed in this list.
Extract text
The following command submits a request to extract text from a file called document.pdf.
Example
1curl --request POST 'https://{cluster_url}/ml/v1/text/extractions?version=2023-10-25' 2-H 'Authorization: Bearer eyJhbGciOiJSUzUxM...' 3-H 'Content-Type: application/json' 4-H 'Accept: application/json' 5-d '{ 6 "project_id": "12ac4cf1-252f-424b-b52d-5cdd9814987f" 7 "document_reference": { 8 "type": "connection_asset", 9 "connection": { 10 "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d" 11 }, 12 "location": { 13 "file_name": "files/document.pdf" 14 } 15 }, 16 "results_reference": { 17 "type": "connection_asset", 18 "connection": { 19 "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d" 20 }, 21 "location": { 22 "file_name": "results" 23 } 24 }, 25 "steps": { 26 "tables_processing": { 27 "enabled": true 28 } 29 } 30}'
Response
The response is a created resource and details for the text extraction.
1{ 2 "metadata": { 3 "id": "6213cf1-252f-424b-b52d-5cdd9814956c", 4 "created_at": "2023-05-02T16:27:51Z", 5 "project_id": "12ac4cf1-252f-424b-b52d-5cdd9814987f", 6 "name": "extract" 7 }, 8 "entity": { 9 "document_reference": { 10 "type": "connection_asset", 11 "connection": { 12 "id": "6f5688fd-f3bf-42c2-a18b-49c0d8a1920d" 13 }, 14 "location": { 15 "file_name": "files/document.pdf" 16 } 17 }, 18 "results_reference": { 19 "type": "connection_asset", 20 "connection": { 21 "id": "2a7c11bc-2913-48d0-9581-a8d9f40fa159" 22 }, 23 "location": { 24 "file_name": "results" 25 } 26 }, 27 "steps": { 28 "tables_processing": { 29 "enabled": true 30 } 31 }, 32 "results": { 33 "status": "submitted", 34 "number_pages_processed": 0 35 } 36 } 37}
For more information about some of the structures recognized by the API, see Extracting text from documents.
Use the extracted text
Take the extracted text from the generated JSON file and store it as plain text. For example, take the extracted text from the generated JSON file and store it in a plain text file named parsed_output_text.txt.
Example
1cat output_report | jq '[.all_structures.tokens[].text] | join(" ")' > parsed_output_text.txt
Or return the number of pages in the original PDF file.
Example
1cat output_report.json | jq '.metadata.num_pages'
Next steps
For additional information about the extraction API, see the following links: