Digitize Documents API

Overview

The Digitize Documents Service provides a REST API for document digitization and ingestion. This service converts PDF documents into searchable content and supports indexing into vector databases for semantic search capabilities.

API Endpoints:

GET /health — Health check
POST /v1/jobs — Create async job to upload and process documents
GET /v1/jobs — List all jobs
POST /v1/import - Import metadata into PostgreSQL
GET /v1/export - Export metadata into PostgreSQL
GET /v1/jobs/{job_id} — Get job by ID
DELETE /v1/jobs/{job_id} — Delete job
GET /v1/documents — List all documents
GET /v1/documents/{doc_id} — Get document metadata
GET /v1/documents/{doc_id}/content — Get document content
DELETE /v1/documents/{doc_id} — Delete document
DELETE /v1/documents — Bulk delete all documents (deprecated)

Note: To access API endpoints:

In the Endpoints Catalog UI: Click on Digital Assistant and refer to Integration Endpoints.
Using CLI: Run the command to retrieve the API endpoints and base URL.
```
ai-services application info <appName> --runtime <podman|openshift>
```

GET /health

Check if the service is running and healthy. Used for liveness probes.

Tags: health

Request:

No parameters required.

Request Example:

curl -X GET http://10.20.188.184:4000/health

Success Response (200 OK):

Content-Type: application/json

Returns an empty object indicating the service is healthy.

Response Example:

{ "status": "ok" }

POST /v1/jobs

Upload PDF documents for processing. Supports two operation types:

ingestion: Process and index documents into vector database for semantic search
digitization: Convert PDF to text/markdown/JSON format (single file only)

The operation runs asynchronously in the background. Use the returned job_id to track progress.

Tags: jobs

Content-Type: multipart/form-data

Table 1. Query Parameters
Parameter	Type	Required	Default	Description
`operation`	string	No	ingestion	Operation type: 'ingestion' (index into vector DB) or 'digitization' (convert to text/md/json)
`output_format`	string	No	json	Output format for digitization: 'json', 'md', or 'txt' (only applies to digitization operation)
`job_name`	string	No	null	Optional human-readable name for the job

Valid Values:

operation: ingestion, digitization
output_format: txt, md, json

Table 2. Request Body Parameters
Field	Type	Required	Description
`files`	array[binary]	Yes	PDF files to process (multiple for ingestion, single for digitization)

Request Example:

curl -X POST "http://10.20.188.184:4000/v1/jobs?operation=digitization&output_format=json" \
  -H "Content-Type: multipart/form-data" \
  -F "files=@document.pdf"

Success Response (202 Accepted):

Content-Type: application/json

Table 3. Response Fields
Field	Type	Description
`job_id`	string	Unique identifier for the created job

Response Example:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000"
}

GET /v1/jobs

Retrieve information about all submitted jobs with pagination and filtering options.

Tags: jobs

Table 4. Query Parameters
Parameter	Type	Required	Default	Description
`latest`	boolean	No	false	Return only the latest job
`limit`	integer	No	20	Number of records per page (min: 1, max: 100)
`offset`	integer	No	0	Number of records to skip (min: 0)
`status`	string	No	null	Filter by job status: accepted, in_progress, completed, failed
`operation`	string	No	null	Filter by operation type: ingestion, digitization

Request Example:

curl -X GET "http://10.20.188.184:4000/v1/jobs?limit=10&status=completed"

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "pagination": {
    "total": 45,
    "limit": 10,
    "offset": 0
  },
  "data": [
    {
      "job_id": "550e8400-e29b-41d4-a716-446655440000",
      "job_name": "Monthly Reports",
      "operation": "ingestion",
      "status": "completed"
    }
  ]
}

POST /v1/import

Import job and document metadata into PostgreSQL using the export-compatible JSON payload.

Table 5. Query Parameters
Parameter	Type	Description
data	object	Container for jobs and documents
data.jobs	array	List of job objects
jobs[].job_id	string	Unique identifier for the job
jobs[].operation	string	Operation performed by the job
jobs[].status	string	Current status of the job
jobs[].job_name	string	Name assigned to the job
jobs[].submitted_at	string	Timestamp when the job was submitted
jobs[].completed_at	string	Timestamp when the job was completed
jobs[].stats	object	Statistical data related to the job
jobs[].error	string	Error message if the job failed
data.documents	array	List of document objects
documents[].id	string	Document identifier
documents[].job_id	string	Associated job identifier
documents[].name	string	Name of the document
documents[].type	string	Document type
documents[].status	string	Processing status of the document
documents[].output_format	string	Format of the output document
documents[].submitted_at	string	Submission timestamp
documents[].completed_at	string	Completion timestamp
documents[].error	string	Error message if processing failed
documents[].metadata	object	Additional metadata for the document
validate_only	boolean	If true, validates request without processing

Request Example:

curl -X 'POST' \
  'http://10.20.188.184:4000/v1/import' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "data": {
    "jobs": [
      {
        "job_id": "string",
        "operation": "string",
        "status": "string",
        "job_name": "string",
        "submitted_at": "string",
        "completed_at": "string",
        "stats": {
          "additionalProp1": 0,
          "additionalProp2": 0,
          "additionalProp3": 0
        },
        "error": "string"
      }
    ],
    "documents": [
      {
        "id": "string",
        "job_id": "string",
        "name": "string",
        "type": "string",
        "status": "string",
        "output_format": "string",
        "submitted_at": "string",
        "completed_at": "string",
        "error": "string",
        "metadata": {
          "additionalProp1": {}
        }
      }
    ]
  },
  "validate_only": false
}'

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "status": "completed",
  "summary": {
    "jobs": {
      "total_received": 1,
      "imported": 0,
      "skipped": 0,
      "failed": 1
    },
    "documents": {
      "total_received": 1,
      "imported": 0,
      "skipped": 0,
      "failed": 1
    }
  },
  "duration_seconds": 0.0061,
  "errors": [
    {
      "record_type": "job",
      "record_id": "string",
      "type": "validation_error",
      "message": "Invalid timestamp: Invalid isoformat string: 'string'"
    }
  ],
  "warnings": [
    {
      "record_type": "document",
      "record_id": "string",
      "type": "orphaned_document",
      "message": "Document references non-existent job_id: string"
    }
  ]
}

Error Responses

Table 6. Error Responses
Status	Condition	Body
400	Bad Request - Invalid input or validation error	`{ "error": { "code": "INVALID_REQUEST", "message": "Request validation failed", "status": 400 } }`
409	Conflict - Resource is locked or in use	`{ "error": { "code": "RESOURCE_LOCKED", "message": "Resource is locked by an active operation", "status": 409 } }`
413	Payload Too Large - Input exceeds size limits	`{ "error": { "code": "CONTEXT_LIMIT_EXCEEDED", "message": "Input size exceeds maximum token limit", "status": 413 } }`
422	Validation Error	`{ "detail": [ { "loc": [ "string", 0 ], "msg": "string", "type": "string", "input": "string", "ctx": {} } ] }`
500	Internal Server Error - Unexpected error occurred	`{ "error": { "code": "INTERNAL_SERVER_ERROR", "message": "An unexpected error occurred", "status": 500 } }`

GET /v1/export

Export job and document metadata from PostgreSQL as JSON for backup and restore workflows.

Table 7. Query Parameters
Parameter	Type	Description
limit	integer	Maximum number of records to return in the response
offset	integer	Number of records to skip before starting to return results (used for pagination)

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "status": "completed",
  "data": {
    "jobs": [],
    "documents": []
  },
  "summary": {
    "jobs": {
      "total_exported": 0,
      "completed": 0,
      "failed": 0
    },
    "documents": {
      "total_exported": 0,
      "completed": 0,
      "failed": 0
    }
  },
  "export_timestamp": "2026-06-16T07:17:22.142225Z",
  "duration_seconds": 0.0028,
  "pagination": {
    "limit": 10000,
    "offset": 0,
    "has_more": false,
    "total_records": 0,
    "returned_records": 0
  }
}

Error Responses

Table 8. Error Responses
Status	Condition	Body
400	Bad Request - Invalid input or validation error	`{ "error": { "code": "INVALID_REQUEST", "message": "Request validation failed", "status": 400 } }`
413	Payload Too Large - Input exceeds size limits	`{ "error": { "code": "CONTEXT_LIMIT_EXCEEDED", "message": "Input size exceeds maximum token limit", "status": 413 } }`
422	Validation Error	`{ "detail": [ { "loc": [ "string", 0 ], "msg": "string", "type": "string", "input": "string", "ctx": {} } ] }`
500	Internal Server Error - Unexpected error occurred	`{ "error": { "code": "INTERNAL_SERVER_ERROR", "message": "An unexpected error occurred", "status": 500 } }`

GET /v1/jobs/{job_id}

Retrieve detailed status and progress information for a specific job.

Tags: jobs

Table 9. Path Parameters
Parameter	Type	Required	Description
`job_id`	string	Yes	Unique identifier for the job

Request Example:

curl -X GET http://10.20.188.184:4000/v1/jobs/c355556c-f945-420a-9142-002bcff8fac8

Success Response (200 OK):

Content-Type: application/json

Returns detailed job information including document statuses and statistics.

Response Example:

{
  "job_id": "c355556c-f945-420a-9142-002bcff8fac8",
  "job_name": "AI - Services",
  "operation": "ingestion",
  "status": "in_progress",
  "submitted_at": "2026-03-27T16:37:47.736184Z",
  "completed_at": null,
  "documents": [
    {
      "id": "87878b3b-9b78-492d-b5aa-e83eb8310e25",
      "name": "AI-services-v020_03272026_112209.pdf",
      "status": "in_progress"
    }
  ],
  "stats": {
    "total_documents": 1,
    "completed": 0,
    "failed": 0,
    "in_progress": 1
  },
  "error": null
}

DELETE /v1/jobs/{job_id}

Delete a job status record. Only completed or failed jobs can be deleted.

Note: This only deletes the job record, not the associated document data.

Tags: jobs

Table 10. Path Parameters
Parameter	Type	Required	Description
`job_id`	string	Yes	Unique identifier for the job

Request Example:

curl -X DELETE http://10.20.188.184:4000/v1/jobs/550e8400-e29b-41d4-a716-446655440000

Success Response (204 No Content):

Job successfully deleted. No response body.

GET /v1/documents

Get high-level information of all documents with pagination and filtering. Documents are sorted by submission time (newest first).

Tags: documents

Table 11. Query Parameters
Parameter	Type	Required	Default	Description
`limit`	integer	No	20	Number of records to return per page (min: 1, max: 100)
`offset`	integer	No	0	Number of records to skip (min: 0)
`status`	string	No	null	Filter by status: accepted, in_progress, completed, failed
`name`	string	No	null	Filter by document name (partial match, case-insensitive)

Request Example:

curl -X GET "http://10.20.188.184:4000/v1/documents?limit=50&status=completed"

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "pagination": {
    "total": 150,
    "limit": 50,
    "offset": 0
  },
  "data": [
    {
      "id": "doc-123",
      "name": "report.pdf",
      "type": "pdf",
      "status": "completed"
    }
  ]
}

GET /v1/documents/{doc_id}

Retrieve detailed metadata for a specific document by its ID.

Tags: documents

Table 12. Path Parameters
Parameter	Type	Required	Description
`doc_id`	string	Yes	Unique identifier for the document

Table 13. Query Parameters
Parameter	Type	Required	Default	Description
`details`	boolean	No	false	Include detailed metadata (pages, tables, timing)

Request Example:

curl -X GET "http://10.20.188.184:4000/v1/documents/doc-123?details=true"

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "id": "doc-123",
  "name": "report.pdf",
  "type": "pdf",
  "status": "completed",
  "output_format": "json",
  "metadata": {
    "pages": 15,
    "tables": 3
  }
}

GET /v1/documents/{doc_id}/content

Retrieve the digitized/processed content of a document.

Tags: documents

Table 14. Path Parameters
Parameter	Type	Required	Description
`doc_id`	string	Yes	Unique identifier for the document

Request Example:

curl -X GET http://10.20.188.184:4000/v1/documents/doc-123/content

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "result": {
    "title": "Quarterly Report",
    "content": "This report summarizes..."
  },
  "output_format": "json"
}

DELETE /v1/documents/{doc_id}

Delete a single document by ID.

Tags: documents

Table 15. Path Parameters
Parameter	Type	Required	Description
`doc_id`	string	Yes	Unique identifier for the document

Request Example:

curl -X DELETE http://10.20.188.184:4000/v1/documents/doc-123

Success Response (204 No Content):

Document successfully deleted. No response body.

DELETE /v1/documents (deprecated)

⚠️ DANGER: Delete ALL documents from the system.

Tags: documents

Table 16. Query Parameters
Parameter	Type	Required	Description
`confirm`	boolean	Yes	Must be true to proceed with bulk deletion

Request Example:

curl -X DELETE "http://10.20.188.184:4000/v1/documents?confirm=true"

Success Response (204 No Content):

All documents successfully deleted. No response body.

Error Handling

HTTP Status Codes:

Table 17. Status Codes
Status Code	Description
200	Request succeeded
202	Request accepted for asynchronous processing
204	Request succeeded with no content to return
422	Validation error in request parameters or body

Error Response Format:

{
  "detail": [
    {
      "loc": ["body", "field_name"],
      "msg": "Error message",
      "type": "error_type"
    }
  ]
}

Usage Examples

Example 1: Digitize a Single PDF to JSON

Request:

curl -X POST "http://10.20.188.184:4000/v1/jobs?operation=digitization&output_format=json" \
  -F "files=@document.pdf"

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000"
}

Example 2: Ingest Multiple PDFs

Request:

curl -X POST "http://10.20.188.184:4000/v1/jobs?operation=ingestion" \
  -F "files=@report1.pdf" \
  -F "files=@report2.pdf"

Response:

{
  "job_id": "660e8400-e29b-41d4-a716-446655440001"
}

Example 3: Check Job Status

Request:

curl -X GET http://10.20.188.184:4000/v1/jobs/550e8400-e29b-41d4-a716-446655440000

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed"
}

Best Practices

Job Tracking: Poll the GET /v1/jobs/{job_id} endpoint to track progress after creating a job.
Error Handling: Check the status field and error message for failed jobs.
Operation Selection: Use digitization for single PDF conversion, ingestion for batch processing.
Output Format: Choose json for structured data, md for readable text, txt for plain content.
Pagination: Use limit and offset parameters to manage large result sets efficiently.
Cleanup: Delete completed jobs using DELETE /v1/jobs/{job_id} to maintain clean records.