Digitize Documents API

Complete API reference for the Digitize Documents Service that processes PDF documents for digitization and ingestion into vector databases for semantic search.

Overview

The Digitize Documents Service provides a REST API for document digitization and ingestion. This service converts PDF documents into searchable content and supports indexing into vector databases for semantic search capabilities.

API Endpoints:

  • GET /health — Health check
  • POST /v1/jobs — Create async job to upload and process documents
  • GET /v1/jobs — List all jobs
  • POST /v1/import - Import metadata into PostgreSQL
  • GET /v1/export - Export metadata into PostgreSQL
  • GET /v1/jobs/{job_id} — Get job by ID
  • DELETE /v1/jobs/{job_id} — Delete job
  • GET /v1/documents — List all documents
  • GET /v1/documents/{doc_id} — Get document metadata
  • GET /v1/documents/{doc_id}/content — Get document content
  • DELETE /v1/documents/{doc_id} — Delete document
  • DELETE /v1/documents — Bulk delete all documents (deprecated)
Note: To access API endpoints:
  • In the Endpoints Catalog UI: Click on Digital Assistant and refer to Integration Endpoints.
  • Using CLI: Run the command to retrieve the API endpoints and base URL.
    ai-services application info <appName> --runtime <podman|openshift>

GET /health

Check if the service is running and healthy. Used for liveness probes.

Tags: health

Request:

No parameters required.

Request Example:

curl -X GET http://10.20.188.184:4000/health

Success Response (200 OK):

Content-Type: application/json

Returns an empty object indicating the service is healthy.

Response Example:

{ "status": "ok" }

POST /v1/jobs

Upload PDF documents for processing. Supports two operation types:

  • ingestion: Process and index documents into vector database for semantic search
  • digitization: Convert PDF to text/markdown/JSON format (single file only)

The operation runs asynchronously in the background. Use the returned job_id to track progress.

Tags: jobs

Content-Type: multipart/form-data

Table 1. Query Parameters
Parameter Type Required Default Description
operation string No ingestion Operation type: 'ingestion' (index into vector DB) or 'digitization' (convert to text/md/json)
output_format string No json Output format for digitization: 'json', 'md', or 'txt' (only applies to digitization operation)
job_name string No null Optional human-readable name for the job

Valid Values:

  • operation: ingestion, digitization
  • output_format: txt, md, json
Table 2. Request Body Parameters
Field Type Required Description
files array[binary] Yes PDF files to process (multiple for ingestion, single for digitization)

Request Example:

curl -X POST "http://10.20.188.184:4000/v1/jobs?operation=digitization&output_format=json" \
  -H "Content-Type: multipart/form-data" \
  -F "files=@document.pdf"

Success Response (202 Accepted):

Content-Type: application/json

Table 3. Response Fields
Field Type Description
job_id string Unique identifier for the created job

Response Example:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000"
}

GET /v1/jobs

Retrieve information about all submitted jobs with pagination and filtering options.

Tags: jobs

Table 4. Query Parameters
Parameter Type Required Default Description
latest boolean No false Return only the latest job
limit integer No 20 Number of records per page (min: 1, max: 100)
offset integer No 0 Number of records to skip (min: 0)
status string No null Filter by job status: accepted, in_progress, completed, failed
operation string No null Filter by operation type: ingestion, digitization

Request Example:

curl -X GET "http://10.20.188.184:4000/v1/jobs?limit=10&status=completed"

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "pagination": {
    "total": 45,
    "limit": 10,
    "offset": 0
  },
  "data": [
    {
      "job_id": "550e8400-e29b-41d4-a716-446655440000",
      "job_name": "Monthly Reports",
      "operation": "ingestion",
      "status": "completed"
    }
  ]
}

POST /v1/import

Import job and document metadata into PostgreSQL using the export-compatible JSON payload.

Table 5. Query Parameters
Parameter Type Description
data object Container for jobs and documents
data.jobs array List of job objects
jobs[].job_id string Unique identifier for the job
jobs[].operation string Operation performed by the job
jobs[].status string Current status of the job
jobs[].job_name string Name assigned to the job
jobs[].submitted_at string Timestamp when the job was submitted
jobs[].completed_at string Timestamp when the job was completed
jobs[].stats object Statistical data related to the job
jobs[].error string Error message if the job failed
data.documents array List of document objects
documents[].id string Document identifier
documents[].job_id string Associated job identifier
documents[].name string Name of the document
documents[].type string Document type
documents[].status string Processing status of the document
documents[].output_format string Format of the output document
documents[].submitted_at string Submission timestamp
documents[].completed_at string Completion timestamp
documents[].error string Error message if processing failed
documents[].metadata object Additional metadata for the document
validate_only boolean If true, validates request without processing

Request Example:

curl -X 'POST' \
  'http://10.20.188.184:4000/v1/import' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "data": {
    "jobs": [
      {
        "job_id": "string",
        "operation": "string",
        "status": "string",
        "job_name": "string",
        "submitted_at": "string",
        "completed_at": "string",
        "stats": {
          "additionalProp1": 0,
          "additionalProp2": 0,
          "additionalProp3": 0
        },
        "error": "string"
      }
    ],
    "documents": [
      {
        "id": "string",
        "job_id": "string",
        "name": "string",
        "type": "string",
        "status": "string",
        "output_format": "string",
        "submitted_at": "string",
        "completed_at": "string",
        "error": "string",
        "metadata": {
          "additionalProp1": {}
        }
      }
    ]
  },
  "validate_only": false
}'

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "status": "completed",
  "summary": {
    "jobs": {
      "total_received": 1,
      "imported": 0,
      "skipped": 0,
      "failed": 1
    },
    "documents": {
      "total_received": 1,
      "imported": 0,
      "skipped": 0,
      "failed": 1
    }
  },
  "duration_seconds": 0.0061,
  "errors": [
    {
      "record_type": "job",
      "record_id": "string",
      "type": "validation_error",
      "message": "Invalid timestamp: Invalid isoformat string: 'string'"
    }
  ],
  "warnings": [
    {
      "record_type": "document",
      "record_id": "string",
      "type": "orphaned_document",
      "message": "Document references non-existent job_id: string"
    }
  ]
}

Error Responses

Table 6. Error Responses
Status Condition Body
400 Bad Request - Invalid input or validation error
{
  "error": {
    "code": "INVALID_REQUEST",
    "message": "Request validation failed",
    "status": 400
  }
}
409 Conflict - Resource is locked or in use
{
  "error": {
    "code": "RESOURCE_LOCKED",
    "message": "Resource is locked by an active operation",
    "status": 409
  }
}
413 Payload Too Large - Input exceeds size limits
{
  "error": {
    "code": "CONTEXT_LIMIT_EXCEEDED",
    "message": "Input size exceeds maximum token limit",
    "status": 413
  }
}
422 Validation Error
{
  "detail": [
    {
      "loc": [
        "string",
        0
      ],
      "msg": "string",
      "type": "string",
      "input": "string",
      "ctx": {}
    }
  ]
}
500 Internal Server Error - Unexpected error occurred
{
  "error": {
    "code": "INTERNAL_SERVER_ERROR",
    "message": "An unexpected error occurred",
    "status": 500
  }
}

GET /v1/export

Export job and document metadata from PostgreSQL as JSON for backup and restore workflows.

Table 7. Query Parameters
Parameter Type Description
limit integer Maximum number of records to return in the response
offset integer Number of records to skip before starting to return results (used for pagination)

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "status": "completed",
  "data": {
    "jobs": [],
    "documents": []
  },
  "summary": {
    "jobs": {
      "total_exported": 0,
      "completed": 0,
      "failed": 0
    },
    "documents": {
      "total_exported": 0,
      "completed": 0,
      "failed": 0
    }
  },
  "export_timestamp": "2026-06-16T07:17:22.142225Z",
  "duration_seconds": 0.0028,
  "pagination": {
    "limit": 10000,
    "offset": 0,
    "has_more": false,
    "total_records": 0,
    "returned_records": 0
  }
}

Error Responses

Table 8. Error Responses
Status Condition Body
400 Bad Request - Invalid input or validation error
{
  "error": {
    "code": "INVALID_REQUEST",
    "message": "Request validation failed",
    "status": 400
  }
}
413 Payload Too Large - Input exceeds size limits
{
  "error": {
    "code": "CONTEXT_LIMIT_EXCEEDED",
    "message": "Input size exceeds maximum token limit",
    "status": 413
  }
}
422 Validation Error
{
  "detail": [
    {
      "loc": [
        "string",
        0
      ],
      "msg": "string",
      "type": "string",
      "input": "string",
      "ctx": {}
    }
  ]
}
500 Internal Server Error - Unexpected error occurred
{
  "error": {
    "code": "INTERNAL_SERVER_ERROR",
    "message": "An unexpected error occurred",
    "status": 500
  }
}

GET /v1/jobs/{job_id}

Retrieve detailed status and progress information for a specific job.

Tags: jobs

Table 9. Path Parameters
Parameter Type Required Description
job_id string Yes Unique identifier for the job

Request Example:

curl -X GET http://10.20.188.184:4000/v1/jobs/c355556c-f945-420a-9142-002bcff8fac8

Success Response (200 OK):

Content-Type: application/json

Returns detailed job information including document statuses and statistics.

Response Example:

{
  "job_id": "c355556c-f945-420a-9142-002bcff8fac8",
  "job_name": "AI - Services",
  "operation": "ingestion",
  "status": "in_progress",
  "submitted_at": "2026-03-27T16:37:47.736184Z",
  "completed_at": null,
  "documents": [
    {
      "id": "87878b3b-9b78-492d-b5aa-e83eb8310e25",
      "name": "AI-services-v020_03272026_112209.pdf",
      "status": "in_progress"
    }
  ],
  "stats": {
    "total_documents": 1,
    "completed": 0,
    "failed": 0,
    "in_progress": 1
  },
  "error": null
}

DELETE /v1/jobs/{job_id}

Delete a job status record. Only completed or failed jobs can be deleted.

Note: This only deletes the job record, not the associated document data.

Tags: jobs

Table 10. Path Parameters
Parameter Type Required Description
job_id string Yes Unique identifier for the job

Request Example:

curl -X DELETE http://10.20.188.184:4000/v1/jobs/550e8400-e29b-41d4-a716-446655440000

Success Response (204 No Content):

Job successfully deleted. No response body.

GET /v1/documents

Get high-level information of all documents with pagination and filtering. Documents are sorted by submission time (newest first).

Tags: documents

Table 11. Query Parameters
Parameter Type Required Default Description
limit integer No 20 Number of records to return per page (min: 1, max: 100)
offset integer No 0 Number of records to skip (min: 0)
status string No null Filter by status: accepted, in_progress, completed, failed
name string No null Filter by document name (partial match, case-insensitive)

Request Example:

curl -X GET "http://10.20.188.184:4000/v1/documents?limit=50&status=completed"

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "pagination": {
    "total": 150,
    "limit": 50,
    "offset": 0
  },
  "data": [
    {
      "id": "doc-123",
      "name": "report.pdf",
      "type": "pdf",
      "status": "completed"
    }
  ]
}

GET /v1/documents/{doc_id}

Retrieve detailed metadata for a specific document by its ID.

Tags: documents

Table 12. Path Parameters
Parameter Type Required Description
doc_id string Yes Unique identifier for the document
Table 13. Query Parameters
Parameter Type Required Default Description
details boolean No false Include detailed metadata (pages, tables, timing)

Request Example:

curl -X GET "http://10.20.188.184:4000/v1/documents/doc-123?details=true"

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "id": "doc-123",
  "name": "report.pdf",
  "type": "pdf",
  "status": "completed",
  "output_format": "json",
  "metadata": {
    "pages": 15,
    "tables": 3
  }
}

GET /v1/documents/{doc_id}/content

Retrieve the digitized/processed content of a document.

Tags: documents

Table 14. Path Parameters
Parameter Type Required Description
doc_id string Yes Unique identifier for the document

Request Example:

curl -X GET http://10.20.188.184:4000/v1/documents/doc-123/content

Success Response (200 OK):

Content-Type: application/json

Response Example:

{
  "result": {
    "title": "Quarterly Report",
    "content": "This report summarizes..."
  },
  "output_format": "json"
}

DELETE /v1/documents/{doc_id}

Delete a single document by ID.

Tags: documents

Table 15. Path Parameters
Parameter Type Required Description
doc_id string Yes Unique identifier for the document

Request Example:

curl -X DELETE http://10.20.188.184:4000/v1/documents/doc-123

Success Response (204 No Content):

Document successfully deleted. No response body.

DELETE /v1/documents (deprecated)

⚠️ DANGER: Delete ALL documents from the system.

Tags: documents

Table 16. Query Parameters
Parameter Type Required Description
confirm boolean Yes Must be true to proceed with bulk deletion

Request Example:

curl -X DELETE "http://10.20.188.184:4000/v1/documents?confirm=true"

Success Response (204 No Content):

All documents successfully deleted. No response body.

Error Handling

HTTP Status Codes:

Table 17. Status Codes
Status Code Description
200 Request succeeded
202 Request accepted for asynchronous processing
204 Request succeeded with no content to return
422 Validation error in request parameters or body

Error Response Format:

{
  "detail": [
    {
      "loc": ["body", "field_name"],
      "msg": "Error message",
      "type": "error_type"
    }
  ]
}

Usage Examples

Example 1: Digitize a Single PDF to JSON

Request:

curl -X POST "http://10.20.188.184:4000/v1/jobs?operation=digitization&output_format=json" \
  -F "files=@document.pdf"

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000"
}

Example 2: Ingest Multiple PDFs

Request:

curl -X POST "http://10.20.188.184:4000/v1/jobs?operation=ingestion" \
  -F "files=@report1.pdf" \
  -F "files=@report2.pdf"

Response:

{
  "job_id": "660e8400-e29b-41d4-a716-446655440001"
}

Example 3: Check Job Status

Request:

curl -X GET http://10.20.188.184:4000/v1/jobs/550e8400-e29b-41d4-a716-446655440000

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed"
}

Best Practices

  1. Job Tracking: Poll the GET /v1/jobs/{job_id} endpoint to track progress after creating a job.
  2. Error Handling: Check the status field and error message for failed jobs.
  3. Operation Selection: Use digitization for single PDF conversion, ingestion for batch processing.
  4. Output Format: Choose json for structured data, md for readable text, txt for plain content.
  5. Pagination: Use limit and offset parameters to manage large result sets efficiently.
  6. Cleanup: Delete completed jobs using DELETE /v1/jobs/{job_id} to maintain clean records.