Digitize Documents API
Complete API reference for the Digitize Documents Service that processes PDF documents for digitization and ingestion into vector databases for semantic search.
Overview
The Digitize Documents Service provides a REST API for document digitization and ingestion. This service converts PDF documents into searchable content and supports indexing into vector databases for semantic search capabilities.
API Endpoints:
- GET /health — Health check
- POST /v1/jobs — Create async job to upload and process documents
- GET /v1/jobs — List all jobs
- POST /v1/import - Import metadata into PostgreSQL
- GET /v1/export - Export metadata into PostgreSQL
- GET /v1/jobs/{job_id} — Get job by ID
- DELETE /v1/jobs/{job_id} — Delete job
- GET /v1/documents — List all documents
- GET /v1/documents/{doc_id} — Get document metadata
- GET /v1/documents/{doc_id}/content — Get document content
- DELETE /v1/documents/{doc_id} — Delete document
- DELETE /v1/documents — Bulk delete all documents (deprecated)
- In the Endpoints Catalog UI: Click on Digital Assistant and refer to Integration Endpoints.
-
Using CLI: Run the command to retrieve the API endpoints and base URL.
ai-services application info <appName> --runtime <podman|openshift>
GET /health
Check if the service is running and healthy. Used for liveness probes.
Tags: health
Request:
No parameters required.
Request Example:
curl -X GET http://10.20.188.184:4000/health
Success Response (200 OK):
Content-Type: application/json
Returns an empty object indicating the service is healthy.
Response Example:
{ "status": "ok" }
POST /v1/jobs
Upload PDF documents for processing. Supports two operation types:
- ingestion: Process and index documents into vector database for semantic search
- digitization: Convert PDF to text/markdown/JSON format (single file only)
The operation runs asynchronously in the background. Use the returned job_id to track progress.
Tags: jobs
Content-Type: multipart/form-data
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
operation |
string | No | ingestion | Operation type: 'ingestion' (index into vector DB) or 'digitization' (convert to text/md/json) |
output_format |
string | No | json | Output format for digitization: 'json', 'md', or 'txt' (only applies to digitization operation) |
job_name |
string | No | null | Optional human-readable name for the job |
Valid Values:
operation: ingestion, digitizationoutput_format: txt, md, json
| Field | Type | Required | Description |
|---|---|---|---|
files |
array[binary] | Yes | PDF files to process (multiple for ingestion, single for digitization) |
Request Example:
curl -X POST "http://10.20.188.184:4000/v1/jobs?operation=digitization&output_format=json" \
-H "Content-Type: multipart/form-data" \
-F "files=@document.pdf"
Success Response (202 Accepted):
Content-Type: application/json
| Field | Type | Description |
|---|---|---|
job_id |
string | Unique identifier for the created job |
Response Example:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000"
}
GET /v1/jobs
Retrieve information about all submitted jobs with pagination and filtering options.
Tags: jobs
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
latest |
boolean | No | false | Return only the latest job |
limit |
integer | No | 20 | Number of records per page (min: 1, max: 100) |
offset |
integer | No | 0 | Number of records to skip (min: 0) |
status |
string | No | null | Filter by job status: accepted, in_progress, completed, failed |
operation |
string | No | null | Filter by operation type: ingestion, digitization |
Request Example:
curl -X GET "http://10.20.188.184:4000/v1/jobs?limit=10&status=completed"
Success Response (200 OK):
Content-Type: application/json
Response Example:
{
"pagination": {
"total": 45,
"limit": 10,
"offset": 0
},
"data": [
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"job_name": "Monthly Reports",
"operation": "ingestion",
"status": "completed"
}
]
}
POST /v1/import
Import job and document metadata into PostgreSQL using the export-compatible JSON payload.
| Parameter | Type | Description |
|---|---|---|
| data | object | Container for jobs and documents |
| data.jobs | array | List of job objects |
| jobs[].job_id | string | Unique identifier for the job |
| jobs[].operation | string | Operation performed by the job |
| jobs[].status | string | Current status of the job |
| jobs[].job_name | string | Name assigned to the job |
| jobs[].submitted_at | string | Timestamp when the job was submitted |
| jobs[].completed_at | string | Timestamp when the job was completed |
| jobs[].stats | object | Statistical data related to the job |
| jobs[].error | string | Error message if the job failed |
| data.documents | array | List of document objects |
| documents[].id | string | Document identifier |
| documents[].job_id | string | Associated job identifier |
| documents[].name | string | Name of the document |
| documents[].type | string | Document type |
| documents[].status | string | Processing status of the document |
| documents[].output_format | string | Format of the output document |
| documents[].submitted_at | string | Submission timestamp |
| documents[].completed_at | string | Completion timestamp |
| documents[].error | string | Error message if processing failed |
| documents[].metadata | object | Additional metadata for the document |
| validate_only | boolean | If true, validates request without processing |
Request Example:
curl -X 'POST' \
'http://10.20.188.184:4000/v1/import' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"data": {
"jobs": [
{
"job_id": "string",
"operation": "string",
"status": "string",
"job_name": "string",
"submitted_at": "string",
"completed_at": "string",
"stats": {
"additionalProp1": 0,
"additionalProp2": 0,
"additionalProp3": 0
},
"error": "string"
}
],
"documents": [
{
"id": "string",
"job_id": "string",
"name": "string",
"type": "string",
"status": "string",
"output_format": "string",
"submitted_at": "string",
"completed_at": "string",
"error": "string",
"metadata": {
"additionalProp1": {}
}
}
]
},
"validate_only": false
}'
Success Response (200 OK):
Content-Type: application/json
Response Example:
{
"status": "completed",
"summary": {
"jobs": {
"total_received": 1,
"imported": 0,
"skipped": 0,
"failed": 1
},
"documents": {
"total_received": 1,
"imported": 0,
"skipped": 0,
"failed": 1
}
},
"duration_seconds": 0.0061,
"errors": [
{
"record_type": "job",
"record_id": "string",
"type": "validation_error",
"message": "Invalid timestamp: Invalid isoformat string: 'string'"
}
],
"warnings": [
{
"record_type": "document",
"record_id": "string",
"type": "orphaned_document",
"message": "Document references non-existent job_id: string"
}
]
}
Error Responses
| Status | Condition | Body |
|---|---|---|
| 400 | Bad Request - Invalid input or validation error |
|
| 409 | Conflict - Resource is locked or in use |
|
| 413 | Payload Too Large - Input exceeds size limits |
|
| 422 | Validation Error |
|
| 500 | Internal Server Error - Unexpected error occurred |
|
GET /v1/export
Export job and document metadata from PostgreSQL as JSON for backup and restore workflows.
| Parameter | Type | Description |
|---|---|---|
| limit | integer | Maximum number of records to return in the response |
| offset | integer | Number of records to skip before starting to return results (used for pagination) |
Success Response (200 OK):
Content-Type: application/json
Response Example:
{
"status": "completed",
"data": {
"jobs": [],
"documents": []
},
"summary": {
"jobs": {
"total_exported": 0,
"completed": 0,
"failed": 0
},
"documents": {
"total_exported": 0,
"completed": 0,
"failed": 0
}
},
"export_timestamp": "2026-06-16T07:17:22.142225Z",
"duration_seconds": 0.0028,
"pagination": {
"limit": 10000,
"offset": 0,
"has_more": false,
"total_records": 0,
"returned_records": 0
}
}
Error Responses
| Status | Condition | Body |
|---|---|---|
| 400 | Bad Request - Invalid input or validation error |
|
| 413 | Payload Too Large - Input exceeds size limits |
|
| 422 | Validation Error |
|
| 500 | Internal Server Error - Unexpected error occurred |
|
GET /v1/jobs/{job_id}
Retrieve detailed status and progress information for a specific job.
Tags: jobs
| Parameter | Type | Required | Description |
|---|---|---|---|
job_id |
string | Yes | Unique identifier for the job |
Request Example:
curl -X GET http://10.20.188.184:4000/v1/jobs/c355556c-f945-420a-9142-002bcff8fac8
Success Response (200 OK):
Content-Type: application/json
Returns detailed job information including document statuses and statistics.
Response Example:
{
"job_id": "c355556c-f945-420a-9142-002bcff8fac8",
"job_name": "AI - Services",
"operation": "ingestion",
"status": "in_progress",
"submitted_at": "2026-03-27T16:37:47.736184Z",
"completed_at": null,
"documents": [
{
"id": "87878b3b-9b78-492d-b5aa-e83eb8310e25",
"name": "AI-services-v020_03272026_112209.pdf",
"status": "in_progress"
}
],
"stats": {
"total_documents": 1,
"completed": 0,
"failed": 0,
"in_progress": 1
},
"error": null
}
DELETE /v1/jobs/{job_id}
Delete a job status record. Only completed or failed jobs can be deleted.
Note: This only deletes the job record, not the associated document data.
Tags: jobs
| Parameter | Type | Required | Description |
|---|---|---|---|
job_id |
string | Yes | Unique identifier for the job |
Request Example:
curl -X DELETE http://10.20.188.184:4000/v1/jobs/550e8400-e29b-41d4-a716-446655440000
Success Response (204 No Content):
Job successfully deleted. No response body.
GET /v1/documents
Get high-level information of all documents with pagination and filtering. Documents are sorted by submission time (newest first).
Tags: documents
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
limit |
integer | No | 20 | Number of records to return per page (min: 1, max: 100) |
offset |
integer | No | 0 | Number of records to skip (min: 0) |
status |
string | No | null | Filter by status: accepted, in_progress, completed, failed |
name |
string | No | null | Filter by document name (partial match, case-insensitive) |
Request Example:
curl -X GET "http://10.20.188.184:4000/v1/documents?limit=50&status=completed"
Success Response (200 OK):
Content-Type: application/json
Response Example:
{
"pagination": {
"total": 150,
"limit": 50,
"offset": 0
},
"data": [
{
"id": "doc-123",
"name": "report.pdf",
"type": "pdf",
"status": "completed"
}
]
}
GET /v1/documents/{doc_id}
Retrieve detailed metadata for a specific document by its ID.
Tags: documents
| Parameter | Type | Required | Description |
|---|---|---|---|
doc_id |
string | Yes | Unique identifier for the document |
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
details |
boolean | No | false | Include detailed metadata (pages, tables, timing) |
Request Example:
curl -X GET "http://10.20.188.184:4000/v1/documents/doc-123?details=true"
Success Response (200 OK):
Content-Type: application/json
Response Example:
{
"id": "doc-123",
"name": "report.pdf",
"type": "pdf",
"status": "completed",
"output_format": "json",
"metadata": {
"pages": 15,
"tables": 3
}
}
GET /v1/documents/{doc_id}/content
Retrieve the digitized/processed content of a document.
Tags: documents
| Parameter | Type | Required | Description |
|---|---|---|---|
doc_id |
string | Yes | Unique identifier for the document |
Request Example:
curl -X GET http://10.20.188.184:4000/v1/documents/doc-123/content
Success Response (200 OK):
Content-Type: application/json
Response Example:
{
"result": {
"title": "Quarterly Report",
"content": "This report summarizes..."
},
"output_format": "json"
}
DELETE /v1/documents/{doc_id}
Delete a single document by ID.
Tags: documents
| Parameter | Type | Required | Description |
|---|---|---|---|
doc_id |
string | Yes | Unique identifier for the document |
Request Example:
curl -X DELETE http://10.20.188.184:4000/v1/documents/doc-123
Success Response (204 No Content):
Document successfully deleted. No response body.
DELETE /v1/documents (deprecated)
⚠️ DANGER: Delete ALL documents from the system.
Tags: documents
| Parameter | Type | Required | Description |
|---|---|---|---|
confirm |
boolean | Yes | Must be true to proceed with bulk deletion |
Request Example:
curl -X DELETE "http://10.20.188.184:4000/v1/documents?confirm=true"
Success Response (204 No Content):
All documents successfully deleted. No response body.
Error Handling
HTTP Status Codes:
| Status Code | Description |
|---|---|
| 200 | Request succeeded |
| 202 | Request accepted for asynchronous processing |
| 204 | Request succeeded with no content to return |
| 422 | Validation error in request parameters or body |
Error Response Format:
{
"detail": [
{
"loc": ["body", "field_name"],
"msg": "Error message",
"type": "error_type"
}
]
}
Usage Examples
Example 1: Digitize a Single PDF to JSON
Request:
curl -X POST "http://10.20.188.184:4000/v1/jobs?operation=digitization&output_format=json" \
-F "files=@document.pdf"
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000"
}
Example 2: Ingest Multiple PDFs
Request:
curl -X POST "http://10.20.188.184:4000/v1/jobs?operation=ingestion" \
-F "files=@report1.pdf" \
-F "files=@report2.pdf"
Response:
{
"job_id": "660e8400-e29b-41d4-a716-446655440001"
}
Example 3: Check Job Status
Request:
curl -X GET http://10.20.188.184:4000/v1/jobs/550e8400-e29b-41d4-a716-446655440000
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed"
}
Best Practices
- Job Tracking: Poll the GET /v1/jobs/{job_id} endpoint to track progress after creating a job.
- Error Handling: Check the status field and error message for failed jobs.
- Operation Selection: Use digitization for single PDF conversion, ingestion for batch processing.
- Output Format: Choose json for structured data, md for readable text, txt for plain content.
- Pagination: Use limit and offset parameters to manage large result sets efficiently.
- Cleanup: Delete completed jobs using DELETE /v1/jobs/{job_id} to maintain clean records.