Question and Answer API
Complete API reference for the AI‑Services Question and Answer API that uses ingested documents to deliver grounded, context‑aware answers for Question and Answer chatbot experiences.
Overview
The AI‑Services Question and Answer API enables conversational query processing over ingested documents and returns grounded answers with citation metadata. It supports both streaming and non‑streaming requests and provides detailed response metadata, including retrieval context, ranking scores, and processing time.
API Endpoints:
- GET /health — Server health
- GET /v1/models — List available models
- POST /v1/chat/completions — Create a chat completion (supports streaming)
- GET /v1/perf_metrics — Get performance metrics
- GET /db-status — Vector database ingestion readiness
- In the Endpoints Catalog UI: Click on Digital Assistant and refer to Integration Endpoints.
-
Using CLI: Run the command to retrieve the API endpoints and base URL.
ai-services application info <appName> --runtime <podman|openshift>
GET /health
Check if the service is running and healthy.
curl -v http://10.20.188.184:5000/health
- Status Code: 200
{"status":"ok"}
GET /v1/models
Returns the list of models served by vLLM.
curl http://10.20.188.184:5000/v1/models
{
"data": [
{
"created": 1764589600,
"id": "ibm-granite/granite-3.3-8b-instruct",
"max_model_len": 32768,
"object": "model",
"owned_by": "vllm",
"parent": null,
"permission": [
{
"allow_create_engine": false,
"allow_fine_tuning": false,
"allow_logprobs": true,
"allow_sampling": true,
"allow_search_indices": false,
"allow_view": true,
"created": 1764589600,
"group": null,
"id": "modelperm-678e29f39bc04e2f994faf93dddc7c64",
"is_blocking": false,
"object": "model_permission",
"organization": "*"
}
],
"root": "/models/ibm-granite/granite-3.3-8b-instruct"
}
],
"object": "list"
}
GET /db-status
Indicates whether vector data ingestion is complete.
curl http://10.20.188.184:5000/db-status
- Ingestion complete - Status Code: 200
{"ready": true} - Ingestion not done - Status Code: 200
{"ready": false, "message": "No data ingested"}
POST /v1/chat/completions
Generates a completion for a chat-style prompt. Supports streaming via Server-Sent Events (SSE).
Content-Type: application/json
Accept: application/json
Optional:
X-Request-ID: string
If not passed, will be allocated randomly. Can be used to fetch performance metrics using GET/v1/perf_metrics {
"messages": [{"role": "user", "content": "Add prompt here"}],
"model": "ibm-granite/granite-3.3-8b-instruct",
"max_tokens": 512,
"temperature": 0.0,
"repetition_penalty": 1.1,
"stop": ["Context:", "Question:", "\nContext:", "\nAnswer:", "\nQuestion:", "Answer:"],
"stream": true
}
- Streaming:
curl http://10.20.188.184:5000/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -d '{ "messages": [{"role": "user", "content": "list power11 features"}], "model": "ibm-granite/granite-3.3-8b-instruct", "max_tokens": 512, "temperature": 0.0, "stream": true }' - Non-streaming:
curl http://10.20.188.184:5000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "list power11 features"}], "model": "ibm-granite/granite-3.3-8b-instruct", "max_tokens": 512, "temperature": 0.0, "stream": false }'
- Streaming:
data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null} data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":"Based"},"logprobs":null,"finish_reason":null,"token_ids":null}]} data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":" on"},"logprobs":null,"finish_reason":null,"token_ids":null}]} data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":" the"},"logprobs":null,"finish_reason":null,"token_ids":null}]} . . . data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":" secure"},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}]} data: [DONE] - Non-streaming
{ "id": "chatcmpl-29a69927f74d4820936f702f98666d4b", "object": "chat.completion", "created": 1764589675, "model": "ibm-granite/granite-3.3-8b-instruct", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Based on the provided context, here are some .... Power11 as a robust, efficient, and secure", "refusal": null, "annotations": null, "audio": null, "function_call": null, "tool_calls": [], "reasoning_content": null }, "logprobs": null, "finish_reason": "length", "stop_reason": null, "token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 1099, "total_tokens": 1611, "completion_tokens": 512, "prompt_tokens_details": null }, "prompt_logprobs": null, "prompt_token_ids": null, "kv_transfer_params": null }
GET /v1/perf_metrics
Retrieve performance metrics for recent chat completion and retrieval operations. Optionally filter by request ID to get metrics for a specific request.
request_id(optional) - Unique request identifier to filter metrics
- Get all metrics:
curl http://10.20.188.184:5000/v1/perf_metrics - Get metrics for specific request:
curl "http://10.20.188.184:5000/v1/perf_metrics?request_id=550e8400-e29b-41d4-a716-446655440000"
{
"metrics": [
{
"timestamp": 1678901234.567,
"readable_timestamp": "2023-03-15 14:30:34",
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"retrieve_time": 0.15,
"rerank_time": 0.12,
"inference_time": 1.25,
"prompt_tokens": 500,
"completion_tokens": 150,
"token_latencies": [0.01, 0.012, 0.011, 0.013]
}
]
}