Question and Answer API

Complete API reference for the AI‑Services Question and Answer API that uses ingested documents to deliver grounded, context‑aware answers for Question and Answer chatbot experiences.

Overview

The AI‑Services Question and Answer API enables conversational query processing over ingested documents and returns grounded answers with citation metadata. It supports both streaming and non‑streaming requests and provides detailed response metadata, including retrieval context, ranking scores, and processing time.

API Endpoints:

  • GET /health — Server health
  • GET /v1/models — List available models
  • POST /v1/chat/completions — Create a chat completion (supports streaming)
  • GET /v1/perf_metrics — Get performance metrics
  • GET /db-status — Vector database ingestion readiness
Note: To access API endpoints:
  • In the Endpoints Catalog UI: Click on Digital Assistant and refer to Integration Endpoints.
  • Using CLI: Run the command to retrieve the API endpoints and base URL.
    ai-services application info <appName> --runtime <podman|openshift>

GET /health

Check if the service is running and healthy.

Example (cURL):
curl -v http://10.20.188.184:5000/health
Response:
  • Status Code: 200
    {"status":"ok"}

GET /v1/models

Returns the list of models served by vLLM.

Example (cURL):
curl http://10.20.188.184:5000/v1/models
Sample Response:

{
  "data": [
    {
      "created": 1764589600,
      "id": "ibm-granite/granite-3.3-8b-instruct",
      "max_model_len": 32768,
      "object": "model",
      "owned_by": "vllm",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sampling": true,
          "allow_search_indices": false,
          "allow_view": true,
          "created": 1764589600,
          "group": null,
          "id": "modelperm-678e29f39bc04e2f994faf93dddc7c64",
          "is_blocking": false,
          "object": "model_permission",
          "organization": "*"
        }
      ],
      "root": "/models/ibm-granite/granite-3.3-8b-instruct"
    }
  ],
  "object": "list"
}
 

GET /db-status

Indicates whether vector data ingestion is complete.

Example (cURL):
curl http://10.20.188.184:5000/db-status
Sample Response:
  • Ingestion complete - Status Code: 200
    {"ready": true}
  • Ingestion not done - Status Code: 200
    {"ready": false, "message": "No data ingested"}

POST /v1/chat/completions

Generates a completion for a chat-style prompt. Supports streaming via Server-Sent Events (SSE).

Request Headers:
Content-Type: application/json
Accept: application/json
Optional:
X-Request-ID: string
If not passed, will be allocated randomly. Can be used to fetch performance metrics using GET/v1/perf_metrics
Request Payload:
{
"messages": [{"role": "user", "content": "Add prompt here"}],
"model": "ibm-granite/granite-3.3-8b-instruct",
"max_tokens": 512,
"temperature": 0.0,
"repetition_penalty": 1.1,
"stop": ["Context:", "Question:", "\nContext:", "\nAnswer:", "\nQuestion:", "Answer:"],
"stream": true
}
Example (cURL):
  • Streaming:
    curl http://10.20.188.184:5000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Accept: application/json" \
      -d '{
        "messages": [{"role": "user", "content": "list power11 features"}],
        "model": "ibm-granite/granite-3.3-8b-instruct",
        "max_tokens": 512,
        "temperature": 0.0,
        "stream": true
      }'
  • Non-streaming:
    curl http://10.20.188.184:5000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "messages": [{"role": "user", "content": "list power11 features"}],
        "model": "ibm-granite/granite-3.3-8b-instruct",
        "max_tokens": 512,
        "temperature": 0.0,
        "stream": false
      }'
     
Response:
  • Streaming:
    data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
    
    data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":"Based"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
    
    data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":" on"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
    
    data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":" the"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
    .
    .
    .
    data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":" secure"},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}]}
    
    data: [DONE]
  • Non-streaming
    {
      "id": "chatcmpl-29a69927f74d4820936f702f98666d4b",
      "object": "chat.completion",
      "created": 1764589675,
      "model": "ibm-granite/granite-3.3-8b-instruct",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "Based on the provided context, here are some .... Power11 as a robust, efficient, and secure",
            "refusal": null,
            "annotations": null,
            "audio": null,
            "function_call": null,
            "tool_calls": [],
            "reasoning_content": null
          },
          "logprobs": null,
          "finish_reason": "length",
          "stop_reason": null,
          "token_ids": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 1099,
        "total_tokens": 1611,
        "completion_tokens": 512,
        "prompt_tokens_details": null
      },
      "prompt_logprobs": null,
      "prompt_token_ids": null,
      "kv_transfer_params": null
    }

GET /v1/perf_metrics

Retrieve performance metrics for recent chat completion and retrieval operations. Optionally filter by request ID to get metrics for a specific request.

Query Parameters:
  • request_id (optional) - Unique request identifier to filter metrics
Example (cURL):
  • Get all metrics:
    curl http://10.20.188.184:5000/v1/perf_metrics
  • Get metrics for specific request:
    curl "http://10.20.188.184:5000/v1/perf_metrics?request_id=550e8400-e29b-41d4-a716-446655440000"
Sample Response:
{
  "metrics": [
    {
      "timestamp": 1678901234.567,
      "readable_timestamp": "2023-03-15 14:30:34",
      "request_id": "550e8400-e29b-41d4-a716-446655440000",
      "retrieve_time": 0.15,
      "rerank_time": 0.12,
      "inference_time": 1.25,
      "prompt_tokens": 500,
      "completion_tokens": 150,
      "token_latencies": [0.01, 0.012, 0.011, 0.013]
    }
  ]
}