Question and Answer API

Edit online

Complete API reference for the AI‑Services Question and Answer API that uses ingested documents to deliver grounded, context‑aware answers for Question and Answer chatbot experiences.

Overview

The AI‑Services Question and Answer API enables conversational query processing over ingested documents and returns grounded answers with citation metadata. It supports both streaming and non‑streaming requests and provides detailed response metadata, including retrieval context, ranking scores, and processing time.

API Endpoints:

GET /health — Server health
GET /v1/models — List available models
POST /v1/chat/completions — Create a chat completion (supports streaming)
GET /v1/perf_metrics — Get performance metrics
GET /db-status — Vector database ingestion readiness

Note: To access API endpoints:

In the Endpoints Catalog UI: Click on Digital Assistant and refer to Integration Endpoints.
Using CLI: Run the command to retrieve the API endpoints and base URL.
```
ai-services application info <appName> --runtime <podman|openshift>
```

GET /health

Check if the service is running and healthy.

Example (cURL):

curl -v http://10.20.188.184:5000/health

Response:

Status Code: 200
```
{"status":"ok"}
```

GET /v1/models

Returns the list of models served by vLLM.

Example (cURL):

curl http://10.20.188.184:5000/v1/models

Sample Response:


{
  "data": [
    {
      "created": 1764589600,
      "id": "ibm-granite/granite-3.3-8b-instruct",
      "max_model_len": 32768,
      "object": "model",
      "owned_by": "vllm",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sampling": true,
          "allow_search_indices": false,
          "allow_view": true,
          "created": 1764589600,
          "group": null,
          "id": "modelperm-678e29f39bc04e2f994faf93dddc7c64",
          "is_blocking": false,
          "object": "model_permission",
          "organization": "*"
        }
      ],
      "root": "/models/ibm-granite/granite-3.3-8b-instruct"
    }
  ],
  "object": "list"
}

GET /db-status

Indicates whether vector data ingestion is complete.

Example (cURL):

curl http://10.20.188.184:5000/db-status

Sample Response:

Ingestion complete - Status Code: 200
```
{"ready": true}
```

Ingestion not done - Status Code: 200

{"ready": false, "message": "No data ingested"}

POST /v1/chat/completions

Generates a completion for a chat-style prompt. Supports streaming via Server-Sent Events (SSE).

Request Headers:

Content-Type: application/json
Accept: application/json

Optional:

X-Request-ID: string

If not passed, will be allocated randomly. Can be used to fetch performance metrics using GET/v1/perf_metrics

Request Payload:

{
"messages": [{"role": "user", "content": "Add prompt here"}],
"model": "ibm-granite/granite-3.3-8b-instruct",
"max_tokens": 512,
"temperature": 0.0,
"repetition_penalty": 1.1,
"stop": ["Context:", "Question:", "\nContext:", "\nAnswer:", "\nQuestion:", "Answer:"],
"stream": true
}

Example (cURL):

Streaming:

curl http://10.20.188.184:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "list power11 features"}],
    "model": "ibm-granite/granite-3.3-8b-instruct",
    "max_tokens": 512,
    "temperature": 0.0,
    "stream": true
  }'

Non-streaming:

curl http://10.20.188.184:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "list power11 features"}],
    "model": "ibm-granite/granite-3.3-8b-instruct",
    "max_tokens": 512,
    "temperature": 0.0,
    "stream": false
  }'

Response:

Streaming:

data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}

data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":"Based"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":" on"},"logprobs":null,"finish_reason":null,"token_ids":null}]}

data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":" the"},"logprobs":null,"finish_reason":null,"token_ids":null}]}
.
.
.
data: {"id":"chatcmpl-0f7520824f06437f8c75bb7007038b2e","object":"chat.completion.chunk","created":1764589727,"model":"ibm-granite/granite-3.3-8b-instruct","choices":[{"index":0,"delta":{"content":" secure"},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}]}

data: [DONE]

Non-streaming

{
  "id": "chatcmpl-29a69927f74d4820936f702f98666d4b",
  "object": "chat.completion",
  "created": 1764589675,
  "model": "ibm-granite/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Based on the provided context, here are some .... Power11 as a robust, efficient, and secure",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 1099,
    "total_tokens": 1611,
    "completion_tokens": 512,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

GET /v1/perf_metrics

Retrieve performance metrics for recent chat completion and retrieval operations. Optionally filter by request ID to get metrics for a specific request.

Query Parameters:

request_id (optional) - Unique request identifier to filter metrics

Example (cURL):

Get all metrics:

curl http://10.20.188.184:5000/v1/perf_metrics

Get metrics for specific request:

curl "http://10.20.188.184:5000/v1/perf_metrics?request_id=550e8400-e29b-41d4-a716-446655440000"

Sample Response:

{
  "metrics": [
    {
      "timestamp": 1678901234.567,
      "readable_timestamp": "2023-03-15 14:30:34",
      "request_id": "550e8400-e29b-41d4-a716-446655440000",
      "retrieve_time": 0.15,
      "rerank_time": 0.12,
      "inference_time": 1.25,
      "prompt_tokens": 500,
      "completion_tokens": 150,
      "token_latencies": [0.01, 0.012, 0.011, 0.013]
    }
  ]
}