Inferencing models through the model gateway

Send requests to models by using OpenAI-compatible endpoints through the model gateway. You can use either the REST API or the OpenAI Python SDK to generate text, create chat-based responses, produce embeddings, and develop scalable solutions across multiple models tailored to your specific use cases.

Required permissions

: To inference gateway models, you must have one of the following permissions: - Administrator platform - Manage configurations

Required credentials: You must generate credentials to authenticate with watsonx.ai APIs. For details, see Generating a bearer token.

Ways to work

The model gateway endpoints provide an OpenAI-compatible unified API for any provider, which are used to route model requests.

You can inference gateway foundation models by using these programming methods:

watsonx.ai REST API
OpenAI Python SDK

Attention: Some model providers might not support a specific endpoint's service in their backend. If you use a configured model provider with an unsupported endpoint service you might see errors in the response.

REST API

For details about the model gateway APIs, see the watsonx.ai API reference documentation.

The model gateway supports the following endpoints:

Listing providers and models
Chat completions (supports streaming)
Text completions/generations (supports streaming)
Embeddings generation

Listing providers and models

You can list both the providers and models that you configured.

To list all configured model providers, use the following command:

curl -sS "https://cpd-<namespace-name>.apps.<OCP-domain>/ml/gateway/v1/providers" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${TOKEN}"

To list all models enabled for a specific provider, use the following command:

curl -sS "https://cpd-<namespace-name>.apps.<OCP-domain>/ml/gateway/v1/providers/${PROVIDER_UUID}/models" \
	-H "Content-Type: application/json" \
	-H "Authorization: Bearer ${TOKEN}"

To list all models enabled (across all the configured providers), use the following command:

curl -sS "https://cpd-<namespace-name>.apps.<OCP-domain>/ml/gateway/v1/models" \
	-H "Content-Type: application/json" \
	-H "Authorization: Bearer ${TOKEN}"

Chat completions

To use the /v1/chat/completions endpoint, see the following example:

curl "https://cpd-<namespace-name>.apps.<OCP-domain>/ml/gateway/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${TOKEN}" \
  -d '{
    "model": "azure/gpt-4o",
    "messages": [
      {
        "role": "system",
        "content": "Please explain everything in a way a 5th grader could understand—simple language, clear steps, and easy examples."
      },
      {
        "role": "user",
        "content": "Can you explain what TLS is and how I can use it?"
      }
    ]
  }'

For more details and examples, see Chat completions in the OpenAI API documentation.

Text completions/generation

curl "https://cpd-<namespace-name>.apps.<OCP-domain>/ml/gateway/v1/completions" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer ${TOKEN}" \
 -d '{
   "model": "ibm/llama-3-3-70b-instruct",
   "prompt": "Say this is a test",
   "max_tokens": 7,
   "temperature": 0
 }'

For more details and examples, see Text generation in the OpenAI API documentation.

Embeddings generation

To use the /v1/embeddings endpoint, see the following example:

curl "https://cpd-<namespace-name>.apps.<OCP-domain>/ml/gateway/v1/embeddings" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${TOKEN}" \
  -d '{
    "input": "The food was delicious and the waiter...",
    "model": "text-embedding-3-large",
    "encoding_format": "float"
  }'

For more details and examples, see How to get embeddings in the OpenAI API documentation.

Python SDK

To work with the foundation models in the model gateway, you can use the Gateway class in the watsonx.ai Python library.

To get started, see the following sample notebooks:

To build LLM apps that route requests to providers by using the LangGraph framework and model gateway, see LangGraph Agent Template.

The model gateway maintains compatibility with the OpenAI API. Therefore, the OpenAI SDKs can be used to inference gateway models by passing the bearer token instead of the OpenAI API key.

To use the OpenAI Python SDK to make a chat completions request through the model gateway, see the following example:

import os
from openai import OpenAI


gateway_url = "https://cpd-<namespace-name>.apps.<OCP-domain>/ml/gateway/v1"
ibm_cloud_api_key = os.getenv("TOKEN")

client = OpenAI(
    base_url=gateway_url,
    api_key=bearer_token,
)

completion = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message)