Inferencing deployed custom foundation models

Test a custom foundation model after you deploy it with dedicated GPUs or MIG enablement on a cluster.

Inferencing response types

You can use the deployed custom foundation model to generate a text or streamed response:

Text response: The LLM processes the entire input prompt and generates the complete response before sending it back. The full response is returned all at once when inferencing is complete.

You can generate a text response when you need a concise response and your use case doesn't require real-time interaction, such as summarizing a document.
Streamed response: The LLM starts sending output token-by-token as soon as they are generated, instead of waiting for the full completion. The client (such as a chatbot or web application) receives and displays partial outputs in real-time as they arrive.

You can generate a streamed response when you want a chat-like experience where responses appear progressively and you need a real-time interactive system such as a conversational chatbot or AI-assisted writing tool.

Inferencing deployed custom foundation models from the user interface

You can inference your deployed custom foundation model directly from the user interface to generate text or streamed responses.

Follow these steps to inference your deployed custom foundation model:

From the Deployments tab of your space, click the deployment name.
Click the Test tab to input prompt text and get a response from the deployed asset.
Enter test data in one the following formats, depending on the type of asset that you deployed:
- Text: Enter text input data to generate a block of text as output.
- Stream: Enter text input data to generate a stream of text as output.
- JSON: Enter JSON input data to generate output in JSON format.
Click Generate to get results based on your prompt.

Inferencing deployed custom foundation models with REST API

Based on the method you used to create a prompt template to interact with your deployed custom foundation model, you can use REST API endpoints to generate text or streamed response for inferencing.

Note:

To inference deployed time-series models, use this API endpoint: <cluster url>/ml/v1/deployments/<your deployment ID>/time_series/forecast

Generating a text response

When you create a structured, free-form, or chat prompt template based on your deployed custom foundation model, you can generate a text response with REST API.

Two API endpoints are available to generate a text response:

Deployment text-generation API endpoint: The deployment text generation API endpoint /ml/v1/deployments/<deployment ID>/text/generation can be used to inference prompt templates that use custom foundation models in structured or freeform mode to generate text responses.

Note:

Text generation is not possible for models that were deployed by using custom runtime images.

The following code snippet shows how to generate a text response for online inferencing, after deploying a custom foundation model:

curl -X POST "<cluster url>/ml/v1/deployments/<your deployment ID>/text/generation?version=2024-01-29" \
-H "Authorization: Bearer <your token>" \
-H "content-type: application/json" \
--data '{
 "parameters": {
    "prompt_variables": {
       "user_input": "price too high and location off premise too far",
       "output1": "true",
       "output2": "false"
    }
 }
}'

Deployment text-chat API endpoint: The deployment text chat API endpoint /ml/v1/deployments/<deployment ID>/text/chat can be used to inference prompt templates that use custom foundation models in structured or freeform mode to generate text responses.

To inference a custom foundation model with text-chat API, the custom foundation model must meet the following requirements:

The model content file (tokenizer_config.json in model content) must include a chat template. For example, the ibm-granite/granite-3.0-8b-instruct model on Hugging Face includes a chat template. For more information, see Model configuration file for granite-3.0-8b-instruct on Hugging Face.
The custom foundation model must be deployed with a software specification that enables text-chat functionality. All the models that use the watsonx-cfm-caikit-1.1 software specification are in this category. Additionally, models that are deployed by using custom software specifications might enable this functionality, but you must verify this with the MLOps engineer who created the software specification.

From release 2.2.1, you can also define:

The tool that the foundation model must use to answer user queries
The list of tools that the foundation model can use to answer user queries.

For details about inferencing models with the use of tools, see Building agent-driven workflows with the chat API.

When you create a chat-based prompt template to interact with your deployed custom foundation model, you can generate a response with REST API.

curl --request POST \
--url'https: //<host-url>/ml/v1/deployments/<deployment-id>/text/chat?version=2020-10-10'\
--header'Authorization: Bearer$token'\
--header'Content-Type: application/json'\
--data'{
  "messages": [
	{
	 "role": "user",
	   "content": [
		{
		 "type": "text",
		 "text": "What is the capital of India"
		}
		    ]
	},
	{
	 "role": "assistant",
		"content": "New Delhi."
		},
		{
		"role": "user",
		"content": [
				{
		"type": "text",
		"text": "Which continent"
				}
				   ]
		}
	]
}'

Generating streamed response

When you create a structured, free-form, or chat prompt template based on your deployed custom foundation model, you can generate a streamed response with REST API. There are two API endpoints available to generate a text response:

To inference a custom foundation model with text-chat API, the custom foundation model must meet the following requirements:

The model content file (tokenizer_config.json in model files) must include a chat template.
The custom foundation model must be deployed with the software specification that enables streamed output. All the models that use the watsonx-cfm-caikit-1.1 software specification are in this category.
Deployment text-generation stream API endpoint: The deployment text-chat stream API endpoint /ml/v1/deployments/<deployment ID>/text/generation_stream can be used to inference prompt templates that use custom foundation models in structured or freeform mode to generate streamed responses.

The following code snippet shows how to generate a text response for online inferencing after deploying a custom foundation model:

curl -X POST "<cluster url>/ml/v1/deployments/<your deployment ID>/text/generation_stream?version=2024-01-29" \
-H "Authorization: Bearer <your token>" \
-H "content-type: application/json" \
--data '{
 "parameters": {
    "prompt_variables": {
       "user_input": "price too high and location off premise too far",
       "output1": "true",
       "output2": "false"
    }
 }
}'

For more information, see Deployment text-generation stream API endpoint in the watsonx.ai REST API documentation.

Deployment text-chat stream API endpoint: The deployment text-chat stream API endpoint /ml/v1/deployments/<deployment-id>/text/chat_stream can be used to inference prompt templates that use custom foundation models in chat mode to generate streamed responses.

When you create a chat-based prompt template to interact with your deployed custom foundation model, you can generate a streamed response with the REST API:

curl --request POST \
  --url 'https://<host-url>/ml/v1/deployments/<deployment-id>/text/chat_stream?version=2020-10-10' \
  --header 'Authorization: Bearer $token' \
  --header 'Content-Type: application/json' \
  --data '{
	"messages": [
	{
	 "role": "user",
		"content": [
		{
		 "type": "text",
		 "text": "What is the capital of USA"
		}
			       ]
	}
    ]
}'

For more information, see Deployment text-chat stream API endpoint in the watsonx.ai REST API documentation.

Inferencing a deployed time-series model

For an example of how to inference a deployed time-series model, see this example code snippet:

curl --request POST \
  --url '<cluster url>/ml/v1/deployments/<your deployment ID>/time_series/forecast?version=2025-05-01' \
  --header 'Authorization: Bearer <your token>' \
  --header 'content-type: application/json' \
  --data '{
  "schema": {
    "timestamp_column": "date",
    "freq": "1h"
  },
  "data": {
    "date": [
            "2017-10-02T16:00:00",
            "2017-10-02T17:00:00",
            "2017-10-02T18:00:00",
            "2017-10-02T19:00:00",
            "2017-10-02T20:00:00",
            "2017-10-02T21:00:00",
            "2017-10-02T22:00:00",
            "2017-10-02T23:00:00",
            "2017-10-03T00:00:00",
            "2017-10-03T01:00:00",
            "2017-10-03T02:00:00",
            "2017-10-03T03:00:00",
            "2017-10-03T04:00:00",
            "2017-10-03T05:00:00",
            "2017-10-03T06:00:00",
            "2017-10-03T07:00:00",
            "2017-10-03T08:00:00",
            "2017-10-03T09:00:00",
            "2017-10-03T10:00:00",
            "2017-10-03T11:00:00"
        ],
        "HULL": [
            7.969533546848622,
            3.517834648350302,
            0.27686073271722456,
            7.270215589283335,
            8.301608634471089,
            3.5715043729826013,
            4.837339724448691,
            6.259279065718229,
            4.992513028164783,
            4.054610086143303,
            8.734517184165343,
            5.6456574912796444,
            4.9638106191135964,
            7.582701093090489,
            4.286026911335609,
            7.821287465558724,
            4.987596574195042,
            9.301099509722704,
            5.983153494948004,
            4.4075103791467605
        ],
        "HUFL": [
            4.721054631564481,
            7.9107469233587935,
            9.379306714057732,
            9.466520103213014,
            3.0316660084605607,
            1.2258327967405924,
            7.827951462135792,
            5.619227620657014,
            2.1684074695772115,
            2.2928977662265337,
            7.495460408629855,
            1.7033375810280893,
            8.27439736702749,
            6.3714075795857275,
            5.545698731301024,
            4.609266503722505,
            4.8792239645717235,
            5.319728292461291,
            3.0308754657454253,
            5.5629867790456675
        ]
    }
}

Inferencing automatic speech recognition (ASR) models:

From release 2.2.1:

To inference automatic speech recognition models, use the v1/transcriptions endpoint. See example:

curl -kLsS <cluster url>/ml/v1/audio/transcriptions
-H "Authorization: Bearer <bearer token>;space_id=<space ID>"
-H "Content-Type: multipart/form-data"
--form file="@<path to file>"
-F model="<model family/model name>"

Learn more

Managing custom foundation model deployment