Inferencing deployed custom foundation models

Test a custom foundation model after you deploy it with dedicated GPUs or MIG enablement on a cluster.

Inferencing response types

You can use the deployed custom foundation model to generate a text or streamed response:

  • Text response: The LLM processes the entire input prompt and generates the complete response before sending it back. The full response is returned all at once when inferencing is complete.

    You can generate a text response when you need a concise response and your use case doesn't require real-time interaction, such as summarizing a document.

  • Streamed response: The LLM starts sending output token-by-token as soon as they are generated, instead of waiting for the full completion. The client (such as a chatbot or web application) receives and displays partial outputs in real-time as they arrive.

    You can generate a streamed response when you want a chat-like experience where responses appear progressively and you need a real-time interactive system such as a conversational chatbot or AI-assisted writing tool.

Inferencing deployed custom foundation models from the user interface

You can inference your deployed custom foundation model directly from the user interface to generate text or streamed responses.

Follow these steps to inference your deployed custom foundation model:

  1. From the Deployments tab of your space, click the deployment name.

  2. Click the Test tab to input prompt text and get a response from the deployed asset.

  3. Enter test data in one the following formats, depending on the type of asset that you deployed:

    • Text: Enter text input data to generate a block of text as output.
    • Stream: Enter text input data to generate a stream of text as output.
    • JSON: Enter JSON input data to generate output in JSON format.

    Enter test data for custom foundation model

  4. Click Generate to get results based on your prompt.

    Generate output for custom foundation model

Inferencing deployed custom foundation models with REST API

Based on the method you used to create a prompt template to interact with your deployed custom foundation model, you can use REST API endpoints to generate text or streamed response for inferencing.

Generating text response

When you create a structured, free-form, or chat prompt template based on your deployed custom foundation model, you can generate a text response with REST API. There are two API endpoints available to generate a text response:

  • Deployment text-generation API endpoint: The deployment text generation API endpoint /ml/v1/deployments/<deployment ID>/text/generation can be used to inference prompt templates that use custom foundation models in structured or freeform mode to generate text responses.

The following code snippet shows how to generate a text response for online inferencing after deploying a custom foundation model:

curl -X POST "https://<replace with your CPD hostname>/ml/v1/deployments/<replace with your deployment ID>/text/generation?version=2024-01-29" \
-H "Authorization: Bearer <replace with your token>" \
-H "content-type: application/json" \
--data '{
 "parameters": {
    "prompt_variables": {
       "user_input": "price too high and location off premise too far",
       "output1": "true",
       "output2": "false"
    }
 }
}'
  • Deployment text-chat API endpoint: The deployment text chat API endpoint /ml/v1/deployments/<deployment ID>/text/chat can be used to inference prompt templates that use custom foundation models in structured or freeform mode to generate text responses.

To inference a custom foundation model with text-chat API, the custom foundation model must meet the following requirements:

  • The model content file (tokenizer_config.json in model content) must include a chat template. For example, the ibm-granite/granite-3.0-8b-instruct model on Hugging Face includes a chat template. For more information, see Model configuration file for granite-3.0-8b-instruct on Hugging Face.
  • The custom foundation model must be deployed with the software specification for vLLM runtime watsonx-cfm-caikit-1.1.

When you create a chat-based prompt template to interact with your deployed custom foundation model, you can generate a stream response with REST API.

curl--requestPOST\--url'https: //<host-url>/ml/v1/deployments/<deployment-id>/text/chat?version=2020-10-10'\
--header'Authorization: Bearer$token'\
--header'Content-Type: application/json'\
--data'{
	"messages": [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "What is the capital of India"
				}
			]
		},
		{
			"role": "assistant",
			"content": "New Delhi."
		},
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Which continent"
				}
			]
		}
	]
}'

Generating streamed response

When you create a structured, free-form, or chat prompt template based on your deployed custom foundation model, you can generate a text response with REST API. There are two API endpoints available to generate a text response:

To inference a custom foundation model with text-chat API, the custom foundation model must meet the following requirements:

  • The model content file (tokenizer_config.json in model content) must include a chat template.

  • The custom foundation model must be deployed with the software specification for vLLM runtime watsonx-cfm-caikit-1.1.

  • Deployment text-generation stream API endpoint: The deployment text-chat stream API endpoint /ml/v1/deployments/<deployment ID>/text/generation_stream can be used to inference prompt templates that use custom foundation models in structured or freeform mode to generate streamed responses.

The following code snippet shows how to generate a text response for online inferencing after deploying a custom foundation model:

curl -X POST "https://<replace with your CPD hostname>/ml/v1/deployments/<replace with your deployment ID>/text/generation_stream?version=2024-01-29" \
-H "Authorization: Bearer <replace with your token>" \
-H "content-type: application/json" \
--data '{
 "parameters": {
    "prompt_variables": {
       "user_input": "price too high and location off premise too far",
       "output1": "true",
       "output2": "false"
    }
 }
}'

For more information, see Deployment text-generation stream API endpoint in the watsonx.ai REST API documentation.

  • Deployment text-chat stream API endpoint: The deployment text-chat stream API endpoint /ml/v1/deployments/<deployment-id>/text/chat_stream can be used to inference prompt templates that use custom foundation models in chat mode to generate streamed responses.

When you create a chat-based prompt template to interact with your deployed custom foundation model, you can generate a stream response with the REST API:

curl --request POST \
  --url 'https://<host-url>/ml/v1/deployments/<deployment-id>/text/chat_stream?version=2020-10-10' \
  --header 'Authorization: Bearer $token' \
  --header 'Content-Type: application/json' \
  --data '{
	"messages": [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "What is the capital of USA"
				}
			]
		}
	]
}'

For more information, see Deployment text-chat stream API endpoint in the watsonx.ai REST API documentation.

Learn more

Parent topic: Deploying custom foundation models