Inferencing deployed PEFT models

After deploying the PEFT model, you can inference the model by providing text or stream input data to the deployed model to generate predictions in real-time.

Before you begin

Make sure you have deployed your LoRA adapter model before proceeding with inferencing.

Inferencing deployed PEFT model with REST API

You can use the watsonx.ai REST API to inference your deployed PEFT model and generate predictions in real-time.

Generating text response

To generate a text response from your deployed PEFT model, use the following code sample:

curl -X POST "https://<HOST>/ml/v1/deployments/<deployment_id>/text/generation?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
 "input": "What is the boiling point of water?",
 "parameters": {
    "max_new_tokens": 200,
    "min_new_tokens": 20
 }
}'

Make sure to replace the placeholders with your actual values and adjust the parameters according to your specific use case.

Generating stream response

To generate a stream response from your deployed PEFT model, use the following code sample:

curl -X POST "https://<HOST>/ml/v1/deployments/<deployment_id>/text/generation_stream?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
 "input": "What is the boiling point of water?",
 "parameters": {
    "max_new_tokens": 200,
    "min_new_tokens": 20
 }
}'

Make sure to replace the placeholders with your actual values and adjust the parameters according to your specific use case.

Inferencing deployed PEFT models with Python client library

You can use the watsonx.ai Python client library to inference your deployed PEFT model and generate predictions in real-time.

Generating text response

To generate a text response from your deployed fine-tuned model, use the client.Deployments.generate_text function from the watsonx.ai Python client library. For more information, see Generating text response with generate_text() in the Python client library documentation.

Generate stream response

To generate a stream response from your deployed fine-tuned model, use the client.Deployments.generate_text_stream function from the watsonx.ai Python client library. For more information, see Generating stream response with generate_text_stream() in the Python client library documentation.

Learn more

Parent topic: Deploying Parameter-Efficient Fine-Tuned (PEFT) models