Deploying custom foundation models tuned with PEFT techniques with REST API

To deploy a custom foundation model that is tuned with a PEFT technique, start by creating a repository asset for the custom foundation model, followed by deploying the asset. Then, create a repository asset for the model that is trained with a PEFT technique and deploy the model.

Before you begin

  1. The administrator must store the LLM to PVC storage and register the model with watsonx.ai. For more information, see Deploying custom foundation models in IBM watsonx.ai in the IBM Software Hub documentation.
  2. The custom foundation model must be trained with a PEFT technique. For more information, see Tuning a foundation model programmatically.
  3. You must authenticate by generating and entering your API key.

Creating a repository asset for the custom foundation model

Create a Watson Machine Learning repository asset for the custom foundation model by providing the model details.

The following code sample shows how to create the custom foundation model asset by using REST API:

curl -X POST "https://<HOST>/ml/v4/models?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
  "name":"custom foundation model asset",
  "space_id":"<space_id>",
  "foundation_model":{
    "model_id":"ibm/granite-3-1-8b-base" // use the same model_id you used to register the CFM in watsonxaiifm-cr
  },
  "type":"custom_foundation_model_1.0",
  "software_spec":{
    "name":"watsonx-cfm-caikit-1.1"
  }
}'

Deploying the custom foundation model asset

When you create an online deployment for your custom foundation model, you must set the enable_lora parameter to true in the JSON payload so that you can deploy the LoRA or QLoRA adapters by using the custom foundation model.

The LoRA or QLoRA parameter values required to create the custom foundation model deployment can be set by the administrator or MLOps engineer.

For example, the admin can set the values of parameters such as max_gpu_loras, max_cpu_loras, max_lora_rank, as shown by the following code sample:

custom_foundation_models:
- location:
    pvc_name: ibm-granite-3-1-8b-base-pvc
  model_id: ibm-granite/granite-3.1-8b-base
  parameters:
  - default: true
    name: enable_lora
  - default: 10
    name: max_gpu_loras
  - default: 8
    name: max_cpu_loras
  - default: 4
    name: max_lora_rank

If the administrator sets the values of the custom foundation model parameters after registering the model in watsonxaiifm-cr, you can override the default values set by the admin by specifying the updates values in the deployment payload.

The following code sample shows how to create an online deployment for the custom foundation model asset with REST API:

curl -X POST "https://<HOST>/ml/v4/deployments?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
  "asset":{
    "id": "<asset_id>"  // WML base foundation model asset
  },
  "online":{
    "parameters":{
       "serving_name": "cfmservingname",
       "foundation_model": {
              "max_batch_weight": 10000,
              "max_sequence_length": 8192,
              "enable_lora": true,
              "max_gpu_loras": 8,                                
              "max_cpu_loras": 16,            
              "max_lora_rank": 32
       }
    }
  },
  "hardware_spec": {                        // Only one, of "id" or "name" must be set. 
    "id": "<hardware_spec_id>",
    "num_nodes": 1
  },
  "description": "Testing deployment using base foundation model",
  "name":"custom_fm_deployment",
  "space_id": "<space_id>"  // Either "project_id" (or) "space_id". Only "one" is allowed
}'

Running this code returns the deployment ID in the metadata.id field, as shown here:

{
  "entity": {
    "asset": {
      "id": "d92c00ab-5242-4861-8410-813529cfcdf5"
    },
    "custom": {

    },
    "deployed_asset_type": "custom_foundation_model",
    "description": "Granite Base Model Deployment",
    "hardware_spec": {
      "id": "e5ebf6cd-a6e0-4a90-8326-c743b59a752c",
      "name": "custom_hw_spec",
      "num_nodes": 1
    }
    ....
  }
}

Polling for deployment status

Poll for the deployment status by using the deployment ID and wait until the state changes from initializing to ready.

curl -X GET "https://<HOST>/ml/v4/deployments/<deployment_id>?version=2024-01-29&space_id=<space_id>" \
-H "Authorization: Bearer <token>" 

After successful creation of the deployment, the polling status returns the deployed_asset_type as custom_foundation_model.

"deployed_asset_type": "custom_foundation_model"

Creating the LoRA or QLoRA adapter model asset

While training the model, if you did not enable the auto_update_model option, you must create a repository asset for the LoRA or QLoRA adapters.

If the auto_update_model option was enabled during training, the LoRA adapter model asset is already created in the Watson Machine Learning repository. In that case, you can proceed with creating a deployment for the LoRA adapter model asset.

After the training process for fine-tuning the LoRA or QLoRA adapter completes, the content of the LoRA or QLoRA adapter is stored in the PVC. The model content is written to a directory named after the model model in the PVC. You must provide the PVC name as the foundation_model.model_id in the model creation input payload.

The following code sample shows how to create the LoRA adapter model asset:

curl -X POST "https://<HOST>/ml/v4/models?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
  "name":"lora adapter model asset",
  "space_id":"<space_id>",
  "foundation_model":{
    "model_id":"finetunedadapter-385316ea-1d25-4274-9550-e387a1355241" // this is the name of the pvc created post training
  },
  "type":"lora_adapter_1.0",
  "software_spec":{
    "name":"watsonx-cfm-caikit-1.1"
  },
  "training":{
    "base_model":{
      "model_id":"ibm/granite-3-1-8b-base" // your cfm model_id as registered in the watsonxaiifm-cr
    },
    "task_id":"summarization",
    "fine_tuning": {
         "peft_parameters": {
             "type": "lora",
             "rank": 8
         },
         "verbalizer": "<Replace with verbalizer using for Fine tuning>", // For example: Input: {input} Output:
    }
  }
}'

Deploying the LoRA or QLoRA adapter model asset

Use the deployed custom foundation model to deploy the LoRA adapters as an additional layer on the custom foundation model.

The following code sample shows how to create a deployment for the LoRA adapter model:

curl -X POST "https://<HOST>/ml/v4/deployments?version=2024-01-29" \
-H "Authorization: Bearer <token>" \
-H "content-type: application/json" \
--data '{
  "asset":{
    "id":<asset_id>  // WML lora adapter model asset
  },
  "online":{
    "parameters":{
      "serving_name":"lora_adapter_dep"
    }
  },
  "base_deployment_id": "<replace with your WML custom foundation model deployment ID>",
  "description": "Testing deployment using lora adapter model",
  "name":"lora_adapter_deployment",
  "space_id":<space_id>  // Either "project_id" (or) "space_id". Only "one" is allowed
}'

Running this code returns the deployment ID in the metadata.id field, as shown here:

 "deployed_asset_type": "lora_adapter",
    "description": "Lora Trained CFM Granite Deployment",
    "name": "lora_adapter_deployment_medical",
    "online": {
      "parameters": {

      }

Polling for deployment status

Poll for the deployment status by using the deployment ID and wait until the state changes from initializing to ready.

curl -X GET "https://<HOST>/ml/v4/deployments/<deployment_id>?version=2024-01-29&space_id=<space_id>" \
-H "Authorization: Bearer <token>" 

After successful creation of the deployment, the polling status returns the deployed_asset_type as lora_adapter.

"deployed_asset_type": "lora_adapter"

Learn more

Parent topic: Deploying fine-tuned custom foundation models