Requirements for deployment
Review supported model architectures, software requirements, and hardware requirements for deploying custom foundation models that are fine-tuned with Parameter-Efficient Fine-Tuning (PEFT) techniques.
Supported model architectures
You can deploy custom foundation models that are fine-tuned with PEFT techniques for the following model architectures:
| Model architecture | Model name and size | LoRA | QLoRA |
|---|---|---|---|
GraniteForCausalLM |
Granite PowerLM 3B Granite 3.1 1B Granite 3.1 2B Granite 3.1 3B Granite 3.1 8B Granite 3.0 2B Granite 3.0 8B |
Yes | Yes |
GraniteMoeForCausalLM |
GraniteMoE 1B GraniteMoE 3B |
Yes | N/A |
LlamawithCausalLM |
Granite 3B Granite 8B |
Yes | Yes |
GPTBigCodeForCausalLM |
Granite 13B Granite 20B |
Yes | Yes |
GPTBigCodeForCausalLM |
Granite 34B |
Yes | Yes |
Llama 3.1 |
Llama3.1-8B |
Yes | Yes |
Llama 3.1 |
Llama3.1-70B |
Yes | Yes |
Llama 3.1 |
Llama3.1-405B |
No | Yes |
Llama 3 |
Llama3-8B |
Yes | Yes |
Llama 3 |
Llama3-70B |
Yes | Yes |
LlamaForCausalLM |
aLLaM-13b |
Yes | Yes |
Mixtral |
Mixtral 8x7B |
Yes | Yes |
Mistral |
Mistral-7b |
Yes | Yes |
Software requirements
You can use the watsonx-cfm-caikit-1.1 software specification, which is based on the vLLM runtime engine, to deploy your custom foundation model that is fine-tuned with a PEFT technique.
Hardware requirements
Deploying custom foundation models that are fine-tuned with a PEFT technique is a resource-intensive process that requires you to have GPU resources available for deployment.
When you choose the hardware specification to deploy your custom foundation model, you must consider the number of Lora or QLora adapters and the size of each LoRA/QLoRA adapter that you want to deploy in the custom foundation model.
The standard supported hardware configurations to deploy custom foundation models are:
- NVIDIA A100 80 GB of GPU memory
- NVIDIA H100 with 80 GB of GPU memory
- NVIDIA L40S with 48 GB of GPU memory
Supported hardware specifications
Based on the size and the number of parameters used in your model, choose a hardware specification to deploy your fine-tuned custom foundation model.
You can use the following predefined hardware specifications for deployment:
| Parameters Range | Hardware specification | Memory available |
|---|---|---|
| 1B to 20B | WX-S | 1 GPU, 2 CPU and 60 GB |
| 21B to 40B | WX-M | 2 GPU, 3 CPU and 120 GB |
| 41B to 80B | WX-L | 4 GPU, 5 CPU and 240 GB |
| 81B to 200B | WX-XL | 8 GPU, 9 CPU and 600 GB |
Creating customized hardware specifications
You may need to create a customized hardware specification if your model requires a specific configuration not met by predefined specs, its size doesn't fit predefined ranges, or you want to optimize performance or cost-effectiveness. For more information, see Creating custom hardware specifications.
Supported deployment types
You can create an online deployment to deploy custom foundation models that are tuned with PEFT techniques. Online deployment allows for real-time inferencing and is suitable for applications that require low-latency predictions.
Batch deployments are not currently supported for deploying custom foundation models tuned with PEFT.
Learn more
Parent topic: Deploying fine-tuned custom foundation models