Workload controller scale actions
Actions associated with a workload controller scale replicas horizontally to maintain Service Level Objectives (SLOs) for your applications. This is a natural representation of these actions because the parent controller's number of replicas in the Deployment are modified. The workload controller then rolls out the changes in the running environment.
For example, when current response time for an application is in direct violation of SLO, Turbonomic will recommend increasing the number of replicas to improve response time. If applications can meet SLOs using less resources, Turbonomic will recommend reducing the replica count to improve infrastructure efficiency.
Action generation requirements
Turbonomic generates workload controller scale actions under the following conditions:
For GenAI LLM inference workloads, see the next section.
-
Services are discovered by the Kubeturbo agent that you deployed to your cluster.
-
Application performance metrics for services are collected through the Instana or Dynatrace target or the Prometurbo metrics server.
Prometurbo collects application performance metrics from Prometheus, and then exposes the applications and metrics in JSON format through the REST API. Data Ingestion Framework (DIF) accesses the REST API and converts the JSON output to a DTO consumed by Turbonomic.
To collect metrics through Prometurbo, deploy Prometurbo and enable metrics collection.
-
You have created service policies and configured SLOs for Response Time and Transactions in those policies.
Scale actions for GenAI LLM inference workloads
For GenAI large language model (LLM) inference workloads that use GPU resources and are deployed in a Kubernetes cluster, Turbonomic generates workload controller scale actions to maintain SLOs for the following:
-
Concurrent Queries
For LLM inference workloads, concurrent queries SLO uses TGI, vLLM or custom framework metrics and is the number of concurrent queries to a workload. When there is no request, concurrent queries will be zero.
-
Queuing time
For LLM inference workloads, queuing time SLO uses TGI, vLLM or custom framework metrics and is the amount of time that a request spends in a queue before it is processed. When there is no request, queuing time will be zero.
-
Service Time
For LLM inference workloads, service time SLO uses TGI, vLLM or custom framework metrics and is the amount of processing time needed to generate the next token. This metric is relatively stable per given model and GPU resource. When there is no request, service time will be unavailable.
-
Response Time
For LLM inference workloads, response time SLO uses TGI, vLLM or custom framework metrics and is the turnaround time for each request, including both queuing time and service time. When there is no request, response time will be unavailable.
-
Transactions
For LLM inference workloads, Transactions SLO uses TGI, vLLM or custom framework metrics and is the total number of tokens per second, which includes both input tokens and generated tokens. When there is no request, transaction will be zero.
-
LLM Cache
For LLM inference workloads, LLM cache SLO uses a vLLM metric called percentage of KV cache usage that tracks the usage of key-value cache in GPU memory. Efficient use of this cache can significantly reduce memory requirements and enhance performance.
It is assumed that you have set up an LLM inference service on a cluster that has an NVIDIA GPU attached.
If you set up KServe in your environment, scroll to the KServe Support section in this topic for information about the level of support for KServe.
Requirements for generating scale actions for GenAI LLM inference workloads
The following diagram illustrates how the components listed in the next table work together to support scale actions for LLM inference workloads. To configure these components properly, review the requirements listed in the table.
| Component | Requirements |
|---|---|
| NVIDIA DCGM (Data Center GPU Manager) | NVIDIA DCGM is deployed as DaemonSet pods and collects GPU metrics. DCGM exposes these metrics as APIs. |
| DCGM exporter for Prometheus | DCGM exporter for Prometheus is deployed as DaemonSet pods and collects
GPU metrics from DCGM. DCGM exporter exposes the data for Prometheus to scrape, connects to the Kubelet pod resources API to identify GPU devices associated with a container pod, and then appends the GPU devices to the metrics. |
| TGI (Text Generation Inference) or vLLM metrics | TGI or vLLM metrics are exposed directly by the LLM-serving services on predefined ports. |
| Prometheus server | The Prometheus server is configured to scrape both GPU and TGI metrics
from DCGM exporter and TGI service endpoint. The Prometheus server makes these metrics available through PromQL queries. |
| Kubeturbo agent | Kubeturbo is deployed to your cluster. Kubeturbo monitors container platform entities and collects standard metrics for these entities. |
| Prometurbo agent | Prometurbo is deployed to your cluster and Prometurbo metrics collection is enabled. Prometurbo connects to the Prometheus server and sends PromQL queries to collect GPU and TGI metrics. Prometurbo requires these CRs:
|
| Turbonomic supply chain and charts | Turbonomic stitches the entities discovered from
Prometurbo and Kubeturbo into the supply chain. When you set the scope to
container platform entities, charts show GPU and TGI metrics.
Turbonomic calculates 10-minute and 1-hour moving averages, and then uses the maximum of the two. This mechanism allows for faster generation of scale up actions, and slower or more conservative generation of scale down actions. |
| Turbonomic service policies | Service
policies are created for the services associated with the LLM
inference workloads. In these policies:
Turbonomic generates workload controller scale actions to maintain the SLOs that you defined in the policies. See the next section for information about the generated actions. |
MIG-aware horizontal scale actions for Kubernetes GenAI LLM workloads
If your NVIDIA GPUs are partitioned using multi-instance GPU (MIG), Turbonomic recognizes the GPU partitions and recommends MIG-aware horizontal scale actions for Kubernetes GenAI LLM workloads accordingly. To generate scale actions, Turbonomic analyzes GPU metrics that Prometurbo collected from a customer-managed Prometheus instance.
To support this feature, Turbonomic accepts multiple labels of the scraped GPU metrics and then combines them to generate a single identifier, as specified in this sample resource.
name: id
labels:
- UUID
- GPU_I_ID
As all the GPU partitions have the exact same GPU UUID, they can be accounted for as a single GPU entity. By combining multiple labels (GPU UUID and GPU Instance ID in this case), Turbonomic generates a unique ID for every MIG partition and subsequently recognizes that multiple GPUs are available. The number of detected entities matches the number of GPUs reported by the allocatable resources of the node. Horizontal scale actions then scale AI workloads up or down based on available MIG partitions.
You must define additional constraints if multiple MIG configurations are used in the same cluster so that workloads are not placed in a mismatching partition.
To take advantage of this feature, enable metrics collection for Prometurbo. For more information, see Enabling metrics collection.
If you enabled metrics collection before version 8.15.4, perform these steps after updating to version 8.15.4 or later.
-
Apply the Prometurbo CRDs.
kubectl apply -f https://raw.githubusercontent.com/IBM/turbonomic-container-platform/refs/heads/main/turbo-metrics/crd/metrics.turbonomic.io_prometheusquerymappings.yamlkubectl apply -f https://raw.githubusercontent.com/IBM/turbonomic-container-platform/refs/heads/main/turbo-metrics/crd/metrics.turbonomic.io_prometheusserverconfigs.yaml -
Apply the PQM custom resource.
kubectl apply -f https://raw.githubusercontent.com/IBM/turbonomic-container-platform/refs/heads/main/turbo-metrics/samples/metrics_v1alpha1_nvidia-dcgm-exporter.yaml
Support for KServe
Turbonomic can automate scale actions for workloads served by KServe in Red Hat OpenShift AI. KServe is a scalable and standards-based
Model Inference Platform on Kubernetes for Trusted AI. Inference services deployed
by KServe are managed by InferenceService Custom Resources. The
number of workload replicas in an InferenceService is dictated by
the minimum and maximum replicas in the predictor, transformer, or explainer
configuration in the InferenceService spec.
Turbonomic scale actions for InferenceService
workloads are generated and executed as follows:
-
Turbonomic automatically detects and updates the minimum and maximum replicas to the values configured in the
InferenceServicespec, instead of changing thespec.replicasof the Deployment directly. Since replicas are automatically updated, you do not need any additional setup or configuration in KServe. -
The KServe controller scales the workloads accordingly.
KServe can be deployed in RawDeployment or
Serverless mode. Turbonomic supports both
modes and can therefore detect and update them automatically. You do not need any
additional configuration in Turbonomic to define the mode for your
inference services.
Support for Knative Serving
Turbonomic can automate scale actions for workloads served by Knative Serving, an open-source enterprise solution for building serverless and event-driven applications. Knative Serving defines a set of objects as Kubernetes Custom Resource Definitions (CRDs), which are then used to define and control how serverless workloads operate in the cluster.
In the Custom Resource for the Knative Service, the minimum and maximum workload
replicas are defined as the following values in the
spec.template.metadata.annotations spec:
-
autoscaling.knative.dev/min-scale -
autoscaling.knative.dev/max-scale
Scale actions for workloads are generated and executed as follows:
-
Turbonomic automatically detects and updates the minimum and maximum replicas to the values configured in the Knative Service Custom Resource (CR), instead of changing the
spec.replicasof theDeploymentdirectly. Since replicas are automatically updated, you do not need any additional setup or configuration in Knative Service. -
The Knative Serving controller scales the workloads accordingly.
Action visibility
Turbonomic shows and executes SLO-driven scale actions through workload controllers. A single scale action represents the total number of replicas that you need to scale in or out to meet your SLOs.
When you examine an action, SLO is indicated as the reason for the action.