Workload controller scale actions

Edit online

Actions associated with a workload controller scale replicas horizontally to maintain Service Level Objectives (SLOs) for your applications. This is a natural representation of these actions because the parent controller's number of replicas in the Deployment are modified. The workload controller then rolls out the changes in the running environment.

For example, when current response time for an application is in direct violation of SLO, Turbonomic will recommend increasing the number of replicas to improve response time. If applications can meet SLOs using less resources, Turbonomic will recommend reducing the replica count to improve infrastructure efficiency.

Action generation requirements

Turbonomic generates workload controller scale actions under the following conditions:

Note:

For GenAI LLM inference workloads, see the next section.

Services are discovered by the Kubeturbo agent that you deployed to your cluster.
Application performance metrics for services are collected through the Instana or Dynatrace target or the Prometurbo metrics server.

Prometurbo collects application performance metrics from Prometheus, and then exposes the applications and metrics in JSON format through the REST API. Data Ingestion Framework (DIF) accesses the REST API and converts the JSON output to a DTO consumed by Turbonomic.

To collect metrics through Prometurbo, deploy Prometurbo and enable metrics collection.
You have created service policies and configured SLOs for Response Time and Transactions in those policies.

Scale actions for GenAI LLM inference workloads

For GenAI large language model (LLM) inference workloads that use GPU resources and are deployed in a Kubernetes cluster, Turbonomic generates workload controller scale actions to maintain SLOs for the following:

Concurrent Queries

For LLM inference workloads, concurrent queries SLO uses TGI, vLLM or custom framework metrics and is the number of concurrent queries to a workload. When there is no request, concurrent queries will be zero.
Queuing time

For LLM inference workloads, queuing time SLO uses TGI, vLLM or custom framework metrics and is the amount of time that a request spends in a queue before it is processed. When there is no request, queuing time will be zero.
Service Time

For LLM inference workloads, service time SLO uses TGI, vLLM or custom framework metrics and is the amount of processing time needed to generate the next token. This metric is relatively stable per given model and GPU resource. When there is no request, service time will be unavailable.
Response Time

For LLM inference workloads, response time SLO uses TGI, vLLM or custom framework metrics and is the turnaround time for each request, including both queuing time and service time. When there is no request, response time will be unavailable.
Transactions

For LLM inference workloads, Transactions SLO uses TGI, vLLM or custom framework metrics and is the total number of tokens per second, which includes both input tokens and generated tokens. When there is no request, transaction will be zero.
LLM Cache

For LLM inference workloads, LLM cache SLO uses a vLLM metric called percentage of KV cache usage that tracks the usage of key-value cache in GPU memory. Efficient use of this cache can significantly reduce memory requirements and enhance performance.

Note:

It is assumed that you have set up an LLM inference service on a cluster that has an NVIDIA GPU attached.

If you set up KServe in your environment, scroll to the KServe Support section in this topic for information about the level of support for KServe.

Requirements for generating scale actions for GenAI LLM inference workloads

The following diagram illustrates how the components listed in the next table work together to support scale actions for LLM inference workloads. To configure these components properly, review the requirements listed in the table.

Components that support scaling of LLM inference workloads

Component	Requirements
NVIDIA DCGM (Data Center GPU Manager)	NVIDIA DCGM is deployed as DaemonSet pods and collects GPU metrics. DCGM exposes these metrics as APIs.
DCGM exporter for Prometheus	DCGM exporter for Prometheus is deployed as DaemonSet pods and collects GPU metrics from DCGM. DCGM exporter exposes the data for Prometheus to scrape, connects to the Kubelet pod resources API to identify GPU devices associated with a container pod, and then appends the GPU devices to the metrics.
TGI (Text Generation Inference) or vLLM metrics	TGI or vLLM metrics are exposed directly by the LLM-serving services on predefined ports.
Prometheus server	The Prometheus server is configured to scrape both GPU and TGI metrics from DCGM exporter and TGI service endpoint. The Prometheus server makes these metrics available through PromQL queries.
Kubeturbo agent	Kubeturbo is deployed to your cluster. Kubeturbo monitors container platform entities and collects standard metrics for these entities.
Prometurbo agent	Prometurbo is deployed to your cluster and Prometurbo metrics collection is enabled. Prometurbo connects to the Prometheus server and sends PromQL queries to collect GPU and TGI metrics. Prometurbo requires these CRs: `PrometheusQueryMapping` – specifies the GPU and TGI metrics to collect `PrometheusServerConfig` – specifies settings for your Prometheus server
Turbonomic supply chain and charts	Turbonomic stitches the entities discovered from Prometurbo and Kubeturbo into the supply chain. When you set the scope to container platform entities, charts show GPU and TGI metrics. GPU metrics include GPU (utilization of Tensor cores) and GPU memory (utilization of framebuffer memory). TGI metrics include Concurrent Queries, Queueing Time, Service Time, Response Time, Transactions, and LLM Cache. Turbonomic calculates 10-minute and 1-hour moving averages, and then uses the maximum of the two. This mechanism allows for faster generation of scale up actions, and slower or more conservative generation of scale down actions.
Turbonomic service policies	Service policies are created for the services associated with the LLM inference workloads. In these policies: Scope to the relevant services. Tip: Create a group of services from Settings > Groups and then specify the group as your scope. Turn on Horizontal Scale Down and Horizontal Scale Up actions. Enable SLOs for Concurrent Queries, Queueing Time, Service Time, Response Time, Transactions, and LLM Cache, and then specify your preferred SLO values. Turbonomic generates workload controller scale actions to maintain the SLOs that you defined in the policies. See the next section for information about the generated actions.

MIG-aware horizontal scale actions for Kubernetes GenAI LLM workloads

If your NVIDIA GPUs are partitioned using multi-instance GPU (MIG), Turbonomic recognizes the GPU partitions and recommends MIG-aware horizontal scale actions for Kubernetes GenAI LLM workloads accordingly. To generate scale actions, Turbonomic analyzes GPU metrics that Prometurbo collected from a customer-managed Prometheus instance.

To support this feature, Turbonomic accepts multiple labels of the scraped GPU metrics and then combines them to generate a single identifier, as specified in this sample resource.

name: id
labels:
    - UUID
    - GPU_I_ID

As all the GPU partitions have the exact same GPU UUID, they can be accounted for as a single GPU entity. By combining multiple labels (GPU UUID and GPU Instance ID in this case), Turbonomic generates a unique ID for every MIG partition and subsequently recognizes that multiple GPUs are available. The number of detected entities matches the number of GPUs reported by the allocatable resources of the node. Horizontal scale actions then scale AI workloads up or down based on available MIG partitions.

Note:

You must define additional constraints if multiple MIG configurations are used in the same cluster so that workloads are not placed in a mismatching partition.

To take advantage of this feature, enable metrics collection for Prometurbo. For more information, see Enabling metrics collection.

If you enabled metrics collection before version 8.15.4, perform these steps after updating to version 8.15.4 or later.

Apply the Prometurbo CRDs.

kubectl apply -f https://raw.githubusercontent.com/IBM/turbonomic-container-platform/refs/heads/main/turbo-metrics/crd/metrics.turbonomic.io_prometheusquerymappings.yaml

kubectl apply -f https://raw.githubusercontent.com/IBM/turbonomic-container-platform/refs/heads/main/turbo-metrics/crd/metrics.turbonomic.io_prometheusserverconfigs.yaml

Apply the PQM custom resource.

kubectl apply -f https://raw.githubusercontent.com/IBM/turbonomic-container-platform/refs/heads/main/turbo-metrics/samples/metrics_v1alpha1_nvidia-dcgm-exporter.yaml

Support for KServe

Turbonomic can automate scale actions for workloads served by KServe in Red Hat OpenShift AI. KServe is a scalable and standards-based Model Inference Platform on Kubernetes for Trusted AI. Inference services deployed by KServe are managed by InferenceService Custom Resources. The number of workload replicas in an InferenceService is dictated by the minimum and maximum replicas in the predictor, transformer, or explainer configuration in the InferenceService spec.

Turbonomic scale actions for InferenceService workloads are generated and executed as follows:

Turbonomic automatically detects and updates the minimum and maximum replicas to the values configured in the InferenceService spec, instead of changing the spec.replicas of the Deployment directly. Since replicas are automatically updated, you do not need any additional setup or configuration in KServe.
The KServe controller scales the workloads accordingly.

KServe can be deployed in RawDeployment or Serverless mode. Turbonomic supports both modes and can therefore detect and update them automatically. You do not need any additional configuration in Turbonomic to define the mode for your inference services.

Support for Knative Serving

Turbonomic can automate scale actions for workloads served by Knative Serving, an open-source enterprise solution for building serverless and event-driven applications. Knative Serving defines a set of objects as Kubernetes Custom Resource Definitions (CRDs), which are then used to define and control how serverless workloads operate in the cluster.

In the Custom Resource for the Knative Service, the minimum and maximum workload replicas are defined as the following values in the spec.template.metadata.annotations spec:

autoscaling.knative.dev/min-scale
autoscaling.knative.dev/max-scale

Scale actions for workloads are generated and executed as follows:

Turbonomic automatically detects and updates the minimum and maximum replicas to the values configured in the Knative Service Custom Resource (CR), instead of changing the spec.replicas of the Deployment directly. Since replicas are automatically updated, you do not need any additional setup or configuration in Knative Service.
The Knative Serving controller scales the workloads accordingly.

Action visibility

Turbonomic shows and executes SLO-driven scale actions through workload controllers. A single scale action represents the total number of replicas that you need to scale in or out to meet your SLOs.

Action center page with Scale highlighted

When you examine an action, SLO is indicated as the reason for the action.

Action Details page with graphs and reason for action highlighted