Workload controller scale actions
Actions associated with a workload controller scale replicas horizontally to maintain Service Level Objectives (SLOs) for your applications. This is a natural representation of these actions because the parent controller's number of replicas are modified. The workload controller then rolls out the changes in the running environment.
For example, when current response time for an application is in direct violation of SLO, Turbonomic will recommend increasing the number of replicas to improve response time. If applications can meet SLOs using less resources, Turbonomic will recommend reducing the replica count to improve infrastructure efficiency.
Action generation requirements
Turbonomic generates workload controller scale actions under the following conditions:
For GenAI LLM inference workloads, see the next section.
-
Services are discovered by the Kubeturbo agent that you deployed to your cluster.
-
Application performance metrics for services are collected through the Instana or Dynatrace target or the Prometurbo metrics server.
Prometurbo collects application performance metrics from Prometheus, and then exposes the applications and metrics in JSON format through the REST API. Data Ingestion Framework (DIF) accesses the REST API and converts the JSON output to a DTO consumed by Turbonomic.
To collect metrics through Prometurbo, deploy Prometurbo and enable metrics collection.
-
You have created service policies and configured SLOs for Response Time and Transactions in those policies.
Scale actions for GenAI LLM inference workloads
For GenAI large language model (LLM) inference workloads that use GPU resources and are deployed in a Kubernetes cluster, Turbonomic generates workload controller scale actions to maintain SLOs for the following GPU metrics:
-
Concurrent Queries
-
Queueing Time
-
Service Time
-
Response Time
-
Transactions
It is assumed that you have set up an LLM inference service on a cluster that has an NVIDIA GPU attached.
If you set up KServe in your environment, scroll to the KServe Support section in this topic for information about the level of support for KServe.
Requirements for generating scale actions for GenAI LLM inference Workloads
The following diagram illustrates how the components listed in the next table work together to support scale actions for LLM inference workloads. To configure these components properly, review the requirements listed in the table.
Component | Requirements |
---|---|
NVIDIA DCGM (Data Center GPU Manager) | NVIDIA DCGM is deployed as DaemonSet pods and collects GPU metrics. DCGM exposes these metrics as APIs. |
DCGM exporter for Prometheus | DCGM exporter for Prometheus is deployed as DaemonSet pods and collects
GPU metrics from DCGM. DCGM exporter exposes the data for Prometheus to scrape, connects to the Kubelet pod resources API to identify GPU devices associated with a container pod, and then appends the GPU devices to the metrics. |
TGI (Text Generation Inference) or vLLM metrics | TGI or vLLM metrics are exposed directly by the LLM-serving services on predefined ports. |
Prometheus server | The Prometheus server is configured to scrape both GPU and TGI metrics
from DCGM exporter and TGI service endpoint. The Prometheus server makes these metrics available through PromQL queries. |
Kubeturbo agent | Kubeturbo is deployed to your cluster. Kubeturbo monitors container platform entities and collects standard metrics for these entities. |
Prometurbo agent | Prometurbo is deployed to your cluster and Prometurbo metrics collection is enabled. Prometurbo connects to the Prometheus server and sends PromQL queries to collect GPU and TGI metrics. Prometurbo requires these CRs:
|
Turbonomic supply chain and charts | Turbonomic stitches the entities discovered from
Prometurbo and Kubeturbo into the supply chain. When you set the scope to
container platform entities, charts show GPU and TGI metrics.
Turbonomic calculates 10-minute and 1-hour moving averages, and then uses the maximum of the two. This mechanism allows for faster generation of scale up actions, and slower or more conservative generation of scale down actions. |
Turbonomic service policies | Service
policies are created for the services associated with the LLM
inference workloads. In these policies:
Turbonomic generates workload controller scale actions to maintain the SLOs that you defined in the policies. See the next section for information about the generated actions. |
Support for KServe
Turbonomic can automate scale actions for workloads served by KServe in Red Hat OpenShift AI. KServe is a scalable and standards-based
Model Inference Platform on Kubernetes for Trusted AI. Inference services deployed
by KServe are managed by InferenceService
Custom Resources. The
number of workload replicas in an InferenceService
is dictated by
the minimum and maximum replicas in the predictor, transformer, or explainer
configuration in the InferenceService
spec.
Turbonomic scale actions for InferenceService
workloads are generated and executed as follows:
-
Turbonomic automatically detects and updates the minimum and maximum replicas to the values configured in the
InferenceService
spec, instead of changing thespec.replicas
of the Deployment directly. Since replicas are automatically updated, you do not need any additional setup or configuration in KServe. -
The KServe controller scales the workloads accordingly.
Action visibility
Turbonomic shows and executes SLO-driven scale actions through workload controllers. A single scale action represents the total number of replicas that you need to scale in or out to meet your SLOs.
When you examine an action, SLO is indicated as the reason for the action.