Workload controller scale actions

Actions associated with a workload controller scale replicas horizontally to maintain Service Level Objectives (SLOs) for your applications. This is a natural representation of these actions because the parent controller's number of replicas are modified. The workload controller then rolls out the changes in the running environment.

For example, when current response time for an application is in direct violation of SLO, Turbonomic will recommend increasing the number of replicas to improve response time. If applications can meet SLOs using less resources, Turbonomic will recommend reducing the replica count to improve infrastructure efficiency.

Action generation requirements

Turbonomic generates workload controller scale actions under the following conditions:

Note:

For GenAI LLM inference workloads, see the next section.

  • Services are discovered by the Kubeturbo agent that you deployed to your cluster.

  • Application performance metrics for services are collected through the Instana or Dynatrace target or the Prometurbo metrics server.

    Prometurbo collects application performance metrics from Prometheus, and then exposes the applications and metrics in JSON format through the REST API. Data Ingestion Framework (DIF) accesses the REST API and converts the JSON output to a DTO consumed by Turbonomic.

    To collect metrics through Prometurbo, deploy Prometurbo and enable metrics collection.

  • You have created service policies and configured SLOs for Response Time and Transactions in those policies.

Scale actions for GenAI LLM inference workloads

For GenAI large language model (LLM) inference workloads that use GPU resources and are deployed in a Kubernetes cluster, Turbonomic generates workload controller scale actions to maintain SLOs for the following GPU metrics:

  • Concurrent Queries

  • Queueing Time

  • Service Time

  • Response Time

  • Transactions

Note:

It is assumed that you have set up an LLM inference service on a cluster that has an NVIDIA GPU attached.

If you set up KServe in your environment, scroll to the KServe Support section in this topic for information about the level of support for KServe.

Requirements for generating scale actions for GenAI LLM inference Workloads

The following diagram illustrates how the components listed in the next table work together to support scale actions for LLM inference workloads. To configure these components properly, review the requirements listed in the table.

Components that support scaling of LLM inference workloads
Component Requirements
NVIDIA DCGM (Data Center GPU Manager) NVIDIA DCGM is deployed as DaemonSet pods and collects GPU metrics. DCGM exposes these metrics as APIs.
DCGM exporter for Prometheus DCGM exporter for Prometheus is deployed as DaemonSet pods and collects GPU metrics from DCGM.

DCGM exporter exposes the data for Prometheus to scrape, connects to the Kubelet pod resources API to identify GPU devices associated with a container pod, and then appends the GPU devices to the metrics.

TGI (Text Generation Inference) or vLLM metrics TGI or vLLM metrics are exposed directly by the LLM-serving services on predefined ports.
Prometheus server The Prometheus server is configured to scrape both GPU and TGI metrics from DCGM exporter and TGI service endpoint.

The Prometheus server makes these metrics available through PromQL queries.

Kubeturbo agent Kubeturbo is deployed to your cluster.

Kubeturbo monitors container platform entities and collects standard metrics for these entities.

Prometurbo agent Prometurbo is deployed to your cluster and Prometurbo metrics collection is enabled.

Prometurbo connects to the Prometheus server and sends PromQL queries to collect GPU and TGI metrics.

Prometurbo requires these CRs:

  • PrometheusQueryMapping – specifies the GPU and TGI metrics to collect

  • PrometheusServerConfig – specifies settings for your Prometheus server

Turbonomic supply chain and charts Turbonomic stitches the entities discovered from Prometurbo and Kubeturbo into the supply chain. When you set the scope to container platform entities, charts show GPU and TGI metrics.
  • GPU metrics include GPU (utilization of Tensor cores) and GPU memory (utilization of framebuffer memory).

  • TGI metrics include Concurrent Queries, Queueing Time, Service Time, Response Time, and Transactions.

Turbonomic calculates 10-minute and 1-hour moving averages, and then uses the maximum of the two. This mechanism allows for faster generation of scale up actions, and slower or more conservative generation of scale down actions.

Turbonomic service policies Service policies are created for the services associated with the LLM inference workloads. In these policies:
  • Scope to the relevant services.

    Tip:

    Create a group of services from Settings > Groups and then specify the group as your scope.

  • Turn on Horizontal Scale Down and Horizontal Scale Up actions.

  • Enable SLOs for Concurrent Queries, Queueing Time, Service Time, Response Time, and Transactions, and then specify your preferred SLO values.

Turbonomic generates workload controller scale actions to maintain the SLOs that you defined in the policies. See the next section for information about the generated actions.

Support for KServe

Turbonomic can automate scale actions for workloads served by KServe in Red Hat OpenShift AI. KServe is a scalable and standards-based Model Inference Platform on Kubernetes for Trusted AI. Inference services deployed by KServe are managed by InferenceService Custom Resources. The number of workload replicas in an InferenceService is dictated by the minimum and maximum replicas in the predictor, transformer, or explainer configuration in the InferenceService spec.

Turbonomic scale actions for InferenceService workloads are generated and executed as follows:

  • Turbonomic automatically detects and updates the minimum and maximum replicas to the values configured in the InferenceService spec, instead of changing the spec.replicas of the Deployment directly. Since replicas are automatically updated, you do not need any additional setup or configuration in KServe.

  • The KServe controller scales the workloads accordingly.

Action visibility

Turbonomic shows and executes SLO-driven scale actions through workload controllers. A single scale action represents the total number of replicas that you need to scale in or out to meet your SLOs.

Action Center page with Scale highlighted

When you examine an action, SLO is indicated as the reason for the action.

Action Details page with graphs and reason for action highlighted
Action Details page with graphs and reason for action highlighted