GPU optimization

Edit online

As demand for advanced graphics processing units (GPUs) grows to support machine learning, AI, video streaming, and 3D visualization, safeguarding performance while maximizing efficiency is critical. Turbonomic for Government Standard optimizes the performance of your GPU-enabled workloads to help you achieve the following goals:

Performance optimization

Optimizing GPU utilization helps applications fully leverage their advanced computational power, which then leads to faster responses and smoother experiences.
Resource efficiency

GPU-enabled workloads are resource intensive so you might allocate more GPU resources than are actually needed. Proper optimization based on historical demand prevents overprovisioning and reduces costs for workloads in the public cloud.
Sustainability

Optimization cuts resource waste and improves power consumption, resulting in energy efficiency and carbon footprint reductions.

GPU optimization in AWS and Azure environments

Running GPU-enabled workloads in the public cloud can be costly, especially if workloads are charged on-demand rates. Turbonomic for Government Standard can scale AWS and Azure virtual machines running supported GPU instance types to optimize performance at the lowest possible cost. Turbonomic for Government Standard collects NVIDIA GPU metrics for VMs running these instance types and then uses these metrics to generate accurate VM scale actions.

For more information, see the following topics:

Note:

For general information about AWS and Azure optimization, see AWS optimization and Azure optimization.

GPU optimization in Kubernetes and Red Hat OpenShift environments

For Kubernetes or Red Hat OpenShift clusters that manage Generative AI (GenAI) workloads, immense GPU processing power is required to operate at efficient levels of performance. Turbonomic for Government Standard optimizes GPU resources to ensure that workloads meet performance standards while maximizing efficiency.

Turbonomic for Government Standard optimizes your GPU-enabled workloads in the following ways.

For GenAI large language model (LLM) inference workloads that use GPU resources and are deployed in a Kubernetes cluster, Turbonomic for Government Standard generates workload controller scale actions to maintain SLOs for key GPU metrics, such as Concurrent Queries and LLM Cache. For more information, see Scale action for GenAI LLM workloads.
If your NVIDIA GPUs are partitioned using multi-instance GPU (MIG), Turbonomic for Government Standard recognizes the GPU partitions and recommends MIG-aware horizontal scale actions for Kubernetes GenAI LLM workloads accordingly. For more information, see MIG-aware horizontal scale actions.

For additional information, see this case study.