IBM Big AI Models - Turbonomic

Maximizing high-demand GPUs for efficiency and performance

Artificial intelligence has reached a crucial milestone, where training large language models (LLMs) is one of the most computationally demanding tasks. High-performance computing is essential for generative AI (gen AI) and LLM workload optimization, and Graphics Processing Units (GPUs) can be expensive and scarce. GPUs are specialized computer chips designed to handle complex mathematical calculations and parallel processing, making them ideal for complex computations required in training and inference for deep learning models. As a result, GPUs are in high demand, and optimizing their utilization is critical for AI success.

The IBM® Big AI Models (BAM) team, that supports the primary research and development environment for engineering teams to test and refine their gen AI projects, saw an opportunity for improvement. As more projects went through the testing stage, the team recognized the importance of optimally using each instance to avoid resource wasting.

increase in idle GPU resources

throughput achieved without degrading the latency performance

Enabling Turbonomic to scale up and down our LLM inference servers has allowed me to spend less time monitoring performance. Tom Morris

Infrastructure and Operations Lead for IBM AI Platform Enablement Research

IBM

Transforming GPU management: from chaos to control

To optimize their GPU resources and manage their LLM Kubernetes instances, the IBM BAM team deployed IBM Turbonomic®, an advanced application resource management software tool. Using real-time data, Turbonomic generated AI-driven recommendations for automated actions to optimize resource utilization and efficiency. By identifying optimal resource allocation strategies, the solution generated tailored suggestions that the team could configure to execute automatically, enabling AI-driven resource optimization.

As an IBM internal software explicitly tailored for optimizing hybrid cloud management, including containerized applications, virtual machines and public clouds, IBM Turbonomic provided seamless integration within the existing infrastructure.

Tom Morris, AI Platform Researcher, summarizes: “Enabling Turbonomic to scale up and down our LLM inference servers has allowed me to spend less time monitoring performance.”

BEFORE

AFTER

Better performance, reduced costs: the outcomes of efficient GPU resource allocation

With Turbonomic, the IBM BAM team was able to create a scalable and agile infrastructure that could adapt to the evolving demands of their business, supporting their LLM services and running over 100 NVIDIA A100 GPUs.

By scaling down overprovisioned instances, the team show the ability to increase the idle GPU resources from 3 to 16 (5.3 times), to allow those resources to handle additional workloads.

The results included:

Resource allocation
With the automated solution, dynamic scaling became second nature, generating optimal utilization of available GPUs according to varying needs.
Cost efficiency
Scaling LLM services on demand allowed time-sharing of GPUs, optimizing the total number of GPUs required. Now, with scaling and sharing, IBM BAM team showed 13 fewer GPUs will be necessary in a full automation environment.
Labor efficiency
Automatic scaling of LLM inference servers allowed the IBM BAM team to spend less time monitoring performance.
Scalability and performance
After fully automating the scaling of LLM services, the originally over-provisioned GPU resources were freed up to be shared by other workloads based on demand. The throughput increase gives the opportunity to enhance performance by addressing latency issues.

Applying Turbonomic automation capabilities, the IBM BAM team successfully scaled and optimized LLM services. This improvement positioned the team to reallocate their time for strategic projects.

About IBM Big AI Models

The IBM Big AI Models (BAM) team is a group of researchers and engineers within IBM Research® that focuses on developing and applying large-scale AI models. These models are designed to process and analyze vast amounts of data, enabling applications such as natural language processing, computer vision and predictive analytics.

Solution component

IBM® Turbonomic®

Transform your business with data-driven decisions

Optimize performance and efficiency with IBM Turbonomic AI-driven automated resource management

Start automating with Turbonomic

View more case studies

Legal

© Copyright IBM Corporation 2024. IBM, the IBM logo, Turbonomic, and IBM Research are trademarks or registered trademarks of IBM Corp., in the U.S. and/or other countries. This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.

Client examples are presented as illustrations of how those clients have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.

Optimizing GPUs for gen AI

Maximizing high-demand GPUs for efficiency and performance

Transforming GPU management: from chaos to control

Better performance, reduced costs: the outcomes of efficient GPU resource allocation

About IBM Big AI Models

Legal