Scale and optimize gen AI inferencing using the new AI Optimizer for Z 2.1

AI Optimizer for Z 2.1 is designed to serve AI models and perform inference optimization on IBM Spyre accelerator. It optimizes gen AI inferencing across infrastructures through key-value (KV) caching and monitoring capabilities configured for IBM Z clients.

Why the AI Optimizer for Z 2.1

For enterprises running workloads on IBM Z, the path to operationalizing AI is not about whether they can run it; it’s about how efficiently and securely it can be integrated into existing environments.

AI workloads are becoming larger and more resource-intensive, particularly with generative AI and LLM-based applications. On Z, clients must balance:

Latency-sensitive workloads that can’t leave the platform.
Compliance and data residency requirements that restrict where inference runs.
Rising compute and energy costs driven by inefficient deployment of models.

AI Optimizer for Z 2.1 is built to align with these realities, enabling enterprises to decide less manually and automate more intelligently when it comes to inference placement and optimization.

Core capabilities of AI Optimizer Z 2.1

This release introduces several technical enhancements that improve both performance and efficiency:

1. Real-time monitoring and visualization for complete operational transparency

Using Grafana and Prometheus dashboards, AI Optimizer for Z 2.1 provides deep observability and near real time insights into inference performance metrics, hardware and Spyre utilization, model usage patterns and identifies bottlenecks and anomalies in model serving. For example, users can interpret complex data intuitively through one of the dashboards, avoid over-provisioning and plan future decisions on infrastructure and budget considerations using these metrics.

2. Multi-level caching for faster responses, higher throughput

With a staged delivery plan, two levels of caching that reuse previously computed computations for common token sequences across different inferencing requests, can be enabled. Level 1 is where KV caching can be done with one LLM deployed on several hardware units. Inferencing requests that have cached text will be accelerated and hardware utilization can be optimized. And Level 2 is where caching can be shared across multiple LLM deployments, accelerating the inferencing, reducing time-to-first-token and increasing the throughput.

3. Inferencing optimization for models running on Spyre, accelerated by design

LLMs that run on Spyre can be automatically detected by AI Optimizer for Z and registered for inferencing optimization. Users can custom-build routing plans as the built-in intelligent router considers availability, usage and performance. LLMs serving similar applications or purposes can be grouped together by adding tags to them. Users can also configure their own tags following OpenAI APIs standards.

4. External model registration, unify your hybrid AI operations

External LLMs that are deployed on other infrastructures outside IBM Z and IBM LinuxONE can be registered with AI Optimizer for Z. These can be tagged and grouped with local LLMs running on Spyre to ensure use case grouping and optimization. Depending on the LLM deployment, external LLMs monitoring can be integrated in the cross-platform monitoring dashboard to give a complete gen AI overview.

Depending on the business need for a gen AI use case, multiple models can be needed to achieve a certain goal. Therefore, AI Optimizer for Z allows registration of external models that are running outside IBM Z and IBM LinuxONE to unify the inferencing endpoints. External LLMs and Local LLMs can be grouped together through custom tags that can be used in inferencing requests to serve business needs.

AI Optimizer on Z and watsonx Assistant for Z on Spyre

When AI Optimizer for Z meets watsonx Assistant for Z on IBM’s Spyre accelerator, enterprises get the best of both worlds: intelligence and performance in perfect harmony.

AI Optimizer ensures every query, inference and model call is routed, cached and scaled for maximum efficiency, while watsonx Assistant for Z delivers natural, conversational engagement with customers and employees.

Running on Spyre’s high-performance, energy-efficient architecture, the two together enable faster responses, lower latency and end-to-end visibility, transforming customer interactions into seamless, AI-powered experiences that are smarter, faster and built for enterprise scale.

Learn more about the IBM AI Optimizer for Z

Join our upcoming webinar to learn more

Minaz Merali

VP IBM Z Data and AI

IBM

Mohamed Elmougi

Senior Product Manager - IBM Z Data and AI

IBM

Learn more

What's New at IBM newsletter

Get the biggest product and feature announcements from IBM.

Scale and optimize gen AI inferencing using the IBM AI Optimizer for Z 2.1