How observability is adjusting to generative AI

Authors

Staff Writer

IBM Think

Observability is the ability to understand a system's internal state by analyzing its external outputs, primarily through telemetry data such as metrics, events, logs and traces, collectively referred to as “MELT data.”

Observability goes beyond traditional monitoring solutions to provide critical insight into software systems and cloud computing environments, helping IT teams ensure availability, optimize performance and detect anomalies.

Most IT systems behave deterministically, which makes root cause analysis fairly straightforward. When an app fails, observability tools can use MELT data to correlate signals and pinpoint failures, determining whether it's a memory leak, database connection failure or API timeout.

But large language models (LLMs) and other generative artificial intelligence (AI) applications complicate observability. Unlike traditional software, LLMs produce probabilistic outputs, meaning identical inputs can yield different responses. This lack of interpretability—or the difficulty in tracing how inputs shape outputs—can cause problems for conventional observability tools. As a result, troubleshooting, debugging and performance monitoring are significantly more complex in generative AI systems.

"Observability can detect if an AI response contains personally identifiable information (PII), for example, but can't stop it from happening,” explains IBM's Drew Flowers, Americas Sales Leader for Instana. “The model's decision-making process is still a black box."

This "black box" phenomenon highlights a critical challenge for LLM observability. While observability tools can detect problems that have occurred, they cannot prevent those issues because they struggle with AI explainability—the ability to provide a human-understandable reason why a model made a specific decision or generated a particular output.

Until the explainability problem is solved, AI observability solutions must prioritize the things that they can effectively measure and analyze. This includes a combination of traditional MELT data and AI-specific observability metrics.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Critical metrics for gen AI observability

While traditional metrics don't provide complete visibility into model behavior, they remain essential components of AI observability. CPU, memory and network performance directly impact AI system functionality and user experience. They can help organizations assess how efficiently AI workloads are running and whether infrastructure constraints are affecting model performance and response times.

However, comprehensive AI observability requires additional metrics that monitor qualities specific to AI model behavior and outputs, including:

Token usage
Model drift
Response quality
Responsible AI monitoring

Token usage

A token is an individual unit of language—usually a word or a part of a word—that an AI model can understand. The number of tokens a model processes to understand an input or produce an output directly impacts the cost and performance of an LLM-based application. Higher token consumption can increase operational expenses and response latency.

Key metrics for tracking token usage include:

Token consumption rates and costs, which can help quantify operational expenses.
Token efficiency, a measure of how effectively each token is used in an interaction. Efficient interactions produce high-quality outputs while minimizing the number of tokens consumed.
Token usage patterns across different prompt types, which can help identify resource-intensive uses of models.

These metrics can help organizations identify optimization opportunities for reducing token consumption, such as by refining prompts to convey more information in fewer tokens. By optimizing token utilization, organizations can maintain high response quality while potentially reducing inference costs for machine learning workloads.

Model drift

Unlike traditional software, AI models can gradually change their behavior as real-world data evolves. This phenomenon, known as model drift, can significantly impact AI system reliability and performance.

Key metrics for tracking model drift include:

Changes in response patterns over time to identify emerging inconsistencies.
Variations in output quality or relevance that might indicate declining model performance.
Shifts in latency or resource utilization that could signal computational inefficiencies.

Drift detection mechanisms can provide early warnings when a model's accuracy decreases for specific use cases, enabling teams to intervene before the model disrupts business operations.

Response quality

Monitoring AI output quality is essential for maintaining trust, reliability and compliance. Key metrics for tracking response quality include:

Hallucination frequency across different prompt types to identify possible triggers for inaccurate outputs.
Factual accuracy of generated responses, though this often requires external validation and human oversight.
Consistency of outputs for similar inputs to verify model stability over time.
Relevance of responses to user prompts to assess how the model aligns with user intent.
Latency tracking is critical for user-facing AI applications, where speed and accuracy often require trade-offs. Monitoring response times across different prompt types can help organizations pinpoint performance bottlenecks and computational inefficiencies.

While tracking these metrics can help flag anomalous responses, observability tools cannot fully explain why hallucinations occur, nor can they automatically determine the correctness of AI-generated content. These are central challenges to AI trust and governance that have yet to be fully addressed by anyone.

Responsible AI monitoring

Ensuring ethical AI deployment and regulatory compliance requires comprehensive monitoring of AI-generated content.

Key metrics for tracking responsible AI include:

Occurrences of bias in responses to help ensure fairness across user interactions.
Instances of PII in generated content to help protect sensitive information.
Compliance with ethical AI guidelines to align with industry standards and regulations.
Content appropriateness to uphold brand reputation and user trust.

Real-time visualization dashboards with automated anomaly detection can alert teams when AI outputs deviate from expected norms. This proactive approach helps organizations address issues quickly, monitor AI performance over time and ensure responsible AI deployment at scale.

Mixture of Experts | 23 January, episode 91

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

OpenTelemetry and AI observability

OpenTelemetry (OTel) has emerged as the industry standard framework for collecting and transmitting telemetry data, and it can assist with generative AI observability, too. This open-source project provides a vendor-neutral approach to observability that's particularly valuable in complex AI ecosystems.

For AI providers, OpenTelemetry offers a way to standardize how they share performance data without exposing proprietary model details or source code. For enterprises, it ensures that observability data flows consistently across complex AI pipelines that may include multiple models, various dependencies and retrieval augmented generation (RAG) systems.

Key benefits of OpenTelemetry for gen AI observability include:

Vendor independence: Organizations avoid lock-in to specific observability platforms, maintaining flexibility as AI technologies evolve.
End-to-end visibility: Telemetry data flows consistently from all components of AI application infrastructure.
Future-proofing: As AI technologies evolve, the OpenTelemetry standard adapts, ensuring observability strategies remain relevant.
Ecosystem integration: Open standards enable observability across multivendor AI solutions and hybrid deployment models.
Metadata standardization: Capture essential metadata—including training timestamps, dataset origins and model inputs—to provide critical context for understanding AI system behavior.

Unlock the power of IBM Instana® Observability

IBM Instana Observability can help you achieve an ROI of 219% and reduce developer time spent troubleshooting by 90%.

Speed is everything

AI applications require significant investment, from model licensing costs to infrastructure expenditures and developer resources. Organizations that delay generative AI observability risk wasting resources if they can’t uncover performance issues, ethical problems or inefficient implementations.

"For AI observability, time to value (TTV) is everything,” Flowers says. “If I can't start getting insights fast, I'm burning money while waiting to optimize my system.”

Some common challenges that slow AI observability adoption include:

Complex custom dashboards that require extensive setup and configuration.
Overwhelming data volume that creates processing bottlenecks.
Lack of automation in configuring alerts and generating reports.
Integration difficulties between AI platforms and observability tools.
Skill gaps in interpreting AI-specific telemetry data.

To overcome these challenges, organizations should consider observability solutions that support:

Rapid deployment
Automated insights
Integrated AI workflows

Rapid deployment

Organizations should prioritize observability solutions they can deploy quickly to gain immediate insights. Preconfigured platforms can significantly reduce setup time and accelerate TTV, enabling teams to start monitoring AI systems in days rather than weeks.

Key observability solution capabilities for rapid AI observability deployment include:

AI-specific dashboard templates that work right out of the box with minimal customization.
Automated instrumentation that can immediately start collecting data from common AI frameworks and platforms.
Prebuilt connectors for popular LLM providers and AI infrastructure that eliminate the need for custom integration work.
Quick-start implementation guides to help teams get up and running with proven approaches for common AI use cases.

Automated insights

Manually analyzing vast amounts of AI-generated data can take significant time and expertise, often leading to delays, mistakes or missed issues. Observability solutions can automate this process, allowing teams to focus on more pressing issues than sifting through raw telemetry data.

Key automations in AI observability solutions include:

Using anomaly detection to identify irregularities in AI behavior and performance without requiring manual threshold configuration.
Generating actionable recommendations for system optimization rather than just identifying problems.
Translating technical issues into business-relevant explanations.
Prioritizing alerts based on impact to avoid alert fatigue and reduce downtime.

Integrated AI workflows

Observability shouldn't be an afterthought. Embedding it throughout the AI development lifecycle will empower teams across the organization with shared visibility into AI system performance, enabling faster issue resolution and more informed decision-making.

For AI observability, TTV isn't just about how quickly observability tools can be implemented. It is also about how rapidly these tools deliver actionable insights that optimize AI investments and prevent downtime.

Key ways to integrate AI observability into AI development workflows include:

Building observability into CI/CD pipelines for AI applications.
Testing observability instrumentation during pre-production.
Capturing development-stage metrics to inform production monitoring .

From monitoring to prediction

As AI observability matures, organizations are moving from reactive monitoring to predictive approaches that anticipate problems before they impact users or business outcomes. To support this, the most advanced observability solutions now incorporate their own specialized AI tools to analyze patterns across telemetry data and identify issues before they become critical.

"The most valuable AI in observability is predictive and causal AI, not generative AI," explains Flowers.

Observability tools with predictive and causal AI capabilities can:

Predict when model drift will reach problematic levels.
Forecast resource requirements based on AI usage patterns.
Identify prompt patterns likely to produce hallucinations.
Detect subtle bias trends before they become significant.

This shift from reactive to predictive observability represents the next frontier for AI operations, enabling more proactive management of AI applications and infrastructure while ensuring consistent, high-quality outputs.

Finding the right gen AI observability solution

Drawing from the challenges and solutions discussed, here are five essential principles to keep in mind when looking for the right observability solution for generative AI applications:

Acknowledge inherent limitations

While AI observability provides critical insights into performance patterns and anomalies, it cannot fully explain the internal decision-making processes of large language models. Focus on measurable metrics that indicate system health and performance.

Look beyond traditional metrics

Comprehensive generative AI observability requires monitoring token usage patterns, model drift indicators and prompt-response relationships alongside traditional infrastructure performance metrics such as CPU utilization and memory consumption.

Focus on time to value

Select observability platforms that offer rapid deployment capabilities with preconfigured dashboards and automated alerting to realize quicker returns on AI investments and prevent costly operational issues.

Integrate observability into software development

Integrate observability instrumentation early in the software development lifecycle to identify issues before deployment, establish performance baselines and create feedback loops that improve AI system quality.

Embrace OpenTelemetry

Standardizing on open observability frameworks helps future-proof observability strategies while providing comprehensive end-to-end visibility across complex AI systems and avoiding vendor lock-in.

Additionally, remember that embracing OpenTelemetry doesn't mean you have to choose an open-source observability solution. Many commercial platforms, which your organization may already use, fully support OTel while offering additional enterprise-grade capabilities.

Commercial observability solutions can provide fully managed observability with AI-driven insights and continuous support, minimizing manual setup and maintenance and improving TTV.

“If I’m sitting there building out dashboards, creating alerts, building context and data, I am literally just focused on building out tooling. I’m not optimizing the system. I’m not supporting customer initiatives,” Flowers says. “What I am doing fundamentally does not help me make money.”

With commercial observability solutions, much of that setup can be automated or preconfigured. Teams can instead focus on optimizing the performance and reliability of their generative AI models, maximizing both their observability investments and the real-world impacts of AI applications.

Full-stack observability for DevOps teams

Learn how full-stack observability, powered by AI and automation, enables teams to proactively detect, diagnose and resolve issues before they impact users or SLAs.

How observability is adjusting to generative AI

Authors

The latest AI News + Insights

Critical metrics for gen AI observability

Token usage

Model drift

Response quality

Responsible AI monitoring

Decoding AI: Weekly News Roundup

OpenTelemetry and AI observability

Unlock the power of IBM Instana® Observability

Speed is everything

Rapid deployment

Automated insights

Integrated AI workflows

From monitoring to prediction

Finding the right gen AI observability solution

Share

Resources

The latest AI News + Insights