25 February, 2025
Large Language Models (LLMs) are being used in a wide range of generative AI applications, from translation services to chatbots and Retrieval Augmented Generation (RAG) systems. Alongside these applications, a new tech stack has emerged, where vector databases, orchestration applications and schedulers simplify the process of developing and deploying LLM-based applications. An important facet of this stack is the tools that provide observability into the LLMs themselves. LLMs are non-deterministic, resource-hungry and often challenging to test and control. Observability enables users to understand the behavior of a system or model without altering it. LLM observability is the practice of gathering data, often called telemetry, from a running LLM-powered system to analyze its behavior and performance.
Observability is a larger practice in machine learning that provides a high-level overview of the model. It aims to equip model owners with a comprehensive understanding of the application performance, system health, performance and behavior. Observability solutions encompass not only the model but also the associated data, infrastructure and code.
LLM observability is essentially monitoring of an LLM, where systems collect, visualize and trigger alerts based on metrics. Those metrics measure latency, token count, cost or custom indicators like evaluations of end-user inputs, LLM-generated responses, or content guardrails. Unlike less complex models where interpretability techniques like SHAP scores can be used, LLMs can only be observed and internal workings inferred from outputs and metadata.
Developers can create evaluations using custom performance metrics, human evaluated scores, or by using LLMs as a judge, depending on the use-case where the LLM is deployed. You can use these evaluations to validate a new model on a curated dataset, referred to as “offline evaluations”, or run evaluations on data from live real-world production, called “online evaluations”. An LLM or RAG application can be used to judge the generation of a second LLM on hallucinations, toxicity, context relevancy. LLM judges can handle both pairwise comparisons where the judge compares two outputs or direct scoring, where the judge evaluates correctness or relevance.
In an LLM application there are typically four steps:
User input - The user types a prompt or question that is sent to the model.
Processing - The backend application that sends this request to the language model.
Response - The model processes the input and sends back a response.
Display - The response is displayed to the user in the chat interface.
A complete LLM observability solution will capture input, generated output and measure metrics across these steps. Tracking the sequence of operations is critical, especially when using orchestration frameworks like LangChain or LlamaIndex. Tracing helps understand the workflow, making troubleshooting, debugging and root cause analysis more straightforward and effective. The architecture around LLM usage requires both a conventional observability setup and observability tools for LLMs themselves. LLMs are typically accessed in a prompt and response model and that means that each prompt needs to be logged and the responses checked for hygiene and relevance. Some of the key metrics that need to be tracked are listed below.
Inference latency - Measures the time taken for the model to generate a response. This is crucial for real-time applications where latency affects the user experience. Latency can indicate where a system might need optimization.
Token usage - Tracks the number of tokens processed, which directly impacts costs and resource allocation.
Error rates - This measures and tracks the frequency of model errors or failures during inference and can provide insight into LLM performance.
Output quality - Assesses the relevance, coherence and accuracy of model outputs often using evaluation metrics to measure end users’ satisfaction with the generated response.
Model drift - Model drift occurs when a model’s performance degrades or behaves inconsistently over time due to changes in data patterns, shifts in user behavior, or evolving language use. Detecting changes in model performance over time can indicate the need for retraining or fine-tuning.
Resource utilization - Observability platforms will sometimes monitor CPU, GPU and memory usage to ensure efficient operation.
Throughput - This measures the number of requests processed per unit of time, typically measured in second.
Combining multiple of these metrics, or even all of them if needed, gives developers and operations personnel an accurate snapshot of how the LLM is performing and the stability of the system around the LLM.
User feedback metrics - Tracks user ratings or feedback on model outputs or LLM responses.
Observing the behavior of the LLM can help with cost optimization, tracking the number of input and output tokens for every request being sent to an LLM. This helps developers look at the cost overhead of each request at an individual level and also as an aggregate over time.
An LLM observability solution can provide custom tagging to attribute costs to different entities, an observability solution can add tags for specific use-cases and accounts.
A/B Experimentation on prompts is another benefit. Developers can try out multiple prompt templates, keep them side by side and explore which one gives the best results and iterate on them. What would otherwise require extensive manual interpretation on a UI layer like chatGPT, will be done programmatically and tracked in an LLM Observability tool.
Evaluation of the right LLM for a use-case is another reason to observe and track model behavior. A shadow deployment can mirror a production deployment to see how different models might perform given the same prompt.
RAGs are another application area where monitoring and logging model behavior can provide insight into possible optimizations. Observability can help in finding the most efficient strategies to iterate over retrieval-based queries, providing insight into prompts, evaluating chunking strategies and deciding which type of vector stores to use.
Finally, evaluating fine-tuned model performance allows developers to analyze how fine-tuning affects outputs and detect differences in output quality between different fine-tuned versions of a model.
To instrument LLM observability, organizations will sometimes write their own tools in a language like Python or Javas but often will use software platforms instead of a bespoke application. LLM Observability platforms will enable logging prompts and generated responses, measuring the number of tokens used and diagnosing frameworks that capture metrics for the overall model behavior. There are a variety of commonly used LLM observability tools:
A variety of model monitoring platforms are now available and under active development. Commercial platforms like IBM's Instana, Arize or Helicone provide suites of tools for all aspects of LLM Observability. There are also open-source platforms like Langfuse or OpenTelemetry and the OpenLLMetry suite, which is based on it. These are built specifically for monitoring LLMs, so they can help with advanced safeguards like detecting prompt injection or jailbreak attempts.
Logging systems like DataDog or Prometheus, although not specifically built for LLM observability, can capture and track LLM prompts and responses as well as system resource usage.
Testing and experimentation frameworks like MLFlow or Weights & Biases that provide dashboards to compare and evaluate models and variants of the models created through fine-tuning. Creating an LLM Observability system is very context dependent and might involve using multiple systems to log and measure the models involved. Having an observability system is very useful to ensure that LLM apps are performing optimally, are resilient to failure and are aligned with ethical and business requirements.
