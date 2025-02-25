In an LLM application there are typically four steps:

User input - The user types a prompt or question that is sent to the model.

Processing - The backend application that sends this request to the language model.

Response - The model processes the input and sends back a response.

Display - The response is displayed to the user in the chat interface.

A complete LLM observability solution will capture input, generated output and measure metrics across these steps. Tracking the sequence of operations is critical, especially when using orchestration frameworks like LangChain or LlamaIndex. Tracing helps understand the workflow, making troubleshooting, debugging and root cause analysis more straightforward and effective. The architecture around LLM usage requires both a conventional observability setup and observability tools for LLMs themselves. LLMs are typically accessed in a prompt and response model and that means that each prompt needs to be logged and the responses checked for hygiene and relevance. Some of the key metrics that need to be tracked are listed below.

Inference latency - Measures the time taken for the model to generate a response. This is crucial for real-time applications where latency affects the user experience. Latency can indicate where a system might need optimization.

Token usage - Tracks the number of tokens processed, which directly impacts costs and resource allocation.

Error rates - This measures and tracks the frequency of model errors or failures during inference and can provide insight into LLM performance.

Output quality - Assesses the relevance, coherence and accuracy of model outputs often using evaluation metrics to measure end users’ satisfaction with the generated response.

Model drift - Model drift occurs when a model’s performance degrades or behaves inconsistently over time due to changes in data patterns, shifts in user behavior, or evolving language use. Detecting changes in model performance over time can indicate the need for retraining or fine-tuning.

Resource utilization - Observability platforms will sometimes monitor CPU, GPU and memory usage to ensure efficient operation.

Throughput - This measures the number of requests processed per unit of time, typically measured in second.

Combining multiple of these metrics, or even all of them if needed, gives developers and operations personnel an accurate snapshot of how the LLM is performing and the stability of the system around the LLM.

User feedback metrics - Tracks user ratings or feedback on model outputs or LLM responses.