Monitoring vLLM

vLLM is an open source framework for optimizing and accelerating large language model (LLM) inference. Monitor your vLLM integrations with Instana by exporting traces and metrics to analyze calls and gain insights into LLM performance.

Prerequisites

  • Python 3.10 or later

  • vLLM 0.6.3.post1 or later

  • Access to Instana agent or backend

Configuring environment variables

Configure your environment to export traces to Instana using either agent mode or agentless mode.

Agent mode

Export traces through an Instana agent:

export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://<instana-agent-host>:4317"
export OTEL_SERVICE_NAME="<your-service-name>"

Agentless mode

Export traces directly to the Instana backend:

export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="https://<otlp-acceptor-domain>:4317"
export OTEL_EXPORTER_OTLP_HEADERS="x-instana-key=<agent-key>"
export OTEL_SERVICE_NAME="<your-service-name>"

For endpoints without TLS, set:

export OTEL_EXPORTER_OTLP_INSECURE=true
Tip:

For Instana backend endpoints, see Endpoints of the Instana backend otlp-acceptor.

Installing dependencies

  1. Verify Python version:

    python3 -V
  2. (Optional) Create a virtual environment:

    pip3 install virtualenv
    virtualenv vllm-env
    source vllm-env/bin/activate
  3. Install vLLM:

    pip3 install vllm==0.6.3.post1

    For alternative installation methods, see the vLLM documentation.

  4. Install OpenTelemetry packages:
     
    pip3 install
      'opentelemetry-sdk>=1.26.0,<1.27.0' 
      'opentelemetry-api>=1.26.0,<1.27.0' 
      'opentelemetry-exporter-otlp>=1.26.0,<1.27.0' 
      'opentelemetry-semantic-conventions-ai>=0.4.1,<0.5.0'

Instrumenting vLLM applications

vLLM supports three deployment patterns for trace instrumentation.

Pattern 1: vLLM as a library

Use vLLM directly in your application code:

import os
from random import randint
from vllm import LLM, SamplingParams

# Sample prompts
prompts = [
  "Hello, my name is",
  "The future of AI is",
]

# Create sampling params
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Ensure unique trace IDs
random_seed = randint(0, 2**32 - 1)

# Create LLM with OpenTelemetry endpoint
llm = LLM(
  seed=random_seed,
  model="ibm-granite/granite-3.0-2b-instruct",
  otlp_traces_endpoint=os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"],
)

# Generate and print outputs
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
  prompt = output.prompt
  generated_text = output.outputs[0].text
  print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Figure 1. Service overview
Service overview
Figure 2. Trace overview
Trace overview

Pattern 2: Using LLMEngine directly

Access the model through LLMEngine for more control:

import argparse
from typing import List, Tuple

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import set_tracer_provider

from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams
from vllm.utils import FlexibleArgumentParser

# Configure OpenTelemetry
trace_provider = TracerProvider()
set_tracer_provider(trace_provider)
trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))

tracer = trace_provider.get_tracer("vllm-client")

def create_test_prompts() -> List[Tuple[str, SamplingParams]]:
  """Create test prompts with sampling parameters."""
  return [
    ("A robot may not injure a human being",
      SamplingParams(temperature=0.0, logprobs=1, prompt_logprobs=1)),
    ("To be or not to be,",
      SamplingParams(temperature=0.8, top_k=5, presence_penalty=0.2)),
    ("What is the meaning of life?",
      SamplingParams(n=1, best_of=1, temperature=0.8, top_p=0.95, frequency_penalty=0.1)),
  ]

def process_requests(engine: LLMEngine, test_prompts: List[Tuple[str, SamplingParams]]):
  """Process prompts and handle outputs."""
  request_id = 0

  while test_prompts or engine.has_unfinished_requests():
    if test_prompts:
      prompt, sampling_params = test_prompts.pop(0)
      engine.add_request(str(request_id), prompt, sampling_params)
      request_id += 1

    request_outputs: List[RequestOutput] = engine.step()

    for request_output in request_outputs:
      if request_output.finished:
        print(request_output)

def initialize_engine(args: argparse.Namespace) -> LLMEngine:
  """Initialize LLMEngine from CLI arguments."""
  engine_args = EngineArgs.from_cli_args(args)
  return LLMEngine.from_engine_args(engine_args)

def main(args: argparse.Namespace):
  """Main function."""
  engine = initialize_engine(args)
  test_prompts = create_test_prompts()
  process_requests(engine, test_prompts)

if __name__ == '__main__':
  parser = FlexibleArgumentParser(description='Demo using LLMEngine directly')
  parser = EngineArgs.add_cli_args(parser)
  args = parser.parse_args()
  main(args)

Run with CLI arguments:

python3 llm-engine.py --model facebook/opt-125m --otlp-traces-endpoint "$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"

Pattern 3: vLLM as a server

Deploy vLLM as a separate server with API endpoints. Traces from client and server are correlated using propagated trace IDs.

Start the vLLM server:

vllm serve ibm-granite/granite-3.0-2b-instruct --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"

Client application:

import requests
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import SpanKind, set_tracer_provider
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

# Configure OpenTelemetry
trace_provider = TracerProvider()
set_tracer_provider(trace_provider)
trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))

tracer = trace_provider.get_tracer("vllm-client")

vllm_url = "http://localhost:8000/v1/completions"
with tracer.start_as_current_span("client-span", kind=SpanKind.CLIENT) as span:
  prompt = "San Francisco is a"
  span.set_attribute("prompt", prompt)
  headers = {}
  TraceContextTextMapPropagator().inject(headers)
  payload = {
    "model": "ibm-granite/granite-3.0-2b-instruct",
    "prompt": prompt,
    "max_tokens": 10,
    "n": 1,
    "best_of": 1,
    "use_beam_search": "false",
    "temperature": 0.0,
  }
  response = requests.post(vllm_url, headers=headers, json=payload)
  print(response)
Figure 3. Trace overview with server
Trace overview with server

Viewing traces on Instana

After running your application, data will appear in the Instana Gen AI observability dashboard. You can analyze calls by application, service, and endpoint, and view logs alongside trace information.

For more details on viewing and analyzing traces, see Viewing telemetry data.

Installing OTel Data Collector for vLLM (ODCV)

ODCV collects OpenTelemetry metrics from vLLM based on predefined semantic conventions.

Prerequisites

  • Java SDK 11 or later

Verify installation:

java -version

Procedure

  1. Download the latest ODCV package:

    wget https://github.com/instana/otel-dc/releases/download/v1.0.7/otel-dc-vllm-1.0.0.tar
  2. Extract the package:

    tar xf otel-dc-vllm-1.0.0.tar
    cd otel-dc-vllm-1.0.0
  3. Configure config/config.yaml:

    vi config/config.yaml

    Update these fields:

    • otel.agentless.mode: Connection mode (true for agentless, false for agent)

    • otel.backend.url: gRPC endpoint for Instana backend or agent

    • callback.interval: Data posting interval in seconds

    • otel.service.name: Name for the Data Collector

    • otel.vllm.metrics.url: vLLM server metrics endpoint (use https if TLS enabled)

    For agentless mode: (otel.agentless.mode: true):

    For agent mode: (otel.agentless.mode: false):

    • otel.backend.url: Instana agent endpoint (for example, http://<instana-agent-host>:4317)

  4. (Optional) Configure logging:

    vi config/logging.properties
  5. Run the Data Collector:
    nohup ./bin/otel-dc-vllm >/dev/null 2>&1 &

    Alternatively, use tmux or screen for background execution.

Viewing vLLM metrics

After installing ODCV and instrumenting your application:

  1. In Instana UI, go to Infrastructure > Analyze Infrastructure

  2. Select OTel vLLMonitor from entity types

  3. Click an OTel vLLMonitor instance to view its dashboard

Figure 4. vLLM metrics
vLLM metrics

Troubleshooting

Infomation on troubleshooting SSL handshake errors.

SSL handshake errors

If you encounter SSL errors such as:

Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.

Set the following to export data without TLS:

export OTEL_EXPORTER_OTLP_INSECURE=true

Next steps

  • Configure alerts for vLLM performance metrics

  • Explore trace correlation between client and server

  • Review OpenTelemetry for advanced configurations