Monitoring vLLM

vLLM is an open source framework for optimizing and accelerating large language model (LLM) inference. Monitor your vLLM infrastructure with Instana by collecting performance metrics to gain visibility into resource utilization, throughput, latency, and overall system health of your LLM deployments.

Prerequisites

  • Python 3.10 or later

  • vLLM 0.6.3.post1 or later

  • Access to Instana agent or backend

Configuring environment variables

Configure your environment to export traces to Instana using either agent mode or agentless mode.

Agent mode

Export traces through an Instana agent:

export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://<instana-agent-host>:4317"
export OTEL_SERVICE_NAME="<your-service-name>"

Agentless mode

Export traces directly to the Instana backend:

export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="https://<otlp-acceptor-domain>:4317"
export OTEL_EXPORTER_OTLP_HEADERS="x-instana-key=<agent-key>"
export OTEL_SERVICE_NAME="<your-service-name>"

For endpoints without TLS, set:

export OTEL_EXPORTER_OTLP_INSECURE=true
Tip:

For Instana backend endpoints, see Endpoints of the Instana backend otlp-acceptor.

Installing dependencies

  1. Verify Python version:

    python3 -V
  2. (Optional) Create a virtual environment:

    pip3 install virtualenv
    virtualenv vllm-env
    source vllm-env/bin/activate
  3. Install vLLM:

    pip3 install vllm==0.6.3.post1

    For alternative installation methods, see the vLLM documentation.

  4. Install OpenTelemetry packages:
     
    pip3 install
      'opentelemetry-sdk>=1.26.0,<1.27.0' 
      'opentelemetry-api>=1.26.0,<1.27.0' 
      'opentelemetry-exporter-otlp>=1.26.0,<1.27.0' 
      'opentelemetry-semantic-conventions-ai>=0.4.1,<0.5.0'

Instrumenting vLLM applications

Starting the vLLM server

Deploy vLLM as a separate server with API endpoints. If you want to observe the client application, you can try out the sample client implementation provided below.

Start the vLLM server:

vllm serve ibm-granite/granite-3.0-2b-instruct --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"

Creating a client application

Create a client application to send requests to the vLLM server:

import requests
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.trace import SpanKind, set_tracer_provider
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

# Configure OpenTelemetry
trace_provider = TracerProvider()
set_tracer_provider(trace_provider)
trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))

tracer = trace_provider.get_tracer("vllm-client")

vllm_url = "http://localhost:8000/v1/completions"
with tracer.start_as_current_span("client-span", kind=SpanKind.CLIENT) as span:
  prompt = "San Francisco is a"
  span.set_attribute("prompt", prompt)
  headers = {}
  TraceContextTextMapPropagator().inject(headers)
  payload = {
    "model": "ibm-granite/granite-3.0-2b-instruct",
    "prompt": prompt,
    "max_tokens": 10,
    "n": 1,
    "best_of": 1,
    "use_beam_search": "false",
    "temperature": 0.0,
  }
  response = requests.post(vllm_url, headers=headers, json=payload)
  print(response)
Figure 1. Trace overview with server
Trace overview with server

Viewing traces in Instana

After running your application, data will appear in the Instana Gen AI observability dashboard. You can analyze calls by application, service, and endpoint, and view logs alongside trace information.

For more details on viewing and analyzing traces, see Viewing telemetry data.

Collecting metrics from vLLM

You can collect metrics from vLLM using one of the following options:

Option 1: Using OTel data collector for vLLM (ODCV)

ODCV collects OpenTelemetry metrics from vLLM based on predefined semantic conventions.

Prerequisites

  • Java SDK 11 or later

Verify installation:

java -version

Procedure

  1. Download the latest ODCV package:

    wget https://github.com/instana/otel-dc/releases/download/v1.0.7/otel-dc-vllm-1.0.0.tar
  2. Extract the package:

    tar xf otel-dc-vllm-1.0.0.tar
    cd otel-dc-vllm-1.0.0
  3. Configure config/config.yaml:

    vi config/config.yaml

    Update these fields:

    • otel.agentless.mode: Connection mode (true for agentless, false for agent)

    • otel.backend.url: gRPC endpoint for Instana backend or agent

    • callback.interval: Data posting interval in seconds

    • otel.service.name: Name for the Data Collector

    • otel.vllm.metrics.url: vLLM server metrics endpoint (use https if TLS enabled)

    For agentless mode: (otel.agentless.mode: true):

    For agent mode: (otel.agentless.mode: false):

    • otel.backend.url: Instana agent endpoint (for example, http://<instana-agent-host>:4317)

  4. (Optional) Configure logging:

    vi config/logging.properties
  5. Run the Data Collector:
    nohup ./bin/otel-dc-vllm >/dev/null 2>&1 &

    Alternatively, use tmux or screen for background execution.

Option 2: Using open source OpenTelemetry collector

You can forward OpenTelemetry data from vLLM to an Instana agent or Instana backend by using the OpenTelemetry collector.

You can use one of the following OpenTelemetry collectors:

  • Instana distribution of OpenTelemetry collector (IDOT) - A preconfigured, optimized distribution that simplifies vLLM monitoring setup. For more information, see IDOT and OpenTelemetry project

  • An open source OpenTelemetry collector - The core component of the OpenTelemetry ecosystem, offering vendor-independent functions for telemetry data collection, processing, and export. For custom deployments or environments where IDOT is not available. For more information, see OpenTelemetry collector.

Instana supports the following models to collect data:

  • Agent model - Data is sent to the Instana agent first, and the agent helps aggregate the data and send to the Instana backend.

  • Agentless model - Data is sent to the Instana backend directly without going through the agent.

Forwarding telemetry data to an Instana agent (agent model)

The following snippet shows a typical configuration for the OpenTelemetry Collector to forward telemetry data to a local Instana host agent by using the OTLP/HTTP protocol. Create a YAML file, such as config.yaml, as follows:

receivers:
  otlp/receiver:
    protocols:
      grpc:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'prometheus-scrape'
          scrape_interval: 10s
          static_configs:
            - targets:
              - "$(PROMETHEUS_METRICS_ENDPOINT)"
          metrics_path: '/metrics'
processors:
  batch: {}
  resource:
    attributes:
      - key: service.name
        value: vllm
        action: insert
      - key: service.namespace
        value: genai
        action: insert
      - key: service.instance.id
        value: <vllm-host>
        action: insert
      - key: server.address
        value: <vllm-host-address>
        action: insert
      - key: server.port
        value: "8000"
        action: insert
      - key: vllm.entity.type
        value: vllm
        action: insert
      - key: INSTANA_PLUGIN
        value: vllm
        action: insert

exporters:
  otlphttp:
    endpoint: "http://$(INSTANA_AGENT_HOST):4318"
    tls:
      insecure: true

service:
  pipelines:
    metrics/prometheus:
      receivers: [prometheus]
      processors: [resource, batch]
      exporters: [otlphttp]
Note:

PROMETHEUS_METRICS_ENDPOINT represents the endpoint from where vLLM metrics will be scraped. For example, the value is 9.60.233.77:8000.

The following example shows a typical configuration of the OpenTelemetry Collector for forwarding Telemetry data to a local Instana host agent with the OTLP/grpc protocol.

exporters:
  otlp:
    endpoint: "$(INSTANA_AGENT_HOST):4317"
    tls:
      insecure: true
  • Set the PROMETHEUS_METRICS_ENDPOINT field with the vLLM metrics exporter endpoint.

  • Set the INSTANA_AGENT_HOST field with the IP or the host of the Instana agent to connect to.

  • Instana uses OTLP standard port numbers, such as 4317 for OTLP/gRPC and 4318 for OTLP/HTTP.

After you complete all configuration changes in the config.yaml file, run the following command to use the OpenTelemetry Collector:

docker run -d -p 4317:4317 -v $(pwd)/config.yaml:/etc/otelcol-contrib/config.yaml otel/opentelemetry-collector-contrib:latest

Forwarding telemetry data to the Instana backend (agentless model)

To forward OpenTelemetry data to the Instana backend by using the OpenTelemetry Collector, complete the following steps:

  1. Create a YAML file, such as config.yaml same as in the preceding section. Change the endpoint from Instana agent endpoint to Instana backend endpoint. The special endpoints of the backend otlp-acceptor component are used when the OpenTelemetry data is sent. The Instana backend requires an Instana agent key for validation.

    exporters:
      otlp:
        endpoint: INSTANA_OTLP_GRPC_BACKEND:4317
        headers:
          x-instana-key: xxxxxxx
          x-instana-host: xxxx
    Note:
    • Set the INSTANA_OTLP_GRPC_BACKEND field with the correct domain name of the otlp-acceptor component of the Instana backend. For more information about the endpoint of the Instana backend otlp-acceptor, see Endpoints of Self-Hosted Instana backend otlp-acceptor or Endpoints of SaaS Instana backend otlp-acceptor

    • Set the x-instana-key field with the agent key of the Instana agent for targeting the Instana backend. To find your agent key, you can click More > Agents in the navigation bar of the Instana UI and then click Install Agents > Windows.

    • Set the x-instana-host field with the host ID if no host.id, faas.id, or device.id resource attribute is defined in your application or system.

    • Instana uses OTLP standard port numbers, such as 4317 for OTLP/gRPC and 4318 for OTLP/HTTP. Port 443 is also supported for OTLP/HTTP.

  2. After you complete all configuration changes in the config.yaml file, run the following command to use the OpenTelemetry Collector:

    docker run -d -p 4317:4317 -v $(pwd)/config.yaml:/etc/otelcol-contrib/config.yaml otel/opentelemetry-collector-contrib:latest

Viewing vLLM metrics in Instana

After configuring metrics collection using either option:

  1. In Instana UI, go to Infrastructure > Analyze Infrastructure

  2. Select OTel vLLMonitor from entity types

  3. Click an OTel vLLMonitor instance to view its dashboard

Figure 2. vLLM metrics
vLLM metrics

Troubleshooting

Infomation on troubleshooting SSL handshake errors.

SSL handshake errors

If you encounter SSL errors such as:

Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.

Set the following to export data without TLS:

export OTEL_EXPORTER_OTLP_INSECURE=true

Next steps

  • Configure alerts for vLLM performance metrics

  • Explore trace correlation between client and server

  • Review OpenTelemetry for advanced configurations