Monitoring large language models (LLMs) with vLLM (public preview)
vLLM is an open source framework that is designed to optimize and accelerate the use of large language models (LLMs), particularly for efficient inference on such models. You can now monitor your vLLM integrations seamlessly with Instana. Export traces from vLLM applications to Instana to analyze calls and gain insights on your LLMs.
Configuring the environment
Configure your environment to export traces to Instana either through an agent or directly to the Instana backend (agentless mode). To find the domain names of the Instana backend otlp-acceptor for different Instana SaaS environments, see Endpoints of the Instana backend otlp-acceptor.
To export traces to Instana using an Instana agent:
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://<instana-agent-host>:4317"
export OTEL_SERVICE_NAME="<your-service-name>"
To export traces directly to the Instana backend (agentless mode):
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="https://<otlp-acceptor-domain>:4317"
export OTEL_EXPORTER_OTLP_HEADERS="x-instana-key=<agent-key>"
export OTEL_SERVICE_NAME="<your-service-name>"
Additionally, if the endpoint of the Instana backend otlp-acceptor or agent is not TLS-enabled, set OTEL_EXPORTER_OTLP_INSECURE
to true
.
export OTEL_EXPORTER_OTLP_INSECURE=true
Exporting traces to Instana
To instrument the LLM application, complete the following steps:
-
Verify that Python 3.10 or later is installed. To check the Python version, run the following command:
python3 -V
-
Optional: Create a virtual environment for your applications to keep your dependencies consistent and prevent conflicts with other applications. To create a virtual environment, run the following command:
pip3 install virtualenv virtualenv vllm-env source vllm-env/bin/activate
-
Install vLLM and the OpenTelemetry packages.
a. To install vLLM, run the following command:
pip3 install vllm==0.6.3.post1
Please refer to the official documentation for more ways to install vLLM.
b. To install the OpenTelemetry packages, run the following command:
pip3 install \ 'opentelemetry-sdk>=1.26.0,<1.27.0' \ 'opentelemetry-api>=1.26.0,<1.27.0' \ 'opentelemetry-exporter-otlp>=1.26.0,<1.27.0' \ 'opentelemetry-semantic-conventions-ai>=0.4.1,<0.5.0'
-
Verify the installation and configuration. You can use the following code to generate sample applications:
-
The following code block is a sample application where vLLM is used as a library. The core functionality of the library is integrated directly into an application or system. The model is loaded into memory and used to generate predictions.
import os from random import randint from vllm import LLM, SamplingParams # Sample prompts. prompts = [ "Hello, my name is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # ensure unique trace Id's for each execution random_seed = randint(0, 2**32 - 1) # Create an LLM. llm = LLM( seed=random_seed, model="ibm-granite/granite-3.0-2b-instruct", # Set the OpenTelemetry endpoint from the environment variable. otlp_traces_endpoint=os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"], ) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
The following code block is a sample application where
LLMEngine
is used directly to access the model:import argparse from typing import List, Tuple import requests from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (OTLPSpanExporter) from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import (BatchSpanProcessor, ConsoleSpanExporter) from opentelemetry.trace import SpanKind, set_tracer_provider from opentelemetry.trace.propagation.tracecontext import (TraceContextTextMapPropagator) from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams from vllm.utils import FlexibleArgumentParser trace_provider = TracerProvider() set_tracer_provider(trace_provider) trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter())) trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter())) tracer = trace_provider.get_tracer("dummy-client") def create_test_prompts() -> List[Tuple[str, SamplingParams]]: """Create a list of test prompts with their sampling parameters.""" return [ ("A robot may not injure a human being", SamplingParams(temperature=0.0, logprobs=1, prompt_logprobs=1)), ("To be or not to be,", SamplingParams(temperature=0.8, top_k=5, presence_penalty=0.2)), ("What is the meaning of life?", SamplingParams(n=1, best_of=1, temperature=0.8, top_p=0.95, frequency_penalty=0.1)), ] def process_requests(engine: LLMEngine, test_prompts: List[Tuple[str, SamplingParams]]): """Continuously process a list of prompts and handle the outputs.""" request_id = 0 while test_prompts or engine.has_unfinished_requests(): if test_prompts: prompt, sampling_params = test_prompts.pop(0) engine.add_request(str(request_id), prompt, sampling_params) request_id += 1 request_outputs: List[RequestOutput] = engine.step() for request_output in request_outputs: if request_output.finished: print(request_output) def initialize_engine(args: argparse.Namespace) -> LLMEngine: """Initialize the LLMEngine from the command line arguments.""" engine_args = EngineArgs.from_cli_args(args) return LLMEngine.from_engine_args(engine_args) def main(args: argparse.Namespace): """Main function that sets up and runs the prompt processing.""" engine = initialize_engine(args) test_prompts = create_test_prompts() process_requests(engine, test_prompts) if __name__ == '__main__': parser = FlexibleArgumentParser( description = 'Demo on using the LLMEngine class directly') parser = EngineArgs.add_cli_args(parser) args = parser.parse_args() main(args)
Run the application with the necessary CLI arguments:
python3 llm-engine.py --model facebook/opt-125m --otlp-traces-endpoint "$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
-
The following code block is a sample application where vLLM operates as a separate server that exposes an API or endpoint to handle inference requests. The language model can be deployed on a dedicated server, and users or applications can interact with it remotely. The traces from the client application and the vLLM server are correlated by using the same trace ID that the client propagates to the server.
vllm serve ibm-granite/granite-3.0-2b-instruct --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
import requests from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (OTLPSpanExporter) from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import (BatchSpanProcessor, ConsoleSpanExporter) from opentelemetry.trace import SpanKind, set_tracer_provider from opentelemetry.trace.propagation.tracecontext import (TraceContextTextMapPropagator) trace_provider = TracerProvider() set_tracer_provider(trace_provider) trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter())) trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter())) tracer = trace_provider.get_tracer("dummy-client") vllm_url = "http://localhost:8000/v1/completions" with tracer.start_as_current_span("client-span", kind=SpanKind.CLIENT) as span: prompt = "San Francisco is a" span.set_attribute("prompt", prompt) headers = {} TraceContextTextMapPropagator().inject(headers) payload = { "model": "ibm-granite/granite-3.0-2b-instruct", "prompt": prompt, "max_tokens": 10, "n": 1, "best_of": 1, "use_beam_search": "false", "temperature": 0.0, } response = requests.post(vllm_url, headers=headers, json=payload) print(response)
-
Viewing traces
To create an application perspective to view trace information that is gathered from the LLM application runtime, complete the following steps:
- Open the New Application Perspective wizard in one of the following ways:
- On the Instana dashboard, go to the Applications section and click Add application.
- From the navigation menu, click Applications and then open the Applications dashboard. Then, click Add and select New Application Perspective.
- Select Services or Endpoints and click Next.
- Click Add filter and select a service name. You can select multiple services and endpoints by using
OR
conditions. The service name is specified by theapp_name
parameter inTraceloop.init()
. For example,watsonx_chat_service
. - In the Application Perspective Name field, enter a name for the LLM application perspective. Then, click Create.
The steps to create a new application perspective are complete.
To view trace information, go to the navigation menu and click Analytics to open the Analytics dashboard. Here, you can analyze calls by application, service, and endpoint, by breaking down the data that Instana
presents by service, endpoint, and call names. You can filter and group traces or calls by using arbitrary tags, such as filtered by 'Trace->Service Name' equals watsonx_chat_service
. You can view log information along with
the trace information. For more information, see Analyzing traces and calls.
Troubleshooting
Information about how to troubleshoot a problem with LLM applications.
SSL issues
When you encounter SSL handshake issues (or similar ones) with your LLM applications, you might see the following error:
Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.
For your vLLM applications to export the data to the gRPC endpoint without using TLS, set OTEL_EXPORTER_OTLP_INSECURE=true
.
Install OTel Data Collector for vLLM (ODCV)
To collect OpenTelemetry metrics for vLLM, you need to install ODCV. All implementations are based on predefined OpenTelemetry Semantic Conventions.
-
Verify that the following prerequisites are met:
-
Java SDK 11+ is installed. Run the following command to check the installed version:
java -version
-
-
Install the collector:
a. Download the latest version of
otel-dc-vllm
package available:wget https://github.com/instana/otel-dc/releases/download/v1.0.7/otel-dc-vllm-1.0.0.tar
b. Extract the package to the preferred deployment location:
tar xf otel-dc-vllm-1.0.0.tar
-
Modify the configuration file:
a. Open the
config.yaml
file:cd otel-dc-vllm-1.0.0 vi config/config.yaml
b. Update the following fields in the
config.yaml
file:otel.agentless.mode
: Set the connection mode for the OTel data connector. The default mode isagentless
.otel.backend.url
: Specify the gRPC endpoint for the Instana backend or Instana agent, depending on whether the agentless mode is used or not.callback.interval
: Define the time interval (in seconds) for posting data to the backend or agent.otel.service.name
: Assign a name to the Data Collector, which can be any string that you choose.otel.vllm.metrics.url
: The endpoint of the vllm server for metrics collection. Usehttps
if TLS is enabled.
If
otel.agentless.mode
is set totrue
, metrics data is sent directly to the Instana backend. Theotel.backend.url
is the gRPC endpoint of the backend otlp-acceptor component withhttp/https
scheme, for examplehttp://<instana-otlp-endpoint>:4317
. Usehttps://
scheme if the endpoint of the Instana backend otlp-acceptor in SaaS environments is TLS-enabled. For more information about Instana SaaS environment, see Endpoints of the Instana backend otlp-acceptor.If
otel.agentless.mode
is set tofalse
, metrics data is sent to the Instana agent. Theotel.backend.url
is the gRPC endpoint of the Instana agent, for examplehttp://<instana-agent-host>:4317
.c. Open the
logging.properties
file by using the following command:vi config/logging.properties
Configure the Java logging settings in the
logging.properties
file according to your needs. -
Run the Data Collector with the following command according to your current system:
nohup ./bin/otel-dc-vllm >/dev/null 2>&1 &
You can also use tools like
tmux
orscreen
to run this program in the background.
Viewing vLLM metrics
To view the metrics, after you install OpenTelemetry (OTel) Data Collector for vLLM, and instrument the AI agent application, complete the following steps:
- From the navigation menu in the Instana UI, select Infrastructure.
- Click Analyze Infrastructure.
- From the list of entity types, select OTel vLLMonitor.
- Click the entity instance of OTel vLLMonitor entity type. The associated dashboard is displayed.
