Monitoring large language models (LLMs) with vLLM (public preview)
vLLM is an open source framework that is designed to optimize and accelerate the use of large language models (LLMs), particularly for efficient inference on such models. You can now monitor your vLLM integrations seamlessly with Instana. Export traces from vLLM applications to Instana to analyze calls and gain insights on your LLMs.
Configuring the environment
Configure your environment to export traces to Instana either through an agent or directly to the Instana backend (agentless mode). To find the domain names of the Instana backend otlp-acceptor for different Instana SaaS environments, see Endpoints of the Instana backend otlp-acceptor.
To export traces to Instana using an Instana agent:
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://<instana-agent-host>:4317"
export OTEL_SERVICE_NAME="<your-service-name>"
To export traces directly to the Instana backend (agentless mode):
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="https://<otlp-acceptor-domain>:4317"
export OTEL_EXPORTER_OTLP_HEADERS="x-instana-key=<agent-key>"
export OTEL_SERVICE_NAME="<your-service-name>"
Additionally, if the endpoint of the Instana backend otlp-acceptor or agent is not TLS-enabled, set OTEL_EXPORTER_OTLP_INSECURE
to true
.
export OTEL_EXPORTER_OTLP_INSECURE=true
Exporting traces to Instana
To instrument the LLM application, complete the following steps:
-
Verify that Python 3.10 or later is installed. To check the Python version, run the following command:
python3 -V
-
Optional: Create a virtual environment for your applications to keep your dependencies consistent and prevent conflicts with other applications. To create a virtual environment, run the following command:
pip3 install virtualenv virtualenv vllm-env source vllm-env/bin/activate
-
Install vLLM and the OpenTelemetry packages.
a. To install vLLM, run the following command:
pip3 install vllm==0.6.3.post1
Please refer to the official documentation for more ways to install vLLM.
b. To install the OpenTelemetry packages, run the following command:
pip3 install \ 'opentelemetry-sdk>=1.26.0,<1.27.0' \ 'opentelemetry-api>=1.26.0,<1.27.0' \ 'opentelemetry-exporter-otlp>=1.26.0,<1.27.0' \ 'opentelemetry-semantic-conventions-ai>=0.4.1,<0.5.0'
-
Verify the installation and configuration. You can use the following code to generate sample applications:
-
The following code block is a sample application where vLLM is used as a library. The core functionality of the library is integrated directly into an application or system. The model is loaded into memory and used to generate predictions.
import os from random import randint from vllm import LLM, SamplingParams # Sample prompts. prompts = [ "Hello, my name is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # ensure unique trace Id's for each execution random_seed = randint(0, 2**32 - 1) # Create an LLM. llm = LLM( seed=random_seed, model="ibm-granite/granite-3.0-2b-instruct", # Set the OpenTelemetry endpoint from the environment variable. otlp_traces_endpoint=os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"], ) # Generate texts from the prompts. The output is a list of RequestOutput objects # that contain the prompt, generated text, and other information. outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-
The following code block is a sample application where
LLMEngine
is used directly to access the model:import argparse from typing import List, Tuple import requests from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (OTLPSpanExporter) from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import (BatchSpanProcessor, ConsoleSpanExporter) from opentelemetry.trace import SpanKind, set_tracer_provider from opentelemetry.trace.propagation.tracecontext import (TraceContextTextMapPropagator) from vllm import EngineArgs, LLMEngine, RequestOutput, SamplingParams from vllm.utils import FlexibleArgumentParser trace_provider = TracerProvider() set_tracer_provider(trace_provider) trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter())) trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter())) tracer = trace_provider.get_tracer("dummy-client") def create_test_prompts() -> List[Tuple[str, SamplingParams]]: """Create a list of test prompts with their sampling parameters.""" return [ ("A robot may not injure a human being", SamplingParams(temperature=0.0, logprobs=1, prompt_logprobs=1)), ("To be or not to be,", SamplingParams(temperature=0.8, top_k=5, presence_penalty=0.2)), ("What is the meaning of life?", SamplingParams(n=1, best_of=1, temperature=0.8, top_p=0.95, frequency_penalty=0.1)), ] def process_requests(engine: LLMEngine, test_prompts: List[Tuple[str, SamplingParams]]): """Continuously process a list of prompts and handle the outputs.""" request_id = 0 while test_prompts or engine.has_unfinished_requests(): if test_prompts: prompt, sampling_params = test_prompts.pop(0) engine.add_request(str(request_id), prompt, sampling_params) request_id += 1 request_outputs: List[RequestOutput] = engine.step() for request_output in request_outputs: if request_output.finished: print(request_output) def initialize_engine(args: argparse.Namespace) -> LLMEngine: """Initialize the LLMEngine from the command line arguments.""" engine_args = EngineArgs.from_cli_args(args) return LLMEngine.from_engine_args(engine_args) def main(args: argparse.Namespace): """Main function that sets up and runs the prompt processing.""" engine = initialize_engine(args) test_prompts = create_test_prompts() process_requests(engine, test_prompts) if __name__ == '__main__': parser = FlexibleArgumentParser( description = 'Demo on using the LLMEngine class directly') parser = EngineArgs.add_cli_args(parser) args = parser.parse_args() main(args)
Run the application with the necessary CLI arguments:
python3 llm-engine.py --model facebook/opt-125m --otlp-traces-endpoint "$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
-
The following code block is a sample application where vLLM operates as a separate server that exposes an API or endpoint to handle inference requests. The language model can be deployed on a dedicated server, and users or applications can interact with it remotely. The traces from the client application and the vLLM server are correlated by using the same trace ID that the client propagates to the server.
vllm serve ibm-granite/granite-3.0-2b-instruct --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
import requests from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (OTLPSpanExporter) from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import (BatchSpanProcessor, ConsoleSpanExporter) from opentelemetry.trace import SpanKind, set_tracer_provider from opentelemetry.trace.propagation.tracecontext import (TraceContextTextMapPropagator) trace_provider = TracerProvider() set_tracer_provider(trace_provider) trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter())) trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter())) tracer = trace_provider.get_tracer("dummy-client") vllm_url = "http://localhost:8000/v1/completions" with tracer.start_as_current_span("client-span", kind=SpanKind.CLIENT) as span: prompt = "San Francisco is a" span.set_attribute("prompt", prompt) headers = {} TraceContextTextMapPropagator().inject(headers) payload = { "model": "ibm-granite/granite-3.0-2b-instruct", "prompt": prompt, "max_tokens": 10, "n": 1, "best_of": 1, "use_beam_search": "false", "temperature": 0.0, } response = requests.post(vllm_url, headers=headers, json=payload) print(response)
-
Viewing traces
To create an application perspective to view trace information that is gathered from the LLM application runtime, complete the following steps:
- Open the New Application Perspective wizard in one of the following ways:
- On the Instana dashboard, go to the Applications section and click Add application.
- From the navigation menu, click Applications and then open the Applications dashboard. Then, click Add and select New Application Perspective.
- Select Services or Endpoints and click Next.
- Click Add filter and select a service name. You can select multiple services and endpoints by using
OR
conditions. The service name is specified by theapp_name
parameter inTraceloop.init()
. For example,watsonx_chat_service
. - In the Application Perspective Name field, enter a name for the LLM application perspective. Then, click Create.
The steps to create a new application perspective are complete.
To view trace information, go to the navigation menu and click Analytics to open the Analytics dashboard. Here, you can analyze calls by application, service, and endpoint, by breaking down the data that Instana
presents by service, endpoint, and call names. You can filter and group traces or calls by using arbitrary tags, such as filtered by 'Trace->Service Name' equals watsonx_chat_service
. You can view log information along with
the trace information. For more information, see Analyzing traces and calls.
Troubleshooting
Information about how to troubleshoot a problem with LLM applications.
SSL issues
When you encounter SSL handshake issues (or similar ones) with your LLM applications, you might see the following error:
Handshake failed with fatal error SSL_ERROR_SSL: error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER.
For your vLLM applications to export the data to the gRPC endpoint without using TLS, set OTEL_EXPORTER_OTLP_INSECURE=true
.