Monitoring GPU (private preview)
When you install the NVIDIA GPU Operator and create an OpenTelemetry-based data collector, you can view metrics that are related to GPU in the Instana UI.
Nvidia GPU Operator
You can install the NVIDIA GPU Operator on your GPU environment that helps manage and collect GPU metrics. Enable components such as NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, and the NVIDIA Container Toolkit if you need them. For more information, see Nvidia GPU Operator
helm install gpu-operator \
--repo https://helm.ngc.nvidia.com/nvidia \
--namespace gpu-operator \
--create-namespace \
-set driver.enabled=false \
-set toolkit.enabled=false \
-set devicePlugin.enabled=false \
-set mig.strategy=single \
gpu-operator
The NVIDIA GPU Operator installs the NVIDIA Data Center GPU Manager (DCGM) by default. DCGM Exporter is a tool to NVIDIA DCGM that allows users to gather GPU metrics and understand workload behavior or monitor GPUs in clusters.
DCGM Exporter exposes GPU metrics at an HTTP endpoint (/metrics
) for monitoring solutions. For more information, see DCGM Exporter
OpenTelemetry-based data collectors
You can forward the OpenTelemetry data of GPU to an Instana agent or Instana backend by using the OpenTelemetry Collector. The OpenTelemetry Collector is the core component of the OpenTelemetry ecosystem, offering vendor-independent functions for telemetry data collection, processing, and export. For more information, see OpenTelemetry Collector
We support two patterns in Instana to collect the GPU data:
- Agent model: For this pattern, the GPU data is sent to the Instana agent first, and the agent helps aggregate the data and send to the Instana backend.
- Agentless model: For this pattern, the GPU data is sent to the Instana backend directly without going through the agent.
Forwarding Telemetry data to an Instana agent (agent pattern)
- To enable OTLP ports for an Instana agent, add the following snippet to the configuration.yaml file of your Instana host agent. Make sure to save the changes and restart the Instana host agent to apply the modifications.
com.instana.plugin.opentelemetry:
grpc:
enabled: true
http:
enabled: true
- The following snippet shows a typical configuration for the OpenTelemetry Collector to forward telemetry data to a local Instana host agent by using the
OTLP/gRPC
protocol.
Create a YAML file, such as config.yaml, as follows:
receivers:
otlp:
protocols:
grpc:
prometheus/nvidia-dcgm:
config:
scrape_configs:
- job_name: 'nvidia-dcgm'
scrape_interval: 10s
static_configs:
- targets: "$(DCGM_EXPORTOR_ENDPOINT)"
processors:
batch:
resource:
attributes:
- key: server.address
from_attribute: net.host.name
action: insert
- key: server.port
from_attribute: net.host.port
action: insert
- key: service.name
value: nvidia-dcgm
action: update
- key: INSTANA_PLUGIN
value: dcgm
action: insert
exporters:
otlp:
endpoint: "$(INSTANA_AGENT_HOST):4317"
tls:
insecure: true
service:
pipelines:
metrics/nvidia-dcgm:
receivers: [prometheus/nvidia-dcgm]
processors: [batch, resource]
exporters: [otlp]
The following example shows a typical configuration of the OpenTelemetry Collector for forwarding Telemetry data to a local Instana host agent with the OTLP/HTTP
protocol.
exporters:
otlphttp:
endpoint: "$(INSTANA_AGENT_HOST):4318"
tls:
insecure: true
Notes:
- Set the
DCGM_EXPORTOR_ENDPOINT
field with the DCGM Exporter endpoint. - Set the
INSTANA_AGENT_HOST
field with the IP or the host of the Instana agent to connect to. - Instana uses OTLP standard port numbers, such as 4317 for
OTLP/gRPC
and 4318 forOTLP/HTTP
.
- After you complete all configuration changes in the
config.yaml
file, run the following command to use the OpenTelemetry Collector:
docker run -d -p 4317:4317 -v $(pwd)/config.yaml:/etc/otelcol-contrib/config.yaml otel/opentelemetry-collector-contrib:latest
Forwarding Telemetry data to the Instana backend (agentless pattern)
To forward OpenTelemetry data to the Instana backend by using the OpenTelemetry Collector, complete the following steps:
-
Create a YAML file, such as config.yaml in the preceding section. Change the
endpoint
from Instana agent endpoint to Instana backend endpoint. The special endpoints of the backendotlp-acceptor
component are used when the OpenTelemetry data is sent. The Instana backend requires an Instana agent key for validation. The Instana backend also requires thehost.id
,faas.id
, ordevice.id
resource attribute.exporters: otlp: endpoint: INSTANA_OTLP_GRPC_BACKEND:4317 headers: x-instana-key: xxxxxxx x-instana-host: xxxx
Notes:
- Set the
INSTANA_OTLP_GRPC_BACKEND
field with the correct domain name of theotlp-acceptor
component of the Instana backend. For more information about the endpoint of the Instana backendotlp-acceptor
, see Endpoints of Self-Hosted Instana backend otlp-acceptor or Endpoints of SaaS Instana backend otlp-acceptor - Set the
x-instana-key
field with the agent key of the Instana agent for targeting the Instana backend. To find your agent key, you can click More > Agents in the navigation bar of the Instana UI and then click Install Agents > Windows. - Set the
x-instana-host
field with the host ID if nohost.id
,faas.id
, ordevice.id
resource attribute is defined in your application or system. - Instana uses OTLP standard port numbers, such as 4317 for
OTLP/gRPC
and 4318 forOTLP/HTTP
. Port 443 is also supported forOTLP/HTTP
.
- After you complete all configuration changes in the
config.yaml
file, run the following command to use the OpenTelemetry Collector:
docker run -d -p 4317:4317 -v $(pwd)/config.yaml:/etc/otelcol-contrib/config.yaml otel/opentelemetry-collector-contrib:latest
Viewing metrics
After you install OpenTelemetry (OTel) Data Collector, you can view the GPU metrics in the Instana UI.
- Open the Instana UI, and click Infrastructure. Then, click Analyze Infrastructure.
- Select OTEL Dcgm from the list of types of the entities.
- Click the entity instance of OTEL Dcgm entity type to open the associated dashboard.