Monitoring GPU (private preview)

When you install the NVIDIA GPU Operator and create an OpenTelemetry-based data collector, you can view metrics that are related to GPU in the Instana UI.

Nvidia GPU Operator

You can install the NVIDIA GPU Operator on your GPU environment that helps manage and collect GPU metrics. Enable components such as NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, and the NVIDIA Container Toolkit if you need them. For more information, see Nvidia GPU Operator

helm install gpu-operator \
   --repo https://helm.ngc.nvidia.com/nvidia \
   --namespace gpu-operator \
   --create-namespace \
   -set driver.enabled=false \
   -set toolkit.enabled=false \
   -set devicePlugin.enabled=false \
   -set mig.strategy=single \
   gpu-operator

The NVIDIA GPU Operator installs the NVIDIA Data Center GPU Manager (DCGM) by default. DCGM Exporter is a tool to NVIDIA DCGM that allows users to gather GPU metrics and understand workload behavior or monitor GPUs in clusters.

DCGM Exporter exposes GPU metrics at an HTTP endpoint (/metrics) for monitoring solutions. For more information, see DCGM Exporter

OpenTelemetry-based data collectors

You can forward the OpenTelemetry data of GPU to an Instana agent or Instana backend by using the OpenTelemetry Collector. The OpenTelemetry Collector is the core component of the OpenTelemetry ecosystem, offering vendor-independent functions for telemetry data collection, processing, and export. For more information, see OpenTelemetry Collector

We support two patterns in Instana to collect the GPU data:

Agent model: For this pattern, the GPU data is sent to the Instana agent first, and the agent helps aggregate the data and send to the Instana backend.
Agentless model: For this pattern, the GPU data is sent to the Instana backend directly without going through the agent.

Forwarding Telemetry data to an Instana agent (agent pattern)

To enable OTLP ports for an Instana agent, add the following snippet to the configuration.yaml file of your Instana host agent. Make sure to save the changes and restart the Instana host agent to apply the modifications.

com.instana.plugin.opentelemetry:
  grpc:
    enabled: true
  http:
    enabled: true

The following snippet shows a typical configuration for the OpenTelemetry Collector to forward telemetry data to a local Instana host agent by using the OTLP/gRPC protocol.

Create a YAML file, such as config.yaml, as follows:

receivers:
  otlp:
    protocols:
      grpc:
  prometheus/nvidia-dcgm:
    config:
      scrape_configs:
        - job_name: 'nvidia-dcgm'
          scrape_interval: 10s
          static_configs:
            - targets: "$(DCGM_EXPORTOR_ENDPOINT)"
processors:
  batch:
  resource:
    attributes:
      - key: server.address
        from_attribute: net.host.name
        action: insert
      - key: server.port
        from_attribute: net.host.port
        action: insert
      - key: service.name
        value: nvidia-dcgm
        action: update
      - key: INSTANA_PLUGIN
        value: dcgm
        action: insert
exporters:
  otlp:
    endpoint: "$(INSTANA_AGENT_HOST):4317"
    tls:
      insecure: true
service:
  pipelines:
    metrics/nvidia-dcgm:
      receivers: [prometheus/nvidia-dcgm]
      processors: [batch, resource]
      exporters: [otlp]

The following example shows a typical configuration of the OpenTelemetry Collector for forwarding Telemetry data to a local Instana host agent with the OTLP/HTTP protocol.

exporters:
  otlphttp:
    endpoint: "$(INSTANA_AGENT_HOST):4318"
    tls:
      insecure: true

Notes:

Set the DCGM_EXPORTOR_ENDPOINT field with the DCGM Exporter endpoint.
Set the INSTANA_AGENT_HOST field with the IP or the host of the Instana agent to connect to.
Instana uses OTLP standard port numbers, such as 4317 for OTLP/gRPC and 4318 for OTLP/HTTP.

After you complete all configuration changes in the config.yaml file, run the following command to use the OpenTelemetry Collector:

docker run -d -p 4317:4317 -v $(pwd)/config.yaml:/etc/otelcol-contrib/config.yaml otel/opentelemetry-collector-contrib:latest

Forwarding Telemetry data to the Instana backend (agentless pattern)

To forward OpenTelemetry data to the Instana backend by using the OpenTelemetry Collector, complete the following steps:

Create a YAML file, such as config.yaml in the preceding section. Change the endpoint from Instana agent endpoint to Instana backend endpoint. The special endpoints of the backend otlp-acceptor component are used when the OpenTelemetry data is sent. The Instana backend requires an Instana agent key for validation. The Instana backend also requires the host.id, faas.id, or device.id resource attribute.
```
exporters:
  otlp:
    endpoint: INSTANA_OTLP_GRPC_BACKEND:4317
    headers:
      x-instana-key: xxxxxxx
      x-instana-host: xxxx
```

Notes:

Set the INSTANA_OTLP_GRPC_BACKEND field with the correct domain name of the otlp-acceptor component of the Instana backend. For more information about the endpoint of the Instana backend otlp-acceptor, see Endpoints of Self-Hosted Instana backend otlp-acceptor or Endpoints of SaaS Instana backend otlp-acceptor
Set the x-instana-key field with the agent key of the Instana agent for targeting the Instana backend. To find your agent key, you can click More > Agents in the navigation bar of the Instana UI and then click Install Agents > Windows.
Set the x-instana-host field with the host ID if no host.id, faas.id, or device.id resource attribute is defined in your application or system.
Instana uses OTLP standard port numbers, such as 4317 for OTLP/gRPC and 4318 for OTLP/HTTP. Port 443 is also supported for OTLP/HTTP.

After you complete all configuration changes in the config.yaml file, run the following command to use the OpenTelemetry Collector:

docker run -d -p 4317:4317 -v $(pwd)/config.yaml:/etc/otelcol-contrib/config.yaml otel/opentelemetry-collector-contrib:latest

Viewing metrics

After you install OpenTelemetry (OTel) Data Collector, you can view the GPU metrics in the Instana UI.

Open the Instana UI, and click Infrastructure. Then, click Analyze Infrastructure.
Select OTEL Dcgm from the list of types of the entities.
Click the entity instance of OTEL Dcgm entity type to open the associated dashboard.