Azure GPU metrics collection

Use NVIDIA Data Center GPU Manager (DCGM) to collect Azure GPU metrics. Turbonomic uses these metrics to generate scale actions that optimize performance and costs. For a list of metrics and associated scale actions, see this topic.

Note:

This topic provides an overview of the DCGM configuration that is required to collect Azure GPU metrics. For detailed configuration guidelines and instructions, contact your Turbonomic representative.

Requirements

At a high level, you must ensure that the following requirements are met:

  • The VM image contains NVIDIA GPU drivers.

  • The DCGM package is installed. This package provides the dcgmi CLI tool that you can use to collect GPU metrics.

Metrics collection options

Use any of the following options to collect metrics:

  • DCGM exporter

    DCGM exporter exposes metrics in Prometheus format through the Telegraf agent with the Prometheus plugin. The plugin collects metrics from DCGM exporter and uses Azure Monitor to log metrics directly to the VM. Turbonomic then uses the Azure Monitor REST API to collect GPU metrics. This is the same API that enables the collection of standard metrics, such as vCPU.

    In the Azure portal, when you view a VM's details in the Metrics section, the GPU metrics are available in the telegraf/prometheus namespace.

  • Custom Python script

    The custom Python script reads data using the dcgmi CLI tool and then sends collected metrics directly to the Azure Log Analytics Workspace. To send metrics, the Log Analytics Workspace must be set up correctly using a workspace ID and client authentication keys.