Azure GPU metrics collection
Use NVIDIA Data Center GPU Manager (DCGM) to collect Azure GPU metrics. Turbonomic uses these metrics to generate scale actions that optimize performance and costs. For a list of metrics and associated scale actions, see this topic.
This topic provides an overview of the DCGM configuration that is required to collect Azure GPU metrics. For detailed configuration guidelines and instructions, contact your Turbonomic representative.
Requirements
At a high level, you must ensure that the following requirements are met:
-
The VM image contains NVIDIA GPU drivers.
-
The DCGM package is installed. This package provides the
dcgmi
CLI tool that you can use to collect GPU metrics.
Metrics collection options
Use any of the following options to collect metrics:
-
DCGM exporter
DCGM exporter exposes metrics in Prometheus format through the Telegraf agent with the Prometheus plugin. The plugin collects metrics from DCGM exporter and uses Azure Monitor to log metrics directly to the VM. Turbonomic then uses the Azure Monitor REST API to collect GPU metrics. This is the same API that enables the collection of standard metrics, such as vCPU.
In the Azure portal, when you view a VM's details in the Metrics section, the GPU metrics are available in the
telegraf/prometheus
namespace. -
Custom Python script
The custom Python script reads data using the
dcgmi
CLI tool and then sends collected metrics directly to the Azure Log Analytics Workspace. To send metrics, the Log Analytics Workspace must be set up correctly using a workspace ID and client authentication keys.