Monitoring Azure Machine Learning
Azure Machine Learning is a cloud service to build, deploy, and manage machine learning models. Instana uses the Azure Machine Learning sensor to monitor the Azure Machine Learning service. Instana comprehensively monitors the Azure Machine Learning service by providing end-to-end visibility into your environment.
Configuring the Azure Machine Learning sensor
To configure the Azure Machine Learning sensor, complete the following steps:
-
Enable the Azure subscription on Instana. Update the
<agentinstall_dir>/etc/instana/configuration.yamlagent configuration file as shown in the following example:com.instana.plugin.azure: enabled: true subscription: "[Your-Subscription-Id]" tenant: "[Your-Tenant-Id]" principals: - id: "[Your-Service-Principal-Account-Id]" secret: "[Your-Service-Principal-Secret]"For more information about installing the Azure agent, see Installation.
-
Check whether the Azure Machine Learning sensor is enabled in the agent configuration file. You can also configure tags and resource groups as described in the Filtering services by defining tags and resource groups section.
com.instana.plugin.azure.machinelearning: enabled: true # Valid values: true, false. Enabled (true) by default include_tags: # Comma separated list of tags in key:value format (e.g. env:prod,env:staging) exclude_tags: # Comma separated list of tags in key:value format (e.g. env:dev,env:test) include_resource_groups: # Comma separated list of resource groups (e.g. rg_prod,rg_staging) exclude_resource_groups: # Comma separated list of resource groups (e.g. rg_dev,rg_test)
Disabling the Azure Machine Learning sensor
To disable the Azure Machine Learning sensor, update the
<agentinstall_dir>/etc/instana/configuration.yaml
agent configuration file as shown in the following example:
com.instana.plugin.azure.machinelearning:
enabled: false
Filtering services by defining tags and resource groups
To define multiple tags and resource groups, separate them with
commas. Define tags as a key-value pair separated by a colon
(:).
You can define multiple tags and resource groups in the
configuration.yaml file. Use commas to separate
multiple tags or resource groups. If you define a tag or resource
group in both lists (include and exclude), the exclude list has a
higher priority. If you want to include all services without
filtering, avoid defining any configuration.
-
To set tags for the include list, update the
configuration.yamlfile as shown in the following example:com.instana.plugin.azure.machinelearning: include_tags: # Comma separated list of tags in key:value format (e.g. env:prod,env:staging) -
To set tags for the exclude list, update the
configuration.yamlfile as shown in the following example:com.instana.plugin.azure.machinelearning: exclude_tags: # Comma separated list of tags in key:value format (e.g. env:dev,env:test) -
To set resource groups for the include list, update the
configuration.yamlfile as shown in the following example:com.instana.plugin.azure.machinelearning: include_resource_groups: # Comma separated list of resource groups (e.g. rg_prod,rg_staging) -
To set resource groups for the exclude list, update the
configuration.yamlfile as shown in the following example:com.instana.plugin.azure.machinelearning: exclude_resource_groups: # Comma separated list of resource groups (e.g. rg_dev,rg_test)
Viewing metrics
To view the metrics, complete the following steps:
- From the navigation menu in the Instana UI, click Infrastructure.
- Click a monitored host.
You can see a host dashboard with all the collected metrics and monitored processes.
Metrics are pulled every minute, which is the resolution that Azure provides for monitoring these services.
Configuration data machine learning workspace
| Namespace details | Description |
|---|---|
| Name | Name of the Azure Machine Learning workspace |
| Resource Group | Name of the resource group in which workspace is located |
| Location | The location of the resource |
| Subscription Id | Azure subscription identifier |
| createdAt | The timestamp of resource creation (UTC) |
Performance metrics machine learning workspace
| Metric | Name | Unit | Aggregation | Description |
|---|---|---|---|---|
| Processor Utilization | ||||
| Count | CpuUtilizationPercentage | Count | Average | The utilization percentage of a CPU node averaged in a minute |
| Count | GpuUtilizationPercentage | Count | Average | The utilization percentage of a GPU node averaged in a minute |
| Nodes | ||||
| Count | Active Nodes | Count | Average | The nodes which are actively running a job within a minute |
| Count | Total Nodes | Count | Average | The sum of Active Nodes, Idle Nodes, Unusable Nodes, Premepted Nodes, and Leaving Nodes averaged over one-minute time interval |
| Cores | ||||
| Count | Active Core | Count | Average | Number of active cores averaged in a minute |
| Count | Total Cores | Count | Average | Number of total cores averaged in a minute |
| Runs | ||||
| Count | Started Runs | Count | Total | The total number of runs running for this workspace in a minute |
| Count | Completed Runs | Count | Total | The total number of runs completed successfully for this workspace in a minute |
| Count | Cancelled Runs | Count | Total | The total number of runs cancelled for this workspace in a minute |
| Count | Errors | Count | Total | The total number of run errors in this workspace within a minute |
| Count | Failed Runs | Count | Total | The total number of runs failed for this workspace in a minute |
| Disk | ||||
| Count | DiskUsedMegabytes | Count | Average | The average disk space utilization in megabytes in a minute |
| Count | DiskReadMegabytes | Count | Average | The average of data read from disk in megabytes in a minute |
| Count | DiskWriteMegabytes | Count | Average | The average of data written into disk in megabytes in a minute |