Resource usage using Grafana
To monitor the resource usage in your IBM Watson® Machine Learning Accelerator cluster using the visual charts available in Grafana, you must set up monitoring.
In Watson Machine Learning Accelerator 3.0 and later, Grafana was removed and OpenShift metrics are used for monitoring. Some of the Grafana pods from version 2.6 and earlier may still be running and the console is still available but is no longer supported. Use this topic to setup and monitor metrics.
A Watson Machine Learning Accelerator cluster administrator can use the Grafana dashboard to view scheduling metrics. Resource usage is available in chart format showing the requested and used CPU and GPU resources. Charts can be adjusted to show data based on resource plan or specific period of time. Only the Grafana administrator role is supported.
Set up monitoring for Watson Machine Learning Accelerator
- Enable monitoring for the namespace where ibm-cpd-scheduler-service is in.
- Set enableUserWorkload to true.
$ cat cluster-monitoring-config.yaml apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | enableUserWorkload: true - Apply the changes.
$ oc apply -f cluster-monitoring-config.yaml
- Set enableUserWorkload to true.
- Create a service monitor for service ibm-cpd-scheduler-service. Note that endpoints and selector
for the service monitor must be the same as the ibm-cpd-scheduler-service service.
- Verify the endpoints and selector in
ibm-cpd-scheduler-service.
$ oc describe svc ibm-cpd-scheduler-service -n ibm-common-services Name: ibm-cpd-scheduler-service Namespace: ibm-common-services Labels: app=metrics app.kubernetes.io/instance=cpd-scheduler app.kubernetes.io/managed-by=ansible app.kubernetes.io/name=ibm-cpd-scheduler release=cpd-scheduler role=metrics velero.io/exclude-from-backup=true Annotations: prometheus.io/port: 10501 prometheus.io/scrape: true Selector: app.kubernetes.io/instance=cpd-scheduler,app.kubernetes.io/managed-by=ansible,app.kubernetes.io/name=ibm-cpd-scheduler,app=metrics,release=cpd-scheduler,role=metrics Type: ClusterIP IP Family Policy: SingleStack IP Families: IPv4 IP: 172.30.131.75 IPs: 172.30.131.75 Port: metrics 10501/TCP TargetPort: 10501/TCP Endpoints: 10.131.1.127:10501 Session Affinity: None Events: <none> - Configure the service monitor.
$ cat scheduler-service-monitor.yaml apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: scheduler-metrics-monitor name: scheduler-metrics-monitor namespace: ibm-common-services spec: endpoints: - interval: 30s port: metrics scheme: http selector: matchLabels: app: metrics - Create the service
monitor.
$ oc create -f scheduler-service-monitor.yaml
- Verify the endpoints and selector in
ibm-cpd-scheduler-service.
- In Cloud Pak for Data 4.6.4, apply a network policy to grant Grafana access to the schedulers metrics.
- Create a Network Policy.
cat patch-ibm-cpd-scheduler-metrics-np.yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: patch-ibm-cpd-scheduler-metrics-np namespace: ibm-common-services spec: egress: - ports: - port: 8080 protocol: TCP ingress: - ports: - port: 10501 protocol: TCP podSelector: matchLabels: release: cpd-scheduler role: metrics policyTypes: - Ingress - Egress - Apply the network
policy.
oc apply -f patch-ibm-cpd-scheduler-metrics-np.yaml
- Create a Network Policy.
View metrics from the OpenShift console
- As an administrator, log in to the OpenShift console.
- To view metrics, navigate to .
Create a visual dashboard in OpenShift
- From the navigation menu, click .
- Select Insert metric at cursor from the dropdown menu.
- Input a metric query and click Add query.
- Click Run queries to view visual metrics.
Example
- Used and requested GPUs
- To obtain the number of used GPUs by the service, use the following
query:
To obtain the number of requested number GPUs by the service, use the following query:spectrum_scheduler_consumer_used_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="nvidia.com/gpu"}spectrum_scheduler_consumer_requested_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="nvidia.com/gpu"}
- Used and requested CPUs
- To obtain the number of used CPUs by the service, use the following
query:
To obtain the number of requested number CPUs by the service, use the following query:spectrum_scheduler_consumer_used_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="cpu"}spectrum_scheduler_consumer_requested_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="cpu"}
Troubleshooting
- Max allocation number for User Resource Allocation is incorrect
- Metrics are collected in 30 second intervals, this can result in an incorrect allocation number. The number of allocations may be redundant at the moment when resources are being released and reallocated.