Resource usage using Grafana

Edit online

To monitor the resource usage in your IBM Watson® Machine Learning Accelerator cluster using the visual charts available in Grafana, you must set up monitoring.

Note: In Watson Machine Learning Accelerator 2.6 and earlier, Grafana was installed with Watson Machine Learning Accelerator and used for monitoring.

In Watson Machine Learning Accelerator 3.0 and later, Grafana was removed and OpenShift metrics are used for monitoring. Some of the Grafana pods from version 2.6 and earlier may still be running and the console is still available but is no longer supported. Use this topic to setup and monitor metrics.

A Watson Machine Learning Accelerator cluster administrator can use the Grafana dashboard to view scheduling metrics. Resource usage is available in chart format showing the requested and used CPU and GPU resources. Charts can be adjusted to show data based on resource plan or specific period of time. Only the Grafana administrator role is supported.

Set up monitoring for Watson Machine Learning Accelerator

Enable monitoring for the namespace where ibm-cpd-scheduler-service is in.

Set enableUserWorkload to true.

$ cat cluster-monitoring-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

Apply the changes.

$ oc apply -f cluster-monitoring-config.yaml

Create a service monitor for service ibm-cpd-scheduler-service. Note that endpoints and selector for the service monitor must be the same as the ibm-cpd-scheduler-service service.

Verify the endpoints and selector in ibm-cpd-scheduler-service.

$ oc describe svc ibm-cpd-scheduler-service -n ibm-common-services
Name:              ibm-cpd-scheduler-service
Namespace:         ibm-common-services
Labels:            app=metrics
                   app.kubernetes.io/instance=cpd-scheduler
                   app.kubernetes.io/managed-by=ansible
                   app.kubernetes.io/name=ibm-cpd-scheduler
                   release=cpd-scheduler
                   role=metrics
                   velero.io/exclude-from-backup=true
Annotations:       prometheus.io/port: 10501
                   prometheus.io/scrape: true
Selector:          app.kubernetes.io/instance=cpd-scheduler,app.kubernetes.io/managed-by=ansible,app.kubernetes.io/name=ibm-cpd-scheduler,app=metrics,release=cpd-scheduler,role=metrics
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                172.30.131.75
IPs:               172.30.131.75
Port:              metrics  10501/TCP
TargetPort:        10501/TCP
Endpoints:         10.131.1.127:10501
Session Affinity:  None
Events:            <none>

Configure the service monitor.

$ cat scheduler-service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    k8s-app: scheduler-metrics-monitor
  name: scheduler-metrics-monitor
  namespace: ibm-common-services
spec:
  endpoints:
  - interval: 30s
    port: metrics
    scheme: http
  selector:
    matchLabels:
      app: metrics

Create the service monitor.

$ oc create -f scheduler-service-monitor.yaml

In Cloud Pak for Data 4.6.4, apply a network policy to grant Grafana access to the schedulers metrics.

Create a Network Policy.

cat patch-ibm-cpd-scheduler-metrics-np.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
 name: patch-ibm-cpd-scheduler-metrics-np
 namespace: ibm-common-services
spec:
 egress:
 - ports:
   - port: 8080
     protocol: TCP
 ingress:
 - ports:
   - port: 10501
     protocol: TCP
 podSelector:
   matchLabels:
     release: cpd-scheduler
     role: metrics
 policyTypes:
 - Ingress
 - Egress

Apply the network policy.

oc apply -f patch-ibm-cpd-scheduler-metrics-np.yaml

Create a visual dashboard in OpenShift

Create a dashboard in OpenShift that uses your resource usage information:

From the navigation menu, click Observe > Metrics.
Select Insert metric at cursor from the dropdown menu.
Input a metric query and click Add query.
Click Run queries to view visual metrics.

Example

Used and requested GPUs

To obtain the number of used GPUs by the service, use the following query:

spectrum_scheduler_consumer_used_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="nvidia.com/gpu"}

To obtain the number of requested number GPUs by the service, use the following query:

spectrum_scheduler_consumer_requested_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="nvidia.com/gpu"}

Used and requested CPUs

To obtain the number of used CPUs by the service, use the following query:

 spectrum_scheduler_consumer_used_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="cpu"}

To obtain the number of requested number CPUs by the service, use the following query:

spectrum_scheduler_consumer_requested_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="cpu"}

Troubleshooting

Max allocation number for User Resource Allocation is incorrect: Metrics are collected in 30 second intervals, this can result in an incorrect allocation number. The number of allocations may be redundant at the moment when resources are being released and reallocated.