Resource usage using Grafana

To monitor the resource usage in your IBM Watson® Machine Learning Accelerator cluster using the visual charts available in Grafana, you must set up monitoring.

Note: In Watson Machine Learning Accelerator 2.6 and earlier, Grafana was installed with Watson Machine Learning Accelerator and used for monitoring.

In Watson Machine Learning Accelerator 3.0 and later, Grafana was removed and OpenShift metrics are used for monitoring. Some of the Grafana pods from version 2.6 and earlier may still be running and the console is still available but is no longer supported. Use this topic to setup and monitor metrics.

A Watson Machine Learning Accelerator cluster administrator can use the Grafana dashboard to view scheduling metrics. Resource usage is available in chart format showing the requested and used CPU and GPU resources. Charts can be adjusted to show data based on resource plan or specific period of time. Only the Grafana administrator role is supported.

Set up monitoring for Watson Machine Learning Accelerator

  1. Enable monitoring for the namespace where ibm-cpd-scheduler-service is in.
    1. Set enableUserWorkload to true.
      $ cat cluster-monitoring-config.yaml
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: cluster-monitoring-config
        namespace: openshift-monitoring
      data:
        config.yaml: |
          enableUserWorkload: true
    2. Apply the changes.
      $ oc apply -f cluster-monitoring-config.yaml
  2. Create a service monitor for service ibm-cpd-scheduler-service. Note that endpoints and selector for the service monitor must be the same as the ibm-cpd-scheduler-service service.
    1. Verify the endpoints and selector in ibm-cpd-scheduler-service.
      $ oc describe svc ibm-cpd-scheduler-service -n ibm-common-services
      Name:              ibm-cpd-scheduler-service
      Namespace:         ibm-common-services
      Labels:            app=metrics
                         app.kubernetes.io/instance=cpd-scheduler
                         app.kubernetes.io/managed-by=ansible
                         app.kubernetes.io/name=ibm-cpd-scheduler
                         release=cpd-scheduler
                         role=metrics
                         velero.io/exclude-from-backup=true
      Annotations:       prometheus.io/port: 10501
                         prometheus.io/scrape: true
      Selector:          app.kubernetes.io/instance=cpd-scheduler,app.kubernetes.io/managed-by=ansible,app.kubernetes.io/name=ibm-cpd-scheduler,app=metrics,release=cpd-scheduler,role=metrics
      Type:              ClusterIP
      IP Family Policy:  SingleStack
      IP Families:       IPv4
      IP:                172.30.131.75
      IPs:               172.30.131.75
      Port:              metrics  10501/TCP
      TargetPort:        10501/TCP
      Endpoints:         10.131.1.127:10501
      Session Affinity:  None
      Events:            <none>
    2. Configure the service monitor.
      $ cat scheduler-service-monitor.yaml
      apiVersion: monitoring.coreos.com/v1
      kind: ServiceMonitor
      metadata:
        labels:
          k8s-app: scheduler-metrics-monitor
        name: scheduler-metrics-monitor
        namespace: ibm-common-services
      spec:
        endpoints:
        - interval: 30s
          port: metrics
          scheme: http
        selector:
          matchLabels:
            app: metrics
    3. Create the service monitor.
      $ oc create -f scheduler-service-monitor.yaml
  3. In Cloud Pak for Data 4.6.4, apply a network policy to grant Grafana access to the schedulers metrics.
    1. Create a Network Policy.
      cat patch-ibm-cpd-scheduler-metrics-np.yaml
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
       name: patch-ibm-cpd-scheduler-metrics-np
       namespace: ibm-common-services
      spec:
       egress:
       - ports:
         - port: 8080
           protocol: TCP
       ingress:
       - ports:
         - port: 10501
           protocol: TCP
       podSelector:
         matchLabels:
           release: cpd-scheduler
           role: metrics
       policyTypes:
       - Ingress
       - Egress
    2. Apply the network policy.
      oc apply -f patch-ibm-cpd-scheduler-metrics-np.yaml

View metrics from the OpenShift console

View metrics from the OpenShift console:
  1. As an administrator, log in to the OpenShift console.
  2. To view metrics, navigate to Observe > Metrics.

Create a visual dashboard in OpenShift

Create a dashboard in OpenShift that uses your resource usage information:
  1. From the navigation menu, click Observe > Metrics.
  2. Select Insert metric at cursor from the dropdown menu.
  3. Input a metric query and click Add query.
  4. Click Run queries to view visual metrics.

Example

Used and requested GPUs
To obtain the number of used GPUs by the service, use the following query:
spectrum_scheduler_consumer_used_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="nvidia.com/gpu"}
To obtain the number of requested number GPUs by the service, use the following query:
spectrum_scheduler_consumer_requested_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="nvidia.com/gpu"}
Used and requested CPUs
To obtain the number of used CPUs by the service, use the following query:
 spectrum_scheduler_consumer_used_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="cpu"}
To obtain the number of requested number CPUs by the service, use the following query:
spectrum_scheduler_consumer_requested_resource_counter{consumer="/cpd-inst-01/platform/wml-accelerator/wml-accelerator-cpd-inst-01", type="cpu"}

Troubleshooting

Max allocation number for User Resource Allocation is incorrect
Metrics are collected in 30 second intervals, this can result in an incorrect allocation number. The number of allocations may be redundant at the moment when resources are being released and reallocated.