Perform routine IBM Cloud Pak for Data monitoring

Establish a schedule for monitoring your IBM Cloud Pak for Data deployments.

If you have multiple Cloud Pak for Data deployments, complete the same process for each deployment.

Who should perform this task?

A Cloud Pak for Data administrator with one of the following permissions must perform this task:

Administer platform
Manage platform health
View platform health (read-only access)

How frequently should you perform this task?

It is recommended that you perform this task at least once per day or once per shift.

However, if there is a large variation in the number of concurrent users or jobs, it is recommended that you perform this task more frequently during peak activity.

Your routine should include the following tasks:

Check the status of the platform from the Monitoring page.

Important: Ensure that a cluster administrator completed one of the following tasks:

Enable the default monitoring stack on the Red Hat® OpenShift® Container Platform cluster.
Install the Kubernetes Metrics Server on the Red Hat OpenShift Container Platform cluster.

If neither of these tasks are completed, monitoring data will not be available on the Cloud Pak for Data Monitoring page.

Metric to check	Things to consider
vCPU and memory use	Check the amount of vCPU and memory in use compared to the requests and limits. Optional: From the Platform resource overview card, click to see a breakdown of use by: Services Review the vCPU and memory use and requests to identify which services consume the most resources. If you click on a service, you can see historical resource use for the service. This information can help you determine if a particular service is causing a spike in resource consumption. Service instances Review the vCPU and memory use and requests to identify which service instances consume the most resources. Look for any service instances that are over-sized or unused. For example, the resource use is consistently below the resource requests. If you click on a service instance, you can see historical resource use for the service instance. This information can help you determine if a particular service instance is causing a spike in resource consumption. Projects Review the vCPU and memory requests to identify which projects consume the most resources. Look for any orphaned or unused projects that can be deleted to free up resources. If you click on a project, you can see historical resource use for the project. This information can help you determine if a particular project is causing a spike in resource consumption. Tool runtimes Look for any runtimes that are over-sized or unused. For example, the resource use is consistently below the resource requests. Pods Review the vCPU and memory requests to identify which pods consume the most resources. Compare the current vCPU and memory use against the limits. The cluster will terminate any pods that exceed the limits. If you set quotas on the platform, services, or projects, check the vCPU and memory use against the thresholds that you set. In addition, compare the vCPU and memory requests against the quotas to determine whether you need to allocate additional resources or whether you can reduce the quotas. If you are running out of vCPU or memory, determine whether there are any processes that you can stop or whether you need to add more vCPU or memory to your cluster. For example, there might be old jobs or tool runtimes that are consuming resources but that no longer provide value to your organization.
Services status summary	Check for services in a critical state (). This indicates that the service has one of the following issues: A service instance in a failed state A pod in a failed or unknown state For troubleshooting tips, see What should you do if you find pods in a failed state? Check for services in a warning state (). This indicates that the service has one of the following issues: A service instance in a pending state A pod in a pending state Typically, a service in this state isn't a cause for concern unless the service has been in this state for a long time. Service instances and pods are in pending state when they are waiting to be scheduled. If they remain in pending state for a long time, it might mean that: The cluster has insufficient resources to schedule pods. The quota settings are preventing pods from being scheduled. For details, see Setting and enforcing quotas.
Service instances status summary	Check for service instances in a critical state (). This indicates that the service instance has one of the following issues: The instance is in a failed state A pod in a failed or unknown state For troubleshooting tips, see What should you do if you find pods in a failed state? Check for service instances in a warning state (). This indicates that the service has one of the following issues: The instance is in an unknown state A pod in a pending state Typically, a service instance in this state isn't a cause for concern unless the instance has been in this state for a long time. Service instances and pods are in pending state when they are waiting to be scheduled. If they remain in pending state for a long time, it might mean that: The cluster has insufficient resources to schedule pods. The quota settings are preventing pods from being scheduled. For details, see Setting and enforcing quotas.
Tool runtimes status summary	Check for tool runtimes that have at least one pod in a failed state (). For troubleshooting tips, see What should you do if you find pods in a failed state?
Pods status summary	Check for pods that are in a failed or unknown state (). For troubleshooting tips, see What should you do if you find pods in a failed state? Check for pods in a warning state (). This indicates that the pod in a pending state Typically, a pod in this state isn't a cause for concern unless the pod has been in this state for a long time. Pods are pending when they are waiting to be scheduled. If they remain in pending state for a long time, it might mean that: The cluster has insufficient resources to schedule pods. The quota settings are preventing pods from being scheduled. For details, see Setting and enforcing quotas. Click the pod status summary to see more details: Check for pods with more than one restart. For troubleshooting tips, see What should you do if you find pods in a failed state? Check for pods that are running but that have no ready containers or too few ready containers, for example 0/1 ready containers. For troubleshooting tips, see What should you do if you find pods in a failed state?
Projects summary status	Check the number of projects that are running on the platform. You can click this card to view more detailed information about each project. Review the vCPU and memory requests to identify which projects consume the most resources. Look for any orphaned or unused projects that can be deleted to free up resources.

What should you do if you find pods in a failed state?

Pods in a failed state indicate an underlying problem with the service. For example, the pod might not be able to pull one or more required images or the pod is overloaded.

Review the pod details to look for status or event information that can help you determine why the pod failed and take the appropriate action to remediate the issue. For example:

Ensure that the repository that you pull images from is running.
Ensure that the appropriate image pull secret exists.
Increase the resources allocated the service, instance, or runtime.
Increase the amount of memory of vCPU on the cluster.

If you cannot determine the root cause of the problem, run a diagnostic job to collect the relevant information to open a case with IBM® Support.

Review any alerts that were issued by Cloud Pak for Data.
By default, the platform runs several pre-defined monitors (scripts that check the state of an entity) every 10 minutes.
- If the monitor reports a critical event 3 times in a row, the platform issues a critical alert ().
- If the monitor reports a warning event 5 times in a row, the platform issues a warning alert ().
Default alerting rules
Warning alerts
After 5 occurrences, the platform issues warning alerts for the following events:

A quota setting is preventing a service from creating new pods.

A service has one or more service instances or pods in a pending state.

A service instance has one or more pods in a pending state.
Critical alerts
After 3 occurrences, the platform issues critical alerts for the following events:

A service does not have enough replicas.
If this occurs, determine how you can allocate sufficient resources to the service to enable it to create the required number of replicas.

A persistent volume claim (PVC) is not associated with a storage volume, which means that the service cannot store data.
If this occurs, ensure that you have sufficient storage to create the requested PVC.

A service has insufficient resources to fulfill requests.
The service cannot create new pods if the new pods will push the service over the memory quota or the vCPU quota. These pods remain in pending state until sufficient resources are available.

If there are insufficient resources, you can:

Wait for existing process to complete so that additional resources become available.

Determine whether there are any processes that you can stop to free up resources.

Adjust the appropriate quota settings to allocate more resources to the platform or to the service.

Add vCPU or memory to your cluster.

A service instance is in a failed state or a pod is in a failed or unknown state.
For troubleshooting tips, see What should you do if you find pods in a failed state?

One or more pods that are associated with a service instance are in a failed or unknown state.
For troubleshooting tips, see What should you do if you find pods in a failed state?

The platform also issues alerts if one or more of the preceding monitors do not complete successfully.