Monitoring the platform

From the IBM® Cloud Pak for Data web client, you can monitor the services that are running on the platform, understand how you are using cluster resources, and be aware of issues as they arise. You can also set quotas on the platform and on individual services to help mitigate unexpected spikes in resource use.

Accessing the Monitoring page

Required permissions:
To access the Monitoring page, you must have one of the following permissions:
  • Administer platform
  • Manage platform health
  • View platform health (read-only access)
To access the Monitoring page:
  1. Log in to the Cloud Pak for Data web client.
  2. From the navigation menu, select Administration > Monitoring.
From the Monitoring page, you can:
  • See the current resource use (vCPU and memory) for the platform

    If you click View status and use data on the Platform resource overview card, you can see a breakdown by services, service instances, environments, and pods.

  • Review the platform resource use for the last 12 hours

    If you click View historical data on the Platform resource use card, you can see a breakdown by services, service instances, environments, and pods. You can also view historical data beyond 12 hours. By default, the platform stores up to 30 days of data. However, you can adjust the length of time that data is retained. For details, see Changing the retention period for monitoring data.

  • Access at-a-glance platform monitoring
  • View events and alerts
  • Configure and enforce quotas

At-a-glance platform monitoring

From the Monitoring page, you can see the status of the following items on the platform:
Available cards Status information Get more detailed information
Services

Services are software that is installed on the platform. Services consume resources as part of their regular operations.

From the Monitoring, you can see:
  • How many services are installed on the platform
  • The number of services that have either:
    • A service instance in a failed state
    • A pod in a failed or unknown state
  • The number of services that have either:
    • A service instance in a pending state
    • A pod in a pending state
  • The number of services that are running normally
Click the Services card to see:
  • The historical vCPU and memory use for all services

    You can optionally filter the graph to show a single service.

  • The status (or health) of each service
  • The number of service instances, environments, and jobs that are associated with the service (if applicable)
  • The vCPU quota status and the memory quota status (if set)
You can optionally configure the table to show:
  • The current vCPU use, requests, and limits
  • The current memory use, requests, and limits
You can select a service to see:
  • The service quotas
  • The pods that are associated with the service
Service instances

Some services can be deployed multiple times after they are installed. Each deployment is called a service instance.

Service instances consume resources as part of their normal operations.

From the Monitoring, you can see:
  • How many service instances are deployed on the platform
  • The number of service instances where either:
    • The instance is in a failed state
    • A pod is in a failed or unknown state
  • The number of services instances where either:
    • The instance is in an unknown state
    • A pod is in a pending state
  • The number of service instances that are running normally
Click the Service instances card to see:
  • The historical vCPU and memory use for all service instances

    You can optionally filter the graph to show a single instance

  • The status (or health) of each service instance
  • The service that the service instance is associated with
  • Who provisioned the instance and when
  • The number of users who have access to the service instance
  • The number of pods associated with the service instance
You can optionally configure the table to show:
  • The current vCPU use, requests, and limits
  • The current memory use, requests, and limits

You can select a service instance to see the pods that are associated with the service instance.

Additionally, you can click the Options icon (Image of the Options icon.) for a service instance to:
  • Manage access to the instance
  • Delete the instance

However, to complete either of these tasks, you must be an administrator of the service instance or you must have the Administer platform permission.

Environments

Environments specify the hardware and software configurations for runtimes for analytical assets and jobs. Environments consume resources as part of their regular operations.

By default, this card is not displayed on the platform. It is displayed only if you install a service that uses environments.

From the Monitoring, you can see:
  • How many environments are currently running on the platform
  • The number of environments with at least one pod in a failed state
  • The number of environments that are running normally
Click the Environments card to see:
  • The status (or health) of each environment
  • Who started the environment and when
  • The project or deployment space where the environment is running
  • The number of GPU requests
  • The current resource use for the environment

You can select an environment to see the pods that are associated with the environment.

Additionally, you can optionally click the Stop runtime instance icon (Image of the Stop runtime instance icon) to stop the environment.

Pods

Services are composed of Kubernetes pods.

If a pod is failed or unknown, it can impact the health of the service. If a pod is pending, the service might not be able to process specific requests until the pod is running.

From the Monitoring, you can see:
  • How many pods are associated with the platform
  • The number of pods in a failed or unknown state
  • The number of pods that are pending
    Kubernetes is attempting to create and schedule these pods. The pods might remain in pending state if:
    • Kubernetes is waiting for a process to complete or doesn't have sufficient resources to fulfill the pod requests
    • The platform or service quota settings are preventing new pods from being created
  • The number of pods that are running normally
Click the Pods card to see:
  • The status (or health) of each pod
  • The number of containers in the ready state compared to the number of containers defined for the pod
  • The service the pod is associated with
  • Whether the pod is associated with a fixed resource, service instance, or environment
  • The function or application of the pod
  • The service instance that the pod is associated with
  • When the pod was started
  • How many times the pod has restarted
You can optionally configure the table to show:
  • The Red Hat® OpenShift® project (namespace) where the pod is running
  • The environment, job, project, or deployment space that the pod is associated with
  • The current vCPU use, requests, and limits
  • The current memory use, requests, and limits
Additionally, you can click the Options icon (Image of the Options icon.) for a pod to:
  • See the details of the pod
  • View the pod logs
  • Restart the pod

Events and alerts

An alert is triggered by an event or a series of events. The severity of an event indicates that an issue occurred or that there is a potential issue.

From the Monitoring page, you can see:
  • The number of critical alerts
  • The number of critical events
  • The number of warning alerts
  • The number of warning events

If you click on any of these entries, you are taken to a filtered list of alerts or events based on the entry you selected.

If you click View all events and alerts on the Events card, you can a complete list of events.

You can optionally customize the events that trigger alerts. For details, see Monitoring and alerting in Cloud Pak for Data.

Setting and enforcing quotas

A quota is a way for you to specify the maximum amount of memory and vCPU you want the platform or a specific service to use. A quota is a target against which you can measure your actual memory and vCPU use. A quota acts as a benchmark to let you know when your vCPU or memory use is approaching or surpassing your target use.

Note: Setting a quota is not the same thing as scaling.

Scaling impacts the overall capacity of a service by adjusting the number of pods in the service. (You can also scale the Cloud Pak for Data control plane.) When you scale a service up, the service becomes more resilient. Additionally, the service might have increased parallel processing capacity.

Setting a quota on a service does not change the scale. Scale and quota are independent settings.

In addition to setting a quota, you can optionally enable quota enforcement. When you enforce quotas, new pods cannot be created if the pods would push your use above your quota.

Important: To use quota enforcement, you must install the scheduling service.

The behavior of the quota enforcement feature depends on whether you set your quotas on pod requests or limits. (For an in-depth explanation of requests and limits, see Managing Resources for Containers in the Kubernetes documentation.)

Enforcing quotas on pod requests
A request is the amount of vCPU or memory that the pod expects to use as part of its normal operations.
When you set quotas on pod requests, you have more flexibility in how your resources are allocated:
  • If you enforce the platform quotas, the control plane and any services that are running on this instance of Cloud Pak for Data are prevented from creating new pods if the requests in the new pod would push the platform over either the platform memory quota or the vCPU quota. These pods remain in the pending state until there are sufficient resources available. However, the existing pods can use more memory or vCPU than the platform quota.
  • If you enforce a service quota, the service is prevented from creating new pods if the requests in the new pod would push the service over either the memory quota or the vCPU quota. These pods remain in the pending state until there are sufficient resources available. However, the existing pods can use more memory or vCPU than the service quota.
Enforcing quotas on pod limits
A limit is the absolute maximum amount of vCPU or memory that the pod can use. If the pod tries to consume additional resources, the pod is terminated. In most cases, the requested resources (the requests) are less than the limits.
When you set quotas on pod limits, you have more control over your resources:
  • If you enforce platform quotas, the control plane and any services that are running on this instance of Cloud Pak for Data are prevented from creating new pods if the limits in the new pods would push the platform over either the platform memory quota or the vCPU quota. These pods remain in the pending state until there are sufficient resources available. When you enforce platform quotas on pod limits, the quota is a cap on the total resources that existing pods can use.
  • If you enforce service quotas, the service is prevented from creating new pods if the limits in the new pod would push the service over either the memory quota or the vCPU quota. These pods remain in the pending state until there are sufficient resources available. When you enforce service quotas on pod limits, the quota is a cap on the total resources that the existing pods can use.

If you don't enforce quotas, the quota has no impact on the behavior of the platform or services. If you are approaching or surpassing your quota settings, it's up to you whether you want to allow processes to consume resources or whether you want to stop processes to release resources.

To set quotas:
  1. To set the platform quota:
    1. On the Monitoring page, click Set platform quotas or Edit platform quotas.
    2. Select Monitor platform resource use against your target use.
    3. Specify whether you want to set quotas on pod Requests or Limits.
    4. Specify your vCPU quota. This is the target maximum amount of vCPU you want the platform to use.
    5. Specify your vCPU alert threshold. When you reach the specified percent of vCPU in use, the platform will alert you based on your alert settings
    6. Specify your Memory quota. This is the target maximum amount of memory you want the platform to use.
    7. Specify your Memory alert threshold. When you reach the specified percent of memory in use, the platform will alert you.
    8. If you want to automatically enforce the platform quota settings, select Enforce quotas.
    9. Click Save.
  2. To set service quotas
    1. On the Monitoring page, click Edit service quotas.
    2. Locate the service for which you want to edit the quota, and click the Edit icon (Image of the Edit icon).
    3. Select Monitor platform resource use against your target use.
    4. Specify whether you want to set quotas on pod Requests or Limits.
    5. Specify your vCPU quota. This is the target maximum amount of vCPU you want the service to use.
    6. Specify your vCPU alert threshold. When you reach the specified percent of vCPU in use, the platform will alert you based on your alert settings
    7. Specify your Memory quota. This is the target maximum amount of memory you want the service to use.
    8. Specify your Memory alert threshold. When you reach the specified percent of memory in use, the platform will alert you.
    9. If you want to automatically enforce the platform quota settings, select Enforce quotas.
    10. Click Save.