SNMP response codes

If you send alerts to your SNMP server, you can use the following information to interpret the response codes that are sent by IBM® Software Hub.

You can monitor the state of IBM Software Hub and your services with platform monitors, service monitors, and privileged monitors. Platform monitors are installed automatically when you install IBM Software Hub. You must manually install service monitors and privileged monitors.

Deployment status check codes

Monitor type: Platform

Each service is configured to maintain a specific number of Deployment replicas. The check-deployment-status event monitors the status of Deployment replicas that are associated with IBM Software Hub and reports any issues.

Response code Severity Description
102 Critical The service does not have enough replicas.
100 Information The monitor that checks the status of the replicas ran. There are no issues to report.

StatefulSet status check codes

Monitor type: Platform

Each service is configured to maintain a specific number of StatefulSet replicas. The check-statefulset-status event monitors the status of StatefulSet replicas that are associated with IBM Software Hub and reports any issues.

Response code Severity Description
202 Critical The service does not have enough replicas.
200 Information The monitor that checks the status of the replicas ran. There are no issues to report.

PVC status check codes

Monitor type: Platform

A persistent volume claim (PVC) is a request for storage that meets specific criteria, such as a minimum size or a specific access mode. The check-pvc-status event monitors the status of the PVCs that are associated with IBM Software Hub and reports any issues.

Response code Severity Description
302 Critical The PVC is not associated with a storage volume, which means that the service cannot store data.
300 Information The monitor that checks the status of the PVCs ran. There are no issues to report.

Quota status check codes

Monitor type: Platform

An administrator sets the vCPU quota and memory quota for services or for the platform. The check-quota-status event monitors the quotas and requests that are associated with IBM Software Hub to determine whether services have sufficient resources to fulfill requests.

Response code Severity Description
402 Critical The service has insufficient resources to fulfill requests. The service cannot create new pods if the new pods will push the service over the memory quota or the vCPU quota. These pods remain in pending state until sufficient resources are available.
401 Warning Check the quota settings and the available resources on the cluster.
400 Information The monitor that checks the status of the quotas ran. There are no issues to report.

Monitor status check codes

Monitor type: Platform

A monitor is a script that checks the state of an entity periodically and generates events based on the state of the entity. The check-monitor-status event monitors the status of monitoring jobs to determining whether the jobs completed successfully.

Response code Severity Description
502 Critical One or more jobs did not complete successfully.
500 Information The monitor that checks the status of monitoring jobs ran. There are no issues to report.

Service check status codes

Monitor type: Platform

A service is comprised of pods and one or more service instances. The check-service-status event monitors the status of services to determine whether the pods and instances that are associated with the service are running as expected.

Response code Severity Description
602 Critical A service instance is in a failed state or a pod is in a failed or unknown state.
601 Warning Check the status of the service. A pod that is associated with the service might be pending.
600 Information The monitor that checks the status of each service ran. There are no issues to report.

Service instance check status codes

Monitor type: Platform

A service instance is comprised of one or more pods. The check-instance-status event monitors the status of service instances to determine whether the pods that are associated with the instance are running as expected.

Response code Severity Description
702 Critical One or more pods that are associated with the instance are in a failed or unknown state.
701 Warning Check the status of the instance. A pod that is associated with the instance might be pending.
700 Information The monitor that checks the status of each instance ran. There are no issues to report.

Service health check codes

Monitor type: Service

The service-health-check event monitors the functional health of a service to determine whether the service is healthy

Response code Severity Description
802 Critical The service is not functioning properly or is not functioning at all.
801 Warning The service is partially operational, but some functionality is unavailable.
800 Information The service is healthy.

Node status check codes

Monitor type: Privileged

Each node hosts the pods that run the platform and services, and the overall cluster health depends on the health of its nodes. The check-network-status event monitors the health and status of all cluster nodes by monitoring the node conditions and usage statistics. A critical state indicates that one or more nodes are not in a Ready state or are consuming excessive resources.

Response code Severity Description
902 Critical One or more nodes are not ready or are utilizing excessive resources.
901 Warning A node health warning condition was detected.
900 Information All nodes are healthy.

Volume status check codes

Monitor type: Privileged

A persistent volume claim (PVC) is a request for storage. The check-volume-status event monitors whether the PVCs associated with the deployment are running out of space. A warning or critical state indicates that volume usage exceeds the configured thresholds.

Response code Severity Description
1002 Critical Volume usage exceeds the critical threshold. (The default threshold is 90% of the total capacity.)
1001 Warning Volume usage exceeds the warning threshold. (The default threshold is 80% of the total capacity.)
1000 Information Volume usage is within normal range.

Operator namespace status check codes

Monitor type: Privileged

The check-operator-namespace-status event checks whether the resources in the operators project for the deployment are healthy.

Response code Severity Description
1102 Critical One or more operator resources are not running as expected.
1101 Warning A warning condition was detected in operator namespace resources.
1100 Information All operator resources are healthy.

EDB Cluster Status check codes

Monitor type: Privileged

The check-edb-cluster-status event checks whether any instances of EDB Postgres that are associated with the deployment are healthy.

Response code Severity Description
1203 Critical Cluster is unhealthy or the replicas are significantly out of sync.
Restriction: The replica out of sync check applies only to the zen-metastore-edb storage cluster.
1201 Warning One or more replicas are unavailable.
1200 Information EDB cluster is healthy.

Cluster operator status check codes

Monitor type: Privileged

The check-cluster-operator-status event checks the status of the cluster operators that comprise the Red Hat® OpenShift® Container Platform infrastructure to determine whether:

  • All of the operators are AVAILABLE
  • Any of the operators are DEGRADED
Response code Severity Description
1302 Critical A cluster operator is unavailable (Available=False) or degraded (Degraded=True).
1301 Warning A warning condition exists for the cluster operator.
1300 Information Cluster operator healthy

Node imbalance status check codes

Monitor type: Privileged

The check-node-imbalance-status event checks whether vCPU requests are balanced across nodes or whether one node is supporting a disproportionately high load.

A warning state indicates that CPU requests on one node exceed the maximum threshold and that other nodes fall below the minimum threshold. A critical state indicates that CPU imbalance exceeds defined thresholds.

Response code Severity Description
1402 Critical Node CPU imbalance exceeds defined thresholds.
1401 Warning CPU usage imbalance detected.
1400 Information CPU requests across nodes are balanced.

Network status check codes

Monitor type: Privileged

The check-network-status event checks the status of the PodNetworkConnectivityCheck objects for cluster resources to determine whether the objects are Reachable.

Response code Severity Description
1502 Critical Network is not reachable.
1501 Warning Network connectivity warning detected.
1500 Information Network connectivity is healthy.

Certificate status check codes

Monitor type: Platform

The check-certificate-status event monitors certificates to:
  • Ensure that the certificates are valid
  • Identify when the certificates will expire
  • Identify when the certificates will be renewed
  • Determine whether certificates were renewed successfully

For certificates that do not have a renewal date, warning and critical events are generated as the certificate approaches is expiration date.

For certificates that have a renewal date, warning and critical events are generated if the certificate is not automatically renewed by the specified date.

Response code Severity Description
1602 Critical Certificate is close to expiry (default: 7 days) or renewal is significantly overdue (default: 24 hours).
1601 Warning Certificate is expiring soon (default: 21 days) or renewal is slightly overdue (default: 1 hour).
1600 Information Certificate is valid and not expiring soon.

Certificate renewal check codes

Monitor type: Platform

The check-certificate-renewal event monitors upcoming certificate renewals so that you can identify renewals that might cays service disruptions.

Response code Severity Description
1701 Warning Certificate renewal approaching (default: 3 days before renewal time).
1700 Information No certificate renewal events pending.

Workload quota status check codes

Monitor type: Platform

The check-workload-quota-status event monitors the quotas and requests that are associated with the following objects:
  • Projects
  • Remote physical locations
  • Data planes

The event determines whether the workloads associated with the objects have sufficient resources to fulfill requests.

Response code Severity Description
1802 Critical Workload has insufficient CPU, memory, or GPU resources.
1801 Warning Workload quota warning detected.
1800 Information Workload quotas are sufficient.