Health status is not 'OK'

Problem

After certain actions, the health of a resource can change to 'Warning', 'Attention', or 'Critical', but a potential resolution is not apparent. Either the complete problem is not known or no recourse is provided.

Explanation

Actions such as changing a resource password or regenerating a certificate can adversely affect the relationship between the IBM® Cloud Infrastructure Center and a resource. If problems arise, they are reflected in the health status of the resource in the IBM® Cloud Infrastructure Center. Whenever the health status is 'Warning', 'Attention', or 'Critical', you potentially must take steps to repair the relationship and change the status back to 'OK'.

Investigation: Health Manager Service overview

The IBM® Cloud Infrastructure Center Health Manager Service provides resource health state processing that is based on defined health policies. Know that how those policies work can help you solve problems that are related to the health of a resource. The following information details the properties that the IBM® Cloud Infrastructure Center uses to derive health for each type of monitored resource. It also references the policies that the IBM® Cloud Infrastructure Center supplies and uses to determine the health of each resource.

Management node

The IBM® Cloud Infrastructure Center Health Manager Service checks the following properties on management node periodically.

  • available disk space
  • disk usage
  • available memory
  • memory usage
  • CPU usage
  • Connectivity with the IBM® Cloud Infrastructure Center
  • Database status
  • RabbitMQ status
  • Fencing status

A notification is sent if these parameters are less than the minimum requirement or the status of a service is changed on the management node.

Hosts

Host health is derived from the following properties:
- Local hypervisor state
- Related host service state
- Related host configuration state (For example, for z/VM hypervisor, the SMAPI status is checked.)

For example, in a z/VM hypervisor environment, if a hypervisor state is 'Error', the host health status is 'Critical' and the following explanation is given: “The Hypervisor State of the Host host_ID is "Error"”. For more information, see the hypervisor-health-policy-zvm.json JSON encoded health policy in the /etc/nova/icic-health-policy directory.

Virtual servers

Virtual server health is derived from the following properties:
- Local power state
- Local virtual machine state
- Related hypervisor state
- Related host service state
- Related host configuration state
- Related volume status

For example, in a z/VM hypervisor environment, if the related Nova compute service is stopped, the virtual machine's health status is 'Attention' and the following explanation is given: “Nova compute state of Host host_name is Stopped”. For more information, see the server-health-policy-zvm.json JSON encoded health policy in the /etc/nova/icic-health-policy directory.

Storage providers

Storage provider health is derived from the following properties:
- Local storage provider access state
- Related Cinder host service state

For example, if a storage provider Cinder service access state is 'authentication_error', the storage provider's health status is 'Attention' and the following explanation is given: “The Access State of Host storage_provider_host_ID is "Authentication Error"”. For more information, see the storage-provider-health-policy.json JSON encoded health policy in the /etc/cinder/icic-health-policy directory.

Storage volumes

Storage volume health is derived from the following properties:
- Local volume status state
- Related storage provider access status
- Related Cinder host service state

For example, if a volume status is 'Error', the volume health status is 'Critical' and the following explanation is given: “The Status of Volume volume_ID is "Error"”. For more information, see the volume-health-policy.json JSON encoded health policy in the /etc/cinder/icic-health-policy directory.

Resolution

Complete the following steps to resolve the problem and change the health status of a resource back to 'OK'.

  1. On the properties page for a resource, the Health and Fault fields provide some information about the reason for a status other than 'OK'. See that information and implement any recourses that seem appropriate.

  2. Look for messages that are related to the resource and implement any recommended recourses.

  3. If you are unable to determine and resolve the problem with the information that is provided in the health status field or the messages, implement basic troubleshooting procedures. For example, ensure that the resource is accessible and check its state.

  4. If the previous steps did not result in a health status of 'OK', further investigation is needed. For more information and potential solutions, search the IBM® Cloud Infrastructure Center knowledge center or the resource documentation.

Note: Every five minutes, the IBM® Cloud Infrastructure Center Health Manager Service starts a fresh health status evaluation of all resources by cleaning the existing caches. That means the actual health status of a resource might be reflected after five minutes. It is strongly recommended to wait for five minutes before issuing any health status related problem.