Health status is not 'OK'
Problem
After certain actions, the health of a resource can change to 'Warning', 'Attention', or 'Critical', but a potential resolution is not apparent. Either the complete problem is not known or no recourse is provided.
Explanation
Actions such as changing a resource password or regenerating a certificate can adversely affect the relationship between the IBM® Cloud Infrastructure Center and a resource. If problems arise, they are reflected in the health status of the resource in the IBM® Cloud Infrastructure Center. Whenever the health status is 'Warning', 'Attention', or 'Critical', you potentially must take steps to repair the relationship and change the status back to 'OK'.
Investigation: Health Manager Service overview
The IBM® Cloud Infrastructure Center Health Manager Service provides resource health state processing that is based on defined health policies. Know that how those policies work can help you solve problems that are related to the health of a resource. The following information details the properties that the IBM® Cloud Infrastructure Center uses to derive health for each type of monitored resource. It also references the policies that the IBM® Cloud Infrastructure Center supplies and uses to determine the health of each resource.
Management node
The IBM® Cloud Infrastructure Center Health Manager Service checks the following properties on management node periodically.
- available disk space
- disk usage
- available memory
- memory usage
- CPU usage
- Connectivity with the IBM® Cloud Infrastructure Center
- Database status
- RabbitMQ status
- Fencing status
A notification is sent if these parameters are less than the minimum requirement or the status of a service is changed on the management node.
Hosts
Host health is derived from the following properties:
- Local hypervisor state
- Related host service state
- Related host configuration state (For example, for z/VM hypervisor, the SMAPI status is checked.)
For example, in a z/VM hypervisor environment, if a hypervisor
state is 'Error', the host health status is 'Critical' and the
following explanation is given: “The Hypervisor State of the Host
host_ID is "Error"”. For more information, see the
hypervisor-health-policy-zvm.json
JSON encoded health policy in the
/etc/nova/icic-health-policy
directory.
Virtual servers
Virtual server health is derived from the following properties:
- Local power state
- Local virtual machine state
- Related hypervisor state
- Related host service state
- Related host configuration state
- Related volume status
For example, in a z/VM hypervisor environment, if the related
Nova compute service is stopped, the virtual machine's health
status is 'Attention' and the following explanation is given: “Nova
compute state of Host host_name is Stopped”. For more information,
see the
server-health-policy-zvm.json
JSON encoded health policy in the
/etc/nova/icic-health-policy
directory.
Storage providers
Storage provider health is derived from the following properties:
- Local storage provider access state
- Related Cinder host service state
For example, if a storage provider Cinder service access state
is 'authentication_error', the storage provider's health status is
'Attention' and the following explanation is given: “The Access
State of Host storage_provider_host_ID is "Authentication Error"”.
For more information, see the
storage-provider-health-policy.json
JSON encoded health policy in the
/etc/cinder/icic-health-policy
directory.
Storage volumes
Storage volume health is derived from the following properties:
- Local volume status state
- Related storage provider access status
- Related Cinder host service state
For example, if a volume status is 'Error', the volume health
status is 'Critical' and the following explanation is given: “The
Status of Volume volume_ID is "Error"”. For more information, see
the
volume-health-policy.json
JSON
encoded health policy in the
/etc/cinder/icic-health-policy
directory.
Resolution
Complete the following steps to resolve the problem and change the health status of a resource back to 'OK'.
-
On the properties page for a resource, the Health and Fault fields provide some information about the reason for a status other than 'OK'. See that information and implement any recourses that seem appropriate.
-
Look for messages that are related to the resource and implement any recommended recourses.
-
If you are unable to determine and resolve the problem with the information that is provided in the health status field or the messages, implement basic troubleshooting procedures. For example, ensure that the resource is accessible and check its state.
-
If the previous steps did not result in a health status of 'OK', further investigation is needed. For more information and potential solutions, search the IBM® Cloud Infrastructure Center knowledge center or the resource documentation.