Health status is not 'OK'

After certain actions, the health of a resource might change to 'Warning', 'Attention', or 'Critical'. Further action is required to change the health status back to 'OK'.

Problem

The health status of a resource is a value other than 'OK', but a potential resolution is not apparent. Either the complete problem is not known or no recourse is provided.

Explanation

Actions such as changing a resource password or regenerating a certificate can adversely affect the relationship between PowerVC and a resource. If problems arise, they are reflected in the health status of the resource in PowerVC. Whenever the health status is 'Warning', 'Attention', or 'Critical', you potentially must take steps to repair the relationship and change the status back to 'OK'.

Investigation: Health Manager Service overview

The PowerVC Health Manager Service provides resource health state processing that is based on defined health policies. Knowing how those policies work might help you solve problems that are related to the health of a resource. The following information details the properties that PowerVC uses to derive health for each type of monitored resource. It also references the policies that PowerVC supplies and uses to determine the health of each resource.
Hardware Management Console
Hardware Management Console health is derived from the HMC management service state. For example, if an HMC management service state value is 'connection_failed', the HMC health status value is 'Attention' and the following explanation is given: The Access State of HMC HMC_ID is Connection failed. For more information, see the hmc-health-policy.json JSON encoded health policy in the /etc/nova/powervc-health-policy directory.
Hosts
Host health is derived from the following properties:
  • Local hypervisor state
  • Related Nova host service state
For example, if a hypervisor state is 'Error', the host health status is 'Critical' and the following explanation is given: The Hypervisor State of Host host_ID is "Error". For more information, see the hypervisor-health-policy.json JSON encoded health policy in the /etc/nova/powervc-health-policy directory.
Virtual servers
Virtual server health is derived from the following properties:
  • Local power state
  • Local virtual machine state
  • Local remote restart state
  • Local Resource Monitoring and Control (RMC) state
  • Related hypervisor state
  • Related Nova host service state
  • Related volume status
For example, if an RMC state is 'Inactive', the virtual machine health status is 'Warning' and the following explanation is given: The RMC state of virtual machine VM_name is Inactive. For more information, see the server-health-policy.json JSON encoded health policy in the /etc/nova/powervc-health-policy directory.
Storage providers
Storage provider health is derived from the following properties:
  • Local storage provider access state
  • Related Cinder host service state
For example, if a storage provider Cinder service access state is 'authentication_error', the storage provider health status is 'Attention' and the following explanation is given: The Access State of Host storage_provider_host_ID is "Authentication Error". For more information, see the storage-provider-health-policy.json JSON encoded health policy in the /etc/cinder/powervc-health-policy directory.
Storage volumes
Storage volume health is derived from the following properties:
  • Local volume status state
  • Related storage provider access status
  • Related Cinder host service state
For example, if a volume status is 'Error', the volume health status is 'Critical' and the following explanation is given: The Status of Volume volume_ID is "Error". For more information, see the volume-health-policy.json JSON encoded health policy in the /etc/cinder/powervc-health-policy directory.

Resolution

Complete the following steps to resolve the problem and change the health status of a resource back to 'OK'.
  1. On the properties page for a resource, the Health and Fault fields provide some information about the reason for a status other than 'OK'. See that information and implement any recourses that seem appropriate.
  2. Look for messages that are related to the resource and implement any recommended recourses.
  3. If you are unable to determine and resolve the problem with the information that is provided in the health status field or the messages, implement basic troubleshooting procedures. For example, ensure that the resource is accessible and check its state on the management console.
  4. For virtual servers, if RMC health state is not OK even after you install cloud-init and RSCT Utilities then make sure to implement one of the following steps.
    • Disable the firewall.
    • Enable port 657 on virtual server.
  5. If the previous steps did not result in a health status of 'OK', further investigation is needed. For more information and potential solutions, search the PowerVC knowledge center or the resource documentation.