CES monitoring and troubleshooting

You can monitor system health, query events, and perform maintenance and troubleshooting tasks in Cluster Export Services (CES).

System health monitoring

Each CES node runs a separate GPFS™ process that monitors the network address configuration of the node. If a conflict between the network interface configuration of the node and the current assignments of the CES address pool is found, corrective action is taken. If the node is unable to detect an address that is assigned to it, the address is reassigned to another node.

Additional monitors check the state of the services that are implementing the enabled protocols on the node. These monitors cover NFS, SMB, Object, and Authentication services that monitor, for example, daemon liveliness and port responsiveness. If it is determined that any enabled service is not functioning correctly, the node is marked as failed and its CES addresses are reassigned. When the node returns to normal operation, it returns to the normal (healthy) state and is available to host addresses in the CES address pool.

An additional monitor runs on each protocol node if Microsoft Active Directory (AD), Lightweight Directory Access Protocol (LDAP), or Network Information Service (NIS) user authentication is configured. If a configured authentication server does not respond to test requests, GPFS marks the affected node as failed.

Querying state and events

Aside from the automatic failover and recovery of CES addresses, two additional outputs are provided by the monitoring that can be queried: events and state.

State can be queried by entering the mmces state show command, which shows you the state of each of the CES components. The possible states for a component follow:

HEALTHY: The component is working as expected.
DISABLED: The component has not been enabled.
SUSPENDED: When a CES node is in the suspended state, most components also report suspended.
STARTING: The component (or monitor) recently started. This state is a transient state that is updated after the startup is complete.
UNKNOWN: Something is preventing the monitoring from determining the state of the component.
STOPPED: The component was intentionally stopped. This situation might happen briefly if a service is being restarted due to a configuration change. It might also happen because a user ran the mmces service stop protocol command for a node.
DEGRADED: There is a problem with the component but not a complete failure. This state does not cause the CES addresses to be reassigned.
FAILED: The monitoring detected a significant problem with the component that means it is unable to function correctly. This state causes the CES addresses of the node to be reassigned.
DEPENDENCY_FAILED: This state implies that a component has a dependency that is in a failed state. An example would be NFS or SMB reporting DEPENDENCY_FAILED because the authentication failed.

Looking at the states themselves can be useful to find out which component is causing a node to fail and have its CES addresses reassigned. To find out why the component is being reported as failed, you can look at events.

The mmces events command can be used to show you either events that are currently causing a component to be unhealthy or a list of historical events for the node. If you want to know why a component on a node is in a failed state, use the mmces events active invocation. This command gives you a list of any currently active events that are affecting the state of a component, along with a message that describes the problem. This information should provide a place to start when you are trying to find and fix the problem that is causing the failure.

If you want to get a complete idea of what is happening with a node over a longer time period, use the mmces events list invocation. By default, this command prints a list of all events that occurred on this node, with a time stamp. This information can be narrowed down by component, time period, and severity. As well as being viewable with the command, all events are also pushed to the syslog.

Maintenance and troubleshooting

A CES node can be marked as unavailable by the monitoring process. The command mmces node list can be used to show the nodes and the current state flags that are associated with it. When unavailable (one of the following node flags are set), the node does not accept CES address assignments. The following possible node states can be displayed:

Suspended: Indicates that the node is suspended with the mmces node suspend command. When suspended, health monitoring on the node is discontinued. The node remains in the suspended state until it is resumed with the mmces node resume command.
Network-down: Indicates that monitoring found a problem that prevents the node from bringing up the CES addresses in the address pool. The state reverts to normal when the problem is corrected. Possible causes for this state are missing or non-functioning network interfaces and network interfaces that are reconfigured so that the node can no longer host the addresses in the CES address pool.
No-shared-root: Indicates that the CES shared root directory cannot be accessed by the node. The state reverts to normal when the shared root directory becomes available. Possible cause for this state is that the file system that contains the CES shared root directory is not mounted.
Failed: Indicates that monitoring found a problem with one of the enabled protocol servers. The state reverts to normal when the server returns to normal operation or when the service is disabled.
Starting up: Indicates that the node is starting the processes that are required to implement the CES services that are enabled in the cluster. The state reverts to normal when the protocol servers are functioning.

Additionally, events that affect the availability and configuration of CES nodes are logged in the GPFS log file /var/adm/ras/mmfs.log.latest. The verbosity of the CES logging can be changed with the mmces log level n command, where n is a number from 0 (less logging) to 4 (more logging). The current log level can be viewed with the mmlscluster --ces command.

For more information about CES troubleshooting, see the IBM Spectrum Scale™ Wiki.