Introduction to system health and troubleshooting

IBM Storage Scale comes with several functions to monitor and maintain the health of a system.

System Health

In IBM Storage Scale, system health monitoring is performed by the Sysmonitor daemon. The Sysmonitor daemon monitors all critical aspects of the entire cluster, where IBM Storage Scale is used, to ensure that potential issues are detected as soon as possible. Sysmonitor daemon does hundreds of checks on the relevant cluster nodes and raises RAS events. Based on these event checks, it informs the user whether anything is not working as expected and also provides guidance to solve existing problems.

You can configure IBM Storage Scale to raise events when certain thresholds are reached. As soon as one of the metric values exceeds or drops beyond a threshold limit, the Sysmonitor daemon receives an event notification from the monitoring process. The Sysmonitor daemon then generates a log event and updates the health status of the corresponding component. For more information about event type and health status, see Event type and monitoring status for system health.

Every component has certain events that are defined for it. The mmhealth node eventlog command gives an overview of the happenings across all components on the local node that is sorted by time. For more information, see System health monitoring use cases. The user can also create and raise custom health events in IBM Storage Scale. For more information, see Creating, raising, and finding custom defined events.

The mmhealth node show command displays the results of the health monitoring of the node and its services, which run in the background. The role of a node in monitoring determines the components that need to be monitored. For many IBM Storage Scale components, separate categories exist in the output of the mmhealth node show command. For example, the typical components that are presented on a node are GPFS, network, file system, disk, and Perfmon. For a complete list of supported components, see Monitoring the health of a node.

For these services, the mmhealth node show <service> command displays the results of the health monitoring, aggregated health state of a service, and recent active events for this service. This view also can be used to get more details about the particular health states of a service subcomponent by using the --verbose option. For more information about node role and functions, see Monitoring the health of a node.

You can also use the mmhealth cluster show command to see an overview of the health monitoring for the complete cluster.

You can use the mmhealth command to do the following tasks:

  • View the health of a node or cluster.
  • View current events, and get tips for a better system configuration.
  • View details of any raised event.
  • Browse event history.
  • Manage performance thresholds.
  • Configure monitoring intervals.

For more information, see mmhealth command.

Protocol monitoring

You can monitor system health, query events, and perform maintenance and troubleshooting tasks that are related to CES by using the mmces command. If a CES node is unable to export information by using the configured protocols, then its CES IP address is reassigned to another node. The reassignment is done so that when a single node goes down, the availability is not impacted. For more information, see CES configuration issues.

You can use the mmces command to manage protocol addresses, services, node state, logging level, and balancing the load. For more information about mmces command, see mmces command.

You can use the mmprotocoltrace command to collect trace information for debugging system problems or performance issues. For more information, see mmprotocoltrace command.

Troubleshooting

Despite all the functions that are meant to maintain your system's health, you might still face some issues with your storage clusters. To start the troubleshooting process, collect details of the issues reported in the system.

IBM Storage Scale provides the following options for collecting details:

For more information, see Troubleshooting.

To diagnose the cause of an issue, it might be necessary to gather some extra information from the cluster. This information can then be used to determine the root cause of an issue. Collection of debugging information, such as configuration files and logs can be gathered by using the gpfs.snap command. This command gathers data about GPFS, operating system information, and information for each of the protocols. For more information, see gpfs.snap command.

For more high-level analysis of an issue, you can also use the tracing feature. Tracing is logging at a high level. For more information, see Collecting details of issues by using logs, dumps, and traces.

For information about different sections that are related to configuring and maintaining system health and troubleshooting, see System Health.