Introduction to system health and troubleshooting
IBM Storage Scale comes with several functions to monitor and maintain the health of a system.
System Health
In IBM
Storage Scale, system health monitoring is
performed by the Sysmonitor
daemon. The Sysmonitor
daemon monitors
all critical aspects of the entire cluster, where IBM
Storage Scale is used, to ensure that potential issues are
detected as soon as possible. Sysmonitor
daemon does hundreds of checks on the
relevant cluster nodes and raises RAS events. Based on these event checks, it informs the user
whether anything is not working as expected and also provides guidance to solve existing
problems.
You can configure IBM
Storage Scale to raise events
when certain thresholds are reached. As soon as one of the metric values exceeds or drops beyond a
threshold limit, the Sysmonitor
daemon receives an event notification from the
monitoring process. The Sysmonitor
daemon then generates a log event and updates
the health status of the corresponding component. For more information about event type and health
status, see Event type and monitoring status for system health.
Every component has certain events that are defined for it. The mmhealth node eventlog command gives an overview of the happenings across all components on the local node that is sorted by time. For more information, see System health monitoring use cases. The user can also create and raise custom health events in IBM Storage Scale. For more information, see Creating, raising, and finding custom defined events.
The mmhealth node show command displays the results of the health monitoring of the node and its services, which run in the background. The role of a node in monitoring determines the components that need to be monitored. For many IBM Storage Scale components, separate categories exist in the output of the mmhealth node show command. For example, the typical components that are presented on a node are GPFS, network, file system, disk, and Perfmon. For a complete list of supported components, see Monitoring the health of a node.
For these services, the mmhealth node show <service> command displays the
results of the health monitoring, aggregated health state of a service, and recent active events for
this service. This view also can be used to get more details about the particular health states of a
service subcomponent by using the --verbose
option. For more information about node
role and functions, see Monitoring the health of a node.
You can also use the mmhealth cluster show command to see an overview of the health monitoring for the complete cluster.
You can use the mmhealth command to do the following tasks:
- View the health of a node or cluster.
- View current events, and get tips for a better system configuration.
- View details of any raised event.
- Browse event history.
- Manage performance thresholds.
- Configure monitoring intervals.
For more information, see mmhealth command.
Protocol monitoring
You can monitor system health, query events, and perform maintenance and troubleshooting tasks that are related to CES by using the mmces command. If a CES node is unable to export information by using the configured protocols, then its CES IP address is reassigned to another node. The reassignment is done so that when a single node goes down, the availability is not impacted. For more information, see CES configuration issues.
You can use the mmces command to manage protocol addresses, services, node state, logging level, and balancing the load. For more information about mmces command, see mmces command.
You can use the mmprotocoltrace command to collect trace information for debugging system problems or performance issues. For more information, see mmprotocoltrace command.
Troubleshooting
Despite all the functions that are meant to maintain your system's health, you might still face some issues with your storage clusters. To start the troubleshooting process, collect details of the issues reported in the system.
IBM Storage Scale provides the following options for collecting details:
- Logs
- Dumps
- Traces
- Diagnostic data collection through CLI. Note: For more information, see CLI commands for collecting issue details.
- Diagnostic data collection through GUI
For more information, see Troubleshooting.
To diagnose the cause of an issue, it might be necessary to gather some extra information from the cluster. This information can then be used to determine the root cause of an issue. Collection of debugging information, such as configuration files and logs can be gathered by using the gpfs.snap command. This command gathers data about GPFS, operating system information, and information for each of the protocols. For more information, see gpfs.snap command.
For more high-level analysis of an issue, you can also use the tracing feature. Tracing is logging at a high level. For more information, see Collecting details of issues by using logs, dumps, and traces.
For information about different sections that are related to configuring and maintaining system health and troubleshooting, see System Health.