Monitoring the health of a node
The following list provides the details of the monitoring services available in the IBM Storage Scale system:
General
-
CALLHOME
- Node role: The call home nodes or group servers get this node role.
- Task: Monitors the call home feature and sends call home heartbeats.
-
DISK
- Node role: Nodes with the node class nsdNodes monitor the DISK service. IBM Storage Scale nodes.
- Task: Checks whether the IBM Storage Scale disks are available and running.
-
ENCRYPTION
- Node role: A node that is configured with file system encryption with a Remote Key
Manager,
RKM.conf
. - Task: Displays the events that are related to the configuration of the encrypted file systems.
- Node role: A node that is configured with file system encryption with a Remote Key
Manager,
-
FILE SYSTEM
- Node role: This node role is active on all IBM Storage Scale nodes.
- Task: Monitors different aspects of IBM Storage Scale file systems.
-
FILEAUDITLOG
- Node role: All nodes get the FILEAUDITLOG producer node role if
fileauditlog
is made active for a file system. - Task: Monitors the event producer state for nodes that have this role.
- Node role: All nodes get the FILEAUDITLOG producer node role if
- FILESYSMGR
- Node role: A file system manager node, which can be detected by using the mmlsmgr command.
- Task: Shows the file systems where the current node acts as a manager. If quality of service monitoring (QoS) is enabled for a file system, it might show more hints.
- GDS
- Node role: A node, which is configured for GDS
(verbsGPUDirectStorage=enable) and the
gdscheck
program installed. The full path to this file must be declared in the /var/mmfs/ mmsysmon/mmsysmonitor.conf file.For more information, see the
gpudirect
section of the /var/mmfs/mmsysmon/mmsysmonitor.conf file to modify thegdscheckfile
variable.Declare the path of the
gdscheck
program:gdscheckfile = /usr/local/cuda/gds/tools/gdscheck
- Task: Monitors the health state of the GDS configuration based on the gdscheck program output.
- Node role: A node, which is configured for GDS
(verbsGPUDirectStorage=enable) and the
- GPFS
- Node role: This node role is always active on all IBM Storage Scale nodes.
- Task: Monitors all GPFS
daemon-related functions. For example,
mmfsd
process andgpfs port accessibility
.
- GUI
- Node role: Nodes with the node class GUI_MGMT_SERVERS monitor the GUI service.
- Task: Verifies whether the GUI services are functioning properly.
- HEALTHCHECK
- Node role: The call home nodes or group servers get this node role.
- Task: Monitors health check service alert events and raises dynamic health events
for the
HEALTHCHECK
component on the first call home server node.
- LOCAL CACHE
- Node role: This node role is active when local read-only cache disks are configured.
- Task: Monitors the health state of the local read-only cache devices.
- NETWORK
- Node role: This node role is active on every IBM Storage Scale node.
- Task: Monitors all IBM Storage Scale relevant IP-based (Ethernet + IPoIB) and InfiniBand RDMA networks.
- NVMeoF
- Node role: The node, which is contained in the
NVMeoFFT_SERVERS
node class, monitors the NVMeoF service. - Task: Monitors the NVMeoF feature for the nodes that have this node role.
- Node role: The node, which is contained in the
- PERFMON
- Node role: Nodes where PerfmonSensors or PerfmonCollector services are running get the PERFMON node role. PerfmonSensors are determined through the perfmon designation in mmlscluster. PerfmonCollector are determined through the colCandidates line in the configuration file.
- Task: Monitors whether PerfmonSensors and PerfmonCollector are running as expected.
- SERVERRAID
- Node role: An Elastic Storage Server (ESS) node, which is configured for IBM® Power® RAID (IPR) and has the /sbin/iprconfig program installed.
- Task: Monitors the health state of the IBM Power RAID based on the /sbin/iprconfig program output.
- THRESHOLD
- Node role: Nodes where the performance data collection is configured and enabled. If a node role is not configured to PERFMON, it cannot have a THRESHOLD role either.
- Task: Monitors whether the node-related thresholds rules evaluation is running as
expected, and if the health status changed as a result of the threshold limits being
crossed.Note: In a mixed environment, when a cluster contains some nodes that run IBM Storage Scale versions, which are different from the versions that are running on other nodes, the threshold service is not available.
- WATCHFOLDER
- Node role: All nodes get the WATCHFOLDER producer node role if
watchfolder
is made active for a file system. - Task: Monitors the event producer state for nodes that have this role.
- Node role: All nodes get the WATCHFOLDER producer node role if
GNR
- NVMe
- Node role: Node must be either an Elastic Storage Server (ESS) node or an ECE node that is connected to an NVMe device.
- Task: Monitors the health state of the NVMe devices.
Interface
- AFM
- Node role: The AFM monitoring service is active if the node is a gateway node.
Note: To know whether the node is a gateway node, run the mmlscluster command.
- Task: Monitors the cache states and different user exit events for all the AFM fileset.
- Node role: The AFM monitoring service is active if the node is a gateway node.
- CES
- Node role: This node role is active on the CES nodes that are
listed by mmlscluster --ces. After a node obtains this role, all corresponding
CES sub services are activated on that node. The CES service does not have its own monitoring
service or events. The status of the CES is an aggregation of the status of its sub services. The
following sub services are monitored:
- AUTH
- Task: Monitors LDAP, AD and or NIS-based authentication services.
- AUTH_OBJ
- Task: Monitoring the OpenStack identity service functions.
- BLOCK
- Task: Checks whether the iSCSI daemon is functioning properly.
- CESNETWORK
- Task: Monitoring CES network-related adapters and IP addresses.
- HDFS_NAMENODE
- Node role: CES nodes that belong to an HDFS CES group.
- Task: Checks whether the
HDFS_NAMENODE
process is running correctly and is healthy. It also monitors if theACTIVE
orSTANDBY
state of the name node is correct as only one HDFS node in an HDFS group can be the active node.
- NFS
- Task: Monitoring NFS-related functions.
- OBJECT
- Node role: Monitors the IBM Storage Scale for object functions. Especially, the status of relevant system services and accessibility to ports are checked.
- SMB
- Node role: Monitoring SMB-related functions like the smbd process, the ports and ctdb processes.
- AUTH
- Node role: This node role is active on the CES nodes that are
listed by mmlscluster --ces. After a node obtains this role, all corresponding
CES sub services are activated on that node. The CES service does not have its own monitoring
service or events. The status of the CES is an aggregation of the status of its sub services. The
following sub services are monitored:
- CESIP
- Node role: A cluster manager node, which can be detected by using the
mmlsmgr -c
command. This node runs a special code module of the monitor, which checks the cluster-wide CES IP distribution. - Task: Compares the effectively hosted CES IPs with the list of declared CES IPs in
the address pool and reports the result. There are three cases:
- All declared CES IPs are hosted. In this case, the state of the IPs is
HEALTHY
. - None of the declared CES IPs are hosted. In this case, the state of the IPs is
FAILED
. - Only a subset of the declared CES IPs is hosted. In this case, the state of the IPs is
DEGRADED
.
Note: AFAILED
state does not trigger any failover. - All declared CES IPs are hosted. In this case, the state of the IPs is
- Node role: A cluster manager node, which can be detected by using the
- CLOUDGATEWAY
- Node role: Identified as a Transparent cloud tiering node. All nodes that are listed in the mmcloudgateway node list get this node role.
- Task: Check whether the cloud gateway service functions as expected.
- HADOOPCONNECTOR
- Node role: Nodes where the Hadoop service is configured get the Hadoop connector node role.
- Task: Monitors the Hadoop data node and name node services.
- HDFS_DATANODE
- Node role: CES HDFS service is enabled on the cluster and the node is configured as
an HDFS data node. Note: The /usr/lpp/mmfs/hadoop/sbin/mmhdfs datanode is-enabled command is used to find out whether the node is configured as the data node.
- Task: Monitors the HDFS data node service process.
- Node role: CES HDFS service is enabled on the cluster and the node is configured as
an HDFS data node.
Stretch cluster monitoring
Stretch cluster monitoring is set up by defining the node classes, which specify one or more
IBM
Storage Scale nodes that belong to a site. Each site
node class name must start with SCALE_SITE_
. The character(s) after
SCALE_SITE_
are used as the site name. For example, SCALE_SITE_A
would define a stretch cluster site named A. The site name is visible in the
mmhealth command output when showing site health events.
- STRETCH CLUSTER
- Node role: A cluster manager node, which can be detected by using the mmlsmgr -c command. This node runs a special code module to monitor the health of a stretch cluster configuration.
- Task: Checks the health of the sites that are defined in a stretch cluster
configuration and reports any discovered health issues, which include:
- All sites have no issues that would affect the health of the stretch cluster. In this case, the
state of the sites is
HEALTHY
. - Sites are experiencing file system replication issues. In this case, the state of the sites is
DEGRADED
. - Sites are experiencing process, network, or hardware issues. In this case, the state of the
sites is
FAILED
.
- All sites have no issues that would affect the health of the stretch cluster. In this case, the
state of the sites is
For a list of all the available events, see Events.