Monitoring the health of a node

The following list provides the details of the monitoring services available in the IBM Storage Scale system:

General

  1. CALLHOME

    • Node role: The call home nodes or group servers get this node role.
    • Task: Monitors the call home feature and sends call home heartbeats.
  2. DISK

    • Node role: Nodes with the node class nsdNodes monitor the DISK service. IBM Storage Scale nodes.
    • Task: Checks whether the IBM Storage Scale disks are available and running.
  3. ENCRYPTION

    • Node role: A node that is configured with file system encryption with a Remote Key Manager, RKM.conf.
    • Task: Displays the events that are related to the configuration of the encrypted file systems.
  4. FILE SYSTEM

    • Node role: This node role is active on all IBM Storage Scale nodes.
    • Task: Monitors different aspects of IBM Storage Scale file systems.
  5. FILEAUDITLOG

    • Node role: All nodes get the FILEAUDITLOG producer node role if fileauditlog is made active for a file system.
    • Task: Monitors the event producer state for nodes that have this role.
  6. FILESYSMGR
    • Node role: A file system manager node, which can be detected by using the mmlsmgr command.
    • Task: Shows the file systems where the current node acts as a manager. If quality of service monitoring (QoS) is enabled for a file system, it might show more hints.
  7. GDS
    • Node role: A node, which is configured for GDS (verbsGPUDirectStorage=enable) and the gdscheck program installed. The full path to this file must be declared in the /var/mmfs/ mmsysmon/mmsysmonitor.conf file.

      For more information, see the gpudirect section of the /var/mmfs/mmsysmon/mmsysmonitor.conf file to modify the gdscheckfile variable.

      Declare the path of the gdscheck program:

      gdscheckfile = /usr/local/cuda/gds/tools/gdscheck

    • Task: Monitors the health state of the GDS configuration based on the gdscheck program output.
    For more information on GPUDirect Storage (GDS) support for IBM Storage Scale, see GPUDirect Storage support for IBM Storage Scale.
  8. GPFS
    • Node role: This node role is always active on all IBM Storage Scale nodes.
    • Task: Monitors all GPFS daemon-related functions. For example, mmfsd process and gpfs port accessibility.
  9. GUI
    • Node role: Nodes with the node class GUI_MGMT_SERVERS monitor the GUI service.
    • Task: Verifies whether the GUI services are functioning properly.
  10. HEALTHCHECK
    • Node role: The call home nodes or group servers get this node role.
    • Task: Monitors health check service alert events and raises dynamic health events for the HEALTHCHECK component on the first call home server node.
    For more information, see Proactive system health alerts.
  11. LOCAL CACHE
    • Node role: This node role is active when local read-only cache disks are configured.
    • Task: Monitors the health state of the local read-only cache devices.
  12. NETWORK
    • Node role: This node role is active on every IBM Storage Scale node.
    • Task: Monitors all IBM Storage Scale relevant IP-based (Ethernet + IPoIB) and InfiniBand RDMA networks.
  13. NVMeoF
    • Node role: The node, which is contained in the NVMeoFFT_SERVERS node class, monitors the NVMeoF service.
    • Task: Monitors the NVMeoF feature for the nodes that have this node role.
  14. PERFMON
    • Node role: Nodes where PerfmonSensors or PerfmonCollector services are running get the PERFMON node role. PerfmonSensors are determined through the perfmon designation in mmlscluster. PerfmonCollector are determined through the colCandidates line in the configuration file.
    • Task: Monitors whether PerfmonSensors and PerfmonCollector are running as expected.
  15. SERVERRAID
    • Node role: An Elastic Storage Server (ESS) node, which is configured for IBM® Power® RAID (IPR) and has the /sbin/iprconfig program installed.
    • Task: Monitors the health state of the IBM Power RAID based on the /sbin/iprconfig program output.
  16. THRESHOLD
    • Node role: Nodes where the performance data collection is configured and enabled. If a node role is not configured to PERFMON, it cannot have a THRESHOLD role either.
    • Task: Monitors whether the node-related thresholds rules evaluation is running as expected, and if the health status changed as a result of the threshold limits being crossed.
      Note: In a mixed environment, when a cluster contains some nodes that run IBM Storage Scale versions, which are different from the versions that are running on other nodes, the threshold service is not available.
  17. WATCHFOLDER
    • Node role: All nodes get the WATCHFOLDER producer node role if watchfolder is made active for a file system.
    • Task: Monitors the event producer state for nodes that have this role.

GNR

  1. NVMe
    • Node role: Node must be either an Elastic Storage Server (ESS) node or an ECE node that is connected to an NVMe device.
    • Task: Monitors the health state of the NVMe devices.

Interface

  1. AFM
    • Node role: The AFM monitoring service is active if the node is a gateway node.
      Note: To know whether the node is a gateway node, run the mmlscluster command.
    • Task: Monitors the cache states and different user exit events for all the AFM fileset.
  2. CES
    • Node role: This node role is active on the CES nodes that are listed by mmlscluster --ces. After a node obtains this role, all corresponding CES sub services are activated on that node. The CES service does not have its own monitoring service or events. The status of the CES is an aggregation of the status of its sub services. The following sub services are monitored:
      1. AUTH
        • Task: Monitors LDAP, AD and or NIS-based authentication services.
      2. AUTH_OBJ
        • Task: Monitoring the OpenStack identity service functions.
      3. BLOCK
        • Task: Checks whether the iSCSI daemon is functioning properly.
      4. CESNETWORK
        • Task: Monitoring CES network-related adapters and IP addresses.
      5. HDFS_NAMENODE
        • Node role: CES nodes that belong to an HDFS CES group.
        • Task: Checks whether the HDFS_NAMENODE process is running correctly and is healthy. It also monitors if the ACTIVE or STANDBY state of the name node is correct as only one HDFS node in an HDFS group can be the active node.
      6. NFS
        • Task: Monitoring NFS-related functions.
      7. OBJECT
        • Node role: Monitors the IBM Storage Scale for object functions. Especially, the status of relevant system services and accessibility to ports are checked.
      8. SMB
        • Node role: Monitoring SMB-related functions like the smbd process, the ports and ctdb processes.
  3. CESIP
    • Node role: A cluster manager node, which can be detected by using the mmlsmgr -c command. This node runs a special code module of the monitor, which checks the cluster-wide CES IP distribution.
    • Task: Compares the effectively hosted CES IPs with the list of declared CES IPs in the address pool and reports the result. There are three cases:
      • All declared CES IPs are hosted. In this case, the state of the IPs is HEALTHY.
      • None of the declared CES IPs are hosted. In this case, the state of the IPs is FAILED.
      • Only a subset of the declared CES IPs is hosted. In this case, the state of the IPs is DEGRADED.
      Note: A FAILED state does not trigger any failover.
  4. CLOUDGATEWAY
    • Node role: Identified as a Transparent cloud tiering node. All nodes that are listed in the mmcloudgateway node list get this node role.
    • Task: Check whether the cloud gateway service functions as expected.
  5. HADOOPCONNECTOR
    • Node role: Nodes where the Hadoop service is configured get the Hadoop connector node role.
    • Task: Monitors the Hadoop data node and name node services.
  6. HDFS_DATANODE
    • Node role: CES HDFS service is enabled on the cluster and the node is configured as an HDFS data node.
      Note: The /usr/lpp/mmfs/hadoop/sbin/mmhdfs datanode is-enabled command is used to find out whether the node is configured as the data node.
    • Task: Monitors the HDFS data node service process.

Stretch cluster monitoring

Stretch cluster monitoring is set up by defining the node classes, which specify one or more IBM Storage Scale nodes that belong to a site. Each site node class name must start with SCALE_SITE_. The character(s) after SCALE_SITE_ are used as the site name. For example, SCALE_SITE_A would define a stretch cluster site named A. The site name is visible in the mmhealth command output when showing site health events.

  1. STRETCH CLUSTER
    • Node role: A cluster manager node, which can be detected by using the mmlsmgr -c command. This node runs a special code module to monitor the health of a stretch cluster configuration.
    • Task: Checks the health of the sites that are defined in a stretch cluster configuration and reports any discovered health issues, which include:
      1. All sites have no issues that would affect the health of the stretch cluster. In this case, the state of the sites is HEALTHY.
      2. Sites are experiencing file system replication issues. In this case, the state of the sites is DEGRADED.
      3. Sites are experiencing process, network, or hardware issues. In this case, the state of the sites is FAILED.

Note: Users can now create and raise custom events. For more information, see Creating, raising, and finding custom defined events.

For a list of all the available events, see Events.