Monitoring the health of a node

The following list provides the details of the monitoring services available in the IBM Spectrum Scale system:

  1. GPFS
    • Node role: This node role is always active on all IBM Spectrum Scale nodes.
    • Tasks: Monitors all GPFS daemon-related functionalities. For example, mmfsd process and gpfs port accessibility.
  2. NETWORK
    • Node role: This node role is active on every IBM Spectrum Scale node.
    • Tasks: Monitors all IBM Spectrum Scale relevant IP-based (Ethernet + IPoIB) and IB RDMA networks.
  3. CES
    • Node role: This node role is active on the CES nodes that are listed by mmlscluster --ces. Once a node obtains this role, all corresponding CES sub-services are activated on that node. The CES service does not have its own monitoring service or events. The status of the CES is an aggregation of the status of its sub-services. The following sub-services are monitored::
      1. AUTH
        • Tasks: Monitors LDAP, AD and or NIS-based authentication services.
      2. AUTH_OBJ
        • Tasks: Monitoring the OpenStack identity service functionalities.
      3. BLOCK
        • Tasks: Checks whether the iSCSI daemon is functioning properly.
      4. CESNETWORK
        • Tasks: Monitoring CES network-related adapters and IP addresses.
      5. NFS
        • Tasks: Monitoring NFS-related functionalities.
      6. OBJECT
        • Tasks: Monitors the IBM Spectrum Scale for object functionality. Especially, the status of relevant system services and accessibility to ports are checked.
      7. SMB
        • Tasks: Monitoring SMB-related functionality like the smbd process, the ports and ctdb processes.
  4. AFM
    • Node Role: The AFM monitoring service will be active if the node is a gateway node.
      Note: To know if the node is a gateway node, run the mmlscluster command.
    • Tasks: Monitors the cache states and different user exit events for all the AFM fileset.
  5. CLOUDGATEWAY
    • Node role: Yis identified as a Transparent cloud tiering node. All nodes listed in mmcloudgateway node list will get this node role.
    • Tasks: Check if the cloud gateway service functions as expected.
  6. DISK
    • Node role: Nodes with node class nsdNodes will monitor the DISK service. IBM Spectrum Scale nodes.
    • Tasks: Checking, if IBM Spectrum Scale disks are available and running.
  7. FILESYSTEM
    • Node role: This node role is active on all IBM Spectrum Scale nodes.
    • Tasks: Monitors different aspects of IBM Spectrum Scale file systems.
  8. GUI
    • Node role: Nodes with node class GUI_MGMT_SERVERS will monitor the GUI service.
    • Tasks: Verifies whether the GUI services are functioning properly.
  9. HADOOPCONNECTOR
    • Node role: Nodes where the Hadoop service is configured get the Hadoop connector node role.
    • Tasks: Monitors the Hadoop data node and name node services.
  10. PERFMON
    • Node role: Nodes where PerfmonSensors or PerfmonCollector services are running get the PERFMON node role. PerfmonSensors are determined through the perfmon designation in mmlscluster. PerfmonCollector are determined through the colCandidates line in the configuration file.
    • Tasks: Monitors whether PerfmonSensors and PerfmonCollector are running as expected.
  11. THRESHOLD
    • Node role: Nodes where the performance data collection is configured and enabled. If a node role is not configured to PERFMON, it cannot have a THRESHOLD role either.
    • Tasks: Monitors whether the node-related thresholds rules evaluation is running as expected, and if the health status has changed as a result of the threshold limits being crossed.
      Note: The THRESHOLD service is available only when the cluster belongs to IBM Spectrum Scale version 4.2.3 or later. In a mixed environment with a cluster containing some nodes belonging to IBM Spectrum Scale version 4.2.2 and some nodes belonging to IBM Spectrum Scale version 4.2.3, the overall cluster version is 4.2.2. The threshold service is unavailable in such an environment.
  12. MSGQUEUE
    • Node role: A node gets the MSGQUEUE node role if it monitors nodes included in the kafkaBrokerServers node class or the kafkaZookeeperServers node class.
    • Tasks: Monitors the Zookeeper and Kafka service on the Kafka broker servers and the Kafka zookeeper servers.
  13. FILEAUDITLOG
    This can be split into 2 section:
    1. FILEAUDITLOG - Consumer
      • Node role: A node gets the FILEAUDITLOG (Consumer) node role if the node is part of thekafkaBrokerServers node class.
      • Tasks: Monitors the File Audit Log consumer process of each filesystem that has file audit login enabled, and detects consumer related errors.
    2. FILEAUDITLOG - Producer
      • Node role: All nodes get the FILEAUDITLOG (Producer) node role if fileauditlog is made active for a filesystem.
      • Tasks: Monitors the event producer state for nodes that have this role.
  14. CESIP
    • Node role: A cluster manager node, which can be detected using the mmlsmgr -c command. This node runs a special code module of the monitor, which checks the cluster-wide CES IP distribution.
    • Tasks: Compares the effectively hosted CES IPs with the list of declared CES IPs in the address pool, and report the result. There are three cases:
      • All declared CES IPs are hosted. In this case, the state of the IPs is HEALTHY.
      • None of the declared CES IPs are hosted. In this case, the state of the IPs is FAILED.
      • Only a subset of the declared CES IPs are hosted . In this case, the state of the IPs is DEGRADED.
      Note: A FAILED state does not trigger any failover.
  15. WATCHFOLDER
    This can be split into 2 section:
    1. WATCHFOLDER - Consumer
      • Node role: A node gets the WATCHFOLDER (Consumer) node role if the node is part of the kafkaBrokerServers node class.
      • Tasks: Monitors the watchfolder consumer process of each filesystem that has watchfolder enabled, and detects consumer related errors.
    2. WATCHFOLDER - Producer
      • Node role: All nodes get the WATCHFOLDER (Producer) node role if watchfolder is made active for a filesystem.
      • Tasks: Monitors the event producer state for nodes that have this role.
  16. CALLHOME
    • Node role: The call home nodes (group masters) get this node role.
    • Tasks: Monitors the call home feature and sends call home heartbeats.
  17. Start of changeNVMe
    • Node role: Node has to be either an ESS node or an ECE node connected to an NVMe device.
    • Tasks: Monitors the health state of the NVMe devices.
    End of change
Note: Users can now create and raise custom events. For more information, see Creating, raising, and finding custom defined events.

For more details on different events, their causes and possible user actions to resolves them, see Events.