Configuring the performance monitoring tool

The performance monitoring tool, collector, sensors, and proxies, are a part of the IBM Spectrum Scale™ distribution. The tool is installed with the GPFS™ core packages on all nodes. The tools packages are small, approximately 300 KB for the sensors and 500 KB for the collector.

Note: The tool is supported on Linux nodes only.

For information on the usage of ports for the performance monitoring tool, see the Firewall recommendations for Performance Monitoring tool

Configuring sensors

The install toolkit sets up a default set of sensors to monitor on each node. You can modify the sensors on each individual node.

The configuration file of the sensors, ZimonSensors.cfg, is located on each node in the /opt/IBM/zimon folder. The file lists all groups of sensors in it. The configuration file includes the parameter setting of the sensors, such as the reporting frequency, and controls the sensors that are active within the cluster.

For example:
sensors =
{
        name = "CPU"
        period = 1
},
{       name = "Load"
        period = 1
},
{
        name = "Memory"
        period = 1
},
{
        name = "Network"
        period = 1
        filter = "eth*"
        # filters are currently ignored.
},
{
        name = "Netstat"
        period = 1
},
Starting with version 4.2 of the performance monitoring tool, sensors can be configured on nodes that are part of a IBM Spectrum Scale cluster through a IBM Spectrum Scale based configuration mechanism. However, this also requires the installation of IBM Spectrum Scale 4.2 on all the nodes where a sensor is running and where the sensors are to be configured. It also requires the entire cluster to be at least running IBM Spectrum Scale 4.1.1 and the execution of mmchconfig release=LATEST command.
The IBM Spectrum Scale based configuration method allows the sensor configuration to be stored as part of the IBM Spectrum Scale configuration. IBM Spectrum Scale-based configuration is only available for the sensor configuration files (/opt/IBM/zimon/ZIMonSensors.cfg) but not for the collector configuration files (/opt/IBM/zimon/ZIMonCollector.cfg). In this setup, the /opt/IBM/zimon/ZIMonSensors.cfg configuration file on each IBM Spectrum Scale node is maintained by IBM Spectrum Scale. As a result, the file must not be edited manually because whenever IBM Spectrum Scale needs to update a configuration parameter, the file is regenerated and any manual modifications are overwritten. Before using IBM Spectrum Scale-based configuration, an initial configuration needs to be stored within IBM Spectrum Scale. Storing this initial configuration is accomplished with the mmperfmon config generate command:
prompt# mmperfmon config generate \
--collectors collector1.domain.com,collector2.domain.com,… 
Once the configuration file has been stored within IBM Spectrum Scale, it can be activated as follows.
prompt# mmchnode --perfmon –N nodeclass1,nodeclass2,… 
Note: Any previously existing configuration file is overwritten.
To deactivate the performance monitoring tool, the same command is used but with the --noperfmon switch supplied instead. Configuration parameters can be changed with the following command where parami is of the form sensorname.sensorattribute:
prompt# mmperfmon config update param1=value1 param2=value2 …
For instance, to restrict a sensor, say NFSIO, to a given node class and change the reporting period to once every 10 hours, one would specify NFSIO.period=36000 NFSIO.restrict=nodeclass1 as attribute value pairs in the update command.
Note: Some sensors such as the cluster export services sensors must only run on a specific set of nodes. Other sensors such as the GPFSDiskCap sensor must only run on a single node in the cluster since the data reported is the same independently of the node the sensor is running on. For these types of sensors, the restrict function is especially intended.
Configuration information for SMB and NFS in the ZimonSensors.cfg file references the sensor definition files in the /opt/IBM/zimon folder. For example:
  • The CTDBDBStats.cfg file is referred in:
    {       name = "CTDBDBStats"
            period = 1
            type = "Generic"
    },
  • The CTDBStats.cfg file is referred in:
    {       name = "CTDBStats"
            period = 1
            type = "Generic"
    },
  • The NFSIO.cfg file is referred in:
    {
            # NSF Ganesha statistics
            name = "NFSIO"
            period = 1
            type = "Generic"
    },
  • The SMBGlobalStats.cfg file is referred in:
    {       name = "SMBGlobalStats"
            period = 1
            type = "Generic"
    },
  • The SMBStats.cfg file is referred in:
    {       name = "SMBStats"
            period = 1
            type = "Generic"
    },

The period in the example specifies the interval size in number of seconds when a sensor group will gather data. 0 means the sensor group is disabled and 1 runs the sensor group every second. You can specify a higher value to decrease the frequency that the data is collected.

The file also contains the hostname of the node where the collector is running that the sensor should be reporting to.

Configuration changes result in a new version of the configuration file which is then propagated through the IBM Spectrum Scale cluster at the file level.
Note: Some sensors, such as VFS, are not enabled by default even though they have associated predefined queries with the mmperfmon query command because the collector may display performance issues of its own if it is required to collect more than 1000000 metrics per second.

To enable VFS sensors, use the mmfsadm vfsstats enable command on the node.

To enable a sensor, set the period value to an integer greater than 0. You will then need to restart the sensors on that node using the systemctl restart pmsensors command:

The sensors can be stopped (deactivated) using the systemctl stop pmsensors command.

The sensors can be started (activated) enabled using the systemctl start pmsensors command.

Configuring the collector

The configuration file of the collector, ZIMonCollector.cfg, is located in the /opt/IBM/zimon/ folder.

The most important configuration options are the domains and the peers configuration options. All other configuration options are best left at their defaults and are explained within the default configuration file shipped with ZIMon.

Metric Domain Configuration

The domains configuration indicates the number of metrics to be collected and how long they must be retained and in what granularity. Multiple domains might be specified. If data no longer fits into the current domain, data is spilled over into the next domain and resampled.

A simple configuration is:
domains = {
# this is the raw domain, aggregation factor for the raw domain is always 0
aggregation = 0
ram = "500m" # amount of RAM to be used
duration = "12h"
filesize = "1g" # maximum file size
files = 16 # number of files.
}
,
{
# this is the second domain that aggregates to 60 seconds
aggregation = 60
ram = "500m" # amount of RAM to be used
duration = "4w"
filesize = "500m" # maximum file size
files = 4 # number of files.
}
,
{
# this is the third domain that aggregates to 30*60 seconds == 30 minutes
aggregation = 30
ram = "500m" # amount of RAM to be used
duration = "1y"
filesize = "500m" # maximum file size
files = 4 # number of files.
}
The configuration file lists several data domains. At least one domain must be present and the first domain represents the raw data collection as the data is collected by sensors. The aggregation parameter for this first domain must be set to 0.
Each domain specifies the following parameters:
  • The duration parameter indicates how long until the collected metrics will be pushed into the next (coarser-grained) domain. If this option is left out, no limit on the duration will be imposed.
  • The ram parameter indicates the amount of RAM to be allocated for the domain. Once that amount of RAM has been filled up, collected metrics will be pushed into the next (coarser-grained) domain. If this option is left out, no limit on the amount of RAM available will be imposed.
  • The filesize and files parameter indicates how much space is allocated on disk for a given domain. While storing metrics in memory, there is a persistence mechanism in place that also stores the metrics on disk in files of size filesize. Once the number of files is reached and a new file is to be allocated, the oldest file is removed from the disk. The persistent storage must be at least as large as the amount of main memory to be allocated for a domain because when the collector is restarted, the in-memory database is recreated from these files. Queries can also be served from these files if they are executed in archive mode (-a).
The aggregation value, used for the second and following domains, indicates the resampling to be performed. Once data is spilled into this domain, the data is resampled to be no better than indicated by the aggregation factor. The value for the second domain is in seconds, the value for domain n (n>2) is the value of domain n-1 multiplied by the aggregation value of domain n.

The collector collects the metrics from the sensors. For instance, in a 5 node cluster where only the load values (load1, load5, load15) are reported, the collector will have to maintain 15 metrics (3 metrics times 5 nodes). Depending on the number of metrics collected, the collector requires a different amount of main memory to store the collected metrics in memory. Assuming 500000 metrics are collected, here are two configurations and the amount of data required to store the database. Depending on the amount of data to be collected, 500000 metrics corresponds to about 1000 nodes.

Configuration 1 (4GB of RAM). Domain one configured at 1 second granularity for a period of 6 hours, domain 2 configured at 30 seconds granularity for the next 2 days, domain 3 configured at 15 minutes granularity for the next 2 weeks and domain 4 configured at 6 hour granularity for the next 2 months.

Configuration 2 (16GB of RAM). Domain one configured at 1 second granularity for a period of 1 day, domain 2 configured at 30 sec granularity for the next week, domain 3 configured at 15 minute granularity for the next 2 months and domain 4 configured at 6 hour granularity for the next year.

Note: The above computation only gives the memory required for the in-memory database, not including the indices necessary for the persistent storage or for the collector program itself.

The collectors can be stopped (deactivated) using the systemctl stop pmcollector command.

The collectors can be started (activated) using the systemctl start pmcollector command.

Manual configuration method

When upgrading the performance monitoring tool, it is important to note how the previous version was configured and if the configuration mechanism is to be changed. Before IBM Spectrum Scale 4.2 the previous version was configured using a file-based configuration mechanism where you had to manually edit the configuration files and propagate them to the requisite nodes. If the configuration mechanism is to be changed, it is important to verify that the installed versions of both IBM Spectrum Scale and the performance monitoring tool support the new configuration mechanism. If this is not the case, or if you would use the manual, file-based, configuration method, then when you install the IBM Spectrum Scale 4.2 there are a couple of steps required. None of the nodes in the cluster should be designated perfmon nodes. If the nodes in the cluster are designated as perfmon nodes then you must run mmchnode --perfmon –N all command.

Then you need to delete the centrally stored configuration information by issuing mmperfmon config delete --all command.

The /opt/IBM/zimon/ZIMonSensors.cfg file is then maintained manually and the tools using it can overwrite it with a new version at any time. This mode is useful if sensors are to be installed on non-Spectrum Scale nodes or if you want to have a cluster with multiple levels of IBM Spectrum Scale running.

Enabling Object metrics

At the time of installation, the Object metrics proxy is configured to start by default on an Object protocol node.

The Object metrics proxy server is controlled by the corresponding service script called pmswiftd, located in the /etc/rc.d/init.d/ directory. You can start and stop it using the systemctl start pmswiftd and systemctl stop pmswiftd commands respectively.

In case of a system restart, the Object metrics proxy server restarts automatically whereas the Object metrics proxy client is triggered by the Performance Monitoring tool. In case of a failover, the server may start automatically if it has failed gracefully. Otherwise, it needs to be started manually.