Network performance monitoring
Network performance can be monitored either by using Remote Procedure Call (RPC) statistics or it can be monitored by using the IBM Storage Scale graphical user interface (GUI).
Monitoring networks by using RPC statistics
You can monitor the network performance of IBM Storage Scale nodes and of the communication protocols, which is used for exchanging information between them. One of such communication protocols is RPC that is used to send request or response messages between IBM Storage Scale nodes over an Ethernet or an InfiniBand interface.
Each IBM Storage Scale node has a set of seven RPC statistics that are cached per node, and one RPC statistic that is cached per size of the RPC message to monitor the network performance. The counters are measured in seconds and milliseconds.
Following statistics are cached per IBM Storage Scale node:
- Channel Wait Time
- The amount of time the RPC must wait to access a communication channel to the destination IBM Storage Scale node.
- Send Time TCP
- The amount of time, which is needed to transfer an RPC message over an Ethernet interface.
- Send Time Verbs
- The amount of time, which is needed to transfer an RPC message to an InfiniBand interface.
- Receive Time TCP
- The amount of time to transfer an RPC message from an Ethernet interface to the GPFS daemon.
- Latency TCP
- The latency of an RPC message when sent and received over an Ethernet interface.
- Latency Verbs
- The latency of an RPC message when sent and received over an InfiniBand interface.
- Latency Mixed
- The latency of an RPC message when sent over one type of interface (Ethernet or InfiniBand) and received over the other (InfiniBand or Ethernet).
Send
Time Verbs
, Latency Verbs
, and Latency Mixed
.The GPFS daemon considers the RPC latency as a relative measure of GPFS network performance. The RPC latency is defined as the difference between the round-trip time and the execution time. Here, the round-trip time is measured as the time from the start of writing an RPC request message over an interface till an RPC response message is received. Whereas, execution time is measured as the time an RPC request message is received on a GPFS destination node till an RPC response message is sent. Therefore, the RPC latency can be defined as the amount of time the RPC is being transmitted and received over a network.
There is an RPC statistic that is associated with each of a set of size ranges, each with an
upper bound that is a power of 2. The first range is 0-64, then 65-128, then 129-256, and then
continuing until the last range has an upper bound of twice the maxBlockSize
. For
example, if the maxBlockSize
is 1 MB, the upper bound of the last range is
2,097,152 (2 MB). For each of these ranges, the associated statistic is the latency of the RPC whose
size falls within that range. The size of an RPC is the amount of data that is sent plus the amount
of data received. However, if one amount is more than 16 times greater than the other, only the
larger amount is used as the size of the RPC.
The final statistic associated with each type of RPC message, on the node where the RPC is received, is the execution time of the RPC.
The RPC statistics, which are used for network performance monitoring are described as an aggregation of values. By default, an aggregation consists of 60 one-second intervals, 60 one-minute intervals, 24 one-hour intervals, and 30 one-day intervals.
Each time interval consists of the following values:
- Sum of values that are accumulated during the interval.
- Count of values that are added to the aggregation total.
- Minimum value that is added to the aggregation total.
- Maximum value that is added to the aggregation total.
After 60 seconds from the time GPFS daemon starts, the oldest 1-second interval is discarded, and a new 1-second interval with latest RPC data is added.
After receiving each RPC response message, the following information is saved in a raw statistics buffer:
- Channel wait time
- Send time
- Receive time
- Latency
- Length of data sent
- Length of data received
- Flags indicating whether the RPC was sent or received over InfiniBand
- Target node identifier
As each RPC completes execution, the execution time for the RPC and the message type of the RPC is saved in a raw execution buffer. The raw buffers are processed per second, and then, the values are added to the appropriate aggregated statistic. For each value, the value is added to the statistic's sum, the count is incremented, and the value is compared to the minimum and maximum, which are adjusted as needed. Upon completion of this processing, for each statistic the sum, count, minimum, and maximum values are entered into the next 1-second interval.
Every 60 seconds, the sums, and counts in the 60 1-second intervals are added into a 1-minute sum and count. The smallest value of the 60 minimum values is determined, and the largest value of the 60 maximum values is determined. This 1-minute sum, count, minimum, and maximum are then entered into the next 1-minute interval.
An analogous pattern holds for the minute, hour, and day periods. For any one particular interval, the sum is the sum of all raw values that are processed during that interval, the count is the count of all values during that interval, the minimum is the minimum of all values during that interval, and the maximum is the maximum of all values during that interval.
When statistics are displayed for any particular interval, an average is calculated from the sum and count, then the average, minimum, maximum, and count are displayed. The average, minimum, and maximum are displayed in units of milliseconds, to three decimal places (1-microsecond granularity).
The RPC buffers and intervals can be controlled by using the following mmchconfig command attributes:
rpcPerfRawStatBufferSize
rpcPerfRawExecBufferSize
rpcPerfNumberSecondIntervals
rpcPerfNumberMinuteIntervals
rpcPerfNumberHourIntervals
rpcPerfNumberDayIntervals
The mmdiag command with the --rpc parameter can be used to query RPC statistics.
For more information, see mmchconfig command, mmnetverify command, and mmdiag command.