High Speed Interconnect Monitoring
Monitoring the performance on your Db2 high speed cluster interconnect layer can give valuable insight into the configuration between the members and CFs. The most appropriate way to measure this performance is through RDMA ping tests.
Testing for all adapter combinations
To run the RDMA ping test for all combinations, run db2cluster -verify -req
-rdma_ping
. This produces two files in sqllib/db2dump/db2cluster
:
- A log file of the form
ibm.db2cluster_int-<timestamp>.log
- A trace file of the form
ibm.db2cluster_int-<timestamp>.trace
Each entry of the log file will show a summary per test pair. In this summary, the primary metric is the RDMA average round trip ping time.
Example entry:
12:07:34 ----------------------------------------------------------------------
12:07:34 Testing rdma ping connection.
12:07:34 Client Hostname: coralpib259 Client Netname: coralpib259-ib0
12:07:34 Server Hostname: coralpib257 Server Netname: coralpib257-ib0
12:07:39
Starting connection test number 1
Connected to Server (10.1.2.5) in 3002499 microseconds after 1 attempt
100 bytes from 10.1.2.5: seq=0 time=6
100 bytes from 10.1.2.5: seq=1 time=11
100 bytes from 10.1.2.5: seq=2 time=9
100 bytes from 10.1.2.5: seq=3 time=147
100 bytes from 10.1.2.5: seq=4 time=23
100 bytes from 10.1.2.5: seq=5 time=8
100 bytes from 10.1.2.5: seq=6 time=12
100 bytes from 10.1.2.5: seq=7 time=7
100 bytes from 10.1.2.5: seq=8 time=9
100 bytes from 10.1.2.5: seq=9 time=34
round-trip average: 27 microseconds
Disconnected from Server in 283 microseconds
Performing udapl ping test from coralpib259-ib0 to coralpib257-ib0 status: PASS
12:07:39 ----------------------------------------------------------------------
To quickly search for this metric for all entries, run grep "round-trip average:"
<logfile>
. Most results for a single site pureScale cluster should be in the threshold
of single-digit or low-double digit microseconds. This is not to be confused with a singular ping
test as seen above, which may be well above this threshold. If there is one or more round-trip time
which is significantly higher than the threshold, then the specific adapter pairs producing these
outliers should be identified and retested. These steps are detailed below.
Testing for a specific pair of adapters
To identify which adapter pair contains an outlier, search the logfile for an entry which contains a high round-trip average found in the round-trip average search. Look at the entry block to determine the:
- Client hostname
- Client netname
- Server hostname
- Server netname
Once an outlier has been identified, the RDMA ping test should be run on the identified adapter
pair. This is done by rerunning db2cluster
with more options: db2cluster
-verify -req -rdma_ping -host <client_hostname> -netname <client_netname> -host
<server_hostname> -netname <server_netname>
. This will produce another log file and
trace file with a new timestamp for this specific test. Perform this test a few times for each
outlier.
Example db2cluster using the above entry:
db2cluster -verify -req -rdma_ping -host coralpib259 -netname coralpib259-ib0 -host coralpib257 -netname coralpib257-ib0
Look at the newly generated log files and view its their entries. Consistently high round trip times for each pair after successive tests may be a sign of poor cluster performance. Contact your network administrator for further troubleshooting.