IBM Support

HADR slow ! Log disk ? or communication ?

Technical Blog Post


Abstract

HADR slow ! Log disk ? or communication ?

Body

Once you find DB2 HADR ( High Availability Disaster Recovery ) operation is somehow slow, what's the next ?
You may want to narrow down if this delay is from writing to transaction log disk subsystem or something on network communication.

Here is a sample scenario.
- Intermittently, the number of 'commit active' status applications spikes.
- Application team reports this has very serious impact even though that status stays very short 1 second period.
  ( Application processes increasing with High CPU consumption due to 'fork' operation....etc. )

To investigate this on HADR side, I suggest to collect data using the following two functions in overall database point of view.

MON_GET_TRANSACTION_LOG
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0059253.html

MON_GET_WORKLOAD
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0053940.html


And here is the sample result output summary to study.
( Collected the output using the two functions in every 5 seconds.
 Then calculated the delta value for each 5 seconds. )

MON_GET_WORKLOAD shows TOTAL_COMMIT_TIME spiked and this is from LOG_DISK_WAIT_TIME.

TIMESTAMP / LOCK_WAIT_TIME /LOG_DISK_WAIT_TIME/TOTAL_COMMIT_TIME
17:39:16,  0,   2,   3,
17:39:21,  0,   2,   2,
17:39:26,  0,   2,   2,
17:39:31,  0,   4,   4,
17:39:37,  0,   2,   2,
17:39:42,  0,   9,   9,
17:39:47,  0,   3,   3,
17:39:52, 79,  99,  99,  <====
17:39:58,  0,   3,   3,
17:40:03,  0,   2,   2,
17:40:08,  1,   4,   4,
17:40:13,  0,   4,   4,
17:40:19,  0,   4,   4,


LOG_DISK_WAIT_TIME sometimes overlaps with LOG write time and sometimes with HADR communication time.
So it would be helpful to check the LOG_WRITE_TIME and LOG_HADR_WAIT_TIME from MON_GET_TRANSACTION_LOG function.
In the same way, I get the delta values and did some calculation.

-Log write Time per IOs = LOG_WRITE_TIME / LOG_WRITES
-AVG HADR Wait time   = LOG_HADR_WAIT_TIME / LOG_HADR_WAITS_TOTAL


TIMESTAMP /Log write Time per IOs /AVG HADR Wait time
17:39:16, 0.310, 0.041
17:39:21, 0.298, 0.042
17:39:26, 0.308, 0.040
17:39:32, 0.302, 0.072
17:39:37, 0.294, 0.042
17:39:42, 0.342, 0.805<===
17:39:47, 0.326, 0.058
17:39:52, 0.334, 0.485<===
17:39:57, 0.344, 0.064
17:40:02, 0.299, 0.046

 

As you can see above, HADR wait time spiked around the time whereas log write time did not increase much.
So the next is checking network side.
( NOTE : I assume there is not transaction blocking in HADR primary side due to any HADR receive buffer full on standby.)

 

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm11139986