HADR slow ! Log disk ? or communication ?

Technical Blog Post

Abstract

Body

Once you find DB2 HADR ( High Availability Disaster Recovery ) operation is somehow slow, what's the next ?
You may want to narrow down if this delay is from writing to transaction log disk subsystem or something on network communication.

Here is a sample scenario.
- Intermittently, the number of 'commit active' status applications spikes.
- Application team reports this has very serious impact even though that status stays very short 1 second period.
( Application processes increasing with High CPU consumption due to 'fork' operation....etc. )

To investigate this on HADR side, I suggest to collect data using the following two functions in overall database point of view.

MON_GET_TRANSACTION_LOG
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0059253.html

MON_GET_WORKLOAD
https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0053940.html

And here is the sample result output summary to study.
( Collected the output using the two functions in every 5 seconds.
Then calculated the delta value for each 5 seconds. )

MON_GET_WORKLOAD shows TOTAL_COMMIT_TIME spiked and this is from LOG_DISK_WAIT_TIME.

TIMESTAMP / LOCK_WAIT_TIME /LOG_DISK_WAIT_TIME/TOTAL_COMMIT_TIME
17:39:16, 0,   2,   3,
17:39:21, 0,   2,   2,
17:39:26, 0,   2,   2,
17:39:31, 0,   4,   4,
17:39:37, 0,   2,   2,
17:39:42, 0,   9,   9,
17:39:47, 0,   3,   3,
17:39:52, 79, 99, 99, <====
17:39:58, 0,   3,   3,
17:40:03, 0,   2,   2,
17:40:08, 1,   4,   4,
17:40:13, 0,   4,   4,
17:40:19, 0,   4,   4,

LOG_DISK_WAIT_TIME sometimes overlaps with LOG write time and sometimes with HADR communication time.
So it would be helpful to check the LOG_WRITE_TIME and LOG_HADR_WAIT_TIME from MON_GET_TRANSACTION_LOG function.
In the same way, I get the delta values and did some calculation.

-Log write Time per IOs = LOG_WRITE_TIME / LOG_WRITES
-AVG HADR Wait time = LOG_HADR_WAIT_TIME / LOG_HADR_WAITS_TOTAL

TIMESTAMP /Log write Time per IOs /AVG HADR Wait time
17:39:16, 0.310, 0.041
17:39:21, 0.298, 0.042
17:39:26, 0.308, 0.040
17:39:32, 0.302, 0.072
17:39:37, 0.294, 0.042
17:39:42, 0.342, 0.805<===
17:39:47, 0.326, 0.058
17:39:52, 0.334, 0.485<===
17:39:57, 0.344, 0.064
17:40:02, 0.299, 0.046

As you can see above, HADR wait time spiked around the time whereas log write time did not increase much.
So the next is checking network side.
( NOTE : I assume there is not transaction blocking in HADR primary side due to any HADR receive buffer full on standby.)

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm11139986

Tips

HADR slow ! Log disk ? or communication ?

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?