Technical Blog Post
Abstract
HADR: Standby cannot catch up primay after upgrading from V9.7 to V10.5
Body
It is a HADR system set up with near_sync mode. This server has been running well for several years with version V9.7 FixPack7 even though the user already know the storage on standby side could be a bottleneck.
At the beginning of this year 2018, the user upgraded Db2 to V10.5, the problem happened since then. The log gap kept on increasing, and standby can never catch up with primary.
Collected stacks from standby, found hadrs EDU was busy on writing log to disk:
0x0900000000037B18 pwrite64 + 0x38
0x090000002817CF4C sqloseekwrite64 + 0x230
0x090000002817CAF4 sqloWriteBlocks + 0x128
0x090000002817C7F4 sqlpgWriteNLogPagesInternal__FR12SQLO_FHANDLECPvPvClCUlCPUlCUsP17sqlexCipherTicketCbPb + 0xD8
0x090000002817C5EC sqlpgWriteNLogPages__FR12SQLO_FHANDLECPvPvCUlClT4CPUlCUsP17sqlexCipherTicketCbPb + 0x18C
0x090000002816B474 sqlpgWriteToDisk__FP9SQLP_DBCBP9SQLP_LFPBUlbT4T3 + 0xCB0
0x090000002817BE48 sqlpgWriteToDisk__FP9SQLP_DBCBP9SQLP_LFPBUlbT4T3 + 0x244
0x090000002817BB6C sqlpgPingPong__FP9SQLP_DBCBP9SQLP_LFPBUlbT3 + 0x3FC
0x09000000260AF388 hdrWriteLog__14HDR_BLOCK_LISTFb + 0x344
0x09000000260B6E94 hdrEduS__6hdrEduFPUl + 0x3BC8
0x09000000260882E4 hdrEduS__6hdrEduFPUl + 0x1A68
0x0900000029C39B04 RunEDU__6hdrEduFv + 0x2254
0x09000000287D4080 EDUDriver__9sqzEDUObjFv + 0x3F0
0x09000000287D4080 EDUDriver__9sqzEDUObjFv + 0x3F0
0x0900000027BAAD68 sqloEDUEntry + 0x390
db2trc also showed hadrs busy on writing logs:
Pid Lvl FuncName TTime(ms) HTime(ms) LTime(ms) AvgTime(ms) NCalls ERecnumHTime
---------- --- --------------------------------- --------------- --------------- --------------- --------------- --------- -------------
4458310(Tid = 200979, Node = 0)
0 hdrAddDataBlock 29.409 21.646 7.762 14.704 2 84357
1 hdrWriteLog 29.037 21.537 7.500 14.518 2 84365
2 sqlpgPingPong 29.016 12.710 0.643 2.418 12 99795
4 sqlpgWriteNLogPages 28.982 9.643 0.259 1.159 25 99923
3 sqlpgWriteToDisk 28.979 12.703 0.640 2.415 12 99809
5 sqlpgWriteNLogPagesInternal 28.931 9.640 0.258 1.157 25 99934
6 sqloWriteBlocks 28.861 9.637 0.255 1.154 25 99939
7 sqloseekwrite64 28.764 9.634 0.253 1.151 25 99952
0 sqlpgPingPong 9.089 7.296 1.793 4.544 2 12736
1 sqlpgWriteToDisk 9.082 7.292 1.791 4.541 2 12748
2 sqlpgWriteNLogPages 9.032 3.903 0.438 2.258 4 12848
3 sqlpgWriteNLogPagesInternal 9.024 3.899 0.437 2.256 4 12853
4 sqloWriteBlocks 9.014 3.897 0.435 2.254 4 12860
5 sqloseekwrite64 9.003 3.894 0.432 2.251 4 12870
0 hdrEduAcceptEvent 1.016 0.376 0.003 0.169 6 51930
2 hdrRecvMsgS 0.941 0.358 0.001 0.157 6 51990
Compared the behaviors between v10.5 and v9.7 on log writing, found that one difference is that the default of DB2_LOGGER_NON_BUFFERED_IO is on for 10.5 but off on 9.7. So, suggested the user to try with DB2_LOGGER_NON_BUFFERED_IO=OFF, and confirmed this worked!!
As a conclusion, we expected Direct I/O to be faster than CIO when writing logs in most of case, however, in realistic, it is not always, like in this case. It should also be noted that this is not a common case, saying that setting DB2_LOGGER_NON_BUFFERED_IO=OFF resolved the issue in this case does NOT mean it can also resolve the similar issue in other cases. Additionally, when the server has to failover and the DB2_LOGGER_NON_BUFFERED_IO setting that might be good for standby but might not be good for primary.
UID
ibm11140256