HADR: Standby cannot catch up primay after upgrading from V9.7 to V10.5

Technical Blog Post

Abstract

Body

It is a HADR system set up with near_sync mode. This server has been running well for several years with version V9.7 FixPack7 even though the user already know the storage on standby side could be a bottleneck.

At the beginning of this year 2018, the user upgraded Db2 to V10.5, the problem happened since then. The log gap kept on increasing, and standby can never catch up with primary.

Collected stacks from standby, found hadrs EDU was busy on writing log to disk:

0x0900000000037B18 pwrite64 + 0x38
0x090000002817CF4C sqloseekwrite64 + 0x230
0x090000002817CAF4 sqloWriteBlocks + 0x128
0x090000002817C7F4 sqlpgWriteNLogPagesInternal__FR12SQLO_FHANDLECPvPvClCUlCPUlCUsP17sqlexCipherTicketCbPb + 0xD8
0x090000002817C5EC sqlpgWriteNLogPages__FR12SQLO_FHANDLECPvPvCUlClT4CPUlCUsP17sqlexCipherTicketCbPb + 0x18C
0x090000002816B474 sqlpgWriteToDisk__FP9SQLP_DBCBP9SQLP_LFPBUlbT4T3 + 0xCB0
0x090000002817BE48 sqlpgWriteToDisk__FP9SQLP_DBCBP9SQLP_LFPBUlbT4T3 + 0x244
0x090000002817BB6C sqlpgPingPong__FP9SQLP_DBCBP9SQLP_LFPBUlbT3 + 0x3FC
0x09000000260AF388 hdrWriteLog__14HDR_BLOCK_LISTFb + 0x344
0x09000000260B6E94 hdrEduS__6hdrEduFPUl + 0x3BC8
0x09000000260882E4 hdrEduS__6hdrEduFPUl + 0x1A68
0x0900000029C39B04 RunEDU__6hdrEduFv + 0x2254
0x09000000287D4080 EDUDriver__9sqzEDUObjFv + 0x3F0
0x09000000287D4080 EDUDriver__9sqzEDUObjFv + 0x3F0
0x0900000027BAAD68 sqloEDUEntry + 0x390

db2trc also showed hadrs busy on writing logs:

Pid        Lvl FuncName                          TTime(ms)       HTime(ms)       LTime(ms)       AvgTime(ms)      NCalls    ERecnumHTime
---------- --- --------------------------------- --------------- --------------- --------------- --------------- --------- -------------
4458310(Tid = 200979, Node = 0)
             0 hdrAddDataBlock                           29.409          21.646           7.762          14.704         2        84357
             1 hdrWriteLog                               29.037          21.537           7.500          14.518         2        84365
             2 sqlpgPingPong                             29.016          12.710           0.643           2.418        12        99795
             4 sqlpgWriteNLogPages                       28.982           9.643           0.259           1.159        25        99923
             3 sqlpgWriteToDisk                          28.979          12.703           0.640           2.415        12        99809
             5 sqlpgWriteNLogPagesInternal               28.931           9.640           0.258           1.157        25        99934
             6 sqloWriteBlocks                           28.861           9.637           0.255           1.154        25        99939
             7 sqloseekwrite64                           28.764           9.634           0.253           1.151        25        99952
             0 sqlpgPingPong                              9.089           7.296           1.793           4.544         2        12736
             1 sqlpgWriteToDisk                           9.082           7.292           1.791           4.541         2        12748
             2 sqlpgWriteNLogPages                        9.032           3.903           0.438           2.258         4        12848
             3 sqlpgWriteNLogPagesInternal                9.024           3.899           0.437           2.256         4        12853
             4 sqloWriteBlocks                            9.014           3.897           0.435           2.254         4        12860
             5 sqloseekwrite64                            9.003           3.894           0.432           2.251         4        12870
             0 hdrEduAcceptEvent                          1.016           0.376           0.003           0.169         6        51930
             2 hdrRecvMsgS                                0.941           0.358           0.001           0.157         6        51990

Compared the behaviors between v10.5 and v9.7 on log writing, found that one difference is that the default of DB2_LOGGER_NON_BUFFERED_IO is on for 10.5 but off on 9.7. So, suggested the user to try with DB2_LOGGER_NON_BUFFERED_IO=OFF, and confirmed this worked!!

As a conclusion, we expected Direct I/O to be faster than CIO when writing logs in most of case, however, in realistic, it is not always, like in this case. It should also be noted that this is not a common case, saying that setting DB2_LOGGER_NON_BUFFERED_IO=OFF resolved the issue in this case does NOT mean it can also resolve the similar issue in other cases. Additionally, when the server has to failover and the DB2_LOGGER_NON_BUFFERED_IO setting that might be good for standby but might not be good for primary.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm11140256

Tips

HADR: Standby cannot catch up primay after upgrading from V9.7 to V10.5

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?