Troubleshooting
Problem
表函数pd_get_diag_hist执行慢的话有可能会导致application无法被断开,进而造成log不能replay,发生hadr standby replay hang的问题
Symptom
客户通过db2pd -db <db_name> -hadr,发现STANDBY_RECV_REPLAY_GAP值不变,而STANDBY_RECV_REPLAY_GAP一直增大,说明发生了standby replay hang的问题。
HADR_ROLE = STANDBY
..
PRIMARY_LOG_FILE,PAGE,POS = S0002731.LOG, 151, 182424963042
STANDBY_LOG_FILE,PAGE,POS = S0002731.LOG, 151, 182424963042
HADR_LOG_GAP(bytes) = 0
STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0002707.LOG, 223, 180822508874
STANDBY_RECV_REPLAY_GAP(bytes) = 1602454168
正常情况下,在诊断日志中可以观测到,db2redom负责的HdrForceAppsInReplayOnlyWindow-> HdrEndReplayOnlyWindow这个过程很快能完成,HdrForceAppsInReplayOnlyWindow的作用是在进行replay前断开所有的连接,但是客户的诊断日志中发现HdrForceAppsInReplayOnlyWindow一直没有完成,说明有应用程序不能够强制断开。
在standby收集了"db2pd -stack all ",发现有agent的stack如下:
..
0x00000000004223B7 __intel_new_memset + 0x0a77
0x00007F3402810391 pdDiagGetNextLogRecord + 0x03e1
0x00007F340280FDD8 pdDiagGetNextRecordFromBuffer + 0x0038
0x00007F340280FA69 pdDiagGetNextRecord + 0x0239
0x00007F340276AFFB
_ZN17PADiagLogCollEngn11getNextRowsEjPP13PA_DATA_VALUEPjS3_ + 0x053b
0x00007F3403E1144C
_Z25sqlrwGetPDDiagHist_v10fp3P8sqlrr_cbP22sqlrwGetPDDiagHistArgsPPvPl +
0x0b2c
0x00007F3401CEF2C6
_Z30sqlrwGetWLMTableFunctionResultP8sqlrr_cbP20sqlrw_rpc_tf_requestPPvPl
b + 0x0446
0x00007F3401CEA7B1 _Z36sqlrwGetWLMTableFunctionMergedResultjPPv +
0x01f1
0x00007F3400CF80D1 _Z29sqlerTrustedRtnCallbackRouterjPPv + 0x00b1
0x00007F33D1491A9B pd_get_diag_hist_v10fp3 + 0x1c8b
Log InLog in to view more of this document
Was this topic helpful?
Document Information
Modified date:
30 April 2025
UID
swg21981283