IBM Support

DB2 Case sharing: I/O issue caused deactivation of HADR Standby slow

Technical Blog Post


Abstract

DB2 Case sharing: I/O issue caused deactivation of HADR Standby slow

Body

One of my customer complaints that it usually takes more than 6 minutes to deactivate one standby database, he asks me help him find out the reason.

A a start, I asked him collect "db2pd -stack all" and "db2pd -eve" for at least 3 time while deactivating the database.  Then data was collected.

From the stack, the deactivate db agent was busy on flushing dirty pages to disk:

0x0900000000037C18 pwrite64 + 0x38
  0x09000000042861BC sqloseekwrite64 + 0x144
  0x0900000004285FDC sqloWriteBlocks + 0xE0
  0x0900000004285E64 sqlbWriteBlocks__FP16SqlbOpenFileInfoPvlUlUiPUlP12SQLB_GLOBALS + 0x48
  0x09000000042859A8 sqlbWritePageToDisk__FP11SQLO_IO_REQP16SqlbOpenFileInfoP12SQLB_GLOBALS + 0x22C
  0x09000000042E05DC sqlbWritePage + 0xF8
  0x09000000040F90D4 sqlbFlushForDLSubset__FP9SQLP_LSN8UiUsiT4P12SQLB_GLOBALS + 0x7F8
  0x0900000004446478 sqlbFlush__FP9SQLP_LSN8UiUsP12SQLB_GLOBALS + 0x24
  0x0900000003C166A0 sqlbFlush__FP9SQLP_LSN8UiUsP12SQLB_GLOBALS@glue52A + 0x7C
  0x0900000004A300E4 TermDbConnect__16sqeLocalDatabaseFP8sqeAgentP5sqlcai + 0xE4
  0x0900000004131634 AppStopUsing__14sqeApplicationFP8sqeAgentUcP5sqlca + 0xDA4
  0x090000000445F02C AppLocalStart__14sqeApplicationFP14db2UCinterface + 0x65C
  0x0900000004256488 sqlelostWrp__FP14db2UCinterface + 0x40
  0x090000000425653C sqleUCengnInit__FP14db2UCinterfaceUs + 0x64
  0x090000000446CC40 sqleUCagentConnect + 0x2B4
  0x090000000446B5F4 sqljsConnectAttach__FP13sqljsDrdaAsCbP14db2UCinterface + 0x104
  0x090000000446B344 sqljs_ddm_accsec__FP14db2UCinterfaceP13sqljDDMObject + 0x340
  0x090000000446AF6C sqljsParseConnect__FP13sqljsDrdaAsCbP13sqljDDMObjectP14db2UCinterface + 0x5C
  0x090000000446AEA0 .sqljsParse.fdpr.clone.378__FP13sqljsDrdaAsCbP14db2UCinterfaceP8sqeAgentb + 0x28
  0x0900000004368A04 @63@sqljsSqlam__FP14db2UCinterfaceP8sqeAgentb + 0x2D0
  0x090000000448B818 @63@sqljsDriveRequests__FP8sqeAgentP14db2UCconHandle + 0xB4
  0x090000000448B4CC @63@sqljsDrdaAsInnerDriver__FP18SQLCC_INITSTRUCT_Tb + 0x2D4
  0x090000000448AF54 sqljsDrdaAsDriver__FP18SQLCC_INITSTRUCT_T + 0x100
  0x090000000417E110 RunEDU__8sqeAgentFv + 0x2F0
  0x0900000004185E3C EDUDriver__9sqzEDUObjFv + 0xF8
  0x090000000418B580 sqloEDUEntry + 0x264


 Summary:
 Found in 5 stacks of a total of 280 stacks ( 1.79% ) in 56 files
 Found in:
        ./collect_ys/17170518.13271.000.stack.txt -- db2agent(PAYEE) -- 2016-07-13-21.39.52.661077(Signal #30)
        ./collect_ys/17170518.13271.000.stack.txt -- db2agent(PAYEE) -- 2016-07-13-21.40.36.216268(Signal #30)
        ./collect_ys/17170518.13271.000.stack.txt -- db2agent(PAYEE) -- 2016-07-13-21.41.17.765316(Signal #30)
        ./collect_ys/17170518.13271.000.stack.txt -- db2agent(PAYEE) -- 2016-07-13-21.42.05.778419(Signal #30)
        ./collect_ys/17170518.13271.000.stack.txt -- db2agent(PAYEE) -- 2016-07-13-21.43.09.038292(Signal #30)

 


Normally we can decrease CHNGPGS_THRESH to make things better:

CHNGPGS_THRESH                 80                   80  

However, I saw the DB2_USE_ALTERNATE_PAGE_CLEANING registry variable is set (that is, the alternate method of page cleaning is used), the chngpgs_thresh parameter has no effect, and the database manager automatically determines how many dirty pages to maintain in the buffer pool:

DB2_USE_ALTERNATE_PAGE_CLEANING=ON

 

The number of dirty pages is high:

Address            Id   Name               PageSz     PA-NumPgs  BA-NumPgs  BlkSize    NumTbsp              PgsLeft              CurrentSz            PostAlter            SuspndTSCt Automatic
0x0A0004000010DE20 1    IBMDEFAULTBP       32768      111594     0          0          4

Bufferpool: 1         
   Dirty pages %       : 101283 / 111594 (90.76% dirty)
   Bufferpool minbuflsn: 000000D0A629906A

101283 * 32768 ~= 3G  ==> Flush 3G data onto disk took 4 minutes, looks like we have I/O bottleneck.

 

So I asked customer collect additional data when the problem happens again:

db2pd -eve > db2pd.eve
nohup iostat -Dl 10 > iostat.out &
db2trc on -Madd sqlbFlushForDLSubset -Madd sqloseekwrite64 -t -f db2trc.dmp
db2 deactivate db payee
db2trc off
db2trc flw -t -wc db2trc.dmp db2trc.flw

db2trc perfrep -g -sort timeelapsed db2trc.dmp  db2trc.rpt
#stop the iostat command  

 

The data was collected.  From the new data, I did see the I/O bottleneck.  sqloseekwrite64 directly invokes system call pwrite64. The db2agent flushing data to disk has AvgElapsed = 0.182239108, while db2hadrs and db2loggr that flush logs to transaction log has AvgElapsed=0.001366460. 0.182239108 vs 0.001366460 suggest we encountered I/O slow on data containers. Plz check and/or tune the I/O.


Node : 0 , PID : 20513020 , TID : 11310  ==> db2hadrs

  nCalls   TotalElapsed     AvgElapsed     TotalSpent       AvgSpent FunctionName
   15603   25.716884552    0.001648201   25.716884552    0.001648201 sqloseekwrite64

Node : 0 , PID : 20513020 , TID : 4113  ==> db2agent

  nCalls   TotalElapsed     AvgElapsed     TotalSpent       AvgSpent FunctionName
       1  233.521891755  233.521891755    0.073594333    0.073594333 sqlbFlushForDLSubset
    1281  233.448297421    0.182239108  233.448297421    0.182239108 sqloseekwrite64

Node : 0 , PID : 20513020 , TID : 4885  ==> db2loggr

  nCalls   TotalElapsed     AvgElapsed     TotalSpent       AvgSpent FunctionName
       4    0.005465841    0.001366460    0.005465841    0.001366460 sqloseekwrite64

 

The high dirty pages most probably caused by the same I/O issue.

 

So, I asked customer check the I/O performance from OS/storage side.  

 

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm13286737