IBM Support

75 ways to demystify DB2: #33: Techtip: In pureScale environment, Online Backup could encounter contention on 'SQLO_LT_SQLB_POOL_CB__olbRangeLotch'

Technical Blog Post


Abstract

75 ways to demystify DB2: #33: Techtip: In pureScale environment, Online Backup could encounter contention on 'SQLO_LT_SQLB_POOL_CB__olbRangeLotch'

Body

ONLINE BACKUP could encounter performance issue in pureScale environment.
This is because of contention on SQLO_LT_SQLB_POOL_CB__olbRangeLotch' ( Online Backup range latch)
The problem happens in pureScale only. Non-pureScale environment isn't vulnerable.

Symptom:

When the problem happens, multiple DB2 prefetchers (EDU name 'db2pfchr') can be observed competing for 'SQLO_LT_SQLB_POOL_CB__olbRangeLotch'.

More specifically:

1. 'db2bm' EDU is stuck in following stack trace, waiting on data from prefetchers:

<StackTrace>

msgrcv
sqloCSemP
sqlbpfParallelDirectIO
sqlbDirectReadBlock
sqlubReadDMS
sqlubBMProcessData
sqlubbuf
sqleSubCoordProcessRequest
RunEDU
EDUDriver
sqlzRunEDU
sqloEDUEntry

</StackTrace>


2. DB2 prefetchers are waiting on 'SQLO_LT_SQLB_POOL_CB__olbRangeLotch':

<StackTrace>

thread_wait
getConflictComplex
getConflict
get
sqlbDMSDirectRead
sqlbDirectRead
sqlbServiceDirectReadRequest
sqlbPFPrefetcherEntryPoint
RunEDU
EDUDriver
sqlzRunEDU
sqloEDUEntry

</StackTrace>

<LatchInformation>
Waiting on latch type: (SQLO_LT_SQLB_POOL_CB__olbRangeLotch)
</LatchInformation>


3. If DB2 trace data is collected, DB2 prefetchers can be observed spending long time in acquiring 'SQLO_LT_SQLB_POOL_CB__olbRangeLotch':

132480         0.042155912   | | sqlbDMSDirectRead entry [eduid 130343 eduname db2pfchr]
132491         0.042158312   | | | SQLO_SLATCH_CAS64::getConflictComplex entry [eduid 130343 eduname db2pfchr]
132507         0.042161835   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_STARTING ]
555761         0.227610406   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_FINISHED ]
555808         0.227620593   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_STARTING ]
558199         0.228200568   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_FINISHED ]
558235         0.228208183   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_STARTING ]
560238         0.228780675   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_FINISHED ]
560276         0.228788125   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_STARTING ]
562570         0.229366589   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_FINISHED ]
562605         0.229373968   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_STARTING ]
564912         0.229954460   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_FINISHED ]
564945         0.229961919   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_STARTING ]
567138         0.230546101   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_FINISHED ]
567171         0.230552929   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_STARTING ]
569433         0.231153625   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_FINISHED ]
569468         0.231161470   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_STARTING ]
571678         0.231720898   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_FINISHED ]
571708         0.231727562   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_STARTING ]
574127         0.232324687   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_FINISHED ]
574163         0.232331406   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_STARTING ]
576692         0.232900673   | | | SQLO_SLATCH_CAS64::getConflictComplex mbt [Marker:PD_LATCH_TRACE_WAIT_FINISHED ]
576699         0.232902406   | | | SQLO_SLATCH_CAS64::getConflictComplex exit

132491  entry DB2 UDB SQO Latch Tracing SQLO_SLATCH_CAS64::getConflictComplex fnc (1.3.130.15.0)
        pid 18088574 tid 130343 cpid 7209740 node 0 sec 0 nsec 42158312
        eduid 130343 eduname db2pfchr
        bytes 60

        Data1   (PD_TYPE_PTR,8) Pointer:
        0x070000168af6f668
        Data2   (PD_TYPE_LATCH_ID,4) Latch Identity: SQLB_POOL_CB::olbRangeLotch (778)
        Data3   (PD_TYPE_INTERNAL_15,8) Hex:
        0002 E000 0001 0000                        ........

        Data4   (PD_TYPE_LATCH_MODE,8) LatchMode: 0x10000 (SQLO_LATCH_MODE_EXCLUSIVE)
 

Root Cause :

Multiple prefetchers are competing for the same 'SQLO_LT_SQLB_POOL_CB__olbRangeLotch' hence they are all slowed down.
Each table space has its own 'SQLO_LT_SQLB_POOL_CB__olbRangeLotch'.


Data Collection :

To identify the problem, collect following data when ONLINE BACKUP is running:

1. db2pd -edu -utilities
    db2pd -db <dbname> -appl -apinfo
   
2. db2pd -stack all

3. DB2 trace (db2trc) data. 


Resolution :

There are 2 ways to resolve the problem:

1. Make table spaces use serial IO (i.e. non-paralleled IO), then ONLINE BACKUP won't trigger DB2 prefetchers.

   To achieve this, temporarily set prefetch size exactly to extent size when ONLINE BACKUP is running.
   For example, if the prefetch size is 12 and the extent size is 2, do it as follows:
  
   ALTER TABLESPACE <tbsp> prefetchsize 2
   <Perform ONLINE BACKUP>
   ALTER TABLESPACE <tbsp> prefetchsize 12
  

2. Make prefetchers take advantage of "Vectored IO", which can avoid contention on the latch.

   To make use of "Vectored IO", all of following conditions must be met:
   1) "BACKUP buffer size" > ((page size) * (extent size) / 4)
      For example, page size is 16K and extent size is 2, then the buffer size of ONLINE BACKUP
      (specified by the "BUFFER" option in BACKUP DATABASE command) must be greater than (16*2)/4=8.
   2) NOT using registry variable DB2_MAX_READV_IOSIZE_FOR_BACKUP.
   3) NOT using registry variable DB2_USE_PAGE_CONTAINER_TAG.
   4) Number of containers in the table space is bigger than 1.
   5) NOT enabled NOREADVBCKUP via registry variable BPVARS.

   APAR IC96465 was created to change condition 4) from "number of containers > 1" to "table spaces parallelism > 1".
   After applying the APAR fix, table spaces having only one container can also take advantage of "Vectored IO", as long as all other conditions are met.

[{"Business Unit":{"code":"BU029","label":"Data and AI"}, "Product":{"code":"SSEPGG","label":"DB2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":""}]

UID

ibm13286905