IBM Support

A SQL running very slowly on pureScale database with isolation level of RS

Technical Blog Post


Abstract

A SQL running very slowly on pureScale database with isolation level of RS

Body

We had a customer, he had a WebSphere application connecting to a database running on pureScale server. The application would execute a SQL statement as below:
SELECT DISTINCT(MYCOLUMN1) FROM MYTABLE FOR FETCH ONLY


Normally, the SQL take only < 10 ms to complete. However, when the SQL is executed from the WebSphere application, it ALWAYS(NOT randomly) take around 2 seconds.


To narrow down the issue, he already tried to:
1. Connect from the DB2 CLP on the same WebSphere machine to the DB, SQL only took < 10 ms to complete
2. Run a standalone java program which connect to the same pureScale DB on the same WebSphere machine, SQL also took < 10 ms to complete
3. Using the same WebSphere application connect to another standalone ESE database and ran the same SQL, it took < 10 ms to complete.

Asked him to collect explain and db2trc for both bad case and good case.  

Note, to collect explain(a.k.a access plan) from section, the following steps can be followed:

db2 "call SYSPROC.SYSINSTALLOBJECTS('EXPLAIN','D',NULL,NULL)"

db2 "call SYSPROC.SYSINSTALLOBJECTS('EXPLAIN','C',NULL,NULL)"

# run the statement from WebSphere

# back to db2 server

db2 "SELECT executable_id, Varchar(stmt_text, 200) as stmt_text FROM TABLE(MON_GET_PKG_CACHE_STMT (NULL, NULL,NULL, -1)) AS T"

# from above output find out the executable_id for the statement 'SELECT DISTINCT(MYCOLUMN1) FROM MYTABLE FOR FETCH ONLY', assume it is x'00000001000000000000000000000A1000000000000220130221162052296615'

# then:

db2 "CALL EXPLAIN_FROM_SECTION(x'00000001000000000000000000000A1000000000000220130221162052296615','M', NULL, 0, '<schema of instance user>', ?, ?, ?, ?, ?)"

db2exfmt -d <db name> -g TIC -w -1 -n % -s % -# 0 -o  explain.out

For usage of EXPLAIN_FROM_SECTION, refer to:

https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0056251.html

 

See the db2trc analysis report,  pay attention to the lines in red color,  we found CF communication due to lock requests took much of the time:

 

GOOD CASE:

 

Pid        Lvl FuncName                        TTime(ms)       HTime(ms)       LTime(ms)       AvgTime(ms)      NCalls    ERecnumHTime
---------- --- ------------------------------- --------------- --------------- --------------- ---------------  --------- -------------
22085672(Tid = 276671, Node = 0)
...
             0 sqljsParse                               90.027          89.696           0.330          45.013         2          272
             1 sqljsParseRdbAccessed                    89.996          89.269           0.057          22.499         4          875
             2 sqljs_ddm_opnqry                         89.268          89.268          89.268          89.268         1          876
             3 sqlrr_open                               89.258          89.258          89.258          89.258         1          885
             4 sqlrr_process_fetch_request              88.949          88.949          88.949          88.949         1         1033
             5 sqlriSectInvoke                          88.938          88.938          88.938          88.938         1         1034
             6 sqlritaSimplePerm                        88.932          88.932          88.932          88.932         1         1037
             7 sqldRowFetch                             88.436          88.436          88.436          88.436         1         1307
             8 sqldIndexFetch                           88.433          88.433          88.433          88.433         1         1308
             9 sqlirdk                                  88.431          88.431          88.431          88.431         1         1309
            11 sqlifnxt                                 88.347          88.347          88.347          88.347         1         1372
            12 sqliScanLeaf2                            88.113          11.406           3.471           8.010        11       165897
            13 sqlriDistinct2                           70.476           0.417           0.001           0.002     34564        89377

 

BAD CASE:

 

Pid        Lvl FuncName                        TTime(ms)       HTime(ms)       LTime(ms)       AvgTime(ms)      NCalls    ERecnumHTime
---------- --- ------------------------------- --------------- --------------- --------------- ---------------  --------- -------------
22085672(Tid = 267899, Node = 0)
             0 sqljsParse                             5178.360        5177.289           1.071        2589.180         2           32
             1 sqljsParseRdbAccessed                  5178.327        5177.216           0.055        2589.136         3          116
             2 sqljs_ddm_opnqry                       5177.214        5177.214        5177.214        5177.214         1          117
             3 sqlrr_open                             5177.203        5177.203        5177.203        5177.203         1          126
             4 sqlrr_process_fetch_request            5176.753        5176.753        5176.753        5176.753         1          324
             5 sqlriSectInvoke                        5176.704        5176.704        5176.704        5176.704         1          325
             6 sqlritaSimplePerm                      5176.695        5176.695        5176.695        5176.695         1          328
             7 sqldRowFetch                           5176.639        5176.639        5176.639        5176.639         1          374
             8 sqldIndexFetch                         5176.636        5176.636        5176.636        5176.636         1          375
             9 sqlirdk                                5176.635        5176.635        5176.635        5176.635         1          376
            11 sqlifnxt                               5176.458        5176.458        5176.458        5176.458         1          530
            12 sqliScanLeaf2                          5176.042         650.660         301.629         650.463        10      2799047
            13 sqlplrq                                4132.845           7.371           0.046           0.135     34368      3460771
            14 sqlplMakeNewRequestSD                  3984.122           7.367           0.044           0.131     34368      3460772
            15 sqlpLLMInformGLMIfStateChanged         3704.859           7.356           0.040           0.123     34368      3460779
            16 sqlpLLMSetLockState                    3214.568           7.342           0.033           0.108     34368      3460784
            17 SAL_GLM_HANDLE::SAL_SetLockSta         3150.604           7.340           0.032           0.106     34368      3460785
            18 SAL_GLM_HANDLE::SAL_SetLockSta         3106.833           7.338           0.031           0.104     34368      3460786

 

In bad case, we can see within the index scan, sqlplrq->sqlplMakeNewRequestSD->sqlpLLMInformGLMIfStateChanged->sqlpLLMSetLockState were called to set the lock state on the CF:

 

324            0.256215503   | | | | sqlrr_process_fetch_request entry [eduid 267899 eduname db2agent]
325            0.256216466   | | | | | sqlriSectInvoke entry [eduid 267899 eduname db2agent]
328            0.256219351   | | | | | | sqlritaSimplePerm entry [eduid 267899 eduname db2agent]
374            0.256273871   | | | | | | | sqldRowFetch entry [eduid 267899 eduname db2agent]
375            0.256275224   | | | | | | | | sqldIndexFetch entry [eduid 267899 eduname db2agent]
376            0.256276154   | | | | | | | | | sqlirdk entry [eduid 267899 eduname db2agent]
364            0.256264945   | | | | | | | | | | sqliSetupReadaheadPrefetch entry [eduid 267899 eduname db2agent
530            0.256432664   | | | | | | | | | | | sqlifnxt entry [eduid 267899 eduname db2agent]
536            0.256436914   | | | | | | | | | | | | sqliScanLeaf2 entry [eduid 267899 eduname db2agent]
537            0.256439289   | | | | | | | | | | | | | sqlplrq entry [eduid 267899 eduname db2agent]
538            0.256440707   | | | | | | | | | | | | | | sqlplMakeNewRequestSD entry [eduid 267899 eduname db2agent]

539            0.256441044   | | | | | | | | | | | | | | | sqlplComputeLocalConflict entry [eduid 267899 eduname db2agent]
540            0.256441521   | | | | | | | | | | | | | | | sqlplComputeLocalConflict exit
541            0.256441976   | | | | | | | | | | | | | | | sqlpUpdateLrbForNewRequest entry [eduid 267899 eduname db2agent]
542            0.256442583   | | | | | | | | | | | | | | | sqlpUpdateLrbForNewRequest exit
543            0.256443039   | | | | | | | | | | | | | | | sqlpLLMComputeCurrentNonXF entry [eduid 267899 eduname db2agent]
544            0.256443726   | | | | | | | | | | | | | | | sqlpLLMComputeCurrentNonXF exit
545            0.256443990   | | | | | | | | | | | | | | | sqlpLLMInformGLMIfStateChanged entry [eduid 267899 eduname db2agent]
546            0.256444357   | | | | | | | | | | | | | | | | sqlpLLMPrepareSetState entry [eduid 267899 eduname db2agent]
547            0.256444568   | | | | | | | | | | | | | | | | | sqlpLLMDetermineSendType entry [eduid 267899 eduname db2agent]
548            0.256444972   | | | | | | | | | | | | | | | | | sqlpLLMDetermineSendType exit
549            0.256445480   | | | | | | | | | | | | | | | | sqlpLLMPrepareSetState exit
550            0.256446126   | | | | | | | | | | | | | | | | sqlpLLMSetLockState entry [eduid 267899 eduname db2agent]

 

However, no such behavior in good case:

 

1033          13.491516158   | | | | sqlrr_process_fetch_request entry [eduid 276671 eduname db2agent]
1034          13.491517132   | | | | | sqlriSectInvoke entry [eduid 276671 eduname db2agent]
1037          13.491519658   | | | | | | sqlritaSimplePerm entry [eduid 276671 eduname db2agent]
1307          13.492014476   | | | | | | | sqldRowFetch entry [eduid 276671 eduname db2agent]
1308          13.492016142   | | | | | | | | sqldIndexFetch entry [eduid 276671 eduname db2agent]
1309          13.492016675   | | | | | | | | | sqlirdk entry [eduid 276671 eduname db2agent]
1297          13.492006033   | | | | | | | | | | sqliSetupReadaheadPrefetch entry [eduid 276671 eduname db2agent]
1372          13.492086845   | | | | | | | | | | | sqlifnxt entry [eduid 276671 eduname db2agent]
1378          13.492092304   | | | | | | | | | | | | sqliScanLeaf2 entry [eduid 276671 eduname db2agent]
1379          13.492095648   | | | | | | | | | | | | | sqlriDistinct2 entry [eduid 276671 eduname db2agent]
1380          13.492096175   | | | | | | | | | | | | | | sqlriProcessDistinct entry [eduid 276671 eduname db2agent]
1381          13.492097595   | | | | | | | | | | | | | | | sqlrxcp1 entry [eduid 276671 eduname db2agent]
1382          13.492098220   | | | | | | | | | | | | | | | sqlrxcp1 exit [rc = 1]
1383          13.492098771   | | | | | | | | | | | | | | sqlriProcessDistinct exit
1384          13.492099087   | | | | | | | | | | | | | sqlriDistinct2 exit

 

Why did Db2 have such differences between good case and bad case?   Any difference between the access plans?

 

At the first glance, they had exactly the 'same' explain:

 

      Rows
     RETURN
     (   1)
      Cost
       I/O
       |
       47
     UNIQUE
     (   2)
     87.025
       10
       |
      34470
     IXSCAN
     (   3)
     85.9056
       10
       |
      34470
 INDEX: DB2SDIN1
     MYTABLE
       Q1

 

Only compare the plan graph is not enough, then I I saw in the access plan related to the index was that the good case had an isolation level of CS while the bad case had an isolation level of RS in addition to not having arguments to the IXSCAN, CUR_COMMIT and LCKAVOID set to TRUE.  These two would likely result in an increased number of locks as RS isolation level needs more locks to enable repeateable reads and CUR_COMMIT and LCKAVOID both help reduce the number of locks needed during an index scan.

More info on isolation levels

https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.perf.doc/doc/c0004121.html

More info on Currently Commited:

https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.admin.config.doc/doc/r0053556.html

 

GOOD CASE:

 

    3) IXSCAN: (Index Scan)

        Cumulative Total Cost:         85.9056
        Cumulative CPU Cost:         5.94928e+07
        Cumulative I/O Cost:         10
        Cumulative Re-Total Cost:     15.4126
        Cumulative Re-CPU Cost:     5.93274e+07
        Cumulative Re-I/O Cost:     0
        Cumulative First Row Cost:     7.05876
        Estimated Bufferpool Buffers:     11
        Arguments:
        ---------
        CUR_COMM: (Currently Committed)
            TRUE

        LCKAVOID: (Lock Avoidance)
            TRUE

        MAXPAGES: (Maximum pages for prefetch)
            10
        PREFETCH: (Type of Prefetch)
            SEQUENTIAL,READAHEAD
        ROWLOCK : (Row Lock intent)
            SHARE (CS/RS)
        SCANDIR : (Scan Direction)
            FORWARD
        SKIP_INS: (Skip Inserted Rows)
            TRUE
        TABLOCK : (Table Lock intent)
            INTENT SHARE
        TBISOLVL: (Table access Isolation Level)
            CURSOR STABILITY

 

BAD CASE:

 

    3) IXSCAN: (Index Scan)
        Cumulative Total Cost:         85.9056
        Cumulative First Row Cost:     7.05876
        Estimated Bufferpool Buffers:     11
        Arguments:
        ---------
        MAXPAGES: (Maximum pages for prefetch)
            10
        PREFETCH: (Type of Prefetch)
            READAHEAD           
        ROWLOCK : (Row Lock intent)
            SHARE (CS/RS)
        SCANDIR : (Scan Direction)
            FORWARD
        SKIP_INS: (Skip Inserted Rows)
            TRUE
        TABLOCK : (Table Lock intent)
            INTENT SHARE
        TBISOLVL: (Table access Isolation Level)
            READ STABILITY

 

Why did we have isolation level of RS(i.e. READ STABILITY) in bad case?  Did the Websphere application explicitly set the isolation level  to RS as the default isolation level should be CS(Ii.e. CURSOR STABILITY)?

Then I checked the JCC trace collected in bad case (refer to this link to collect JCC trace with different ways -> http://www-01.ibm.com/support/docview.wss?uid=swg21196160), I found the applicaton called setTransactionIsolation() setting Isolation to 4 - java.sql.Connection.TRANSACTION_REPEATABLE_READ (i.e. Read stability in Db2 world):

[jcc][Time:2017-10-26-11:10:01.484][Thread:WebContainer : 0][Connection@a9d5e2b4]setTransactionIsolation (4) called

For Transaction Isolation levels in JCC world, see: https://www.ibm.com/support/knowledgecenter/SSEPGG_10.1.0/com.ibm.db2.luw.apdv.java.doc/src/tpc/imjcc_r0052429.html

So, the thing got clear, the customer removed the call of setTransactionIsolation (4) from his application, the performance became good, the SQL completed in 10 ms!

 

 

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm13286197