IBM Support

PH16635: EXECUTION OF A COMPLEX QUERY IN BIG SQL FAILS WITH AN OUT OF MEMORY ERROR

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as fixed if next.

Error description

  • Running a complex query - for example, a long query with lots of
    UNION statements - may fail with an "Out Of Memory" (OOM)
    error.  The issue is with the size of the section (the compiled
    statement) and the number of subagents created for the query
    execution.  The parallelism in effect - which determines how
    many subagents are spawned - is a factor of the DFT_DEGREE
    setting (SMP parallelism), and the number of subsections (which
    depends on the query execution plan chosen by the Optimizer).
    Every subagent holds a copy of the section, which can result in
    a large memory requirement.
    
    You will see error messages in the db2diag.log similar to the
    following:
    
    2019-08-28-08.43.09.061918-240 I23169E1082           LEVEL:
    Warning
    PID     : 448991               TID : 140674770921216 PROC :
    db2sysc 12
    INSTANCE: bigsql               NODE : 012            DB   :
    BIGSQL
    APPHDL  : 0-14638              APPID: appid
    AUTHID  : bigsql               HOSTNAME: bigsqlhost.com
    EDUID   : 27282                EDUNAME: db2agnts (BIGSQL) 12
    FUNCTION: DB2 UDB, SQO Memory Management,
    SqloMemController::requestMemory, probe:50
    MESSAGE : ZRC=0x8B0F0000=-1961951232=SQLO_NOMEM "No Memory
    Available"
             DIA8300C A memory heap error has occurred.
    DATA #1 : String, 36 bytes
    OOM - Instance memory request failed
    DATA #2 : String, 35 bytes
    Logging disabled until next success
    DATA #3 : unsigned integer, 8 bytes
    2162688
    DATA #4 : unsigned integer, 8 bytes
    0
    DATA #5 : String, 13 bytes
    FMP_RESOURCES
    ...
    
    with a stack similar to the following:
    
    0x00007FFFF01D1070 _ZN17SqloMemController9dumpOnOOMEjb + 0x0450
    0x00007FFFF02ACB53 sqloMemLogPoolConditions + 0x0243
    0x00007FFFF02ABEFD sqloGetMemoryBlockExtended + 0x183d
    0x00007FFFEE836CFD
    _ZN23sqerCommBufferAllocator18allocateCommBufferEP12sqzDataChain
    I14sqerCommBuffer16sqzChainNodeBaseIS1_EEl + 0x005d
    0x00007FFFEE836923
    _ZN21sqerFmpCommBufferPool19allocateCommBuffersEm + 0x0083
    0x00007FFFEE836852
    _ZN21sqerFmpCommBufferPool24adjustCommBufferPoolSizeElRl +
    0x0272
    0x00007FFFEE8362FF
    _ZN21sqerFmpCommBufferPool18reserveCommBuffersEllRl + 0x00af
    0x00007FFFEE86ECC0
    _Z17sqlerConnectToFmpP8sqeAgentP23SQLER_FMP_CONNECT_PARMSPP13sqe
    rFmpClient + 0x0190
    
    This error triggers an FODC_Memory data collection.  In this
    directory, under the DB2PD directory, db2pd output can be found
    in a file named <pid>.<eduid>.<nodenumber>.db2pd.inst.alldbs.
    You can use this output to confirm this issue.  In that file,
    look for output similar to the following (an active or waiting
    application with a high number of agents):
    
    Applications:
    Address            AppHandl [nod-index] NumAgents  CoorEDUID
    Status      C-AnchID C-StmtUID  L-AnchID L-StmtUID  Appid
    WorkloadID  WorkloadOccID CollectActData CollectActPartition
    CollectSectionActuals
    
    0x0000000206600080 45596    [000-45596] 1014       0
    UOW-Waiting          0        0          0        0
    xx.xx.xx.xxx.xxxxx.xxxxxxxxxxxx              7            467
    n/a n/a                     n/a
    
    Note the high number of agents (NumAgents) - 1014 in this
    example.
    
    Find the active statement for that application handle:
    
    Active Statement List:
    Address            AppHandl [nod-index] UOW-ID     StmtID
    AnchID StmtUID
    EffISO      EffLockTOut EffDegree   EntryTime
    StartTime           LastRefTime
    0x00007FFB9CE4D680 45596    [000-45596] 1          1
    462    1
    1           120         8           Tue Jul 23 08:58:59 Tue Jul
    23 08:58:59 Tue Jul 23 08:59:01
    
    and note the AnchID and StmtUID.  Use those values to find the
    entry in the Dynamic SQL Variations output:
    
    Dynamic SQL Variations:
    Address            AnchID StmtUID    EnvID      VarID
    NumRef     Typ Lockname                   Val Insert Time
    Sect Size  Num Copies
    0x00007FFB3115ED80 462    1          1          1          184
    6   01000000010000000100C039D6 Y   2019-07-23-08.58.59.927870
    6441752    849
    
    Here you can see the large section size (6,442,752 bytes) and
    the number of copies of this section that had been made at the
    time of the memory failure (849); 5 gigabytes of memory.  This
    sudden spike in memory usage triggered the OOM error.
    
    You can also use the AnchID and StmtUID to find the statement
    itself.
    
    The solution to prevent the out of memory errors is to disable
    SMP parallelism for complex queries.  The number of subagents
    created is determined by the number of subsections (the parts of
    the section that can run in parallel) and the degree of SMP
    parallelism in effect.  Disabling SMP parallelism will reduce
    the number of subagents that are spawned, thereby reducing the
    overall memory requirement.
    
    Be aware that reducing SMP parallelism can impact the
    performance of the query.
    .
    This fix adds a new DB2_EXTENDED_OPTIMIZATION registry setting
    option "SMP_MAX_SS". This setting will disable SMP parallelism
    for plans in which the number of subsections exceeds a specified
    limit.
    .
    For example, if one sets:
    db2set DB2_EXTENDED_OPTIMIZATION="BI_INFER_CC ON,SMP_MAX_SS 100"
    .
    then any optimizer plan with more than 100 subsections will not
    be SMP-parallelized.
    
    The number of subsections to target for throttling can be
    estimated from the data collected during an Out Of Memory event.
    .
    Using the above example, the number of copies of the section at
    the time of the OOM was 849, as shown in the Dynamic SQL
    Variations output in the
    <pid>.<eduid>.<nodenumber>.db2pd.inst.alldbs file, found under
    the FODC_Memory data collection in the DB2PD directory.
    .
    Dynamic SQL Variations:
    Address            AnchID StmtUID    EnvID      VarID
    NumRef     Typ Lockname                   Val Insert Time
    Sect Size  Num Copies
    0x00007FFB3115ED80 462    1          1          1          184
    6   01000000010000000100C039D6 Y   2019-07-23-08.58.59.927870
    6441752    849
    .
    A good setting for SMP_MAX_SS is a value lower than the number
    of section copies at the time of the OOM event divided by the
    default degree of parallelism that is set in the database
    configuration.  The default is 8, but the actual value can be
    determined by running
    db2 get db cfg for bigsql | grep DFT_DEGREE
    .
    In the above example 849 divided by the default parallelism of 8
    is 106.  Setting SMP_MAX_SS to 100 would avoid this OOM.
    

Local fix

  • Adding the following guideline to the end of a query will
    manually throttle parallelism for that query:
    /*<OPTGUIDELINES><DEGREE VALUE='1'/></OPTGUIDELINES>*/
    

Problem summary

  • Please see the problem description.
    

Problem conclusion

Temporary fix

Comments

APAR Information

  • APAR number

    PH16635

  • Reported component name

    IBM BIG SQL

  • Reported component ID

    5737E7400

  • Reported release

    504

  • Status

    CLOSED FIN

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2019-09-10

  • Closed date

    2020-09-09

  • Last modified date

    2020-12-03

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

Applicable component levels

[{"Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"504"}]

Document Information

Modified date:
04 December 2020