IBM Support

On pureScale, db2start may hang or timeout if idle and CA resources can't be online

Technical Blog Post


Abstract

On pureScale, db2start may hang or timeout if idle and CA resources can't be online

Body

In pureScale environment, when you start CF or member, DB2 will attempt to start idle resources and CA resources first.

If for any reason this takes time, DB2 may appear to be hanging or eventually times out.

 

For instance, the following shows db2start is waiting for bringing up the TSA resources:

root@lpar215ps4:/>ps -elf|grep db2start
  200001 A db2inst1 11010278  9240664   0  60 20 8a92b5590  8804 f1000a00e0bc4ab0 13:57:20  pts/0  0:00 db2start cf 128
root@lpar215ps4:/>pstack 11010278
ksh: pstack:  not found.
root@lpar215ps4:/>procstack 11010278
11010278: db2start cf 128
0x090000000055a910  _p_nsleep(??, ??) + 0x10
0x09000000000397e4  nsleep(??, ??) + 0xe4
0x090000000015da90  nanosleep(??, ??) + 0x190
0x090000000118e468  ossSleep(??) + 0xa8
0x0900000003b37f00  sqlhaWaitForResourceState(SQLHA_CLUSTER_OBJECT_INFO*,_sqlhaObjStates,SQLHA_CONTROL_BLOCK*)(0x80000000000080, 0x100000001, 0x200) + 0x1640
0x0900000003b364d0  sqlhaOnlineClusterObject(SQLHA_CLUSTER_OBJECT_INFO*,SQLHA_CONTROL_BLOCK*)(??, ??) + 0x1e30
0x0900000003b6b9b4  sqlhaOperationOnClusterObjectsByType(char*,_sqlhaClusterObjType,SQLHA_CLUSTER_OPERATION,unsigned long,SQLHA_CLUSTER_OBJECT_INFO**,SQLHA_CLUSTER_OPERATION_RESULT_LIST**,SQLHA_CONTROL_BLOCK*)(0xffffffffffffffff, 0x1200000012, 0x0, 0x1, 0x0, 0x90000000b2dc79c, 0xcd) + 0x1054
0x0900000003b6dc28  sqlhaStartSDInfrastructure(char*,unsigned long,SQLHA_CLUSTER_OPERATION_RESULT_LIST**,SQLHA_CONTROL_BLOCK*,short)(0x8000000080, 0x5, 0x0, 0x1, 0x3) + 0x15c8
0x0900000009c95814  sqleIssueStartStop(int,void*,char*,char*,sqlf_kcfd*,SQLE_INTERNAL_ARGS*,unsigned int,unsigned int,sqlca*)(0x100, 0x2a9, 0x0, 0x2f64623266733031, 0x2f646232696e7374, 0x312f73716c6c6962, 0x5f73686172656400, 0x0) + 0x9df4
0x0900000009c88cf8  sqleProcessStartStop(int,void*,SQLE_INTERNAL_ARGS*,sqlf_kcfd*,char*,unsigned int,unsigned int,sqlca*)(0x100000001, 0x0, 0x3f2000003f2, 0x0, 0x0, 0x0, 0x0, 0x0) + 0x1138
0x0000000100002a1c  main(??, ??) + 0x219c
0x00000001000002f8  __start() + 0x70

 

In such case, firstly check whether any idle resource is offline:

root@lpar215ps4:/>lssam|grep idle
Online IBM.ResourceGroup:idle_db2inst1_997_lpar214ps3-rg Nominal=Online
        '- Online IBM.Application:idle_db2inst1_997_lpar214ps3-rs
                '- Online IBM.Application:idle_db2inst1_997_lpar214ps3-rs:lpar214ps3
Online IBM.ResourceGroup:idle_db2inst1_997_lpar215ps4-rg Nominal=Online
        '- Online IBM.Application:idle_db2inst1_997_lpar215ps4-rs
                '- Online IBM.Application:idle_db2inst1_997_lpar215ps4-rs:lpar215ps4
Online IBM.ResourceGroup:idle_db2inst1_998_lpar214ps3-rg Nominal=Online
        '- Online IBM.Application:idle_db2inst1_998_lpar214ps3-rs
                '- Online IBM.Application:idle_db2inst1_998_lpar214ps3-rs:lpar214ps3
Online IBM.ResourceGroup:idle_db2inst1_998_lpar215ps4-rg Nominal=Online
        '- Online IBM.Application:idle_db2inst1_998_lpar215ps4-rs
                '- Online IBM.Application:idle_db2inst1_998_lpar215ps4-rs:lpar215ps4
Online IBM.ResourceGroup:idle_db2inst1_999_lpar214ps3-rg Nominal=Online
        '- Online IBM.Application:idle_db2inst1_999_lpar214ps3-rs
                '- Online IBM.Application:idle_db2inst1_999_lpar214ps3-rs:lpar214ps3
Online IBM.ResourceGroup:idle_db2inst1_999_lpar215ps4-rg Nominal=Online
        '- Online IBM.Application:idle_db2inst1_999_lpar215ps4-rs
                '- Online IBM.Application:idle_db2inst1_999_lpar215ps4-rs:lpar215ps4
Online IBM.Equivalency:idle_db2inst1_997_lpar214ps3-rg_group-equ
Online IBM.Equivalency:idle_db2inst1_997_lpar215ps4-rg_group-equ
Online IBM.Equivalency:idle_db2inst1_998_lpar214ps3-rg_group-equ
Online IBM.Equivalency:idle_db2inst1_998_lpar215ps4-rg_group-equ
Online IBM.Equivalency:idle_db2inst1_999_lpar214ps3-rg_group-equ
Online IBM.Equivalency:idle_db2inst1_999_lpar215ps4-rg_group-equ

 

If any idle resource is offline, check if it is caused due to any "Depends-On" resource is not available.

 

Next also check any CA or primary resource is offline.

root@lpar215ps4:/>lssam|egrep "ca_|primary"
Online IBM.ResourceGroup:ca_db2inst1_0-rg Nominal=Online
        '- Online IBM.Application:ca_db2inst1_0-rs
                |- Online IBM.Application:ca_db2inst1_0-rs:lpar214ps3
                '- Online IBM.Application:ca_db2inst1_0-rs:lpar215ps4
Online IBM.ResourceGroup:primary_db2inst1_900-rg Nominal=Online
        '- Online IBM.Application:primary_db2inst1_900-rs
                |- Offline IBM.Application:primary_db2inst1_900-rs:lpar214ps3
                '- Online IBM.Application:primary_db2inst1_900-rs:lpar215ps4
Online IBM.Equivalency:ca_db2inst1_0-rg_group-equ
Online IBM.Equivalency:primary_db2inst1_900-rg_group-equ

 

If db2start times out eventually, check db2diag.log to see which resource it takes time to bring online.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm11140316