IBM Support

DB2 pureScale: db2start fails with SQL1517N, cannot resolved by repairing resources

Technical Blog Post


Abstract

DB2 pureScale: db2start fails with SQL1517N, cannot resolved by repairing resources

Body

Recently I see an user is upgrading their pureScale to V10.5 FP8, online upgrade method is used, but he meets this error:

$ ./installFixPack -p /opt/IBM/db2/V10.5.0.8 -I db2hk -online -l /tmp/install.log -t /tmp/install.trc  -f RSCT -f GPFS

Execution of a rolling update task failed with an error
Error Message :
SQL1517N  db2start failed because the cluster manager resource states are inconsistent.
Refer to db2diag.log for more details

From install trace, we can see the upgrade failed at the last step: db2start instance on db203

23176  |||||||||||| 1 InstallProcess::executeLocal 90 -DATA-  , STRING = /source/10_5_fp8/universal/db2/aix/install/db2iexec -n -o "/tmp/db2ioMi7ace:/tmp/db2ieMi7acf" db2hk "db2start instance on db203"
23177  |||||||||||| 1 InstallProcess::executeLocal 100 -DATA-  , INT = 1024
23178  |||||||||||| 1 InstallProcess::executeLocal 110 -DATA-  , STRING = WEXITSTATUS
23179  |||||||||||| 1 InstallProcess::executeLocal 150 -DATA-  , INT = 4
23180  |||||||||||\ 1 InstallProcess::executeLocal EXIT Wed Feb 15 20:33:10 2017 --  , INT = 0
23181  ||||||||||\ 1 InstallProcess::execute EXIT Wed Feb 15 20:33:10 2017 --  , INT = 0
23182  ||||||||||/ 1 InstallProcess::exitCode ENTRY Wed Feb 15 20:33:10 2017 --  ,
23183  ||||||||||\ 1 InstallProcess::exitCode EXIT Wed Feb 15 20:33:10 2017 --  , INT = 0
23184  ||||||||||/ 1 InstallProcess::getStdOutputLength ENTRY Wed Feb 15 20:33:10 2017 --  ,
23185  |||||||||||/ 1 iPutFileInBuffer ENTRY Wed Feb 15 20:33:10 2017 --  ,
23186  |||||||||||| 1 iPutFileInBuffer 10 -DATA-  , STRING = /tmp/db2ioMi7ace
23187  ||||||||||||/ 1 iFopen ENTRY Wed Feb 15 20:33:10 2017 --  ,
23188  ||||||||||||| 1 iFopen 10 -DATA-  , STRING = /tmp/db2ioMi7ace
23189  ||||||||||||| 1 iFopen 20 -DATA-  , STRING = rt
23190  ||||||||||||\ 1 iFopen EXIT Wed Feb 15 20:33:10 2017 --  , INT = 0
23191  |||||||||||| 1 iPutFileInBuffer 20 -DATA-  , STRING = SQL1517N  db2start failed because the cluster manager resource states are inconsistent.

 

The user sees the same error when he issues the below commands then:

db2start instance on db203

db2start cf 129

 

As a start point, suggest  him try to repair the inconsistent resources by db2cluster:

/support/pages/node/478351

Unfortunately, it doesn't work.  Well, seems there is no quick way to fix this problem,  we have to find out the cause of failure at first.

Checking the db2diag.log, we can see error messages as below:

 

2017-02-15-20.33.10.188475+480 E21788608A1525       LEVEL: Error
PID     : 8847618              TID : 1              PROC : db2start
INSTANCE: db2sdin1                NODE : 000
HOSTNAME: db203
EDUID   : 1
FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for CF, sqleCAGetTransportMethod, probe:684
MESSAGE : ZRC=0x87270023=-2027487197=SQLE_SAL_UNEXPECTED_ERROR
          "Unexpected SAL Error."
DATA #1 : String, 34 bytes
Unable to determine transport type
DATA #2 : Codepath, 8 bytes
18:22
DATA #3 : unsigned integer, 8 bytes
1
DATA #4 : unsigned integer, 8 bytes
1
DATA #5 : unsigned integer, 4 bytes
0
DATA #6 : unsigned integer, 4 bytes
0
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
  [0] 0x09000000315876CC sqleCAGetTransportMethod + 0xC08
  [1] 0x0900000031585F84 sqleCAGetTransportMethod + 0x1244
  [2] 0x0900000031582080 sqleCAIsRoCE + 0x6E4
  [3] 0x0900000031C05618 sqlhaVerifyNetworkResources__FPPPcPiPbiP19SQLHA_CONTROL_BLOCK + 0x920
  [4] 0x0900000031C0A99C sqlhaVerifyClusterResources__FPcP18sqlo_db2nodes_descPbP19SQLHA_CONTROL_BLOCK + 0x1CA4
  [5] 0x090000002FAECD8C sqleIssueStartStop__FiPvPcT3P9sqlf_kcfdP18SQLE_INTERNAL_ARGSUiT7P5sqlca + 0xE22C
  [6] 0x090000002FADA4E8 sqleIssueStartStop__FiPvPcT3P9sqlf_kcfdP18SQLE_INTERNAL_ARGSUiT7P5sqlca + 0x96CC
  [7] 0x09000000312D8B68 sqleProcessStartStop__FiPvP18SQLE_INTERNAL_ARGSP9sqlf_kcfdPcUiT6P5sqlca + 0xAF8
  [8] 0x0000000100002950 main + 0x20D0
  [9] 0x00000001000002F8 __start + 0x70

2017-02-15-20.33.10.190401+480 I21790134A639        LEVEL: Error
PID     : 8847618              TID : 1              PROC : db2start
INSTANCE: db2sdin1                NODE : 000
HOSTNAME: db203
EDUID   : 1
FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for CF, sqleCAIsRoCE, probe:2105
MESSAGE : ZRC=0x87270023=-2027487197=SQLE_SAL_UNEXPECTED_ERROR
          "Unexpected SAL Error."
DATA #1 : Codepath, 8 bytes
2
DATA #2 : String, 0 bytes
Object not dumped: Address: 0x0900000033236D50 Size: 0 Reason: Zero-length data
DATA #3 : String, 0 bytes
Object not dumped: Address: 0x0FFFFFFFFFFAEDA0 Size: 0 Reason: Zero-length data

2017-02-15-20.33.10.218674+480 E21790774A520        LEVEL: Error
PID     : 8847618              TID : 1              PROC : db2start
INSTANCE: db2sdin1                NODE : 000
HOSTNAME: db203
EDUID   : 1
FUNCTION: DB2 UDB, high avail services, sqlhaVerifyClusterResources, probe:16760
MESSAGE : ZRC=0x87270023=-2027487197=SQLE_SAL_UNEXPECTED_ERROR
          "Unexpected SAL Error."
DATA #1 : String, 37 bytes
public network equivalency is missing
DATA #2 : signed integer, 4 bytes
0
DATA #3 : Boolean, 1 bytes
false

 

2017-02-15-20.33.10.273849+480 I21792079A1374       LEVEL: Event
PID     : 8847618              TID : 1              PROC : db2start
INSTANCE: db2sdin1                NODE : 000
HOSTNAME: db203
EDUID   : 1
FUNCTION: DB2 UDB, base sys utilities, sqleIssueStartStop, probe:6007
MESSAGE : ZRC=0x87270023=-2027487197=SQLE_SAL_UNEXPECTED_ERROR
          "Unexpected SAL Error."
DATA #1 : SQLCA, PD_DB2_TYPE_SQLCA, 136 bytes
 sqlcaid : SQLCA     sqlcabc: 136   sqlcode: 0   sqlerrml: 0
 sqlerrmc:
 sqlerrp : SQL10058
 sqlerrd : (1) 0x00000000      (2) 0x00000000      (3) 0x00000000
           (4) 0x00000000      (5) 0x00000000      (6) 0x00000000
 sqlwarn : (1)      (2)      (3)      (4)        (5)       (6)    
           (7)      (8)      (9)      (10)        (11)     
 sqlstate:      
DATA #2 : SQLCA, PD_DB2_TYPE_SQLCA, 136 bytes
 sqlcaid : SQLCA     sqlcabc: 136   sqlcode: -1517   sqlerrml: 0
 sqlerrmc:
 sqlerrp : SQLESSCM
 sqlerrd : (1) 0x00000000      (2) 0x00000000      (3) 0x00000000
           (4) 0x00000000      (5) 0x00000000      (6) 0x00000000
 sqlwarn : (1)      (2)      (3)      (4)        (5)       (6)    
           (7)      (8)      (9)      (10)        (11)     
 sqlstate:      
DATA #3 : Boolean, 1 bytes
false
DATA #4 : Boolean, 1 bytes
false
DATA #5 : Boolean, 1 bytes
false
DATA #6 : Boolean, 1 bytes
false
DATA #7 : Boolean, 1 bytes
false

 

 

We can see the first error is "Unable to determine transport type", this error is usually caused by incorrect interconnect configuration.   So, ask the user to check db2nodes.cfg, /etc/hosts and /etc/dat.conf, all things look good without any problem.

Finally, manage to collect a db2trc by reproducing the error:

$ db2trc on -f db2trc.dmp
Trace is turned on
$ db2start cf 129
SQL1517N  db2start failed because the cluster manager resource states are inconsistent.
$ db2trc off
Trace is turned off
$ db2trc flw db2trc.dmp db2trc.flw
$ db2trc fmt db2trc.dmp db2trc.fmt

 

Checking db2trc, I can see the error comes from ossDATCheckIfInterfaceHasValidUDAPLDevice:

2017-02-17-12.58.12.787555+480 E13703A1525          LEVEL: Error
PID     : 12582994             TID : 1              PROC : db2start
INSTANCE: db2sdin1                NODE : 000
HOSTNAME: db203
EDUID   : 1
FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for CF, sqleCAGetTransportMethod, probe:684
MESSAGE : ZRC=0x87270023=-2027487197=SQLE_SAL_UNEXPECTED_ERROR
          "Unexpected SAL Error."
DATA #1 : String, 34 bytes
Unable to determine transport type
DATA #2 : Codepath, 8 bytes
18:22
DATA #3 : unsigned integer, 8 bytes
1
DATA #4 : unsigned integer, 8 bytes
1
DATA #5 : unsigned integer, 4 bytes
0
DATA #6 : unsigned integer, 4 bytes
0
CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol)
  [0] 0x090000000CBF56CC sqleCAGetTransportMethod + 0xC08
  [1] 0x090000000CBF3F84 sqleCAGetTransportMethod + 0x1244
  [2] 0x090000000CBF0080 sqleCAIsRoCE + 0x6E4
  [3] 0x090000000D273618 sqlhaVerifyNetworkResources__FPPPcPiPbiP19SQLHA_CONTROL_BLOCK + 0x920
  [4] 0x090000000D27899C sqlhaVerifyClusterResources__FPcP18sqlo_db2nodes_descPbP19SQLHA_CONTROL_BLOCK + 0x1CA4
  [5] 0x090000000B15AD8C sqleIssueStartStop__FiPvPcT3P9sqlf_kcfdP18SQLE_INTERNAL_ARGSUiT7P5sqlca + 0xE22C
  [6] 0x090000000B1484E8 sqleIssueStartStop__FiPvPcT3P9sqlf_kcfdP18SQLE_INTERNAL_ARGSUiT7P5sqlca + 0x96CC
  [7] 0x090000000C946B68 sqleProcessStartStop__FiPvP18SQLE_INTERNAL_ARGSP9sqlf_kcfdPcUiT6P5sqlca + 0xAF8
  [8] 0x0000000100002950 main + 0x20D0
  [9] 0x00000001000002F8 __start + 0x70


50021       | | | | | | sqleCAGetTransportMethod data [probe 15]
50022       | | | | | | sqleCAGetTransportMethod data [probe 20]
50023       | | | | | | | sqleCAGetNetworkInterfaceTransportType entry

50021    data DB2 UDB Shared Data Structure Abstraction Layer for CF sqleCAGetTransportMethod fnc (3.3.39.49.0.15)

...

 

50058       | | | | | | | | ossDATGetUDAPLDeviceForInterface data [probe 976]
50059       | | | | | | | | | ossDATCheckIfInterfaceHasValidUDAPLDevice entry
50060       | | | | | | | | | | OSSHLibrary::getFuncAddress entry
50061       | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 10]
50062       | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 100]
50063       | | | | | | | | | | OSSHLibrary::getFuncAddress exit
50064       | | | | | | | | | | OSSHLibrary::getFuncAddress entry
50065       | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 10]
50066       | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 100]
50067       | | | | | | | | | | OSSHLibrary::getFuncAddress exit
50068       | | | | | | | | | | OSSHLibrary::getFuncAddress entry
50069       | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 10]
50070       | | | | | | | | | | OSSHLibrary::getFuncAddress data [probe 100]
50071       | | | | | | | | | | OSSHLibrary::getFuncAddress exit
50072       | | | | | | | | | ossDATCheckIfInterfaceHasValidUDAPLDevice error [probe 691]

 

50078       | | | | | | | | | ossDATCheckIfInterfaceHasValidUDAPLDevice exit [rc = 0x90000620 = -1879046624]
50079       | | | | | | | | ossDATGetUDAPLDeviceForInterface data [probe 1050]

50072    error DB2 Common OSSe ossDATCheckIfInterfaceHasValidUDAPLDevice cei (4.1.3.219.2.691)
    pid 12582994 tid 1 probe 691
    Error ZRC = 0x00000000 = 0 = PSM_OK
    bytes 12

    Data1     (PD_TYPE_DEFAULT,4) Hexdump:
    9000 0620                                  ...


50078    exit DB2 Common OSSe ossDATCheckIfInterfaceHasValidUDAPLDevice cei (2.1.3.219.2)
    pid 12582994 tid 1
    rc = 0x90000620 = -1879046624

50079    data DB2 Common OSSe ossDATGetUDAPLDeviceForInterface cei (3.1.3.220.2.1050)
    pid 12582994 tid 1 probe 1050
    bytes 297

    Data1     (PD_TYPE_DEFAULT,4) Hexdump:
    9000 0620                                  ...

    Data2     (PD_TYPE_DEFAULT,256) Hexdump:
    6863 6131 0000 0000 0000 0000 0000 0000    hca1............
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................
    0000 0000 0000 0000 0000 0000 0000 0000    ................

    Data3     (PD_TYPE_DEFAULT,13) Hexdump:
    3130 2E31 3333 2E36 332E 3130 39           10.133.63.109

 

Looks like the adapter is not working, so I suggest the user check the status of adapter:

$ ibstat -v

ERROR: "/dev/roce0": open failed rc=46, errno=46

It seems the adapter is down on the host, after fixing this adapter, the problem is resolved.

Also recommend the user verify the adapters by running DB2ClusterPing tool

/support/pages/node/267117

 

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm11140610