IBM Support

75 ways to demystify DB2: #26: Techtip: In DB2 PureScale environment, all members are showing UNKNOWN state

Technical Blog Post


Abstract

75 ways to demystify DB2: #26: Techtip: In DB2 PureScale environment, all members are showing UNKNOWN state

Body

Problem:

In DB2 PureScale environment, all members are showing UNKNOWN state.

Symptoms:

db2instance -list

ID        TYPE             STATE                HOME_HOST               CURRENT_HOST            ALERT   PARTITION_NUMBER        LOGICAL_PORT    NETNAME
--        ----             -----                ---------               ------------            -----   ----------------        ------------    -------
0       MEMBER           UNKNOWN                purescalem020            purescalem020               NO                  0                   0    purescalem020-ib0
1       MEMBER           UNKNOWN                purescalem021            purescalem021               NO                  0                   0    purescalem021-ib0
2       MEMBER           UNKNOWN                purescalem022            purescalem022               NO                  0                   0    purescalem022-ib0
3       MEMBER           UNKNOWN                purescalem023            purescalem023               NO                  0                   0    purescalem023-ib0
128     CF                  PEER                purescacf001            purescacf001               NO                  -                   0    purescacf001-ib0,purescacf001-ib1
129     CF               PRIMARY                purescacf002            purescacf002               NO                  -                   0    purescacf002-ib0,purescacf002-ib1

HOSTNAME                   STATE                INSTANCE_STOPPED        ALERT
--------                   -----                ----------------        -----
purescacf002              ACTIVE                              NO           NO
purescacf001              ACTIVE                              NO           NO
purescalem023              ACTIVE                              NO           NO
purescalem022              ACTIVE                         UNKNOWN          YES
purescalem021              ACTIVE                              NO           NO
purescalem020              ACTIVE                              NO           NO

There is currently an alert for a member, CF, or host in the data-sharing instance. For more information on the alert, its impact, and how to clear it, run the following command: 'db2cluster -cm -list -alert'.
UNKNOWN is a transient state. Re-issue the command in few seconds. However, if UNKNOWN state continues to re-appear, check the db2diag.log for failure messages pertaining to the DB member or CF in this state.

When running: db2cluster -cm -list -alert , we get:

NON-FATAL ASSERTION FAILED!!!
Time       = Fri Apr 24 13:29:40 2015
Expr       = Invalid block eye-catcher (0x1) found at:
PID        = 32768000
TID        = 1
File       = sqlhaInterface.C
Line       = 21755
Stacktrace =
             _ossMemAllocCheck (+0x730)
             _ossMemFree (+0x278)
             _ossMemFree@glue8B (+0x50)
             sqlhaDeleteClusterObjectListStructure__FP25SQLHA_CLUSTER_OBJECT_LIST (+0x200)
             sqlhaSysIBMGetAlerts__FPP17sqlerDB2AlertDataPUiP5sqlcaP19SQLHA_CONTROL_BLOCK (+0x330)
             rocmDb2clusterActionDisplayAlerts__FPcCiUi (+0x1A4)
             rocmDoDb2clusterEngineAction__FP11ROCM_ACTION (+0x1C8)
             main (+0x24A4)
             __start (+0x70)
             ?unknown
             ?unknown
             ?unknown
             ?unknown
             ?unknown
             ?unknown
             ?unknown
NON-FATAL ASSERTION FAILED!!!
Time       = Fri Apr 24 13:29:40 2015
Expr       = Invalid pad type (0x10570750) found at:
PID        = 32768000
TID        = 1
File       = sqlhaInterface.C
Line       = 21755
Stacktrace =
             _ossMemAllocCheck (+0x6CC)
             _ossMemFree (+0x278)
             _ossMemFree@glue8B (+0x50)
             sqlhaDeleteClusterObjectListStructure__FP25SQLHA_CLUSTER_OBJECT_LIST (+0x200)
             sqlhaSysIBMGetAlerts__FPP17sqlerDB2AlertDataPUiP5sqlcaP19SQLHA_CONTROL_BLOCK (+0x330)
             rocmDb2clusterActionDisplayAlerts__FPcCiUi (+0x1A4)
             rocmDoDb2clusterEngineAction__FP11ROCM_ACTION (+0x1C8)
             main (+0x24A4)
             __start (+0x70)
             ?unknown
             ?unknown
             ?unknown
             ?unknown
             ?unknown
             ?unknown
             ?unknown
Cluster alerts could not be retrieved from the cluster manager. See the db2diag.log for more information.

On top, if you are unable to log into the problematic server remotely and OS is reporting memory related issues, then the situation we have is that on this particular host we have run into high memory utilization and that caused the resources to go into UNKNOWN state.

Solution:

If we can confirm the following:

mmgetstate -a

 Node number  Node name        GPFS state
------------------------------------------
       1      purescacf001     active
       2      purescacf002     active
       3      purescalem020     active
       4      purescalem021     active
       5      purescalem022     unknown 
       6      purescalem023     active

This confirms that the GPFS resource state itself is unknown.The solution would be to re-direct the traffic to rest of the members if you can and do a hard reboot on the problematic member.

In the mean time, for further verification, we can check , errpt -a, or GPFS logs, located at /var/adm/ras/ (i.e mmfs.log.latest and mmfs.log.previous)


Once the member restarts successfully, we should have the following:

db2instance -list

ID        TYPE             STATE                HOME_HOST               CURRENT_HOST            ALERT   PARTITION_NUMBER        LOGICAL_PORT    NETNAME
--        ----             -----                ---------               ------------            -----   ----------------        ------------    -------
0       MEMBER           STARTED                purescalem020            purescalem020               NO                  0                   0    purescalem020-ib0
1       MEMBER           STARTED                purescalem021            purescalem021               NO                  0                   0    purescalem021-ib0
2       MEMBER           STARTED                purescalem022            purescalem022               NO                  0                   0    purescalem022-ib0
3       MEMBER           STARTED                purescalem023            purescalem023               NO                  0                   0    purescalem023-ib0
128     CF                  PEER                purescacf001            purescacf001               NO                  -                   0    purescacf001-ib0,purescacf001-ib1
129     CF               PRIMARY                purescacf002            purescacf002               NO                  -                   0    purescacf002-ib0,purescacf002-ib1

HOSTNAME                   STATE                INSTANCE_STOPPED        ALERT
--------                   -----                ----------------        -----
purescacf002              ACTIVE                              NO           NO
purescacf001              ACTIVE                              NO           NO
purescalem023              ACTIVE                             NO           NO
purescalem022              ACTIVE                               NO           NO
purescalem021              ACTIVE                             NO           NO
purescalem020              ACTIVE                             NO           NO

mmgetstate -a

 Node number  Node name        GPFS state
------------------------------------------
       1      purescacf001     active
       2      purescacf002     active
       3      purescalem020     active
       4      purescalem021     active
       5      purescalem022     active  
       6      purescalem023     active

If the reboot doesn't help, then please collect the following and engage DB2 Support.

1. db2support (with -s -f option ) from the problematic member along with 'db2instance -list' output
2. GPFS logs
3.  getsdata -a
 

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm11141138