Solving problems in the DB2 pureScale cluster services environment

A systematic approach to diagnostic information gathering and problem determination

This tutorial guides DBAs and system administrators in problem determination for IBM® DB2® pureScale® cluster services. As you deploy IBM DB2 pureScale Feature for DB2 Enterprise Server Edition systems into production, you need to acquire appropriate problem determination skills. This tutorial provides information about gathering diagnostic information when failures occur, and provides additional information to aid in understanding the tightly integrated subcomponents of the DB2 pureScale Feature, such as the Cluster Caching Facility (CF), General Parallel File System (GPFS), Reliable Scalable Cluster Technology (RSCT), and IBM Tivoli Systems Automation for Multiplatforms (Tivoli SA MP).

Oleg Tyschenko (tyschenko@ie.ibm.com), DB2 pureScale Development Team Leader, IBM

Photo of Oleg TyschenkoOleg Tyschenko leads the DB2 engine development team effort in Ireland and is a member of the DB2 High Availability team for DB2 pureScale development. He has more than fifteen years technical and engineering experience in information management and related technologies. He is currently involved in a number of DB2 pureScale deployment and proof-of-concept projects across Europe. His areas of expertise include the DB2 pureScale Feature, DB2 high availability and cluster management, Tivoli SA for Multiplatforms, and GPFS deployment. Oleg holds a degree in computer science and a MBA degree from the Warwick Business School (UK).



Massimiliano Gallo (gallomas@ie.ibm.com), DB2 Functional Test Engineer, IBM

Photo of Massimiliano GalloMassimiliano Gallo is a senior member of the DB2 Functional Verification Test team in Dublin and has worked on DB2 pureScale testing for the last several years. He has more than 10 years experience in software development and product testing, working on a broad range of technologies. His areas of expertise include the DB2 pureScale Feature, DB2 high availability and cluster management, and Tivoli SA for Multiplatform. Massimiliano holds a M.Sc. in Aerospace Engineering from the Sapienza University of Rome.



18 August 2011

Also available in Chinese Vietnamese Portuguese

Before you start

Introduction

IBM DB2 pureScale Feature for Enterprise Server Edition offers clustering technology that helps deliver high availability and exceptional scalability transparent to applications, and brings best-of-breed architecture to the distributed platform. The DB2 pureScale Feature enables the database to continue processing through unplanned outages and provides nearly unlimited capacity for any transactional workload. Scaling your system is simply a matter of connecting a host and issuing two simple commands. The cluster-based, shared-disk architecture of the DB2 pureScale Feature also helps reduce costs through efficient use of system resources.

The DB2 pureScale Feature combines several tightly integrated software components, which are installed and configured automatically when you deploy the DB2 pureScale Feature. You interact with components such as the DB2 cluster manager and DB2 cluster services through DB2 administration views and commands, such as db2instance, db2icrt, db2iupdt, and the db2cluster tool. The db2cluster tool also provides options for troubleshooting and problem determination. Additionally, the messages that are generated by the subsystems of the DB2 cluster manager are an excellent source of information for problem determination. For example, the resource managers of the resource classes utilized by DB2 cluster services each write status information to their log files. The db2diag log files also provide useful information. Often, messages in the db2diag log files explain the reason for a failure and give advice on how to resolve it.

DB2 cluster services is able to automatically handle the majority of run-time failures. However, there are specific types of failures that require you to take action to resolve the failures. For example, the power cord may become unplugged from the host or a network cable could get disconnected. If DB2 cluster services cannot resolve the failure automatically, then an alert field is set to notify the DBA that a problem has occurred that requires attention. DBAs can see the alert when they check the status of the DB2 instance, as shown later.

Understanding the DB2 pureScale Feature resource model

The Version 9.8 DB2 pureScale Feature resource model differs from the resource model utilized in a HA DB2 instance in Version 9.7 single partition and multi-partition database environments. For additional information on HA DB2 instances in DB2 versions prior to 9.8 DB2 pureScale Feature, please refer to the background information links in the Resources section at the end of the tutorial.

The new resource model implemented in Version 9.8 DB2 pureScale Feature is necessary to represent cluster caching facilities (CFs) and the shared clustered file system.

In a DB2 pureScale shared data instance, one CF fulfills the primary role, which contains the currently active data for the shared data instance. The second CF maintains a copy of pertinent information for immediate recovery of the primary role.

The new resource model allows IBM Tivoli® System Automation for Multiplatforms (Tivoli SA MP) to appropriately automate the movement of the primary role in case of failure of the primary CF node.

DB2 cluster services includes three major components:

  • Cluster manager: Tivoli SA MP, which includes Reliable Scalable Cluster Technology (RSCT)
  • Shared clustered file system: IBM General Parallel File System (GPFS)
  • DB2 cluster administration: DB2 commands and administration views for managing and monitoring the cluster
Figure 1. DB2 Cluster services
Diagram shows client workstations connected to DB2 data server, which has Primary CF, Secondary CF, members, DB2 cluster services, and shared file system

DB2 cluster services provide essential infrastructure for the shared data instance to be highly available and to provide automatic failover and restart as soon as the instance has been created.

DB2 cluster elements are representations of entities that are monitored and whose status changes are managed by DB2 cluster services. For the purposes of this tutorial, we will address three types of DB2 cluster elements:

  • Hosts: A host can be a physical machine, LPAR (Logical Partition of a physical machine), or a virtual machine.
  • DB2 members: A DB2 member is the core processing engine and normally resides on its home host. The home host of a DB2 member is the host name that was provided as the member's location when the member was added to the DB2 shared data instance. A DB2 member has single home host. DB2 members can accept client connections only when they are running on their home host.
  • Cluster caching facilities (CFs): The cluster caching facility (CF) is a software application managed by DB2 cluster services that provides internal operational services for a DB2 shared data instance.

There is not necessarily a one-to-one mapping between DB2 cluster elements and the underlying cluster manager resources and resource groups.

Understanding how the DB2 pureScale Feature automatically handles failure

When a failure occurs in the DB2 pureScale instance, DB2 cluster services automatically attempts to restart the failed resources. When and where the restart occurs depends on different factors, such as the type of resource that failed and the point in the resource life cycle at which the failure occurred.

If a software or hardware failure on a host causes a DB2 member to fail, DB2 cluster services automatically restarts the member. DB2 members can be restarted on either the same host (local restart) or if that fails, on a different host (member restart in restart light mode). Restarting a member on another host is called failover.

Member restart includes restarting failed DB2 processes and performing member crash recovery (undoing or reapplying log transactions) in order to roll back any 'in-flight' transactions and to free any locks held by them. Member restart also ensures that updated pages have been written to the CF.

When a member is restarted on a different host in restart light mode, minimal resources are used on the new host (which is the home host of another DB2 member). A member running in restart light mode does not process new transactions, because its sole purpose is to perform member crash recovery. The databases on the failed member are recovered to a point of consistency as quickly as possible. This enables other active members to access and change database objects that were locked by the abnormally terminated member. All in-flight transactions from the failed member are rolled back and all locks that were held at the time of the abnormal termination of the member are released. Although the member does not accept new transactions, it remains available for resolution of in-doubt transactions. When a DB2 member has failed-over to a new host, the total processing capability of the whole cluster is reduced temporarily. When the home host is active and available again, the DB2 member automatically fails back to the home host, and the DB2 member is restarted on its home host. The cluster's processing capability is restored as soon as the DB2 member has failed back and restarted on its home host. Transactions on all other DB2 members are not affected during the failback process.


Managing the cluster environment for the DB2 pureScale Feature

DB2 cluster services provides DB2 administration views and commands for cluster management. You can use DB2 commands to manage the shared file system and cluster manager rather than using separate Tivoli SA MP or GPFS commands. For example, when you create a DB2 instance and add new DB2 members or CFs, the DB2 database manager automatically invokes the appropriate actions for Tivoli SA MP and GPFS to set up or update the peer domain as needed. The peer domain is a cluster of hosts configured for high availability and can consist of all hosts in the cluster or can be a subset of hosts in the overall cluster solution. The peer domain includes the hosts as failover targets in a round-robin policy, which chooses a host from a list of available hosts.

The DB2 cluster services monitor and react to the database cluster commands when the DB2 database manager invokes any of the following operations:

  • Creation and deletion of cluster resources, as part of the install and upgrade process
  • Expansion or contraction of the cluster manager and shared file system to include additional hosts or reduce the number of hosts, as part of adding or removing a host or a DB2 member
  • Stopping the instance for planned maintenance that cannot be performed online

For example, you can use the db2cluster command to enter maintenance mode, the db2icrt command to create a new DB2 pureScale Feature instance, and the db2iupdt command to update that instance.

Further administration of the DB2 cluster services is provided by the db2cluster command.

Using the db2cluster command to diagnose and repair problems

In certain rare situations, the cluster resource model might become inconsistent, requiring you to intervene.

The db2cluster and the db2instance commands allow you to gather information on the state of the resource model.

In this example, the resource model is healthy:

> db2cluster -cm -verify –resources
Cluster manager resource states for the DB2 instance are consistent.

In this case, the resource model is inconsistent:

> db2cluster -cm -verify –resources
Cluster manager resource states for the DB2 instance are inconsistent. 
Refer to the db2diag.log for more information on inconsistencies.

When a resource model is inconsistent, you need to look at the messages in the db2diag log files to determine details about the problem. Typically, messages in the db2diag log files explain the reason for the failure and give advice on what the problem is and how to resolve it. As an example, this message indicates that the db2nodes.cfg file may have been changed manually. If that is the case, the recommended action is to revert to an earlier version of the db2nodes.cfg file. If that is not the case, then you can see an alternate recommended action to fix the inconsistencies with the db2cluster command.

Listing 1. db2diag.log file with failure details
2011-05-25-15.01.21.776169+060 E13778E572            LEVEL: Info
PID     : 21058                TID  : 46912898085712 KTID : 21058
PROC    : db2cluster
INSTANCE: db2inst1                NODE : 000
HOSTNAME: hostA
FUNCTION: DB2 UDB, high avail services, sqlhaUISDVerifyResources, probe:0
MESSAGE : Resource verification failed.  Recommendation: if db2nodes.cfg was
          modified, return it to its original form.
DATA #1 : String, 124 bytes
Issue a 'db2cluster -cm -repair -resources' command to fix the inconsistencies 
if the db2nodes.cfg was not manually changed.

In many cases, you can use the db2cluster tool to repair resources, using the following command:

> db2cluster -cm -repair -resources
All cluster configurations have been completed successfully. db2cluster exiting...

Note that the above command requires the DB2 instance to be stopped.

After the resource model is repaired, the cluster manager returns to a healthy operational state with no alert raised on any entity. To obtain further information on the status of the cluster, you can use the following command:

Listing 2. Output from db2instance -list
> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0   MEMBER STARTED    hostA     hostA        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostC     hostC        NO    0                0            hostC-ib0
128 CF     PRIMARY    hostB     hostB        NO    -                0            hostB-ib0
129 CF     CATCHUP    hostD     hostD        NO    -                0            hostD-ib0

HOSTNAME                   STATE                INSTANCE_STOPPED        ALERT
--------                   -----                ----------------        -----
hostD                      ACTIVE               NO                      NO
hostB                      ACTIVE               NO                      NO
hostC                      ACTIVE               NO                      NO
hostA                      ACTIVE               NO                      NO

The above example of the db2instance output shows the case of a healthy cluster, with two DB2 members and two CFs. There are no alerts in the cluster: all DB2 members are started and operational on their home host. The CF represented by ID 128 holds the PRIMARY role, and all hosts are active, with the DB2 instance started on each host.

We can double check that there are no alerts by using the following command:

> db2cluster -cm -list -alert
There are no alerts

You can check the state of hosts in the peer domain by using the following command:

Listing 3. Output from db2cluster -list -cm -host -state
> db2cluster -list -cm -host -state
HOSTNAME                    STATE
------------------------    -----------
hostA                       ONLINE
hostB                       ONLINE
hostC                       ONLINE
hostD                       ONLINE

This means that the peer domain is online and operational on all hosts.

Using subsystem messages and error data to troubleshoot DB2 cluster management events

For troubleshooting and problem determination, the messages that are generated by the subsystems of the DB2 cluster manager are an important source of information.

On Linux platforms, messages are written to the system log (/var/log/messages).

On AIX platforms, the system logger is not configured by default, and messages are written to the error log. It is recommended that you configure the system logger in the /etc/syslog.conf file. After updating /etc/syslog.conf, you must run the command refresh –s syslogd.

Resource classes utilized by DB2 cluster services components are managed by a variety of resource managers (RMs). Which resource manager is responsible for managing a particular class depends on the class. A resource manager runs as a daemon which is controlled by the system resource controller (SRC). For more information about resource managers, please refer to Tivoli SA MP Administrator's and User's Guide. A link is available in the Resources section.

The Global Resource RM (IBM.GblResRM)

On every node, the Global Resource RM (IBM.GblResRM) maintains an audit log where it records any execution of start and stop commands, and reset operations. The Global Resource RM is operable in peer domain mode, only. Logs from the Global Resource RM are located under:

/var/ct/<domain_name>/log/mc/IBM.GblResRM/

The domain name can be found using the db2cluster command:

> db2cluster -cm -list -domain
Domain Name: db2domain

To view the audit log, issue the rpttr command locally on a node:

rpttr -o dtic /var/ct/<domain_name>/log/mc/IBM.GblResRM/trace_summary*

Note that root access to the /var/ct/<domain_name>/log/mc/IBM.RecoveryRM/ directory needs to be provided in order to view the logs with the rpttr command as shown above. (For alternative approaches allowing non-root users to read traces with fewer privileges, see Appendix A).

The following example shows the format of trace information in the audit log:

Listing 4. Formatted trace file
[00] 04/15/11 17:07:43.324891 T(4150437552) ____  ******************* Trace Started - Pid 
= 7268 **********************
[00] 04/18/11 09:08:57.498711 T(4150433456) ____  ******************* Trace Started - Pid 
= 6735 **********************
[00] 04/19/11 11:41:31.832310 T(4128861088) _GBD Monitor detect OpState change for resourc
e Name=instancehost_db2inst1_hostD OldOpState=0 NewOpState=2 Handle=0x6028 0xffff 0xb82b28
63 0x4a725d1a 0x1215ee4e 0xc2f1e0b0
[00] 04/19/11 11:41:32.574276 T(4128861088) _GBD Monitor detect OpState change for resourc
e Name=instancehost_db2inst1_hostD OldOpState=2 NewOpState=1 Handle=0x6028 0xffff 0xb82b28
63 0x4a725d1a 0x1215ee4e 0xc2f1e0b0
[00] 04/19/11 11:42:07.255763 T(4123835296) _GBD Monitor detect OpState change for resourc
e Name=cacontrol_db2inst1_128_hostD OldOpState=0 NewOpState=2 Handle=0x6028 0xffff 0xb82b2
863 0x4a725d1a 0x1215ee57 0x8657b518
[00] 04/19/11 11:42:08.973508 T(4123736992) _GBD Monitor detect OpState change for resourc
e Name=ca_db2inst1_0-rs OldOpState=0 NewOpState=2 Handle=0x6028 0xffff 0xb82b2863 0x4a725d
1a 0x1215ee57 0xc96670b0
[00] 04/19/11 11:42:19.740162 T(4123638688) _GBD Monitor detect OpState change for resourc
e Name=primary_db2inst1_900-rs OldOpState=0 NewOpState=2 Handle=0x6028 0xffff 0xb82b2863 0
x4a725d1a 0x1215ee5a 0x63655418
[00] 04/19/11 11:50:14.697605 T(4123835296) _GBD Monitor detect OpState change for resourc
e Name=cacontrol_db2inst1_128_hostD OldOpState=2 NewOpState=1 Handle=0x6028 0xffff 0xb82b2
863 0x4a725d1a 0x1215ee57 0x8657b518

Note: For easy reading of the formatted traces, you can redirect the rpttr command output to a file, since the output contains a significant volume of event log records.

The Recovery RM (IBM.RecoveryRM)

The Recovery RM (IBM.RecoveryRM) serves as the decision engine for Tivoli SA MP. A Recovery RM runs on each host in the cluster, with one Recovery RM designated as the master. The master Recovery RM is responsible for evaluating the monitoring information and acts on information provided by IBM.GlbResRM. When a situation develops that requires intervention, the Recovery RM controls the decisions that result in start or stop operations on the resources, as needed.

Logs from the Recovery RM are located under:

/var/ct/<domain_name>/log/mc/IBM.RecoveryRM/

The master Recovery RM maintains an audit log of:

  • All requests
  • Error responses to the requests
  • Information about the current policy
  • Information about binding issues

To view the log, you need to issue the rpttr command on the host on which the master Recovery RM is running. First, you can determine the host by using the lssrc command, for example:

Listing 5. Output from lssrc command
lssrc -ls IBM.RecoveryRM | grep Master

Example output:
   Master Node Name     : hostD (node number =  4)
rpttr -o dtic /var/ct/<domain_name>/log/mc/IBM.RecoveryRM/trace_summary*

Note that root access to the /var/ct/<domain_name>/log/mc/IBM.RecoveryRM/ directory needs to be provided in order to view the logs with the rpttr command as shown above (for alternative approaches allowing non root users to read traces with less privileges, see Appendix A).

An example of the format trace file is below:

Listing 6. Format trace file
[02] 05/25/11 14:59:25.487143 T(4112939936) _RCD ReportMoveState: Resource : db2_db2inst1_
1-rs/Float/IBM.Application  reported move state change: a000 - preparing: 0
[02] 05/25/11 14:59:25.487651 T(4112939936) _RCD ReportMoveState: Resource : db2_db2inst1_
1-rs/Fixed/IBM.Application/hostA  reported move state change: db2_db2inst1_1-rs/Fixed/IBM.
Application/hostA  reported state change: 2
[02] 05/25/11 14:59:25.488380 T(4112939936) _RCD Offline request injected: db2_db2inst1_0-
rg/ResGroup/IBM.ResourceGroup
[02] 05/25/11 14:59:25.488637 T(4112939936) _RCD ReportMoveState: Resource : db2_db2inst1_
0-rs/Float/IBM.Application  reported move state change: a000 - preparing: 0
[02] 05/25/11 14:59:25.488914 T(4112939936) _RCD ReportMoveState: Resource : db2_db2inst1_
0-rs/Fixed/IBM.Application/hostA  reported move state change: a000 - preparing: 0
[02] 05/25/11 14:59:25.488955 T(4112939936) _RCD ReportState: Resource : db2_db2inst1_0-rs
/Fixed/IBM.Application/hostA  reported state change: 2
[02] 05/25/11 14:59:25.489642 T(4112939936) _RCD Offline request injected: idle_db2inst1_9
97_hostA-rg/ResGroup/IBM.ResourceGroup

Note: For easy reading of the formatted traces, you can redirect the rpttr command output to a file, since the output contains a significant volume of event logs.

The Configuration RM (IBM.ConfigRM)

The Configuration RM (IBM.ConfigRM) is used in DB2 pureScale Cluster Services. In addition, it helps provide quorum support, which is a means of ensuring data integrity when portions of a cluster lose communication. The Configuration RM updates its logs only when the state changes (a new event occurs). The logs are located in the following path:

/var/ct/IW/log/mc/IBM.ConfigRM/

The cluster topology service registers any change in an RSCT peer domain communication state. If a network adapter is considered down, the state change of the adapter is logged in the Configuration RM traces. The topology services logs are located in the following path, where <domain_name> is the name of your domain:

/var/ct/<domain_name>/log/cthats

The "hatsd" daemon is the primary daemon responsible for most of the work in topology services. It performs higher level organization, including setting up heartbeat rings. A heartbeat ring is a process wherein each node sends a message to one of its neighbors and expects to receive a reply from one of its other neighbors. The "hatsd" daemon can sometimes fail to detect a topology state change for the Network Interface Module (NIM), so you should check NIM logs first. The main logs begin with "nim" and contain the interface name being monitored:

nim.cthats.ib0 nim.cthats.ib0.001 nim.cthats.ib0.002 nim.cthats.ib0.003

The netmon library, used by NIM, monitors the state of each adapter (used by the NIM Topology Services processes) to determine whether the local adapter is alive. The log file name matches the NIM log name plus the prefix "nmDiag":

nmDiag.nim.cthats.ib0.001

The log file for GPFS

GPFS writes both operational messages and error data to the MMFS log file. The MMFS log file is located in the /var/adm/ras directory on each node and is named mmfs.log.date.nodeName, where date is the timestamp when the GPFS instance started on the node, and nodeName is the name of the node. You can find the latest MMFS log file by using the symbolic file name /var/adm/ras/mmfs.log.latest.

GPFS records file system or disk failures using the error logging facility provided by the operating system:

  • The syslog facility on Linux platforms
  • The errpt facility on AIX platforms

You can view these failures by issuing the following command on an AIX platform:

errpt -a

Issue this command on a Linux platform:

grep "mmfs:" /var/log/messages

Scenario 1: Member failure and restart on its home host, with commented logs

This scenario covers the case where a software failure occurs on a DB2 member and the DB2 member can be automatically restarted on its home host. For the purpose of demonstrating this scenario, the db2sysc process on the DB2 member will be killed.

After the steps are shown, a detailed explanation of what information is written to the logs follows.

Initially, there are four hosts (called hostA, hostB, hostC, hostD) that can access the shared data of the DB2 instance named db2inst1 through a clustered file system. The instance contains two DB2 members: DB2 member 0 is running on home host hostA, DB2 member 1 is running on home host hostB.

The db2instance command shows the initial state of the healthy cluster:

Listing 7. Output from db2instance command with healthy cluster
> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0   MEMBER STARTED    hostA     hostA        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostB     hostB        NO    0                0            hostB-ib0
128 CF     PRIMARY    hostC     hostC        NO    -                0            hostC-ib0
129 CF     PEER       hostD     hostD        NO    -                0            hostD-ib0
	
HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE               NO                      NO
hostB                ACTIVE               NO                      NO
hostC                ACTIVE               NO                      NO
hostD                ACTIVE               NO                      NO

To demonstrate the behavior in the case of a software failure for DB2 member 1 on hostB, we first issue the psdb2 command to list all the database manager processes running on hostB for the current instance, db2inst1:

Listing 8. Output from psdb2
> psdb2
25014 db2sysc (idle 997)
25018 db2sysc (idle 998)
25027 db2sysc (idle 999)
	
25416 db2sysc 1
	
25440 db2vend (PD Vendor Process - 1) 0 0
	
25517 db2acd 1 ,0,0,0,1,0,0,0000,1,0,8a8e50,14,1e014,2,0,1,11fc0,0x210000000,0x210000000,1
600000,1138029,2,372802f

We then use the kill -9 25416 system command to remove the db2sysc process associated with DB2 member 1 on hostB:

> kill -9 25416

As soon as the process is removed, you will see that the DB2 member restart is initiated, as we can check by running the db2instance command right after killing the process:

Listing 9. Running db2instance after killing the process
> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0   MEMBER STARTED    hostA     hostA        NO    0                0            hostA-ib0
1   MEMBER RESTARTING hostB     hostB        NO    0                0            hostB-ib0
128 CF     PRIMARY    hostC     hostC        NO    -                0            hostC-ib0
129 CF     PEER       hostD     hostD        NO    -                0            hostD-ib0

HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE               NO                      NO
hostB                ACTIVE               NO                      NO
hostC                ACTIVE               NO                      NO
hostD                ACTIVE               NO                      NO

By running the psdb2 command, we can also see the db2start process that is restarting the DB2 member:

Listing 10. psdb2 output showing restart of DB2 member
> psdb2
25014 db2sysc (idle 997)
25018 db2sysc (idle 998)
25027 db2sysc (idle 999)
	
27497 db2start NOMSG DATA 1 RESTART HOSTNAME hostB CM NETNAME hostB-ib0

You can also obtain the resource state from the resource model from the lssam command output:

Listing 11. lssam command output
Pending online IBM.ResourceGroup:db2_db2inst1_1-rg Nominal=Online
'- Pending online IBM.Application:db2_db2inst1_1-rs
|- Offline IBM.Application:db2_db2inst1_1-rs:hostA
'- Pending online IBM.Application:db2_db2inst1_1-rs:hostB

The db2_db2inst1_1-rs resource associated with DB2 member 1 is in PENDING ONLINE state, which means that the member is being restarted by DB2 cluster services, as reported by the db2instance command.

As the DB2 member is restarted on its home host, the db2instance command will report that the state of the cluster has returned to healthy, and the restarted DB2 member is ready to process a workload again:

Listing 12. Output from db2instance command
> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0   MEMBER STARTED    hostA     hostA        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostB     hostB        NO    0                0            hostB-ib0
128 CF     PRIMARY    hostC     hostC        NO    -                0            hostC-ib0
129 CF     PEER       hostD     hostD        NO    -                0            hostD-ib0
	
HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE               NO                      NO
hostB                ACTIVE               NO                      NO
hostC                ACTIVE               NO                      NO
hostD                ACTIVE               NO                      NO

Corresponding events in the logs

The db2_db2inst1_1-rs resource corresponding to DB2 member 1 is reported as down on hostB (IBM.RecoveryRM):

2011-05-25 18:10:34.421792 R(hostA) T(4118616992) _RCD ReportState: Resource: db2_db2inst
1_1-rs/Fixed/IBM.Application/hostB  reported state change: 2

Subsequently, the Recovery RM issues a cleanup request against DB2 member 1 on its home host hostB:

2011-05-25 18:10:34.583168 R(hostA) T(4118616992) _RCD Cleanup: Resource db2_db2inst1_1-rs
cleaned up on node hostB

The cleanup request is reported as successful (IBM.RecoveryRM):

2011-05-25 18:10:36.506559 R(hostA) T(4118616992) _RCD RIBME-HIST: db2_db2inst1_1-rs/Float
/IBM.Application Cleanup: Cleanup order successfully completed.

You can also see the corresponding cleanup command executed from the GblRes RM log (IBM.GblResRM):

Listing 13. Formatted trace file with cleanup command
2011-05-25 18:10:34.584000 G(hostB) T(4125461408) _GBD Running cleanup command "/home/db2i
nst1/sqllib/adm/db2rocm 1 DB2 db2inst1 1 CLEANUP" for resource 0x6028 0xffff 0x5b83a378 0x
a4e2ad4a 0x1221057d 0xf2eda550. Supporter: 0x0000 0x0000 0x00000000 0x00000000 0x00000000 
0x00000000.
	
2011-05-25 18:10:36.498548 G(hostB) T(4122803104) _GBD Cleanup command for application 
resource "db2_db2inst1_1-rs" (handle 0x6028 0xffff 0x5b83a378 0xa4e2ad4a 0x1221057d 
0xf2eda550) succeeded with exit code 0

You can see corresponding events in the db2diag log files when the DB2 member has successfully executed cleanup on its home host hostB:

Listing 14. db2diag log file when DB2 member has executed cleanup
2011-05-25-18.10.36.485126-240 I128222E517           LEVEL: Event
PID     : 27434                TID  : 46912763496800 PROC : db2rocm 1 [db2inst1]
INSTANCE: db2inst1             NODE : 001
HOSTNAME: hostB
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:949
DATA #1 : SQLHA Event Recorder header data (struct sqlhaErPdInfo), PD_TYPE_SQLHA_ER_PDINFO
, 80 bytes
Original timestamp: 2011-05-25-18.10.34.787912000
DATA #2 : String, 32 bytes
db2rocm 1 DB2 db2inst1 1 CLEANUP
DATA #3 : String, 5 bytes
BEGIN
	
2011-05-25-18.10.36.492805-240 I132248E520           LEVEL: Event
PID     : 27434                TID  : 46912763496800 PROC : db2rocm 1 [db2inst1]
INSTANCE: db2inst1             NODE : 001
HOSTNAME: hostB
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:1796
DATA #1 : SQLHA Event Recorder header data (struct sqlhaErPdInfo), 
PD_TYPE_SQLHA_ER_PDINFO, 80 bytes
Original timestamp: 2011-05-25-18.10.36.484918000
DATA #2 : String, 32 bytes
db2rocm 1 DB2 db2inst1 1 CLEANUP
DATA #3 : String, 7 bytes
SUCCESS

When cleanup has completed, you will see an online request issued to restart the DB2 member on its home host hostB (IBM.RecoveryRM):

2011-05-25 18:10:36.563398 R(hostB) T(4118616992) _RCD Resource::doRIBMAction 
Online Request against db2_db2inst1_1-rs on node hostB.

Finally, the DB2 member is restarted successfully and its corresponding resource state is reported as ONLINE once again (IBM.RecoveryRM):

2011-05-25 18:10:54.947427 R(hostB) T(4118616992) _RCD ReportState: Resource : db2_db2inst
1_1-rs/Fixed/IBM.Application/hostB  reported state change: 1

You can see also the corresponding online request execution (IBM.GblResRM):

Listing 15. Corresponding online request execution
2011-05-25 18:10:36.563972 G(hostB) T(4128926624) _GBD Bringing application resource 
online: Name=db2_db2inst1_1-rs Handle=0x6028 0xffff 0x5b83a378 0xa4e2ad4a 
0x1221057d 0xf2eda550

2011-05-25 18:10:54.937889 G(hostB) T(4122803104) _GBD Start command for application resou
rce "db2_db2inst1_1-rs" (handle 0x6028 0xffff 0x5b83a378 0xa4e2ad4a 0x1221057d 0xf2eda550)
succeeded with exit code 0

Looking at the output of the lssam command, you will see that the resource is indeed again online on its home host hostB (while being offline on possible guest host hostA):

Listing 16. Output of lssam showing resource online on home host
Online IBM.ResourceGroup:db2_db2inst1_1-rg Nominal=Online
'- Online IBM.Application:db2_db2inst1_1-rs
'- Online IBM.Application:db2_db2inst1_1-rs:hostB

In the db2diag log files, the following DB2 member start events are logged:

Listing 17. db2diag log file showing DB2 member start events
2011-05-25-18.10.36.777580-240 I132769E364           LEVEL: Event
PID     : 27496                TID  : 46912763496800 PROC : db2rocm 1 [db2inst1]
INSTANCE: db2inst1             NODE : 001
HOSTNAME: hostB
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:949
DATA #1 : String, 30 bytes
db2rocm 1 DB2 db2inst1 1 START
DATA #2 : String, 5 bytes
BEGIN
	
2011-05-25-18.10.54.932894-240 I534141E367           LEVEL: Event
PID     : 27496                TID  : 46912763496800 PROC : db2rocm 1 [db2inst1]
INSTANCE: db2inst1             NODE : 001
HOSTNAME: hostB
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:1796
DATA #1 : String, 30 bytes
db2rocm 1 DB2 db2inst1 1 START
DATA #2 : String, 7 bytes
SUCCESS

DB2 member 1 successfully restarts on its home host, hostB, and can now process new transactions. The state for DB2 member 1 is STARTED again and if you query the status of the members using the db2instance -list command, the current host is still shown as hostB.


Scenario 2: Member failover and failback (cable unplugged scenario), with commented logs

In this scenario, a hardware error occurs. Specifically, someone accidentally pulls out the InfiniBand communication cord.

DB2 cluster services automatically restarts the member in restart light mode on another active host in the cluster. When the failed home host comes back online, the member in restart light mode automatically fails back to its home host and becomes a fully active member that can process new transactions again.

After the steps are shown, a detailed explanation of what is written to the logs follows.

Initially, there are four hosts (hostA, hostB, hostC, hostD) that can access the shared data of the DB2 instance, db2inst1, through a clustered file system. DB2 member 0 is running on home host hostA, DB2 member 1 is running on home host hostC. There is an InfiniBand link between the members and the CF, which is critical for both the DB2 members and the CF.

Listing 18. Output from db2instance showing the configuration
> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0   MEMBER STARTED    hostA     hostA        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostC     hostC        NO    0                0            hostC-ib0
128 CF     PRIMARY    hostB     hostB        NO    -                0            hostB-ib1
129 CF     CATCHUP    hostD     hostD        NO    -                0            hostD-ib1

HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE                             NO           NO
hostB                ACTIVE                             NO           NO
hostC                ACTIVE                             NO           NO
hostD                ACTIVE                             NO           NO

When someone trips and unplugs the InfiniBand communication cord, DB2 cluster services detects that a failure occurred on hostA. The network interface resource state changes to OFFLINE for hostA. You can obtain the resource state from the lssam command output:

Listing 19. Resource state from lssam
Online IBM.Equivalency:db2_private_network_db2inst1_0
|- Offline IBM.NetworkInterface:ib0:hostA
'- Online IBM.NetworkInterface:ib0:hostC

DB2 cluster services automatically initiates a restart of member 0 on a guest host:

Listing 20. db2instance showing restart of member 0
>$ db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0  MEMBER RESTARTING hostA   hostC        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostC     hostC        NO    0                0            hostC-ib0
128 CF     PRIMARY    hostB     hostB        NO    -                0            hostB-ib1
129 CF     CATCHUP    hostD     hostD        NO    -                0            hostD-ib1

HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE               NO                      YES
hostB                ACTIVE               NO                      NO
hostC                ACTIVE               NO                      NO
hostD                ACTIVE               NO                      NO
There is currently an alert for a member, CF, or host in the data-sharing instance. For mo
re information on the alert, its impact, and how to clear it, run the following command: '
db2cluster -cm -list -alert'.

As DB2 cluster services completed the restart of member 0, we can see it has been restarted on guest host hostC:

Listing 21. db2instance showing the alert
> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0  MEMBER WAITING_FOR.hostA  hostC        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostC     hostC        NO    0                0            hostC-ib0
128 CF     PRIMARY    hostB     hostB        NO    -                0            hostB-ib1
129 CF     CATCHUP    hostD     hostD        NO    -                0            hostD-ib1

HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE               NO                      YES
hostB                ACTIVE               NO                      NO
hostC                ACTIVE               NO                      NO
hostD                ACTIVE               NO                      NO
There is currently an alert for a member, CF, or host in the data-sharing instance. For 
more information on the alert, its impact, and how to clear it, run the following 
command: 'db2cluster -cm -list -alert'.


$ db2cluster -cm -list -alert
1.
Alert: Host 'hostA' is not responding on the network adapter 'ib0'. Check the operating
system for error messages related to this network adapter and chec k the connections to 
the adapter as well.

Action: This alert will clear itself when the network adapter starts to respond. This 
alert cannot be cleared manually with db2cluster.

Impact: DB2 members and CFs on the affected host that require access to network adapter 
'ib0' will not be operational until the problem is resolved. While the network adapter 
is offline, the DB2 members on this host will be in restart light mode on other systems, 
and will be in the WAITING_FOR_FAILBACK state. The affected CFs will not be available 
for CF failover and will remain in the STOPPED state until the network adapter issue is 
resolved.

After plugging the InfiniBand cord back in, DB2 cluster services automatically detects that hostA is back online and resets the network interface to ONLINE for hostA:

Listing 22. Network interface reset
Online IBM.Equivalency:db2_private_network_db2inst1_0
|- Online IBM.NetworkInterface:ib0:hostA
'- Online IBM.NetworkInterface:ib0:hostC

Then DB2 cluster services terminates DB2 member 0 on guest host hostC and invokes a normal DB2 restart of DB2 member 0 on home host hostA:

Listing 23. Restart of DB2 member 0 on home host
> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0  MEMBER STARTED    hostA   hostA        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostC     hostC        NO    0                0            hostC-ib0
128 CF     PRIMARY    hostB     hostB        NO    -                0            hostB-ib1
129 CF     CATCHUP    hostD     hostD        NO    -                0            hostD-ib1
	
HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE               NO                      NO
hostB                ACTIVE               NO                      NO
hostC                ACTIVE               NO                      NO
hostD                ACTIVE               NO                      NO
> db2cluster -cm -list -alert
There are no alerts

Corresponding events in the logs:

The network interface corresponding to adapter ib0 on host hostA is reported as down (IBM.RecoveryRM):

2011-02-28 06:15:26.411814 R(hostA) T(4117486496) _RCD ReportState: Resource : ib0/Fixed/I
BM.NetworkInterface/hostA reported state change: 2

The Recovery RM issues an offline request against member 0 on its home host hostA:

Listing 24. Offline request against member 0
2011-02-28 06:15:26.421832 R(hostA) T(4117486496) _RCD Resource::doRIBMAction Offline Requ
est against db2_db2inst1_0-rs on node hostA.
2011-02-28 06:15:26.425470 G(hostA) T(4127394720) _GBD Taking application resource offline
: Name=db2_db2inst1_0-rs Handle=0x6028 0xffff 0x77babcfd 0xac578906 0x120693f5 0xeea64740
2011-02-28 06:15:26.425613 R(hostA) T(4117486496) _RCD ReportState: Resource : db2_db2inst
1_0-rs/Fixed/IBM.Application/hostA reported state change: 6
2011-02-28 06:15:29.009743 G(hostA) T(4123753376) _GBD Stop command for application resour
ce "db2_db2inst1_0-rs" (handle 0x6028 0xffff 0x77babcfd 0xac578906 0x120693f5 0xeea64740) 
succeeded with exit code 0
2011-02-28 06:15:29.288071 R(hostA) T(4117486496) _RCD ReportState: Resource : db2_db2inst
1_0-rs/Fixed/IBM.Application/hostA reported state change: 2

You can see corresponding events in the db2diag log files when the member has successfully stopped on its home host hostA:

Listing 25. db2diag.log file when member has stopped on home host
2011-02-28-06.15.28.988205-300 I981598E541           LEVEL: Event
PID : 1613                     TID : 46912880928864  KTID : 1613
PROC : db2rocm 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: hostA
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:915
DATA #1 : SQLHA Event Recorder header data (struct sqlhaErPdInfo), PD_TYPE_SQLHA_ER_PDINFO
, 80 bytes
Original timestamp: 2011-02-28-06.15.26.693115000
DATA #2 : String, 40 bytes
db2rocm 1 DB2 db2inst1 0 STOP (SA_RESET)
DATA #3 : String, 5 bytes
BEGIN
	
2011-02-28-06.15.29.001230-300 I984566E544           LEVEL: Event
PID : 1613                     TID : 46912880928864  KTID : 1613
PROC : db2rocm 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: hostA
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:1590
DATA #1 : SQLHA Event Recorder header data (struct sqlhaErPdInfo), PD_TYPE_SQLHA_ER_PDINFO
, 80 bytes
Original timestamp: 2011-02-28-06.15.28.987779000
DATA #2 : String, 40 bytes
db2rocm 1 DB2 db2inst1 0 STOP (SA_RESET)
DATA #3 : String, 7 bytes
SUCCESS

Now, member 0 is restarted on guest host hostC (IBM.RecoveryRM):

Listing 26. Member 0 restarted on guest host
2011-02-28 06:15:29.439708 R(hostA) T(4117486496) _RCD Resource::doRIBMAction Online Reque
st against db2_db2inst1_0-rs on node hostC.
2011-02-28 06:15:35.491788 G(hostC) T(4122594208) _GBD Start command for application resou
rce "db2_db2inst1_0-rs" (handle 0x6028 0xffff 0x5b83a378 0xa4e2ad4a 0x120693f5 0xeea67620)
succeeded with exit code 0
2011-02-28 06:15:35.496093 R(hostA) T(4117486496) _RCD ReportState: Resource : db2_db2inst
1_0-rs/Fixed/IBM.Application/hostC reported state change: 1

The db2diag log files show:

Listing 27. db2diag.log showing start on guest host
2011-02-28-06.15.29.710141-300 I996097E381           LEVEL: Event
PID : 17290                    TID : 46912880937056  KTID : 17290
PROC : db2rocm 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: hostC
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:915
DATA #1 : String, 30 bytes
db2rocm 1 DB2 db2inst1 0 START
DATA #2 : String, 5 bytes
BEGIN
	
2011-02-28-06.15.35.482268-300 I1342111E384          LEVEL: Event
PID : 17290                    TID : 46912880937056  KTID : 17290
PROC : db2rocm 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: hostC
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:1590
DATA #1 : String, 30 bytes
db2rocm 1 DB2 db2inst1 0 START
DATA #2 : String, 7 bytes
SUCCESS

When the InfiniBand cable is plugged back in, the adapter (ib0) on hostA reports as online again (IBM.RecoveryRM):

2011-02-28 06:22:40.503749 R(hostA) T(4117486496) _RCD ReportState: Resource : ib0/Fixed/I
BM.NetworkInterface/hostA reported state change: 1

Therefore, member 0 is stopped on guest host hostC:

Listing 28. Member 0 stopped on guest host
2011-02-28 06:22:40.549388 R(hostA) T(4117486496) _RCD Resource::doRIBMAction Offline Requ
est against db2_db2inst1_0-rs on node hostC.
2011-02-28 06:22:43.402484 G(hostC) T(4122594208) _GBD Stop command for application resour
ce "db2_db2inst1_0-rs" (handle 0x6028 0xffff 0x5b83a378 0xa4e2ad4a 0x120693f5 0xeea67620) 
succeeded with exit code 0

The db2diag log files show the event:

Listing 29. db2diag log showing event
2011-02-28-06.22.43.373821-300 I2291438E542          LEVEL: Event
PID : 19699                    TID : 46912880937056  KTID : 19699
PROC : db2rocm 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: hostC
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:915
DATA #1 : SQLHA Event Recorder header data (struct sqlhaErPdInfo), PD_TYPE_SQLHA_ER_PDINFO
, 80 bytes
Original timestamp: 2011-02-28-06.22.40.810556000
DATA #2 : String, 40 bytes
db2rocm 1 DB2 db2inst1 0 STOP (SA_RESET)
DATA #3 : String, 5 bytes
BEGIN
	
2011-02-28-06.22.43.392086-300 I2295670E545          LEVEL: Event
PID : 19699                    TID : 46912880937056  KTID : 19699
PROC : db2rocm 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: hostC
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:1590
DATA #1 : SQLHA Event Recorder header data (struct sqlhaErPdInfo), PD_TYPE_SQLHA_ER_PDINFO
, 80 bytes
Original timestamp: 2011-02-28-06.22.43.373504000
DATA #2 : String, 40 bytes
db2rocm 1 DB2 db2inst1 0 STOP (SA_RESET)
DATA #3 : String, 7 bytes
SUCCESS

Now, member 0 is restarted on its home host, hostA (IBM.RecoveryRM):

Listing 30. Member 0 restarted on home host
2011-02-28 06:22:43.825145 R(hostA) T(4117486496) _RCD Resource::doRIBMAction Online Reque
st against db2_db2inst1_0-rs on node hostA.
2011-02-28 06:22:55.189619 G(hostA) T(4123360160) _GBD Start command for application resou
rce "db2_db2inst1_0-rs" (handle 0x6028 0xffff 0x77babcfd 0xac578906 0x120693f5 0xeea64740)
succeeded with exit code 0
2011-02-28 06:22:55.193415 R(hostA) T(4117486496) _RCD ReportState: Resource : db2_db2inst
1_0-rs/Fixed/IBM.Application/hostA reported state change: 1

The db2diag log files show the event:

Listing 31. db2diag log file showing member 0 restarted
2011-02-28-06.22.44.102847-300 I2369547E380          LEVEL: Event
PID : 4468                     TID : 46912880928864  KTID : 4468
PROC : db2rocm 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: hostA
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:915
DATA #1 : String, 30 bytes
db2rocm 1 DB2 db2inst1 0 START
DATA #2 : String, 5 bytes
BEGIN
	
2011-02-28-06.22.55.178487-300 I2763321E383          LEVEL: Event
PID : 4468                     TID : 46912880928864  KTID : 4468
PROC : db2rocm 0 [db2inst1]
INSTANCE: db2inst1             NODE : 000
HOSTNAME: hostA
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:1590
DATA #1 : String, 30 bytes
db2rocm 1 DB2 db2inst1 0 START
DATA #2 : String, 7 bytes
SUCCESS

Scenario 3: Cluster Caching Facility (CF) takeover

In this scenario, a DB2 pureScale Feature instance is configured with two cluster caching facilities (CFs). One is designated the primary CF and the other is the secondary CF. DB2 members automatically duplex critical information, such as lock modification status and group buffer pool pages, to both CFs. Since at the beginning of this scenario the secondary CF is in peer state, we know that critical information is duplexed and up to date on the secondary CF. If the primary CF fails, the secondary CF quickly takes over the responsibilities of the primary CF so that the failure is nearly invisible to applications.

Initially, there are 4 hosts (hostA, hostB, hostC, hostD) that can access the shared data of the DB2 instance db2inst1 through a clustered file system. DB2 member 0 is running on home host hostA, DB2 member 1 is running on home host hostC. There is an InfiniBand link between the members and the CF, which is critical for both the DB2 members and the CFs, and thus dependencies exist to the InfiniBand equivalency from both the members and the CFs.

Listing 32. Output from db2instance showing configuration
> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0   MEMBER STARTED    hostA     hostA        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostC     hostC        NO    0                0            hostC-ib0
128 CF     PRIMARY    hostB     hostB        NO    -                0            hostB-ib0
129 CF     PEER       hostD     hostD        NO    -                0            hostD-ib0
	
HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE               NO                      NO
hostB                ACTIVE               NO                      NO
hostC                ACTIVE               NO                      NO
hostD                ACTIVE               NO                      NO

If the CF fails on hostB, the primary role is failed over to a secondary CF, which is in a peer state. DB2 cluster services will detect that the CF processes are down on a home host and invokes the CF cleanup command on the failing host (IBM.RecoveryRM):

Listing 33. Cluster services detecting that CF processes are down and invoking cleanup
2011-04-21 11:17:49.485951 G(hostB) T(4124957600) _GBD Monitor detect OpState change for r
esource Name=ca_db2inst1_0-rs OldOpState=1 NewOpState=2 Handle=0x6028 0xffff 0x88a0f77a 0x
4444ff43 0x12168f02 0x33bd39f8
2011-04-21 11:17:49.547613 R(hostB) T(4118584224) _RCD Cleanup: 
Resource ca_db2inst1_0-rs/Concurrent/IBM.Application cleanup order for constituent 
a_db2inst1_0-rs/Fixed/IBM.Application/hostB

The db2diag log files contain log entries for the issuing of this command and the result of the cleanup:

Listing 34. db2diag log file showing cleanup
2011-04-21-11.17.54.442397-240 I45376528E522         LEVEL: Event
PID     : 4287                 TID  : 46912733618560 PROC : db2rocme 128 [db2inst1]
INSTANCE: db2inst1             NODE : 128
HOSTNAME: hostB
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:927
DATA #1 : SQLHA Event Recorder header data (struct sqlhaErPdInfo), PD_TYPE_SQLHA_ER_PDINFO
, 80 bytes
Original timestamp: 2011-04-21-11.17.50.333529000
DATA #2 : String, 34 bytes
db2rocme 1 CF db2inst1 128 CLEANUP
DATA #3 : String, 5 bytes
BEGIN
	
2011-04-21-11.17.54.450788-240 I45379976E525         LEVEL: Event
PID     : 4287                 TID  : 46912733618560 PROC : db2rocme 128 [db2inst1]
INSTANCE: db2inst1             NODE : 128
HOSTNAME: hostB
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:1639
DATA #1 : SQLHA Event Recorder header data (struct sqlhaErPdInfo), PD_TYPE_SQLHA_ER_PDINFO
, 80 bytes
Original timestamp: 2011-04-21-11.17.54.442222000
DATA #2 : String, 34 bytes
db2rocme 1 CF db2inst1 128 CLEANUP
DATA #3 : String, 7 bytes
SUCCESS

The primary role is started on the secondary CF (on hostD):

Listing 35. db2instance shows primary role started on secondary CF
> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0   MEMBER STARTED    hostA     hostA        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostC     hostC        NO    0                0            hostC-ib0
128 CF     RESTARTING hostB     hostB        NO    -                0            hostB-ib0
129 CF     BECOMING_PR.hostD   hostD        NO    -                0            hostD-ib0
	
HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE               NO                      NO
hostB                ACTIVE               NO                      NO
hostC                ACTIVE               NO                      NO
hostD                ACTIVE               NO                      NO

IBM.RecoveryRM:

Listing 36. IBM.RecoveryRM
2011-04-21 11:17:52.137049 G(hostB) T(4124760992) _GBD Stop command for application resour
ce "primary_db2inst1_900-rs" (handle 0x6028 0xffff 0x88a0f77a 0x4444ff43 0x12168f06 0xe30a
b210) succeeded with exit code 0
2011-04-21 11:17:52.537322 R(hostB) T(4118584224) _RCD Cleanup: Resource primary_db2inst1_
900-rs/Fixed/IBM.Application/hostD is dirty.
2011-04-21 11:17:52.539174 G(hostD) T(4128926624) _GBD Bringing application resource onlin
e: Name=primary_db2inst1_900-rs Handle=0x6028 0xffff 0x1802cd9b 0xb77f5380 0x12168f06 0xe3
85c1f8
2011-04-21 11:17:52.543773 R(hostB) T(4118584224) _RCD Resource::doRIBMAction Online Reque
st against primary_db2inst1_900-rs on node hostD.
2011-04-21 11:18:08.446049 G(hostD) T(4123556768) _GBD Start command for application res
ource "primary_db2inst1_900-rs" (handle 0x6028 0xffff 0x1802cd9b 0xb77f5380 0x12168f06 0xe
385c1f8) succeeded with exit code 0

The db2diag log files contain the following event for the primary role restart:

Listing 37. db2diag log file showing event for primary role restart
2011-04-21-11.17.52.784069-240 I45349553E400         LEVEL: Event
PID     : 2560                 TID  : 46912733618560 PROC : db2rocme 900 [db2inst1]
INSTANCE: db2inst1             NODE : 900
HOSTNAME: hostD
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:927
DATA #1 : String, 63 bytes
/home/db2inst1/sqllib/adm/db2rocme 1 PRIMARY db2inst1 900 START
DATA #2 : String, 5 bytes
BEGIN
	
2011-04-21-11.18.08.440950-240 I46272100E403         LEVEL: Event
PID     : 2560            TID  : 46912733618560 PROC : db2rocme 900 [db2inst1]
INSTANCE: db2inst1        NODE : 900
HOSTNAME: hostD
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:1639
DATA #1 : String, 63 bytes
/home/db2inst1/sqllib/adm/db2rocme 1 PRIMARY db2inst1 900 START
DATA #2 : String, 7 bytes
SUCCESS

DB2 cluster services attempts to restart the failing CF processes on hostB. After the CF is successfully restarted, it moves into CATCHUP state, followed by PEER state when the fail over is complete.

Figure 2. Cluster Caching Facility state transition
Diagram shows client workstations connected to DB2 data server, which is connected to the cluster caching facility

The Recovery RM log shows:

Listing 38. IBM.RecoveryRM log
2011-04-21 11:17:53.885765 G(hostB) T(4124269472) _GBD Cleanup command for application res
ource "ca_db2inst1_0-rs" (handle 0x6028 0xffff 0x88a0f77a 0x4444ff43 0x12168f02 0x33bd39f8
) succeeded with exit code 0
2011-04-21 11:17:53.964794 R(hostB) T(4118584224) _RCD Resource::doRIBMAction Online Reque
st against ca_db2inst1_0-rs on node hostB.
2011-04-21 11:17:53.966743 G(hostB) T(4128926624) _GBD Bringing application resource onlin
e: Name=ca_db2inst1_0-rs Handle=0x6028 0xffff 0x88a0f77a 0x4444ff43 0x12168f02 0x33bd39f8
2011-04-21 11:18:08.947076 G(hostB) T(4124269472) _GBD Start command for app
lication resource "ca_db2inst1_0-rs" (handle 0x6028 0xffff 0x88a0f77a 0x4444ff43 0x12168f0
2 0x33bd39f8) succeeded with exit code 0

The db2diag log files show:

Listing 39. db2diag.log file
2011-04-21-11.17.54.442397-240 I45376528E522         LEVEL: Event
PID     : 4287                 TID  : 46912733618560 PROC : db2rocme 128 [db2inst1]
INSTANCE: db2inst1             NODE : 128
HOSTNAME: hostB
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:927
DATA #1 : SQLHA Event Recorder header data (struct sqlhaErPdInfo), PD_TYPE_SQLHA_ER_PDINFO
, 80 bytes
Original timestamp: 2011-04-21-11.17.50.333529000
DATA #2 : String, 34 bytes
db2rocme 1 CF db2inst1 128 CLEANUP
DATA #3 : String, 5 bytes
BEGIN
	
2011-04-21-11.17.54.450788-240 I45379976E525         LEVEL: Event
PID     : 4287                 TID  : 46912733618560 PROC : db2rocme 128 [db2inst1]
INSTANCE: db2inst1             NODE : 128
HOSTNAME: hostB
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:1639
DATA #1 : SQLHA Event Recorder header data (struct sqlhaErPdInfo), PD_TYPE_SQLHA_ER_PDINFO
, 80 bytes
Original timestamp: 2011-04-21-11.17.54.442222000
DATA #2 : String, 34 bytes
db2rocme 1 CF db2inst1 128 CLEANUP
DATA #3 : String, 7 bytes
SUCCESS

2011-04-21-11.17.54.787350-240 I45392043E395         LEVEL: Event
PID     : 4358                 TID  : 46912733618560 PROC : db2rocme 128 [db2inst1]
INSTANCE: db2inst1             NODE : 128
HOSTNAME: hostB
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:927
DATA #1 : String, 58 bytes
/home/db2inst1/sqllib/adm/db2rocme 1 CF db2inst1 128 START
DATA #2 : String, 5 bytes
BEGIN
	
2011-04-21-11.18.09.512434-240 I46330550E398         LEVEL: Event
PID     : 4358                 TID  : 46912733618560 PROC : db2rocme 128 [db2inst1]
INSTANCE: db2inst1             NODE : 128
HOSTNAME: hostB
FUNCTION: DB2 UDB, high avail services, db2rocm_main, probe:1639
DATA #1 : String, 58 bytes
/home/db2inst1/sqllib/adm/db2rocme 1 CF db2inst1 128 START
DATA #2 : String, 7 bytes
SUCCESS

> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0   MEMBER STARTED    hostA     hostA        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostC     hostC        NO    0                0            hostC-ib0
128 CF     CATCHUP   hostB    hostB        NO    -                0            hostB-ib0
129 CF     PRIMARY    hostD     hostD        NO    -                0            hostD-ib0
	
HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE               NO                      NO
hostB                ACTIVE               NO                      NO
hostC                ACTIVE               NO                      NO
hostD                ACTIVE               NO                      NO

> db2instance -list
ID  TYPE   STATE      HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
--  ----   -----      --------- ------------ ----- ---------------- ------------ -------
0   MEMBER STARTED    hostA     hostA        NO    0                0            hostA-ib0
1   MEMBER STARTED    hostC     hostC        NO    0                0            hostC-ib0
128 CF     PEER      hostB    hostB        NO    -                0            hostB-ib0
129 CF     PRIMARY    hostD     hostD        NO    -                0            hostD-ib0
	
HOSTNAME             STATE                INSTANCE_STOPPED        ALERT
--------             -----                ----------------        -----
hostA                ACTIVE               NO                      NO
hostB                ACTIVE               NO                      NO
hostC                ACTIVE               NO                      NO
hostD                ACTIVE               NO                      NO

Note that in a similar scenario, if the secondary CF is not in PEER state, the situation is more complicated. In this case, there will be an interruption in transactional processing and the failure will affect clients connected to the database. When the second CF is not in PEER state, it does not have the most up-to-date information about the state of the shared data instance. Therefore, if the primary CF fails, the shared data instance must perform a group restart. DB2 cluster services will initiate the group restart automatically.


Conclusion

There are many sources of important information about the status of your DB2 pureScale cluster. DB2 cluster services monitors events and writes logs that record relevant information about successful and failed situations. You can use the db2cluster and db2instance commands in conjunction with this DB2 cluster services log information to understand what happens within the cluster and diagnose and remedy problems.

If you are working with IBM technical support, you can also use the db2support command to collect diagnostic data, trace and log files for DB2 pureScale cluster services together with DB2 logging information. The output files can be sent to IBM technical support for further investigation and troubleshooting.


Appendix A: Access privileges required to read resource managers traces

In order to format and read resource manager traces, for example those in /var/ct/< domain_name >/log/mc/IBM.GblResRM/, use the following command:

rpttr -o dtic /var/ct/<domain_name>/log/mc/IBM.GblResRM/trace_summary*

You should have read/write access to the /var/ct/< domain_name >/log/mc/IBM.GblResRM/ directory and also read access to the log files present in the directory.

If write access is not allowed to the directory, then read access to the directory and the log files can be sufficient. In this case you can use the following approach to read the logs:

Listing 40. Approach for reading the logs if write access is not allowed
mkdir ~/gblresrm_logs_copy
cp /var/ct/<domain_name>/log/mc/IBM.GblResRM/trace_summary* ~/gblresrm_logs_copy

rpttr -o dtic ~/gblresrm_logs_copy/trace_summary*
rpttr -o dtic ~/gblresrm_logs_copy/trace_summary* > formatted_gblresrm.log

Note that by default log files permissions are set with read-write access for root only. Hence an administrator must provide a given user with the required additional privileges on these log files.


Background information

The following optional information is provided as reference only.

Tivoli SA MP and previous versions of the DB2 database manager

The DB2 database manager has been integrated with Tivoli SA MP for some time. Versions 9.5 and 9.7 of the DB2 database system are integrated with Tivoli SA MP to provide database partitioning functionality, ESE single partition, and high availability and disaster recovery (HADR), but do not default to using Tivoli SA MP or the cluster management automation. There are several manual steps required for initial cluster creation, and for the addition and removal of DB2 partitions.

See the Resources section for links to books and more resources that provide further information about integration of the DB2 database system with Tivoli SA MP, prior to Version 9.8.


Contributors

In addition to the authors, the following individuals also contributed to this tutorial:

  • Joyce Simmonds - DB2 Information Developer
  • Nikolaj Richers DB2 Information Developer

Resources

Learn

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Tivoli, Tivoli (service management)
ArticleID=752744
ArticleTitle=Solving problems in the DB2 pureScale cluster services environment
publish-date=08182011