Primary role moves between cluster caching facility hosts

The primary role is not on the same host as the last time the db2instance -list command was run, that is, the primary role has moved to another cluster caching facility host. This host change indicates a problem with the cluster caching facility at some point in the past that might need to be investigated.

This is a sample output from the db2instance -list command showing a three member, two cluster caching facility environment:
ID        TYPE             STATE          HOME_HOST
--        ----             -----          ---------
0         MEMBER           STARTED        hostA    
1         MEMBER           STARTED        hostB    
2         MEMBER           STARTED        hostC    
128       CF               PEER           hostD    
129       CF               PRIMARY        hostE    

CURRENT_HOST  ALERT  PARTITION_NUMBER  LOGICAL_PORT  NETNAME
------------  -----  ----------------  ------------  -------
hostA         NO                    0             0  hostA-ib0
hostB         NO                    0             0  hostB-ib0
hostC         NO                    0             0  hostC-ib0
hostD         NO                    -             0  hostD-ib0
hostE         NO                    -             0  hostE-ib0
	
HOSTNAME      STATE      INSTANCE_STOPPED ALERT
--------      -----      ---------------- -----
hostA         ACTIVE     NO               NO
hostB         ACTIVE     NO               NO
hostC         ACTIVE     NO               NO
hostD         ACTIVE     NO               NO
hostE         ACTIVE     NO               NO
Like members, each cluster caching facility will log information into the cfdiag*.log and dump more diagnostic data when required. The files will reside in the directory set by the database manager configuration parameter cf_diagpath or if not set, the diagpath or $INSTHOME/sqllib_shared/db2dump/ $m by default.
  • cluster caching facility Diagnostic Log Files (cfdiag-timestamp.cf_id.log)
    • Each of these files keep a log of the activities related to a cluster caching facility. Events, errors, warnings, or additional debugging information will be logged here. This log has a similar appearance to the db2diag log file. A new log is created each time a cluster caching facility starts.
    • Note that there is a single static cluster caching facility diagnostic log name that always points to the most current diagnostic logging file for each cluster caching facility and has the following format: cfdiag.cf_id.log
  • cluster caching facility Output Dump Diagnostic Files (cfdump.out.cf_pid.hostname.cf_id)
    • These files contain information regarding cluster caching facility startup and stop. There might be some additional output shown here.
  • Management LWD Diagnostic Log File (mgmnt_lwd_log.cf_pid)
    • This log file displays the log entries of a particular cluster caching facility's LightWeight Daemon (LWD) process. Errors presented in this log file indicate the LWD has not started properly. A successful start will not have ERROR messages in the log.
  • cluster caching facility stack files (CAPD.cf_pid.tid.thrstk)
    • These are stack files produced by the cluster caching facility when it encounters a signal. These files are important for diagnosing a problem with the cluster caching facility.
  • cluster caching facility trace files (CAPD.tracelog.cf_pid)
    • A default lightweight trace is enabled for the cluster caching facility. These trace files appear whenever the cluster caching facility terminates or stops. These might indicate a problem with the cluster caching facility, only in combination with other diagnostic data can these files be useful in diagnosing any errors.
A startup and initialization message will be shown in the cluster caching facility dump files. For example, the message for cfdump.out.1548476.host04.128 contains the message that shows a successful process start:
CA Server IPC component Initialised: LWD BG buffer count: 16
              Session ID: 1d
CA Server IPC component Acknowledged LWD Startup Message
          Waiting for LWD to Configure Server
Processors: (4:4) PowerPC_POWER5 running at 1498 MHz

Cluster Accelerator initialized

Cluster Accelerator Object Information:
   OS: AIX 64-bit
   Compiler: xlC VRM (900)
   SVN Revision: 7584
   Built on: Oct 12 2009 at 17:00:54
   Executable generated with symbols
   Model Components Loaded: CACHE  LIST  LOCK
   Transport: uDAPL
   Number of HCAs: 1
   Device[0]: hca0
   CF Port[0]: 50638
   Mgmnt Port Type: TCP/IP
   Mgmnt Port: 50642
   IPC Key: 0xe50003d
   Total Workers: 4
   Conn/Worker: 128
   Notify conns: 256
   Processor Speed: 1498.0000 MHz
In this example, cfdiag-20091109015035000037.128.log contains a successful process start. If the cluster caching facility did not start properly, this log might be either empty or contain error messages. For example:
2009-11-09-01.50.37.0051837000-300 E123456789A779    LEVEL    : Event
PID       : 688182 TID :          1
HOSTNAME  : host04
FUNCTION  : CA svr_init, mgmnt_cfstart
MESSAGE   : CA server log has been started.
DATA #1   :
Log Level: Error
Debugging : active
Cluster Accelerator Object Information
    AIX 64-bit
    Compiler: xlC VRM (900)
    SVN Revision: 7584
    Built on Oct 12 2009 at 17:00:59
    Executable generated with symbols.
    Executable generated with asserts.
    Model Components Loaded: CACHE, LIST, LOCK
    Transport: uDAPL
    Number of HCAs: 1
    Device[0]: hca0
    CF Port[0]: 50638
    Total Workers: 4
    Conn/Worker: 128
    Notify conns: 256
    Processor Speed: 1498.000000 Mhz.
    Allocatable Structure memory: 170 MB
Look for the relevant cluster caching facility diagnostic log files by looking for the cfdiag log that has the same CF ID as the failed cluster caching facility. For example, if CF ID 128 failed (as it did in the previous db2instance -list command), use the following command:
$ ls cfdiag*.128.log

cfdiag.128.log -> cfdiag-20091109015035000215.128.log
cfdiag-20091110023022000037.128.log
cfdiag-20091109015035000215.128.log
Note that cfdiag.128.log always points to the most current cfdiag log for CF 128. Look into cfdiag-20091109015035000037.128.log (the previous cfdiag log) and the db2diag log file at a time corresponding to 2009-11-10-02.30.22.000215 for errors.

The system error log for the affected host can also be consulted if the cause of the error is still unknown. Log onto the unstartedcluster caching facility host and view the system error log by running the errpt -a command (on Linux®, look in the /var/log/messagesfile). In the example shown here, log in to hostD because CF 128 experienced the failure.