CF server failure

Use the information in this topic to help you diagnose if a cluster caching facility (CF) component failed.

Symptoms

A Db2® instance fails to start on the execution of the db2start command.

Diagnosing a CF server failure

Refer to the SQLCODEs in the db2start command output.

To determine whether a CF has not started, run db2instance -list. This information might show CFs in a STOPPED or ERROR state if the startup has failed, depending on when the failure occurs.

The following example shows a sample output from db2instance -list

ID        TYPE             STATE                HOME_HOST               CURRENT_HOST   ...
--        ----             -----                ---------               ------------
0       MEMBER           STOPPED                host01                  host01      
1       MEMBER           STOPPED                host02                  host02      
2       MEMBER           STOPPED                host03                  host03      
128     CF               STOPPED                host04                  host04      
129     CF               STOPPED                host05                  host05      


ALERT       PARTITION_NUMBER        LOGICAL_PORT    NETNAME   ...
-----       ----------------        ------------    -------   
   NO                      0                   0    host01-ib0
   NO                      0                   0    host02-ib0
   NO                      0                   0    host03-ib0
   NO                      -                   0    host04-ib0
   NO                      -                   0    host05-ib0


HOSTNAME                       STATE                INSTANCE_STOPPED        ALERT
--------                       -----                ----------------        -----
host01                        ACTIVE                              NO           NO
host02                        ACTIVE                              NO           NO
host03                        ACTIVE                              NO           NO
host04                        ACTIVE                              NO           NO
host05                        ACTIVE                              NO           NO

If any alerts are present, run db2cluster -cm -list -alerts for more information. The alerts will provide more information about what might need to be fixed (for example, a network adapter or host is offline), or point to the cfdiag*.log files for more information.

Look for errors related in the CF's db2diag log file that pertain to the time when the db2start command was run:

2009-11-09-02.32.46.967563-300 I261372A332          LEVEL: Severe
PID     : 1282088              TID  : 1             KTID : 4751433
PROC    : db2start
INSTANCE: db2inst1             NODE : 000
HOSTNAME: host04
EDUID   : 1
FUNCTION: Db2, base sys utilities, sqleIssueStartStop, probe:3973
MESSAGE : Failed to start any CF.

Search the sections of the db2diag log file preceding previous trace point for more information as to why the CF has not started. For example, if cluster services cannot start a CF, the db2diag log file might show:

2009-11-09-02.12.40.882897-300 I256778A398          LEVEL: Error
PID     : 737522               TID  : 1             KTID : 2371807if
PROC    : db2havend
INSTANCE: db2inst1             NODE : 000
EDUID   : 1
FUNCTION: Db2, high avail services, db2haOnlineResourceGroup, probe:5982
DATA #1 : <preformatted>
Timeout waiting for resource group ca_db2inst1_0-rg to be online, last known OpState is 2

Each CF writes information to the cfdiag*.log and dumps more diagnostic data when required. The files reside in the directory set by the database manager configuration parameter cf_diagpath or if not set, the diagpath, or $INSTHOME/sqllib_shared/db2dump/ $m by default.
- CF diagnostic log files (cfdiag-<timestamp>.<cf_id>*.log)
  - Each of these files keeps a log of the activities that are related to a CF. Events, errors, warnings, or additional debugging information will be logged there. This log has a similar structure to the db2diag log file. A new log is created each time that a CF starts. The logging level is controlled by the cf_diaglevel database manager configuration parameter .
  - Note that there is a static CF diagnostic log name that always points to the most current diagnostic logging file for each CF and has the following format: cfdiag.<cf_id>.log
- CF output dump diagnostic files cfdump.YYYYMMDDhhmmssuuuuuu.<host>.<cf_id>.out
  - These files contain information regarding CF startup and stop. There might be some additional output in these files.
- Management LightWeight Daemon diagnostic log file (mgmnt_lwd_log.<cf_pid>)
  - This log file displays the log entries that pertain to the LightWeight Daemon (LWD) process for a particular CF. Errors in this log file indicate that the LWD has not started properly.
- CF stack files (CAPD.<cf_pid>.<tid>.thrstk)
  - These are stack files produced by the CF when it encounters a signal. These files are important for diagnosing a problem with the CF.
- CF trace files (CAPD.tracelog.<cf_pid>)
  - A default lightweight trace is enabled for the CF.
  - These trace files appear whenever the CF terminates or stops.
  - The trace files might indicate a problem with the CF, but these files are useful for diagnosing errors only when used in combination with other diagnostic data.
If the CF process starts successfully, a startup and initialized message is written to the CF dump files.

For example, the contents of cfdump.20091109015035000037.host04.128.out include a message that shows a successful process start:

CA Server IPC component Initialised: LWD BG buffer count: 16
              Session ID: 1d
CA Server IPC component Acknowledged LWD Startup Message
          Waiting for LWD to Configure Server
Processors: (4:4) PowerPC_POWER5 running at 1498 MHz

Cluster Accelerator initialized

Cluster Accelerator Object Information:
   OS: AIX 64-bit
   Compiler: xlC VRM (900)
   SVN Revision: 7584
   Built on: Oct 12 2009 at 17:00:54
   Executable generated with symbols
   Model Components Loaded: CACHE  LIST  LOCK
   Transport: uDAPL
   Number of HCAs: 1
   Device[0]: hca0
   CA Port[0]: 50638
   Mgmnt Port Type: TCP/IP
   Mgmnt Port: 50642
   IPC Key: 0xe50003d
   Total Workers: 4
   Conn/Worker: 128
   Notify conns: 256
   Processor Speed: 1498.0000 MHz

If the cfdump.out.* file does not contain the "cluster caching facility initialized" line or "cluster caching facility Object Information" and other lines shown in the following example, the CF process did not start successfully. An error message might be shown instead. Contact IBM Support for more information.

In this example, cfdiag-20091109015035000037.128.log contains a successful process start. If the CF did not start properly, this log might be empty or contain error messages.

2009-11-09-01.50.37.0051837000-300 E123456789A779 LEVEL : Event
PID       : 688182 TID :          1
HOSTNAME  : host04
FUNCTION  : CA svr_init, mgmnt_castart
MESSAGE   : CA server log has been started.
DATA #1   :
Log Level: Error
Debugging : active
Cluster Accelerator Object Information
    AIX 64-bit
    Compiler: xlC VRM (900)
    SVN Revision: 7584
    Built on Oct 12 2009 at 17:00:59
    Executable generated with symbols.
    Executable generated with asserts.
    Model Components Loaded: CACHE, LIST, LOCK
    Transport: uDAPL
    Number of HCAs: 1
    Device[0]: hca0
    CA Port[0]: 50638
    Total Workers: 4
    Conn/Worker: 128
    Notify conns: 256
    Processor Speed: 1498.000000 Mhz.
    Allocatable Structure memory: 170 MB

Look for core files or stack traceback files in the CF_DIAGPATH directory.
The system error log for the affected host might also be consulted if the cause of the error is still unknown. Log onto the CF host that has not been started and view the system error log by running the errpt -a command (on Linux®, look in the /var/log/messages file). Look for related log entries at the time of the failure. In the example shown here, login to host04 and host05, because CF 128 and CF 129 reside on these hosts.
If an alert was shown from db2cluster -list -alert, run db2cluster -clear -alert after the problem is resolved, and then reissue the db2start command.