CF server failure

Use the information in this topic to help you diagnose if a cluster caching facility (CF) component failed.

Symptoms

A Db2 instance fails to start on the execution of the db2start command.

Diagnosing a CF server failure

  • Refer to the SQLCODEs in the db2start command output.
  • To determine whether a CF has not started, run db2instance -list. This information might show CFs in a STOPPED or ERROR state if the startup has failed, depending on when the failure occurs.
    • The following example shows a sample output from db2instance -list
      ID        TYPE             STATE                HOME_HOST               CURRENT_HOST   ...
      --        ----             -----                ---------               ------------
      0       MEMBER           STOPPED                host01                  host01      
      1       MEMBER           STOPPED                host02                  host02      
      2       MEMBER           STOPPED                host03                  host03      
      128     CF               STOPPED                host04                  host04      
      129     CF               STOPPED                host05                  host05      
      
      
      ALERT       PARTITION_NUMBER        LOGICAL_PORT    NETNAME   ...
      -----       ----------------        ------------    -------   
         NO                      0                   0    host01-ib0
         NO                      0                   0    host02-ib0
         NO                      0                   0    host03-ib0
         NO                      -                   0    host04-ib0
         NO                      -                   0    host05-ib0
      
      
      HOSTNAME                       STATE                INSTANCE_STOPPED        ALERT
      --------                       -----                ----------------        -----
      host01                        ACTIVE                              NO           NO
      host02                        ACTIVE                              NO           NO
      host03                        ACTIVE                              NO           NO
      host04                        ACTIVE                              NO           NO
      host05                        ACTIVE                              NO           NO
  • If any alerts are present, run db2cluster -cm -list -alerts for more information. The alerts will provide more information about what might need to be fixed (for example, a network adapter or host is offline), or point to the cfdiag*.log files for more information.
  • Look for errors related in the CF's db2diag log file that pertain to the time when the db2start command was run:
    2009-11-09-02.32.46.967563-300 I261372A332          LEVEL: Severe
    PID     : 1282088              TID  : 1             KTID : 4751433
    PROC    : db2start
    INSTANCE: db2inst1             NODE : 000
    HOSTNAME: host04
    EDUID   : 1
    FUNCTION: Db2, base sys utilities, sqleIssueStartStop, probe:3973
    MESSAGE : Failed to start any CF.
  • Search the sections of the db2diag log file preceding previous trace point for more information as to why the CF has not started. For example, if cluster services cannot start a CF, the db2diag log file might show:
    2009-11-09-02.12.40.882897-300 I256778A398          LEVEL: Error
    PID     : 737522               TID  : 1             KTID : 2371807if
    PROC    : db2havend
    INSTANCE: db2inst1             NODE : 000
    EDUID   : 1
    FUNCTION: Db2, high avail services, db2haOnlineResourceGroup, probe:5982
    DATA #1 : <preformatted>
    Timeout waiting for resource group ca_db2inst1_0-rg to be online, last known OpState is 2
  • Each CF writes information to the cfdiag*.log and dumps more diagnostic data when required. The files reside in the directory set by the database manager configuration parameter cf_diagpath or if not set, the diagpath, or $INSTHOME/sqllib_shared/db2dump/ $m by default.
    • CF diagnostic log files (cfdiag-<timestamp>.<cf_id>*.log)
      • Each of these files keeps a log of the activities that are related to a CF. Events, errors, warnings, or additional debugging information will be logged there. This log has a similar structure to the db2diag log file. A new log is created each time that a CF starts. The logging level is controlled by the cf_diaglevel database manager configuration parameter .
      • Note that there is a static CF diagnostic log name that always points to the most current diagnostic logging file for each CF and has the following format: cfdiag.<cf_id>.log
    • CF output dump diagnostic files cfdump.YYYYMMDDhhmmssuuuuuu.<host>.<cf_id>.out
      • These files contain information regarding CF startup and stop. There might be some additional output in these files.
    • Management LightWeight Daemon diagnostic log file (mgmnt_lwd_log.<cf_pid>)
      • This log file displays the log entries that pertain to the LightWeight Daemon (LWD) process for a particular CF. Errors in this log file indicate that the LWD has not started properly.
    • CF stack files (CAPD.<cf_pid>.<tid>.thrstk)
      • These are stack files produced by the CF when it encounters a signal. These files are important for diagnosing a problem with the CF.
    • CF trace files (CAPD.tracelog.<cf_pid>)
      • A default lightweight trace is enabled for the CF.
      • These trace files appear whenever the CF terminates or stops.
      • The trace files might indicate a problem with the CF, but these files are useful for diagnosing errors only when used in combination with other diagnostic data.
  • If the CF process starts successfully, a startup and initialized message is written to the CF dump files.
  • For example, the contents of cfdump.20091109015035000037.host04.128.out include a message that shows a successful process start:
    CA Server IPC component Initialised: LWD BG buffer count: 16
                  Session ID: 1d
    CA Server IPC component Acknowledged LWD Startup Message
              Waiting for LWD to Configure Server
    Processors: (4:4) PowerPC_POWER5 running at 1498 MHz
    
    Cluster Accelerator initialized
    
    Cluster Accelerator Object Information:
       OS: AIX 64-bit
       Compiler: xlC VRM (900)
       SVN Revision: 7584
       Built on: Oct 12 2009 at 17:00:54
       Executable generated with symbols
       Model Components Loaded: CACHE  LIST  LOCK
       Transport: uDAPL
       Number of HCAs: 1
       Device[0]: hca0
       CA Port[0]: 50638
       Mgmnt Port Type: TCP/IP
       Mgmnt Port: 50642
       IPC Key: 0xe50003d
       Total Workers: 4
       Conn/Worker: 128
       Notify conns: 256
       Processor Speed: 1498.0000 MHz
  • If the cfdump.out.* file does not contain the "cluster caching facility initialized" line or "cluster caching facility Object Information" and other lines shown in the following example, the CF process did not start successfully. An error message might be shown instead. Contact IBM Support for more information.
  • In this example, cfdiag-20091109015035000037.128.log contains a successful process start. If the CF did not start properly, this log might be empty or contain error messages.
    2009-11-09-01.50.37.0051837000-300 E123456789A779 LEVEL : Event
    PID       : 688182 TID :          1
    HOSTNAME  : host04
    FUNCTION  : CA svr_init, mgmnt_castart
    MESSAGE   : CA server log has been started.
    DATA #1   :
    Log Level: Error
    Debugging : active
    Cluster Accelerator Object Information
        AIX 64-bit
        Compiler: xlC VRM (900)
        SVN Revision: 7584
        Built on Oct 12 2009 at 17:00:59
        Executable generated with symbols.
        Executable generated with asserts.
        Model Components Loaded: CACHE, LIST, LOCK
        Transport: uDAPL
        Number of HCAs: 1
        Device[0]: hca0
        CA Port[0]: 50638
        Total Workers: 4
        Conn/Worker: 128
        Notify conns: 256
        Processor Speed: 1498.000000 Mhz.
        Allocatable Structure memory: 170 MB
  • Look for core files or stack traceback files in the CF_DIAGPATH directory.
  • The system error log for the affected host might also be consulted if the cause of the error is still unknown. Log onto the CF host that has not been started and view the system error log by running the errpt -a command (on Linux®, look in the /var/log/messages file). Look for related log entries at the time of the failure. In the example shown here, login to host04 and host05, because CF 128 and CF 129 reside on these hosts.
  • If an alert was shown from db2cluster -list -alert, run db2cluster -clear -alert after the problem is resolved, and then reissue the db2start command.