CF server failure
Use the information in this topic to help you diagnose if a cluster caching facility (CF) component failed.
Symptoms
A Db2 instance fails to start on the execution of the db2start command.
Diagnosing a CF server failure
- Refer to the SQLCODEs in the db2start command output.
- To determine whether a CF has not started, run
db2instance -list. This information might show CFs in a STOPPED or ERROR
state if the startup has failed, depending on when the failure occurs.
- The following example shows a sample output from db2instance
-list
ID TYPE STATE HOME_HOST CURRENT_HOST ... -- ---- ----- --------- ------------ 0 MEMBER STOPPED host01 host01 1 MEMBER STOPPED host02 host02 2 MEMBER STOPPED host03 host03 128 CF STOPPED host04 host04 129 CF STOPPED host05 host05 ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME ... ----- ---------------- ------------ ------- NO 0 0 host01-ib0 NO 0 0 host02-ib0 NO 0 0 host03-ib0 NO - 0 host04-ib0 NO - 0 host05-ib0 HOSTNAME STATE INSTANCE_STOPPED ALERT -------- ----- ---------------- ----- host01 ACTIVE NO NO host02 ACTIVE NO NO host03 ACTIVE NO NO host04 ACTIVE NO NO host05 ACTIVE NO NO
- The following example shows a sample output from db2instance
-list
- If any alerts are present, run db2cluster -cm -list -alerts for more information. The alerts will provide more information about what might need to be fixed (for example, a network adapter or host is offline), or point to the cfdiag*.log files for more information.
- Look for errors related in the CF's db2diag log file
that pertain to the time when the db2start command was
run:
2009-11-09-02.32.46.967563-300 I261372A332 LEVEL: Severe PID : 1282088 TID : 1 KTID : 4751433 PROC : db2start INSTANCE: db2inst1 NODE : 000 HOSTNAME: host04 EDUID : 1 FUNCTION: Db2, base sys utilities, sqleIssueStartStop, probe:3973 MESSAGE : Failed to start any CF.
- Search the sections of the db2diag log file preceding previous trace point for more information
as to why the CF
has not started. For example, if cluster services cannot start a CF, the db2diag log file
might
show:
2009-11-09-02.12.40.882897-300 I256778A398 LEVEL: Error PID : 737522 TID : 1 KTID : 2371807if PROC : db2havend INSTANCE: db2inst1 NODE : 000 EDUID : 1 FUNCTION: Db2, high avail services, db2haOnlineResourceGroup, probe:5982 DATA #1 : <preformatted> Timeout waiting for resource group ca_db2inst1_0-rg to be online, last known OpState is 2
- Each CF
writes information to the cfdiag*.log and dumps more diagnostic data when
required. The files reside in the directory set by the database manager configuration parameter
cf_diagpath or if not set, the diagpath, or $INSTHOME/sqllib_shared/db2dump/ $m by default.
- CF
diagnostic log files (cfdiag-<timestamp>.<cf_id>*.log)
- Each of these files keeps a log of the activities that are related to a CF. Events, errors, warnings, or additional debugging information will be logged there. This log has a similar structure to the db2diag log file. A new log is created each time that a CF starts. The logging level is controlled by the cf_diaglevel database manager configuration parameter .
- Note that there is a static CF diagnostic log name that always points to the most current diagnostic logging file for each CF and has the following format: cfdiag.<cf_id>.log
- CF output dump diagnostic
files cfdump.YYYYMMDDhhmmssuuuuuu.<host>.<cf_id>.out
- These files contain information regarding CF startup and stop. There might be some additional output in these files.
- Management LightWeight Daemon diagnostic log file
(mgmnt_lwd_log.<cf_pid>)
- This log file displays the log entries that pertain to the LightWeight Daemon (LWD) process for a particular CF. Errors in this log file indicate that the LWD has not started properly.
- CF stack
files (CAPD.<cf_pid>.<tid>.thrstk)
- These are stack files produced by the CF when it encounters a signal. These files are important for diagnosing a problem with the CF.
- CF trace
files (CAPD.tracelog.<cf_pid>)
- A default lightweight trace is enabled for the CF.
- These trace files appear whenever the CF terminates or stops.
- The trace files might indicate a problem with the CF, but these files are useful for diagnosing errors only when used in combination with other diagnostic data.
- CF
diagnostic log files (cfdiag-<timestamp>.<cf_id>*.log)
- If the CF process starts successfully, a startup and initialized message is written to the CF dump files.
- For example, the
contents of cfdump.20091109015035000037.host04.128.out include a message that
shows a successful process
start:
CA Server IPC component Initialised: LWD BG buffer count: 16 Session ID: 1d CA Server IPC component Acknowledged LWD Startup Message Waiting for LWD to Configure Server Processors: (4:4) PowerPC_POWER5 running at 1498 MHz Cluster Accelerator initialized Cluster Accelerator Object Information: OS: AIX 64-bit Compiler: xlC VRM (900) SVN Revision: 7584 Built on: Oct 12 2009 at 17:00:54 Executable generated with symbols Model Components Loaded: CACHE LIST LOCK Transport: uDAPL Number of HCAs: 1 Device[0]: hca0 CA Port[0]: 50638 Mgmnt Port Type: TCP/IP Mgmnt Port: 50642 IPC Key: 0xe50003d Total Workers: 4 Conn/Worker: 128 Notify conns: 256 Processor Speed: 1498.0000 MHz
- If the cfdump.out.* file does not contain the "cluster caching facility initialized" line or "cluster caching facility Object Information" and other lines shown in the following example, the CF process did not start successfully. An error message might be shown instead. Contact IBM Support for more information.
- In this example, cfdiag-20091109015035000037.128.log contains a successful
process start. If the CF did not start
properly, this log might be empty or contain error
messages.
2009-11-09-01.50.37.0051837000-300 E123456789A779 LEVEL : Event PID : 688182 TID : 1 HOSTNAME : host04 FUNCTION : CA svr_init, mgmnt_castart MESSAGE : CA server log has been started. DATA #1 : Log Level: Error Debugging : active Cluster Accelerator Object Information AIX 64-bit Compiler: xlC VRM (900) SVN Revision: 7584 Built on Oct 12 2009 at 17:00:59 Executable generated with symbols. Executable generated with asserts. Model Components Loaded: CACHE, LIST, LOCK Transport: uDAPL Number of HCAs: 1 Device[0]: hca0 CA Port[0]: 50638 Total Workers: 4 Conn/Worker: 128 Notify conns: 256 Processor Speed: 1498.000000 Mhz. Allocatable Structure memory: 170 MB
- Look for core files or stack traceback files in the CF_DIAGPATH directory.
- The system error log for the affected host might also be consulted if the cause of the error is still unknown. Log onto the CF host that has not been started and view the system error log by running the errpt -a command (on Linux®, look in the /var/log/messages file). Look for related log entries at the time of the failure. In the example shown here, login to host04 and host05, because CF 128 and CF 129 reside on these hosts.
- If an alert was shown from db2cluster -list -alert, run db2cluster -clear -alert after the problem is resolved, and then reissue the db2start command.