Db2 instance fails to restart automatically after a failure

After a failure, the Db2® instance may have failed to restart automatically. This topic will show you how to identify and resolve the failure.

Important: In version 11.5 Mod Pack 4 and later, using Pacemaker as a cluster manager in an automated failover to HADR standby is a Technical Preview. This means it should be restricted to development, test, and proof of concept environments only.

Identification of the problem

The following log entries from pacemaker-execd and pacemaker-controld indicate that the db2inst resource agent failed to start the instance after reaching the 900 second timeout.

In /var/log/pacemaker/pacemaker.log
219 May 19 13:16:08  db2inst(db2_db2tea1_db2inst1_0)[29069]:    INFO: start: 290: db2inst1: db2inst_start() entry.
220 May 19 13:16:08  db2inst(db2_db2tea1_db2inst1_0)[29069]:    WARNING: start: 487: psdid not list any process for db2sysc on node 0 after 5 retries. db2inst_monitor() exit with rc=1.
<...>
234 May 19 13:16:08  db2inst(db2_db2tea1_db2inst1_0)[29069]:    INFO: start: 184: db2inst1: Attempting to start partition(0) via db2gcf...
<...>
846 May 19 13:31:13 db2tea1.fyre.ibm.com pacemaker-based[12628] (cib_process_ping)     info: Reporting our current digest to kedge1: 329366e809df79e8eee7385e29f376d6 for 61.5713.41 (0x556f3945d0c0 0)
847 May 19 13:31:13 db2tea1.fyre.ibm.com pacemaker-execd[12630] (child_timeout_callback)    warning: db2_db2tea1_db2inst1_0_start_0 process (PID 29069) timed out
848 May 19 13:31:13 db2tea1.fyre.ibm.com pacemaker-execd[12630] (operation_finished)   warning: db2_db2tea1_db2inst1_0_start_0:29069 -timed out after 900000ms
849 May 19 13:31:13 db2tea1.fyre.ibm.com pacemaker-execd[12630] (log_finished)         info: finished -rsc:db2_db2tea1_db2inst1_0 action:startcall_id:54 pid:29069 exit-code:1 exec-time:900003ms queue-time:0ms
850 May 19 13:31:13 db2tea1.fyre.ibm.com pacemaker-controld[16247] (process_lrm_event)    error: Result of start operation for db2_db2tea1_db2inst1_0 on db2tea1: Timed Out | call=54 key=db2_db2tea1_db2inst1_0_start_0 timeout=900000ms

Lines 219 & 220 are typical entries logged while the instance is stopped. They may repeat until the instance is started and do not indicate a problem. The log entry on line 234 indicates that Pacemaker is attempting to start the instance.

Lines 847-850, logged approximately 15 minutes later, indicate that Pacemaker timed-out waiting for the instance resource to start.

Resolution

The above logs typically indicate that an issue occurred during db2start. Use the following steps to further debug and resolve the problem.
  1. Note the timestamp of the instance start attempt (May 19 13:16:08 in the above on Line 234).
  2. Locate the db2diag.log of the current host.
  3. Search forward from the timestamp noted in step #1 for errors from "db2start", "db2star2", or "db2sysc" process.
  4. One of the entries should indicate the SQLCODE, or there can be other errors earlier that will provide a clue to the failure.
  5. Resolve the error, run ipclean -a as the instance owner.
  6. Restart the instance manually.