Recovering from transaction failures in a partitioned database environment
If a transaction failure occurs in a partitioned database environment, database recovery is usually necessary on both the failed database partition server and any other database partition server that was participating in the transaction.
- Crash recovery occurs on the failed database partition server after the failure condition is corrected.
- Database partition failure recovery on the other (still active) database partition servers occurs immediately after the failure has been detected.
- READ-ONLY
- No data change occurred at this server
- YES
- Data change occurred at this server
- NO
- Because of an error, the server is not prepared to commit
If one of the servers responds with a NO
, the
transaction is rolled back. Otherwise, the coordinator partition begins
the second phase.
During the second phase, the coordinator partition writes a COMMIT
log record, then distributes a COMMIT request to all the servers that
responded with a YES
. After all the other database
partition servers have committed, they send an acknowledgement of
the COMMIT to the coordinator partition. The transaction is complete
when the coordinator agent has received all COMMIT acknowledgments
from all the participating servers. At this point, the coordinator
agent writes a FORGET log record.
Transaction failure recovery on an active database partition server
If any database partition server detects that another server is down, all work that is associated with the failed database partition server is stopped:
- If the still active database partition server is the coordinator partition for an application, and the application was running on the failed database partition server (and not ready to COMMIT), the coordinator agent is interrupted to do failure recovery. If the coordinator agent is in the second phase of COMMIT processing, SQL0279N is returned to the application, which in turn loses its database connection. Otherwise, the coordinator agent distributes a ROLLBACK request to all other servers participating in the transaction, and SQL1229N is returned to the application.
- If the failed database partition server was the coordinator partition for the application, then agents that are still working for the application on the active servers are interrupted to do failure recovery. The transaction is rolled back locally on each database partition where the transaction is not in prepared state. On those database partitions where the transaction is in a prepared state, the transaction becomes an indoubt transaction. The coordinator database partition is not aware that the transaction is indoubt on some database partitions because the coordinator database partition is not available.
- If the application connected to the failed database partition server (before it failed), but neither the local database partition server nor the failed database partition server is the coordinator partition, agents working for this application are interrupted. The coordinator partition will either send a ROLLBACK or a DISCONNECT message to the other database partition servers. The transaction will only be indoubt on database partition servers that are still active if the coordinator partition returns SQL0279.
Any process (such as an agent or deadlock detector) that attempts to send a request to the failed server is informed that it cannot send the request.
Transaction failure recovery on the failed database partition server
If the transaction failure causes the database manager to end abnormally, you can issue the db2start command with the RESTART option to restart the database manager once the database partition has been restarted. If you cannot restart the database partition, you can issue db2start to restart the database manager on a different database partition.
- Explicitly, through the RESTART DATABASE command
- Implicitly, through a CONNECT request when the autorestart database
configuration parameter has been set to
ON
- On a database partition server that is not the coordinator partition, a transaction is indoubt if it is prepared but not yet committed.
- On the coordinator partition, a transaction is indoubt if it is committed but not yet logged as complete (that is, the FORGET record is not yet written). This situation occurs when the coordinator agent has not received all the COMMIT acknowledgments from all the servers that worked for the application.
- If the server that restarted is not the coordinator partition for the application, it sends a query message to the coordinator agent to discover the outcome of the transaction.
- If the server that restarted is the coordinator partition for the application, it sends a message to all the other agents (subordinate agents) that the coordinator agent is still waiting for COMMIT acknowledgments.
It is possible that crash recovery might not be able to resolve all the indoubt transactions as part of crash recovery. For example, some of the database partition servers might not be available. If the coordinator partition completes crash recovery before other database partitions involved in the transaction, crash recovery will not be able to resolve the indoubt transaction. This is expected because crash recovery is performed by each database partition independently. In Db2® 11.5.3 and later versions, and when DB2_DPF_ASYNC_INDOUBT_RESOLUTION is enabled, the coordinator partition will continue to attempt to resolve the indoubt transactions with the other database partitions until the indoubt is finally resolved. Prior to Db2 11.5.3, or when DB2_DPF_ASYNC_INDOUBT_RESOLUTION is OFF, the coordinator partition will only make one attempt to resolve the indoubt transaction and if the indoubt transaction cannot be resolved the SQL warning message SQL1061W is returned. Because indoubt transactions hold resources, such as locks and active log space, it is possible to get to a point where no changes can be made to the database because the active log space is being held up by indoubt transactions. For this reason, you should determine whether indoubt transactions remain after crash recovery, and recover all database partition servers that are required to resolve the indoubt transactions as quickly as possible.
db2_all "db2 restart database <database_name>"
If one or more servers that are required to resolve an indoubt transaction cannot be recovered in time, and access is required to database partitions on other servers, you can manually resolve the indoubt transaction by making an heuristic decision. You can use the LIST INDOUBT TRANSACTIONS command to query, commit, and roll back the indoubt transaction on the server.
- Db2 Enterprise Server Edition, which indicates that the transaction originated in a partitioned database environment.
- XA, which indicates that the transaction originated in a distributed environment.
Identifying the failed database partition server
When a database partition server fails, the application will typically receive one of the following SQLCODEs. The method for detecting which database manager failed depends on the SQLCODE received:
- SQL0279N
- This SQLCODE is received when a database partition server involved in a transaction is terminated during COMMIT processing.
- SQL1224N
- This SQLCODE is received when the database partition server that failed is the coordinator partition for the transaction.
- SQL1229N
- This SQLCODE is received when the database partition server that failed is not the coordinator partition for the transaction.
- Find the partition server that detected the failure by examining the SQLCA. The SQLCA associated with SQLCODE SQL1229N contains the node number of the server that detected the error in the sixth array position of the sqlerrd field. (The node number that is written for the server corresponds to the node number in the db2nodes.cfg file.)
- Examine the administration notification log on the server found in step one for the node number of the failed server.