In collaboration with Shant Minassians
Have you ever heard a story where a developer made a couple of updates in the production DB, and then two days later the DBA gets a call because the database is not accepting requests? Yes. It's not an urban myth. The transaction logs, on which DB2 records all changes, can be exhausted not only from large transactions, but also from transactions that have made updates and are left idle. In this post, I will describe the scenarios that lead to the dreaded "SQL0964C The transaction log for the database is full" error, and what you can do to prevent it.
Scenarios that can lead to log exhaustion
The transaction logs are used for rollbacks, crash recovery and roll forward. DB2 records all update operations (insert/update/deletes) in them as they are happening, and the logs are released when the transaction completes with commit or rollback. There are two main scenarios that can lead to log exhaustion: applications making too many changes within a single transaction (large transaction), and connections left idle after having made updates.
Large transactions: As DB2 keeps track of all the updates within a transaction, a large enough transaction can exhaust the transaction logs. In Commerce, this scenario doesn't typically happen from the JVMs, but it can happen from utilities such as Data Load and StagingProp. The application creating the large transaction typically falls victim of the SQL0964C error itself, causing the transaction to be rolled back and the processing to fail. The rollback frees up the log resources, restoring the database to a normal state.
Idle Transactions: As logs are freed up sequentially, a transaction left idle after making updates can prevent the logs from being cleared up which leads to SQL0964C. This scenario is tricky because although the log utilization is high, no single transaction appears to be using much of the logging space. Idle connections can happen from outside WebSphere Commerce, such as an operator working with auto-commit off, or a hang situation within the JVM, generally while waiting on an external service without timeouts. It might actually take several days for an idle connection to start causing problems; this depends on the configuration and activity. As the transaction is idle, it doesn't run into the SQL0964C error itself, but all other activity starts failing until a DBA identifies and kills the idle connection, or a configuration is setup to do it automatically.
Troubleshooting and recovering
Next, I will show you how to diagnose error SQL0964C, the tools available, and how to recover from the situation. I've listed the most relevant DB2 monitoring views and table functions, snapshots and a brief mention of db2pd.
It is critical to always investigate the root cause and not just kill the connections, as it is the only way to ensure you will not continue to run into this situation.
Evidence of a log full scenario
When DB2 is unable to allocate new transactions logs, statements start failing with SQLCODE=-964:
SQL0964C The transaction log for the database is full
Without a functioning database, the Commerce site is effectively down. As evidence, you find the SQLCODE=-964 error in the WebSphere Commerce, or utility logs:
SQLCODE=-964, SQLSTATE=57011, SQLERRMC=null, DRIVER=4.12.55
com.ibm.db2.jcc.am.SqlException: DB2 SQL Error: SQLCODE=-964, SQLSTATE=57011, SQLERRMC=null, DRIVER=4.12.55
On the database side, the error is logged in the alert and notification logs as follows:
2017-08-25-18.104.22.1681526 Instance:db2inst1 Node:000
PID:19595318(db2agent (MALL) 0) TID:3857 Appid:22.214.171.124.64694.170825235406
data protection services sqlpgResSpace Probe:2860 Database:MALL
ADM1823E The active log is full and is held by application handle "0-7".
Terminate this application by COMMIT, ROLLBACK or FORCE APPLICATION.
Along with others such as:
2017-08-25-17.25.53.068145-240 E11659A605 LEVEL: Error
PID : 9306252 TID : 22521 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : MALL
APPHDL : 0-36007 APPID: 126.96.36.199.53275.170825210028
AUTHID : WCS HOSTNAME: demo.torolab.ibm.com
EDUID : 22521 EDUNAME: db2agent (MALL) 0
FUNCTION: DB2 UDB, data protection services, sqlpgResSpace, probe:6666
MESSAGE : ZRC=0x85100009=-2062548983=SQLP_NOSPACE
"Log File has reached its saturation point"
DIA8309C Log file was full.
Troubleshooting with monitoring views
The MON_TRANSACTION_LOG_UTILIZATION view is the quickest way to confirm the logs are exhausted:
SELECT LOG_UTILIZATION_PERCENT FROM SYSIBMADM.MON_TRANSACTION_LOG_UTILIZATION
The MON_GET_CONNECTION and MON_GET_UNIT_OF_WORK table functions can help you identity the transactions using log space and their idle time. Start by sorting by UOW_LOG_SPACE_USED to find large transactions. Then sort by CLIENT_IDLE_WAIT_TIME to identify idle connections.
SELECT CON.APPLICATION_HANDLE, CON.APPLICATION_NAME, CON.APPLICATION_ID, CON.CLIENT_WRKSTNNAME, (CON.CLIENT_IDLE_WAIT_TIME/1000/60) CLIENT_IDLE_TIME_MIN, CON.NUM_LOCKS_HELD, (UOW.UOW_LOG_SPACE_USED/1000) UOW_LOG_SPACE_USED_KB
FROM TABLE(MON_GET_CONNECTION(NULL, -1)) AS CON,
TABLE(MON_GET_UNIT_OF_WORK(NULL, -1)) AS UOW
WHERE CON.APPLICATION_HANDLE = UOW.APPLICATION_HANDLE
AND UOW.UOW_LOG_SPACE_USED > 0
ORDER BY UOW.UOW_LOG_SPACE_USED DESC
Note: If UOW_LOG_SPACE_USED doesn't match the numbers in db2pd or the snapshots, you might be running into: IT19230: MON_GET_UNIT_OF_WORK INTERFACE - UOW_LOG_SPACE_USED REPORTING MIGHT BE INCORRECT.
Another table function, MON_GET_TRANSACTION_LOG, also provides detailed information about the logs, including the oldest transaction using log space (APPLID_HOLDING_OLDEST_XACT).
SELECT * FROM TABLE(MON_GET_TRANSACTION_LOG(-1)) AS LOG
When using APPLID_HOLDING_OLDEST_XACT, just keep in mind that there could be multiple transactions in the same state.
Troubleshooting with snapshots
If you are old school and still prefer snapshots in text format, they can also help you identity the transactions using log space. You will use mainly the Database, Locks and Application snapshots.
Under the Database snapshot you find these entries, which will allow you to calculate the utilization percentage:
Log space available to the database (Bytes)= 41240
Log space used by the database (Bytes) = 359446586
This entry shows you the oldest transaction. Useful for the "idle connection scenario":
Appl id holding the oldest transaction = 308
With an Application Snapshot, you will get this info for each application:
Configuration NNAME of client = 188.8.131.52
UOW log space used (Bytes) = 315
This other entry shows the idle time. Keep in mind Commerce uses connection pooling, with some connections left idle waiting for new requests. Therefore, a high idle time by itself is not a problem unless the transaction is using log space.
Application idle time = 1 day 23 hours
You can use grep to find the entries quicker, as follows:
db2 "get snapshot for applications on mall" | grep -i -e "Application handle" -e "Configuration NNAME of client" -e "UOW log space used" -e "Application idle time"
Start by calculating the log utilization percentage. If it is abnormally high, grep for "UOW log space used (Bytes)". If you locate applications hogging all the logs, investigate those. If otherwise no application appears to be using much of the logging space, find the application under "Appl id holding the oldest transaction", review its idle time and confirm it is using log space (>0). Remember there could be multiple connections in the same state.
When you are investigating a transaction, use "Configuration NNAME of client" to find the machine the connection is coming from. The Lock snapshot will show you the locks that had been acquired by the connection, which gives you insight on the type of workload it is executing.
Troubleshooting with db2pd
Although generally too low level for my taste, db2pd is a very powerful and lightweight tool. The usage for troubleshooting logs is as follows:
db2pd -db mall -logs -transactions
I wont get into details, but what I found useful is that it shows the SpaceReserved, which is amount of logs that are reserved for a transaction to ensure it can rollback. The SpaceReserved is not allocated but it counts toward the usage. You will need to consider this if implementing NUM_LOG_SPAN (more on it later).
Analyzing and recovering
Lucky, when dealing with large transactions, the process that creates them often falls victim of the situation it created, gets the error itself, and fails with rollback. When the large transaction rolls back, the associated log resources are freed up and the site is restored. In this situation you will often see the process failing with the SQLCODE=-964 error in the logs.
You can typically start by reviewing the logging configurations in DB2 ( logprimary, logsecond and logfilsiz). The defaults might be too small for your production ( live or authoring) databases, and you might need to increase the log capacity available.
WebSphere Commerce does provide options to do the work in multiple transactions, but this impacts when the data is available and the recovery options. If you work in a single transaction and there is a failure, all changes will be rolled back, but if you have multiple transactions, updates until the latest commit remain in the database. The stagingprop utility provides the transaction parameter. See this blog post for details: Stagingprop Performance: Getting Started, and with Data Load you can use the commit setting.
Idle connections can be tricky to debug. As the connection is inactive, it doesn't issue any commands to get the SQLCODE=-964 error, and therefore the situation is not resolved until the DBA manually clears out the connections.
It is not uncommon for an inactive connection to take many hours or days before it causes problems. You should monitor for idle connections with log space used and resolve them before they become a problem. If you have alerting and monitoring in place, ensure to get alerted well before the log usage reaches 100%.
Using the monitoring views and snapshots, locate the server where the connections are coming from ( MON_GET_CONNECTION.CLIENT_WRKSTNNAME) and work with the AppOps team to know the kind of server it is (e.g. from where stageprop runs, a WC JVM or a Search JVM).
If the connections are coming from a Commerce or Search JVM, you will wonder why the server is having long running transactions without committing. The WebSphere servers have a transaction timeout, but the timeout only marks the transaction for rollback and it doesn't kill the thread. Because of this, if the thread is hung on an external resource, such as an STMP server, it can remain active until the hang is resolved.
You can collect multiple Javacores on the JVM or use Health Center. If you compare activity over time, you will likely identify threads that are "not moving" and waiting on a resource (likely network and external). This situation can be prevented in the future by ensuring the call is done with a time-out. Very rarely, if the culprit is not clear from Javacores, you can use the technique described in the post: Troubleshooting: Database connection pooling issues to identify the exact Java thread that is associated to the connection.
To release the logs, terminate the connections you identified earlier with the FORCE APPLICATION command:
$ db2 "force application ( 2620, 2724 )"
DB20000I The FORCE APPLICATION command completed successfully.
DB21024I This command is asynchronous and may not be effective immediately.
Avoiding 2:00 AM calls
These are a few things that you can do to avoid running into SQL0964C in production:
Do your homework
Validate your logging configuration ( logprimary, logsecond, logfilsiz, etc) and compare your high watermarks to the total logs available:
SELECT * FROM SYSIBMADM.DBCFG WHERE name IN ( 'max_log', 'num_log_span', 'logprimary', 'logsecond','logfilsiz')
SELECT TOTAL_LOG_AVAILABLE, TOTAL_LOG_USED, TOT_LOG_USED_TOP, SEC_LOG_USED_TOP, SEC_LOGS_ALLOCATED FROM TABLE(MON_GET_TRANSACTION_LOG(-1)) AS LOG
You can also do a cursory review to look for long running transactions, and validate log usage during load and staging operations.
Implement monitoring and alerting
The MON_TRANSACTION_LOG_UTILIZATION view makes it easy to retrieve the log usage percentage. Forward the number to a monitoring tool such as New Relic, from where you can graph it to see trends and setup alerts.
SELECT LOG_UTILIZATION_PERCENT FROM SYSIBMADM.MON_TRANSACTION_LOG_UTILIZATION
Trigger alerts well before the utilization reaches 100% (as soon as it is abnormal), to give the team time to troubleshoot and free up the connections.
Killing off connections automatically
DB2 provides two configurations that can be used to automatically kill the connections holding up the logs:
max_log - Maximum log per transaction configuration parameter
This setting will kill a connection when using over a certain percentage of the logs (calculated based on the primary logs only). It relies on the actual percentage of logs used (UOW log space used (Bytes)). It works well for the "large transaction" scenario but it wont trigger for an idle connection with few locks and few log space used.
num_log_span - Number log span configuration parameter
This setting limits the number of logs a transaction can span. It's useful to protect against the "idle connection" scenario and also works for large transactions. It can be tricky to set, because among other reasons, the total log space used by a transaction includes a reserved space, which is not considered by NUM_LOG_SPAN. For example, a transaction could be using 100 MB of logs, of which 50 MB are allocated and 50 MB are only reserved. Because of this, a transaction could run into SQL964 before NUM_LOG_SPAN is reached. In my testing of large transactions, with 10 primary logs and no secondary logs, if I set MAX_LOG_SPAN to 5 or higher, I got SQLCODE=-964 before NUM_LOG_SPAN had a chance to trigger.
This technote explains the scenario in detail: Database encounters log-full error even though it has the NUM_LOG_SPAN parameter set. To see the reserved space for a transaction, use db2pd -logs -transaction and look under the SpaceReserved column.
To choose values for these settings, review your configuration and usage. Start with high numbers to be on the safe side. If you are unfamiliar with how much logging is typically used, monitor for a couple of days to get a baseline. I recommend to setup both, with MAX_LOG to protect against large transactions and NUM_LOG_SPAN for idle transactions. NUM_LOG_SPAN works well for idle transactions because typically the log space used (including reserved) is small, and the log span increases as other activity moves the logs forward.
If a connection is killed due to the MAX_LOG or NUM_LOG_SPAN, the application prints a connection error similar to these:
com.ibm.db2.jcc.am.DisconnectNonTransientConnectionException: [jcc][t4][4.12.55] A communication error occurred during operations on the connection's underlying socket, socket input stream, or socket output stream. Error location: Reply.fill() - insufficient data (-1). Message: Insufficient data. ERRORCODE=-4499, SQLSTATE=08001
com.ibm.db2.jcc.am.DisconnectNonTransientException: [jcc][t4][4.12.55] The database manager is not able to accept new requests, has terminated all requests in progress, or has terminated this particular request due to unexpected error conditions detected at the target system. ERRORCODE=-4499, SQLSTATE=58009
You will also see these messages in db2diag and notification logs:
ADM1542W Application "db2jcc_application" with application handle "0-394" and application id "184.108.40.206.54854.170825234907" executing under authentication id "WCS" will be forced off of the database for violating database configuration parameter MAX_LOG (current value "70"). The unit of work will be rolled back.
ADM1541W Application "db2jcc_application" with application handle "0-361" and application id "220.127.116.11.63009.170826012407" executing under authentication id "WCS" has been forced off of the database for violating database configuration parameter NUM_LOG_SPAN (current value "8"). The unit of work will be rolled back.
Finally, I know some customers use the DB2 Governor to kill connections that have been idle for more than N hours. This is fine, but keep in mind that Commerce uses connection pooling, and WebSphere keeps connections in the pool that remain idle until they are needed for a new request. These connections don't use log space and are harmless. If you configure the Governor, compare with the Data Source settings such as Minimum Connections and Aged Timeout to avoid killing connections that are still in the pool waiting to be used.