Recently, I'd been working on lot of ISAS / DPF systems where a very common complaint is the SQL1229 error encountered by the applications. In a few cases, it was fairly severe that a recycle was necessary.
I wanted to dwell a bit on this as it can be a mystery as to why it happens and can consume a lot of debugging effort by the DBA. I know its not a real performance issue per se, but I've seen customers being affected by it a lot and not really knowing what to do. I'm hoping with the information in this article, they can take a few steps and at least collect the "right data" for the root cause analysis.
To start with, what does this error really mean ( in the context of DPF ) .. it essentially means that the engine was trying to do something and it was not able to get back a reply in time from the remote side which it deems as a "system error" and thus is asking the transaction to be rolled back and re-submitted by the application again.
There can be many reasons for this error, in some cases DB2 can be a victim of underlying issues on hardware, network or in some cases it can be a DB2 issue itself.
The most common error seen in the db2diag.log just before a SQL1229 error received by the application is:
2013-09-12-21.06.56.206128-420 E5017316E450 LEVEL: Error (OS)
PID : 28382 TID : 46912950495552PROC : db2sysc 0
INSTANCE: xxxxx1 NODE : 000
EDUID : 16 EDUNAME: db2fcmr 0
FUNCTION: DB2 UDB, oper system services, sqloPdbRecvBuffer, probe:35
MESSAGE : ZRC=0x810F001A=-2129723366=SQLO_CONN_TIMEOUT "Connection timed out"
CALLED : OS, -, recv OSERR: ETIMEDOUT (110)
2013-09-12-21.06.56.238933-420 E5032877E381 LEVEL: Error
PID : 28382 TID : 46912958884160PROC : db2sysc 0
INSTANCE: dwpinst1 NODE : 000
EDUID : 14 EDUNAME: db2fcms 0
FUNCTION: DB2 UDB, fast comm manager, sqkfTcpLink::closeConn, probe:25
DATA #1 : <preformatted>
Link info: node 17; type 4; state 5; session 0; activated 1
The point to note is that its the FCM Receive thread ( db2fcmr ) which has got this error and ETIMEDOUT error can happen due to two reasons:
1) Connection timed out OR
2) Active transmission timed out
Generally, most of the time the reason is invariably "Active transmission timed out".
Now, based on this error, DB2 has to act on it and assume that the remote node is not responding, so what it does is that it does a "Node recovery" for the remote node on the local node and send a message to the remote node to do the same as well. In the above snippet, Node#0 got the ETIMEDOUT message from Node#17 so what it'll do is do a "Node recovery" for Node#17 on Node#0 and then send a message across to Node#17 to do a node recovery for Node#0. Node recovery isn't a "full blown" recovery as we know it .. its essentially recycling the socket connections to the node and trying to re-establish the connection using new sockets ( If interested, you can observe this phenomenon in netstat -an outputs ). There's one quirk however, if it so happens that the catalog node gets the error, the remote node in order to complete the node recovery will have to de-activate / re-activate the database. Usually, the node recovery should happen very fast ( in millisecs ) but in some cases if it doesn't happen, then following can hold true:
1) Its not able to send the message across to the remote node quick enough to do the node recovery on the remote node ( network issue ) OR
2) Something is happening on the remote node to make it unable to process the message quick enough ( maybe workload, low on memory, cpu etc.) OR
3) Its the exception case where it has to deactivate/activate the db in order to complete node recovery and is stuck in deactivate phase.
There can be another cause for SQL1229. A message such as
2008-07-28-09.31.23.006067-240 I22590117A418 LEVEL: Error
PID : 5038184 TID : 1286 PROC : db2sysc 46
INSTANCE: edwinp01 NODE : 046
EDUID : 1286 EDUNAME: db2fcms 46
FUNCTION: DB2 UDB, fast comm manager, sqkfNetworkServices::detectNodeFailure, probe:15
DATA #1 : <preformatted>
Detected failure for node 41 - time elapsed: 501; max timeout: 500; link state: 4
appears in the diaglog.
This essentially means that DB2 is trying to contact the remote node for nearly 500 seconds but unable to do so. However, if you notice, there's OS error this time around which indicates the network packets probably are going through but the other side is not able to process the messages sent by this node quick enough.
Ok, I have kind of explained what can cause the SQL1229 issues, but what can really be done to address it. Unfortunately, there's no simple solution to it as there are a lot of moving parts in this and the problem can be anywhere. The best you can do is to collect the "right data at the right time" to help the support engineers quickly address the problem you are having. If you see that your instance is hung right after the SQL1229 errors, the chances are that you probably have encountered the issue where a "deactivate/activate" is necessary to complete the node recovery. In that case, issue a "db2 force applications all" from the co-ordinator node. There's a high chance that the deactivate ( which was stuck ) completes and then node recovery completes and engine is back on line. If that also, doesn't help, unfortunately the quickest way out of it is a "recycle" of the instance but pls. make sure you collect the data for root cause analysis.
What can you collect:
For the first condition ( where OS reports ETIMEDOUT or ECONNRESET ) its next to impossible to anticipate that error, but you can be prepared for the recurrence. Unfortunately, since its the OS which reports the error, you have to have TCP trace turned on continuously and you have to stop it when the message appears in the db2diag.log. You will have to get the respective TCP trace commands for the OS ( for AIX its iptrace, Linux its tcpdump ) and you can limit the TCP trace only to the FCM ports. In order to stop it at the right moment, you can use my tool "watchThreshold.pl" ( I had uploaded it for one of my earlier posts ) and use the -watchFile and -command option to issue the stop TCP trace command when it encounters the message in the db2diag.log file.
If instance is "hung" then I would suggest taking stacks on the "hung" node if possible ( if you can identify from the db2diag.log which node is stuck in "deactivate" phase ) otherwise, I would suggest taking the following data:
rah "db2pd -alldbp -stack all"
rah "db2pd -alldbp -stack all"
This will collect the info of the "stuck" thread and provide clues as to why its stuck in deactivate.
If you get the timeout message as pasted above in the db2diag.log, you can set the following registry var for future occurrences which will trigger data collection automatically.
This will trigger data collection @50% of the timeout value, 75% of the timeout value, 87% of the timeout value.
This is available from DB2 v97FP5 onwards. This helps to find out what was happening in DB2 when the timeout happened ( was it too busy ? If so, why ? etc. ).
The db2diag.log is very helpful in stitching together the scenario when the issue happened so its also an essential part of the puzzle.
I hope this information will help you understand the SQL1229 error and what can you do to help do the RCA.