Customers who are using VSAM log striping for the DB2 active log datasets with DB2 10 for z/OS or DB2 11 for z/OS should pay attention to recently opened APAR PI10353, which is now marked HIPER. There is no exposure for DB2 for z/OS customers running DB2 V9.1 or earlier. Neither is there any cause for alarm, as the exposure for DB2 10 or 11 is very small. There is no data loss involved and no loss of data integrity.
So what is the underlying problem? This concerns a concept called dependent writes, which means that, in the event of a DB2 restart after a crash, or when restarting DB2 from a DASD mirror or restored FlashCopy image, I/Os must be applied in the order in which they were initiated, and not in the order in which they completed. Unfortunately, VSAM striping does not guarantee the correct ordering of dependent writes for in-flight I/Os. This could mean, in some rare scenarios, a missing log CI stripe. Just to be clear: VSAM striping has always worked this way.
Why are DB2 10 for z/OS and DB2 11 for z/OS exposed, and not earlier releases? Forced log writes (for example, those scheduled as a result of an application commit) can mean that a partially full log CI is written to disk. That log CI has to be rewritten in place when the next log write I/O is scheduled. Prior to DB2 10 for z/OS, DB2 would perform the rewrite of the log CI to log copy 1 and then to log copy 2 serially - ensuring that the write to log copy 1 is complete before initiating the write to log copy 2. With DB2 9 for z/OS and earlier releases, DB2 restart automatically detected the hole in the log, truncated the log at that point and continued the restart process. However, a performance enhancement delivered with DB2 10 for z/OS caused DB2 to rewrite log CIs (those which had been partially full when previously written) to both log copy 1 and log copy 2 in parallel. Unfortunately, the existing logic in DB2 10 for z/OS and DB2 11 for z/OS cannot deal with the new situation, that is, a missing log CI stripe.
A possible consequence of this is that one or both copies of the current DB2 active log pair could be damaged at the end, causing a DB2 crash restart to fail and leading to advanced non-standard recovery, which has to be done under guidance of the IBM DB2 service team. This could theoretically happen in the following circumstances:
- A local site crash, such as a CEC or LPAR failure, or an address space failure
- A DB2 restart off an XRC mirror or PPRC mirror
- A DB2 restart off a restored system level FlashCopy
No customer has ever hit the problem following a local site failure in over 3 years of DB2 10 for z/OS production experience. Only one customer has ever hit the problem, and only when performing disaster recovery restart testing from an XRC mirror. This happened on two occasions. The chance of hitting the problem during DB2 crash restart following a local site failure is tiny. The chance of hitting the problem when restarting from a DASD mirror or from a restored system level FlashCopy is marginally greater, because the timing window for this sort of event (a missing log CI stripe) is larger. APAR PI10353 will provide a fix (circumvention) such that DB2 will automatically detect this type of problem, repair the log and allow DB2 crash restart to proceed.
For further guidance, please see the problem relief as described in the APAR, but be cautious if you are considering the option of disabling VSAM striping, which is difficult to implement, and may well result in serious performance problems.