Managing I/O errors and long wait times
When a database write I/O error occurs in single area data sets (ADS), IMS copies the buffer contents of the error control interval (CI) to a virtual buffer. A subsequent DL/I request causes the error CI to be read back into the buffer pool.
The write error information and buffers are maintained across restarts, allowing recovery to be deferred to a convenient time. I/O error retry is automatically performed at database close time and at system checkpoint. If the retry is successful, the error condition no longer exists and recovery is not needed.
When a database read I/O error occurs, IMS creates a non-permanent EQE associated with the area data set (ADS) recording the RBA of the error. If there are other ADSs available, IMS retries the read using a different ADS. If there is only a single ADS, or if the read fails on all ADSs, the application program receives an 'AO' status code. The presence of the EQE prevents subsequent access to the same CI in this ADS. Any attempt to access the CI receives an immediate I/O error indication.
For MADS, the I/O is attempted against a different ADS. Up to three distinct I/O errors can be recorded per ADS. On the fourth error, IMS internally stops the ADS. If this is the only ADS for the area, the area is stopped.
EQEs are temporary and do not persist across IMS restarts or the opening and closing of an area. EQEs are not recorded in DBRC or in the DMAC on DASD. A write error eliminates the read EQEs and resets the counter.
The process to create the read EQE also reads the DMAC (second CI) from DASD. If the DMAC read fails, which it might if the failure is device level, IMS internally stops the ADS. ADS stop processing involves a physical close of the area, which involves a DMAC (second CI) write. If this process fails, and the ADS being closed is the only ADS for the area, the area is stopped and flagged in DBRC as 'recovery needed'.
Multiple Area Data Sets I/O Timing (MADSIOT) helps you avoid the excessively long wait times (also known as a long busy) that can occur while a RAMAC disk array performs internal recovery processing.
To invoke MADSIOT, you must define the MADSIOT keyword on the DFSVSMxx PROCLIB member. The /STA MADSIOT and /DIS AREA MADSIOT commands serve to start and monitor the MADSIOT function.
Additionally, MADSIOT requires the use of a Coupling Facility (CFLEVEL=1 or later) list structure in a sysplex environment. MADSIOT uses this Coupling Facility to store information required for DB recovery. You must use the CFRM policy to define the list structure name, size, attributes, and location.
Altered number of CIs (entrynum) | Required storage size (listheadernum=50) |
---|---|
1 000 | 1 792 KB |
5 000 | 3 584 KB |
20 000 | 11 008 KB |
30 000 | 15 616 KB |
You can estimate CFRM list structure storage sizes
tailored to your installation using an online tool: the IBM® System z® Coupling Facility Structure Sizer Tool (CFSizer). CFSizer is
available for you to use at the following website: www.ibm.com/servers/eserver/zseries/cfsizer/,
or search for CFSizer
at
the IBM website: www.ibm.com.