Problem Determination: Corruption

Data corruption in Db2® can be the source of crashes, panics, or forced database shutdowns. Data can become corrupt either physically or logically.

A crash describes any time Db2 unexpectedly stops functioning. A "panic" is a self-induced crash. Typically, a panic occurs in error paths when there is no reasonable way to handle the error and continue operation. For example, Db2 will typically panic after reading a bad checksum from a page on the disk. Panics typically mark the database bad.

Essential tools

Use the db2ckbkp command to test the integrity of a backup image. If a backup image appears to be corrupt, you can run the db2ckbkp to confirm. You may also run db2ckbkp proactively whenever making a backup.

You can also use db2ckbkp to display meta-data stored within the backup image, such as the storage paths associated with an automatic storage database.

To display all available information, run the db2ckbkp -a command. For example:
db2ckbkp -a MYDB.0.db2inst.NODE0000.CATN0000.20180221223624.001
Note: For TSM backup images, use the db2adutl VERIFY option.

The db2dart (Database Analysis and Reporting Tool) tool checks the architectural correctness of a database. Run the db2dart command to see all supported options. Some options include inspecting, formatting, and repairing data. The db2dart tool verifies data on directly on the disk, and unlike the INSPECT command, it does not access the buffer pools. If users are connected to the database, db2dart may show false errors.

The default operation for db2dart is a full database inspection. You can choose to inspect at the database (/DB), table space (/TS) or table (/T) level. A database inspection checks the validity of metadata structures, data page/row headers, etc.

Note: While a database inspection does check for physical validity, it does not check for logical correctness of the data.

You can also use db2dart to dump formatted table data in delimited ASCII format (/DDEL) or to mark the index object as invalid (/MI), which can be useful for damaged indexes that need rebuilding.

Logical and physical corruption

In cases of logical corruption, the data is physically correct, but there is a mismatch between the Db2 read of page metadata and the actual contents of the page. For examples the index page pointing to an incorrect root, or an incorrect number of data slots on a data page.

Logical corruption failures only occur during runtime.

Physical corruption refers to data that is physically damaged. For example, a database page contains nothing but zeroes (which is usually prohibited – a header and other metadata are typically required for every database page).

Physical corruption can be easily detected by running tools such as db2dart or the INSPECT command.

A frequent symptom of physical corruption is page verification errors (CBIT or checksum problems), reported in db2diag.log while reading the page from the disk.

The root cause of physical corruption varies. Common areas of interest include: database bugs (though almost never for CBIT or checksum problems), file system bugs, operating system issues or hardware problems.

The root cause of logical corruption is often a database problem.

Logical corruption read errors

For logical read errors, consumers of buffer pool pages perform validation on various internal structures that are stored in the page. If the page is found to be invalid, Db2 will attempt to tolerate the error after dumping diagnostic data.

If the page is not modified yet, it will be marked bad. In this case the page in unloaded from the buffer pool and, if required, reloaded again.

This feature can be disabled through the DB2RESILIENCE registry variable. Default: ON.

Upon detecting a bad page in memory, Db2 will:

  1. Create the usual FODC package (call stacks, etc...).
  2. Terminate the application with SQL1656. The current operation fails, but the application can continue to use the same connection.
  3. Log ADM6007E (admin log; different from the physical read error ADM code) with the page details.
  4. In order to prevent diagnostic log flooding, limit logging of the same error.
  5. Keep the database accessible.

Physical corruption read errors

When Db2 loads a page from the disk to the buffer pool, the page is validated. If the page is found to be invalid, Db2 will attempt to tolerate the error after dumping diagnostic data. The data from the invalid page is not consumed and no harm has been caused yet, so no reason to mark the database bad.

This feature can be disabled by the DB2RESILIENCE registry variable. Default: ON.

Upon detecting a physical read toleration error, Db2 will:
  1. Create the usual FODC package (call stacks, etc...).
  2. Terminate the application with SQL1655. The current operation fails, but the application can continue to use the same connection.
  3. Log ADM6006E (admin log) with the page details.
  4. In order to prevent diagnostic log flooding, limit logging of the same error.
  5. Keep the database accessible.