IBM Support

Collecting data: DB2 data corruption

Question & Answer


Question

Collecting data for troubleshooting corruption problems such as CBIT errors, bad page errors. Collect the following information and contact DB2 support for assistance.

Answer

Data corruption is one of the most difficult problems to resolve. It requires simultaneous engagement of every team that is associated with a particular role to collect as much diagnostic information as possible at the time of the event so that it can be correlated later during root cause analysis. Data corruption can occur at any layer in the stack – from DB2 to the operating system and right down to the hardware layer. As a result, the key to data collection is to involve all of the possible affected layers and collect any diagnostic data that might be required as soon as possible, while the data is available. It is crucial not only to collect the DB2 diagnostic files, but also to involve operating system and hardware support in order to gather any OS diagnostics (for example, errpt –a, snap, fileplace outputs on AIX) as well as any and all hardware diagnostics (state saves, error logs, etc).

It is important to ensure that there is adequate disk space for critical file systems like dump space and log directories to ensure that critical events are captured in their entirety. It is also important to review error log (such as errpt –a) output regularly to address any non-critical error situations that can result in numerous messages that would fill the report and cause the log to wrap.

Here is a summary of the recommended data collection strategy, followed by a more detailed explanation of what may be required.

SUMMARY:

  1. Contact DB2 support to assist with data collection. It is important that NOTHING be changed until the diagnostic data have been collected from all layers in the stack. Do not connect to or restart the partition or database until data have been collected; crash recovery may change the on-disk versions of affected pages.
  2. Run db2support to collect all DB2 diagnostic files created at the time of the error. The syntax is “db2support <output_file_path> -d <dbname>” from the affected db2 node. This will create db2support.zip in <output_file_path>.
  3. Run db2dart against the tablespace to find out the extent of corruption. If time allows, db2dart can be run against the entire database. The details of the db2dart commands will be provided by DB2 support with information in the db2diag.log.
    • Inspect the tablespace via db2dart /ts
    • Dump out the page in error via db2dart /dp
    • If possible, repeat the db2dart /dp a few minutes apart
    • A raw dump of the page via od or dd may also be desirable
  4. If the operating system is AIX, run fileplace -p <filename> or fileplace -m <devicename> and snap –ac.
  5. If possible, rename the offending table so that we can preserve it in place for detailed analysis later on. This is very important for AIX.
  6. Contact the operating system support folks to collect any other diagnostic information.
  7. Contact the hardware support folks to collect any error logs and diagnostic information.

Submitting diagnostic information to IBM Technical Support for problem determination

DETAILS:

A number of messages are logged to the db2diag.log when a data corruption is detected. These messages are necessary to identify the object that has been affected. The following are the most common entries in the db2diag.log that identify corruption (note that this is not a complete list):

Example #1:
2003-03-06-15.20.01.176852   Instance:instance1   Node:000
PID:135718(db2agntp (DBNAME) 20)   Appid:applicationId1
buffer_pool_services  sqlbcres   Probe:25   Database:DBNAME

DiagData
0000 a3dd                                     ....

Example #2:
2003-03-06-15.20.01.272209   Instance:instance1   Node:000
PID:135718(db2agntp (DBNAME) 20)   Appid:applicationId1
buffer_pool_services  sqlbrdpg   Probe:1143   Database:DBNAME
DIA3700C A bad page was encountered.

ZRC=0xFFFFE101

Example #3 :
2003-03-04-06.35.29.735233   Instance:instance2   Node:000                  
PID:56094(db2agent (DBNAME))   Appid:*:applicationId2
buffer_pool_services  sqlbcres   Probe:10   Database:DBNAME

Important: CBIT Error0000 0007                             ....

There are other messages that are logged that will identify the affected object. The first thing to determine is whether it is a temporary or permanent object. Look for a diagnostic message that shows the object type:

2003-03-06-15.20.01.402714   Instance:instance1   Node:000
PID:135718(db2agntp (DBNAME) 20)   Appid:applicationId1
buffer_pool_services  sqlbrdpg   Probe:1143   Database:DBNAME

 Obj={pool:3;obj:16;type:128} State=x27 Page=41949 Cont=6 Offset=5245 BlkSize=13
BadPage

Any object type larger than or equal to 128 indicates a temporary object.

For a permanent object, the db2dart diagnostic tool should be run against the database to determine the extent of the corruption. Note that db2dart must be run with the database offline, and can be time consuming, so if time is an issue, it can be run against just the table or the tablespace where the corruption was detected. The syntax is as follows:

For the entire database: db2dart <dbname> /d
For the tablespace: db2dart <dbname> /ts /tsi <tablespaceid>
For the table: db2dart <dbname> /t /oi <objectid> /tsi <tablespaceid>

For either type of object (temp or perm), the OS support team will need the file name (or raw device) and the offset into that file where the corruption can be found. The internal db2page tool is used by DB2 support to determine this.

Usage:
db2page SMS <pageNum> <objectID> <numConts> <extSize>
db2page DMS <path and name of SQLSPCS file> <poolID> <pageNum>

The information required as input to db2page is available in the SQLSPCS files, which are collected by db2support, and in the db2diag.log messages.

For a temporary object, the object file must be saved because it will be deleted when crash recovery occurs. It is important to rename it, not just copy it, to preserve the file in its original location in the file system. Then copy the renamed file back to its original name. Permanent objects should also be renamed to preserve the file in its original location. Note that this also means preserving the hardware configuration as is until initial data collection and diagnosis has been completed.

For example:

Original file: SQL0004.LBA
Rename SQL0004.LBA to SQL0004.LBA.ORIG
Copy SQL0004.LBA.ORIG to SQL0004.LBA

When the operating system is AIX, fileplace and snap output should also be collected, and fsck can be run on the file system to look for errors. It is especially important to run fileplace against the file or lv in question. This is the most critical step. The AIX and storage teams need this output in combination with offset and page length information to identify the location of the page in the storage system. Example syntax:

fileplace -p <filename>
fileplace -m <lv name>
snap -ac

Note that AIX support should be contacted to get the exact syntax as well as any other commands they would want to run. Also note that fsck must be run with the file system unmounted.

For hardware data collection, it is important to contact the hardware support team as soon as possible to assist with the data collection. Some of the data may be time sensitive. Keep in mind though that the hardware configuration should not be changed (that is, no removing of suspect disk drives, etc) until after all AIX data has been collected.

RECOVERY OPTIONS:

First of all, determine if the data corruption is permanent on disk. There is no recovery required for corruption in temporary objects, or for in memory corruptions. The db2dart tool or the dd utility can be used to read the pages from disk to see if the corruption persists.

There are various recovery options, depending on the scope of the problem and whether a reliable backup image exists.

The recommended recovery option is to restore from a backup. If the corruption is widespread, the recommendation would be to restore a full database backup. If the corruption is localized, a tablespace restore is sufficient.

If the corruption is localized to a table, it may be possible to drop and recreate that table, if the data is available. The /ddel option of db2dart can be used to retrieve as much data as possible from the table if the data is not otherwise available.

If restoring from a backup is not a viable option, then DB2 support must be contacted to determine what if any other options are available.

[{"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Database Objects\/Config - Tables","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF012","label":"IBM i"},{"code":"PF015","label":"IRIX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.7;9.5;9.1;10.1;10.5","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 June 2018

UID

swg21322155