Recovering table space errors on an HADR standby database

In an HADR environment, when a standby database has a table space in an invalid or error state, the replay of transactions on this table space stops. However, the replay of transactions on other valid table spaces will continue. The primary database is not affected, and the condition of this table space on the standby can go unnoticed.

If this table space condition exists on the standby database, then sometime later when a TAKEOVER operation is performed on the standby database, applications can be impacted by the unavailability of this table space. The erroneous table spaces can be recovered on the standby database by either reinitializing the affected table spaces, or reinitializing the entire database.

Monitoring and identifying erroneous table spaces on the standby

Techniques for monitoring this condition and identifying the erroneous table spaces are described in the technote, Monitoring and Identifying table spaces in invalid or error state on the HADR Standby database.

Identifying and rectifying the cause of the erroneous table spaces

After the erroneous table spaces are identified, the cause of the error condition must be determined. Common causes include insufficient filesystem space, a filesystem that is not mounted, a filesystem error, or a load copy-yes image cannot be found. A severe log replay error can also be a cause.

The administration notification log (<instance_name>.nfy) and db2diag.log must be examined.

Depending on the type of error, the administration notification log (<instance name>.nfy) can contain warning messages similar to these messages:

2016-03-15-22.51.15.490605 Instance:rsiinp16 Node:000
     PID:10551302(db2redow (MYDB1) 0) TID:19277 Appid:*LOCAL.DB2.160221104421
     buffer pool services sqlbIncPoolState Probe:3604 Database:MYDB1
     ADM12512W Log replay on the HADR standby has stopped on table space "MYTBSPACE3"
     (ID "7") because it has been put into "ROLLFORWARD PENDING" state.
     
2016-03-15-22.51.15.565961 Instance:rsiinp16 Node:000 
     PID:10551302(db2redow (MYDB1) 0) TID:19277 Appid:*LOCAL.DB2.160221104421
     data management sqldHandleBadPool Probe:40 Database:MYDB1
     ADM5550C The table space "MYTBSPACE3" (ID "7") is being removed from the
     rollforward set. The SQLCODE is "-980".

Likewise, the db2diag.log can contain error messages with more clues, such as the disk write error in this example:

2016-03-15-22.51.15.189467-240 E763181A1307 LEVEL: Error (OS)
     PID : 10551302 TID : 19277 PROC : db2sysc 0
     INSTANCE: myinst1 NODE : 000 DB : MYDB1APPHDL : 0-8 
     APPID: *LOCAL.DB2.160221104421
     HOSTNAME: somehost.ibm.com
     EDUID : 19277 EDUNAME: db2redow (MYDB1) 0
     FUNCTION: Db2 UDB, oper system services, sqloseekwrite64, probe:40
     MESSAGE : ZRC=0x860F0003=-2045837309=SQLO_DERR "disk error occurred (DOS)"
     DIA8402C A disk error has occurred.
     CALLED : OS, -, pwrite
     OSERR : EIO (5) "I/O error"

If the cause of the error condition is determined to be a localized issue, such as insufficient filesystem space, a filesystem problem, or similar, the issue must be rectified before you proceed to reinitialize the table space on the standby.

If the cause of the error condition is determined to be a severe replay error or an internal db2 error, with no observable localized cause, retain all diagnostic information before you open a PMR.

Reinitializing the erroneous table spaces on the standby database

Depending on your Db2® version, refer to the respective links to reinitialize the erroneous table spaces on the standby database.

For pureScale and non-pureScale environments, on version 10.5 Fix Pack 8 and previous Fix Packs:

A full reinitialization of the standby database is the only supported method.
Refer to Initializing high availability disaster recovery (HADR).

For pureScale environments, on version 10.5 Fix Pack 9 and newer Fix Packs, or Versions 11.1.0.0 or 11.1.1.1:

A full reinitialization of the standby database is the only supported method.
Refer to Initializing high availability disaster recovery (HADR).

For non-pureScale environments, on version 10.5 Fix Pack 9 and newer Fix Packs, or Versions 11.1.0.0 or 11.1.1.1:

Refer to Instructions to reinitialize the erroneous table spaces on the standby database.

For pureScale and non-pureScale environments, on Version 11.1.2.2 and newer Mod Packs:

Refer to Instructions to reinitialize the erroneous table spaces on the standby database.

Instructions to reinitialize the erroneous table spaces on the standby database

On the standby host, deactivate the standby database:
```
[Standby]$db2 "deactivate db MYDB1"
```
On the primary host, perform a FLUSH BUFFERPOOL operation before the backup operation. This is to ensure that the backup image to be taken at step 3 has a later recovery starting point than recovery starting point of the standby database. This operation is required in order to repair the table space on the standby database.
```
[Primary]$db2"flush bufferpools all"
```
Note: Databases with heavy workloads can experience short performance issues due to aggressive I/O during the FLUSH BUFFERPOOL operation
On the primary host, perform a backup of the erroneous table spaces. (Note: An 'online' backup allows the primary database to remain available during the backup operation).
```
[Primary]$db2 "backup db MYDB1 tablespace MYTBSPC3 online to /bkp_image_path_pri/ "
```
For Versions 11.1.0.0 or 11.1.1.1 only, on the standby host, perform a STOP HADR operation:
```
[Standby]$db2 "stop hadr on db MYDB1"
```
Copy or FTP the backup image from the primary host path (/bkp_image_path_pri/ in this example) to the standby host path (/bkp_image_path_stdby/ in this example). On the standby host, perform an offline restore of the table space from the backup image:
```
[Standby]$db2 "restore db MYDB1 tablespace (MYTBSPC3) from /bkp_image_path_stdby/"
```
For Versions 11.1.0.0 or 11.1.1.1 only, on the standby host, perform a START HADR AS STANDBY operation:
```
[Standby]$db2 start hadr on db MYDB1 as standby
```
On the standby host, reactivate the database (for Versions 11.1.0.0 or 11.1.1.1, this step is not required because of the start hadr operation above):
```
[Standby]$db2 "activate db MYDB1"
```

Monitor the table space(s) state on the Standby as described at the top of this document. Note that the recovered table space(s) will remain in benign ROLLFORWARD_IN_PROGRESS (x40) state until the next takeover operation.

If a TAKEOVER operation was performed before realizing that table space(s) were invalid on the Standby:

If you already performed a TAKEOVER operation, and then realized that one or more table spaces were invalid on the new-Primary, then an SQL0290N error may be returned when those table spaces are accessed by a query or operation.

SQL0290N Table space access is not allowed SQLSTATE=55039

Additionally, the db2diag.log on the new-Primary may contain a message similar to this:


   yyyy-mm-dd-hh.mm.ss...       I32132788A530      LEVEL: Warning
   PID     : ...                TID  : ...         PROC : db2sysc 0
   INSTANCE: ...                NODE : ...         DB   : ...
   APPHDL  : ...                APPID: ...
   EDUID   : ...                EDUNAME: db2agent (...) 0
   FUNCTION: Db2 UDB, recovery manager, sqlpGetTablespacesForFilter,probe:1570
   DATA #1 : preformatted
   Tablespace 6 is in rollforward pending state. Another rollforward will
   be needed to bring this tablespace online.

To confirm which table spaces are in rollforward-pending state, the db2pd -tablespaces command can be used as described in the following technote (#1993013): Monitoring and Identifying table spaces in invalid or error state on the HADR Standby database.

There are three methods to resolve this situation:

If the old-Primary database is still online: then a TAKEOVER operation can be performed on the old-Primary database to make it the Primary database again, and the invalid table spaces on the Standby database can be re-initialized using the instructions above. This method is preferred since it reduces the duration of time that applications are unable to access the table space data.

If the old-Primary database is not online or is inaccessible: then a rollforward operation must be performed on the new-Primary database to return the table spaces to normal state. First all the required recovery log files must be made available on the new-Primary starting from the time period when the table space(s) became invalid (you may need to search far back through the db2diag.logs to determine this time period). The log files can either be copied into the active log path, or stored in a separate path and the 'overflow log path' option used in the subsequent roll-foward operation below. (Reminder to never delete or overwrite/replace any log files, always retain a copy when performing these types of log file manipulation procedures). Next, the new-Primary database must be changed into a standard (non-HADR) database. (Otherwise attempting to perform the subsequent Rollforward operation on an HADR database will fail with an "SQL1774N Table space restore or rollforward cannot be issued on an HADR primary or HADR standby database".)


   db2 deactivate db sample
   db2 stop hadr on db sample

Perform the rollforward operation:


   db2 rollforward db dbname to end of logs

Upon completion of the rollforward, table spaces should be in normal state, and the HADR Standby database will need to be re-initialized.