Recovering table space errors on an HADR standby database
In an HADR environment, when a standby database has a table space in an invalid or error state, the replay of transactions on this table space stops. However, the replay of transactions on other valid table spaces will continue. The primary database is not affected, and the condition of this table space on the standby can go unnoticed.
If this table space condition exists on the standby database, then sometime later when a TAKEOVER operation is performed on the standby database, applications can be impacted by the unavailability of this table space. The erroneous table spaces can be recovered on the standby database by either reinitializing the affected table spaces, or reinitializing the entire database.
Monitoring and identifying erroneous table spaces on the standby
Techniques for monitoring this condition and identifying the erroneous table spaces are described in the technote, Monitoring and Identifying table spaces in invalid or error state on the HADR Standby database.
Identifying and rectifying the cause of the erroneous table spaces
After the erroneous table spaces are identified, the cause of the error condition must be determined. Common causes include insufficient filesystem space, a filesystem that is not mounted, a filesystem error, or a load copy-yes image cannot be found. A severe log replay error can also be a cause.
The administration notification log (<instance_name>.nfy) and db2diag.log must be examined.
2016-03-15-22.51.15.490605 Instance:rsiinp16 Node:000
PID:10551302(db2redow (MYDB1) 0) TID:19277 Appid:*LOCAL.DB2.160221104421
buffer pool services sqlbIncPoolState Probe:3604 Database:MYDB1
ADM12512W Log replay on the HADR standby has stopped on table space "MYTBSPACE3"
(ID "7") because it has been put into "ROLLFORWARD PENDING" state.
2016-03-15-22.51.15.565961 Instance:rsiinp16 Node:000
PID:10551302(db2redow (MYDB1) 0) TID:19277 Appid:*LOCAL.DB2.160221104421
data management sqldHandleBadPool Probe:40 Database:MYDB1
ADM5550C The table space "MYTBSPACE3" (ID "7") is being removed from the
rollforward set. The SQLCODE is "-980".
2016-03-15-22.51.15.189467-240 E763181A1307 LEVEL: Error (OS)
PID : 10551302 TID : 19277 PROC : db2sysc 0
INSTANCE: myinst1 NODE : 000 DB : MYDB1APPHDL : 0-8
APPID: *LOCAL.DB2.160221104421
HOSTNAME: somehost.ibm.com
EDUID : 19277 EDUNAME: db2redow (MYDB1) 0
FUNCTION: Db2 UDB, oper system services, sqloseekwrite64, probe:40
MESSAGE : ZRC=0x860F0003=-2045837309=SQLO_DERR "disk error occurred (DOS)"
DIA8402C A disk error has occurred.
CALLED : OS, -, pwrite
OSERR : EIO (5) "I/O error"
If the cause of the error condition is determined to be a localized issue, such as insufficient filesystem space, a filesystem problem, or similar, the issue must be rectified before you proceed to reinitialize the table space on the standby.
If the cause of the error condition is determined to be a severe replay error or an internal db2 error, with no observable localized cause, retain all diagnostic information before you open a PMR.
Reinitializing the erroneous table spaces on the standby database
Depending on your Db2® version, refer to the respective links to reinitialize the erroneous table spaces on the standby database.
- A full reinitialization of the standby database is the only supported method.
- Refer to Initializing high availability disaster recovery (HADR).
- A full reinitialization of the standby database is the only supported method.
- Refer to Initializing high availability disaster recovery (HADR).
- On the standby host, deactivate the standby
database:
[Standby]$db2 "deactivate db MYDB1"
- On the primary host, perform a FLUSH BUFFERPOOL operation before the backup operation. This is
to ensure that the backup image to be taken at step 3 has a later recovery starting point than
recovery starting point of the standby database. This operation is required in order to repair the
table space on the standby
database.
[Primary]$db2"flush bufferpools all"
Note: Databases with heavy workloads can experience short performance issues due to aggressive I/O during the FLUSH BUFFERPOOL operation - On the primary host, perform a backup of the erroneous table spaces. (Note: An 'online' backup
allows the primary database to remain available during the backup
operation).
For Versions 11.1.0.0 or 11.1.1.1 only, on the standby host, perform a STOP HADR operation:[Primary]$db2 "backup db MYDB1 tablespace MYTBSPC3 online to /bkp_image_path_pri/ "
[Standby]$db2 "stop hadr on db MYDB1"
- Copy or FTP the backup image from the primary host path
(/bkp_image_path_pri/ in this example) to the standby host path
(/bkp_image_path_stdby/ in this example). On the standby host, perform an
offline restore of the table space from the backup
image:
For Versions 11.1.0.0 or 11.1.1.1 only, on the standby host, perform a START HADR AS STANDBY operation:[Standby]$db2 "restore db MYDB1 tablespace (MYTBSPC3) from /bkp_image_path_stdby/"
[Standby]$db2 start hadr on db MYDB1 as standby
- On the standby host, reactivate the database (for Versions 11.1.0.0 or 11.1.1.1, this step is
not required because of the
start hadr
operation above):[Standby]$db2 "activate db MYDB1"
If a TAKEOVER operation was performed before realizing that table space(s) were invalid on the Standby:
SQL0290N Table space access is not allowed SQLSTATE=55039
Additionally, the
db2diag.log on the new-Primary may contain a message similar to this:
yyyy-mm-dd-hh.mm.ss... I32132788A530 LEVEL: Warning
PID : ... TID : ... PROC : db2sysc 0
INSTANCE: ... NODE : ... DB : ...
APPHDL : ... APPID: ...
EDUID : ... EDUNAME: db2agent (...) 0
FUNCTION: Db2 UDB, recovery manager, sqlpGetTablespacesForFilter,probe:1570
DATA #1 : preformatted
Tablespace 6 is in rollforward pending state. Another rollforward will
be needed to bring this tablespace online.
To confirm which table spaces are in rollforward-pending state, the db2pd -tablespaces command can be used as described in the following technote (#1993013): Monitoring and Identifying table spaces in invalid or error state on the HADR Standby database.
There are three methods to resolve this situation:
If the old-Primary database is still online: then a TAKEOVER operation can be performed on the old-Primary database to make it the Primary database again, and the invalid table spaces on the Standby database can be re-initialized using the instructions above. This method is preferred since it reduces the duration of time that applications are unable to access the table space data.
db2 deactivate db sample
db2 stop hadr on db sample
Perform the
rollforward operation:
db2 rollforward db dbname to end of logs
Upon
completion of the rollforward, table spaces should be in normal state, and the HADR Standby database
will need to be re-initialized.