Sitworld: Best Practice TEMS Database Backup and Recovery
John Alvord, IBM Corporation
24 June 2014 - Version 1.1
I was working with a customer with a TEMS Database File problem. In this case some of the situations had been deleted. In other cases over the years the Database files were not accessible because the index file was inconsistent. These cases are very rare but the results can be disruptive. The hub TEMS or some remote TEMS cannot start or are running without all situations and other objects.
This document presents five best practice procedures for creating reliable database backups. It is not a reference for creating a full and complete backup including configuration and application support files. See here for that reference. This document is dated. Based on history there will me more changes in the future.
Background for Distributed Linux/Unix/Windows platform
There are 50+ TEMS database tables and most of them are represented by indexed sequential files [QA1*.DB/IDX]. 16 of those tables contain user data such as situation definitions and distribution configuration. The IDX file links the keys of a table to the location of the related objects in the DB file. If there is an interruption in the update process, the IDX file may become inconsistent and the data unavailable.
Here are some cases which caused an interruption in the past.
- TEMS crash
- System shutdown without stopping TEMS [AIX system before ITM 623 FP3]
- Mount point or disk full
- Hardware failure where system or SAN lost power
- Networking outage when writing to a NFS mount
There are certainly many more possible causes. These are just the cases I have seen over the years.
The TEMS environment variable KGLCB_FSYNC_ENABLED defaults to 1 and that decreases the chances of problems. Review the environment variables in <installdir>/logs/ms.env [or Windows <installdir>\logs\ms.env and if it is set to zero , you should change that setting.
These are very rare cases. When and if the problem ever hits, a recovery plan will ensure a prompt return to normal processing.
A Poor Backup Plan
While the hub TEMS is running, make a copy of the QA1* files in
That is better then nothing but it might result in an inconsistent set of tables because tables are constantly changing. With the on the fly captured files the TEMS might not even start up.
Solution 1 - No Secondary hub TMS
The simplest and easiest backup plan is to stop the hub TEMS before copying the QA1* files into a compressed tar or zip file. That ensures capturing a consistent state.
If you do that once a week during a maintenance period, you can always restore those files and have a consistent state. There is certainly a cost in doing that but the cost for an outage is much higher.
Solution 1 – Recovery
- Stop the hub TEMS.
- Make a pdcollect to capture the current state
- Restore the QA1* files from a backup
- Start the hub TEMS
At this point all objects will be restored to the time of backup.
Prepare empty table files
The next solutions require a maintenance level specific copy of all the TEMS database files representing empty files. The simplest way is doing a dummy TEMS install and without starting it make a copy of the QA1* files into a tar or zip file
You will absolutely need those files in the following processes. Prepare them ahead of time. If the TEMS is upgraded to a new maintenance level or a different platform then get a new set and save them.
Solution 2 – Hot Backup - Valid from ITM 622 FP5
In this configuration you have two hub TEMS and but only one is used ever used as the primary hub TEMS. The other hub TEMS is started for backup purposes only. This not an actual FTO configuration but it uses FTO logic to get the job done,
The hot backup hub TEMS is configured with FTO pointed to the running hub TEMS. That has an additional required control in the KBBENV file which is in
Add this line manually
When the hot backup hub TEMS starts, it works to make sure that its own synchronized database files match the other hub TEMS. The first hub TEMS and all the remote TEMS are totally unaware of this usage. When the synchronization is complete [See Note 1] stop the Hot Backup hub TEMS and archive the QA1* files.
Solution 2 – Recovery
When a problem is found with TEMS database files, a recovery action is required. This case does require some hub TEMS down time.
1) Stop the usual hub TEMS if still running.
2) Configure the usual hub TEMS with FTO with the partner being the Hot Backup hub TEMS. Add in the MHM:HOTBACKUP=1 manually to the usual hub TEMS KBBENV file.
3) Replace the problem hub TEMS QA1* files with the saved “empty table” files.
4) Configure the Hot Backup hub TEMS to NOT use FTO.
5) Restore the backup QA1* files to the backup hub TEMS.
6) Start the Hot Backup hub TEMS.
7) Start the problem hub TEMS and wait for it to synchronize with the Hot Backup hub TEMS. [Note 1]
8) Stop the Hot Backup hub TEMS
9) Stop the usual hub TEMS.
10) Configure the usual hub TEMS so it is not using FTO and remove the line MHM:HOTBACKUP=1
11) Start the usual hub TEMS.
12) Configure the Hot Standby hub TEMS to use FTO and make sure the MHM:HOTBACKUP=1 is present.
13) Verify normal operation.
Solution 3 – Fault Tolerant Option [FTO]
In this configuration you have two hub TEMS and one has the primary role and one has the backup role. Both hub TEMS tasks have equal user objects in the tables. Once a week or so stop the current backup hub TEMS and copy the QA1* files into a compressed tar or zip file.
Solution 3a – Recovery When one hub TEMS is OK
After a problem is found connected with TEMS database files, a recovery action is required. Usually this is the primary hub TEMS.
1) Stop the hub TEMS with the problem [if required] and the remote TEMS tasks will switch over to the backup hub TEMS which takes on the primary role.
2) Replace the problem hub TEMS QA1* files with the saved “empty table” files.
3) Start the problem hub TEMS and wait until synchronization is complete. [Note 1]
4) Stop the usual backup hub TEMS.
5) After processing switches back to the usual primary hub TEMS start the usual backup hub TEMS again.
6) Verify normal operation.
Solution 3b – Recovery When Neither hub TEMS is OK
1) Stop both hub TEMS tasks.
2) For the TEMS the backup was taken on, replace the QA1* with the saved files.
3) Replace the other hub TEMS QA1* files with the saved “empty table” files.
4) Start the hub TEMS with the backup QA1* files.
5) After 20 minutes start the hub TEMS with the empty files. Wait until synchronization is complete. [Note 1].
6) If needed, stop the usual backup hub TEMS. After processing switches back to the usual primary hub TEMS start the usual backup hub TEMS again.
7) Verify normal operation.
Solution 4 - FTO and Hot Backup - Valid from ITM 622 FP5
In this configuration you have two hub TEMS and one has the primary role and one has the backup role. Both hub TEMS tasks have equal user objects in the tables. Create a third hub TEMS used only for backup purposes. The two FTO hub TEMS will be totally unaware of the backup process so normal operations are unaffected.
Use the Solution 2 documented. The hub TEMS used only for backup is configured in the "Hot Backup" mode and is configured to the usual primary hub TEMS. Before the backup process, this new TEMS is started and the TEMS database files are synchronized. See Note 1 for determining when the synchronization is complete. This will normally complete in 10-20 minutes but you could also scan the operations log file for the named messages. At that time stop the TEMS. The QA1* files are in a stable synchronized state and are sufficient to be used for a recovery.
If a recovery is needed and one hub TEMS is OK, Solution 3a recovery is sufficient.
If a recovery is needed and both usual primary hub TEMS and usual backup hub TEMS are damaged, use this solution 4 backup with the Solution 3b recovery process.
Many kudos to Richard Bennett, IBM Support L3 TEMS team lead for his extensive knowledge and his wise editing suggestions.
This document is a best practice procedure for creating a reliable backup for TEMS Database files and how to use those files in a recovery action.
1.1 - Added Solution 4
1.0 - Initial publication
The TEMS operations log is located in
During a recovery like this there will be a long series of messages about individual objects being updated. Look for one of the following messages in the TEMS which is being recovered:
KQM0009 FTO promoted <temsname> as the acting HUB.
KQM0013 The <temsname> is now the acting HUB.
KQM0014 The <temsname> is now the standby HUB.
This message(s) occur when synchronization between the primary monitoring server and the secondary monitoring server has been completed.