Sitworld: TEMS Database Repair
John Alvord, IBM Corporation
Draft #5 – 10 December 2018 - Level 1.04000
The TEMS database tables are used to store user data such as situation descriptions and distribution definitions. They also keep running data such as current situation status on agents. There many more internal and functional tables.
When the files holding the data are damaged and the TEMS usually malfunctions. Over the years there have been many reasons for such damage. Here are some examples
- TEMS exception and failure.
- File system full.
- Unwise manual changes or restoring from a backup that wasn't taken correctly.
- Power outage without any UPS backup.
- SAN [Storage Access Network Device] failure.
- System shutdown without stopping the TEMS.
- Many unexplained instances.
Hub TEMS Recovery Attempt WARNING!!!
A primary hub TEMS is the repository of fundamental user data and any recovery of that is a delicate operation which can easily result in a reinstall and significant downtime. Please work with ITM support in planning a hub TEMS data recovery. Remote TEMS can be recovered quite simply as can a FTO mirror hub TEMS.
In addition you should have a Backup/Recovery plan for hub TEMS data. See this document for five different ways to accomplish this goal. A simple backup of the files while the TEMS is running is inadequate and can lead to significant downtime. These are hot database files and many constantly change and are tightly connected.
Non-Hub TEMS Recovery
The process is very simple although it varies by platform [hardware and operating system] and by TEMS maintenance level. From a high level view you stop the TEMS [if running], replace the database files with emptytable files and then start up the TEMS and let the hub TEMS refill with correct data naturally. A reference to the files follow. They are not exactly empty. At the very least they contain a “end of objects” record and some are pre-loaded with data. The ones here were accumulated from install media builds from ITM 6.2, ITM 621, ITM 622, itm 623 and ITM 630. They are the exact files you would lay down during a new TEMS install.
There are three types of files:
- Bigendian – for Unix [AIX/Solaris/HPUX] and Linux on Z
- Littleendian – for Linux/Intel and Windows
- VSAM – z/OS index sequential file
The references here are to a zip file for each maintenance level. Each zip file contains a bigendian.tar file [Unix and zLinux], a littleendian.tar file [for Iinux/Intel] and a littleendian.zip file for Windows. The last two contain identical files but are packaged differently for convenience. With z/OS the story is quite different, see later.
Windows Recovery for non-hub TEMS
- Select the correct maintenance level and load the proper zip file from the links above. Unzip that file and you will use the .zip file included.
- Unzip that file into some convenient directory – we will assume C:\TEMP but it can be anyplace. You will see a lot of QA1*.DB file and QA1*.IDX files.
- Stop the TEMS
- Copy the files, for example [adjust for actual install directory]
You could also use Windows explorer. You may also wish to make a safety copy of those files.
- Start the TEMS
- Monitor for correct operation.
- Recovery complete
Linux/Unix Recovery for non-hub TEMS
- Select the correct maintenance level and load the proper zip file from the links below. Most environments will have a gunzip command. If not you can unzip on some convenient Windows workstation.
- Select the proper endian type. Bigendian is for all Unix and Linux on z systems. Littleendian is for all Linux/Intel systems. For this example we use linux at ITM 630 and the file is ITM630_emptytables.littleendian_inux_intel.tar and it is assumed to be copied to /opt/IBM/ITM/tmp
- Move that littleendian file to the system where the TEMS runs and un-tar it.
tar -xf ITM630_emptytables.littleendian_linux_intel.tar
This will create many QA1* files
- At this point you have to determine the attributes/owner/group the current TEMS files. You could do that with these commands
ls -l /opt/IBM/ITM/tables/<temsnodeid>/QA1CSTSH.DB
which in my zLinux test environment looks like this:
nmp180:~ # ls -l /opt/IBM/ITM/tables/HUB_NMP180/QA1CSTSH.DB
-rwxr-xr-x 1 root root 35274789 Nov 14 21:03 … QA1CSTSH.DB
[Above line shortened for display purposes.
- Next change the un-tar’d files to what is currently being used and what the TEMS expects. Remember the following is just an example that would be used in my environment. You will run the command appropriate to your actual environment,
chmod 755 QA1*.*
chown root QA1*.*
chgrp root QA1*.*
- Next stop the remote TEMS or FTO mirror hub TEMS
- Next copy the emptytable files into the directory where the stopped TEMS expects them
cp /opt/IBM/ITM/tmp/QA1*.* .
Note the trailing period which means to copy to the current directory.
- Next start the remote TEMS or FTO mirror hub TEMS
- Monitor for normal operations
- End of recovery
- Warning for the FTO mirror hub TEMS: When performing this operation *always* start the primary hub TEMS first [if not already running]. The refreshed FTO mirror hub TEMS must be started second. If that rule is violated the primary hub TEMS will have all custom objects deleted. Don't do that.
z/OS recovery for non-hub TEMS
Please note: this is hardly ever needed. The last PMR I worked on *looked* like it was needed but the symptom was actually a harmless TEMS message [actually a defect] that complained about a table… and there was no actual problem at all! So I expect it is very rare to have to do this procedure.
Always involve IBM Support if you have any uncertainty at all in this process. Also, if you *think* you know more about z/OS than the author – you are very likely correct!!
z/OS recovery example with ICAT configuration
The following uses QA1CSTSH as an example.
1) Stop the TEMS task
2) Delete or rename the QA1CSTSH VSAM dataset. If unsure, examine the Joblog output to determine the complete dataset name.
3) Proceed to ICAT and navigate to the 'Runtime Environments' panel (KCIPRTE)
4) Place a 'B' next to the RTE [Run Time Environment] that contains the TEMS that owns the file you wish to recreate.
5) That will generate the DS#1xxxx job which should then be submitted.
6) The job will detect the file that is missing and recreate ONLY that file.
7) The job should complete with condition code zero
8) The TEMS can then be started.
z/OS recovery with PARMGEN configuration
The general idea is the same as ICAT.
For steps #3 - #7, it can be replaced w/ similar instructions here. That documents how to reallocate PDS files but the path followed is the same. Following are some notes from the Parmgen expert.
The job would vary - you can use KDSDELJB as a model job that has the deletes but only make it specific for RKDSSTSH VSAM
(//QA1CSTSH DD DISP=SHR,DSN=&RVHILEV..&SYS..RKDSSTSH.)
Submit the composite KCIJPALO job same as in the doc., and for the standalone job, refer to the PARMGEN KDSDVSRF - needs to be modified of course.
Hub TEMS – if you absolutely have no choice
There are many TEMS hub database tables which you can reset only by losing significant data and undergoing a long manual reinstall and rebuild. This could mean a week or more of outage. It is very important to involve IBM Support if you have any doubts at all.
However there are a few tables which can be reset with no real impact. These 5 sets of tables contain internal processing data, not user data.
- TSITSTSC – QA1CSTSC: The Situation Status Cache which is reused every time the TEMS starts.
- TSITSTSH – QA1CSTSH: The Situation Status History. This is an intermediate file where situation event status collect. It is a wraparound table and defaults to 8192 rows. At hub TEMS startup all the remote TEMSes and agents [if directly connected] send current status. Therefore you only miss situation status history after a reset. Since there are no ITM functions which display or use the history, nothing much is lost by resetting it to emptytable status.
- [several tables] – QA1CDSCA: This is the combined catalog table. If this is reset to emptytable status, at TEMS startup the pre-defined data is updated based on the existing package [like klz.cat] files. Therefore it can be reset to emptytable status and nothing is lost. As a minor point, TEMS has an extremely hard limit of 512 packages. At 513 the TEMS will crash and not come up. It is pretty rare but definitely something to keep aware of. Should you encounter this issue, you will have to remove one or more .cat [and the paired .atr] file to get the total down to 512 packages or below. If you encounter this limit see Sitworld: Attribute and Catalog Health Survey which will calculate what packages are no longer being used.
- SITDB/TOBJCOBJ - QA1CRULD/QA1CCOBJ: These tables are created dynamically as situations are started. SITDB contains the SQL representing the situation. TOBJCOBJ records how situations are related to each other. In any case the data is created dynamically as situations started. Both need to be reset to emptytable status at the same time.
- TNODESAV - QA1DNSAV: This records the current agent registration - the nodes or managed system names. When agents connect the data is rebuilt and also any missing data in the TNODELST table. This is sometimes shows as advisories in Database Health Checker reports and the agents affected do not actually run situations. One factor to consider is that agents which are temporarily offline will no longer be in the table. When they do connect again they will be present as usual. If that is important you should capture that information before performing the replacement.
In each case you would do the same as a complete replacement but only handle the QA1*.DB and QA1*.IDX file.
Backup/Recovery best practice
The following document was co-authored with L3 TEMS and represents the best current thinking. It gives five ways to create a valid useful and reliable backup of the TEMS database files.
This document shows how to repair many cases of damaged TEMS database files.
History and Earlier versions
Correct credit name for photo
Add information about two more tables that be reset to emptytable status at the hub TEMS.
Add warning about not starting refreshed FTP mirror hub TEMS first.
Rename the emptytable files including the platform type - to reduce mistakes.