Sitworld: ITM Database Health Checker
Draft #42 – 24 April 2016 - Level 1.28000
Follow on twitter
In Spring 2014 I was tasked to repair a customer database that experienced a newly identified issue. If an agent was in the node status table [TNODESAV] and the related object in the NODELIST table [TNODELST NODETYPE=V] was absent, the agent dropped into a never never land where no situations would run and no real time data could be retrieved. After preparing a recovery action plan and the customer site was again working properly, I challenged myself to discover this issue from the TEMS Database files [EIB] alone. In a few days I was able to replicate the list of problem agents that had taken quite some weeks to calculate using usual methods.
I searched for more such cases of EIB inconsistency and other discoverable problems. The goal was to accelerate problem diagnostic time by providing advisories on customer problems. After a year of effort this project can now discover 52 issues. The issues range from the most severe to the totally innocuous. Now it is time to make this project available to a wider audience. In some cases the issues can be handled independently. In other cases the issues require IBM Support to cure. The issues are ranked in impact from 0 to 110. At 100 or higher, monitoring is not happening in some cases, which is the severest case. There are false positives, for example if you accidentally examine a remote TEMS and the missing node status information produces masses of uninteresting advisories.
The publishing delay was caused by new issues which appeared. A second reason for delay was a desire that the recent maintenance - ITM 630 FP4 - contained fixes for almost all observed issues. In some cases there is repair needed for defects experienced earlier. However in all known serious issues, new problems will not arise when running at latest maintenance levels. There may be error conditions new examples will not arise. Some cases arise from hardware or human error but speeding diagnosis minimizes the impact.
I chose to capture the live system data using a TEPS utility KfwSQLClient. It gets up to date information and it is also a process which uses the existing TEPS to TEMS connection. The retrieval process is rapid and reliable. If needed the resulting files can be moved to another system for analysis. This document assumes you install and run on the same system, however that is not required.
The output report is a .csv report file. The binary package includes a Word document with each issue explained and a recovery plan suggested.
Version 1.02000 now checks all tables which have user data and are synchronized with an FTO hub TEMS. The issue count is now 64.
Version 1.07000 adds the first Case Study to the documentation. This documents a case where Database Health Checker diagnosed a new error case and how that case was promptly resolved. With this Case Study any user could do the same and achieve an immediate recovery.
Version 1.16000 adds an important section which identifies possible duplicate agent names. This is usually an agent configuration mistake. Duplicate agent names cause severe TEMS and TEPS stress. In addition it reduces monitoring since only one agent name at a time will be show events or real time data
Version 1.21000 adds reports about a sampling of situation event history objects. These can be almost invisible stress factors.
Version 1.23000 handles Agent CF better, which is a non-TEMA managed system and so cannot run situations and is not expected to have MSLes etc.
The default install directory is used in the document, however you can make any installation choices. The EIB data is captured on the system which runs the TEPS. For example you could capture the data at the TEPS and then copy the files somewhere else to process. The choice of where to store the program objects is arbitrary.
Linux/Unix systems come with Perl installed. Windows may need it installed and I use http://www.activestate.com/activeperl, community edition 5.20.No CPAN modules are needed for this package. It will likely work on many different levels.
The package is ==>here<==. It contains
1) A Perl program datahealth.pl and control file datahealth.ini - standing for Database Health Checker. The ini file is not used in the package.
2) A Windows datasql.cmd file to run the SQL statements
3) A datasql.tar file which contains Linux/Unix version of the datasql.sh. This avoids problems with line endings. To use untar datasql.tar into the installation directory.
4) If the install directory is not the default, set the required environment variable before capturing data
5) A Word file "Database Health Checker - 1.28000.docx" which documents each advisory message and gives advice on how to proceed,
I suggest these all program objects be placed in a single directory. For Windows you can create the tmp directory and sql subdirectory. For Linux/Unix create the sql directory.
You can run this in any directory, of course.
Running the Database Health Checker Program
a) cd /opt/IBM/IBM/tmp/sql
b) If not using default install directory specify like this: export CANDLEHOME=/opt/IBM/ITM
c) sh datasql.sh
d) perl datahealth.pl -lst
c) If not using default install directory specify like this: SET CANDLE_HOME=c:\IBM\ITM
e) perl datahealth.pl -lst
A report file is created: datahealth.csv which is the report. There is a log file which is not much used at present. The rest of the document shows some screen shots.
Screen shots of Database Health Checker Report
The beginning of the report contains the version number, a summary of the hub and remote TEMSes and a record if FTO is configured.
The Hub row has the total agents. The TEMS rows listsTEMS nodeid, ONLINE/OFFLINE status, the agent count and the version number.
Next is a view of the TEMA Deficit report. [See Appendix for Technical Deficit explanation] TEMA [or the Agent Support Library] is a part of the TEMS and it is bundled with the OS Agent. When this TEMS logic is running at a lower version then the rest of the TEMS, APAR fixes are not present. This is called a TEMA deficit and the ITM environment is exposed to problems that would otherwise never be seen. There have been 122 APAR fixes since the first release in 2005. For an exact analysis of what agents are missing which fixes, please see ITM Agent Census Scorecard.
Average APARS TEMA version less the TEMS version: The average number of APAR fixes missing on each agent compared to all OS Agents upgraded to existing TEMS maintenance level.
Average APARS TEMA version less the latest TEMS version: The average number of APAR fixes missing on each agent compared to TEMS and all OS Agents upgraded to highest TEMS maintenance level of ITM 630 FP4.
These averages are calculated from every Agent that records a TEMA version [not all do]. Some could be at level and others way behind. One TEMA can be used by many agents, so updating a single OS Agent can improve many other agents.
The last section contains advisory messages.
The first part is the count of advisory messages and that is followed by the messages. Note that the impact can range from 0 to 110. At 100 and over, some monitoring may not be running. That means no situations, no real time data and no historical data. To increase your interest, here are examples of two cases.
The advisory code DATAHEALTH1037W is fully described in the Word file which is included with the programs. That one looks like
25 DATAHEALTH1037W CLO_DB2:ibmw1dbs02:LO Agent at version [06.30.00] using TEMA at lower release version [06.23.03]
So impact is 25 - not too terrible. The third column lists the object involved, this is often an agent name but not always.
This particular case involves an agent at ITM 630 level which is connecting to a TEMS at agent ITM 623 FP3. This is not a supported level and is not tested and is certainly not recommended. It may work or it may cause problems. It is impossible to predict. You will experience fewer problems if you stay within the product guidelines and avoid being one of the "far away from normal" environments.
Another interesting one - which was quite a surprise when I started seeing reports from real life environments.
80 DATAHEALTH1010W ip.pipe:#184.108.40.206<NM>ibmaswdbc2</NM> TNODESAV duplicate hostaddr in [ibmaswdbc2.nam.nsroot.net:KUX[rtems-gt3-ibm1][Y] ibmaswdbc2:KUX[rtems-gt3-ibm1][Y] ]
That is two agents, both on the same system, both appearing online both claiming to use the same listening port. [That is literally impossible from a TCP Socket usage standpoint.] Likely one agent was stopped and never actually completed shutting down and the other came up and reused the listening port. That is confusing and wrong and needs investigation. It isn't horrible because monitoring is continuing but it is a waste of resources and a waste of human analysis time. In some cases it can cause performance problems at the TEMS and the TEPS.
Always remember there can be false positives and so each case needs to be handled separately.
The first time you run the report you may see many many advisories. Remember that the higher impact ones are the most important. You may have a new case where the analysis routine makes a false estimate of the condition. Some are relatively easy to resolve and some require IBM Support to resolve... you are not alone. Read through the documentation on messages and clear things up. As you correct problems your monitoring environment will run smoother.
These advisory messages are a personal judgment based on working with ITM for many years. There are no absolutes here and if one of the advisories seems incorrect then just ignore it and move on. Or contact the author and ask questions. For ITM environments that have been running for many years, you will likely see a lot of 0 impact advisories. Usually these are things like Managed System Lists for products no longer installed or references to agents that have been decommissioned. It is sort of nice to clean these up, but the effect is usually minor. [I had one case where a customer had 100,000 Nodelist table objects and fully 35% were for MSLs no longer used at all. Removing them did improve TEMS performance.]
There are very rare cases where problems are caused by TEMS database file problems. One big help is the division between data .DB file and index .IDX file. This Health Checker project uses the index file and when problems occur the index file is most commonly broken. The data file is rarely damaged* because all data is written into new areas and deleted records are simply updated with a "delete" flag added. Thus at any time the disk data file is consistent before and after an update.
IBM Support has tools to work directly with the data .DB files and can often recover data and then use that to rebuild the table. This can also be done with backup files. Every customer should have a best practice backup and recovery plan so please review this blog post co-written with L3 TEMS:
Sitworld: Best Practice TEMS Database Backup and Recovery.
I have worked on rare cases where the lack of a reliable backup process caused a week+ outage and much manual rework. That happened maybe 3 times in 6 years. When making plans, please review the technote and chose one of the five ways to create a reliable backup. A simple copy of the files - manual or automatic - is useful in some ways but is not reliable enough for a restoration. If you have a reliable backup you can often use the Recovery process by yourself and minimize outage time.
*Note: One memorable case had the data on a large Network Access Storage device that lost power and lots of files were seriously damaged.
This report shows problems in TEMS database tables.
This is a work in progress and will be updated periodically. If you have any feedback or insights about missing features, please communicate with the author.
History and Earlier versions
If the current version of the Database Health Checker tool does not work, you can try previous published binary object zip files. At the same time please contact me to resolve the issues. If you discover an issue try intermediate levels to isolate where the problem was introduced.
Advisory on missing MQ hostname qualifier
Calculate rate of event arrivals
Advisory on duplicate SYSTEM NAMES
Advisory when ::CONFIG agents not connected to hub TEMS.
Handle some KfwSQLClient output better
Add missing single quote in datasql.sh and the copy in datasql.tar
Handle CF/:CONFIG differently since managed system does not use a TEMA
Reduce some advisory impact levels
correct some titles and add some event related times
Report on Situation Flippers and Fireflys
Add TEMA APARs from ITM 630 FP6
No logic change, updated datasql.cmd - a table was missing
parse_lst handle null chunks correctly
Better report on possible duplicate agents, screen TNODELST for just M records
Advisory when FTO hub TEMS at different maintenance levels
Advisories when Sampling Days and Time not in correct form
Display possible duplicate agent names if count more than mode [most common count]
Add report of TEPS version and architecture
Add i5os Agent version level reports
Add top 20 changed/added situations
Add possibly duplicate agent names and an advisory for most likely
Add ITM 630 FP5 TEMA APARs and avoid divide by zero case in TEMA deficit report section
Improve APAR Deficit calculation, calculate avg days/APAR
Improve APAR Deficit calculation, skip A4 TEMA level 06.20.20 warning
Reduce number of DATAHEALTH1043E messages. Add TEMS architecture if available. Add first Case Study to doc
Add advisory on APAR IV50167 and other ITM 630 FP2 issues.
Better parse_lst logic. Easier to configure cmd/sh files for SQL capture.
Fix Workflow Policy checks to warn on autostart *NO at lower impact, correct TSITDESC -lst capture
Correct some mis-spellings and fix some logic SQL mismatch in the -lst option, TNAME and TPCYDESC issues
Add checking for all sync'd tables and better LSTDATE checks. Correct a number of messages with incorrect table names. Also alert on sampling interval longer then 4 days which can be a problem on z/OS hub TEMS.
Correct SQL in datasql.cmd and datasql.sh [in datasql.tar]. Change "future date" check to more then one day ahead of localtime to handle timezone issues.
Appendix 1: Technical Deficit/Debt Explanation
Technical debt is a well known concept in program development and support over time.
My interpretation is: As systems evolve over time the need for changes arise and there is always a choice between
1) Ignoring the issue until you are forced to change and take the hit of losing customers or reputation or product benefit.
2) Making the simplest possible change with minimal effort disregarding design and good practice.
3) Redesigning the system to avoid the problem and adding regression tests to detect similar issues in the future.
Any of the choices are legitimate business responses. Choice (3) means you are investing in the future to reduce future costs and enhance the customer experience. (1) and (2) limit current costs while postponing changes and costs into the future. (2) vastly increases the cost of later changes since it is a patch or quirk which usually makes redesign much more costly.
Technical Debt is something like a credit card purchase: You get the benefit now and postpone costs to the future.
One author related it to a stock market call option. In that transaction you sell the right to a future purchase of a commodity at a fixed cost. You get money now and if the commodity drops in price you can deliver the goods and make money. If the commodity increases in price too much you lose money. (1) and (2) is like selling a call and expecting that the future costs will not be too high later on. Hope springs eternal but is rarely rewarded.
This TEMA Deficit report section judges failure to update OS Agents as an example of (1) - Ignoring the issues until you are forced to make a change. Making updates to handle specific issues is an example of (2). Updating agents to prevent known defects is an example of (3) - adapting to avoid future problems.
Not updating agents is taking on a Technical Debt. That path increases costs and reduces product benefits - and should be avoided. The report section identifies how many APAR fixes you are in deficit. For a report with more detail please see ITM Agent Census Scorecard, which lists the name and abstract of each APAR fix affecting agents and how many agents are affected. Following that are more detailed reports so you can drill down to your hearts content. That can be highly valuable when determining when an agent upgrade makes sense.
Photo Note: Wood Carving of the Mythical Phoenix - Big Sur, California Nepenthe Restaurant