Sitworld: FTO Configuration Audit
John Alvord, IBM Corporation
On several recent cases, the hub TEMS randomly became inoperative. After long study and diagnostic data collection, the conclusions were that the Fault Tolerant Option [FTO or Hot Standby or Mirror] configuration was incorrect. In one case several z/OS remote TEMSes were missing the CMS_FTO=YES control. In another, the distributed remote TEMS glb_site.txt file had one entry that pointed to another remote TEMS instead of the two hub TEMSes as required. These efforts took several months to discover and test, So I decided this aspect was ripe for an audit tool. That way any customer can make sure their FTO setup is configured correctly
FTO works by having two hub TEMS configured together. At any one time one hub TEMS takes primary role [first one to start] and the other hub TEMS takes a backup role. There is a TEMS-to-TEMS conversation and new user data is propagated from the hub TEMS in primary role to the hub TEMS in backup role. The backup hub TEMS actually it accepts remote TEMS and Agent connections but shortly after tells them to "find another TEMS" and disconnects. At most recent levels it doesn't run any situations.
The remote TEMS logic is simpler. First if FTO is not being used [CMS_FTO=NO or not defined] then at startup the glb_site.txt entries show what hub TEMS might be there. Each one is tried in turn until a successful connection is made. From then on that is the only hub TEMS that will be connected to until the next remote TEMS startup.
Second if FTO is being used [CMS_FTO=YES] the same initial logic is followed to find a working hub TEMS. The difference comes after a loss of hub TEMS connection: at that time the logic starts looking again for a working hub TEMS. In that way it will find the new hub TEMS in primary role after a switch over.
If the FTO configuration is not identical across all hub and remote TEMSes, things won't work. The big surprise is how badly things fail, including hub TEMS breaking.
The rest of this post presents a new tool which will perform all the needed checking and report on discrepancies. The cases where a manual check is needed is also documented. By using this tool you can validate the configurations are correct and fix any issues before experiencing outages. Or, if you suspect this issue, you can rule it in or out quickly.
Preparing for the install
Perl is usually pre-installed in Linux/Unix systems. . For Windows you may need to install from www.activestate.com or any other source. The program only uses Perl core services and no CPAN modules are needed.
TEPS Audit has been tested with
This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x64-multi-thread
zLinux with Perl v5.8.7
This rool runs on the same system as a TEPS connected to the current hub TEMS.
A zip file is found found here. There is one file ftoaudit.pl.
Run Time Options
-h [optional] supply ITM installation directory if not default. You can also [Windows SET CANDLE_HOME=xxxxx or Linux/Unix export CANDLEHOME=xxxxx before starting ftoaudit.pl.
-v show log messages during process
-debug run in debug mode
-debuglevel default 99. If set to 300 log file is more detailed.
-work default C;\TEMP or /tmp - where to store report and log and working files -
-o default ftoaudit.csv - name of report file
This logic will recover and cross-check all the environment variable CMS_FTO values.
The glb_site.txt checking works only on Windows/Linux/Unix remote TEMS and only when there is an OS Agent active on the same system.
Any z/OS remote TEMS will need manual checking. The KDCSSITE member is equivalent to the glb_site.txt. KDSENV will contain the CMS_FTO setting, if present.
FTO Configuration Audit Report
Here is a sample report. with interspersed comments
FTO Configuration Audit Report - Version 0.80000
Primary Hub TEMS - HUB_NMP180
Backup Hub TEMS - HUB_NMP182
==> lists the detected primary and backup hub TEMSes. If this is wrong, maybe the TEPS is not connected to the FTO primary hub TEMS,
100,CMSFTO1006E,HUB_NMP180,Hub TEMS running FTO some remote TEMS not using same glb_site.txt - see later report
===> See following for list of all advisory messages
Remote TEMS glb_site.txt report
===> note how the NMP183 has an extra added "x" where I forced an error.
Elapsed Time report hub TEMS 2.82865595817566
===> Above report section is interesting and may detect cases of high latency between hub TEMS and other TEMSes. The elapsed time is larger than you might expect because there is a java startup close in the KfwSQLClient utility that gets used.
===> The end of the report contains an explanation of the advisory messages.
Advisory Trace, Meaning and Recovery suggestions follow
Advisory code: CMSFTO1006E
Text: Hub TEMS running FTO some remote TEMS not using same glb_site.txt - see later report
Meaning: In FTO configuration remote TEMSes need to have a
configuration that specifies the two hub TEMS. These two hub
TEMSes are defined during configuration and the result is stored
in the glb_site.txt file.
This files will normally be identical. If they are not identical
then the FTO logic will break.
A following report section will detail the contents of each
glb_site.txt which should be thoroughly reviewed. It is possible
for differences to be present, such as one that uses resolvable
names and others that use ip addresses and all is well. More
commonly one or more is just referencing an incorrect address...
most are OK and some are wrong. In this case FTO logic will
break and this can cause hub TEMS instability and crashes.
Errors in the DNS resolving system or /etc/hosts file could make
the results inconsitent even though it looks OK.
The data is available if there is an OS Agent running on
the same system as the remote TEMS. In that case, the remote
TEMS glb_site.txt should be reviewed manually.
Recovery plan: Review the glb_site.txt report and reconcile
any differences. That usually means re-configuring the remote
CMSFTO1001W - Hub TEMS running FTO but no Backup hub TEMS found
CMSFTO1002E - Hub TEMS running FTO but Backup hub TEMS [tems_nodeid] not running FTO
CMSFTO1003E - Hub TEMS running FTO but remote TEMS [tems_nodeid] not running FTO
CMSFTO1004W - Hub TEMS not running FTO but a Backup hub TEMS[tems_nodeid] was found
CMSFTO1005E - Hub TEMS not running FTO but remote TEMS [tems_nodeid] is running FTO
CMSFTO1006E - Hub TEMS running FTO some remote TEMS not using same glb_site.txt - see later report
CMSFTO1007E - TEMS running with KGLCB_FSYNC_ENABLED=0: risk of database file damage and TEMS outage
*note* This is unrelated to FTO but it is concerning on any Linux/Unix system.
In the report itself, if an advisory is produced, the end of the report includes the impact and a discussion and a recovery plan. If this is unclear you can always contact IBM Support.
Identify and correct agent duplicate name configuration issues. If you find any anomalies which are hard to correct, please contact the author.
Here are recently published versions, In case there is a problem at one level you can always back up.
Add check for non-TEPS system
Note: View from Nepenthe Restaurant, Big Sur California