FFST overview
Advances in hardware manufacturing and technology have enabled the computer industry to vastly improve the reliability of circuitry and reduce hardware cost. Less expensive hardware has stimulated extensive use of circuitry to detect failures or deteriorating circuit performance and 'call home', pointing out what component should be replaced. The results are significant reduction in repair time and even more significant reductions in service skill and labor.
As hardware reliability improves, software problems account for a greater portion of system and application interruptions because software has not enjoyed the same degree of advancement in technology as hardware. Although great strides have been made in quality, often measured as errors per 1000 lines of code, the amount of code and system complexity have increased to make this improvement barely visible. Currently, the industry offers programs based on several different failure capture techniques requiring a variety of personnel skills and system resources to recognize and resolve failures across a system.
There are at least five major problems that exist in the software service arena today:
- Detecting problems as early as possible before the environment changes
- Capturing the correct data to debug the software problem- the first time the error occurs
- Capturing only the data required to debug the error (i.e., minimize the need for full address space dumps)
- Immediate notification of the error
- Uniquely identifying the error in order to determine if it is a condition that was already detected and reported to the support organization.
- customized dump
- Promotes the collection of only the data required to debug a software problem
- symptom string
- Provides a unique problem 'label' that can be used to quickly determine if a software problem has already been detected. The symptom string is contained in each output in this list.
- symptom record
- Error log entry built to IBM's Symptom Record Architecture (SRA) standard and placed in LOGREC.
- messages
- Indication on the operator console that a problem has occurred and FFST was called to collect the data and report the problem.
- network notification
- Indication through an System Network Architecture (SNA) Generic Alert that a problem has occurred and FFST was called to collect the data and report the problem. Included in the Generic Alert is key information which includes the machine on which the problem occurred and the name of the dump data set if a dump was requested by the detecting product.
It should be noted that there are situations that will continue to require full address space dumps. For certain types of problems it is very difficult for a programmer to determine what data may be required to diagnose a failure. For these problems, a capture of the complete environment will be required.
IBM programmers continue to improve their defensive programming techniques within their software in order to assure the instances of needing full address space dumps to diagnose a failure will be kept to a minimum.
