Overview of Predictive Failure Analysis

Predictive Failure Analysis (PFA) is designed to predict potential problems with your systems. PFA extends availability by going beyond failure detection to predict problems before they occur. PFA provides this support by using remote checks from IBM Health Checker for z/OS to collect data about your installation. Using this data, PFA constructs a model of the expected or future behavior of the z/OS images and compares the actual behavior with the expected behavior. If the actual behavior is abnormal, PFA issues a health check exception. PFA uses a z/OS UNIX System Services (z/OS UNIX) file system to manage the historical and problem data that it collects.

The following image displays an LPAR view of the PFA components:

LPAR view of the Predictive Failure Analysis components illustrating how PFA creates the report output described here. — Figure 1. LPAR view of the PFA components

PFA creates report output in the following ways:

In a z/OS UNIX file that stores the list of suspect tasks. The individual checks contain descriptions of the directory and file names.
In an IBM Health Checker for z/OS report that is displayed by z/OS System Display and Search Facility (SDSF) and the message buffer.
Your installation can also set up IBM Health Checker for z/OS to send output to a log stream. After you set it up, you can use the HZSPRINT utility to view PFA check output in the message buffer or in the log stream. For complete details, see Using the HZSPRINT utility in IBM Health Checker for z/OS User's Guide.

How PFA works with a typical remote check

PFA_COMMON_STORAGE_USAGE is a remote check that evaluates the common storage use of each system. PFA, running in its own address space, periodically collects common storage area (CSA + SQA) data from the system on which the check is running. The check writes the CSA usage data, at intervals, to a z/OS UNIX file. The check identifies a list of common storage users that are abnormal and that might contribute to exhausting common storage. PFA issues an exception message to alert you if a potential common storage problem exists and provides a list of suspect tasks. You can then examine the list and stop the cause of the potential problem or move critical work off the LPAR.

How PFA interacts with IBM Health Checker for z/OS

When PFA issues an exception, the PFA check does not continue to issue exceptions to the console until the check determines a new exception must be issued or the exception resolves. For some checks, the new exception is always issued after a new model occurs. For other checks, the data must change significantly or the exception message must be different. For all checks, the check continues to run at the defined interval and makes the latest exception report data available using the CK panel in SDSF.

How PFA starts the modeling predictions JVMs

For each PFA check, a unique PFA exception check based JVM hosting address space is started and stopped as needed for machine learning prediction modeling. The JVMs hosting the modeling are started by using a BPX (OMVS) created process, which terminates after a time and are re-created as needed. These created processes follow the WLM Service Definition classification rules for the OMVS subsystem. The OMVS Jobnames are constructed from the PFA Started Task’s Jobname that is concatenated with a value of 1 through 9. There can be more than one of the same name at any time. Each PFA check has a separate created process that can start at the same or different times. During PFA focused analysis, or when a model rebuild is required, the OMVS processes can run as frequently as 1, 5, or 15 minutes after the most recent comparison. Otherwise, the JVMs run at the default of 12 hours.