subscribe iconSubscribe to this information
POWER6 information

Output files for health check

Learn about the output files for the Fast Fabric health check.

The Fast Fabric health check output files are documented in the Fast Fabric Toolset Users Guide. The following information provides some of the key aspects of the output files:

While the following information is intended to be comprehensive in describing how to interpret the health check results, for the most recent information about health check, see the Fast Fabric Users Guide.

When any of the health check tools are run, the overall success or failure is indicated in the output of the tool and its exit status. The tool indicates which areas had problems and which files must be reviewed. The results from the latest run can be found in the $FF_ANALYSIS_DIR/latest/ directory. Many files can be found in this directory that indicate both the latest configuration of the fabric and indicate errors or differences found during the health check. Should the health check fail, use the following paragraphs discuss an order for to review these files.

If the -s option (save history) was used when running the health check, a directory whose name is the date and time of the failing run is created under the FF_ANALYSIS_DIR directory, in which case, that directory can be consulted instead of the latest directory shown in the following examples.

Review the results for any esm first (if using embedded subnet managers) or hostsm (if using host-based subnet managers) health check failures. If the subnet manager is incorrectly configured or not running, it can cause other health checks to fail, in which case the subnet manager problems must be corrected first, and then the health check must be rerun and other problems must then be reviewed and corrected, and as needed.

For a hostsm analysis, review the files in the following order:

latest/hostsm.smstatus
Ensure that this file indicates that the subnet manager is running. If no subnet managers are running on the fabric, that problem must be corrected before proceeding further. After being corrected, the health checks must be rerun to look for more errors.
latest/hostsm.smver.diff
This file indicates that the subnet manager version has changed. If this change is not expected, the subnet manager must be corrected before proceeding further. After being corrected, the health checks must be rerun to look for more errors. If the change is expected and is permanent, a baseline must be rerun after all other health check errors have been corrected.
latest/hostsm.smconfig.diff
This file indicates that the subnet manager configuration has changed. This file must be reviewed, and, as necessary, the latest/hostsm.smconfig file needs to be compared to the baseline/hostsm.smconfig file. If necessary, correct the subnet manager configuration. After being corrected, the health checks needs be rerun to look for more errors. If the change is expected and is permanent, a baseline must be rerun once all other health check errors have been corrected.

For an esm analysis, the FF_ESM_CMDS configuration setting selects which ESM commands are used for the analysis. When using the default setting for this parameter, the files must be reviewed in the following order:

latest/esm.smstatus
Ensure that this file indicates that the subnet manager is running. If no subnet managers are running on the fabric, that problem must be corrected before proceeding further. After being corrected, the health checks must be rerun to look for more errors.
latest/esm.smShowSMParms.diff
This file indicates that the subnet manager configuration has changed. This file must be reviewed, and, as necessary, the latest/esm.smShowSMParms file needs to be compared to the baseline/esm.smShowSMParms file. If necessary, correct the subnet manager configuration. After being corrected, the health checks must be rerun to look for more errors. If the change is expected and is permanent, a baseline must be rerun once all other health check errors have been corrected.
latest/esm.smShowDefBcGroup.diff
This file indicates that the subnet manager broadcast group for IPoIB configuration has changed. This file must be reviewed, and, as necessary, the latest/esm.smShowDefBcGroup file must be compared to the baseline/esm.smShowDefBcGroup file. If necessary, correct the subnet manager configuration. After being corrected, the health checks needs to be rerun to look for more errors. If the change is expected and is permanent, a baseline must be rerun once all other health check errors have been corrected.
latest/esm.*.diff
If the FF_ESM_CMDS file has been changed, the changes in results for those additional commands must be reviewed. If necessary, correct the subnet manager configuration. After being corrected, the health checks must be rerun to look for more errors. If the change is expected and is permanent, a baseline must be rerun once all other health check errors have been corrected.

Next, review the results of the fabric analysis for each configured fabric. If nodes or links are missing, the fabric analysis detects them. Missing links or nodes can cause other health checks to fail. If such failures are expected (for example, a node or switch is offline), further review of result files can be performed. You must be aware that the loss of the node or link can cause other analyses to also fail.

The following information presents the analysis order for the fabric.0.0 file, If other or additional fabrics are configured for analysis, you must review the files in the order shown in the following list for each fabric. There is no specific order for which fabric to review first.

latest/fabric.0.0.errors.stderr
If this file is not empty, it can indicate problems with the iba_report file (such as the inability to access an subnet manager), which can result in unexpected problems or inaccuracies in the related errors file. If possible, problems reported in this file must be corrected first. After being corrected, the health checks must be rerun to look for more errors.
latest/fabric.0:0.errors
If any links with excessive error rates or incorrect link speeds are reported, they must be corrected. If there are links with errors, be aware that the same links might also be detected in other reports such as the links and comps files.
latest/fabric.0.0.snapshot.stderr
If this file is not empty, it can indicate problems with the iba_report file (such as inability to access an subnet manager), which can result in unexpected problems or inaccuracies in the related links and comps files. If possible, problems reported in this file must be corrected first. After being corrected, the health checks must be rerun to look for more errors.
latest/fabric.0:0.links.stderr
If this file is not empty, it can indicate problems with the iba_report file, which can result in unexpected problems or inaccuracies in the related links file. If possible, problems reported in this file must be corrected first. After being corrected, the health checks must be rerun to look for more errors.
latest/fabric.0:0.links.diff
This file indicates that the links between components in the fabric have changed, removed, or added, or that components in the fabric have disappeared. This file must be reviewed and, as necessary, the latest/fabric.0:0.links file must be compared to the baseline/fabric.0:0.links file. If components have disappeared, review of the latest/fabric.0:0.comps.diff file might be easier for such components. If necessary, correct missing nodes and links. After being corrected, the health checks must be rerun to look for more errors. If the change is expected and is permanent, a baseline must be rerun once all other health check errors have been corrected.
latest/fabric.0:0.comps.stderr
If this file is not empty, it can indicate problems with the iba_report file which can result in unexpected problems or inaccuracies in the related comps file. If possible, problems reported in this file must be corrected first. After being corrected, the health checks must be rerun to look for more errors.
latest/fabric.0:0.comps.diff
This file indicates that the components in the fabric or their Subnet Management Agent (SMA) configuration has changed. This file must be reviewed, and, as necessary, the latest/fabric.0:0.comps file must be compared to the baseline/fabric.0:0.comps file. If necessary, correct missing nodes, ports that are down, and incorrect port configurations. After being corrected, the health checks must be rerun to look for more errors. If the change is expected and is permanent, a baseline must be rerun once all other health check errors have been corrected.

Finally, review the results of the chassis_analysis file. If chassis configuration has changed, the chassis_analysis chassis_analysis, the FF_CHASSIS_CMDS, and FF_CHASSIS_HEALTH configuration settings select which chassis commands are used for the analysis. When using the default setting for this parameter, the files must be reviewed in the following order:

latest/chassis.hwCheck
Ensure that this file indicates all chassis are operating appropriately with the wanted power and cooling redundancy. If there are problems, they must be corrected, but other analysis files can be analyzed first. Once any problems are corrected, the health checks must be rerun to verify the correction.
latest/chassis.fwVersion.diff
This file indicates the chassis firmware version has changed. If this change was not an expected change, the chassis firmware must be corrected before proceeding further. After correcting the firmware version, rerun the health checks to look for more errors. If the change is expected and is permanent, a baseline must be rerun once all other health check errors have been corrected.
latest/chassis.*.diff
These files reflect other changes to chassis configuration based on checks selected through the FF_CHASSIS_CMDS command. The changes in results for these remaining commands must be reviewed. If necessary, correct the chassis. After being corrected, the health checks must be rerun to look for more errors. If the change is expected and is permanent, a baseline must be rerun once all other health check errors have been corrected.

If any health checks fail, after correcting the related problems, another health check must be run to verify that all the problems are corrected. If the failures are due to expected and permanent changes, once all other errors have been corrected, a baseline must be rerun.


Send feedback | Rate this page

Last updated: Tue, February 08, 2011