Learn about the output files for the Fast Fabric health
check.
The Fast Fabric health check output files are documented in the Fast
Fabric Toolset Users Guide. The following information provides
some of the key aspects of the output files:
- The location of the output files is configurable in the /etc/sysconfig/fastfabric.conf file.
- The default location of output files is /var/opt/iba/analysis/[baseline
| latest | timestamp. The $FF_ANALYSIS_DIR variable
defines the output directory with the default of /var/opt/iba/analysis.
- The Filename equals [type of health check].[fast fabric
command].[suffix]
The commands for the output files are
shown in the following list:
- fabric: Basically subnet manager queries about
fabric status
- chassis: Switch chassis firmware queries
- hostsm. Queries about subnet manager configuration
- esm: Queries about embedded subnet manager
configuration
- The Fast Fabric commands used by health check are detailed in
the Fast Fabric Toolset Users Guide.
- The suffixes for the output files are shown in the following list:
- All output files must be queried before taking a new baseline
health check to ensure that the saved configuration information is
correct.
- The all_analysis command is a wrapper for fabric_analysis,
chassis_analysis, hostsm_analysis, and esm_analysis.
- The analysis routines use the iba_report file
to gather information.
- Key output files to check for problems follow:
While the following information is intended to be comprehensive
in describing how to interpret the health check results, for the most
recent information about health check, see the Fast Fabric Users
Guide.
When any of the health check tools are run, the overall success
or failure is indicated in the output of the tool and its exit status.
The tool indicates which areas had problems and which files must be
reviewed. The results from the latest run can be found in the $FF_ANALYSIS_DIR/latest/ directory.
Many files can be found in this directory that indicate both the latest
configuration of the fabric and indicate errors or differences found
during the health check. Should the health check fail, use the following
paragraphs discuss an order for to review these files.
If the -s option (save history) was used when running the health
check, a directory whose name is the date and time of the failing
run is created under the FF_ANALYSIS_DIR directory,
in which case, that directory can be consulted instead of the latest
directory shown in the following examples.
Review the results for any esm first (if using embedded subnet
managers) or hostsm (if using host-based subnet managers) health check
failures. If the subnet manager is incorrectly configured or not running,
it can cause other health checks to fail, in which case the subnet
manager problems must be corrected first, and then the health check
must be rerun and other problems must then be reviewed and corrected,
and as needed.
For a hostsm analysis, review the files in the
following order:
- latest/hostsm.smstatus
- Ensure that this file indicates that the subnet manager is running.
If no subnet managers are running on the fabric, that problem must
be corrected before proceeding further. After being corrected, the
health checks must be rerun to look for more errors.
- latest/hostsm.smver.diff
- This file indicates that the subnet manager version has changed.
If this change is not expected, the subnet manager must be corrected
before proceeding further. After being corrected, the health checks
must be rerun to look for more errors. If the change is expected and
is permanent, a baseline must be rerun after all other health check
errors have been corrected.
- latest/hostsm.smconfig.diff
- This file indicates that the subnet manager configuration has
changed. This file must be reviewed, and, as necessary, the latest/hostsm.smconfig file
needs to be compared to the baseline/hostsm.smconfig file.
If necessary, correct the subnet manager configuration. After being
corrected, the health checks needs be rerun to look for more errors.
If the change is expected and is permanent, a baseline must be rerun
once all other health check errors have been corrected.
For an esm analysis, the FF_ESM_CMDS configuration
setting selects which ESM commands are used for the analysis. When
using the default setting for this parameter, the files must be reviewed
in the following order:
- latest/esm.smstatus
- Ensure that this file indicates that the subnet manager is running.
If no subnet managers are running on the fabric, that problem must
be corrected before proceeding further. After being corrected, the
health checks must be rerun to look for more errors.
- latest/esm.smShowSMParms.diff
- This file indicates that the subnet manager configuration has
changed. This file must be reviewed, and, as necessary, the latest/esm.smShowSMParms file
needs to be compared to the baseline/esm.smShowSMParms file.
If necessary, correct the subnet manager configuration. After being
corrected, the health checks must be rerun to look for more errors.
If the change is expected and is permanent, a baseline must be rerun
once all other health check errors have been corrected.
- latest/esm.smShowDefBcGroup.diff
- This file indicates that the subnet manager broadcast group for
IPoIB configuration has changed. This file must be reviewed, and,
as necessary, the latest/esm.smShowDefBcGroup file
must be compared to the baseline/esm.smShowDefBcGroup file.
If necessary, correct the subnet manager configuration. After being
corrected, the health checks needs to be rerun to look for more errors.
If the change is expected and is permanent, a baseline must be rerun
once all other health check errors have been corrected.
- latest/esm.*.diff
- If the FF_ESM_CMDS file has been changed, the
changes in results for those additional commands must be reviewed.
If necessary, correct the subnet manager configuration. After being
corrected, the health checks must be rerun to look for more errors.
If the change is expected and is permanent, a baseline must be rerun
once all other health check errors have been corrected.
Next, review the results of the fabric analysis for each configured
fabric. If nodes or links are missing, the fabric analysis detects
them. Missing links or nodes can cause other health checks to fail.
If such failures are expected (for example, a node or switch is offline),
further review of result files can be performed. You must be aware
that the loss of the node or link can cause other analyses to also
fail.
The following information presents the analysis order for the fabric.0.0
file, If other or additional fabrics are configured for analysis,
you must review the files in the order shown in the following list
for each fabric. There is no specific order for which fabric to review
first.
- latest/fabric.0.0.errors.stderr
- If this file is not empty, it can indicate problems with the iba_report file
(such as the inability to access an subnet manager), which can result
in unexpected problems or inaccuracies in the related errors file.
If possible, problems reported in this file must be corrected first.
After being corrected, the health checks must be rerun to look for
more errors.
- latest/fabric.0:0.errors
- If any links with excessive error rates or incorrect link speeds
are reported, they must be corrected. If there are links with errors,
be aware that the same links might also be detected in other reports
such as the links and comps files.
- latest/fabric.0.0.snapshot.stderr
- If this file is not empty, it can indicate problems with the iba_report file
(such as inability to access an subnet manager), which can result
in unexpected problems or inaccuracies in the related links and comps files.
If possible, problems reported in this file must be corrected first.
After being corrected, the health checks must be rerun to look for
more errors.
- latest/fabric.0:0.links.stderr
- If this file is not empty, it can indicate problems with the iba_report file,
which can result in unexpected problems or inaccuracies in the related links file.
If possible, problems reported in this file must be corrected first.
After being corrected, the health checks must be rerun to look for
more errors.
- latest/fabric.0:0.links.diff
- This file indicates that the links between components in the fabric
have changed, removed, or added, or that components in the fabric
have disappeared. This file must be reviewed and, as necessary, the latest/fabric.0:0.links file
must be compared to the baseline/fabric.0:0.links file.
If components have disappeared, review of the latest/fabric.0:0.comps.diff file
might be easier for such components. If necessary, correct missing
nodes and links. After being corrected, the health checks must be
rerun to look for more errors. If the change is expected and is permanent,
a baseline must be rerun once all other health check errors have been
corrected.
- latest/fabric.0:0.comps.stderr
- If this file is not empty, it can indicate problems with the iba_report file
which can result in unexpected problems or inaccuracies in the related comps file.
If possible, problems reported in this file must be corrected first.
After being corrected, the health checks must be rerun to look for
more errors.
- latest/fabric.0:0.comps.diff
- This file indicates that the components in the fabric or their
Subnet Management Agent (SMA) configuration has changed. This file
must be reviewed, and, as necessary, the latest/fabric.0:0.comps file
must be compared to the baseline/fabric.0:0.comps file.
If necessary, correct missing nodes, ports that are down, and incorrect
port configurations. After being corrected, the health checks must
be rerun to look for more errors. If the change is expected and is
permanent, a baseline must be rerun once all other health check errors
have been corrected.
Finally, review the results of the chassis_analysis file.
If chassis configuration has changed, the chassis_analysis
chassis_analysis, the FF_CHASSIS_CMDS, and FF_CHASSIS_HEALTH configuration
settings select which chassis commands are used for the analysis.
When using the default setting for this parameter, the files must
be reviewed in the following order:
- latest/chassis.hwCheck
- Ensure that this file indicates all chassis are operating appropriately
with the wanted power and cooling redundancy. If there are problems,
they must be corrected, but other analysis files can be analyzed first.
Once any problems are corrected, the health checks must be rerun to
verify the correction.
- latest/chassis.fwVersion.diff
- This file indicates the chassis firmware version has changed.
If this change was not an expected change, the chassis firmware must
be corrected before proceeding further. After correcting the firmware
version, rerun the health checks to look for more errors. If the change
is expected and is permanent, a baseline must be rerun once all other
health check errors have been corrected.
- latest/chassis.*.diff
- These files reflect other changes to chassis configuration based
on checks selected through the FF_CHASSIS_CMDS command.
The changes in results for these remaining commands must be reviewed.
If necessary, correct the chassis. After being corrected, the health
checks must be rerun to look for more errors. If the change is expected
and is permanent, a baseline must be rerun once all other health check
errors have been corrected.
If any health checks fail, after correcting the related
problems, another health check must be run to verify that all the
problems are corrected. If the failures are due to expected and permanent
changes, once all other errors have been corrected, a baseline must
be rerun.