IBM Cisco Data Collection Best Practices
seb_ 060000QVK2 Visits (5224)
The quality of the data collection is a significant factor for a quick and successful troubleshooting. Here in the remote support it's essential to get the data complete, well prepared, and quickly. Quickly is clear, but what do I mean with complete and well prepared?
Collecting data for a Cisco SAN switch case is not difficult if you know what to look for. That depends on the problem. The problem could still be ongoing or you want to have something analyzed that happened in the past. To avoid confusion between ongoing problems and historical stuff in the data, the counters need to be cleared. It seems like common sense but again and again I see data collections gathered in a wrong way rendering them useless for analysis.
The standard data collection for Cisco is a "showtech", to be exactly a "show tech-support details". It's a script with a lot of command outputs and it changed a lot over the hardware platforms and SAN-OS/NX-OS versions in the past. There were (and maybe will be again) bugs causing incomplete outputs like CSCus64671 which caused incomplete data under NX-OS 6.2 and was fixed in 6.2(11c). And that was not the only one! In addition some useful commands were never included into the script. So there's some extra work to do.
Do we look into these data directly? However much I like to dig into the guts of the data, there are things that machines can do better. For example compiling error tables for interfaces or running sanity checks against certain configurations. A colleague of mine and I are responsible for the tool that is used within IBM to analyze Cisco SAN data collections by creating a troubleshooting framework out of the data. Of course the quality of its output depends heavily on the quality of the input. The better the data, the better the tool can do its part and we, the support engineers, can do ours.
To cover all common situations, here is what I believe to be a good data collection plan:
The following command outputs should be gathered via CLI. Please log the (printable!) session output into one text file per switch per data collection round. On each switch start with setting the terminal length to zero to avoid pagewise output:
Switch# terminal length 0
2) Collecting data
Switch# show tech-support all
Switch# show tech-support details
That should give us most of the expected commands. To include internal counter tables and allow the analysis of historical data, please also run:
Switch# show logging onboard
If the problem could be related to the fiber optics (SFPs), like all physical problems incl. CRC errors, invalid transmission words, etc. please include:
Switch# show interface transceiver details
By having them all in one text file per switch you ensure them to be processed together properly. I highly recommend to use the following naming convention for the text files. It helps the IBM server to choose the proper support tool eliminating manual intervention and wait time by the support engineer.
The really important part is "_showtechSAN_" (including the underscores) but I recommend to use the full pattern to allow an easy identification of the proper data.
The text files of all switches can then be backed together in the same zip-file and uploaded to IBM (see the end of the article).
3) Clearing the counters
For ongoing problems it makes sense to clear the counters now. There are 3 major commands for that:
Regular interface counters:
Switch# clear counters interface all
Internal ASIC counters:
Switch# debug system internal clear-counters all
Switch# clear ips stats all
The first two should be done in any case, the third of course only if you have FCIP.