Two additional topics for my previous post came into my mind and I doubt that they will be the last ones :o)
Have a proper SAN management infrastructure
For most of you it's self-evident to have a proper SAN management infrastructure, but from time to time I see environments where this is not the case. In some it's explained with security policies ("Wait - you are not allowed to have your switches in a LAN? And the USB port of your PC is sealed? You have no internet access? No, I don't think that you should send a fax with the supportshow...), sometimes it's just economizing on the wrong end. And sometimes there is just no overall plan for SAN management. So I think at least the following things should be given to enable a timely support:
- A management LAN with enough free ports to allow integration of support-related devices. For example a Fibre Channel tracer.
- A host in the management LAN which is accessible from your desk (e.g. via VNC or MS RDP) and has access to the management interfaces of all SAN devices. This host should at least boot from an internal disk rather than out of the SAN.
- A good ssh and telnet tool should be installed which allows you to log the printable output of a session into a text file. I personally like PuTTY.
- A tFTP- and a FTP-Server on the host mentioned above. It can be used for supportsaves, config backups, firmware updates etc. They should always run and where it's possible the devices should be pre-configured to use them. (e.g. with supportftp in Brocade switches)
- If it's possible with your security policy, it's helpful to have Wireshark installed on it which could be used for "fcanalyzer" traces in Cisco switches or also to trace the ethernet if you have management connection problems with your SAN products.
- The internet connection needs enough upload bandwidth. Fibre Channel traces can be several gigabytes in size. When time matters undersized internet connections are a [insert political correct synonym for PITA here :o) ]
- Callhome and remote support connection where applicable. Callhome can save you a lot of time in problem situations. No need to call support and open a case manually. The support will call you. And most of the SAN devices will submit enough information about the error to give the support member at least an idea where to start and which steps to take first. So in some situations callhomes trigger troubleshooting before your users even notice a problem. In addition some machines (like DS8000) allow the support to dial into it and gather the support data directly - and only the support data. Don't worry - your user data is safe!
- Have all passwords at hand. This includes the root passwords as some troubleshooting actions can only be done with a root user.
- Have all cables and at least one loopback plug at hand. With cables I mean at least: one serial cable, one null-modem cable, one ethernet patch cable and one ethernet crossover cable (not all devices have "auto-negotiating" GigE interfaces)... better more. And of course a good stock of FC cables should be onsite as well.
- The NTP servers as mentioned in my previous blog post.
Monitoring, counter resets and automatic DC
Beside of any SAN monitoring you hopefully do already (Cisco Fabric Manager / Brocade DCFM / Network Advisor / Fabric Watch / SNMP Traps / Syslog Server / etc) there is one thing in addition: automatic data collections based on cleared counters. Finding physical problems on links, frame corruption on SAN director backlinks, slow drain devices or toggeling ports - for all these problems it helps a lot if you can 1. do problem determination based on counters cleared on a regular basis and 2. look back in time to see exactly when it started and maybe how the problem "evolved" over time.
What you need is some scripting skills and a host in the management LAN (with an FTP server) to run scripts from as mentioned above. A good practice is, to have a look for a good time slot - better do not do this on workload peak times - and set up a timed script (e.g. cron job) that does:
- Gather data collections of all switches - use "supportsave" for Brocade switches and for Cisco switches log the output of a "show tech-support details" into a text file.
- Reset the counters - use both "slotstatsclear" and "statsclear" for Brocade switches and for Cisco switches run both "clear counters interface all" and "debug system internal clear-counters all". The debug command is a hidden one, so please type in the whole one as auto-completion won't work. The supportsave is already compressed but for the Cisco data collection it might be a good idea to compress it with the tool of your choice afterwards.
Additional hint: Use proper names for the Cisco Data collections. They should at least contain the switchname, the date and the time!
Depending on the disk space and the number of the switches, it may be good to delete old data collections after a while. For example you could keep one full week of data collections and for older ones only keep one per week as a reference.
If you have a good idea in addition how to be best prepared for the next problem case, please let me know. :o)