Green IT and switch maintenance
seb_ 060000QVK2 Visits (8885)
The term ecological footprint describes the total impact of someone or something on the environment. To achieve sustainability this footprint should be kept as low as possible. We should not demand more from Mother Nature than she can provide and of course we should not demand more than we actually really need. Sounds simple, but the reality is way more complex. In the area of IT the term Green IT was found to describe and consolidate all the rules, actions and requirements to decrease the ecological footprint for the sake of sustainability. And IBM has a broad agenda about this. But often we forget what each one of us could do to be a little more greener.
In the technical support we deal with defects. Our clients have the right to have a product working within the specifications. If a part is working outside its specifications, it has to be repaired or replaced. That's it.
And what's "green" about that?
The impact on the Nature happens if a part is replaced that was not really broken. No manufacturing process of a part can be so "green-optimized" that it's better than just to avoid replacing a part in good order. There is the mining (and/or recycling) for the materials, the chemicals and energy used during its processing, the packages, the stocking and of course the logistics, too. At the end a small part like a fan can have a huge ecological footprint. This can only be avoided by replacing only the broken part. There's just one problem with that:
What if you can't tell which part is broken?
A classical example for that is a physical error in the SAN. In my article about CRC I pointed out how to use the porterrshow to find physical errors and - even more important - how to find the connection where the physical error is really located. But that's all what's possible out of the data: You can only track it down to the connection. The connection usually consists of the sending SFP, the cable (plus any additional patch panels and couplers in between), and the receiving SFP. There is no reliable and technically justifiable way to tell which one is the culprit just out of the porterrshow. I know that there are some "whitepapers" available in the web stating that this combination of "crc err" and "enc in" means this and that combination of "crc err" and "enc out" means that. But from a technical point of view that's nonsense.
So you have a physical problem, what to do?
When it comes to cables, my fellow IBM blogger Anthony Vandewerdt just released a great article about the impact of dust today. Other reasons for a cable to cause physical problems could be a too small bending radius or loose couplers. In times of fully populized 48- or even 64-port cards the frontside of a SAN director often looks like the back of a hedgehog. For every maintenance action with one of the cables you can wait for the CRC error counters increasing for the other ports around then. So in many situations the cable is not really broken and just replacing it wholesale just because of the counter is not eco-friendly.
The same thing with SFPs. You see physical errors increasing in the porterrshow for a specific port. That could mean that the SFP in there is broken, because its "electric eye" doesn't interpret the (good) incoming signal correctly. It could also mean that the SFP on the other end of the cable is broken, because it sends out a signal in a bad condition. Both will lead to the very same counter increases in porterrshow. If you replace them both as the first action you most probably replaced at least one good one.
Given that you have redundancy in your SAN environment (which you should ALWAYS have), you have free ports available, and the multipath drivers for the hosts using the affected path are working properly, you could track the culprit down by plugging the cable to another SFP in another port and look if the error stays with the port or with the cable.
Please keep in mind that the port address ("the IP address of the SAN") could change along with the port (if you don't have Cisco switches). On Brocade switches you need to do a "portswap" to swap the port addresses as well.
If you cannot touch the other ports, Brocade built some tests into FabricOS for you, like "porttest", "portloopbacktest" and "spinfab". Please have a look into the Command Line Interface Reference Guide for your FabricOS version to get more information about them. With these tests in combination with a so called loopback plug it's easy to find out which part is really broken. Loopback plugs look like the end of a cable but just physically redirect the SFP's TX signal into its RX connector.
Mother Nature will be thankful
There is just one thing from above I want to pick up: parts working within their specification. Not every single CRC error is a reason to replace hardware. According to the Fibre Channel standard, the protocol requires a BER (Bit Error Rate) of 10^(-12) to work properly. For 8 or even 16 Gbps that means it's allowed and fully compliant with the FC protocol to have bit errors quite often. Here is where common sense must come into play. If you have 2-digit increases of the CRC error counter within an hour, it might be a good idea to determine which part to replace with the steps mentioned above. If you see a single CRC from time to time, sometimes with days of no error, sometimes with "some" per day, that's perfectly fine with the FC protocol and well within the specifications. It could lead to single temporary and recoverable errors on a host, but nothing has to be replaced then as long as the rate doesn't increase significantly. You wouldn't replace your one-year-old tires just because the tread is only 90% of what it was when you bought them.
Let's think a little bit greener - even in switch maintenance :o)