The trigger and the error recovery
seb_ 060000QVK2 Visits (2111)
The following article is not new. I published it 1,5 years in an internal IBM blog. So why publish it again externally then? From my problem cases of the recent months I found that the principle described there is very common and it most probably won't change in the future. To allow myself to refer to it here in the blog, of course I have to publish it once again. :-)
The majority of the SAN cases that you can not simply break down into replacing a part because of an error message such as "Part xxx is broken" are complex solution cases. You have symptoms on maybe several hosts against maybe several storages over several pathes.
The IBM support structure consists of so called towers. There are different teams supporting different products. With higher supportlevels this is quite important to allow product engineers to develop a deep understanding about their specific product. When it comes to problem determination it's essential for the different tower teams to work together to find the cause for a problem and how to solve it. It's not enough to just check the "own box" for failures and ask yourself if it could be the only reason for the problem. If this is done, the result is often, that the particular device cannot be the "single point of failure" and the responsibility to find the problem source moves to the next probable team.
It is obvious that such a process to solve a problem is not very efficient. There are several attempts to deal with that from organizational point of view like solution support approaches, project office and "Complex Call Leaders". But only from technical point of view, you can see, why it is vital:
In complex cases you have at least one trigger and also at least one device that reacts wrongly on it. The trigger alone doesn't represent the source of the visible symptoms. This is often forgotten as soon as the trigger is found and repaired and the symptoms are gone. And more important, the trigger was the less harmful problem in comparison to a bad error recovery. In the future a new minor error (the same one or another) could trigger the same major problem.
A customer has two SAN switches in two different, redundant fabrics. Connected to the switches there is a 2-node SVC cluster (with several backend storage subsystems). From each of both nodes there are two connections in each fabric. He has some Windows hosts with SDDDSM (same level on each host) and System p hosts with SDDPCM (also same level on each host).
Now one SFP that connects one SVC node port to one of the switches is broken. It corrupts frames and transmission words intermittently which leads to a toggeling link. Although everything is zoned granulary, all the System p hosts loose the access to their disks.
The customer creates a case at System p support. The frontend sees a message in the AIX error report indicating that a hdisk is not accessible. They involve the SAN support team, which finds the high error counters on the switch port where the broken SFP is connected and advises the customer to replace it with one of his spare SFPs. The situation calms down, the symptoms disappear, the disks are accessible again, the problem is gone. Fine.
But this is not the moment this case can be closed. Of course, the trigger is found and the customers systems are productive again, but the main problem could be easily diregarded now: The error was handled in a wrong way by the host and its multipathing driver. The multipath driver should use another available path. It could use another path in the same fabric or even the links in the other fabric that have no problem at all. So the more important problem source is the broken multipath driver which has to react to the trigger and do the error recovery. With the next broken SFP (please keep in mind that a SFP as a opto-electrical converter is a wearing part) the same problem will happen again!
The lesson learnt out of this example is, that a trigger of an error is not the most important part of the problem and should not be the only goal of the problem determination, but the way the devices in a redundant environment react to the trigger is the reason for the impact and could create "artificial" single points of failure. The different tower support teams have to work together until not only the trigger is found but also the parts of the environment that react in a wrong way!