Modified on by seb_
The quality of the data collection is a significant factor for a quick and successful troubleshooting. Here in the remote support it's essential to get the data complete, well prepared, and quickly. Quickly is clear, but what do I mean with complete and well prepared?
Collecting data for a Cisco SAN switch case is not difficult if you know what to look for. That depends on the problem. The problem could still be ongoing or you want to have something analyzed that happened in the past. To avoid confusion between ongoing problems and historical stuff in the data, the counters need to be cleared. It seems like common sense but again and again I see data collections gathered in a wrong way rendering them useless for analysis.
The standard data collection for Cisco is a "showtech", to be exactly a "show tech-support details". It's a script with a lot of command outputs and it changed a lot over the hardware platforms and SAN-OS/NX-OS versions in the past. There were (and maybe will be again) bugs causing incomplete outputs like CSCus64671 which caused incomplete data under NX-OS 6.2 and was fixed in 6.2(11c). And that was not the only one! In addition some useful commands were never included into the script. So there's some extra work to do.
Do we look into these data directly? However much I like to dig into the guts of the data, there are things that machines can do better. For example compiling error tables for interfaces or running sanity checks against certain configurations. A colleague of mine and I are responsible for the tool that is used within IBM to analyze Cisco SAN data collections by creating a troubleshooting framework out of the data. Of course the quality of its output depends heavily on the quality of the input. The better the data, the better the tool can do its part and we, the support engineers, can do ours.
To cover all common situations, here is what I believe to be a good data collection plan:
The following command outputs should be gathered via CLI. Please log the (printable!) session output into one text file per switch per data collection round. On each switch start with setting the terminal length to zero to avoid pagewise output:
Switch# terminal length 0
2) Collecting data
Switch# show tech-support all
Switch# show tech-support details
That should give us most of the expected commands. To include internal counter tables and allow the analysis of historical data, please also run:
Switch# show logging onboard
If the problem could be related to the fiber optics (SFPs), like all physical problems incl. CRC errors, invalid transmission words, etc. please include:
Switch# show interface transceiver details
By having them all in one text file per switch you ensure them to be processed together properly. I highly recommend to use the following naming convention for the text files. It helps the IBM server to choose the proper support tool eliminating manual intervention and wait time by the support engineer.
The really important part is "_showtechSAN_" (including the underscores) but I recommend to use the full pattern to allow an easy identification of the proper data.
The text files of all switches can then be backed together in the same zip-file and uploaded to IBM (see the end of the article).
3) Clearing the counters
For ongoing problems it makes sense to clear the counters now. There are 3 major commands for that:
Regular interface counters:
Switch# clear counters interface all
Internal ASIC counters:
Switch# debug system internal clear-counters all
Switch# clear ips stats all
The first two should be done in any case, the third of course only if you have FCIP.
4) Wait time
Please wait then until the problem re-occurs. We don't have a fixed wait time here, but if the problem happens very seldom, it's advisable to clear the counters every few days to avoid catching unrelated stuff - for example high error counters caused by a maintenance action. The goal is to catch the real problem.
5) Collecting the data again
This is exactly like step 2).
6) Uploading the data
Please upload the created zip file(s) using our "Secure Upload" option here:
Just use your PMR (preferred), RCMS, or CROSS case number for your upload to let the system even notify the support engineer with an update to the case. It's also possible to upload data on the plain Machine Type / Serial Number, but then there won't be any direct correlation to the case. In the field "Upload is for:" always choose "Hardware" when you upload SAN data collections. The email address is optional but it will send you a short notice as soon as the upload completed successfully and the support engineer will be able to contact you via mail if needed. After clicking on "Continue" you can drag and drop the archive file containing the supportsave to upload it.
Short URL for this article: http://ibm.biz/ciscodc
Modified on by seb_
Brocade FabricOS v7.3x is officially supported for IBM clients now. Among all the new features and improvements there are some I would like to cover in small blog entries. Especially for the ones directly related to support and troubleshooting.
One command to rule them all
Investigating ongoing problems usually starts with setting a baseline. To tell the current problem from the battles of the past, you need to clear the counters carefully. Over the years, hardware platforms, and FOS versions these commands changed again and again. Portstatsclear was such a command. Years ago it was like Russian roulette - You never knew what it would really clear. This port? The ports in the same portgroup? All phyiscal counters but not the stuff on the right side of porterrshow? Statsclear cleared all ports - at least the external FC ports. You needed another command for internal blade counters. And for the GigE interfaces you needed portstatsclear again.
All you need in FOS v7.3 is supportinfoclear. It will clear all port counters and in addition it clears the portlogdump, too. You only need to execute:
supportinfoclear --clear -force
The -force prevents it from asking you again, if you are really really sure about doing it. Additionally you can clear the error log, too by using -RASlog (case-sensitive). But at least for anything support-related I don't recommend to do that if not instructed otherwise.
And another improvement: It will be in the clihistory. Even if you execute it in plink or ssh without opening a shell on the switch. So no worrying about how to execute it anymore. Just use your favorite script or do it directly and the IBM support will see how reliable the data is.
Update Nov 3rd:
And another way how it rules them all: As Serge writes in the comments below, it will clear the counters for all ports regardless of their VF-membership. So no hopping through logical switches or need to use fosexec! Thanks Serge!
And as described in "How to avoid support data amnesia" over there in the Storageneers blog: Please think about when to execute this command! While it's save to clear the counters for really ongoing or 100% reoccurring problems, you need to gather supportsaves first if you want to have the root cause analyzed for something that happened in the past. Otherwise supportinfoclear might wipe all the indications and evidences needed to find out what happened!
Modified on by seb_
Sometimes we notice that an ISL is actually a bottleneck in a fabric. Not a congestion bottleneck where the throughput demand is just too high for the ISL's bandwidth. This one could be solved by putting another cable between the both switches. But if you have a latency bottleneck your ISL won't be running at the maximum of the bandwidth. The contrary is the case: it lacks the buffer credits to ensure a proper utilization. If you see a latency bottleneck on an ISL it's often back pressure from a slow drain device attached to the adjacent switch. But every now and then I get a case where it's just the ISL. Sometimes in one direction, sometimes in both. Even with lengths were you don't think about using long distance settings at all.
But in the past we did exactly that!
When we encountered a situation like that, the first step was always to get rid of everything that reduces buffer credits for the real traffic flows, like an active QoS setting without having QoS zones. If the problem was still there the only way to give it more buffers was to configure a long distance mode then. We solved performance bottlenecks on ISLs by setting up a let's say 50m ISL in a 10km long distance mode (LE). I described this also 2 years ago in the article How to NOT connect an SVC in a core-edge Brocade fabric. While this indeed gives you more buffers, it comes with a drawback.
Long distance and Virtual Channels
On normal ISLs we have Virtual Channels. They work in a way that the buffer credit management of the ISL is logically partitioned into 8 channels. When we talk about normal Class 3 open systems traffic it's used this way:
Class 3 data
Class 3 data
Class 3 data
Class 3 data
VC 0 is used for inter-switch communication. For example if a new zoning is distributed to all switches. The VCs 6 and 7 are not really of interest most of the time. We have to focus on VCs 2, 3, 4, and 5. (Mind the Oxford comma!) If you have a slow drain device that is reached using Virtual Channel 2 in your fabric, then at least the traffic of the other 3 data VCs is unaffected. With a long distance mode like LE you lose that advantage.
Buffer distribution on Virtual Channels on a normal ISL:
Buffer distribution on Virtual Channels on a LE configured ISL:
While you have more buffers in total now, only the first data VC has them assigned. There is no partition of data traffic anymore and the result is the risk of Head of Line Blocking (HoLB). A latency bottleneck (for example due to back pressure from a slow drain device) will always impact ALL the user data going over that ISL! That's a high price for those additional buffers.
With FabricOS v7.2x Brocade introduced a new command:
It allows you to assign a freely configurable number of credits between 5 and 40 for that ISL. You might ask:
But LE mode gives me 80 on 16Gbps!?!
Yes, but look at the distribution:
Not the whole data part of the link will share the 40 buffers. Each data VC gets its own 40 buffers and they are still handled independently! No Head of Line Blocking! And remember: This is not meant for long distance connection and it still comes for free! It works on 8G switches, too, as long as they are running at least v 7.2x.
To give 40 buffers to each data VC on an ISL at port 1 you would enter:
portcfgeportcredits --enable 1 40
With the --disable parameter you switch back to normal mode and with --show you can see the current configuration of a port.
And please keep an eye on the number of remaining buffers in portbuffershow :o)
So from now on if you need just some more buffers on your ISLs to keep everything running smoothly:
Modified on by seb_
Well-made professional education is worth every cent, but in today's world controlled by CFOs everything costing money will be challenged sooner or later. And if you search for freebies you often end up with the first 3-4 sentences of an obviously good book about the topic and the prompt to register with your business information and email address. Weeks of business SPAM will follow then even if you unsubscribe again. Here are some good free books to get a good understanding of SAN switching and how it's implemented by the both big players Cisco and Brocade without the need to register for anything.
Introduction to Storage Area Networks and System Networking
Working at IBM I appreciate their Redbooks program. Experts from in- and outside IBM share their knowledge in for of these comprehensive ebooks. This one is a good introduction to SAN and how IBM does it. You learn how Fibre Channel works, the hardware, the software, the management, the use cases and the design considerations. And of course it covers the IBM products in that area, too.
Regular readers of my blog (are there any?) may know my opinion about the SNIA Dictionary, but for learning Storage Networking it's still a good source of definitions and explanations for many of the common terms and concepts. Get it directly from snia.org.
Cisco MDS 9000 Family Switch Architecture
This document is also known as “A Day in the Life of a Fibre Channel Frame” and I like it. Although it certainly saw some summers and winters since its release in 2006 but the general architecture is still the same. Of course everything is integrated and consolidated in the latest products, but if you ever understood how a frame is handled by a n older generation Cisco switch, it won't be a problem to work with, design for, or even troubleshoot the newest ones.
Brocade Fabric OS Administrator's Guide
While Brocade is certainly not revealing too much about the internals of their switches, the admin guide is still a good source of information about the Brocade features and implementations. Many SAN questions I'm asked in an average week could easily be answered by a glimpse into this guide. There is a new one for each new major codestream, so always look in the one for your installed FabricOS version. This is the link for FabricOS v7.2.
The remaining two ebooks on my list are specifically for performance troubleshooting... ...my hobbyhorse somehow.
Slow Drain Device Detection and Congestion Avoidance
This one is from Cisco and it covers the different types of performance problems pretty well. If you read the one about Cisco architecture before (see above) you can get much more out of this piece as well. It has some good example, troubleshooting approaches and explanations for the counters you might see. A definite must-read.
IBM Redpaper: Fabric Resiliency Best Practices
This one is about Brocade switches and the IBM version of their "SAN Fabric Resiliency Best Practices". After explaining the fundamentals about SAN performance it shows you how performance troubleshooting is done on a Brocade fabric, especially by using built-in features like bottleneckmon.
I'm sure there are many other good learning materials out there that don't exist for the sole reason to catch your contact addresses by registration. If you know some that should be on this list as well, please let me know. Thanks!
Modified on by seb_
I don't always write technical blog posts. But when I do I make them long and the conclusion contains a request to you my readers to do this or that. I won't do that today. Today is about a behavior I observed, but I won't propose anything. Feel free to draw your own conclusions. Well that might be considered as a proposal :o)
This one is about the IBM System Storage SAN06B-R, a multi-protocol router or SAN extension switch. It consists of two ASICs - one handling the Fibre Channel part and one for the FCIP. They also have some extra tasks like FC routing and compression, but for our example it's enough to know that there are two and if you want to transfer SAN traffic over FCIP, it has to pass both of them.
The both ASICs are connected via 5 internal ports all working with a line rate of 4Gbps. That doesn't sound much compared to the 16 FC ports running with up to 8Gbps on the front-side. But we have to keep in mind that they are only for connectivity. Given the max. IP connectivity of 6x 1GbE, the internal connections shouldn't be a bottleneck.
Internal connections are somewhat similar to external ISLs between switches when it comes to flow control. They use buffer-to-buffer credits ("buffer credits") and the links are logically partitioned into virtual channels, each of them with an own buffer credit counter. These virtual channels prevent head of line blocking in case of a back pressure (for example due to slow drain devices on the other side of FCIP connections).
When it comes to buffer credits, it's important how they are assigned to these virtual channels. Within these internal connections each VC gets 1 buffer, but it can borrow 3 out of a pool. The pool is shared among all VCs for that port and contains 11 in total.
You might say "Yeah, but hey it's just a very short connection on the board. Who needs those buffer credits anyway?", but keep in mind they are not just for spanning the tiny distance. There are multiple reasons why frames need to be touched here and therefore buffered. Plus of course possible external back pressure. Often a few buffer credits make the difference between normal traffic flow and piling up of frames and even frame discards due to timeout.
I guess the last thing you want to have is an artificial bottleneck inside of your routers...
So the amount of buffers and buffer credits for each internal connection depends on how many VCs are in use. And that's the crux. The number of VCs per internal connection depends on the number of...
A tunnel consists of 1-6 circuits, so you can bundle several GbE interfaces together. They call it FCIP trunking. Some features like e.g. Tape Pipelining require the use of only one tunnel. There's not much we can do about that. For an environment that doesn't utilize them, it starts to get interesting now: If you have only 1 tunnel, you have only 1 VC and therefore only 4 buffer credits plus the risk of head of line blocking! In addition if you actually spread the traffic across the low, medium and high priority within a circuit, you would get an own VC for each priority.
Using only the standard "medium" priority for the data traffic (F-class "administrative" fabric traffic use an own VC that fall out of this equation of course) would give you that amount of buffers on each of the 5 internal connections between the ASICs:
# of tunnels
# of VCs
# of buffers
(1 buffer per VC + 3 to borrow per VC out of a pool of 11)
Please be aware that the amount of VCs/buffers is only one point that needs to be taken into consideration when planning and configuring the optimal FCIP connection. You can find a good overview about the other ones in Brocade's FCIP Administrator's Guide for your FabricOS version.
Modified on by seb_
I thought I'd never have to write about fillwords. I thought: there will be a phase of some months and then this topic is dead. Strangely enough it's still alive. I still get questions about them, I still see people blaming them and I still see evitable problems because of changing them.
For every new line rate (now read "Generation" or "Gen"), usually the switch and HBA vendors are the first ones to adopt the new standard and release their products. It was the same for 8Gbps, which came with a new fillword. Fillwords are 4-byte-words without a special task. A port sends them all the time it doesn't have to send something else. They're used to maintain the synchronization of the link and therefore the fillword used up to incl. 4Gbps was fittingly called IDLE. Depending on the workload, the ports and the CPU utilization of a PC have one thing in common: You see a lot of IDLE. Therefore it made sense to think about the optimal fillword and so it was changed for 8Gbps. In the first published version it was quite like "Let's replace all instances of IDLE with a better one: ARBff". First products were developed and among them Brocade's 8Gbps switches.
Later it turned out that it would be better to not just replace all IDLEs out of hand, because they were not only used as a fillword, but in the link initialization, too. The standard was updated and then said, "Use ARBff as a fillword, but keep the IDLE for link initialization".
For products released after that point in time the vendors usually implemented the new version of the standard, which was not compatible with the first one. So clients bought new 8Gbps-capable devices, for example DS5000 boxes or SAN Volume Controllers, and failed to get them online. These devices tried to use the standard-compliant word during the critical link initialization phase and when they noticed that the switches sent the wrong ones, the link initialization failed.
I have to admit that most vendors' information politics were very "unlucky" at that time. Everybody blamed everybody else. After some protocol traces it was clear that the problem was the use of ARBff during link initialization. So as a workaround we recommended to configure the switches to use IDLE again (mode 0). Eventually new firmware versions were written and Brocade came up with two new fillword modes - one of them compliant to the standard (mode 2) and another more dynamic mode 3. It tried ARBff in link initialization first (like mode 1) and if that failed, then it behaved like mode 2. So mode 3 became the natural choice.
For some time we had a lot of cases for that problem and many people in the broad area of storage got in touch with the term fillword. While the number of problem cases about them decreased, the memory about fillwords stayed active in people's minds. In addition there is a counter called "er_bad_os" for each port. It means "Error: bad ordered set" and increases basically for 2 situations: 1) If such a 4-byte word is corrupted or 2) if the port receives an ordered set it didn't expect. The first situation is a problem, but you get other indications as well ("enc out", "enc in", ...). The second situation could for example happen if a running port expects the IDLE fillword (because it was configured to mode 0 as a workaround as stated above) but receives ARBff. Although the counter increases in the ASIC there is no impact on a running connection. In fact the fibre channel protocol says that each well-encoded ordered set without any other function should be treated the same way as IDLEs. So as long as there is no bit error in them, it doesn't matter what kind of fillword is received - the switch must use it to maintain the synchronization.
However, the myth was already born: Blame it on the fillword! For a lot of totally unrelated problems, like performance problems, CRC errors, occasional link resets and even SFP heat issues, SAN admins and even support personnel for the attached device blamed the fillword. "The fillword is wrong!", "Change the fillword first!", "Look at this rapidly increasing error counter!" - Changing the fillword mode to 3 became the new mantra for every howsoever remote storage problem. And now it's very similar to bloodletting in the medicine of the previous centuries: A sophisticated-sounding theory everybody could agree on and a simple action plan.
But just like bloodletting, it only helps in certain situations and used as a general treatment it does more harm than good.
Changing the fillword mode is disruptive for a link. If you really have a problem with a wrong fillword setting, this is not very concerning, because as stated above, the link initialization would have failed and the device wouldn't be online at that moment anyway. But for all the cases where the port is actually up and running there will be a new link initialization. All current I/O belonging to this port will be void. There will be command timeouts. Error recovery needs to take place. Depending on the robustness of the attached device this could already lead to problems. But not enough, I even saw a lot of SAN admins even changing the fillword mode for normal E-ports, which is complete nonsense. Believe me, you don't want to disturb your fabric stability by bouncing each and every ISL in your SAN environment within a short time without a solid reason.
And changing running ports to a more compliant fillword is certainly NOT a solid reason.
The sad part is that often the perceived problems improved by this action. But then a simple portdisable/portenable would most probably have had the same effect, too. It's like patients recover - not because of bloodletting, but despite of it.
Conclusion and tl;dr
Don't change the fillword mode on a running port! It's disruptive!
Modified on by seb_
From the moment you wanted to read that interesting analyst paper, that compelling best practice guide, or that promising market study, you regret that you typed your email address into that innocent looking form. Now you get them day after day: Newsletters for stuff you don't really care for.
But there are newsletters that really make sense. The Cisco Notification Service (CNS) is one of that kind. It's a very good way to keep yourself up-to-date with support-relevant news about your Cisco storage networking product (and well, yes, any other Cisco product, too). The only thing you need is a cisco.com-user.
And the best thing: You get exactly what you're looking for. So here is, how to configure it:
First go to http://www.cisco.com/cisco/support/notifications.html
After you logged in using your CCO password the page should look like that:
In the Profile Manager you can administer multiple notifications. Let's create the first one by clicking on "Add Notification".
Put in a proper name for the notification in this screen. You can also choose the type of the notification. You can have an email with links and summaries (default), only an email with links or even an RSS feed. For the emails you can choose daily, weekly or monthly summaries and the feed could be configured for today, or 7 or 30 days. The recipent address for the emails can be changed, too. So if you work in a team, you could set it up to send it to the team's group email address or a distribution list.
For the Topic Type you have 3 options: Product-centric, alert-centric and based on a particular Bug ID. If you choose the product or alert approach is up to you. That mainly depends on your role (remember: I write this blog for both admins and tech support people) and the amount of different products and topics you are interested in. The tracking of bugs can also be configured directly out of the bug tool, which makes a bit more sense in my eyes. So for the moment, let's stick with the alert-centric approach.
As this specific notification was about software alerts, I choose "Software Updates" now. You can also see the other options like the EOL (End-of-Life) info, the Field Notices, known bugs and security alerts. Again your choice totally depends on your needs here. Keep in mind: You can manage multiple notifications and maybe you want to select different notification types (email / feed) with different frequencies for different topics. The main goal is still to receive the notification in a manner that you are also willing to read them in one month from now - otherwise it's not better than the negative examples from the beginning.
Now you choose the products you want to be notified about. The MDS products belong to "Storage Networking".
Just click along the tree of products. You can be very specific or just use the "All..." option from a subcategory.
As you see in the picture above, you can apply for very specific hardware and firmware combintions, like NX-OS 6.2(3) for the MDS 9222i or more general like the entries above. To add a product, just click on "Add another subtopic". If you have everything you need, click on "Finish" to store and activate this notification configuration and return to the notification profile manager.
For each notification item you can see the status and the expiration date. Yes, Cisco won't spam you with emails until your personal EOL just because you forgot how to get rid of them. The notification we just created is only valid for one year. If you still find it useful then, you have to renew it. And yes, you will be notified by mail if you have notification configurations that will expire soon.
And if you notice that the setup wasn't optimal - for example you want to change the frequency, email address, notification type or products, then just click on the edit button on the right side of the notification's header (the red circled one). Here you can also copy it or even delete it if you are not interested anymore.
You see, there are a lot of possibilities, but the configuration is quite simple and straight forward.
So try it and keep yourself up-to-date, because surprises are something for a birthday party. Not for a storage environment :o)
Modified on by seb_
Almost a year ago I wrote an article about congestion bottlenecks in Brocade switches. I said you should avoid them, because they mean that you probably have no redundancy because of too much workload or you don't use it properly. You can use the bottleneckmon to detect them. On the other hand I cared much more about latency bottlenecks, often caused by slow drain devices and their implications. And so I do today.
Well...stop! Didn't you talk about congestion bottlenecks?
Yes! Today I want to explain how a congestion bottleneck could cause the exact same symptoms on the devices like a latency bottleneck - and exactly the same performance degradations. This is how it happens. In the middle you see a SAN director with 2 portcards and 2 core cards. While the devices are connected to the portcards, the core cards provide the backend connections between them. They are internally connected via the backplane. So for example host 1's way over there to the storage array A would traverse the portcard, then one of the both core cards and leaves the other portcard until it reaches storage array A. It could even be that two devices connected to the same portcard have to go over the core cards, because so called local switching is only done within an ASIC and a portcard could have more than one depending on the number of ports.
Now please meet host 2. Host 2 is a wonderful modern server. One of the work horses of the datacenter. It's fully packed with virtual machines, but its many cores and memory, as well as its state-of-the-art HBA, provide enough horsepower to cope with the workload. This baby is more than capable to do the work and it's in no way a slow drain device. It's zoned and mapped to the storage arrays A, B, C and D and it uses them heavily, mostly for read operations. The green tiny bars are read requests and as you see in the next picture it sends them to all of them, all of the time.
Of course the other hosts send requests, too, but let's focus on our diligent host 2. Yes, the pictures are too simplistic, but I'm sure you'll get the point. On the next one you see the first responses flowing back to host 2. Communicating with several storage arrays the link towards host 2 is used heavily, but host 2 is processing the incoming frames quickly and gives buffer credits back to the switch in proper time. So far so good.
But the more and longer the link utilization is very high, the more likely the following will happen if you enabled bottleneckmon with alerting:
2013/09/07-12:07:11, [AN-1004], 7002, SLOT 7 | FID 128, WARNING, FAB1DOM5, Slot 2, port 14 is a congestion bottleneck. 99.67 percent of last 300 seconds were affected by this condition.
If you didn't enable bottleneckmon, the congestion bottleneck would still be there... you just wouldn't know it.
The crux is: you will hardly find a congestion bottleneck that just flows with high link utilization and no negative effects. The probability is much higher for the following scenario:
Although there are enough buffer credits for this highly utilized link, frames are piling up towards it, because there is just too much workload and the link is busy sending frames. There is no slow drain device and to stay with the bathing metaphor: the drain works very well and transports as much water as its physically able to do. But there is so much more water in the tub that it could not go through the drain at the same time. And in addition imagine you have not only one water tap (in our case storage arrays) but four of them. They fill the tub quicker than the drain can empty it. As a result the internal buffers for all the hops through the SAN director fill up (that's basically the tub) and finally the director needs to do something against that: It will slow down the sending of buffer credits to the devices. Not only devices that want to send frames directly to host 2, but due to back pressure also the ones that send frames into that rough direction (using the same internal connections for example). And finally you'll end up in something like this:
The SAN director just behaves like a slow drain device itself!
Frames pile up inside the storage arrays and other end devices impaired by the slow drain behavior. If their RAS package is good, they will yell about credit starvation and probably even drop frames within their FC adaptors. In extreme situations these frame drops could happen in the director, too. At least you would see then something that would point you to a performance problem. Because otherwise - if you would have substantial delay in the traffic but all the frames get finally transferred to the next internal or external hop within 500ms ASIC hold time - you would only see the congestion bottleneck. And without bottleneckmon you wouldn't see anything at all then. The switch would look clean. Nothing in porterrshow or porstatsshow. Both show only external port counters anyway. As a SAN administrator you would not suspect anything in the director to cause this.
And still it would be there. A big performance problem caused by a device communicating with too much other devices. Not a slow drain device but still causing a slow drain in the SAN.
So how to solve it?
It's basically what I wrote a year ago plus points 3. and 4. from How to deal with slow drain devices. You just have to ensure - from a architectural design point of view - that all components of the SAN are able to cope with the workload at any given time. It's both that easy and that complex. But the first step towards resolving such a situation is to detect it properly and to keep in mind what could happen.
Modified on by seb_
There are some good videos out there on the STG Europe Youtube channel about the infrastructures able to cope with analytics workloads. Distinguished Engineer John Easton discusses the requirements for these kind of workloads in the video "IBM Big Data with John Easton" below:
He points out that it is more efficient to use large memory systems with high computing power like Power Systems or System z instead of multiple parallel working System x nodes. The reason for that is the high I/O demand contrary the high wait times that result out of the usage of disk based storage systems to share the data between the nodes during processing. Especially for real-time analytics he recommends to have all the computation within the same box.
The same preference of a scale up approach of high powered systems versus scale out infrastructures is explained by Paul Prieto, Technical Strategist for Business Analytics in the video "Choosing the right platform for Cognos Analytics":
Can flash make a difference?
With I/O performance being the main reason for avoiding a scale out strategy, there is of course the question: What if I the I/O performance could be drastically enhanced? Before IBM acquired Texas Memory Systems in 2012, their RamSan systems were rarely used to accelerate scale out infrastructures as far as I know. The main use case was to boost the few big boxes running highly productive applications but waiting for their I/O due to inadequate I/O latencies provided by traditional disk storage systems. With their I/O latencies within the range of two-digit to lower three-digit microseconds and their capability to sustain several hundred thousands of IOPS they were used as a Tier 0 storage for only the most demanding and business-critical workloads.
With the integration of the now called IBM FlashSystem into the IBM storage portfolio another use case emerged and since then played a growing role in these deployments: IBM FlashSystem behind IBM SAN Volume Controller.
The pair "FlashSystem plus SVC" represents in fact two approaches:
- Using SVC to virtualize the all-flash FlashSystem and enrich its raw I/O performance with the features you expect from a today's virtualized storage solution like seamless migrations, remote copy, thin provisioning, snap-shots (FlashCopy) and many more.
- Using FlashSystem to boost existing SVC-virtualized storage environments by using it for Easy Tier as well as for pure flash-based volumes.
Especially the second way combined with the wide range of supported host systems, HBAs, and operating systems now makes it interesting for a former no-go: Running applications with really high I/O demand like analytics on scale out commodity systems while relying on an impressive I/O performance available outside in the SAN. But of course - as always - it's not that simple. Yes, there will still be scenarios where such a scale out approach is just not applicable. Especially then it might make much sense to speed up the storage even for the scale up purpose-built business analytics systems. However for many - for example SMB - companies it'd make perfect sense to run their analytics on flash-accelerated clusters of x86 based commodity hardware...
...if they do it right.
So how to do it right?
Well, this blog is not intended to explain reference architectures or architectural best practices for analytics. But I want to add the SAN point of view. (I guess you already wondered when this will start - given the usual topics of "seb's sanblog") And from my perspective as a SAN troubleshooter I can at least tell you what should be taken into consideration to not let it fail from the beginning. There are two major points: The general architecture and the hardening of the SAN. The proper architecture (for example by keeping the FlashSystem and SVC attached to the core) is the base, but a hand full of issues could have an unacceptable impact on the performance. Many of them I already covered in earlier blog posts and some of them will be the topics of future ones.
The main goal is to prevent the SVC ports from being blocked. Ever. May it be back pressure due to slow drain devices, sub-optimal cabling patterns, "unlucky" longdistance settings, enabled but unused QoS, too few buffers set for the F-ports, sheer overload of links, and many others.
With disk-based storage we talked about good average latencies of around 3ms. As the combination of FlashSystem plus SVC now works with a tenth of that and lower, the storage network's performance really start to make a difference. Usually we talk about single-digit microseconds one-way from device to device in a well-designed SAN. But the issues described above could increase this into the range of hundreds of milliseconds. Then of course it will hardly be possible to provide real-time business analytics. Therefore it is important to harden the SAN with the possibilities you have today, like - speaking of Brocade fabrics - Fabric Watch, bottleneckmon, Advanced Performance Monitoring, port fencing, traffic isolation zones, and so on. Brocade's "Fabric Resiliency Best Practices" are a good first step in this direction.
I think it's still possible to create a scale out infrastructure for business analytics even - and especially - with SAN based storage, as long as it's optimally prepared and using IBM FlashSystem solutions to overcome the mechanically caused latencies of disk storage. But it's crucial to ensure that these benefits are not rendered void by avoidable performance problems.
IBM Experts are more then willing to support you in this challenge. ;-)
To like working in tech support, you have to be the most optimistic guy around. You have to be even more optimistic about the product you support than the sales guy trying to sell it. Why? Because the product can be as fantastic as possible - jam-packed with jaw-dropping features - as a tech support guy you will only witness the bugs. However, the bugs are not what's annoying me. Well, at least most of them. :o) Every software necessarily has bugs. They are my job, the reason of its mere existence. What's really annoying me is, when I know that there is a problem, but the RAS package is just not good enough to enable me troubleshooting it.
Therefore, I was pleasantly surprised when I read the release notes of the Fabric OS v7.1 codestream. There are a lot of tweaks and features that make the life of a troubleshooter easier. And it's not only about finding problems, it's about preventing them, too. So here is just a first selection of what I like:
Can I trust the counters?
"FOS v7.1 has been enhanced to display the time when port statistics were last cleared." says the release note. This sounds trivial, but it's essential for the troubleshooting of many problem types like performance problems, physical problems and so on. Times when we had to go through the CLI history - in the hope that the counters were cleared via CLI after a proper login - seem to be over now.
Link Reset Type in the fabriclog
A small enhancement, but a time-saving one. To get a time-based overview about the state-changes of the ports, you usually have a look into the fabriclog. But there you often only see that there were link resets. The interesting thing would be to find out who initiated them - the local port or the remote one. The LR_IN and LR_OUT counters in portshow were an insufficient source of information here as they show only absolute numbers. In Fabric OS v7.1 they type is simply part of the message and you see it at a glance.
For many admins the best practice to replace an SFP is to disable the port, then replace the SFP and afterwards re-enable the port again. I know many people who did this and I felt always uncomfortable to tell them, "Rip it out while it runs, otherwise the switch won't recognize it correctly." But that's the way it is before v7.1: If the port is not running while you replace an SFP, it might not notice that for example the 4G LW SFP that was in there before is now an 8G SW SFP. Beside of any ugly additional bug that was possible based on that later on, the behavior itself was a pain. In v7.1 you don't have to care for that. Sfpshow will show you the correct information. Additionally sfpshow will also tell you when the last automatic polling of the SFP's serial data took place.
Honest long distance
If you read SAN Myths Uncovered 2: The LD mode (Brocade) on my blog before, you know that the whole long distance stuff in Brocade switches is a little bit... let's say "optimistic". For long distance ISLs (other than long distance end-device connections) you only configure the length of the connection and the switch calculates the necessary amount of buffers. But as it does that by using the maximum frame size, you'll end up with a buffer shortage for basically all real-world use cases. In Fabric OS v7.1 new functions take account of this fact. The command portbuffershow (by the way a mandatory candidate for every data collection) will show you the average frame size now. So sooner or later I can mothball my article about How to determine the average frame size. And this value then can be used to optimize the buffer settings in the completely overhauled portcfglongdistance command. Now it will calculate the buffers based on your average frame size. Furthermore, it allows you to configure the absolute number of buffers yourself if you want. You don't need to tell your switch anymore that a distance is 200km just to assign enough buffers to span 60km with your real-world average frame size being far less than the maximum one. It's that kind of clarity that prevents misconceptions and evitable performance problems.
This is not an exhaustive list of all the good new things. There are definitely more good features in direction of RAS like enhancements for credit recovery, Diagnostic Ports, FDMI, Edge Hold Time, FCIP and many others. In my eyes they'll make the platform even more robust and after all, it will hopefully give me a little more time to write more blog articles in the future. :o)
Oh wait... is this the call to update to v7.1 immediately?
Well, no, it's not. It's just an outlook for the things to come. Better plan your updates carefully. You know, it's just a blog article by the most optimistic guy around... ;o)
Well this year passed by with highspeed. How perception can change... I felt 2011 was a pretty long year. Our first child was born and the life of my wife and me was turned upside down. It was a lot of work, but good work! And I felt my body and brain adjusting to the demands. Time felt going by slower. Maybe I was just rushed with adrenaline for several months. This year was different. Many things happened at my job and time flew by. Most of the things were internal stuff - important for me - maybe interesting, though - but unfortunately nothing I could blog about here. Sure, I'm still deeply related to the topic SAN troubleshooting, but there was so much else to do in 2012.
So what to expect from 2013?
Well, I will still be here blogging from time to time. Hopefully a little bit more than in 2012. Let's see if that really works, because we expect our second baby to be born mids of February. So I hope for the adrenaline to kick in again. :o) There are still a lot of ideas in my mind and SAN troubleshooting is an ongoing thing. I'm here to share my experiences from my work as a SAN PFE (Product Field Engineer) in the IBM ESCC. Usually I don't get much feedback about it, but what I got was really good and I'm happy that I was able to help in a good number of situations. In 2012 the blog had about 254k hits, which I think is a good amount given my relatively small target group of SAN admins, designers and troubleshooters. Of course, I don't earn any money with ads or so :o)
But often enough in 2012 I felt it wouldn't do any harm if I'd expand my scope a little bit. So in 2013 I plan to stop restricting myself and to write a little bit more about other storage related topics as well. At the moment I'm not really sure if I will do that within this blog or if I will create a new one. I'm tending to the first choice but if you, my dear reader, have some good reasons to keep the sanblog "clean", I'll consider them, too.
Until then... Have a good start into 2013 and Happy New Year!
A slow drain device often has a huge impact on the performance of many other devices in a SAN environment. That happens, because they block resources in a fabric other devices use as well. The main example for such a resource are ISLs, particularly the Virtual Channel(s) within those ISLs that are used to reach the slow drain device. But as soon as you have an appliance in the SAN, this could turn into such a blocked resource as well.
Disclaimer: There are several definitions and types of appliances. Within this article an appliance is a device "in the middle" between the hosts and the storages with a specific task such as a compression, encryption, virtualization or deduplication appliance. While I had the SAN Volume Controller (SVC) in mind while I wrote this, it applies to many other products matching this definition. The common thing is that the performance they can provide is to some degree dependent on their destination device's performance.
Fortunately many of the fabrics I saw over the recent years were designed using a core-edge approach. If the device is in the communication path of many of the devices in a SAN it's best practice to attach it directly to the core. But a slow drain device can still block it. This is how it happens:
In this sketch the appliance sends data towards a slow drain device. It will not be able to process the incoming frames quickly enough - they pile up in its HBA's ingress buffers (1). As the appliance is still sending frames but the edge switch cannot send them further to the slow drain device, they also pile up in the ingress buffer of the ISL port of the edge switch (2). Now this could already impair the performance of the other host connected to the same edge switch like the slow drain device - if the frames towards it use the same VC. Some microseconds later the same might happen to the frames from the appliance entering the core (3). They pile up there as well and as soon as that happens, this so called back pressure reaches the appliance itself then. As there are no VCs on the F-to-N-port connection used to attach the appliance to the core, the chance is high that the appliance cannot send any frames out to the SAN anymore - no matter to which destination (4).
Well, that means you just turned your appliance into a slow drain device itself! The performance of the whole environment is heavily impaired now:
In step (5) the frames from the other hosts towards the appliance pile up in the core as well and then the back pressure spreads further to the hosts connected to the edge switches as well (6).
Worst case, hmm?
After the ASIC hold time is reached (usually 500ms) the switches will begin to drop frames to free up buffers again. But as all switches have the same ASIC hold time, you'll end up in the situation that while the edge switch reach these 500ms first, the core switch will start to drop the frames likewise before the buffer credit replenishment information (VC_RDY) from the edge switches arrive. So not only the frames from the communication with the initial slow drain device will be dropped, but most of the others down the path as well. As as the appliance itself turned into a slow drain device, the same might happen to the frames piled up because of that, too.
So what to do against it?
The first thing is: give the F-ports of the appliance as much buffers as possible. Prio 1 should be that it should be able to send its frames out into the fabric, so the chances are higher that when the frames of the open I/O against the slow drain device are out there, there could be still some buffer credits left to send stuff to other devices. For clustered appliances like the SVC it's even more important, because they use these ports for their cluster-internal communication as well. Blocked ports could result in cluster segmentation then (SVC: single nodes rebooting due to "Lease expiry"). To assign more buffers to the switch port (= more buffer credits for the port of the appliance), use
portcfgfportbuffers --enable [slot/]port buffers
Update: Please keep in mind, that adding more buffers to an F-Port is of course disruptive for the link!
To check how much buffers are available, you can check
But in many cases this is not enough. Some time ago, Brocade released Fabric Resiliency Best Practices with some good advises. In my opinion every SAN admin with Brocade gear should have read it. It recommends:
- Use Fabric Watch to get alarms for frame timeouts. (Erwin von Londen wrote a good article about that.
- Use Port Fencing to isolate slow drain devices. (Read Erwin's post about that, too.)
- Configure and use the Edge Hold Time.
- Configure bottleneckmon to get alarms for latency and congestion bottlenecks.
While Fabric Watch is used more and more and especially in the FICON world - but also for open systems - I see some of our customers using port fencing, I hardly see anyone utilizing the Edge Hold Time feature. For a situation as described above it could really improve the situation for the appliance and the other hosts dramatically. It can be set to any value between 100ms and 500ms. It was introduced in FOS v6.3.1b. So if you expect hosts connected to an edge switch to behave slow draining in certain situations, in my opinion the Edge Hold Time of that switch should be set as low as possible. Of course it's always depending on your environment and how likely it is to be impaired by a slow drain device, but 100ms is a long time in a SAN. If you also have some legacy devices connected to these edge switches, check if a decreased hold time could be a problem for them.
It can be enabled and configured using the "configure" command, where it can be found in:
Not all options will be available on an enabled switch.
To disable the switch, use the "switchDisable" command.
Fabric parameters (yes, y, no, n): [no] yes
Configure edge hold time (yes, y, no, n): [yes]
Edge hold time: (100..500) 
You don't need to disable the switch to change the Edge Hold Time and as one of the fabric parameters it will be included in a configupload.
As it seems to be used very seldom in the field I would like to get some feedback if you actually used it. Please give me a hint if and in which situation it helped you. Thanks!
But don't forget: The most important thing is to get rid of the slow draining behavior!
I check the referrers of this blog from time to time to get to know where my readers are coming from. For many of them I cannot actually see it, because often "bridge"-pages are used - for example by the social networking sites. But a fair amount come from searches in google and other search engines. Some search queries there seem to repeat very often. Maybe I will write more articles about the others - because hey, this seems to be the stuff you're coming for :o) - but this time it's about congestion bottlenecks.
Congestion bottlenecks - beside of latency bottlenecks - are one of the two things the Brocade bottleneckmon can detect. The default setup will alert you - if you enabled bottleneckmon with alerting - for all situations where 80% of all seconds within a 5 minutes interval had 95% link utilization. That is a big number! Of course you can also modify the setup to be more aggressive or to spare you some messages in an environment that is usually "under fire"...
But I encourage you to take it seriously!
In my opinion, a healthy SAN should NEVER have congestion bottlenecks. With "healthy" I mean of course the time of normal operation. Not when you have an incident at the moment and there is no redundancy in some parts because for example the second fabric has a problem or one out of two ISLs between two switches had to be disabled... I wrote an article last year about that and I think it fits well within the topic.
Rule of thumb: Link utilization should be up to 50% only.
And of course it should not be only 50% because you configured too few buffers! The setup of the link should always allow it to transport up to 100% of the workload that's physically possible. Otherwise you will have no real redundancy again!
But how to handle them now that I have them?
So you see these [AN-1004] messages in the error log and you know the port. What now? This is more about your SAN concept than defects or software features. The congestion bottleneck happens because the utilization of a link approaches its physical capabilities. Here are some ideas:
- Start a real performance monitoring. Have a look at what the Advanced Performance Monitoring license can do for you. (Ask your IBM SAN support for a free 60days Universal Temporary License)
- For example use Top Talkers to find out where the heavy traffic comes from.
- Simple: Use more links - if you have enough resources. For ISLs - if you have free ports on both switches and the possibility to connect them over your patch panels or if needed over the long distance between two sites, just add ISLs to spread the load.
- If you use a Traffic Isolation setup, check if it's done correctly and does not cost you too much bandwidth.
- Check if you can run the ISL on a higher line rate ("speed"). More line-rate means you can actually transfer more data at the same time. But please keep in mind, that higher line rates require better cable quality. If you have only 100m OM2 cables between two switches, increasing the line rate from 4Gb to 8Gb or even 16Gb will most probably result in problems for the link!
- For congestion bottlenecks on end-device links you should check the multipath driver first. Is the load spread over the available paths? If you have a plain active/passive failover multipather it might be okay to have high load on the ports in use. But if you use a round-robin load balancing, check if you can add additional paths (more HBA-ports) with common sense. Keep in mind that "more paths --> longer failover times" and many devices have a maximum limit for paths!
- Server virtualization may allow you to move workloads to more suitable regions of your SAN environment to relieve the ones under pressure.
And often forgotten:
In many cases the congestion bottlenecks will be observed only at specific times. Usually the devices in your SAN don't have the same workload all the times. There is time when people sleep, there is time when people come to work and switch on their VDI'ed PCs, there is time when the backups run and there is time when big batch jobs run. A proper planning and scheduling is mandatory in today's data centers! Don't let the big workloads run at the same time. Spread them accross the course of the 24 hours you have. The same is true for the course of the week, the months, the quarter, the year.
The fewest environments are totally under-sized for the average mix of workload - but the demand of the components of this mix over time is the heart of your storage environment's performance!
If you need help to better manage your workloads, I'm sure your local IBM Sales rep or IBM business partner can bring you in contact with the right performance expert to work these things out for your special situation.
Modified on by seb_
It's the nightmare of every motorist. Your car was just repaired a few days ago and now it stopped running in the middle of nowhere. Or you even crashed, because the brakes just didn't work in the rain. Fake parts are a big problem in the automotive industry. Original-looking parts from dubious sources could even work as expected in normal operations but when the going gets tough, the weak won't get going. So before a fake cambelt wrecks your engine or a fake brake pad costs your life, it might be a good idea to not save on the wrong things.
But a faked SFP?
Like a brake pad an SFP is somewhat a consumable. Light is transformed into an electric signal and vice versa, produces heat and the components wear out over time. Some sooner, some later. If you bought the SFPs from IBM for a switch under IBM warranty or maintenance, broken SFPs will be replaced for free. But if you decide to buy an SFP, you'll notice after a quick web search that there are a lot of supplier out there offering the same SFP for a much smaller price than IBM. And with "the same SFP" I mean they offer the very same IBM part number - for example 45W1216. That's an 8G 10km LW SFP.
Is it really the same?
Of course not - although they claim it to be the same. Their usual explanation is , that all these SFPs are coming from the same manufacturer anyway. SFPs are built using open standards defined by T11 and therefore they should be compatible per se. I can tell from several occasions: That's not true. There are of course more than 1 SFP manufacturers and I'm sure each of you know a handful offhand. In addition: Even in times before 8G there were SFPs working much better with certain switches than others.
With the 8G platform Brocade decided to offer Brocade-branded SFPs and restricted their switches to only support them and to refuse others (beside of very few exceptions for CWDM SFPs). So Brocade took control over which SFPs can be used and they were able to finetune their ASICs to allow better signal handling and transmission. To enforce that the switch checks the vendor information from the SFP to determine if it's a Brocade branded one. Cisco does the same for the SFPs in their switches.
Here is where the fake begins...
There are several vendors of devices to rewrite these SFP internal information. By spoofing vendor names, OUIs (Organizationally Unique Identifier) and part numbers they try to circumvent the detection mechanisms on the switch. So independent suppliers buy "generic" bulk SFPs and "rebrand" them to sell them as "IBM compatible" with the same part number. And because IBM officially supports the part number (like announced here) one might assume everything will be fine then.
In fact it's not...
Imagine a migration project. The plan is in place, everything is prepared, the components are bought and onsite, all the necessary people are there in the middle of the night or during a weekend and the maintenance window begins. And then these ports everything depends on just don't come online - Only because someone faked these "cheaper but still compatible" SFPs negligently. I had a case where the same SFPs did work in one 8G switch model but not in the other - also 8G - with exactly the same FabricOS.
In the sfpshow output they looked like this:
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 5401001200000000 200,400,800_MB/s SM lw Long_dist
Vendor Name: XXXXXX
Vendor OUI: 00:05:1e
Vendor PN: 57-1000012-01
The supplier did not write "Brocade" into the "Vendor Name" field (I replaced it with Xs) but in the "Vendor OUI" field he inserted the OUI from Brocade. In addition he also faked the "Vendor PN" but even used a wrong one. This one is the PN for a shortwave SFP.
But beside of being an ugly showstopper for the migration - driving costs far beyond of what could have been saved by buying the cheaper parts - that's not even the worst case. Perfectly faked SFPs might be accepted by the switch, but you never know if they are really running fine. I don't wish anybody to be called at 3am about the crash of half the servers, because an ISL started to toggle. Or to have increasing performance problems, because every now and then a faked SFP "on the edge of the spec" devours a buffer credit by misinterpreting an R_RDY.
Troubleshooting this can be a pain itself. But the money potentially lost on outages will hardly be compensated by the savings from cheaper SFPs!
I got the confirmation from IBM product management, that IBM itself will only deliver Brocade-branded SFPs for its current b-type SAN portfolio.
So if you have non-Brocade-branded SFPs in your 8G or 16G Brocade switches be aware that they are probably not supported and there could be some unplanned night or weekend working hours for you in the future...
Working on a refresh of my IBM internal SAN problem determination course from last year, I stumbled over the first couple of slides again. They are a little bit of a "raison d'être" for the course - the answer to the question "Why should I learn that?". And they got me pondering again. How long will this still be relevant? Here's what I think:
Every once in a while someone restarts the discussion if tape will die soon. The same discussion comes and goes for Fibre Channel in total. There are lots of people predicting each year that SAN is a dying concept at all. The cloud is here today. Not some spooky future concept but deployed in many forms and flavors. And there are stacks, too. A whole data center in a rack. Pre-integrated, pre-configured, pre-optimized, pre-fueled with software and "expertise".
So does it still make sense to build up SAN skills?
If all the expertise is already in the product, why spend time to become an expert? If management UIs become childishly simple, why should a company pay certified specialists? So why to learn all the stuff and get certified in a world where storage comes as a commodity? Any why are there still people out there saying everything becomes more and more complex if it appears that everything gets so easy and simple now?
A look back...
Unfortunately in many regions in the world water is not a commodity. It has to be brought from remote sites over long distances. Often there are people whose only task is to fetch the water. And if a drought lasts longer than the water in the reservoirs their special skills to find alternative sources of water are in demand. It's undoubted that such skills must be maintained, transfered and extended to ensure the survive of the family or community. In return we - the people in the industrialized countries - take water for granted. It comes out of the tap. My 1.5-year-old understands that concept. Need water? Open the tap.
But does that mean "No experts needed anymore"?
Certainly not, quite the opposite! The preparation of drinkable water and its distribution as well as the handling of the sewage is a complex process chain today. It involves infrastructure specialists, biologists, chemists, process technicans, civil engineers and many more highly skilled persons working together. And given the challenges of the future it will become even more complex most probably.
The same is true for SAN skills
As long as we don't have a worldwide grid of quantum-entanglemented RAM-locality-based servers, I predict that there will still be something like a SAN. There will still be need for architects, for specialists implementing it and for sure for well-skilled people troubleshooting it if problems arise. Storage - as well as Computing in general - might become a commodity like water and we'll definitely won't need everybody to know how it works. But the ones remaining in that area will need to be the real experts. Skilled, trained, experienced and motivated.
You might say: "That's the case with virtually everything!" and you are right. So why not SAN?
I was asked where to look at in a switch to find the average frame size for a port. The safest way would be to use an external monitoring tool like a VirtualWisdom or a tracer as described in my LD mode article but if you don't own something like that you can get a good guess from the switches themselves. You just have to calculate it out of the number of frames and the number of bytes transferred.
For Cisco it's easy. Just look into the "show interface" for the specific port and you'll find the both numbers in the statistics section for each interface:
1887012 frames input, 1300631486 bytes
542470 frames output, 482780325 bytes
So we can just calculate the average frame sizes for both directions:
1300631486 bytes / 1887012 frames = 689 bytes per frame
482780325 bytes / 542470 frames = 890 bytes per frame
For Brocade switches you can get the information out of the portstatsshow command:
stat_wtx 35481072 4-byte words transmitted
stat_wrx 70173758 4-byte words received
stat_ftx 1111087 Frames transmitted
stat_frx 1177665 Frames received
Here we don't have the plain bytes but 4-byte words. Don't worry - fillwords don't count into this number, so it's still valid for our calculation. We just have to multiply it by four to use it:
(35481072 * 4) bytes / 1111087 frames = 128 bytes per frame
(70173758 * 4) bytes / 1177665 frames = 238 bytes per frame
It's really that easy?
Basically yes. With this average frame size you can find out the multiplier for the buffer credits settings. So if you have an average frame size of 520 and a link of 30 km, just calculate:
2112 (the max frame size) / 520 = 4
So you would set up the link for 120 km instead of 30 km to reserve a sufficient amount of buffers. That's it.
A last catch
If you read my article about bottleneckmon you probably already know that we work with 32 bit counters here. While they cover a few hours for the frames they wrap much quicker for the 4-byte words. So to be able to calculate an average frame size over several hours or days, 32 bit counters are not enough. Actually there are 64 bit counters for these values in the switches - although they are not part of a supportsave. The command portstats64show provides them. The first thing to keep in mind: While in the latest FabOS versions a statsclear resets these counters as well, you had to reset them with portstatsclear in the older versions.
The 64 bit counters are actually two 32 bit counters and the lower one ("bottom_int") is the 32 bit counter we used all the time in portstatsshow. But each time it wraps, it increases the upper one ("top_int") by 1. So after a while you might see an portstats64show output like this:
stat64_wtx 0 top_int : 4-byte words transmitted
2308091032 bottom_int : 4-byte words transmitted
stat64_wrx 39 top_int : 4-byte words received
1398223743 bottom_int : 4-byte words received
stat64_ftx 0 top_int : Frames transmitted
9567522 bottom_int : Frames transmitted
stat64_frx 0 top_int : Frames received
745125912 bottom_int : Frames received
For the received frames it's then:
(2^32 * 39 + 1398223743) * 4 bytes / 745125912 frames = 907 bytes per frame.
Much manual computing, hmm?
Of course you could write a script for that or prepare a spreadsheet but my recommendation is still to start with a multiplier of 3 for normal open systems traffic and check with the command portbuffershow how many buffers are still available. And if you still have some, use them - but keep them in mind if you connect additional long distance ISLs or devices you want to give additional buffers as well.
Update Nov. 2nd 2012:
I was made aware that there is an easier and much more convenient way to use portstats64show: Just use the -long option.
pfe_ODD_B40_25:root> portstats64show 26
stat64_wtx 7 top_int : 4-byte words transmitted
485794041 bottom_int : 4-byte words transmitted
stat64_wrx 13 top_int : 4-byte words received
2521709207 bottom_int : 4-byte words received
pfe_ODD_B40_25:root> portstats64show 26 -long
stat64_wtx 30557972957 4-byte words transmitted
stat64_wrx 58371265974 4-byte words received
Much better, isn't it? Thanks to Martin Lonkwitz!
In one of my previous posts I wrote about "Why inter-node traffic across ISLs should be avoided". There is an additional "bad practice" that could lead to performance problems in the host-to-SVC traffic.
Let's imagine a core-edge fabric. A powerful switch (or director) in its center is the core. The SVC and its backend storage subsystems are directly connected to it. Beside of that there are also the ISLs to the edge switches where the hosts are connected to. As there is an SVC in the fabric, all host traffic usually goes to the SVC and the SVC is the only host of all other storages. From time to time I see a cabling like the one below. The devices are connected in a common pattern. For example SVC ports are always on port 0, 4, 8, ... or for a director for example on port 0 and 16 on each card... Something like that. The reason behind that is often to spread the workload over several cards/ASICs to minimize impact in case of a hardware failure. But there's a risk in doing so.
Index Port Address Media Speed State Proto
0 0 190000 id 8G Online FC F-Port 50:05:07:68:01:40:a2:18
1 1 190100 id 8G Online FC F-Port 20:14:00:a0:b8:11:4f:1e
2 2 190200 id 8G Online FC F-Port 20:16:00:80:e5:17:cc:9e
3 3 190300 id 8G Online FC E-Port 10:00:00:05:1e:0f:75:be "fcsw2_102" (downstream)
4 4 190400 id 8G Online FC F-Port 50:05:07:68:01:40:06:36
5 5 190500 id 8G Online FC F-Port 20:04:00:a0:b8:0f:bf:6f
6 6 190600 id 8G Online FC F-Port 20:16:00:a0:b8:11:37:a2
7 7 190700 id 8G Online FC E-Port 10:00:00:05:1e:34:78:38 "fcsw2_92" (downstream)
8 8 190800 id 8G Online FC F-Port 50:05:07:68:01:40:05:d3
The SAN perspective
In the situation described above, all host traffic is passing the ISLs from the edge switches to the core. ISLs are logically "partitioned" into so called virtual channels. Of course the ISL is still just one fibre and only one signal is passing it physically at the same time. The virtual channels are just portions of buffer credits dedicated and the decision which virtual channel a frame takes - and therefore which portion of the buffers credits it uses - is made by looking into the destination fibre channel address.
Technical deep dive
A normal non-QOS ISL has 4 virtual channels for data traffic. For an 8G link each one of them has 5 buffers. They can only work with these 5 buffers and there is no possibility to "borrow" some out of a common pool like for QoS links. With the command "portregshow" you can see the buffer credits assigned to the virtual channels (I added the first line):
VC 0 1 2 3 4 5 6 7
0xe6692400: bbc_trc 4 0 5 5 5 5 1 1
Only VCs 2-5 are used for data traffic. This makes 20 usable buffers which normally should be enough for a normal multimode connection between two switches in the same room with only some metres cable length. Basically the switch uses the last two bits of the second byte of the destination address. That looks so:
Bits 00 -> frame uses VC 2 (which is the first virtual channel for data)
Bits 01 -> frame uses VC 3
Bits 10 -> frame uses VC 4
Bits 11 -> frame uses VC 5
So where's the problem now?
In our imaginary core-edge fabric where for example all SVC ports are connected to ports 0 (bin 00), 4 (bin 100), 8 (bin 1000), 12 (bin 1100) , ... all host I/O towards SVC would use the same virtual channel. As this is the only traffic that passes the ISLs from edges to cores, only a quarter of the buffers are actually used! 5 buffers are very heavy in use and 15 are idling around never to be filled. And 5 buffers are actually pretty few for an edge switch full of hosts wants to speak with the core switch where the SVC is connected. The result would be credit starvation and congestion on a virtual channel level.
How to solve that?
There are 3 possibilities:
1.) You could re-cable your SAN in a manner that all VCs are used. But beside of the risk of physical problems and problems introduced by maintenance actions the devices have to learn about the new addresses of the SVC ports. For many operating systems this still means reboots or reconfigurations. It could involve a lot of work and risk for outages.
2.) You could just change the addresses with the portaddress command. This command is usually used in the virtual fabric environment and if you can use it depends on installed firmware and used platform. While it avoids the physical actions, it still has the disadvantages for the hosts because of changed addresses.
3.) The best and least interrupting possibility might be to set the ISLs to LE mode. This is the long distance mode dedicated for links under 10km in length. It will not only put more buffers on the link (40 for user traffic in an 8G link compared with the 20 for a normal 8G E-Port) but will also collapse the 4 user traffic VCs to just one. It looks like this then:
VC 0 1 2 3 4 5 6 7
0xe6602400: bbc_trc 4 0 40 0 0 0 1 1
So all buffers and therefore also all buffer credits will be used by the hosts and nothing idles. There will of course be a short interruption while changing the ISL to LE mode but beside of that nothing changes for the hosts, because all the addresses stay the same. This is clearly the way to go in the situation described above.
Just something strange for the end: Some switches are delivered from manufacturing with an alternative addressing pattern. For example port 1 of domain 3 won't have the address 030100 then but something like 030d00. In that case the problem can happen similarly but on other ports. But using LE-mode would solve it in the pretty same way.
Please keep in mind that the whole article relates to a very special (although very common) SAN layout in an SVC-centered environment. This is clearly not a standard action plan for all performance problems but it could help if you have a customer in a situation like this. For any questions, feel free to contact me.
Additionally, please be aware that this is not an SVC problem by itself but will happen with every central storage connected to a switch using a pattern as described above and being used by hosts connected to another switch over an ISL!
Update from May 9th:
I was made aware that readers of this article queried their vendors, maintenance providers or business partners with the idea to just set all their ISLs to LE-mode regardless if the condition as described above is actually met. Because of that, I would like to state more clearly: Using LE-mode as a general approach for your ISLs can cause severe problems!
If the SVC ports are not connected in a way that only one Virtual Channel would be used, it actually makes sense to have ISLs with more than one VC. Virtual Channels are a good feature to prevent that a latency bottleneck due to back pressure impairs the traffic of all devices using the same ISL. If devices on the edge switches communicate with other devices connected to other ports of the core (or other edges) as well, the impact of using LE-mode would be even more extreme in the case of slow drain devices.
I made some drawings to illustrate this. The first one shows 1 normal ISL between the edge and the core. You can see the 4 VCs used for data traffic. (I left out the other VCs for better visibility):
Here host 1 and 2 make traffic against the SVC (green), host 3 against an additional disk subsystem (purple) and host 4 against a tape drive (orange). Based on the ports these devices are connected to, other VCs are used for that traffic.
If you would use an LE-port instead, it would look like this:
Now all 4 data traffic VCs collapsed to a single one. As long as everything runs smoothly, you won't see an impact.
Buf if for example one of the devices connected to the core is slow draining, following will happen most probably:
In the picture above the purple disk is a slow drain device. Due to back pressure the whole ISL will be a latency bottleneck, because all data traffic shares the same VC in LE-mode. The back pressure goes further towards the edge switch and all 4 hosts of our example are affected now although only host 3 communicates with the slow drain device!
With a normal E-port it looks like this:
Now only VC4 is affected while VC2, 3 and 5 are running smoothly, because they have their own, unaffected buffer management. Therefore only host 3 will face a performance problem while the hosts 1, 2 and 4 are running fine.
You see: Using LE-mode for the purpose described in my original article does only make sense if these special conditions are really met. In all other cases it can impair the SAN performance tremendously!
I didn't blog for a while now because of an internal project. Like each software development project it's never really over and development will be going on in the next years to bring in new functions, but I hope I have some more time for blogging again now. :o) I also decided to go a bit away from the long blog posts I did in the past to more conveniently readable short posts if possible.
Long distance modes
Brocade has basically 3 long distance modes:
- LE mode - merges all user-data virtual channels and assigns the amount of buffers necessary to cover a 10 km distance based on the full frame size for the given speed. It requires no license.
- LS mode - like LE mode, but is used for distances > 10 km and requires the "Extended Fabric License". You configure it with a fixed distance.
- LD mode - similar to LS mode, but the distance is measured automatically and the buffers are assigned according to the measured distance. You configure it with a "desired distance".
So what's the problem with LD?
If you have two data centers with a distance of 30 km between them and you configure 60 km, the switch will only assign the buffers for the measured 30 km. Increasing the desired distance doesn't change anything.
Wait! Why should I increase it anyway?
As written above the number of buffers depends on the distance. The switch just calculates the amount of buffers by the number of full sized frames (frames with maximum frame size - usually 2kB) needed to span the distance. But the problem is: in real life the average frame size is actually much smaller than the maximum one.
In the picture above you see a write I/O out of a fibre channel trace. The lines with the rose background are the frames from the host, the ones with the gray background are the responses from the storage. The last column shows the size of the frame. Only the 4 data frames have the full frame size. The other 3 frames have a size far smaller than 2kB. So the average frame size in this example is just 1.2kB. With this average frame size you would need almost double the amount of buffers to fill the link than the number the switch calculated! And it could be much worse. I ran a report over the full trace and the average frame size for the transmit and receive traffic was:
Given that numbers and added a "little buffer reserve" you would need 3 times the buffers than the switch would use!
Okay so let's give it more buffers!
Yes, for LS mode this would exactly be the action plan. But remember: For LD mode, the switch just uses the measured distance. The desired distance is only used as an additional maximum. So if you have 30 km and configure 20 km, it will only assign the buffers for 20 km. If you configure 50 km, it will only assign the buffers for 30 km. So my general recommendation is:
Use LS instead of LD!
LS mode gives you the full control. And use it with enough buffers by configuring a multiple of the physical distance. 3x is a good practice but you can increase it even more if there are buffers left. You can always check the available buffers with the command "portbuffershow".
Don't leave those lazy buffers unassigned but use them to fill your links!
I claim that in 2012 performance problems will keep their place amongst the most frequent and most impacting problems in the SAN. In many of the cases the client's users really notice a performance impact and so the admin calls for support. Other support cases are opened because of performance related messages like the ones from Brocade's bottleneckmon or Cisco's slowdrain policy for the Port Monitor. Beside of that there are also cases that look not really like performance problems from the start but turn out to occur because of the same reasons like them. "I/O abort" messages in the device log, link resets, messages about frame drops, failing remote copy links, failing backup jobs or even worse failing recoveries - these could all be "performance problems in disguise".
When I analyze the data then and find out that a slow drain device or congestion is the real reason for the problem I write my findings down and try to give the client some hints about possible next steps. For example by mentioning my earlier blog article about How to deal with slow drain devices.
Do you know what's mean about it?
Often clients never heard of slow drain devices before. Longtime storage administrators are confronted with a term that sounds like a support guy made it up to fingerpoint to another vendor's product. Of course I usually explain what it is, what it means for the fabric and for the connected devices. But to be honest, I would be sceptical, too. I would go to the next search engine and query "slow drain device". The first finds are from this blog and from the Brocade community pages and there are some questions about that topic. Considering the substance of posts in public forums, I would check Brocade's own SAN glossary. Guess what? Not a word about slow drain devices - Which is no surprise as it's from 2008. I would check wikipedia. Nothing. My fellow blogger Archie Hendryx mentioned that it's missing in the SNIA dictionary, too. And he's right: Nothing!
So why is that so?
Why are the terms "HTML" and "export" explained in the dictionary of the Storage Networking Industry Association but there is not a single appearance of the term "slow drain device" on the complete SNIA website (according to their in-built search function)? Well I don't know but of course we can change that. The SNIA dictionary makers are asking for contribution, so if you have a term that has a meaning in the storage industry, feel free to send them a definition for the next release. I thought about doing that as well for some of the SAN performance-related terms I didn't find in the dictionary. Below you'll find some definitions that I wrote. But I'm not inerrable and therefore I would like to have an open discussion about them. Let me know what you think about them. Let me know if your understanding of a term (used in the area of SAN performance of course) differs from mine. Let me know if my wording hurts the ears of native English speakers. Let me know if you have a better definition. Let me know if there are important terms missing. And let me know if you think that a term is not really so generally used or important that it should appear in the SNIA dictionary - side by side to sophisticated terms like Tebibyte :o).
slow drain device - a device that cannot cope with the incoming traffic in a timely manner.
Slow drain devices can't free up their internal frame buffers and therefore don't allow the connected port to regain their buffer credits quickly enough.
congestion - a situation where the workload for a link exceeds its actual usable bandwidth.
Congestion happens due to overutilization or oversubscription.
buffer credit starvation - a situation where a transmitting port runs out of buffer credits and therefore isn't allowed to send frames.
The frames will be stored within the sending device, blocking buffers and eventually have to be dropped if they can't be sent for a certain time (usually 500ms).
back pressure - a knock-on effect that spreads buffer credit starvation into a switched fabric starting from a slow drain device.
Because of this effect a slow drain device can affect apparently unrelated devices.
bottleneck - a link or component that is not able to transport all frames directed to or through it in a timely manner. (e.g. because of buffer credit starvation or congestion)
Bottlenecks increase the latency or even cause frame drops and upper-level error recovery.
Feel free to use the comment feature here or tweet your thoughts with hashtag #SANperfdef. If you add @Zyrober in the tweet, I'll even get a mail :o)
I updated the definitions with an additional sentence. Feel free to comment.
The term ecological footprint describes the total impact of someone or something on the environment. To achieve sustainability this footprint should be kept as low as possible. We should not demand more from Mother Nature than she can provide and of course we should not demand more than we actually really need. Sounds simple, but the reality is way more complex. In the area of IT the term Green IT was found to describe and consolidate all the rules, actions and requirements to decrease the ecological footprint for the sake of sustainability. And IBM has a broad agenda about this. But often we forget what each one of us could do to be a little more greener.
In the technical support we deal with defects. Our clients have the right to have a product working within the specifications. If a part is working outside its specifications, it has to be repaired or replaced. That's it.
And what's "green" about that?
The impact on the Nature happens if a part is replaced that was not really broken. No manufacturing process of a part can be so "green-optimized" that it's better than just to avoid replacing a part in good order. There is the mining (and/or recycling) for the materials, the chemicals and energy used during its processing, the packages, the stocking and of course the logistics, too. At the end a small part like a fan can have a huge ecological footprint. This can only be avoided by replacing only the broken part. There's just one problem with that:
What if you can't tell which part is broken?
A classical example for that is a physical error in the SAN. In my article about CRC I pointed out how to use the porterrshow to find physical errors and - even more important - how to find the connection where the physical error is really located. But that's all what's possible out of the data: You can only track it down to the connection. The connection usually consists of the sending SFP, the cable (plus any additional patch panels and couplers in between), and the receiving SFP. There is no reliable and technically justifiable way to tell which one is the culprit just out of the porterrshow. I know that there are some "whitepapers" available in the web stating that this combination of "crc err" and "enc in" means this and that combination of "crc err" and "enc out" means that. But from a technical point of view that's nonsense.
So you have a physical problem, what to do?
When it comes to cables, my fellow IBM blogger Anthony Vandewerdt just released a great article about the impact of dust today. Other reasons for a cable to cause physical problems could be a too small bending radius or loose couplers. In times of fully populized 48- or even 64-port cards the frontside of a SAN director often looks like the back of a hedgehog. For every maintenance action with one of the cables you can wait for the CRC error counters increasing for the other ports around then. So in many situations the cable is not really broken and just replacing it wholesale just because of the counter is not eco-friendly.
The same thing with SFPs. You see physical errors increasing in the porterrshow for a specific port. That could mean that the SFP in there is broken, because its "electric eye" doesn't interpret the (good) incoming signal correctly. It could also mean that the SFP on the other end of the cable is broken, because it sends out a signal in a bad condition. Both will lead to the very same counter increases in porterrshow. If you replace them both as the first action you most probably replaced at least one good one.
Given that you have redundancy in your SAN environment (which you should ALWAYS have), you have free ports available, and the multipath drivers for the hosts using the affected path are working properly, you could track the culprit down by plugging the cable to another SFP in another port and look if the error stays with the port or with the cable.
Please keep in mind that the port address ("the IP address of the SAN") could change along with the port (if you don't have Cisco switches). On Brocade switches you need to do a "portswap" to swap the port addresses as well.
If you cannot touch the other ports, Brocade built some tests into FabricOS for you, like "porttest", "portloopbacktest" and "spinfab". Please have a look into the Command Line Interface Reference Guide for your FabricOS version to get more information about them. With these tests in combination with a so called loopback plug it's easy to find out which part is really broken. Loopback plugs look like the end of a cable but just physically redirect the SFP's TX signal into its RX connector.
Mother Nature will be thankful
There is just one thing from above I want to pick up: parts working within their specification. Not every single CRC error is a reason to replace hardware. According to the Fibre Channel standard, the protocol requires a BER (Bit Error Rate) of 10^(-12) to work properly. For 8 or even 16 Gbps that means it's allowed and fully compliant with the FC protocol to have bit errors quite often. Here is where common sense must come into play. If you have 2-digit increases of the CRC error counter within an hour, it might be a good idea to determine which part to replace with the steps mentioned above.
If you see a single CRC from time to time, sometimes with days of no error, sometimes with "some" per day, that's perfectly fine with the FC protocol and well within the specifications. It could lead to single temporary and recoverable errors on a host, but nothing has to be replaced then as long as the rate doesn't increase significantly. You wouldn't replace your one-year-old tires just because the tread is only 90% of what it was when you bought them.
Let's think a little bit greener - even in switch maintenance :o)
In the last couple of years beside of the buzzwords "cloud", "big data" and "VAAI" there is another topic that plays a big role in every discussion about storage products: "easy management". In most cases it means an intuitive and catchy graphical user interface that would allow even children to manage a storage array - if you believe marketing. Along with that goes the integration of storage management tasks into the GUI of the servers temselves and of course automation of these tasks. If the highly skilled server and storage administrators don't have to invest their time into disproportionately laborious routine tasks anymore they could focus on more advanced projects.
But many companies still fight the impacts of the financial crisis. This leads to: vacant posts get dropped, teams get consolidated and cut down. The CIOs want to see the synergy effects in numbers and decreasing headcounts. Former specialized experts have to cope with more and more different systems. Less time, more work, less education, more stress, less productivity, more trouble - a downward spiral. Beside of that classical admin's work is offshored or outtasked to operating and monitoring teams with no more than broad, general skills.
In the technical support I see the effect of that "evolution" in the problem descriptions of current cases: "We see SCSI messages in the host." Or even just "We see messages. Could be the SAN." Administrators with a foundational ITIL certificate but no clue about what a Read(10) is are suddenly confronted with a host running amok with just some obscure rough messages about its storage in the logs. To ensure a quick resolution of the problem priority 1 would be to know what these messages actually mean. Often they are just forwarded from the device driver and there is no good documentation available explaining it properly. Or there is just something like "blabla ...then go to your service provider", not even mentioning which one - out of the broad bouquet one with a heterogenous infrastructure might have - this would be. If the admin lacks a fundamental understanding about the storage concepts and protocols then, he will not be able to get any senseful information out of that. And "randomly" has to pick a support organization for any of the involved machines.
The result: Long & critical outages.
So the colorful dynamic easy-to-use management interfaces protected us from the ugly technical abyss in the lower layers for the longest time. But now as there is a problem, we only get some strange sense data and don't know who could help us further. And it's the same with managing changes in the infrastructure. A lot of the problems opened at the SAN support are in fact mis-configurations, user mistakes or unrealistic expectations born out of conceptual misunderstandings. "We need this 300km synchronous mirror connection to run with 3ms latency max. We bought your enterprise SAN gear. Why is it not fast enough?". The same with slow drain devices. If a SAN admin (with also the server admin's and storage admin's hat on his head) has no idea about the traffic flow in a SAN and buffer-to-buffer credits, how could he understand the impact of a slow drain device in his environment?
That's why clouds and Storage aaS, IaaS or even SaaS are so important today. Not because of the elastic and dynamic deployment or the transparancy of the costs. But because there are less and less people with deep technical background knowledge about storage and SANs available in the companies. They seemed to be superfluous as long as everything was running fine and an un-skilled person was enough to make the few clicks in the GUI. So the only escape and the consequent next step is to move to the cloud concept.
Am I a cloud fanboy?
I wouldn't call me a fanboy. I'm a support guy and I like to troubleshoot as effectively as possible to solve a problem as quickly as possible. And to enable me doing this, I need a skilled local counterpart who is able to collect the data and to execute the action plans, who is also able to address problems to the proper support provider and to proactively monitor the environment. So if there is a classical data center with a team of skilled administrators, I'm quite happy. But if not, this "vacuum" should be filled to minimize the risk of major outages. The provider of a public cloud would have such a team.
And in private clouds?
In a well-defined and highly automatized private cloud, the remaining (most probably much smaller) team of skilled admins doesn't have to care for provisioning of LUNs and other standard tasks anymore. They would have more time for digging deeper into the stuff. You might argue now that this just repeats the story of the easy management above. Right! But as soon as you entered this path and as long as the external constraints don't change, this is the only way to go. And for some of the companies out there a private cloud might just not be the best choice and other options like outsourcing would come into play.
The most important thing is to face the truth and to make a honest review of the skills available. Your data is your most precious asset and availability is crucial. If that path leads to the cloud, there is no reason to stop now. Don't wait for the next outage!
I blog for a while now. Looking back I had a personal blog about things I'm interested in some years during my study. I did a comedic fake news page, too. My wife and I write a blog about our baby and I also have an IBM internal blog about SAN troubleshooting. Last year I started with seb's sanblog on developerworks and it was quite a slow start. Beginning of 2011 there was much stuff to do for my primary job on one hand but on the other hand my daughter was born and my interests shifted a bit. As I write the articles for this blog mainly during my spare time, the simple equation was: no spare time = no blog posts.
Midyear 2011 the situation improved a bit. My baby Johanna was out of the woods somehow (is "to be out of the woods" really the English term for finishing the most stressful phase?) after her hip dysplasia was cured and I was able to really start to blog. And then I thought about: What do you want to blog about? There is so much going on in the storage industry, but am I really the best person to blog about them? Can I really add some value with blog articles here? I don't think so. Of course I comment on such topic on other people's blogs, twitter or social platforms like linkedin from time to time. After all there's always some FUD around I cannot resist to comment. But I try to keep my own blog really about SAN and storage virtualization with a focus on troubleshooting.
I wrote 19 articles in 2011. That's not much compared to let's say storagebod. Why is that so? Well, for me it's quite a balancing act what I can blog about. Of course I can't blog about a specific customer having a problem. That's a no-go. There are also things I don't want to blog about because there is already much out there about it. And then there is stuff that I just can't blog about, because it's internal information. Special troubleshooting procedures I created for example or information about internal tools and projects I'm involved.
What remains then?
Oh, there's still enough to blog about. If I notice situations like "Hey, I explained this general thing in four cases now to customers completely unaware of it." or if I see a feature that could really help admins but hardly anyone uses it so far, then I write a blog article. I see it more as an additional explanation and food for thought. My target audience consists of customers on the "doing level" (admins, architects) as well as people troubleshooting SANs. I know that's a significantly smaller group than the audience of the more general storage bloggers, but I'm happy if the right people read it and I get the feedback that my blog helped them with their problems. However I started to count the visitors internally since end of July and so far around 32000 visited seb's sanblog. That's not too bad, I think.
Writing such a résumé I want to thank the people who inspired me to start a blog. First of all there are Barry Whyte and Tony Pearson with their developerworks blogs showing me: there are actually IBMers out there writing about my topics of interest. Reading their blogs brought me to many others - also from other companies - that I try to look in daily. Most of them you see in the list on the right bar of this blog. But a special Thank you! goes out to my Australian colleague Anthony Vandewerdt whose blog has a big focus on the people really working with IBM storage products and therefore SAN products as well. His Aussie Storage Blog on developerworks triggered my decision to start an own external blog. Thank you again!
So what to expect from 2012?
To be honest, I have no idea :o) There is no overall plan. No weeks-long article pipeline. I'm not invited in blogger events or something like that and my blog is in no way a marketing channel for upcoming IBM products. Everything I write is just born out of my experience with SAN products and troubleshooting. I try not to write too much about hypes and trends, except it has a direct impact on SAN - like oversaturated hypervisors turning to slow drain devices or Big Data as an excuse to do some really weird things with your storage architecture :o)
Are you still interested?
Then be my guests in 2012 and if you feel the urge to say something about, against or additional to an article, don't hesitate to leave a comment! Have a nice start into the New Year!
Everyone is talking about cloud security these days. Is it clever to give my data outside my own data center? To another company? Maybe even outside the country? How safe and secure is that? Not only the way in between but also then there? Are they protected enough? Are they able to block intruders both remotely and locally? And what about attackers from within the cloud service provider? The discussion is so full of - indeed reasonable - concerns that I started to wonder.
Why do I often see SANs that are not secured at all?
I don't mean the physical access control to the machines themselves. Usually companies take that one seriously. But all the other aspects of SAN security are often disregarded according to my experience. If there is no statutory duty or the enforcement of compliance it's just a variable in the risk calculation about costs of security, probabilities and inexplicable consequences in case of security breaches. And taking also budget constraints and lack of skill and manpower into consideration SAN security is often treated as an orphan.
There is a huge market for IP security with firewalls, intrusion detection systems, DMZs, honeypots and hackers with hats in all colors of the rainbow. If a famous company is hacked or victim of a huge DDOS attack you probably read that in the IT news. But if a company has an internal security breach in their storage infrastructure they'll hardly let the public know about it.
What to do from SAN point of view?
There are multiple aspects and possibilities to secure a SAN. Let's take Brocade switches as an example and let's see what could happen...
1.) Management access control
From time to time I get a request for a password reset and the switch's root account is still on the default password. THAT'S. NOT. COOL! It's really unlikely, because in all current FabricOS versions the admin gets the prompt to change the passwords for all four pre-configured user accounts of the switch if it's on the defaults. But it still happens every now and then.
It's the same like for all other devices with user management in IT: Choose passwords, which are hard to guess, can't be found in a dictionary, contain non-alphanumeric characters and so on. Change passwords from time to time, like in a 90 days interval. Most switches support RADIUS and LDAP. The ipfilter command allows you to block telnet, enforcing the use of ssh. In addition for FabricOS v7.0x it's officially supported now to have a plain key-based ssh access for more than one user, too.
And don't stick with old switches from generations ago. Not only the lower linerate and the small feature set should be considered here, but security, too. If the firmware is very old, it's also based on old components like legacy versions of openssh & Co. Very concerning security holes have been fixed over the years. You can check the installed versions of these components here. And yes, it is quite easy to see the password hashes without the root user, but at least they are salted in the current firmwares.
Security is not only about passwords, it's about user roles, too. In the Brocade switches you can define user rights with high granularity, the DCFM has its "resource groups" and the Network Advisor works with "areas of responsibility". Use them to choose wisely who can do what. You don't want to have another Terry Childs case in the media and this time about your company, do you?
The only thing I miss for many SAN switches and other storage equipment is a real, robust and trustworthy accounting or audit log. I want to see what was done on the switch and by whom. Not only what they did via CLI, but via webinterfaces, management applications and shell-less CLI accesses, too. Is there no standard to have these data automatically forwarded to an internal, trusted collection server via a secured connection? Really?
You should encrypt your traffic. There are several possibilities to catch the signal without your knowledge, especially if your data leaves your controlled ground on the way to a remote DR location. For FCIP traffic you should always use encryption. Indisputable. And for plain fibre-based FC longdistance connections? You probably say "Hey, it's transparent and it's optical fibre, not electrical. You can't just dig a hole, rip the cadding of the cable and splice a second cable in." - You have no idea. Keep in mind that the data traversing the SAN is the really important and thus precious kind in your company. There are technical possibilities to do it and if there is opportunity, there could be a criminal mind using it. This perception seems to gain acceptance among the switch vendors more and more. For example Brocade's current 16G equipment is able to have encrypted ISLs for that matter. Of course all vendors sell SAN based encryption appliances or switches, too. This way not only the inter-location traffic is encrypted, but also for the data on the disk or tape. So if there would ever be the chance that some unauthorized person gets his hands on the storage, he won't be able to read the data.
3.) Fabric access control
What would be the easiest thing to work around passwords and encryption if an intruder would have physical access to a data center? (Just like a student employee, a temp worker, an intern, an external engineer... I think you get the point) He could simply spot a free port on a switch and connect a switch he brought in. Setting up a mirror port or changing the zoning to gain access to disks and doing some other nasty things is quite easy.
How to avoid that?
FICON environments for mainframe traffic always had higher security demands and we can use just the same features for open systems as well. There are security policies allowing us to control which devices are allowed to be connected to the fabric (DCC - device connection control), which switches can be part of the fabric (SCC - switch connection control) and which switches can modify the configuration (FCS - fabric configuration server). In addition the current Brocade FabricOS versions support DH-CHAP and FCAP using certificates for authentication.
If you want to utilize the features and mechanisms described above, the FabricOS Administrator's guide provides some good descriptions and procedures to begin with. Of course IBM offers technical consulting services to help you to secure your SAN properly.
So if you are concerned if the provision model your IT could be based on in the future is secure, you should be even more concerned about the security of your SAN today!
(Disclaimer: SAN switches from other vendors may have the same or similar security features, too. I just chose Brocade switches because of their prevalance within IBM's SAN customer base.)
When Brocade released FabricOS v6.0 in 2007 Quality of Service sounded like a great idea: It allows you to prioritize your traffic flow to the level of certain device pairs. There are 3 levels of priority:
High - Medium - Low
Inter Switch Links (ISLs) are logically partitioned into 8 so called Virtual Channels (VCs). Basically each of them has its own buffer management and the decision which virtual channel a frame should use is based on its destination address. If a particular end-to-end path is blocked or really slow, the impact on the communication over the other VCs is minimal. Thus only a subset of devices should be impaired during a bottleneck situation.
Quality of Service takes this one step further.
QoS-enabled ISLs consist of 16 VCs. There are slightly more buffers associated with a QoS ISL and these buffers are equally distributed over the data VCs. (There are some "reserved" VCs for fabric communication and special purposes). The amount of VCs makes the priority work - the most VCs (and therefore the most buffers) are dedicated to the high priority, the least for the low one. Medium lies in the middle obviously. So more important I/Os benefit from more resources than the not so important ones.
Sounds like a great idea!
Theoretically you can configure the traffic flow in terms of buffer credit assignment in your fabric very fine-grained. But that's in fact also the big crux: You have to configure it! That means you actually have to know which host's I/O to which target device should be which priority. Technically you create QoS-zones to categorize your connections. Low priority zones start with QOSL, high priority zones start with QOSH. Zones without such a prefix are considered as medium priority.
But how to categorize?
That's the tricky part. The company's departments relying on IT (virtually all) have to bring in their needs into the discussion. Maybe there are already different SLAs for different tiers of storage and an internal cost allocation in place. The I/O prioritization could go along with that and of course it has to be taken into account to effectively meet the pre-defined SLAs. If you have to start from the scratch, it's more a project for weeks and months than a simple configuration. And there is much psychology in it. Beside of that you really have to know how QoS works in details to design a prioritization concept. For example if you only have 20 high priority zones and 50 with medium priority but only 3 low priority zones, the low ones could even perform better. In the four years since its release I saw only a couple of customers really attempting to implement it.
In addition you need to buy the Adaptive Networking license!
So why should I care?
If QoS is such a niche feature, why blogging about it? Usually a port is configured for QoS when it comes from the factory. You can see it in the output of the command "portcfgshow". A new switch will have QoS in the state "AE" which means auto-enabled - in other words "on". An 8Gig ISL will be logically partitioned into the 16 VCs as described above and the buffer credits will be assigned to the high, the low and the medium priority VCs. But that does not mean that you can actually benefit from the feature, because you most probably have no QoS-zones! And so all your I/O share only the resources allocated for the medium priority. A huge part of the available buffers are reserved for VCs you cannot use! So as a matter of fact you end up with less buffers than without QoS and in many cases this made the difference between smooth running environments and immense performance degradation.
If you don't plan to design a detailed and well-balanced concept about the priorities in your SAN environments, I recommend to switch off QoS on the ports. I don't say QoS is bad! In fact with the Brocade HBA's possibility to integrate QoS even into the host connection - enabling different priorities for virtualized servers - you have the possibility to better cope with slow drain device behavior. But done wrong, QoS can have a very ugly impact on the SAN's performance!
Better know the features you use well - or they might turn against you...
As this was not clear enough in the text above and I got back a question about that, please be aware: Disabling QoS is disruptive for the link! In most FabricOS versions in combination with most switch models, the link will be taken offline and online again as soon as you disable it. In some combinations you'll get the message that it will turn effective with the next reset of the link. In that case you have to portdisable / portenable the port by yourself.
As this is a recoverable, temporary error your application most probably won't notice anything, but to be on the save side, you should do it in a controlled manner and - if really necessary in your environment - in times of little traffic or even a maintenance window. The command to disable it is:
portcfgqos --disable PORTNUMBER
Performance problems are still the most malicious issues on my list. They come in many flavors and most of them have two things in common: 1) They are hardly SAN defects and 2) They need to be solved as quickly as possible, because they really have an impact.
If just a switch crashed or an ISL dropped dead or even an ugly firmware bug blocks the communication of an entire fabric, it might ring all alarm bells. But that's something you (hopefully) have your redundancy for. Performance problems on the other hand can have a high impact on your applications across the whole data center without a concerning message in the logs, if your systems are not well prepared for it. Beside of the preparation steps I pointed out here there is a tool in Brocade's FabricOS especially for performance problems: The bottleneck monitor or short:
If a performance problem is escalated to the technical support the next thing most probably happening is that the support guy asks you to clear the counters, wait up to three hours while the problem is noticeable, and then gather a supportsave of each switch in both fabrics.
Why 3 hours?
A manual performance analysis is based on certain 32 bit counters in a supportsave. In a device that's able to route I/O of several gigabits per second, 32 bits aren't a huge range for counters and they will eventually wrap if you wait too long. But a wrapped counter is worthless, because you can't tell if and how often it wrapped. So all comparisons would be meaningless.
Beside the wait time the whole handling of the data collections including gathering and uploading them to the support takes precious time. And then the support has to process and analyze them. After all these hours of continously repeating telephone calls you get from management and internal and/or external customers, the support guy hopefully found the cause of your performance problem. And keeping point 1) from my first paragraph in mind, it's most probably not even the fault of a switch*). If he makes you aware to a slow drain device, you would now start to involve the admins and/or support for the particular device.
You definitely need a shortcut!
And this shortcut is the bottleneckmon. It's made to permanently check your SAN for performance problems. Configured correctly it will pinpoint the cause of performance problems - at least the bigger ones. The bottleneckmon was introduced with FabricOS v6.3x and some major limitations. But from v6.4x it eventually became a must-have by offering two useful features:
Congestion bottleneck detection
This just measures the link utilization. With the fabric watch license (pre-loaded on many of the IBM-branded switches and directors) you can do that already for a long time. But the bottleneckmon offers a bit more convenience and brings it in the proper context. The more important thing is:
Latency bottleneck detection
This feature shows you most of the medium to major situations of buffer credit starvation. If a port runs out of buffer credits, it's not allowed to send frames over the fibre. To make a long story short if you see a latency bottleneck reported against an F-Port you most probably found a slow drain device in your SAN. If it's reported against an ISL, there are two possible reasons:
- There could be a slow drain device "down the road" - the slow drain device could be connected to the adjacent switch or to another one connected to it. Credit starvation typically pressures back to affect wide areas of the fabric.
- The ISL could have too few buffers. Maybe the link is just too long. Or the average framesize is much smaller than expected. Or QoS is configured on the link but you don't have QoS-Zones prioritizing your I/O. This could have a huge negative impact! Another reason could be a mis-configured longdistance ISL.
Whatever it is, it is either the reason for your performance problem or at least contributing to it and should definitely be solved. Maybe this article can help you with that then.
With FabricOS v7.0 the bottleneckmon was improved again. While the core-policy which detects credit starvation situations was pretty much pre-defined before v7.0 you're now able to configure it in the minutest details. We are still testing that out more in detail - for the moment I recommend to use the defaults.
So how to use it?
At first: I highly recommend to update your switches to the latest supported v6.4x code if possible. It's much better there than in v6.3! If you look up bottleneckmon in the command reference, it offers plenty of parameters and sub-commands. But in fact for most environments and performance problems it's enough to just enable it and activate the alerting:
myswitch:admin> bottleneckmon --enable -alert
That's it. It will generate messages in your switch's error log if a congestion or a latency bottleneck was found. Pretty straightforward. If you are not sure you can check the status with:
myswitch:admin> bottleneckmon --status
And of course there is a show command which can be used with various filter options, but the easiest way is to just wait for the messages in the error log. They will tell you the type of bottleneck and of course the affected port.
And if there are messages now?
Well, there is still the chance, that there are actually situations of buffer credit starvation the default-configured bottleneckmon can't see. However as you read an introduction here, I assume you just open a case at the IBM support.
You'll Never Walk Alone! :o)
*)Depending on country-specific policies and maintenance contracts a performance analysis as described above could be a charged service in your region.
There are some goodies in FOS 7.0 that are not announced big-time. Goodies especially for us troubleshooters. There are regular but not too frequent so called RAS meetings. Here we have the possibility to wish for new RAS features - wishes born out of real problem cases. Some of the wishes we had were implemented in FOS 7.0 (beside of the Frame Log I already described in a previous post).
Time-out discards in porterrshow
You probably noticed that I have a hobbyhorse when it comes to troubleshooting in the SAN: performance problems. Medium to major SAN-performance problems usually go along with frame drops in the fabric. If a frame is kept in a port's buffer for 500ms, because it can't be delivered in time, it will be dropped. So these drops would be a good indicator for a performance problem. There is a counter in portstatsshow for each port (depending on code version and platform) named er_tx_c3_timeout, which shows how often the ASIC connected to a specific port had to drop a frame that was intended to be sent to this port. It means: This guy was busy X times and I had to drop a frame for him.
But who looks in the portstatsshow anyway? At least for monitoring? In that area the porterrshow command is way more popular, because it provides a single table for all FC ports showing the most important error counters. Unfortunately it had only one cumulative counter for all reasons of frame discards - and there are a lot more beside of those time-outs. But now there are two additional counters in this table: c3-timeout tx and c3-timeout rx. Out of them the tx counter is the important one as described above. The rx counter just gives you an idea where the dropped frames came from.
So: just focus on the TX! If it counts up, get some ideas how to treat it here.
The firmware history
Just last week I had a fiddly case about firmware update problems again. There are restrictions about the version you can update to based on the current one. If you don't observe the rules, things could mess up. And they could mess up in a way you don't see straightaway. But then suddenly, after some months and maybe another firmware update, the switch runs into a critical situation. Or it has problems with exactly that new firmware update. Some of these problems can render a CP card useless, which is ugly because from a plain hardware point of view nothing is broken. But the card has to be replaced at the end. Sigh.
To make a long story short: Wouldn't it be better to actually know the versions the switch was running on in the past? And that's the duty of the firmware history:
switch:admin> firmwareshow --history
Firmware version history
Sno Date & Time Switch Name Slot PID FOS Version
1 Fri Feb 18 12:58:06 2011 CDCX16 7 1556 Fabos Version v7.0.0d
2 Wed Feb 16 07:27:38 2011 CDCX16 7 1560 Fabos Version v7.0.0a
(example borrowed from the CLI guide)
No access - No problem
There is a mistake almost everybody in the world of Brocade SAN administration makes (hopefully only) once: Trying to merge a new switch into an existing fabric and fail with a segmented ISL and a "zone conflict". Then the most probable reason is that the new switch's default zoning (defzone) is set to "no access".
This feature was introduced a while ago to make Brocade switches a little more safe. Earlier each port was able to see every other port as long as there was no effective zoning on the switch. With "no access" enabled, all traffic between each unzoned pair of devices is blocked if there is no zone including them both. The drawback of "no access" is its technical implementation, though. As soon as it was enabled a hidden zone was created and its pure existence blocked the traffic for all unzoned devices. And so without any indication the switch did end up with a zone.
But entre nous: no sane person accepts this without raising a few eyebrows. With FOS 7.0 this (mis-)behavior is gone. The new switch has a "no access" setting and wants to merge the fabric? Fine. You don't have to care, the firmware cares for you!
Thanks for the little helpers Brocade - and I hope you stay open for new ideas :o)
Many of you (at least many of the few really reading this stuff) may already know what CRC is. But I think it doesn't hurt to have a short recap. CRC means Cyclic Redundancy Check and can be used as an error detection technique. Basically it calculates a kind of hash value that tends to be very different if you change one or more bits in the original data. Beside of that it's quite easy to implement. I once wrote a CRC algorithm in assembler (but for the Intel 8008) during my study and it was a nice exercise for optimization.
What has that got to do with SAN?
In Fibre Channel we calculate a CRC value for each frame and store it as the next-to-last 4 bytes before the actual end of frame (EOF). The recipient will read the frame bit by bit and meanwhile it calculates the CRC value by itself. Reaching the end of the frame it knows if the CRC value stored there matches the content of the frame. If this is not the case, it knows that there was at least one bit error and it is supposed to be corrupted and thus can be dropped. Now if the recipient is a switch the next thing to happen depends on which frame forwarding method is used:
The switch reads the whole frame into one of its ingress ("incoming") buffers and checks the CRC value. If the frame is corrupted the switch drops it. It's up to the destination device to recognize that a frame is missing and at least the initiator will track the open exchange and starts error recovery as soon as time-out values are reached. Many of the Cisco MDS 9000 switches work this way. It ensures that the network is not stressed with frames that are corrupted anyway, but it's accompanied with a higher latency. From a troubleshooting point of view the link connected to the port reporting CRC errors is most probably the faulty one.
To decrease this latency the switch could just read in the destination address and as soon as that one is confirmed to be zoned with the source connected to the F-port (a really quick look into the so called CAM-table stored within the ASIC) it goes directly on the way towards the destination. So if everything works fine - enough buffer-credits are available - the frame's header is already on the next link before the switch even read the CRC value. The frame will travel the whole path to the destination device even though it's a corrupted frame and all switches it passes will recognize that this frame is corrupted. Brocade switches work this way. As soon as the corrupted frame reaches the destination, it will be dropped.
Regardless which method is used, the CRC value remains just an error detection and most probably the whole exchange has to be aborted and repeated anyway.
So how to troubleshoot CRC errors on Brocade switches then?
If you would only have a counter for CRC errors, you would be in trouble now. Because if all switches along the path increase their CRC error counter for this frame, how would you know which one is really broken? If you have multiple broken links in a huge SAN, this could turn ugly. But there are 2 additional counters for you:
- enc in - The frame is encoded additionally in a way that bit errors can be detected. And because the frame is decoded when it's read from the fiber and encoded again before it's sent out to the next fiber, the enc in (encoding errors inside frames) counter will only increase for the port that is connected to the faulty link.
- crc g_eof - Although a corrupted frame will be cut-through as explained above, there is just one thing the switch can do in addition when it encounters a mismatch between the calculated CRC value and the one stored in the frame: it will replace the EOF with another 4 bytes meaning something like "This is the end of the frame, but the frame was recognized as corrupted." The crc g_eof counter basically means "The CRC value was wrong but nobody noticed it before. Therefore it still had a good EOF." So if this counter increases for a particular link, it is most probably faulty.
frames enc crc crc too too bad enc disc link loss loss frjt fbsy
tx rx in err g_eof shrt long eof out c3 fail sync sig
1: 1.5g 1.8g 13 12 12 0 0 0 1.1m 0 2 650 2 0 0
2: 1.3g 1.4g 0 101 0 0 0 0 0 0 0 0 0 0 0
3: 1.9g 2.9g 82 15 0 0 3 12 847 0 0 0 0 0 0
Port 1 shows a link with classical bit errors. You see CRC errors and also enc in errors. Along with them you see
crc g_eof. Everything as expected. Just go ahead and and check / clean / replace the cable and/or SFPs. There are some tests you could do to determine which one is broken like "porttest" and "spinfab".
Port 2 is a typical example of an ISL with forwarded CRC errors. This ISL itself is error-free. It just transported some previously corrupted frames (crc err but no enc in) which were already "tagged" as corrupted, hence no crc g_eof increases.
Port 3 is a bit tricky now. If you just rely on crc g_eof it seems to be a victim of forwarded CRC errors, too. But that's not the case. Actually they were broken in a manner that the end of the frame was not detected properly, so too long an bad eof is increased. Best practice: Stick with the enc in counter. It still shows that the link indeed generates errors.
Hold on, Help is on the way!
Now with 16G FC as state of the art things changed a bit. It uses a new encoding method and it comes with a forward error correction (FEC) feature. Brocade provides this with its FabricOS v7.0x on the 16G links. It will be able to correct up to 11 bits in a full FC frame. FEC is not really highlighted or specially standing out in their courses and release notes, but in my opinion this thing is a game changer! Eleven bit errors within one frame! Based on the ratio between enc in and crc err - which basically shows how many bit errors you have in a frame on the average - we see so far, I assume this to just solve over 90% of the physical problems we have in SANs today. Without the end-device-driven error recovery which takes ages in Fibre Channel terms. Less aborts, less time-outs, less slow drain devices because of physical problems! If this works as intended SANs will reach a new level of reliability.
So let's see how this turns out in the future. It might be a bright one! :o)
Modified on by seb_
The Storwize V7000 and the SVC (SAN Volume Controller) share the same code base and therefore the same error codes. Many of them indicate a failure condition in this very machine, but there are others just pointing to an external problem source. The error 1370 is one of the second kind. There is not really much information about it in the manuals but in fact it could give you a good understanding about what's going wrong.
As storage virtualization products the SVC and the V7000 - if you use it to virtualize external storage - are actually the hosts for the external storage. Speaking SCSI they are the initiators and the external backend storage arrays are the targets. Usually the initiators monitor their connectivity to the targets and do the error recovery if necessary. And so the SVC and the V7000 focus on monitoring the state of their backend storage and can actually help you to troubleshoot them.
So you have 1370 errors, what now?
They come in two flavors: The event id 010018 (against an mdisk) and the event id 010030 (against a controller - aka storage array). I'll explain the 010030 as it's easier to understand but understanding it will give the insight to understand the 010018, too.
If you double-click the 1370 in your event log, you see the details of the error:
You see the reporting node and the controller the error is reported against. But the most important thing is the KCQ. The Sense Key - Code - Qualifier.
Imagine this situation: The SVC is the initiator. It sends an I/O towards the storage device - the target. But the target faces a "note-worthy" condition at the very moment. So it will make the initiator aware of it by sending a so called "check condition". As curious as it is, the initiator wants to know the details and requests the sense data. These sense data will now be stored in - you already guess it - a 1370 in the format Key - Code - Qualifier. Often the last both are referred to as ASC (Additional Sense Code; the green one) and ASCQ (Additional Sense Code Qualifier; the blue one).
Where's the Rosetta Stone?
These sense data can be translated using the official SCSI reference table by Technical Commitee T10 (the council making the SCSI protocol). If you encounter an ASC/ASCQ combination in a 1370 that can't be found in that list, it's most probably a vendor specific one. In that case the manufacturer of the target device could give you more information about it.
Back to our example. So you see the ASC 29 (the "Code") and the ASCQ 00 (the "Qualifier") here. Looking that up in the list reveals: It's a "POWER ON, RESET, OR BUS DEVICE RESET OCCURRED". This so called "POR" should make you aware that the target was recently either powered on or did a reset. Usually the initiator gets this with the first I/O it does against the target after such an event, to be aware that any open I/O it has against this target is voided and has to be repeated.
Ah, okay. That's it?
No! You see the orange box? This is the time since this sense data was received. The unit is 10ms, so this number actually represents a long time since there really was a POR for this controller.
So why do we have a 1370 today?
The 1370 is more of a container for sense data. The number behind the attributes show the "slot". So the information visible here are for the first slot and as such a long time passed since it occurred it's meaningless for us now. Let's scroll down a bit:
In the second slot you see what's really going wrong within the external storage device at the moment, because the time value is 0. That means the 1370 was triggered because of it. And it contains a different set of sense data. ASC 0C / ASCQ 00! If you try to look it up in the list, you will find 0C/00, but hey - this cannot be! The combination 0C/00 means "WRITE ERROR", but it's not defined for "Direct Access Block Devices" like storage arrays.
A Dead End?
No, of course not. In this example the storage is a DS4000. Just download the DS4000 Problem Determination Guide and it will provide an ASC/ASCQ table. Here you'll see that 0C 00, together with the Sense Key 06 (the red circle) means "Caching Disabled - Data caching has been disabled due to loss of mirroring capability or low battery capacity."
Running without the cache in the backend storage could lead to severe performance degradation and should definitely be troubleshooted! Without even looking into the backend storage you already know what's going wrong there! No need to involve SVC or V7000 support this time. Just focus on the backend storage and find out why the caching is disabled.
So please don't shoot this messenger, it just tries to help you!
Update - December 2nd 2013
The SCSI Interface Guide for IBM FlashSystem can be found here.
Time for another piece of my little series! This time I'd like to write about a new feature in v7.0x especially for administrators and support personnel: The Frame Log. Maybe it's a bit early to write about it, because it seems to be a feature "in development" at the moment, but I did wait for it so long I'm just not able to resist. I think and I hope Brocade will further develop it like the bottleneckmon - which I was very sceptical about in its first version when it was released in the v6.3 code. After seeing its functionality being extended on v6.4 and even more in v7.0, the bottleneckmon is an absolute must-have.
Hmm... maybe I should write an article about bottleneckmon, too :o)
Back to the Frame Log. So what's that?
Basically it is a list of frame discards. There are several reasons why a switch would have to drop a frame instead of delivering it to the destination device. One of them is a timeout. If a frame sticks in the ASIC (the "brain" behind the port) for half a second, the switch has to assume that something's going wrong and so the frame cannot be delivered in time anymore. Then it drops it. Till FabOS v7.0 it just increased a counter by one. Since later v6.2x versions it was at least logged against the TX port (the direction towards the reason for the drop) - in earlier versions the counter increased only for the origin port, which made no sense at all. But now we even have a log for it! A log to store all the frames the switch had to discard. While that sounds a bit like rummaging through the switch's trash bin, the Frame Log is very useful for troubleshooting though. It contains the exact time, the TX and the RX port (keep in mind the TX is the important one) and even information from the frame itself. In the summary view you see the fibrechannel addresses of the source device (SID) and of the destination device (DID).
For example to see the two most recent frame discards in summary mode, just type:
B48P16G:admin> framelog --show -mode summary -n 2
Fri Sep 23 16:07:13 CET 2011
Log TX RX
timestamp port port SID DID SFID DFID Type Count
Sep 29 16:02:08 7 5 0x040500 0x013300 1 1 timeout 1
Sep 29 16:04:51 7 1 0x030900 0x013000 1 1 timeout 1
In the so called "dump mode" you even see the first 64 bytes of each frame. Usually I have to bring an XGIG tracer onsite to catch such information and often it's not even possible to catch it then, because an XGIG can only trace what's going through the fibre. So you'll only see this frame if you trace a link it crosses before it is dropped. And even then you can't trigger (=stop) the tracer directly on this event, but you have to have it looking for a so called ABTS (abort sequence). If a frame is dropped the command will time out in the initiator and it will send this ABTS. Depending on what frame exactly was dropped in what direction, the ABTS could be on the link several minutes after the actual drop of the frame. Imagine a READ command being dropped. The error recovery will start after the SCSI timeout which could be e.g. 2 minutes. But 2 minutes is a long time in a FC trace. Chances are good that the tracer misses it then.
Not so with the Frame Log!
The frame log can tell you exactly which frame was dropped. If you try to find out if a particular I/O timeout in your host was caused by a timeout discard in the fabric, this is your way to go. If you see your storage array complaining about aborts for certain sequences, just look them up in the Frame Log. With this feature Brocade finally catches up with Cisco and their internal tracing capabilities - and Brocade does it way more comfortable for the admin. The logging of discarded frames is enabled by default and it works on all 8G and 16G platform switches without any additional license.
The big "BUTs"
As I mentioned at the beginning of this article there are still things for Brocade to work on to turn the Frame Log into a must-have tool like the bottleneckmon. The first catch is its volatility. In the current version it can only keep 50 frames per second on an ASIC base for 20 minutes in total. At the moment I personally think that's too short. But I'll wait for the first cases where I can use it before I forge an ultimate opinion about this limit.
The other - more concerning - constraint is that it only works for discards due to timeout at the moment. So if a frame is dropped because of one of all the other possible reasons, it won't be visible in the Frame Log in its current implementation. But that's exactly what I need! If the switch discards a frame because of a zone mismatch or because the destination switch was not reachable or because the target device was temporarily offline or whatever - I want to see that. If a server is misconfigured (uses wrong addresses) and so cannot reach its targets, you'd see the reason right there in the target log - no tracing needed! There are plenty of other situations that would be covered with such a functionality. So I honestly hope that there is a developer with a concept like this in his drawer or even already within its implementation. Allow me to assure you that there is at least one support guy waiting for it...
The picture is from Zsuzsanna Kilian. Thank you!
Brocade recently released its 16G platform switches and along with them a new major version of FabricOS: FOS 7.0. Beside the new features customer's admins, architects or end-users might be interested in, I see some nice enhancements and new tools for us support people, too. In the next blog posts I would like to present some of them and show how to use them, why they are important and where they apply.
The first one I want to write about is the D-Port or Diagnostics Port. This is a special mode every port on Brocade's 16G platform can be configured to.
Why should I use it?
Imagine a two-fabric setup, both spread over two locations, connected via some trunked ISLs through a DWDM. Every once in a while I get a case like this where there was a problem with one of these ISLs. Usually the end-users report major performance problems, there might even be crashes of hosts. The SAN admin looks into his switches, the server admins look for their messages against their HBAs and quickly they notice that the problem seems to be in one fabric only and having a redundant second fabric available the decision is made: "Let's block the ISLs in the affected fabric. The workaround is effective, the situation calms down, the business impact disappears. But of course there is no redundancy anymore and the next step is to find out what happened and subsequently it has to be resolved.
So a problem case is opened at the technical support. The first request from the support people will be to gather a supportsave. Often they even request to clear the counters and wait some time before gathering the data.
But it's useless now!
Of course it's most important to stop any business impact by implementing a workaround as quickly as possible, but if I get a data collection like this, it's like being asked to heal a disease on the basis of a photo of an already dead person. Usually no customer will allow to re-enable the ISLs before the cause of the problem is found and solved. Welcome to a recursive nightmare! :o)
That's where D-Ports come into play
Having Diagnostics ports on both sides of the link will allow you to test a connection between two switches without having a working ISL. This means there will be no user traffic and also no fabric management over this link and so there will be no impact at all. From a fabric perspective, the ISL is still blocked. It comes with several automatic tests:
- Electrical loopback - (only with 16G SFP+) tests the ASIC to SFP connection locally
- Optical loopback - (with 16G SFP+ and 10G SFP+) tests the whole connection physically.
- Link traffic test - (with 16G SFP+ and 10G SFP+) does latency and cable length calculation and stress test
So this can even help you to determine the right setup for your long distance connection!
How to do it?
Although it's very easy to set this up in Network Advisor (only supported with 16G SFP+), as a support member I prefer stuff to be done via CLI, because then I can see it in the CLI history. (By the way, a real accounting or audit log covering both CLI and GUI actions would be very useful. I look at you, Brocade!) At first you should know which are the corresponding ports in the two switches. (The Network Advisor would do that for you.) Then you disable them on both sides using
Once disabled you can configure the D-Port:
portcfgdport --enable port
And finally enable it again using
Of course you would do that on both sides. There's a seperate command to view the results then:
B6510_1:admin> portdporttest --show 7
Remote WWNN: 10:00:00:05:33:69:ba:97
Remote port: 25
Start time: Thu Sep 15 02:57:07 2011
End time: Thu Sep 15 02:58:23 2011
Test Start time Result EST(secs) Comments
Electrical loopback 02:58:05 PASSED -- ----------
Optical loopback 02:58:11 PASSED -- ----------
Link traffic test 02:58:18 PASSED -- ----------
Roundtrip link latency: 924 nano-seconds
Estimated cable distance: 1 meters
If you see the test failing, you have your culprit and based on which one is failing actions can be defined to resolve the problem. Your IBM support will of course help you with that! :o)
So if you face similar problems and you are already using 16G switches with 16G SFP+ installed, feel free to implement a workaround like blocking the ISLs to lower the impact. The D-Port will help to find out the reasons afterwards.
But if you are still on 4G or 8G hardware and you want to disable the most probable guilty ports, then please PLEASE get me a supportsave first!
Better: Clear the counters, wait 10 minutes and then gather a supportsave before you disable the ports. And even better than that: Clear counters periodically as described here.
HDS' Hu Yoshida wrote an interesting theory on his blog. Basically he says that while modular dual-controller storage arrays might be useful for traditional physical server deployments, virtualized servers would need enterprise storage arrays. (Which interestingly are defined by "multiple processors that share a global cache" according to him.)
I wrote a small reply as a comment which still awaits moderation. To the present Hu usually published my few comments in his blog - regardless of how criticising they were. I don't know why it didn't happen this time, but I think the most reasonable answer is, that everybody at HDS is very busy with the BlueArc aquisition. So meanwhile I publish it here :o)
interesting read. IMHO there’s much truth in your quote “Virtual servers can be like a drug” and I think you are also right with your observation about Tier 1 applications being virtualized. From a support perspective this could lead to bad nightmares. But to be honest, I don’t get why the storage system should be the limiting factor here. The number of servers (in terms of OSes running) doesn’t change in your picture and neither did the total workload towards the storage array. They were physical servers before, now they are virtual servers (VMs) on a few physical ones. In my eyes the requirements regarding the storage environment didn’t change big times but of course you have to check carefully if your physical servers with their SAN connectivity could turn into a bottleneck themselves, as I pointed out in my latest blog post (http://ibm.co/mY5PnH).
Additionally, just a minor thing with the dual-controller arrays: Why should the outage of the remaining controllers lead to data loss? Usually the write cache of such arrays will be disabled if one controller is down, because it can’t be mirrored anymore. On one hand this means decreased performance during such maintenance, but on the other hand this means that the host gets the SCSI good status only if the I/O is really written to disk. So, there should be access loss of course, but no data loss.
If you have a different - or a similar - opinion, feel free to leave a comment here :o)
There is an interesting discussion ongoing in the Linkedin group The Storage Group. The question is "What is the REAL cost of Fibre Channel?". To my surprise the participants in this discussion relatively quickly came to the conclusion that the problem is over-provisioning resp. under-utitization. My personal opinion was:
"I would like to come back to the over-provision / under-utilization part. Being a tech support guy, I think a bit different about that. State of the art is 16G FC now but of course I see the majority of customers being on 8G or even 4G. Eventually they will move to higher speeds. Not because all of them really need the higher speed, but it's just the switches and HBAs in sales and marketing at the moment. The "speed race" is driven mostly by the vendors and the customers who really need that line rate. But is it bad for the others? I don't think so. A 16G switch is not really 2x the price of a 8G switch or 4x the price of a 4G. In fact I see the prices sinking on a per port base with increasing functionality on the other hand. And then you stand there with your host X. It has a demand for let's say 200MB/s in total and you connected it to 2 redundant fabrics running with 8G, 1 port per fabric.
That makes: 200MB demand versus 1600MB available. WOW! YOU ARE TOTALLY UNDER-UTILIZED! Shame on you!
Well not really. Actually it's good to have redundancy. You know that. First of all "real" redundancy means you are at least 50% under-utilized per se. Plus the higher line rate that made no difference in the price compared to the lower line rate. That means it is normal that you end up over-provisioned and under-utilized.
In fact things start to get ugly if you really use all your links near 100%. I start to see that scenario more often recently when customers put VMs on ESX hosts without really knowing their I/O demand. Many of them work till the next outage (SFPs _WILL_ break some day, a software bug could crash a switch, etc) and then you see that you have no real redundancy, because you utilize your links too high.
On the other hand many of these ESX hosts with many VMs doing different unknown workload tend to turn to slow drain devices as soon as I/O peaks of certain VMs come together at the same time. Then at the latest you notice that under-utilization of a network is not really a bad thing :o)"
Especially the ESX hosts turning to slow drain devices bug me most these days. Nobody really seems to know the demand of their VMs and the internal statistics of the ESX seem to be very limited for that matter. If you look on a port of a slow drain device, it will most probably still look under-utilized from a bandwidth perspective, because the missing buffers plus the error recovery will keep the plain MB/s numbers down. But in fact the port is exhaustively saturated then. And in addition the the eventually dropped frames in the SAN lead to timeouts also within the slow draining host. At the end it looks like: "My ESX is far away from utilizing its link completely but the SAN is bad! We have timouts!".
So what's the demand?
Some customers have the luxury (Should this really considered to be luxury?) of having a VirtualWisdom probe installed to monitor the exact performance values in real-time constantly. Archie Hendryx shows some of the things you could see there in practice in his whitepaper "Destroying the Myths surrounding Fibre Channel SAN". But if you don't have such gear and you don't know the demand it might be worth to have an additional ESX host for testing. It must not be the biggest machine, don't worry. Every day you would take another candidate out of your bulk of VMs with unknown I/O bandwidth (or CPU / memory / etc) demand and put it on that test server with vMotion. Being relatively unimpaired by the other VMs (at least within the ESX), you can measure all the performance values then for 24 hours and - provided no error recovery or external congestion - takes place, these are the real demands of that VM. And only based on these demands you really know which VMs are allowed to come together on the same bare metal. Only so you will have a chance to actually improve the under-utilization in a controlled manner without slamming your SAN into the realms of chaos. The approach seems very simple and straight forward for me, but I see nobody doing this. So what's my error in reasoning, dear reader?
(Thanks to Harout S Hedeshian for the picture.)
Recently I attended a presentation about IBM's cloud computing approaches by IBM Fellow Stefan Pappe. Cloud computing is a big topic in IT nowadays - no doubt about that - but how much impact does it have on SAN troubleshooting? Will the way hardware support is performed change in the cloud? Based on your understanding of the term cloud you might eighter say yes or no. In a cloud the IT is just a commodity like water or electrical power. You just use it. You most likely don't want to know how it works as long as its availability is guaranteed. If a component of a server breaks, the whole construct relies on redundancy. Either within the server (multiple paths etc) or within a pool of servers where the VMs residing on this particular piece of metal are concurrently moved to other servers. This frees up the broken one for maintenance later on.
For a SAN it's quite similar - we rely on internal redundancy (multiple power supplies, failover-able control processors and backlink modules) as well as external redundancy (second independent fabric, multiple paths, multiple ISLs), with an important exception: Some SAN-related problems have to be troubleshooted "on the open heart". Please don't understand me wrong. I don't mean that finding a good workaround isn't important - it surely is and in most scenarios it's a key element for business continuity. But if the symptoms can't be seen, it might be hard for the support member to do the problem determination.
So what now?
Most of these "workarounded" problems can still be troubleshooted if the SAN is well prepared. Especially part 2 of my How to be prepared blog post can help you with that topic. In addition Please gather a data collection from each and every component in the SAN that is related to the problem before you implement any workaround! For the SAN switches that means, if you have performance problems for example, please gather a data collection of all SAN switches.
For other problems it might be necessary to actually test the repaired component / modified configuration / improvement in the code in the productive environment to know if it really helped. Of course all the possibles tests that can be done "offline" should be done first. For example before bringing a formely toggling ISL back to life, it's better to use the built-in port test capabilities of the switches with loopback-plugs.
And as another exception compared with server redundancy: A SAN troubleshooting should not be postponed to gather "workarounded" problems for a certain time and solve them later all at once.
- In most cases redundancy in the SAN means you have two things of a kind. Not five or eight or hundreds. So if the core of fabric A fails, it has to be repaired as soon as possible, because the failure of the core in fabric B will lead to the full outage.
- Different concurrent SAN problems can overlay and create much bigger problems or at least ambiguous symptoms that are much harder to troubleshoot. "Double errors" or "triple errors" are among the worst things to troubleshoot.
- SAN environments are complex structures with lots of hardware and software. There are many things that could lead to the situation that redundancy cannot be utilized properly such as bugs in multipath drivers, wrong configurations or underestimation of the workload on the redundant paths and components during a problem situation.
So if it can be done now, do it now!
Beside of that there are special requirements of the cloud such as the ability for multi-tenancy on the SAN components. Cisco have their VSANs for a long time now, but when it comes to IVR (Inter VSAN Routing) sometimes I see very strange configurations out there based on a wrong understanding of the concept. The first attempt of Brocade in that direction were the "Administrative Domains" which came with some very concerning flaws in my opinion. With the v6.2x code stream this concept was virtually replaced by the "Virtual Fabrics" concept. With "base switches", "XISLs" & co, many new possibilities for mis-configurations appeared. Much new stuff to learn for customers, admins, architects and of course support members.
To sum up, I can say that if SAN troubleshooting was done properly before, there won't be much change here. But the cloud boosts the expectations of the users regarding their SAN even more to: It should just work! No downtime of the application ever! Our primary goal is to deal with upcoming problems in a way that prevents any impact on the applications.
Because in the future zero downtime will be no highend enterprise feature anymore but a commodity.
If you use a SAN Volume Controller it usually is the linchpin of your SAN. Except for the FICON and tape related stuff everything is connected to it. It is the single host for all your storage arrays and the single storage for all your host systems. Because of this crucial role the SVC has some special requirements regarding your SAN design. The rules can be seen in the manuals or in the SVC infocenter (just search for "SAN fabric"). One of these rules is "In dual-core designs, zoning must be used to prevent the SAN Volume Controller from using paths that cross between the two core switches.".
I made this sketch to illustrate that. As you see it's not a complete fabric, but just the devices I want to write about. Sorry for the poor quality, my sketching-kungfu is a bit outdated :o)
This is just one of two fabrics. The both SVC nodes are connected to the both core switches. The edge switch is connected to both core switches and beside of the SVC business let's assume there is a host connected to the edge switch using a tape library connected to the cores. There would be other edge switches, more hosts and of course storage arrays as well. Now the rule says that the SVC node ports are only allowed to see each other locally - therefore on the same switch.
So why is that so?
Of course you could say that this is the support statement and if you want to use a SAN Volume Controller you just have to stick to that. But from time to time I see customers with dual-core fabrics who don't follow that rule. Of course initially when the SVC was integrated into the fabric, the rule was followed because it was most probably done by a business partner or an IBM architect according to the rules and best practice. But later then after months or years - maybe even the SAN admin changed - new hosts were put into the fabric and in an initiator-based zoning approach, each adapter was zoned to all its SVC ports in the fabric. Et voilà! The rule is infringed. The SVC node ports see each other over the edge switch again and the inter-node traffic passes 2 ISLs instead of none.
What is inter-node communication?
Beside of the mirroring of the write cache within an I/O group there is a system to keep the cluster state alive. It includes a so called lease which passes all nodes of a cluster (up to 8 nodes in 4 I/O groups) in a certain time to ensure that communication is possible. These so called lease cycles start again and again and they do even overlap so if one lease is dropped somehow and the next cycle finishes in time, everything is still fine. The lease frames will be passed from node to node within the cluster several times. But if there are severe problems in the SAN the cluster has to trigger the necessary actions to keep the majority of the nodes alive. Such an action would be to warm-start the least responsive node or subset of nodes. You will read "Lease Expiry" in your error log. In a worst case scenario where the traffic is heavily impacted to a degree that the inter-node communication is not possible at all, it might happen that all nodes do a reboot and if the impact stays in the SAN they might do that again and won't be able to serve the hosts.
The result - BIG TROUBLE!
Just as a small disclaimer to prevent FUD (Fear, Uncertainty and Doubt): This is not a design weakness of the SVC or something like that. All devices in a SAN are vulnerable to the risk I want to describe. In addition from all the error handling behavior of the SVC as I know it the SVC seems to be designed to rather allow an access loss than to allow data corruption. It is still the last resort but it's better than actually loose data.
Back to the dual-core design. The following sketch just shows that with the wrong zoning, the lease could take the detour over the edge switch instead of going directly from node 1 to node 2 via core 1 or core 2. It would pass 2 ISLs.
Why should I care?
There are several technical reasons why ISLs should be avoided for that kind of traffic but from SAN support point of view I consider this one as the mose important: slow drain devices! Imagine one day the host would act as a slow drain device for any reason. The tape would send its frames to the host passing the cores and the edge switch. As the host is not able to cope with the incoming frames now, it would not free up its internal buffers in a timely manner and would not send permission to send more frames (R_RDYs) to the switch quickly enough. The frames pile up in the edge switch and congest its buffers. The congestion back-pressures to the cores and finally to the tape drive. As the frames wait within the ASICs some of them will eventually hit the ASIC hold-time of 500ms and get dropped. This causes error recovery and based on the intensity of the slow drain device behavior it would kill the tape job. Bad enough?
But hey! The SVC needs these ISLs!
And that's were it gets ugly. In the sketch above the ISL between the core 1 and the edge switch will become a bottleneck not only for that tape related traffic but for the SVC inter-node communication as well. It will not only cause performance problems (due to the disturbed write cache mirroring) but also could lead to the situation that the frames from several SVC lease cycles in a row would be delayed massively or even dropped causing lease expiries resulting in node reboots.
That's why keeping an eye on the proper zoning for the SVC is so important and that's the reason for that rule.
Just as a short anecdote related to that: Some years ago I had a customer with a large cluster where not the drop of leases but the massive delay of them caused the problem. As every single pass of the lease from one node to the next was only just within the time-out values the subset of nodes that was really impaired by the congestion saw no reason to back out and reboot but as the overall time-out for the lease cycles was reached at a certain point in time, the wrong (because healthy) nodes rebooted then and the impaired ones were kept alive. Not so good... As far as I know some changes were done in the SVC code later to improve its error handling in such situations but the rule is as valid as ever:
Avoid inter-node traffic across ISLs!
Two additional topics for my previous post came into my mind and I doubt that they will be the last ones :o)
Have a proper SAN management infrastructure
For most of you it's self-evident to have a proper SAN management infrastructure, but from time to time I see environments where this is not the case. In some it's explained with security policies ("Wait - you are not allowed to have your switches in a LAN? And the USB port of your PC is sealed? You have no internet access? No, I don't think that you should send a fax with the supportshow...), sometimes it's just economizing on the wrong end. And sometimes there is just no overall plan for SAN management. So I think at least the following things should be given to enable a timely support:
- A management LAN with enough free ports to allow integration of support-related devices. For example a Fibre Channel tracer.
- A host in the management LAN which is accessible from your desk (e.g. via VNC or MS RDP) and has access to the management interfaces of all SAN devices. This host should at least boot from an internal disk rather than out of the SAN.
- A good ssh and telnet tool should be installed which allows you to log the printable output of a session into a text file. I personally like PuTTY.
- A tFTP- and a FTP-Server on the host mentioned above. It can be used for supportsaves, config backups, firmware updates etc. They should always run and where it's possible the devices should be pre-configured to use them. (e.g. with supportftp in Brocade switches)
- If it's possible with your security policy, it's helpful to have Wireshark installed on it which could be used for "fcanalyzer" traces in Cisco switches or also to trace the ethernet if you have management connection problems with your SAN products.
- The internet connection needs enough upload bandwidth. Fibre Channel traces can be several gigabytes in size. When time matters undersized internet connections are a [insert political correct synonym for PITA here :o) ]
- Callhome and remote support connection where applicable. Callhome can save you a lot of time in problem situations. No need to call support and open a case manually. The support will call you. And most of the SAN devices will submit enough information about the error to give the support member at least an idea where to start and which steps to take first. So in some situations callhomes trigger troubleshooting before your users even notice a problem. In addition some machines (like DS8000) allow the support to dial into it and gather the support data directly - and only the support data. Don't worry - your user data is safe!
- Have all passwords at hand. This includes the root passwords as some troubleshooting actions can only be done with a root user.
- Have all cables and at least one loopback plug at hand. With cables I mean at least: one serial cable, one null-modem cable, one ethernet patch cable and one ethernet crossover cable (not all devices have "auto-negotiating" GigE interfaces)... better more. And of course a good stock of FC cables should be onsite as well.
- The NTP servers as mentioned in my previous blog post.
Monitoring, counter resets and automatic DC
Beside of any SAN monitoring you hopefully do already (Cisco Fabric Manager / Brocade DCFM / Network Advisor / Fabric Watch / SNMP Traps / Syslog Server / etc) there is one thing in addition: automatic data collections based on cleared counters. Finding physical problems on links, frame corruption on SAN director backlinks, slow drain devices or toggeling ports - for all these problems it helps a lot if you can 1. do problem determination based on counters cleared on a regular basis and 2. look back in time to see exactly when it started and maybe how the problem "evolved" over time.
What you need is some scripting skills and a host in the management LAN (with an FTP server) to run scripts from as mentioned above. A good practice is, to have a look for a good time slot - better do not do this on workload peak times - and set up a timed script (e.g. cron job) that does:
- Gather data collections of all switches - use "supportsave" for Brocade switches and for Cisco switches log the output of a "show tech-support details" into a text file.
- Reset the counters - use both "slotstatsclear" and "statsclear" for Brocade switches and for Cisco switches run both "clear counters interface all" and "debug system internal clear-counters all". The debug command is a hidden one, so please type in the whole one as auto-completion won't work. The supportsave is already compressed but for the Cisco data collection it might be a good idea to compress it with the tool of your choice afterwards.
Additional hint: Use proper names for the Cisco Data collections. They should at least contain the switchname, the date and the time!
Depending on the disk space and the number of the switches, it may be good to delete old data collections after a while. For example you could keep one full week of data collections and for older ones only keep one per week as a reference.
If you have a good idea in addition how to be best prepared for the next problem case, please let me know. :o)
To be honest the title for this article could also be "How to ease the life of your technical support". But in fact it will ease the life of everyone involved in a problem case and the priority #1 is to solve upcoming problems as quickly as possible.
In the article The EDANT pattern I explained a structured way to transport a problem properly to your SAN support representative. In addition it might be a good idea to prepare the SAN for any upcoming troubleshooting.
The following suggestions are born out of practical experience. It's intended to help you to get rid of all the obstacles and showstoppers that could disturb or delay the troubleshooting process right from the start. Please treat them as well-intentioned recommendations, not as pesky "musts". :o)
Synchronize the time
Having the same time on all components in the datacenter is a huge help during problem determination. Most of the devices today support the NTP protocol. So the best practice is to have an NTP server (+ one or two additional ones for redundancy) in the management LAN and configure all devices (hosts, switches, storage arrays, etc) to use them. It's not necessary to have the NTP connected to an atomic clock. The crucial thing is to have a common time base.
Have a troubleshooting-friendly SAN layout
What is a troubleshooting-friendly SAN layout? I don't only mean that it's a good idea to always have an up-to-date SAN layout sketch at hand - which is very helpful in any case. What I mean is to have a SAN design that lacks of any artificial obscurities. If you have 2 redundant fabrics (yes there are still environments out there where this is not the case), it's best practice to connect all the devices symmetrically. So if you connect a host on port 23 of a switch in one fabric, please connect its other HBA to port 23 of the counterpart switch in the redundant fabric.
Use proper names
It may sound laughable but bad naming can harm a lot. I think 4 points are important here:
- The naming convention - It may be funny to have server names like "Elmo", "Obi-Wan" or "Klingon" but for troubleshooting it may be better to have some useful info within the name. Something like BC01_Bl12_ESX for example. (for Bladecenter 1, Blade 12, OS is ESX).
- Naming consistency - It's even more important to actually use the same names for the same item. So it's very helpful if for example the host has the same name in the switch's zoning, in the storage array's LUN mapping and on the host itself.
- Unique domain IDs - The domain ID is like the ZIP-Code for a switch and according to the fibre channel rules it has to be unique within a fabric. But in addition to that it is very helpful to keep it unique across fabrics as well. Domain IDs are used to build the fibre channel address of a device port - the address used in each frame. Within the connected devices's error logs (hosts, storages, etc) these fibre channel addresses are often the only information that reference for the SAN components. To be able to know which paths over exactly which switch are affected at any time is priceless.
- Brocade: chassisname - As Virtual Fabrics become more and more a standard in Brocade SANs it's crucial to set the chassisname, because the switchname is bound to the logical switch, not to the box. These chassisnames are used for the naming of the data collections (supportsaves) and if you don't configure them, the device/type will be used instead. So you'll most probably end up with a huge collection of supportsave files which differ only in the date. The chassisname can easily be set with the command "chassisname". That's one small step for... :o)
Use a change management
I couldn't emphasize this more: Please use a change mangement. Even for the smallest SAN environment where you would think "Nah! That's my little SAN, I can keep all the stuff in my head." Even for the biggest SAN Environment, where you would think "Nah! Too many people from too many departments are involved here. The SAN is living and evolving every day." Beside of any internal policy and external requirement (mandatory change management methods for several industries) a proper change management also helps in the troubleshooting process. If you can come up with a complete time plan of all actions done in the SAN and the assertion that no unplanned maintenance actions are done in the SAN during the problem determination you will have a very happy SAN support member :o)
Backup your configuration
Bad things could happen every day. Things that wipe parts or all of your switches's configuration or even worse turn them into useless doorstoppers. It's not likely that it happens, but if and when it happens you better be prepared. To be up and running again as soon as possible, you should not only back up your user data but also your configurations on a regular basis. For Brocade switches use "configupload" and for Cisco switches copy the running-config to an external server. The SAN Volume Controller (SVC) and the Storwize V7000 have options to backup the configuration in their GUI as well. Beside of that it helps a lot to also store all your license information for your switches at a well known place. At least for the SAN switches IBM cannot generate licenses and there's also no "emergency stock" for licenses. The support would have to open a ticket at the manufacturer and clarify the license issue with them. This might cost precious time in problem situations.
Keep you firmware up-to-date
This advise often has the smack of a "shoot from the hip", something like "Did you reboot your PC?" for PC tech support. But to be fair, it's not just the SAN support member's blanket mantra. No software is absolutely bug free and because of that there are patches or - for the SAN topic - more likely maintenance releases. Often there are parallel code streams. Newer ones with more features but with a higher risk of new bugs. On the other hand older ones with a long history of fixed defects and a "comfortable" level of stability but most probably already with an "End of Availability" in sight. And between these both extremes are the mature codes like the v6.3x code stream for Brocade switches. It doesn't have the latest features but a good amount of "installed hours" all over the world. It is still fully supported, so if you really would run into a new bug, Brocade would write a fix for it. It's essentially the same for Cisco and for our virtualization products.
So it's up to you. If you want the new features, you have to use the latest code. If you don't need them at the moment, the latest version of a mature code stream might be better for you. Of course you have to align these considerations with the recommended or requested versions of the connected devices as some really require a specific version. A best practice is to update the switches and if possible also all devices proactivily twice a year - beside of any additional recommended updates due to problem cases where a particular bug has to be fixed. If you need support with all the planning and doing, please contact your local IBM sales rep for an offering called Total Microcode Support. These guys will check the SAN environment including the attached devices for their firmware and will come up with a consistent list of recommended versions which should be compatible and cross-checked. Another view on the topic comes from Australian IBMer Anthony Vandewerdt in his Aussie Storage Blog.
Think about your features
Speaking about code updates and features, it's of course a good idea to actually read the release notes. They contain crucial information about the version and should also explain new features. The crux of the matter is that there could be new features that you actually do not need and some of them will be enabled by default. One of these examples is the Brocade feature "Quality of Service" (short: QoS). In simple terms it will "partition" the ISLs to grant high prioritized traffic to have some kind of "right of way" to medium or low prioritized traffic. Buffer-to-Buffer credits will be reservered for the different priority levels to enable this. But to really use it, you actually have to decide which traffic falls into which category. You would do this by so called QoS-Zones. If you don't configure the zones but leave QoS enabled, all the traffic is categorized as medium prioritized and you don't use the reservered resources for the high and the low priority. In times of high workload, this might end up in an artificial bottleneck resulting in frame drops, error recovery and performence problems. This is only one example that shows that it's better to be aware which additional features are activated and if you really need them.
Know the support pages
IBM as well as other vendors has a comprehensive "Support" section on its homepage. It offers loads of information, manuals, links to code downloads, technotes and flashes. It's possible to open and track a support case there via the web. With all the stuff on these pages and all the products IBM offers support for you might get lost a bit. Our "IBM Electronic Support" team (@ibm_eSupport) is constantly optimizing these pages but the hint number one is: Register for an account and set up these pages for you as you like them. So you have your products at hand and you find all related information easily. And if you have some spare time (do you ever?) just have a look around on the support pages. There might be useful hints or important flashes concerning your IBM products.
As always this "list" isn't exhaustive and you probably did additional things to be prepared for problem determination. Feel free to share them in the comments below. Thank you!
One of the ugliest things that can happen in a SAN is a big performance problem introduced by a slow drain device (or slow draining device). Why is it so ugly? Well, if a full fabric or a full data center drops down - due to a fire for example - it's definitely ugly, too. But such situations can be covered by redundancy (failover to another fabric, to another data center, etc), because the trigger is very clear. Whereas a performance degredation due to a slow drain device is not so obvious - at least not for the most hosts, operators or automatic failover mechanisms. Frames will be dropped randomly, paths fail but with the next TUR (Test Unit Ready) they seem to work again, just to fail again minutes later. Error recovery will hit the performance and the worst thing: If commonly used resources are affected - like ISLs - the performance of totally unrelated applications (running on different hosts, using different storage) is impaired.
So you have a slow drain device. If you have a Brocade SAN you might have found it by using the bottleneckmon or you noticed frame discards due to timeout on the TX side of a device port. If you have a Cisco SAN you probably used the creditmon or found dropped packets in the appropriate ASICs. Or maybe your SAN support told you where it is. Nevertheless, let's imagine the culprit of a fabric-wide congestion is already identified. But what now?
The following checklist should help you to think about why a certain device behaves like a slow drain device and what you can do about it. I don't claim this list to be exhaustive and some of the checks may sound obvious, but that's the fate of all checklists :o)
- Check the firmware of the device:
Check the configuration:
- Is this the latest supported HBA firmware?
- Are the drivers / filesets up-to-date and matching?
- Any newer multipath driver out there?
- Check the release notes of all available firmware / driver version for keywords like "performance", "buffer credits", "credit management" and of course "slow drain" and "slow draining".
- If you found a bugfix in a newer and supported version, testing it is worth a try.
- If you found a bugfix in a newer but unsupported version, get in contact with the support of the connected devices to get it supported or info about when it will be supported.
Check the workload:
- Is it configured according to available best practices? (For IBM products, often a Redbook is available.)
- Is the speedsetting of the host port lower than the storage and switches? Better have them at the same line rate.
- Queue depth - better decrease it to have fewer concurrent I/O?
- Load balanced over the available paths? Check you multipath policies!
- Check the amount of buffers. Can this be modified? (direction depends on the type of the problem).
Check the concept:
- Do you have a device with just too much workload? Virtualized host with too much VMs sharing the same resources? Better separate them.
- Too much workload at the same time? Jobs starting concurrently? Better distribute them over time.
- Multi-type virtualized traffic over the same HBA? One VM with tape access share a port with another one doing disk access? Sequential I/O and very small frame sizes on the same HBA? Maybe not the best choice.
Check the logs for this device for any incoming physical errors. Of course, error recovery slows down frame processing.
Check the switch port for any physical error. If you have bit errors on the link, the switch may miss the R_RDY primitives (responsible for increasing the sender's buffer credit counter again after the recipent processed a frame and freed up a buffer).
Use granular zoning (Initiator-based zoning, better 1:1 zones) to have the least impact of RSCNs. (A device that has to check the nameserver again and again has less time to process frames.)
If all other fails: Look for "external" tools and workarounds:
- If the slow drain device is an initiator, does it communicate with too many targets? (Fan-out problem)
- If the slow drain device is a target, is it queried by too many initiators? (Fan-in problem)
- Is it possible to have more HBAs / FC adapters? On other busses maybe?
- Is the device connected as an L-Port but capable to be an F-Port? Configure it as an F-Port, because the credit management of L-Ports tends to be more vulnerable for slow drain device behavior.
- Does the slow drain host get its storage from an SVC or Storwize V7000? Use throttling for this host. Other storages may have similar features.
- Brocade features like Traffic Isolation Zones, QOS and Trunking can help to cushion the impact of slow drain devices.
- Have a Brocade fabric with an Adaptive Networking license? Give Ingress Rate Limiting a try.
- Last resort: Use port fencing or an automated script to kick marauding ports out of the SAN.
The list above is just a collection of things I already saw in problem cases. Having said this, it might be updated in the future if I encounter more reasons for slow drain device behavior. Of course I'm very interested in your opinion and more reasons or ways to deal with them!
First of all: the following blog is about some SAN extension considerations related to Brocade SAN Switches. The described problems may affect other vendors as well but will not be discussed here. It will also not cover all sub topics and consideration but describes a specific problem.
There are a lot of different SAN extensions out there in the field and Brocade supports a considerable proportion of them. You can see them in the Brocade Compatibility Matrix in the "Network Solutions" section. As offsite replication is one of the key items of a good DR solution, I see many environments spread over multiple locations. If the data centers are near enough to avoid slower WAN connections usually multiplexers like CWDM, TDM or DWDM solutions are used to bring several connections on one long distance link.
From a SAN perspective these multiplexers are transparent or non-transparent. Transparent in this context means that:
- They don't appear as a device or switch in the fabric.
- Everything that enters the multiplexer on one site will come out of the (de-)multiplexer on the other site in exactly the same way.
While the first point is true for most of the solutions, the second point is the crux. With "everything" I mean all the traffic. Not only the frames, but also the ordered sets. So it should be really the same. Bit by bit by bit exactly the same. If the multiplexing solution can only guarantee the transfer of the frames it is non-transparent.
So how could that be a problem?
In most cases the long distance connection is an ISL (Inter Switch Link). An ISL does not only transport "user frames" (SCSI over FC frames from actual I/O between an initiator and a target) but also a lot of control primitives (the ordered sets) and administrative communication to maintain the fabric and distribute configuration changes. In addition there are techniques like Virtual Channels or QOS (Quality of service) to minimize the influence of different I/O types and techniques to maintain the link in a good condition like fillwords for synchronization or Credit Recovery. All these techniques rely on a transparent connection between the switches. If you don't have a transparent multiplexer, you have to ensure that these techniques are disabled and of course you can't benefit from their advantages. Problems start when you try to use them but your multiplexer doesn't meet the prerequirements.
What can happen?
Credit Recovery - which allows the switches to exchange information about the used buffer-to-buffer credits and offers the possibility to react on credit loss - cannot work if IDLEs are used as a fillword. They would use several different fillwords (ARB-based ones) to talk about their states. If the multiplexer cuts all the fillwords and just inserts IDLEs at the other site (some TDMs do that) or if the link is configured to use IDLEs, it will start toggeling with most likely disastrous impact for the I/O in the whole fabric.
Another problem appears less obvious. I mentioned Virtual Channels (VC) before. The ISL is logically split. Of course not the fibre itself - the frames still pass it one by one. But the buffer management establishes several VCs. Each of them has its own buffer-to-buffer credits. There are VCs solely used for administrative communication like the VC0 for Class_F (Fabric Class) traffic. Then there several VCs dedicated to "user traffic". Which VC is used by a certain frame is determined by the destination address in its header. A modulo operation calculates the correct VC. The advantage of that is that a slow draining device does not completely block an ISL because no credits are sent back to enable the switch to send the next frame over to the other side. If you have VCs the credits are sent back as "VC_RDY"s. If your multiplexer doesn't suport that (along with ARB fillwords) because it's not transparent, you can't have VCs and "R_RDY"s will be used to send credits. The result: As you have only one big channel there, Class_F and "user frames" (Class_3 & Class_2) will share the same credits and the switches will prioritize Class_F. If you have much traffic anyway or many fabric state changes or even a slow draining device, things will start to become ugly: The both types of traffic will interfer, buffer credits drop to zero, traffic gets stalled, frames will be delayed and then dropped (after 500ms ASIC hold time). Error Recovery will generate more traffic and will have impact on the applications visible as timeouts. Multipath drivers will failover paths, bringing more traffic on other ISLs passing most probably the same multiplexer. => Huge performance degradation, lost paths, access losses, big trouble.
You see, using the wrong (or at least "non-optimal") equipment can lead to severe problems. It's even more provoking the used multiplexer in fact is transparent but the wrong settings are used in the switches. So if you see such problems or other similar issues and you use a multiplexer on the affected paths, check if your multiplexer is transparent (with the matrix linked above) and if you use the correct configuration (refer to the FabOS Admin Guide). And if you have a non-transparent multiplexer and no possibility to get a transparent one, don't hesitate to contact your IBM sales rep and ask him about consultation on how to deal with situations like this (e.g. with traffic shaping / tuning, etc).
From time to time (sometimes everyday - the support business is a capricious one) I need to see what's really going on in the fibre. For that reason we have a couple of tracers which can be sent to the EMEA countries. Some IBM organizations in some countries even have their own tracers. For the SAN support we use the XGIGs from JDSU (originally from Finisar). Usually I trace, if the problem is somehow protocol related and cannot solved with the RAS packages of the switches and the devices. Or if the RAS information from one device contradicts the other one. Or if every support team (internal and external) points to each other. Or if something totally strange happens and nobody can deal with it. Maybe we trace a little too often, because meanwhile other vendors sometimes say things like "Oh, you also have IBM gear in your environment? Let them trace it!".
So what's this tracing all about?
To put it simple, you connect it in the line and it just records all the traffic. Of course you can filter it and let it trace only the interesting part of the frames. I do not care for the actual data but the FCP and the SCSI header info are precious information. Of course an 8 Gbps link generates a lot of data, too and the memory is very limited. So you want to be sure to trace exactly what you need - not more, not less. The tracing is done by IBM customer engineers. We ensure to have a suitable number of trained CEs in every region. I hosted some of the trainings by myself and imho it's definitely worth it. The analysis is then done afterwards. I personally like it, because it offers me a possibility to not be "bound" to the RAS packages alone. I can really see what happens.
Although the whole topic is pretty much straight forward, for the ones unfamiliar with it, tracers seem to be mystical devices. Over time I faced several "urban legends" impeding troubleshooting a lot sometimes:
- "What info? You should see that in the trace!" - Often I get no additional information for a trace (e.g. consisting of 8 trace files from different channels) which slows down the analysis extremely. I need at least a layout where I can see where exactly the tracer was connected. I need to know how it was configured, if the problem really happened during the trace and I need the data collection of the switch and the devices to compare what I see against the RAS packages. Please help me to help you! :o)
- "We can't put this link down. Is it important where to plug in the tracer?" - Yes, of course it is. Like described above, it just records the traffic that enters the tracer. Nothing more. There are no tiny little photon-based nano robots swarming out through the fibres and collecting data. Really. If you plug it somewhere else, I won't see the problem.
- "Thank you so much for introducing a tracer in our environment. It solved the problem. It has to stay." - No, the tracer did not solve the problem by itself. If the problem somehow vanished with cabeling in the tracer, then a simple portdisable/portenable should have helped as well. The tracers are needed frequently and can't stay in the environment till the end of days.
These were just some of the rumors and statements I heard in the past. To summarize it, please keep in mind:
A tracer is not a magical device. It just records traffic.
If you work in the technical support, generally spoken your job is to fix what's broken. But working in the SAN support most of the time is about solving complex problems. The SAN connects everything with everything in the storage world and often that's a lot. Oh yes, there are well-planned and "troubleshooting-friendly" environments out there, managed by top-skilled administrators using state-of-the-art tools, while having enough time between daily routine and important projects to spot problems before they even have an impact on the applications. At least I believe that these things exist, but most of the time I did not even see a part of it. There are excellent multi-tenancy capable products out there, maintained by a single part time admin or an operator some thousand miles away monitoring the environments of a dozen clients. And when there is a problem, this poor guy is called by all the angry people relying on a working IT up to the C-levels. Then he opens a case at his SAN vendor.
Let's switch to the support guy. He takes the new case and reads. "Massive problem, SCSI error!". Yes, most of the time there is just a statement like this. That's okay for the beginning, because the so called "Request Receipt Center" just creates cases administratively (OMG, is that even a word in the English language?). The first level of support, the so called Frontend will call you then and ask you about the problem. And they (hopefully) will bring the information in a pattern called "EDANT" to have it in a structured way and to be able to hand it over (horizontally for shift changes or vertically for escalation) to others. This first call (sometimes 2..n) is crucial because the most important thing is to actually understand the problem. That sounds trivial, but it's not. In fact the whole problem determination will fail or at least significantly lag if this set of information is not complete or contains false statements.
I know you will be under pressure. I know you have thousand other things to do. I know some sales guy probably promised you "Our excellent support will solve all problems - if there'd ever be one - just by hearing the tone of your voice for 1.4 seconds!". But again, to enable the support guy to actually understand your problem is the most important thing and you can hugely accelerate that process by preparing the information using the EDANT pattern.
So what's this EDANT pattern exactly? I have to admit, we stole it from the software guys. You will notice that by the wording. EDANT means:
E is for Environment. You (hopefully) know your environment and maybe you described it to IBMers several times before, maybe an IBM architect even designed it. But to be honest IBMers don't share a collective consciousness like the Borg :o), on the other hand things change. So what's needed is a good description of the environment related to the current problem. This includes among others:
- A layout with the related switches and devices and the ports used to connect them.
- The machine/model information of related switches, hosts, storages, etc
- The firmware/OS/driver levels of all components.
- Time gaps between the components. (Better use NTP!)
- If you use SAN extenders, describe them. Use CWDM/DWDM/TDM? How long? Type? Vendor? Cards? Versions? Transparency? Use FCIP? Bandwidth? Quality?
- Additional specialities: any interop stuff going on? This is a test SAN? This is pre-production? This is designed without redundancy? Stuff like this...
D is for Description. Please describe your problem as precise and as comprehensive as possible.
- When did it start?
- What did happen?
- Where can you notice it?
- What do the switches report?
- What do the other devices report?
- What was done when the problem happened?
- What is the impact?
And in regard to the environment please ask yourself: Which components are affected? Which components could be affected but are not? What is the difference between them? Questions like these are the key for narrowing down the problem.
A is for Actions Done. Opening a case is most probably not the first thing you do, when the phones begin to ring. When a case reachs me, "someone" did already "something". Maybe you have a plan for situations like this. Maybe someone requests "Do things!". Maybe you switched off "culprit candidates". All this should be documented as accurate as possible. With time stamps! And of course with results. Everything that changed in the environment since the problem occured is worth to mention, including counter resets. Do as much as possible from CLI (Command Line Interface) and use session logging. Precious!
N is for Next Actions. This section is for everything you already plan (maintenance windows, replacements, recovery actions, internal and external deadlines) and for everything you expect from the support. The second point is not trival, too. Of course you want the support to solve the problem. But what is most important? Do you need a workaround first, to get things working again? Do you need an RCA (Root Cause Analysis) the next day? Does the problem has to be solved over night and a contact person will be available to provide data and further info? Provide your expectation to get the right help.
T is for Test Case. Okay, this one is clearly from the software support. It's the data collections and any additional data and description of it, like the session logs mentioned above. Screenshots, performance data or scripts belong here too. Usually the support offers a way to upload all the stuff. Please be aware that for example IBM doesn't keep data collections from cases till the end of days. So if you uploaded something for another already closed case 6 months ago, it's most probably gone.
Using this pattern to structure the info should avoid any communication based delays. It may sound like much stuff in the beginning. But it's definitely worth it.
The following article is not new. I published it 1,5 years in an internal IBM blog. So why publish it again externally then? From my problem cases of the recent months I found that the principle described there is very common and it most probably won't change in the future. To allow myself to refer to it here in the blog, of course I have to publish it once again. :-)
The majority of the SAN cases that you can not simply break down into replacing a part because of an error message such as "Part xxx is broken" are complex solution cases. You have symptoms on maybe several hosts against maybe several storages over several pathes.
The IBM support structure consists of so called towers. There are different teams supporting different products. With higher supportlevels this is quite important to allow product engineers to develop a deep understanding about their specific product. When it comes to problem determination it's essential for the different tower teams to work together to find the cause for a problem and how to solve it. It's not enough to just check the "own box" for failures and ask yourself if it could be the only reason for the problem. If this is done, the result is often, that the particular device cannot be the "single point of failure" and the responsibility to find the problem source moves to the next probable team.
It is obvious that such a process to solve a problem is not very efficient. There are several attempts to deal with that from organizational point of view like solution support approaches, project office and "Complex Call Leaders". But only from technical point of view, you can see, why it is vital:
In complex cases you have at least one trigger and also at least one device that reacts wrongly on it. The trigger alone doesn't represent the source of the visible symptoms. This is often forgotten as soon as the trigger is found and repaired and the symptoms are gone. And more important, the trigger was the less harmful problem in comparison to a bad error recovery. In the future a new minor error (the same one or another) could trigger the same major problem.
To illustrate that, let's use following "not so complex" example:
A customer has two SAN switches in two different, redundant fabrics. Connected to the switches there is a 2-node SVC cluster (with several backend storage subsystems). From each of both nodes there are two connections in each fabric. He has some Windows hosts with SDDDSM (same level on each host) and System p hosts with SDDPCM (also same level on each host).
Now one SFP that connects one SVC node port to one of the switches is broken. It corrupts frames and transmission words intermittently which leads to a toggeling link. Although everything is zoned granulary, all the System p hosts loose the access to their disks.
The customer creates a case at System p support. The frontend sees a message in the AIX error report indicating that a hdisk is not accessible. They involve the SAN support team, which finds the high error counters on the switch port where the broken SFP is connected and advises the customer to replace it with one of his spare SFPs. The situation calms down, the symptoms disappear, the disks are accessible again, the problem is gone. Fine.
But this is not the moment this case can be closed. Of course, the trigger is found and the customers systems are productive again, but the main problem could be easily diregarded now: The error was handled in a wrong way by the host and its multipathing driver. The multipath driver should use another available path. It could use another path in the same fabric or even the links in the other fabric that have no problem at all. So the more important problem source is the broken multipath driver which has to react to the trigger and do the error recovery. With the next broken SFP (please keep in mind that a SFP as a opto-electrical converter is a wearing part) the same problem will happen again!
The lesson learnt out of this example is, that a trigger of an error is not the most important part of the problem and should not be the only goal of the problem determination, but the way the devices in a redundant environment react to the trigger is the reason for the impact and could create "artificial" single points of failure. The different tower support teams have to work together until not only the trigger is found but also the parts of the environment that react in a wrong way!