HDS' Hu Yoshida wrote an interesting theory on his blog. Basically he says that while modular dual-controller storage arrays might be useful for traditional physical server deployments, virtualized servers would need enterprise storage arrays. (Which interestingly are defined by "multiple processors that share a global cache" according to him.)
I wrote a small reply as a comment which still awaits moderation. To the present Hu usually published my few comments in his blog - regardless of how criticising they were. I don't know why it didn't happen this time, but I think the most reasonable answer is, that everybody at HDS is very busy with the BlueArc aquisition. So meanwhile I publish it here :o)
interesting read. IMHO there’s much truth in your quote “Virtual servers can be like a drug” and I think you are also right with your observation about Tier 1 applications being virtualized. From a support perspective this could lead to bad nightmares. But to be honest, I don’t get why the storage system should be the limiting factor here. The number of servers (in terms of OSes running) doesn’t change in your picture and neither did the total workload towards the storage array. They were physical servers before, now they are virtual servers (VMs) on a few physical ones. In my eyes the requirements regarding the storage environment didn’t change big times but of course you have to check carefully if your physical servers with their SAN connectivity could turn into a bottleneck themselves, as I pointed out in my latest blog post (http://ibm.co/mY5PnH).
Additionally, just a minor thing with the dual-controller arrays: Why should the outage of the remaining controllers lead to data loss? Usually the write cache of such arrays will be disabled if one controller is down, because it can’t be mirrored anymore. On one hand this means decreased performance during such maintenance, but on the other hand this means that the host gets the SCSI good status only if the I/O is really written to disk. So, there should be access loss of course, but no data loss.
If you have a different - or a similar - opinion, feel free to leave a comment here :o)
There is an interesting discussion ongoing in the Linkedin group The Storage Group. The question is "What is the REAL cost of Fibre Channel?". To my surprise the participants in this discussion relatively quickly came to the conclusion that the problem is over-provisioning resp. under-utitization. My personal opinion was:
"I would like to come back to the over-provision / under-utilization part. Being a tech support guy, I think a bit different about that. State of the art is 16G FC now but of course I see the majority of customers being on 8G or even 4G. Eventually they will move to higher speeds. Not because all of them really need the higher speed, but it's just the switches and HBAs in sales and marketing at the moment. The "speed race" is driven mostly by the vendors and the customers who really need that line rate. But is it bad for the others? I don't think so. A 16G switch is not really 2x the price of a 8G switch or 4x the price of a 4G. In fact I see the prices sinking on a per port base with increasing functionality on the other hand. And then you stand there with your host X. It has a demand for let's say 200MB/s in total and you connected it to 2 redundant fabrics running with 8G, 1 port per fabric.
That makes: 200MB demand versus 1600MB available. WOW! YOU ARE TOTALLY UNDER-UTILIZED! Shame on you!
Well not really. Actually it's good to have redundancy. You know that. First of all "real" redundancy means you are at least 50% under-utilized per se. Plus the higher line rate that made no difference in the price compared to the lower line rate. That means it is normal that you end up over-provisioned and under-utilized.
In fact things start to get ugly if you really use all your links near 100%. I start to see that scenario more often recently when customers put VMs on ESX hosts without really knowing their I/O demand. Many of them work till the next outage (SFPs _WILL_ break some day, a software bug could crash a switch, etc) and then you see that you have no real redundancy, because you utilize your links too high.
On the other hand many of these ESX hosts with many VMs doing different unknown workload tend to turn to slow drain devices as soon as I/O peaks of certain VMs come together at the same time. Then at the latest you notice that under-utilization of a network is not really a bad thing :o)"
Especially the ESX hosts turning to slow drain devices bug me most these days. Nobody really seems to know the demand of their VMs and the internal statistics of the ESX seem to be very limited for that matter. If you look on a port of a slow drain device, it will most probably still look under-utilized from a bandwidth perspective, because the missing buffers plus the error recovery will keep the plain MB/s numbers down. But in fact the port is exhaustively saturated then. And in addition the the eventually dropped frames in the SAN lead to timeouts also within the slow draining host. At the end it looks like: "My ESX is far away from utilizing its link completely but the SAN is bad! We have timouts!".
So what's the demand?
Some customers have the luxury (Should this really considered to be luxury?) of having a VirtualWisdom probe installed to monitor the exact performance values in real-time constantly. Archie Hendryx shows some of the things you could see there in practice in his whitepaper "Destroying the Myths surrounding Fibre Channel SAN". But if you don't have such gear and you don't know the demand it might be worth to have an additional ESX host for testing. It must not be the biggest machine, don't worry. Every day you would take another candidate out of your bulk of VMs with unknown I/O bandwidth (or CPU / memory / etc) demand and put it on that test server with vMotion. Being relatively unimpaired by the other VMs (at least within the ESX), you can measure all the performance values then for 24 hours and - provided no error recovery or external congestion - takes place, these are the real demands of that VM. And only based on these demands you really know which VMs are allowed to come together on the same bare metal. Only so you will have a chance to actually improve the under-utilization in a controlled manner without slamming your SAN into the realms of chaos. The approach seems very simple and straight forward for me, but I see nobody doing this. So what's my error in reasoning, dear reader?
(Thanks to Harout S Hedeshian for the picture.)
Recently I attended a presentation about IBM's cloud computing approaches by IBM Fellow Stefan Pappe. Cloud computing is a big topic in IT nowadays - no doubt about that - but how much impact does it have on SAN troubleshooting? Will the way hardware support is performed change in the cloud? Based on your understanding of the term cloud you might eighter say yes or no. In a cloud the IT is just a commodity like water or electrical power. You just use it. You most likely don't want to know how it works as long as its availability is guaranteed. If a component of a server breaks, the whole construct relies on redundancy. Either within the server (multiple paths etc) or within a pool of servers where the VMs residing on this particular piece of metal are concurrently moved to other servers. This frees up the broken one for maintenance later on.
For a SAN it's quite similar - we rely on internal redundancy (multiple power supplies, failover-able control processors and backlink modules) as well as external redundancy (second independent fabric, multiple paths, multiple ISLs), with an important exception: Some SAN-related problems have to be troubleshooted "on the open heart". Please don't understand me wrong. I don't mean that finding a good workaround isn't important - it surely is and in most scenarios it's a key element for business continuity. But if the symptoms can't be seen, it might be hard for the support member to do the problem determination.
So what now?
Most of these "workarounded" problems can still be troubleshooted if the SAN is well prepared. Especially part 2 of my How to be prepared blog post can help you with that topic. In addition Please gather a data collection from each and every component in the SAN that is related to the problem before you implement any workaround! For the SAN switches that means, if you have performance problems for example, please gather a data collection of all SAN switches.
For other problems it might be necessary to actually test the repaired component / modified configuration / improvement in the code in the productive environment to know if it really helped. Of course all the possibles tests that can be done "offline" should be done first. For example before bringing a formely toggling ISL back to life, it's better to use the built-in port test capabilities of the switches with loopback-plugs.
And as another exception compared with server redundancy: A SAN troubleshooting should not be postponed to gather "workarounded" problems for a certain time and solve them later all at once.
- In most cases redundancy in the SAN means you have two things of a kind. Not five or eight or hundreds. So if the core of fabric A fails, it has to be repaired as soon as possible, because the failure of the core in fabric B will lead to the full outage.
- Different concurrent SAN problems can overlay and create much bigger problems or at least ambiguous symptoms that are much harder to troubleshoot. "Double errors" or "triple errors" are among the worst things to troubleshoot.
- SAN environments are complex structures with lots of hardware and software. There are many things that could lead to the situation that redundancy cannot be utilized properly such as bugs in multipath drivers, wrong configurations or underestimation of the workload on the redundant paths and components during a problem situation.
So if it can be done now, do it now!
Beside of that there are special requirements of the cloud such as the ability for multi-tenancy on the SAN components. Cisco have their VSANs for a long time now, but when it comes to IVR (Inter VSAN Routing) sometimes I see very strange configurations out there based on a wrong understanding of the concept. The first attempt of Brocade in that direction were the "Administrative Domains" which came with some very concerning flaws in my opinion. With the v6.2x code stream this concept was virtually replaced by the "Virtual Fabrics" concept. With "base switches", "XISLs" & co, many new possibilities for mis-configurations appeared. Much new stuff to learn for customers, admins, architects and of course support members.
To sum up, I can say that if SAN troubleshooting was done properly before, there won't be much change here. But the cloud boosts the expectations of the users regarding their SAN even more to: It should just work! No downtime of the application ever! Our primary goal is to deal with upcoming problems in a way that prevents any impact on the applications.
Because in the future zero downtime will be no highend enterprise feature anymore but a commodity.
If you use a SAN Volume Controller it usually is the linchpin of your SAN. Except for the FICON and tape related stuff everything is connected to it. It is the single host for all your storage arrays and the single storage for all your host systems. Because of this crucial role the SVC has some special requirements regarding your SAN design. The rules can be seen in the manuals or in the SVC infocenter (just search for "SAN fabric"). One of these rules is "In dual-core designs, zoning must be used to prevent the SAN Volume Controller from using paths that cross between the two core switches.".
I made this sketch to illustrate that. As you see it's not a complete fabric, but just the devices I want to write about. Sorry for the poor quality, my sketching-kungfu is a bit outdated :o)
This is just one of two fabrics. The both SVC nodes are connected to the both core switches. The edge switch is connected to both core switches and beside of the SVC business let's assume there is a host connected to the edge switch using a tape library connected to the cores. There would be other edge switches, more hosts and of course storage arrays as well. Now the rule says that the SVC node ports are only allowed to see each other locally - therefore on the same switch.
So why is that so?
Of course you could say that this is the support statement and if you want to use a SAN Volume Controller you just have to stick to that. But from time to time I see customers with dual-core fabrics who don't follow that rule. Of course initially when the SVC was integrated into the fabric, the rule was followed because it was most probably done by a business partner or an IBM architect according to the rules and best practice. But later then after months or years - maybe even the SAN admin changed - new hosts were put into the fabric and in an initiator-based zoning approach, each adapter was zoned to all its SVC ports in the fabric. Et voilà! The rule is infringed. The SVC node ports see each other over the edge switch again and the inter-node traffic passes 2 ISLs instead of none.
What is inter-node communication?
Beside of the mirroring of the write cache within an I/O group there is a system to keep the cluster state alive. It includes a so called lease which passes all nodes of a cluster (up to 8 nodes in 4 I/O groups) in a certain time to ensure that communication is possible. These so called lease cycles start again and again and they do even overlap so if one lease is dropped somehow and the next cycle finishes in time, everything is still fine. The lease frames will be passed from node to node within the cluster several times. But if there are severe problems in the SAN the cluster has to trigger the necessary actions to keep the majority of the nodes alive. Such an action would be to warm-start the least responsive node or subset of nodes. You will read "Lease Expiry" in your error log. In a worst case scenario where the traffic is heavily impacted to a degree that the inter-node communication is not possible at all, it might happen that all nodes do a reboot and if the impact stays in the SAN they might do that again and won't be able to serve the hosts.
The result - BIG TROUBLE!
Just as a small disclaimer to prevent FUD (Fear, Uncertainty and Doubt): This is not a design weakness of the SVC or something like that. All devices in a SAN are vulnerable to the risk I want to describe. In addition from all the error handling behavior of the SVC as I know it the SVC seems to be designed to rather allow an access loss than to allow data corruption. It is still the last resort but it's better than actually loose data.
Back to the dual-core design. The following sketch just shows that with the wrong zoning, the lease could take the detour over the edge switch instead of going directly from node 1 to node 2 via core 1 or core 2. It would pass 2 ISLs.
Why should I care?
There are several technical reasons why ISLs should be avoided for that kind of traffic but from SAN support point of view I consider this one as the mose important: slow drain devices! Imagine one day the host would act as a slow drain device for any reason. The tape would send its frames to the host passing the cores and the edge switch. As the host is not able to cope with the incoming frames now, it would not free up its internal buffers in a timely manner and would not send permission to send more frames (R_RDYs) to the switch quickly enough. The frames pile up in the edge switch and congest its buffers. The congestion back-pressures to the cores and finally to the tape drive. As the frames wait within the ASICs some of them will eventually hit the ASIC hold-time of 500ms and get dropped. This causes error recovery and based on the intensity of the slow drain device behavior it would kill the tape job. Bad enough?
But hey! The SVC needs these ISLs!
And that's were it gets ugly. In the sketch above the ISL between the core 1 and the edge switch will become a bottleneck not only for that tape related traffic but for the SVC inter-node communication as well. It will not only cause performance problems (due to the disturbed write cache mirroring) but also could lead to the situation that the frames from several SVC lease cycles in a row would be delayed massively or even dropped causing lease expiries resulting in node reboots.
That's why keeping an eye on the proper zoning for the SVC is so important and that's the reason for that rule.
Just as a short anecdote related to that: Some years ago I had a customer with a large cluster where not the drop of leases but the massive delay of them caused the problem. As every single pass of the lease from one node to the next was only just within the time-out values the subset of nodes that was really impaired by the congestion saw no reason to back out and reboot but as the overall time-out for the lease cycles was reached at a certain point in time, the wrong (because healthy) nodes rebooted then and the impaired ones were kept alive. Not so good... As far as I know some changes were done in the SVC code later to improve its error handling in such situations but the rule is as valid as ever:
Avoid inter-node traffic across ISLs!
Two additional topics for my previous post came into my mind and I doubt that they will be the last ones :o)
Have a proper SAN management infrastructure
For most of you it's self-evident to have a proper SAN management infrastructure, but from time to time I see environments where this is not the case. In some it's explained with security policies ("Wait - you are not allowed to have your switches in a LAN? And the USB port of your PC is sealed? You have no internet access? No, I don't think that you should send a fax with the supportshow...), sometimes it's just economizing on the wrong end. And sometimes there is just no overall plan for SAN management. So I think at least the following things should be given to enable a timely support:
- A management LAN with enough free ports to allow integration of support-related devices. For example a Fibre Channel tracer.
- A host in the management LAN which is accessible from your desk (e.g. via VNC or MS RDP) and has access to the management interfaces of all SAN devices. This host should at least boot from an internal disk rather than out of the SAN.
- A good ssh and telnet tool should be installed which allows you to log the printable output of a session into a text file. I personally like PuTTY.
- A tFTP- and a FTP-Server on the host mentioned above. It can be used for supportsaves, config backups, firmware updates etc. They should always run and where it's possible the devices should be pre-configured to use them. (e.g. with supportftp in Brocade switches)
- If it's possible with your security policy, it's helpful to have Wireshark installed on it which could be used for "fcanalyzer" traces in Cisco switches or also to trace the ethernet if you have management connection problems with your SAN products.
- The internet connection needs enough upload bandwidth. Fibre Channel traces can be several gigabytes in size. When time matters undersized internet connections are a [insert political correct synonym for PITA here :o) ]
- Callhome and remote support connection where applicable. Callhome can save you a lot of time in problem situations. No need to call support and open a case manually. The support will call you. And most of the SAN devices will submit enough information about the error to give the support member at least an idea where to start and which steps to take first. So in some situations callhomes trigger troubleshooting before your users even notice a problem. In addition some machines (like DS8000) allow the support to dial into it and gather the support data directly - and only the support data. Don't worry - your user data is safe!
- Have all passwords at hand. This includes the root passwords as some troubleshooting actions can only be done with a root user.
- Have all cables and at least one loopback plug at hand. With cables I mean at least: one serial cable, one null-modem cable, one ethernet patch cable and one ethernet crossover cable (not all devices have "auto-negotiating" GigE interfaces)... better more. And of course a good stock of FC cables should be onsite as well.
- The NTP servers as mentioned in my previous blog post.
Monitoring, counter resets and automatic DC
Beside of any SAN monitoring you hopefully do already (Cisco Fabric Manager / Brocade DCFM / Network Advisor / Fabric Watch / SNMP Traps / Syslog Server / etc) there is one thing in addition: automatic data collections based on cleared counters. Finding physical problems on links, frame corruption on SAN director backlinks, slow drain devices or toggeling ports - for all these problems it helps a lot if you can 1. do problem determination based on counters cleared on a regular basis and 2. look back in time to see exactly when it started and maybe how the problem "evolved" over time.
What you need is some scripting skills and a host in the management LAN (with an FTP server) to run scripts from as mentioned above. A good practice is, to have a look for a good time slot - better do not do this on workload peak times - and set up a timed script (e.g. cron job) that does:
- Gather data collections of all switches - use "supportsave" for Brocade switches and for Cisco switches log the output of a "show tech-support details" into a text file.
- Reset the counters - use both "slotstatsclear" and "statsclear" for Brocade switches and for Cisco switches run both "clear counters interface all" and "debug system internal clear-counters all". The debug command is a hidden one, so please type in the whole one as auto-completion won't work. The supportsave is already compressed but for the Cisco data collection it might be a good idea to compress it with the tool of your choice afterwards.
Additional hint: Use proper names for the Cisco Data collections. They should at least contain the switchname, the date and the time!
Depending on the disk space and the number of the switches, it may be good to delete old data collections after a while. For example you could keep one full week of data collections and for older ones only keep one per week as a reference.
If you have a good idea in addition how to be best prepared for the next problem case, please let me know. :o)
To be honest the title for this article could also be "How to ease the life of your technical support". But in fact it will ease the life of everyone involved in a problem case and the priority #1 is to solve upcoming problems as quickly as possible.
In the article The EDANT pattern I explained a structured way to transport a problem properly to your SAN support representative. In addition it might be a good idea to prepare the SAN for any upcoming troubleshooting.
The following suggestions are born out of practical experience. It's intended to help you to get rid of all the obstacles and showstoppers that could disturb or delay the troubleshooting process right from the start. Please treat them as well-intentioned recommendations, not as pesky "musts". :o)
Synchronize the time
Having the same time on all components in the datacenter is a huge help during problem determination. Most of the devices today support the NTP protocol. So the best practice is to have an NTP server (+ one or two additional ones for redundancy) in the management LAN and configure all devices (hosts, switches, storage arrays, etc) to use them. It's not necessary to have the NTP connected to an atomic clock. The crucial thing is to have a common time base.
Have a troubleshooting-friendly SAN layout
What is a troubleshooting-friendly SAN layout? I don't only mean that it's a good idea to always have an up-to-date SAN layout sketch at hand - which is very helpful in any case. What I mean is to have a SAN design that lacks of any artificial obscurities. If you have 2 redundant fabrics (yes there are still environments out there where this is not the case), it's best practice to connect all the devices symmetrically. So if you connect a host on port 23 of a switch in one fabric, please connect its other HBA to port 23 of the counterpart switch in the redundant fabric.
Use proper names
It may sound laughable but bad naming can harm a lot. I think 4 points are important here:
- The naming convention - It may be funny to have server names like "Elmo", "Obi-Wan" or "Klingon" but for troubleshooting it may be better to have some useful info within the name. Something like BC01_Bl12_ESX for example. (for Bladecenter 1, Blade 12, OS is ESX).
- Naming consistency - It's even more important to actually use the same names for the same item. So it's very helpful if for example the host has the same name in the switch's zoning, in the storage array's LUN mapping and on the host itself.
- Unique domain IDs - The domain ID is like the ZIP-Code for a switch and according to the fibre channel rules it has to be unique within a fabric. But in addition to that it is very helpful to keep it unique across fabrics as well. Domain IDs are used to build the fibre channel address of a device port - the address used in each frame. Within the connected devices's error logs (hosts, storages, etc) these fibre channel addresses are often the only information that reference for the SAN components. To be able to know which paths over exactly which switch are affected at any time is priceless.
- Brocade: chassisname - As Virtual Fabrics become more and more a standard in Brocade SANs it's crucial to set the chassisname, because the switchname is bound to the logical switch, not to the box. These chassisnames are used for the naming of the data collections (supportsaves) and if you don't configure them, the device/type will be used instead. So you'll most probably end up with a huge collection of supportsave files which differ only in the date. The chassisname can easily be set with the command "chassisname". That's one small step for... :o)
Use a change management
I couldn't emphasize this more: Please use a change mangement. Even for the smallest SAN environment where you would think "Nah! That's my little SAN, I can keep all the stuff in my head." Even for the biggest SAN Environment, where you would think "Nah! Too many people from too many departments are involved here. The SAN is living and evolving every day." Beside of any internal policy and external requirement (mandatory change management methods for several industries) a proper change management also helps in the troubleshooting process. If you can come up with a complete time plan of all actions done in the SAN and the assertion that no unplanned maintenance actions are done in the SAN during the problem determination you will have a very happy SAN support member :o)
Backup your configuration
Bad things could happen every day. Things that wipe parts or all of your switches's configuration or even worse turn them into useless doorstoppers. It's not likely that it happens, but if and when it happens you better be prepared. To be up and running again as soon as possible, you should not only back up your user data but also your configurations on a regular basis. For Brocade switches use "configupload" and for Cisco switches copy the running-config to an external server. The SAN Volume Controller (SVC) and the Storwize V7000 have options to backup the configuration in their GUI as well. Beside of that it helps a lot to also store all your license information for your switches at a well known place. At least for the SAN switches IBM cannot generate licenses and there's also no "emergency stock" for licenses. The support would have to open a ticket at the manufacturer and clarify the license issue with them. This might cost precious time in problem situations.
Keep you firmware up-to-date
This advise often has the smack of a "shoot from the hip", something like "Did you reboot your PC?" for PC tech support. But to be fair, it's not just the SAN support member's blanket mantra. No software is absolutely bug free and because of that there are patches or - for the SAN topic - more likely maintenance releases. Often there are parallel code streams. Newer ones with more features but with a higher risk of new bugs. On the other hand older ones with a long history of fixed defects and a "comfortable" level of stability but most probably already with an "End of Availability" in sight. And between these both extremes are the mature codes like the v6.3x code stream for Brocade switches. It doesn't have the latest features but a good amount of "installed hours" all over the world. It is still fully supported, so if you really would run into a new bug, Brocade would write a fix for it. It's essentially the same for Cisco and for our virtualization products.
So it's up to you. If you want the new features, you have to use the latest code. If you don't need them at the moment, the latest version of a mature code stream might be better for you. Of course you have to align these considerations with the recommended or requested versions of the connected devices as some really require a specific version. A best practice is to update the switches and if possible also all devices proactivily twice a year - beside of any additional recommended updates due to problem cases where a particular bug has to be fixed. If you need support with all the planning and doing, please contact your local IBM sales rep for an offering called Total Microcode Support. These guys will check the SAN environment including the attached devices for their firmware and will come up with a consistent list of recommended versions which should be compatible and cross-checked. Another view on the topic comes from Australian IBMer Anthony Vandewerdt in his Aussie Storage Blog.
Think about your features
Speaking about code updates and features, it's of course a good idea to actually read the release notes. They contain crucial information about the version and should also explain new features. The crux of the matter is that there could be new features that you actually do not need and some of them will be enabled by default. One of these examples is the Brocade feature "Quality of Service" (short: QoS). In simple terms it will "partition" the ISLs to grant high prioritized traffic to have some kind of "right of way" to medium or low prioritized traffic. Buffer-to-Buffer credits will be reservered for the different priority levels to enable this. But to really use it, you actually have to decide which traffic falls into which category. You would do this by so called QoS-Zones. If you don't configure the zones but leave QoS enabled, all the traffic is categorized as medium prioritized and you don't use the reservered resources for the high and the low priority. In times of high workload, this might end up in an artificial bottleneck resulting in frame drops, error recovery and performence problems. This is only one example that shows that it's better to be aware which additional features are activated and if you really need them.
Know the support pages
IBM as well as other vendors has a comprehensive "Support" section on its homepage. It offers loads of information, manuals, links to code downloads, technotes and flashes. It's possible to open and track a support case there via the web. With all the stuff on these pages and all the products IBM offers support for you might get lost a bit. Our "IBM Electronic Support" team (@ibm_eSupport) is constantly optimizing these pages but the hint number one is: Register for an account and set up these pages for you as you like them. So you have your products at hand and you find all related information easily. And if you have some spare time (do you ever?) just have a look around on the support pages. There might be useful hints or important flashes concerning your IBM products.
As always this "list" isn't exhaustive and you probably did additional things to be prepared for problem determination. Feel free to share them in the comments below. Thank you!
One of the ugliest things that can happen in a SAN is a big performance problem introduced by a slow drain device (or slow draining device). Why is it so ugly? Well, if a full fabric or a full data center drops down - due to a fire for example - it's definitely ugly, too. But such situations can be covered by redundancy (failover to another fabric, to another data center, etc), because the trigger is very clear. Whereas a performance degredation due to a slow drain device is not so obvious - at least not for the most hosts, operators or automatic failover mechanisms. Frames will be dropped randomly, paths fail but with the next TUR (Test Unit Ready) they seem to work again, just to fail again minutes later. Error recovery will hit the performance and the worst thing: If commonly used resources are affected - like ISLs - the performance of totally unrelated applications (running on different hosts, using different storage) is impaired.
So you have a slow drain device. If you have a Brocade SAN you might have found it by using the bottleneckmon or you noticed frame discards due to timeout on the TX side of a device port. If you have a Cisco SAN you probably used the creditmon or found dropped packets in the appropriate ASICs. Or maybe your SAN support told you where it is. Nevertheless, let's imagine the culprit of a fabric-wide congestion is already identified. But what now?
The following checklist should help you to think about why a certain device behaves like a slow drain device and what you can do about it. I don't claim this list to be exhaustive and some of the checks may sound obvious, but that's the fate of all checklists :o)
- Check the firmware of the device:
Check the configuration:
- Is this the latest supported HBA firmware?
- Are the drivers / filesets up-to-date and matching?
- Any newer multipath driver out there?
- Check the release notes of all available firmware / driver version for keywords like "performance", "buffer credits", "credit management" and of course "slow drain" and "slow draining".
- If you found a bugfix in a newer and supported version, testing it is worth a try.
- If you found a bugfix in a newer but unsupported version, get in contact with the support of the connected devices to get it supported or info about when it will be supported.
Check the workload:
- Is it configured according to available best practices? (For IBM products, often a Redbook is available.)
- Is the speedsetting of the host port lower than the storage and switches? Better have them at the same line rate.
- Queue depth - better decrease it to have fewer concurrent I/O?
- Load balanced over the available paths? Check you multipath policies!
- Check the amount of buffers. Can this be modified? (direction depends on the type of the problem).
Check the concept:
- Do you have a device with just too much workload? Virtualized host with too much VMs sharing the same resources? Better separate them.
- Too much workload at the same time? Jobs starting concurrently? Better distribute them over time.
- Multi-type virtualized traffic over the same HBA? One VM with tape access share a port with another one doing disk access? Sequential I/O and very small frame sizes on the same HBA? Maybe not the best choice.
Check the logs for this device for any incoming physical errors. Of course, error recovery slows down frame processing.
Check the switch port for any physical error. If you have bit errors on the link, the switch may miss the R_RDY primitives (responsible for increasing the sender's buffer credit counter again after the recipent processed a frame and freed up a buffer).
Use granular zoning (Initiator-based zoning, better 1:1 zones) to have the least impact of RSCNs. (A device that has to check the nameserver again and again has less time to process frames.)
If all other fails: Look for "external" tools and workarounds:
- If the slow drain device is an initiator, does it communicate with too many targets? (Fan-out problem)
- If the slow drain device is a target, is it queried by too many initiators? (Fan-in problem)
- Is it possible to have more HBAs / FC adapters? On other busses maybe?
- Is the device connected as an L-Port but capable to be an F-Port? Configure it as an F-Port, because the credit management of L-Ports tends to be more vulnerable for slow drain device behavior.
- Does the slow drain host get its storage from an SVC or Storwize V7000? Use throttling for this host. Other storages may have similar features.
- Brocade features like Traffic Isolation Zones, QOS and Trunking can help to cushion the impact of slow drain devices.
- Have a Brocade fabric with an Adaptive Networking license? Give Ingress Rate Limiting a try.
- Last resort: Use port fencing or an automated script to kick marauding ports out of the SAN.
The list above is just a collection of things I already saw in problem cases. Having said this, it might be updated in the future if I encounter more reasons for slow drain device behavior. Of course I'm very interested in your opinion and more reasons or ways to deal with them!
First of all: the following blog is about some SAN extension considerations related to Brocade SAN Switches. The described problems may affect other vendors as well but will not be discussed here. It will also not cover all sub topics and consideration but describes a specific problem.
There are a lot of different SAN extensions out there in the field and Brocade supports a considerable proportion of them. You can see them in the Brocade Compatibility Matrix in the "Network Solutions" section. As offsite replication is one of the key items of a good DR solution, I see many environments spread over multiple locations. If the data centers are near enough to avoid slower WAN connections usually multiplexers like CWDM, TDM or DWDM solutions are used to bring several connections on one long distance link.
From a SAN perspective these multiplexers are transparent or non-transparent. Transparent in this context means that:
- They don't appear as a device or switch in the fabric.
- Everything that enters the multiplexer on one site will come out of the (de-)multiplexer on the other site in exactly the same way.
While the first point is true for most of the solutions, the second point is the crux. With "everything" I mean all the traffic. Not only the frames, but also the ordered sets. So it should be really the same. Bit by bit by bit exactly the same. If the multiplexing solution can only guarantee the transfer of the frames it is non-transparent.
So how could that be a problem?
In most cases the long distance connection is an ISL (Inter Switch Link). An ISL does not only transport "user frames" (SCSI over FC frames from actual I/O between an initiator and a target) but also a lot of control primitives (the ordered sets) and administrative communication to maintain the fabric and distribute configuration changes. In addition there are techniques like Virtual Channels or QOS (Quality of service) to minimize the influence of different I/O types and techniques to maintain the link in a good condition like fillwords for synchronization or Credit Recovery. All these techniques rely on a transparent connection between the switches. If you don't have a transparent multiplexer, you have to ensure that these techniques are disabled and of course you can't benefit from their advantages. Problems start when you try to use them but your multiplexer doesn't meet the prerequirements.
What can happen?
Credit Recovery - which allows the switches to exchange information about the used buffer-to-buffer credits and offers the possibility to react on credit loss - cannot work if IDLEs are used as a fillword. They would use several different fillwords (ARB-based ones) to talk about their states. If the multiplexer cuts all the fillwords and just inserts IDLEs at the other site (some TDMs do that) or if the link is configured to use IDLEs, it will start toggeling with most likely disastrous impact for the I/O in the whole fabric.
Another problem appears less obvious. I mentioned Virtual Channels (VC) before. The ISL is logically split. Of course not the fibre itself - the frames still pass it one by one. But the buffer management establishes several VCs. Each of them has its own buffer-to-buffer credits. There are VCs solely used for administrative communication like the VC0 for Class_F (Fabric Class) traffic. Then there several VCs dedicated to "user traffic". Which VC is used by a certain frame is determined by the destination address in its header. A modulo operation calculates the correct VC. The advantage of that is that a slow draining device does not completely block an ISL because no credits are sent back to enable the switch to send the next frame over to the other side. If you have VCs the credits are sent back as "VC_RDY"s. If your multiplexer doesn't suport that (along with ARB fillwords) because it's not transparent, you can't have VCs and "R_RDY"s will be used to send credits. The result: As you have only one big channel there, Class_F and "user frames" (Class_3 & Class_2) will share the same credits and the switches will prioritize Class_F. If you have much traffic anyway or many fabric state changes or even a slow draining device, things will start to become ugly: The both types of traffic will interfer, buffer credits drop to zero, traffic gets stalled, frames will be delayed and then dropped (after 500ms ASIC hold time). Error Recovery will generate more traffic and will have impact on the applications visible as timeouts. Multipath drivers will failover paths, bringing more traffic on other ISLs passing most probably the same multiplexer. => Huge performance degradation, lost paths, access losses, big trouble.
You see, using the wrong (or at least "non-optimal") equipment can lead to severe problems. It's even more provoking the used multiplexer in fact is transparent but the wrong settings are used in the switches. So if you see such problems or other similar issues and you use a multiplexer on the affected paths, check if your multiplexer is transparent (with the matrix linked above) and if you use the correct configuration (refer to the FabOS Admin Guide). And if you have a non-transparent multiplexer and no possibility to get a transparent one, don't hesitate to contact your IBM sales rep and ask him about consultation on how to deal with situations like this (e.g. with traffic shaping / tuning, etc).
From time to time (sometimes everyday - the support business is a capricious one) I need to see what's really going on in the fibre. For that reason we have a couple of tracers which can be sent to the EMEA countries. Some IBM organizations in some countries even have their own tracers. For the SAN support we use the XGIGs from JDSU (originally from Finisar). Usually I trace, if the problem is somehow protocol related and cannot solved with the RAS packages of the switches and the devices. Or if the RAS information from one device contradicts the other one. Or if every support team (internal and external) points to each other. Or if something totally strange happens and nobody can deal with it. Maybe we trace a little too often, because meanwhile other vendors sometimes say things like "Oh, you also have IBM gear in your environment? Let them trace it!".
So what's this tracing all about?
To put it simple, you connect it in the line and it just records all the traffic. Of course you can filter it and let it trace only the interesting part of the frames. I do not care for the actual data but the FCP and the SCSI header info are precious information. Of course an 8 Gbps link generates a lot of data, too and the memory is very limited. So you want to be sure to trace exactly what you need - not more, not less. The tracing is done by IBM customer engineers. We ensure to have a suitable number of trained CEs in every region. I hosted some of the trainings by myself and imho it's definitely worth it. The analysis is then done afterwards. I personally like it, because it offers me a possibility to not be "bound" to the RAS packages alone. I can really see what happens.
Although the whole topic is pretty much straight forward, for the ones unfamiliar with it, tracers seem to be mystical devices. Over time I faced several "urban legends" impeding troubleshooting a lot sometimes:
- "What info? You should see that in the trace!" - Often I get no additional information for a trace (e.g. consisting of 8 trace files from different channels) which slows down the analysis extremely. I need at least a layout where I can see where exactly the tracer was connected. I need to know how it was configured, if the problem really happened during the trace and I need the data collection of the switch and the devices to compare what I see against the RAS packages. Please help me to help you! :o)
- "We can't put this link down. Is it important where to plug in the tracer?" - Yes, of course it is. Like described above, it just records the traffic that enters the tracer. Nothing more. There are no tiny little photon-based nano robots swarming out through the fibres and collecting data. Really. If you plug it somewhere else, I won't see the problem.
- "Thank you so much for introducing a tracer in our environment. It solved the problem. It has to stay." - No, the tracer did not solve the problem by itself. If the problem somehow vanished with cabeling in the tracer, then a simple portdisable/portenable should have helped as well. The tracers are needed frequently and can't stay in the environment till the end of days.
These were just some of the rumors and statements I heard in the past. To summarize it, please keep in mind:
A tracer is not a magical device. It just records traffic.
If you work in the technical support, generally spoken your job is to fix what's broken. But working in the SAN support most of the time is about solving complex problems. The SAN connects everything with everything in the storage world and often that's a lot. Oh yes, there are well-planned and "troubleshooting-friendly" environments out there, managed by top-skilled administrators using state-of-the-art tools, while having enough time between daily routine and important projects to spot problems before they even have an impact on the applications. At least I believe that these things exist, but most of the time I did not even see a part of it. There are excellent multi-tenancy capable products out there, maintained by a single part time admin or an operator some thousand miles away monitoring the environments of a dozen clients. And when there is a problem, this poor guy is called by all the angry people relying on a working IT up to the C-levels. Then he opens a case at his SAN vendor.
Let's switch to the support guy. He takes the new case and reads. "Massive problem, SCSI error!". Yes, most of the time there is just a statement like this. That's okay for the beginning, because the so called "Request Receipt Center" just creates cases administratively (OMG, is that even a word in the English language?). The first level of support, the so called Frontend will call you then and ask you about the problem. And they (hopefully) will bring the information in a pattern called "EDANT" to have it in a structured way and to be able to hand it over (horizontally for shift changes or vertically for escalation) to others. This first call (sometimes 2..n) is crucial because the most important thing is to actually understand the problem. That sounds trivial, but it's not. In fact the whole problem determination will fail or at least significantly lag if this set of information is not complete or contains false statements.
I know you will be under pressure. I know you have thousand other things to do. I know some sales guy probably promised you "Our excellent support will solve all problems - if there'd ever be one - just by hearing the tone of your voice for 1.4 seconds!". But again, to enable the support guy to actually understand your problem is the most important thing and you can hugely accelerate that process by preparing the information using the EDANT pattern.
So what's this EDANT pattern exactly? I have to admit, we stole it from the software guys. You will notice that by the wording. EDANT means:
E is for Environment. You (hopefully) know your environment and maybe you described it to IBMers several times before, maybe an IBM architect even designed it. But to be honest IBMers don't share a collective consciousness like the Borg :o), on the other hand things change. So what's needed is a good description of the environment related to the current problem. This includes among others:
- A layout with the related switches and devices and the ports used to connect them.
- The machine/model information of related switches, hosts, storages, etc
- The firmware/OS/driver levels of all components.
- Time gaps between the components. (Better use NTP!)
- If you use SAN extenders, describe them. Use CWDM/DWDM/TDM? How long? Type? Vendor? Cards? Versions? Transparency? Use FCIP? Bandwidth? Quality?
- Additional specialities: any interop stuff going on? This is a test SAN? This is pre-production? This is designed without redundancy? Stuff like this...
D is for Description. Please describe your problem as precise and as comprehensive as possible.
- When did it start?
- What did happen?
- Where can you notice it?
- What do the switches report?
- What do the other devices report?
- What was done when the problem happened?
- What is the impact?
And in regard to the environment please ask yourself: Which components are affected? Which components could be affected but are not? What is the difference between them? Questions like these are the key for narrowing down the problem.
A is for Actions Done. Opening a case is most probably not the first thing you do, when the phones begin to ring. When a case reachs me, "someone" did already "something". Maybe you have a plan for situations like this. Maybe someone requests "Do things!". Maybe you switched off "culprit candidates". All this should be documented as accurate as possible. With time stamps! And of course with results. Everything that changed in the environment since the problem occured is worth to mention, including counter resets. Do as much as possible from CLI (Command Line Interface) and use session logging. Precious!
N is for Next Actions. This section is for everything you already plan (maintenance windows, replacements, recovery actions, internal and external deadlines) and for everything you expect from the support. The second point is not trival, too. Of course you want the support to solve the problem. But what is most important? Do you need a workaround first, to get things working again? Do you need an RCA (Root Cause Analysis) the next day? Does the problem has to be solved over night and a contact person will be available to provide data and further info? Provide your expectation to get the right help.
T is for Test Case. Okay, this one is clearly from the software support. It's the data collections and any additional data and description of it, like the session logs mentioned above. Screenshots, performance data or scripts belong here too. Usually the support offers a way to upload all the stuff. Please be aware that for example IBM doesn't keep data collections from cases till the end of days. So if you uploaded something for another already closed case 6 months ago, it's most probably gone.
Using this pattern to structure the info should avoid any communication based delays. It may sound like much stuff in the beginning. But it's definitely worth it.