To be honest the title for this article could also be "How to ease the life of your technical support". But in fact it will ease the life of everyone involved in a problem case and the priority #1 is to solve upcoming problems as quickly as possible.
In the article The EDANT pattern I explained a structured way to transport a problem properly to your SAN support representative. In addition it might be a good idea to prepare the SAN for any upcoming troubleshooting.
The following suggestions are born out of practical experience. It's intended to help you to get rid of all the obstacles and showstoppers that could disturb or delay the troubleshooting process right from the start. Please treat them as well-intentioned recommendations, not as pesky "musts". :o)
Synchronize the time
Having the same time on all components in the datacenter is a huge help during problem determination. Most of the devices today support the NTP protocol. So the best practice is to have an NTP server (+ one or two additional ones for redundancy) in the management LAN and configure all devices (hosts, switches, storage arrays, etc) to use them. It's not necessary to have the NTP connected to an atomic clock. The crucial thing is to have a common time base.
Have a troubleshooting-friendly SAN layout
What is a troubleshooting-friendly SAN layout? I don't only mean that it's a good idea to always have an up-to-date SAN layout sketch at hand - which is very helpful in any case. What I mean is to have a SAN design that lacks of any artificial obscurities. If you have 2 redundant fabrics (yes there are still environments out there where this is not the case), it's best practice to connect all the devices symmetrically. So if you connect a host on port 23 of a switch in one fabric, please connect its other HBA to port 23 of the counterpart switch in the redundant fabric.
Use proper names
It may sound laughable but bad naming can harm a lot. I think 4 points are important here:
- The naming convention - It may be funny to have server names like "Elmo", "Obi-Wan" or "Klingon" but for troubleshooting it may be better to have some useful info within the name. Something like BC01_Bl12_ESX for example. (for Bladecenter 1, Blade 12, OS is ESX).
- Naming consistency - It's even more important to actually use the same names for the same item. So it's very helpful if for example the host has the same name in the switch's zoning, in the storage array's LUN mapping and on the host itself.
- Unique domain IDs - The domain ID is like the ZIP-Code for a switch and according to the fibre channel rules it has to be unique within a fabric. But in addition to that it is very helpful to keep it unique across fabrics as well. Domain IDs are used to build the fibre channel address of a device port - the address used in each frame. Within the connected devices's error logs (hosts, storages, etc) these fibre channel addresses are often the only information that reference for the SAN components. To be able to know which paths over exactly which switch are affected at any time is priceless.
- Brocade: chassisname - As Virtual Fabrics become more and more a standard in Brocade SANs it's crucial to set the chassisname, because the switchname is bound to the logical switch, not to the box. These chassisnames are used for the naming of the data collections (supportsaves) and if you don't configure them, the device/type will be used instead. So you'll most probably end up with a huge collection of supportsave files which differ only in the date. The chassisname can easily be set with the command "chassisname". That's one small step for... :o)
Use a change management
I couldn't emphasize this more: Please use a change mangement. Even for the smallest SAN environment where you would think "Nah! That's my little SAN, I can keep all the stuff in my head." Even for the biggest SAN Environment, where you would think "Nah! Too many people from too many departments are involved here. The SAN is living and evolving every day." Beside of any internal policy and external requirement (mandatory change management methods for several industries) a proper change management also helps in the troubleshooting process. If you can come up with a complete time plan of all actions done in the SAN and the assertion that no unplanned maintenance actions are done in the SAN during the problem determination you will have a very happy SAN support member :o)
Backup your configuration
Bad things could happen every day. Things that wipe parts or all of your switches's configuration or even worse turn them into useless doorstoppers. It's not likely that it happens, but if and when it happens you better be prepared. To be up and running again as soon as possible, you should not only back up your user data but also your configurations on a regular basis. For Brocade switches use "configupload" and for Cisco switches copy the running-config to an external server. The SAN Volume Controller (SVC) and the Storwize V7000 have options to backup the configuration in their GUI as well. Beside of that it helps a lot to also store all your license information for your switches at a well known place. At least for the SAN switches IBM cannot generate licenses and there's also no "emergency stock" for licenses. The support would have to open a ticket at the manufacturer and clarify the license issue with them. This might cost precious time in problem situations.
Keep you firmware up-to-date
This advise often has the smack of a "shoot from the hip", something like "Did you reboot your PC?" for PC tech support. But to be fair, it's not just the SAN support member's blanket mantra. No software is absolutely bug free and because of that there are patches or - for the SAN topic - more likely maintenance releases. Often there are parallel code streams. Newer ones with more features but with a higher risk of new bugs. On the other hand older ones with a long history of fixed defects and a "comfortable" level of stability but most probably already with an "End of Availability" in sight. And between these both extremes are the mature codes like the v6.3x code stream for Brocade switches. It doesn't have the latest features but a good amount of "installed hours" all over the world. It is still fully supported, so if you really would run into a new bug, Brocade would write a fix for it. It's essentially the same for Cisco and for our virtualization products.
So it's up to you. If you want the new features, you have to use the latest code. If you don't need them at the moment, the latest version of a mature code stream might be better for you. Of course you have to align these considerations with the recommended or requested versions of the connected devices as some really require a specific version. A best practice is to update the switches and if possible also all devices proactivily twice a year - beside of any additional recommended updates due to problem cases where a particular bug has to be fixed. If you need support with all the planning and doing, please contact your local IBM sales rep for an offering called Total Microcode Support. These guys will check the SAN environment including the attached devices for their firmware and will come up with a consistent list of recommended versions which should be compatible and cross-checked. Another view on the topic comes from Australian IBMer Anthony Vandewerdt in his Aussie Storage Blog.
Think about your features
Speaking about code updates and features, it's of course a good idea to actually read the release notes. They contain crucial information about the version and should also explain new features. The crux of the matter is that there could be new features that you actually do not need and some of them will be enabled by default. One of these examples is the Brocade feature "Quality of Service" (short: QoS). In simple terms it will "partition" the ISLs to grant high prioritized traffic to have some kind of "right of way" to medium or low prioritized traffic. Buffer-to-Buffer credits will be reservered for the different priority levels to enable this. But to really use it, you actually have to decide which traffic falls into which category. You would do this by so called QoS-Zones. If you don't configure the zones but leave QoS enabled, all the traffic is categorized as medium prioritized and you don't use the reservered resources for the high and the low priority. In times of high workload, this might end up in an artificial bottleneck resulting in frame drops, error recovery and performence problems. This is only one example that shows that it's better to be aware which additional features are activated and if you really need them.
Know the support pages
IBM as well as other vendors has a comprehensive "Support" section on its homepage. It offers loads of information, manuals, links to code downloads, technotes and flashes. It's possible to open and track a support case there via the web. With all the stuff on these pages and all the products IBM offers support for you might get lost a bit. Our "IBM Electronic Support" team (@ibm_eSupport) is constantly optimizing these pages but the hint number one is: Register for an account and set up these pages for you as you like them. So you have your products at hand and you find all related information easily. And if you have some spare time (do you ever?) just have a look around on the support pages. There might be useful hints or important flashes concerning your IBM products.
As always this "list" isn't exhaustive and you probably did additional things to be prepared for problem determination. Feel free to share them in the comments below. Thank you!
One of the ugliest things that can happen in a SAN is a big performance problem introduced by a slow drain device (or slow draining device). Why is it so ugly? Well, if a full fabric or a full data center drops down - due to a fire for example - it's definitely ugly, too. But such situations can be covered by redundancy (failover to another fabric, to another data center, etc), because the trigger is very clear. Whereas a performance degredation due to a slow drain device is not so obvious - at least not for the most hosts, operators or automatic failover mechanisms. Frames will be dropped randomly, paths fail but with the next TUR (Test Unit Ready) they seem to work again, just to fail again minutes later. Error recovery will hit the performance and the worst thing: If commonly used resources are affected - like ISLs - the performance of totally unrelated applications (running on different hosts, using different storage) is impaired.
So you have a slow drain device. If you have a Brocade SAN you might have found it by using the bottleneckmon or you noticed frame discards due to timeout on the TX side of a device port. If you have a Cisco SAN you probably used the creditmon or found dropped packets in the appropriate ASICs. Or maybe your SAN support told you where it is. Nevertheless, let's imagine the culprit of a fabric-wide congestion is already identified. But what now?
The following checklist should help you to think about why a certain device behaves like a slow drain device and what you can do about it. I don't claim this list to be exhaustive and some of the checks may sound obvious, but that's the fate of all checklists :o)
- Check the firmware of the device:
Check the configuration:
- Is this the latest supported HBA firmware?
- Are the drivers / filesets up-to-date and matching?
- Any newer multipath driver out there?
- Check the release notes of all available firmware / driver version for keywords like "performance", "buffer credits", "credit management" and of course "slow drain" and "slow draining".
- If you found a bugfix in a newer and supported version, testing it is worth a try.
- If you found a bugfix in a newer but unsupported version, get in contact with the support of the connected devices to get it supported or info about when it will be supported.
Check the workload:
- Is it configured according to available best practices? (For IBM products, often a Redbook is available.)
- Is the speedsetting of the host port lower than the storage and switches? Better have them at the same line rate.
- Queue depth - better decrease it to have fewer concurrent I/O?
- Load balanced over the available paths? Check you multipath policies!
- Check the amount of buffers. Can this be modified? (direction depends on the type of the problem).
Check the concept:
- Do you have a device with just too much workload? Virtualized host with too much VMs sharing the same resources? Better separate them.
- Too much workload at the same time? Jobs starting concurrently? Better distribute them over time.
- Multi-type virtualized traffic over the same HBA? One VM with tape access share a port with another one doing disk access? Sequential I/O and very small frame sizes on the same HBA? Maybe not the best choice.
Check the logs for this device for any incoming physical errors. Of course, error recovery slows down frame processing.
Check the switch port for any physical error. If you have bit errors on the link, the switch may miss the R_RDY primitives (responsible for increasing the sender's buffer credit counter again after the recipent processed a frame and freed up a buffer).
Use granular zoning (Initiator-based zoning, better 1:1 zones) to have the least impact of RSCNs. (A device that has to check the nameserver again and again has less time to process frames.)
If all other fails: Look for "external" tools and workarounds:
- If the slow drain device is an initiator, does it communicate with too many targets? (Fan-out problem)
- If the slow drain device is a target, is it queried by too many initiators? (Fan-in problem)
- Is it possible to have more HBAs / FC adapters? On other busses maybe?
- Is the device connected as an L-Port but capable to be an F-Port? Configure it as an F-Port, because the credit management of L-Ports tends to be more vulnerable for slow drain device behavior.
- Does the slow drain host get its storage from an SVC or Storwize V7000? Use throttling for this host. Other storages may have similar features.
- Brocade features like Traffic Isolation Zones, QOS and Trunking can help to cushion the impact of slow drain devices.
- Have a Brocade fabric with an Adaptive Networking license? Give Ingress Rate Limiting a try.
- Last resort: Use port fencing or an automated script to kick marauding ports out of the SAN.
The list above is just a collection of things I already saw in problem cases. Having said this, it might be updated in the future if I encounter more reasons for slow drain device behavior. Of course I'm very interested in your opinion and more reasons or ways to deal with them!
Modified on by seb_
Brocade FabricOS v7.3x is officially supported for IBM clients now. Among all the new features and improvements there are some I would like to cover in small blog entries. Especially for the ones directly related to support and troubleshooting.
One command to rule them all
Investigating ongoing problems usually starts with setting a baseline. To tell the current problem from the battles of the past, you need to clear the counters carefully. Over the years, hardware platforms, and FOS versions these commands changed again and again. Portstatsclear was such a command. Years ago it was like Russian roulette - You never knew what it would really clear. This port? The ports in the same portgroup? All phyiscal counters but not the stuff on the right side of porterrshow? Statsclear cleared all ports - at least the external FC ports. You needed another command for internal blade counters. And for the GigE interfaces you needed portstatsclear again.
All you need in FOS v7.3 is supportinfoclear. It will clear all port counters and in addition it clears the portlogdump, too. You only need to execute:
supportinfoclear --clear -force
The -force prevents it from asking you again, if you are really really sure about doing it. Additionally you can clear the error log, too by using -RASlog (case-sensitive). But at least for anything support-related I don't recommend to do that if not instructed otherwise.
And another improvement: It will be in the clihistory. Even if you execute it in plink or ssh without opening a shell on the switch. So no worrying about how to execute it anymore. Just use your favorite script or do it directly and the IBM support will see how reliable the data is.
Update Nov 3rd:
And another way how it rules them all: As Serge writes in the comments below, it will clear the counters for all ports regardless of their VF-membership. So no hopping through logical switches or need to use fosexec! Thanks Serge!
And as described in "How to avoid support data amnesia" over there in the Storageneers blog: Please think about when to execute this command! While it's save to clear the counters for really ongoing or 100% reoccurring problems, you need to gather supportsaves first if you want to have the root cause analyzed for something that happened in the past. Otherwise supportinfoclear might wipe all the indications and evidences needed to find out what happened!
Modified on by seb_
I don't always write technical blog posts. But when I do I make them long and the conclusion contains a request to you my readers to do this or that. I won't do that today. Today is about a behavior I observed, but I won't propose anything. Feel free to draw your own conclusions. Well that might be considered as a proposal :o)
This one is about the IBM System Storage SAN06B-R, a multi-protocol router or SAN extension switch. It consists of two ASICs - one handling the Fibre Channel part and one for the FCIP. They also have some extra tasks like FC routing and compression, but for our example it's enough to know that there are two and if you want to transfer SAN traffic over FCIP, it has to pass both of them.
The both ASICs are connected via 5 internal ports all working with a line rate of 4Gbps. That doesn't sound much compared to the 16 FC ports running with up to 8Gbps on the front-side. But we have to keep in mind that they are only for connectivity. Given the max. IP connectivity of 6x 1GbE, the internal connections shouldn't be a bottleneck.
Internal connections are somewhat similar to external ISLs between switches when it comes to flow control. They use buffer-to-buffer credits ("buffer credits") and the links are logically partitioned into virtual channels, each of them with an own buffer credit counter. These virtual channels prevent head of line blocking in case of a back pressure (for example due to slow drain devices on the other side of FCIP connections).
When it comes to buffer credits, it's important how they are assigned to these virtual channels. Within these internal connections each VC gets 1 buffer, but it can borrow 3 out of a pool. The pool is shared among all VCs for that port and contains 11 in total.
You might say "Yeah, but hey it's just a very short connection on the board. Who needs those buffer credits anyway?", but keep in mind they are not just for spanning the tiny distance. There are multiple reasons why frames need to be touched here and therefore buffered. Plus of course possible external back pressure. Often a few buffer credits make the difference between normal traffic flow and piling up of frames and even frame discards due to timeout.
I guess the last thing you want to have is an artificial bottleneck inside of your routers...
So the amount of buffers and buffer credits for each internal connection depends on how many VCs are in use. And that's the crux. The number of VCs per internal connection depends on the number of...
A tunnel consists of 1-6 circuits, so you can bundle several GbE interfaces together. They call it FCIP trunking. Some features like e.g. Tape Pipelining require the use of only one tunnel. There's not much we can do about that. For an environment that doesn't utilize them, it starts to get interesting now: If you have only 1 tunnel, you have only 1 VC and therefore only 4 buffer credits plus the risk of head of line blocking! In addition if you actually spread the traffic across the low, medium and high priority within a circuit, you would get an own VC for each priority.
Using only the standard "medium" priority for the data traffic (F-class "administrative" fabric traffic use an own VC that fall out of this equation of course) would give you that amount of buffers on each of the 5 internal connections between the ASICs:
# of tunnels
# of VCs
# of buffers
(1 buffer per VC + 3 to borrow per VC out of a pool of 11)
Please be aware that the amount of VCs/buffers is only one point that needs to be taken into consideration when planning and configuring the optimal FCIP connection. You can find a good overview about the other ones in Brocade's FCIP Administrator's Guide for your FabricOS version.
Just had some picture puzzles in my head. Here is one :o)
Solution: "N pybhq nepuvgrpg qrcyblf n cevingr pybhq".
Modified on by seb_
It's the nightmare of every motorist. Your car was just repaired a few days ago and now it stopped running in the middle of nowhere. Or you even crashed, because the brakes just didn't work in the rain. Fake parts are a big problem in the automotive industry. Original-looking parts from dubious sources could even work as expected in normal operations but when the going gets tough, the weak won't get going. So before a fake cambelt wrecks your engine or a fake brake pad costs your life, it might be a good idea to not save on the wrong things.
But a faked SFP?
Like a brake pad an SFP is somewhat a consumable. Light is transformed into an electric signal and vice versa, produces heat and the components wear out over time. Some sooner, some later. If you bought the SFPs from IBM for a switch under IBM warranty or maintenance, broken SFPs will be replaced for free. But if you decide to buy an SFP, you'll notice after a quick web search that there are a lot of supplier out there offering the same SFP for a much smaller price than IBM. And with "the same SFP" I mean they offer the very same IBM part number - for example 45W1216. That's an 8G 10km LW SFP.
Is it really the same?
Of course not - although they claim it to be the same. Their usual explanation is , that all these SFPs are coming from the same manufacturer anyway. SFPs are built using open standards defined by T11 and therefore they should be compatible per se. I can tell from several occasions: That's not true. There are of course more than 1 SFP manufacturers and I'm sure each of you know a handful offhand. In addition: Even in times before 8G there were SFPs working much better with certain switches than others.
With the 8G platform Brocade decided to offer Brocade-branded SFPs and restricted their switches to only support them and to refuse others (beside of very few exceptions for CWDM SFPs). So Brocade took control over which SFPs can be used and they were able to finetune their ASICs to allow better signal handling and transmission. To enforce that the switch checks the vendor information from the SFP to determine if it's a Brocade branded one. Cisco does the same for the SFPs in their switches.
Here is where the fake begins...
There are several vendors of devices to rewrite these SFP internal information. By spoofing vendor names, OUIs (Organizationally Unique Identifier) and part numbers they try to circumvent the detection mechanisms on the switch. So independent suppliers buy "generic" bulk SFPs and "rebrand" them to sell them as "IBM compatible" with the same part number. And because IBM officially supports the part number (like announced here) one might assume everything will be fine then.
In fact it's not...
Imagine a migration project. The plan is in place, everything is prepared, the components are bought and onsite, all the necessary people are there in the middle of the night or during a weekend and the maintenance window begins. And then these ports everything depends on just don't come online - Only because someone faked these "cheaper but still compatible" SFPs negligently. I had a case where the same SFPs did work in one 8G switch model but not in the other - also 8G - with exactly the same FabricOS.
In the sfpshow output they looked like this:
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 5401001200000000 200,400,800_MB/s SM lw Long_dist
Vendor Name: XXXXXX
Vendor OUI: 00:05:1e
Vendor PN: 57-1000012-01
The supplier did not write "Brocade" into the "Vendor Name" field (I replaced it with Xs) but in the "Vendor OUI" field he inserted the OUI from Brocade. In addition he also faked the "Vendor PN" but even used a wrong one. This one is the PN for a shortwave SFP.
But beside of being an ugly showstopper for the migration - driving costs far beyond of what could have been saved by buying the cheaper parts - that's not even the worst case. Perfectly faked SFPs might be accepted by the switch, but you never know if they are really running fine. I don't wish anybody to be called at 3am about the crash of half the servers, because an ISL started to toggle. Or to have increasing performance problems, because every now and then a faked SFP "on the edge of the spec" devours a buffer credit by misinterpreting an R_RDY.
Troubleshooting this can be a pain itself. But the money potentially lost on outages will hardly be compensated by the savings from cheaper SFPs!
I got the confirmation from IBM product management, that IBM itself will only deliver Brocade-branded SFPs for its current b-type SAN portfolio.
So if you have non-Brocade-branded SFPs in your 8G or 16G Brocade switches be aware that they are probably not supported and there could be some unplanned night or weekend working hours for you in the future...
I blog for a while now. Looking back I had a personal blog about things I'm interested in some years during my study. I did a comedic fake news page, too. My wife and I write a blog about our baby and I also have an IBM internal blog about SAN troubleshooting. Last year I started with seb's sanblog on developerworks and it was quite a slow start. Beginning of 2011 there was much stuff to do for my primary job on one hand but on the other hand my daughter was born and my interests shifted a bit. As I write the articles for this blog mainly during my spare time, the simple equation was: no spare time = no blog posts.
Midyear 2011 the situation improved a bit. My baby Johanna was out of the woods somehow (is "to be out of the woods" really the English term for finishing the most stressful phase?) after her hip dysplasia was cured and I was able to really start to blog. And then I thought about: What do you want to blog about? There is so much going on in the storage industry, but am I really the best person to blog about them? Can I really add some value with blog articles here? I don't think so. Of course I comment on such topic on other people's blogs, twitter or social platforms like linkedin from time to time. After all there's always some FUD around I cannot resist to comment. But I try to keep my own blog really about SAN and storage virtualization with a focus on troubleshooting.
I wrote 19 articles in 2011. That's not much compared to let's say storagebod. Why is that so? Well, for me it's quite a balancing act what I can blog about. Of course I can't blog about a specific customer having a problem. That's a no-go. There are also things I don't want to blog about because there is already much out there about it. And then there is stuff that I just can't blog about, because it's internal information. Special troubleshooting procedures I created for example or information about internal tools and projects I'm involved.
What remains then?
Oh, there's still enough to blog about. If I notice situations like "Hey, I explained this general thing in four cases now to customers completely unaware of it." or if I see a feature that could really help admins but hardly anyone uses it so far, then I write a blog article. I see it more as an additional explanation and food for thought. My target audience consists of customers on the "doing level" (admins, architects) as well as people troubleshooting SANs. I know that's a significantly smaller group than the audience of the more general storage bloggers, but I'm happy if the right people read it and I get the feedback that my blog helped them with their problems. However I started to count the visitors internally since end of July and so far around 32000 visited seb's sanblog. That's not too bad, I think.
Writing such a résumé I want to thank the people who inspired me to start a blog. First of all there are Barry Whyte and Tony Pearson with their developerworks blogs showing me: there are actually IBMers out there writing about my topics of interest. Reading their blogs brought me to many others - also from other companies - that I try to look in daily. Most of them you see in the list on the right bar of this blog. But a special Thank you! goes out to my Australian colleague Anthony Vandewerdt whose blog has a big focus on the people really working with IBM storage products and therefore SAN products as well. His Aussie Storage Blog on developerworks triggered my decision to start an own external blog. Thank you again!
So what to expect from 2012?
To be honest, I have no idea :o) There is no overall plan. No weeks-long article pipeline. I'm not invited in blogger events or something like that and my blog is in no way a marketing channel for upcoming IBM products. Everything I write is just born out of my experience with SAN products and troubleshooting. I try not to write too much about hypes and trends, except it has a direct impact on SAN - like oversaturated hypervisors turning to slow drain devices or Big Data as an excuse to do some really weird things with your storage architecture :o)
Are you still interested?
Then be my guests in 2012 and if you feel the urge to say something about, against or additional to an article, don't hesitate to leave a comment! Have a nice start into the New Year!
Everyone is talking about cloud security these days. Is it clever to give my data outside my own data center? To another company? Maybe even outside the country? How safe and secure is that? Not only the way in between but also then there? Are they protected enough? Are they able to block intruders both remotely and locally? And what about attackers from within the cloud service provider? The discussion is so full of - indeed reasonable - concerns that I started to wonder.
Why do I often see SANs that are not secured at all?
I don't mean the physical access control to the machines themselves. Usually companies take that one seriously. But all the other aspects of SAN security are often disregarded according to my experience. If there is no statutory duty or the enforcement of compliance it's just a variable in the risk calculation about costs of security, probabilities and inexplicable consequences in case of security breaches. And taking also budget constraints and lack of skill and manpower into consideration SAN security is often treated as an orphan.
There is a huge market for IP security with firewalls, intrusion detection systems, DMZs, honeypots and hackers with hats in all colors of the rainbow. If a famous company is hacked or victim of a huge DDOS attack you probably read that in the IT news. But if a company has an internal security breach in their storage infrastructure they'll hardly let the public know about it.
What to do from SAN point of view?
There are multiple aspects and possibilities to secure a SAN. Let's take Brocade switches as an example and let's see what could happen...
1.) Management access control
From time to time I get a request for a password reset and the switch's root account is still on the default password. THAT'S. NOT. COOL! It's really unlikely, because in all current FabricOS versions the admin gets the prompt to change the passwords for all four pre-configured user accounts of the switch if it's on the defaults. But it still happens every now and then.
It's the same like for all other devices with user management in IT: Choose passwords, which are hard to guess, can't be found in a dictionary, contain non-alphanumeric characters and so on. Change passwords from time to time, like in a 90 days interval. Most switches support RADIUS and LDAP. The ipfilter command allows you to block telnet, enforcing the use of ssh. In addition for FabricOS v7.0x it's officially supported now to have a plain key-based ssh access for more than one user, too.
And don't stick with old switches from generations ago. Not only the lower linerate and the small feature set should be considered here, but security, too. If the firmware is very old, it's also based on old components like legacy versions of openssh & Co. Very concerning security holes have been fixed over the years. You can check the installed versions of these components here. And yes, it is quite easy to see the password hashes without the root user, but at least they are salted in the current firmwares.
Security is not only about passwords, it's about user roles, too. In the Brocade switches you can define user rights with high granularity, the DCFM has its "resource groups" and the Network Advisor works with "areas of responsibility". Use them to choose wisely who can do what. You don't want to have another Terry Childs case in the media and this time about your company, do you?
The only thing I miss for many SAN switches and other storage equipment is a real, robust and trustworthy accounting or audit log. I want to see what was done on the switch and by whom. Not only what they did via CLI, but via webinterfaces, management applications and shell-less CLI accesses, too. Is there no standard to have these data automatically forwarded to an internal, trusted collection server via a secured connection? Really?
You should encrypt your traffic. There are several possibilities to catch the signal without your knowledge, especially if your data leaves your controlled ground on the way to a remote DR location. For FCIP traffic you should always use encryption. Indisputable. And for plain fibre-based FC longdistance connections? You probably say "Hey, it's transparent and it's optical fibre, not electrical. You can't just dig a hole, rip the cadding of the cable and splice a second cable in." - You have no idea. Keep in mind that the data traversing the SAN is the really important and thus precious kind in your company. There are technical possibilities to do it and if there is opportunity, there could be a criminal mind using it. This perception seems to gain acceptance among the switch vendors more and more. For example Brocade's current 16G equipment is able to have encrypted ISLs for that matter. Of course all vendors sell SAN based encryption appliances or switches, too. This way not only the inter-location traffic is encrypted, but also for the data on the disk or tape. So if there would ever be the chance that some unauthorized person gets his hands on the storage, he won't be able to read the data.
3.) Fabric access control
What would be the easiest thing to work around passwords and encryption if an intruder would have physical access to a data center? (Just like a student employee, a temp worker, an intern, an external engineer... I think you get the point) He could simply spot a free port on a switch and connect a switch he brought in. Setting up a mirror port or changing the zoning to gain access to disks and doing some other nasty things is quite easy.
How to avoid that?
FICON environments for mainframe traffic always had higher security demands and we can use just the same features for open systems as well. There are security policies allowing us to control which devices are allowed to be connected to the fabric (DCC - device connection control), which switches can be part of the fabric (SCC - switch connection control) and which switches can modify the configuration (FCS - fabric configuration server). In addition the current Brocade FabricOS versions support DH-CHAP and FCAP using certificates for authentication.
If you want to utilize the features and mechanisms described above, the FabricOS Administrator's guide provides some good descriptions and procedures to begin with. Of course IBM offers technical consulting services to help you to secure your SAN properly.
So if you are concerned if the provision model your IT could be based on in the future is secure, you should be even more concerned about the security of your SAN today!
(Disclaimer: SAN switches from other vendors may have the same or similar security features, too. I just chose Brocade switches because of their prevalance within IBM's SAN customer base.)
There are some goodies in FOS 7.0 that are not announced big-time. Goodies especially for us troubleshooters. There are regular but not too frequent so called RAS meetings. Here we have the possibility to wish for new RAS features - wishes born out of real problem cases. Some of the wishes we had were implemented in FOS 7.0 (beside of the Frame Log I already described in a previous post).
Time-out discards in porterrshow
You probably noticed that I have a hobbyhorse when it comes to troubleshooting in the SAN: performance problems. Medium to major SAN-performance problems usually go along with frame drops in the fabric. If a frame is kept in a port's buffer for 500ms, because it can't be delivered in time, it will be dropped. So these drops would be a good indicator for a performance problem. There is a counter in portstatsshow for each port (depending on code version and platform) named er_tx_c3_timeout, which shows how often the ASIC connected to a specific port had to drop a frame that was intended to be sent to this port. It means: This guy was busy X times and I had to drop a frame for him.
But who looks in the portstatsshow anyway? At least for monitoring? In that area the porterrshow command is way more popular, because it provides a single table for all FC ports showing the most important error counters. Unfortunately it had only one cumulative counter for all reasons of frame discards - and there are a lot more beside of those time-outs. But now there are two additional counters in this table: c3-timeout tx and c3-timeout rx. Out of them the tx counter is the important one as described above. The rx counter just gives you an idea where the dropped frames came from.
So: just focus on the TX! If it counts up, get some ideas how to treat it here.
The firmware history
Just last week I had a fiddly case about firmware update problems again. There are restrictions about the version you can update to based on the current one. If you don't observe the rules, things could mess up. And they could mess up in a way you don't see straightaway. But then suddenly, after some months and maybe another firmware update, the switch runs into a critical situation. Or it has problems with exactly that new firmware update. Some of these problems can render a CP card useless, which is ugly because from a plain hardware point of view nothing is broken. But the card has to be replaced at the end. Sigh.
To make a long story short: Wouldn't it be better to actually know the versions the switch was running on in the past? And that's the duty of the firmware history:
switch:admin> firmwareshow --history
Firmware version history
Sno Date & Time Switch Name Slot PID FOS Version
1 Fri Feb 18 12:58:06 2011 CDCX16 7 1556 Fabos Version v7.0.0d
2 Wed Feb 16 07:27:38 2011 CDCX16 7 1560 Fabos Version v7.0.0a
(example borrowed from the CLI guide)
No access - No problem
There is a mistake almost everybody in the world of Brocade SAN administration makes (hopefully only) once: Trying to merge a new switch into an existing fabric and fail with a segmented ISL and a "zone conflict". Then the most probable reason is that the new switch's default zoning (defzone) is set to "no access".
This feature was introduced a while ago to make Brocade switches a little more safe. Earlier each port was able to see every other port as long as there was no effective zoning on the switch. With "no access" enabled, all traffic between each unzoned pair of devices is blocked if there is no zone including them both. The drawback of "no access" is its technical implementation, though. As soon as it was enabled a hidden zone was created and its pure existence blocked the traffic for all unzoned devices. And so without any indication the switch did end up with a zone.
But entre nous: no sane person accepts this without raising a few eyebrows. With FOS 7.0 this (mis-)behavior is gone. The new switch has a "no access" setting and wants to merge the fabric? Fine. You don't have to care, the firmware cares for you!
Thanks for the little helpers Brocade - and I hope you stay open for new ideas :o)
Time for another piece of my little series! This time I'd like to write about a new feature in v7.0x especially for administrators and support personnel: The Frame Log. Maybe it's a bit early to write about it, because it seems to be a feature "in development" at the moment, but I did wait for it so long I'm just not able to resist. I think and I hope Brocade will further develop it like the bottleneckmon - which I was very sceptical about in its first version when it was released in the v6.3 code. After seeing its functionality being extended on v6.4 and even more in v7.0, the bottleneckmon is an absolute must-have.
Hmm... maybe I should write an article about bottleneckmon, too :o)
Back to the Frame Log. So what's that?
Basically it is a list of frame discards. There are several reasons why a switch would have to drop a frame instead of delivering it to the destination device. One of them is a timeout. If a frame sticks in the ASIC (the "brain" behind the port) for half a second, the switch has to assume that something's going wrong and so the frame cannot be delivered in time anymore. Then it drops it. Till FabOS v7.0 it just increased a counter by one. Since later v6.2x versions it was at least logged against the TX port (the direction towards the reason for the drop) - in earlier versions the counter increased only for the origin port, which made no sense at all. But now we even have a log for it! A log to store all the frames the switch had to discard. While that sounds a bit like rummaging through the switch's trash bin, the Frame Log is very useful for troubleshooting though. It contains the exact time, the TX and the RX port (keep in mind the TX is the important one) and even information from the frame itself. In the summary view you see the fibrechannel addresses of the source device (SID) and of the destination device (DID).
For example to see the two most recent frame discards in summary mode, just type:
B48P16G:admin> framelog --show -mode summary -n 2
Fri Sep 23 16:07:13 CET 2011
Log TX RX
timestamp port port SID DID SFID DFID Type Count
Sep 29 16:02:08 7 5 0x040500 0x013300 1 1 timeout 1
Sep 29 16:04:51 7 1 0x030900 0x013000 1 1 timeout 1
In the so called "dump mode" you even see the first 64 bytes of each frame. Usually I have to bring an XGIG tracer onsite to catch such information and often it's not even possible to catch it then, because an XGIG can only trace what's going through the fibre. So you'll only see this frame if you trace a link it crosses before it is dropped. And even then you can't trigger (=stop) the tracer directly on this event, but you have to have it looking for a so called ABTS (abort sequence). If a frame is dropped the command will time out in the initiator and it will send this ABTS. Depending on what frame exactly was dropped in what direction, the ABTS could be on the link several minutes after the actual drop of the frame. Imagine a READ command being dropped. The error recovery will start after the SCSI timeout which could be e.g. 2 minutes. But 2 minutes is a long time in a FC trace. Chances are good that the tracer misses it then.
Not so with the Frame Log!
The frame log can tell you exactly which frame was dropped. If you try to find out if a particular I/O timeout in your host was caused by a timeout discard in the fabric, this is your way to go. If you see your storage array complaining about aborts for certain sequences, just look them up in the Frame Log. With this feature Brocade finally catches up with Cisco and their internal tracing capabilities - and Brocade does it way more comfortable for the admin. The logging of discarded frames is enabled by default and it works on all 8G and 16G platform switches without any additional license.
The big "BUTs"
As I mentioned at the beginning of this article there are still things for Brocade to work on to turn the Frame Log into a must-have tool like the bottleneckmon. The first catch is its volatility. In the current version it can only keep 50 frames per second on an ASIC base for 20 minutes in total. At the moment I personally think that's too short. But I'll wait for the first cases where I can use it before I forge an ultimate opinion about this limit.
The other - more concerning - constraint is that it only works for discards due to timeout at the moment. So if a frame is dropped because of one of all the other possible reasons, it won't be visible in the Frame Log in its current implementation. But that's exactly what I need! If the switch discards a frame because of a zone mismatch or because the destination switch was not reachable or because the target device was temporarily offline or whatever - I want to see that. If a server is misconfigured (uses wrong addresses) and so cannot reach its targets, you'd see the reason right there in the target log - no tracing needed! There are plenty of other situations that would be covered with such a functionality. So I honestly hope that there is a developer with a concept like this in his drawer or even already within its implementation. Allow me to assure you that there is at least one support guy waiting for it...
The picture is from Zsuzsanna Kilian. Thank you!
HDS' Hu Yoshida wrote an interesting theory on his blog. Basically he says that while modular dual-controller storage arrays might be useful for traditional physical server deployments, virtualized servers would need enterprise storage arrays. (Which interestingly are defined by "multiple processors that share a global cache" according to him.)
I wrote a small reply as a comment which still awaits moderation. To the present Hu usually published my few comments in his blog - regardless of how criticising they were. I don't know why it didn't happen this time, but I think the most reasonable answer is, that everybody at HDS is very busy with the BlueArc aquisition. So meanwhile I publish it here :o)
interesting read. IMHO there’s much truth in your quote “Virtual servers can be like a drug” and I think you are also right with your observation about Tier 1 applications being virtualized. From a support perspective this could lead to bad nightmares. But to be honest, I don’t get why the storage system should be the limiting factor here. The number of servers (in terms of OSes running) doesn’t change in your picture and neither did the total workload towards the storage array. They were physical servers before, now they are virtual servers (VMs) on a few physical ones. In my eyes the requirements regarding the storage environment didn’t change big times but of course you have to check carefully if your physical servers with their SAN connectivity could turn into a bottleneck themselves, as I pointed out in my latest blog post (http://ibm.co/mY5PnH).
Additionally, just a minor thing with the dual-controller arrays: Why should the outage of the remaining controllers lead to data loss? Usually the write cache of such arrays will be disabled if one controller is down, because it can’t be mirrored anymore. On one hand this means decreased performance during such maintenance, but on the other hand this means that the host gets the SCSI good status only if the I/O is really written to disk. So, there should be access loss of course, but no data loss.
If you have a different - or a similar - opinion, feel free to leave a comment here :o)
Modified on by seb_
My friend and colleague here in IBM L2 SAN Support, Serge Monney prepared a "little-helper" article for our clients. It's intended to help you with the Brocade SAN switch data collection in case of a problem. Maybe you even got the link to this article within a case. This is it:
Be prepared to send good logs to your support
Brocade SAN switch have the particularity to have logs with timestamps and logs without timestamps. The logs without timestamps are counters that tell you what is going on or what is going wrong with your frame entering or leaving your switch. To be able to trust the counters we need to set a baseline by clearing the statistics.
Most of the time support needs the following to perform troubleshooting:
- Take a supportsave to be sure to have the data of the current status (see also https://ibm.biz/BdXnNa)
- Clear counter statistics using CLI. CLI usage is necessary, because the graphical interface does not offer the possibility to clear internal counters like "slotstatsclear" does it.
- If you have a logical switches (VF):
fosexec --fid all -force -cmd "statsclear"
- If you do not have logical switches (VF):
- If you have FOS v7.3 or higher (this replaces a. and b.):
- Let the fabric run for up to 3 hours while the problem is visible or go to the next step immediately as soon as the problem is visible.
- Take a supportsave.
How to take a supportsave
- Use Network Advisor. In the menu under "Monitor" select "Product / Host SupportSave":
(You can download a free version of the Network Advisor after you created an account for mybrocade.com)
- Use the “supportsave” CLI command. Be aware: This needs an FTP server to be configured and running on a management workstation reachable by the switch!
This command will collect RASLOG, TRACE, supportShow, core file, FFDC data and other support information and then transfer them to a FTP/SCP server or a USB device. This operation can take several minutes.
OK to proceed? (yes, y, no, n): [no] y
Host IP or Host Name: xxx.xxx.xxx.xxx
User Name: YYYY
Protocol (ftp or scp):
Your FTP server will receive multiple files. Please compress them and give the resulting archive file a meaningful name. For example:
(Collecting data using the web interface will not help , you need to collect supportsave with Network advisor or CLI )
And to IBM...
To upload the package, please make use of our Secure Upload web frontend:
Just use your PMR (preferred), RCMS, or CROSS case number for your upload to let the system even notify the support engineer with an update to the case. It's also possible to upload data on the plain Machine Type / Serial Number, but then there won't be any direct correlation to the case. In the field "Upload is for:" always choose "Hardware" when you upload SAN data collections. The email address is optional but it will send you a short notice as soon as the upload completed successfully and the support engineer will be able to contact you via mail if needed. After clicking on "Continue" you can drag and drop the archive file containing the supportsave to upload it. That's it!
Short link for this article: https://ibm.biz/supportsave
Modified on by seb_
Almost a year ago I wrote an article about congestion bottlenecks in Brocade switches. I said you should avoid them, because they mean that you probably have no redundancy because of too much workload or you don't use it properly. You can use the bottleneckmon to detect them. On the other hand I cared much more about latency bottlenecks, often caused by slow drain devices and their implications. And so I do today.
Well...stop! Didn't you talk about congestion bottlenecks?
Yes! Today I want to explain how a congestion bottleneck could cause the exact same symptoms on the devices like a latency bottleneck - and exactly the same performance degradations. This is how it happens. In the middle you see a SAN director with 2 portcards and 2 core cards. While the devices are connected to the portcards, the core cards provide the backend connections between them. They are internally connected via the backplane. So for example host 1's way over there to the storage array A would traverse the portcard, then one of the both core cards and leaves the other portcard until it reaches storage array A. It could even be that two devices connected to the same portcard have to go over the core cards, because so called local switching is only done within an ASIC and a portcard could have more than one depending on the number of ports.
Now please meet host 2. Host 2 is a wonderful modern server. One of the work horses of the datacenter. It's fully packed with virtual machines, but its many cores and memory, as well as its state-of-the-art HBA, provide enough horsepower to cope with the workload. This baby is more than capable to do the work and it's in no way a slow drain device. It's zoned and mapped to the storage arrays A, B, C and D and it uses them heavily, mostly for read operations. The green tiny bars are read requests and as you see in the next picture it sends them to all of them, all of the time.
Of course the other hosts send requests, too, but let's focus on our diligent host 2. Yes, the pictures are too simplistic, but I'm sure you'll get the point. On the next one you see the first responses flowing back to host 2. Communicating with several storage arrays the link towards host 2 is used heavily, but host 2 is processing the incoming frames quickly and gives buffer credits back to the switch in proper time. So far so good.
But the more and longer the link utilization is very high, the more likely the following will happen if you enabled bottleneckmon with alerting:
2013/09/07-12:07:11, [AN-1004], 7002, SLOT 7 | FID 128, WARNING, FAB1DOM5, Slot 2, port 14 is a congestion bottleneck. 99.67 percent of last 300 seconds were affected by this condition.
If you didn't enable bottleneckmon, the congestion bottleneck would still be there... you just wouldn't know it.
The crux is: you will hardly find a congestion bottleneck that just flows with high link utilization and no negative effects. The probability is much higher for the following scenario:
Although there are enough buffer credits for this highly utilized link, frames are piling up towards it, because there is just too much workload and the link is busy sending frames. There is no slow drain device and to stay with the bathing metaphor: the drain works very well and transports as much water as its physically able to do. But there is so much more water in the tub that it could not go through the drain at the same time. And in addition imagine you have not only one water tap (in our case storage arrays) but four of them. They fill the tub quicker than the drain can empty it. As a result the internal buffers for all the hops through the SAN director fill up (that's basically the tub) and finally the director needs to do something against that: It will slow down the sending of buffer credits to the devices. Not only devices that want to send frames directly to host 2, but due to back pressure also the ones that send frames into that rough direction (using the same internal connections for example). And finally you'll end up in something like this:
The SAN director just behaves like a slow drain device itself!
Frames pile up inside the storage arrays and other end devices impaired by the slow drain behavior. If their RAS package is good, they will yell about credit starvation and probably even drop frames within their FC adaptors. In extreme situations these frame drops could happen in the director, too. At least you would see then something that would point you to a performance problem. Because otherwise - if you would have substantial delay in the traffic but all the frames get finally transferred to the next internal or external hop within 500ms ASIC hold time - you would only see the congestion bottleneck. And without bottleneckmon you wouldn't see anything at all then. The switch would look clean. Nothing in porterrshow or porstatsshow. Both show only external port counters anyway. As a SAN administrator you would not suspect anything in the director to cause this.
And still it would be there. A big performance problem caused by a device communicating with too much other devices. Not a slow drain device but still causing a slow drain in the SAN.
So how to solve it?
It's basically what I wrote a year ago plus points 3. and 4. from How to deal with slow drain devices. You just have to ensure - from a architectural design point of view - that all components of the SAN are able to cope with the workload at any given time. It's both that easy and that complex. But the first step towards resolving such a situation is to detect it properly and to keep in mind what could happen.
Performance problems are still the most malicious issues on my list. They come in many flavors and most of them have two things in common: 1) They are hardly SAN defects and 2) They need to be solved as quickly as possible, because they really have an impact.
If just a switch crashed or an ISL dropped dead or even an ugly firmware bug blocks the communication of an entire fabric, it might ring all alarm bells. But that's something you (hopefully) have your redundancy for. Performance problems on the other hand can have a high impact on your applications across the whole data center without a concerning message in the logs, if your systems are not well prepared for it. Beside of the preparation steps I pointed out here there is a tool in Brocade's FabricOS especially for performance problems: The bottleneck monitor or short:
If a performance problem is escalated to the technical support the next thing most probably happening is that the support guy asks you to clear the counters, wait up to three hours while the problem is noticeable, and then gather a supportsave of each switch in both fabrics.
Why 3 hours?
A manual performance analysis is based on certain 32 bit counters in a supportsave. In a device that's able to route I/O of several gigabits per second, 32 bits aren't a huge range for counters and they will eventually wrap if you wait too long. But a wrapped counter is worthless, because you can't tell if and how often it wrapped. So all comparisons would be meaningless.
Beside the wait time the whole handling of the data collections including gathering and uploading them to the support takes precious time. And then the support has to process and analyze them. After all these hours of continously repeating telephone calls you get from management and internal and/or external customers, the support guy hopefully found the cause of your performance problem. And keeping point 1) from my first paragraph in mind, it's most probably not even the fault of a switch*). If he makes you aware to a slow drain device, you would now start to involve the admins and/or support for the particular device.
You definitely need a shortcut!
And this shortcut is the bottleneckmon. It's made to permanently check your SAN for performance problems. Configured correctly it will pinpoint the cause of performance problems - at least the bigger ones. The bottleneckmon was introduced with FabricOS v6.3x and some major limitations. But from v6.4x it eventually became a must-have by offering two useful features:
Congestion bottleneck detection
This just measures the link utilization. With the fabric watch license (pre-loaded on many of the IBM-branded switches and directors) you can do that already for a long time. But the bottleneckmon offers a bit more convenience and brings it in the proper context. The more important thing is:
Latency bottleneck detection
This feature shows you most of the medium to major situations of buffer credit starvation. If a port runs out of buffer credits, it's not allowed to send frames over the fibre. To make a long story short if you see a latency bottleneck reported against an F-Port you most probably found a slow drain device in your SAN. If it's reported against an ISL, there are two possible reasons:
- There could be a slow drain device "down the road" - the slow drain device could be connected to the adjacent switch or to another one connected to it. Credit starvation typically pressures back to affect wide areas of the fabric.
- The ISL could have too few buffers. Maybe the link is just too long. Or the average framesize is much smaller than expected. Or QoS is configured on the link but you don't have QoS-Zones prioritizing your I/O. This could have a huge negative impact! Another reason could be a mis-configured longdistance ISL.
Whatever it is, it is either the reason for your performance problem or at least contributing to it and should definitely be solved. Maybe this article can help you with that then.
With FabricOS v7.0 the bottleneckmon was improved again. While the core-policy which detects credit starvation situations was pretty much pre-defined before v7.0 you're now able to configure it in the minutest details. We are still testing that out more in detail - for the moment I recommend to use the defaults.
So how to use it?
At first: I highly recommend to update your switches to the latest supported v6.4x code if possible. It's much better there than in v6.3! If you look up bottleneckmon in the command reference, it offers plenty of parameters and sub-commands. But in fact for most environments and performance problems it's enough to just enable it and activate the alerting:
myswitch:admin> bottleneckmon --enable -alert
That's it. It will generate messages in your switch's error log if a congestion or a latency bottleneck was found. Pretty straightforward. If you are not sure you can check the status with:
myswitch:admin> bottleneckmon --status
And of course there is a show command which can be used with various filter options, but the easiest way is to just wait for the messages in the error log. They will tell you the type of bottleneck and of course the affected port.
And if there are messages now?
Well, there is still the chance, that there are actually situations of buffer credit starvation the default-configured bottleneckmon can't see. However as you read an introduction here, I assume you just open a case at the IBM support.
You'll Never Walk Alone! :o)
*)Depending on country-specific policies and maintenance contracts a performance analysis as described above could be a charged service in your region.
Many of you (at least many of the few really reading this stuff) may already know what CRC is. But I think it doesn't hurt to have a short recap. CRC means Cyclic Redundancy Check and can be used as an error detection technique. Basically it calculates a kind of hash value that tends to be very different if you change one or more bits in the original data. Beside of that it's quite easy to implement. I once wrote a CRC algorithm in assembler (but for the Intel 8008) during my study and it was a nice exercise for optimization.
What has that got to do with SAN?
In Fibre Channel we calculate a CRC value for each frame and store it as the next-to-last 4 bytes before the actual end of frame (EOF). The recipient will read the frame bit by bit and meanwhile it calculates the CRC value by itself. Reaching the end of the frame it knows if the CRC value stored there matches the content of the frame. If this is not the case, it knows that there was at least one bit error and it is supposed to be corrupted and thus can be dropped. Now if the recipient is a switch the next thing to happen depends on which frame forwarding method is used:
The switch reads the whole frame into one of its ingress ("incoming") buffers and checks the CRC value. If the frame is corrupted the switch drops it. It's up to the destination device to recognize that a frame is missing and at least the initiator will track the open exchange and starts error recovery as soon as time-out values are reached. Many of the Cisco MDS 9000 switches work this way. It ensures that the network is not stressed with frames that are corrupted anyway, but it's accompanied with a higher latency. From a troubleshooting point of view the link connected to the port reporting CRC errors is most probably the faulty one.
To decrease this latency the switch could just read in the destination address and as soon as that one is confirmed to be zoned with the source connected to the F-port (a really quick look into the so called CAM-table stored within the ASIC) it goes directly on the way towards the destination. So if everything works fine - enough buffer-credits are available - the frame's header is already on the next link before the switch even read the CRC value. The frame will travel the whole path to the destination device even though it's a corrupted frame and all switches it passes will recognize that this frame is corrupted. Brocade switches work this way. As soon as the corrupted frame reaches the destination, it will be dropped.
Regardless which method is used, the CRC value remains just an error detection and most probably the whole exchange has to be aborted and repeated anyway.
So how to troubleshoot CRC errors on Brocade switches then?
If you would only have a counter for CRC errors, you would be in trouble now. Because if all switches along the path increase their CRC error counter for this frame, how would you know which one is really broken? If you have multiple broken links in a huge SAN, this could turn ugly. But there are 2 additional counters for you:
- enc in - The frame is encoded additionally in a way that bit errors can be detected. And because the frame is decoded when it's read from the fiber and encoded again before it's sent out to the next fiber, the enc in (encoding errors inside frames) counter will only increase for the port that is connected to the faulty link.
- crc g_eof - Although a corrupted frame will be cut-through as explained above, there is just one thing the switch can do in addition when it encounters a mismatch between the calculated CRC value and the one stored in the frame: it will replace the EOF with another 4 bytes meaning something like "This is the end of the frame, but the frame was recognized as corrupted." The crc g_eof counter basically means "The CRC value was wrong but nobody noticed it before. Therefore it still had a good EOF." So if this counter increases for a particular link, it is most probably faulty.
frames enc crc crc too too bad enc disc link loss loss frjt fbsy
tx rx in err g_eof shrt long eof out c3 fail sync sig
1: 1.5g 1.8g 13 12 12 0 0 0 1.1m 0 2 650 2 0 0
2: 1.3g 1.4g 0 101 0 0 0 0 0 0 0 0 0 0 0
3: 1.9g 2.9g 82 15 0 0 3 12 847 0 0 0 0 0 0
Port 1 shows a link with classical bit errors. You see CRC errors and also enc in errors. Along with them you see
crc g_eof. Everything as expected. Just go ahead and and check / clean / replace the cable and/or SFPs. There are some tests you could do to determine which one is broken like "porttest" and "spinfab".
Port 2 is a typical example of an ISL with forwarded CRC errors. This ISL itself is error-free. It just transported some previously corrupted frames (crc err but no enc in) which were already "tagged" as corrupted, hence no crc g_eof increases.
Port 3 is a bit tricky now. If you just rely on crc g_eof it seems to be a victim of forwarded CRC errors, too. But that's not the case. Actually they were broken in a manner that the end of the frame was not detected properly, so too long an bad eof is increased. Best practice: Stick with the enc in counter. It still shows that the link indeed generates errors.
Hold on, Help is on the way!
Now with 16G FC as state of the art things changed a bit. It uses a new encoding method and it comes with a forward error correction (FEC) feature. Brocade provides this with its FabricOS v7.0x on the 16G links. It will be able to correct up to 11 bits in a full FC frame. FEC is not really highlighted or specially standing out in their courses and release notes, but in my opinion this thing is a game changer! Eleven bit errors within one frame! Based on the ratio between enc in and crc err - which basically shows how many bit errors you have in a frame on the average - we see so far, I assume this to just solve over 90% of the physical problems we have in SANs today. Without the end-device-driven error recovery which takes ages in Fibre Channel terms. Less aborts, less time-outs, less slow drain devices because of physical problems! If this works as intended SANs will reach a new level of reliability.
So let's see how this turns out in the future. It might be a bright one! :o)