To like working in tech support, you have to be the most optimistic guy around. You have to be even more optimistic about the product you support than the sales guy trying to sell it. Why? Because the product can be as fantastic as possible - jam-packed with jaw-dropping features - as a tech support guy you will only witness the bugs. However, the bugs are not what's annoying me. Well, at least most of them. :o) Every software necessarily has bugs. They are my job, the reason of its mere existence. What's really annoying me is, when I know that there is a problem, but the RAS package is just not good enough to enable me troubleshooting it.
Therefore, I was pleasantly surprised when I read the release notes of the Fabric OS v7.1 codestream. There are a lot of tweaks and features that make the life of a troubleshooter easier. And it's not only about finding problems, it's about preventing them, too. So here is just a first selection of what I like:
Can I trust the counters?
"FOS v7.1 has been enhanced to display the time when port statistics were last cleared." says the release note. This sounds trivial, but it's essential for the troubleshooting of many problem types like performance problems, physical problems and so on. Times when we had to go through the CLI history - in the hope that the counters were cleared via CLI after a proper login - seem to be over now.
Link Reset Type in the fabriclog
A small enhancement, but a time-saving one. To get a time-based overview about the state-changes of the ports, you usually have a look into the fabriclog. But there you often only see that there were link resets. The interesting thing would be to find out who initiated them - the local port or the remote one. The LR_IN and LR_OUT counters in portshow were an insufficient source of information here as they show only absolute numbers. In Fabric OS v7.1 they type is simply part of the message and you see it at a glance.
For many admins the best practice to replace an SFP is to disable the port, then replace the SFP and afterwards re-enable the port again. I know many people who did this and I felt always uncomfortable to tell them, "Rip it out while it runs, otherwise the switch won't recognize it correctly." But that's the way it is before v7.1: If the port is not running while you replace an SFP, it might not notice that for example the 4G LW SFP that was in there before is now an 8G SW SFP. Beside of any ugly additional bug that was possible based on that later on, the behavior itself was a pain. In v7.1 you don't have to care for that. Sfpshow will show you the correct information. Additionally sfpshow will also tell you when the last automatic polling of the SFP's serial data took place.
Honest long distance
If you read SAN Myths Uncovered 2: The LD mode (Brocade) on my blog before, you know that the whole long distance stuff in Brocade switches is a little bit... let's say "optimistic". For long distance ISLs (other than long distance end-device connections) you only configure the length of the connection and the switch calculates the necessary amount of buffers. But as it does that by using the maximum frame size, you'll end up with a buffer shortage for basically all real-world use cases. In Fabric OS v7.1 new functions take account of this fact. The command portbuffershow (by the way a mandatory candidate for every data collection) will show you the average frame size now. So sooner or later I can mothball my article about How to determine the average frame size. And this value then can be used to optimize the buffer settings in the completely overhauled portcfglongdistance command. Now it will calculate the buffers based on your average frame size. Furthermore, it allows you to configure the absolute number of buffers yourself if you want. You don't need to tell your switch anymore that a distance is 200km just to assign enough buffers to span 60km with your real-world average frame size being far less than the maximum one. It's that kind of clarity that prevents misconceptions and evitable performance problems.
This is not an exhaustive list of all the good new things. There are definitely more good features in direction of RAS like enhancements for credit recovery, Diagnostic Ports, FDMI, Edge Hold Time, FCIP and many others. In my eyes they'll make the platform even more robust and after all, it will hopefully give me a little more time to write more blog articles in the future. :o)
Oh wait... is this the call to update to v7.1 immediately?
Well, no, it's not. It's just an outlook for the things to come. Better plan your updates carefully. You know, it's just a blog article by the most optimistic guy around... ;o)
Modified on by seb_
Reach this article at https://ibm.biz/stuckvc.
It's summertime again and for some of our customers it's the time to do their Fabric OS updates. Maybe you want to do that, too? I personally recommend a six month interval to go to the latest or the latest "mature" code, depending on your policy.
When you update to one of the latest v6.3x, v6.4x or v7x codes you might see your switch error log flooded by a new error message after the update:
2012/06/12-07:01:34, [CDR-1011], 1001, SLOT 6 | CHASSIS, WARNING, M48Fab1, S5,P-1(35): Link Timeout
on internal port ftx=10203920 tov=2000 (>1000) vc_no=16 crd(s)lost=3 complete_loss:1
This was for a 2109-M48 (Brocade 48000) with a Condor ASIC. For a DCX with Condor2 ASICs it would look like this:
2012/06/12-10:45:11, [C2-1012], 9482, SLOT 7 | CHASSIS, WARNING, DCXFab1, S1,P-1(3): Link Timeout on
internal port ftx=39298539 tov=2000 (>1000) vc_no=16 crd(s)lost=1 complete_loss:1
Did the update break something?
No. Brocade just implemented a check for "stuck VCs" and it found one in your director. So it was there before but now after the update the Fabric OS is able to point at it and generates a warning message about it.
What is a stuck VC?
I explained VCs (Virtual Channels) a bit in the updated version of my article about"How to NOT connect an SVC in a core-edge Brocade fabric" and the one about Quality of Service. As I wrote there, each VC has its own buffer management - its own buffer credit counter and special VC-related 4-byte words (VC_RDYs) that re-fill only the buffer credits of a certain VC. A normal link to a device usually has only one buffer credit management and if the buffer credits are lost over time, performance usually decreases until the last buffer credit is lost, a link reset will be issued after 2 seconds to re-gain the credits. Internal backlinks between cards in a director could loose buffer credits, too. But as they can only loose a buffer credit belonging to a VC, other VCs may still have buffer credits. So while the other VCs continue to run without any problems, only the VC which lost credits is affected. It's a so called "stuck VC" now.
Wait! How can buffer credits be lost?
There are some reasons but I think the likeliest and most understandable one is a bit error corrupting the VC_RDY. If a bit is flipped in the VC_RDY the receiving port cannot recognize it anymore. The credit is lost. But "a few" bit errors are acceptable even in the Fibre Channel protocol. So this can happen even if everything works within the specs. The important thing is to detect it and react properly.
So I get these new messages and they tell me I have a problem. What now?
With FabOS v6.4.2a (and v6.3.2d, v7.0.0) Brocade extended the bottleneckmon command with an additional agent. This agent reacts on stuck VC conditions by doing a link reset on the specific backlink. This is a big improvement compared with the older codes. Stuck VCs on internal links between two blades required to reseat one of the blades or to power it off for a moment.
But it's disabled and you have to switch it on!
To enable it, run:
bottleneckmon --cfgcredittools -intport -recover onLrOnly
Once enabled the agent will monitor the internal links and if there is a 2-second window without any traffic on a backlink with a stuck VC, it will reset it to solve the stuck VC. This approach minimizes the impact of the link reset. But it still could happen that you see a few aborts in the host logs - which is usually self-recoverable. After that the messages should stop and you can use the full internal bandwidth of your switch again.
Please have a look into the help page of the bottleneckmon command ("help bottleneckmon") for more information. And if you still get messages pointing to lost credits, please open a case and we'll have a look.
In one of my previous posts I wrote about "Why inter-node traffic across ISLs should be avoided". There is an additional "bad practice" that could lead to performance problems in the host-to-SVC traffic.
Let's imagine a core-edge fabric. A powerful switch (or director) in its center is the core. The SVC and its backend storage subsystems are directly connected to it. Beside of that there are also the ISLs to the edge switches where the hosts are connected to. As there is an SVC in the fabric, all host traffic usually goes to the SVC and the SVC is the only host of all other storages. From time to time I see a cabling like the one below. The devices are connected in a common pattern. For example SVC ports are always on port 0, 4, 8, ... or for a director for example on port 0 and 16 on each card... Something like that. The reason behind that is often to spread the workload over several cards/ASICs to minimize impact in case of a hardware failure. But there's a risk in doing so.
Index Port Address Media Speed State Proto
0 0 190000 id 8G Online FC F-Port 50:05:07:68:01:40:a2:18
1 1 190100 id 8G Online FC F-Port 20:14:00:a0:b8:11:4f:1e
2 2 190200 id 8G Online FC F-Port 20:16:00:80:e5:17:cc:9e
3 3 190300 id 8G Online FC E-Port 10:00:00:05:1e:0f:75:be "fcsw2_102" (downstream)
4 4 190400 id 8G Online FC F-Port 50:05:07:68:01:40:06:36
5 5 190500 id 8G Online FC F-Port 20:04:00:a0:b8:0f:bf:6f
6 6 190600 id 8G Online FC F-Port 20:16:00:a0:b8:11:37:a2
7 7 190700 id 8G Online FC E-Port 10:00:00:05:1e:34:78:38 "fcsw2_92" (downstream)
8 8 190800 id 8G Online FC F-Port 50:05:07:68:01:40:05:d3
The SAN perspective
In the situation described above, all host traffic is passing the ISLs from the edge switches to the core. ISLs are logically "partitioned" into so called virtual channels. Of course the ISL is still just one fibre and only one signal is passing it physically at the same time. The virtual channels are just portions of buffer credits dedicated and the decision which virtual channel a frame takes - and therefore which portion of the buffers credits it uses - is made by looking into the destination fibre channel address.
Technical deep dive
A normal non-QOS ISL has 4 virtual channels for data traffic. For an 8G link each one of them has 5 buffers. They can only work with these 5 buffers and there is no possibility to "borrow" some out of a common pool like for QoS links. With the command "portregshow" you can see the buffer credits assigned to the virtual channels (I added the first line):
VC 0 1 2 3 4 5 6 7
0xe6692400: bbc_trc 4 0 5 5 5 5 1 1
Only VCs 2-5 are used for data traffic. This makes 20 usable buffers which normally should be enough for a normal multimode connection between two switches in the same room with only some metres cable length. Basically the switch uses the last two bits of the second byte of the destination address. That looks so:
Bits 00 -> frame uses VC 2 (which is the first virtual channel for data)
Bits 01 -> frame uses VC 3
Bits 10 -> frame uses VC 4
Bits 11 -> frame uses VC 5
So where's the problem now?
In our imaginary core-edge fabric where for example all SVC ports are connected to ports 0 (bin 00), 4 (bin 100), 8 (bin 1000), 12 (bin 1100) , ... all host I/O towards SVC would use the same virtual channel. As this is the only traffic that passes the ISLs from edges to cores, only a quarter of the buffers are actually used! 5 buffers are very heavy in use and 15 are idling around never to be filled. And 5 buffers are actually pretty few for an edge switch full of hosts wants to speak with the core switch where the SVC is connected. The result would be credit starvation and congestion on a virtual channel level.
How to solve that?
There are 3 possibilities:
1.) You could re-cable your SAN in a manner that all VCs are used. But beside of the risk of physical problems and problems introduced by maintenance actions the devices have to learn about the new addresses of the SVC ports. For many operating systems this still means reboots or reconfigurations. It could involve a lot of work and risk for outages.
2.) You could just change the addresses with the portaddress command. This command is usually used in the virtual fabric environment and if you can use it depends on installed firmware and used platform. While it avoids the physical actions, it still has the disadvantages for the hosts because of changed addresses.
3.) The best and least interrupting possibility might be to set the ISLs to LE mode. This is the long distance mode dedicated for links under 10km in length. It will not only put more buffers on the link (40 for user traffic in an 8G link compared with the 20 for a normal 8G E-Port) but will also collapse the 4 user traffic VCs to just one. It looks like this then:
VC 0 1 2 3 4 5 6 7
0xe6602400: bbc_trc 4 0 40 0 0 0 1 1
So all buffers and therefore also all buffer credits will be used by the hosts and nothing idles. There will of course be a short interruption while changing the ISL to LE mode but beside of that nothing changes for the hosts, because all the addresses stay the same. This is clearly the way to go in the situation described above.
Just something strange for the end: Some switches are delivered from manufacturing with an alternative addressing pattern. For example port 1 of domain 3 won't have the address 030100 then but something like 030d00. In that case the problem can happen similarly but on other ports. But using LE-mode would solve it in the pretty same way.
Please keep in mind that the whole article relates to a very special (although very common) SAN layout in an SVC-centered environment. This is clearly not a standard action plan for all performance problems but it could help if you have a customer in a situation like this. For any questions, feel free to contact me.
Additionally, please be aware that this is not an SVC problem by itself but will happen with every central storage connected to a switch using a pattern as described above and being used by hosts connected to another switch over an ISL!
Update from May 9th:
I was made aware that readers of this article queried their vendors, maintenance providers or business partners with the idea to just set all their ISLs to LE-mode regardless if the condition as described above is actually met. Because of that, I would like to state more clearly: Using LE-mode as a general approach for your ISLs can cause severe problems!
If the SVC ports are not connected in a way that only one Virtual Channel would be used, it actually makes sense to have ISLs with more than one VC. Virtual Channels are a good feature to prevent that a latency bottleneck due to back pressure impairs the traffic of all devices using the same ISL. If devices on the edge switches communicate with other devices connected to other ports of the core (or other edges) as well, the impact of using LE-mode would be even more extreme in the case of slow drain devices.
I made some drawings to illustrate this. The first one shows 1 normal ISL between the edge and the core. You can see the 4 VCs used for data traffic. (I left out the other VCs for better visibility):
Here host 1 and 2 make traffic against the SVC (green), host 3 against an additional disk subsystem (purple) and host 4 against a tape drive (orange). Based on the ports these devices are connected to, other VCs are used for that traffic.
If you would use an LE-port instead, it would look like this:
Now all 4 data traffic VCs collapsed to a single one. As long as everything runs smoothly, you won't see an impact.
Buf if for example one of the devices connected to the core is slow draining, following will happen most probably:
In the picture above the purple disk is a slow drain device. Due to back pressure the whole ISL will be a latency bottleneck, because all data traffic shares the same VC in LE-mode. The back pressure goes further towards the edge switch and all 4 hosts of our example are affected now although only host 3 communicates with the slow drain device!
With a normal E-port it looks like this:
Now only VC4 is affected while VC2, 3 and 5 are running smoothly, because they have their own, unaffected buffer management. Therefore only host 3 will face a performance problem while the hosts 1, 2 and 4 are running fine.
You see: Using LE-mode for the purpose described in my original article does only make sense if these special conditions are really met. In all other cases it can impair the SAN performance tremendously!
I claim that in 2012 performance problems will keep their place amongst the most frequent and most impacting problems in the SAN. In many of the cases the client's users really notice a performance impact and so the admin calls for support. Other support cases are opened because of performance related messages like the ones from Brocade's bottleneckmon or Cisco's slowdrain policy for the Port Monitor. Beside of that there are also cases that look not really like performance problems from the start but turn out to occur because of the same reasons like them. "I/O abort" messages in the device log, link resets, messages about frame drops, failing remote copy links, failing backup jobs or even worse failing recoveries - these could all be "performance problems in disguise".
When I analyze the data then and find out that a slow drain device or congestion is the real reason for the problem I write my findings down and try to give the client some hints about possible next steps. For example by mentioning my earlier blog article about How to deal with slow drain devices.
Do you know what's mean about it?
Often clients never heard of slow drain devices before. Longtime storage administrators are confronted with a term that sounds like a support guy made it up to fingerpoint to another vendor's product. Of course I usually explain what it is, what it means for the fabric and for the connected devices. But to be honest, I would be sceptical, too. I would go to the next search engine and query "slow drain device". The first finds are from this blog and from the Brocade community pages and there are some questions about that topic. Considering the substance of posts in public forums, I would check Brocade's own SAN glossary. Guess what? Not a word about slow drain devices - Which is no surprise as it's from 2008. I would check wikipedia. Nothing. My fellow blogger Archie Hendryx mentioned that it's missing in the SNIA dictionary, too. And he's right: Nothing!
So why is that so?
Why are the terms "HTML" and "export" explained in the dictionary of the Storage Networking Industry Association but there is not a single appearance of the term "slow drain device" on the complete SNIA website (according to their in-built search function)? Well I don't know but of course we can change that. The SNIA dictionary makers are asking for contribution, so if you have a term that has a meaning in the storage industry, feel free to send them a definition for the next release. I thought about doing that as well for some of the SAN performance-related terms I didn't find in the dictionary. Below you'll find some definitions that I wrote. But I'm not inerrable and therefore I would like to have an open discussion about them. Let me know what you think about them. Let me know if your understanding of a term (used in the area of SAN performance of course) differs from mine. Let me know if my wording hurts the ears of native English speakers. Let me know if you have a better definition. Let me know if there are important terms missing. And let me know if you think that a term is not really so generally used or important that it should appear in the SNIA dictionary - side by side to sophisticated terms like Tebibyte :o).
slow drain device - a device that cannot cope with the incoming traffic in a timely manner.
Slow drain devices can't free up their internal frame buffers and therefore don't allow the connected port to regain their buffer credits quickly enough.
congestion - a situation where the workload for a link exceeds its actual usable bandwidth.
Congestion happens due to overutilization or oversubscription.
buffer credit starvation - a situation where a transmitting port runs out of buffer credits and therefore isn't allowed to send frames.
The frames will be stored within the sending device, blocking buffers and eventually have to be dropped if they can't be sent for a certain time (usually 500ms).
back pressure - a knock-on effect that spreads buffer credit starvation into a switched fabric starting from a slow drain device.
Because of this effect a slow drain device can affect apparently unrelated devices.
bottleneck - a link or component that is not able to transport all frames directed to or through it in a timely manner. (e.g. because of buffer credit starvation or congestion)
Bottlenecks increase the latency or even cause frame drops and upper-level error recovery.
Feel free to use the comment feature here or tweet your thoughts with hashtag #SANperfdef. If you add @Zyrober in the tweet, I'll even get a mail :o)
I updated the definitions with an additional sentence. Feel free to comment.
When Brocade released FabricOS v6.0 in 2007 Quality of Service sounded like a great idea: It allows you to prioritize your traffic flow to the level of certain device pairs. There are 3 levels of priority:
High - Medium - Low
Inter Switch Links (ISLs) are logically partitioned into 8 so called Virtual Channels (VCs). Basically each of them has its own buffer management and the decision which virtual channel a frame should use is based on its destination address. If a particular end-to-end path is blocked or really slow, the impact on the communication over the other VCs is minimal. Thus only a subset of devices should be impaired during a bottleneck situation.
Quality of Service takes this one step further.
QoS-enabled ISLs consist of 16 VCs. There are slightly more buffers associated with a QoS ISL and these buffers are equally distributed over the data VCs. (There are some "reserved" VCs for fabric communication and special purposes). The amount of VCs makes the priority work - the most VCs (and therefore the most buffers) are dedicated to the high priority, the least for the low one. Medium lies in the middle obviously. So more important I/Os benefit from more resources than the not so important ones.
Sounds like a great idea!
Theoretically you can configure the traffic flow in terms of buffer credit assignment in your fabric very fine-grained. But that's in fact also the big crux: You have to configure it! That means you actually have to know which host's I/O to which target device should be which priority. Technically you create QoS-zones to categorize your connections. Low priority zones start with QOSL, high priority zones start with QOSH. Zones without such a prefix are considered as medium priority.
But how to categorize?
That's the tricky part. The company's departments relying on IT (virtually all) have to bring in their needs into the discussion. Maybe there are already different SLAs for different tiers of storage and an internal cost allocation in place. The I/O prioritization could go along with that and of course it has to be taken into account to effectively meet the pre-defined SLAs. If you have to start from the scratch, it's more a project for weeks and months than a simple configuration. And there is much psychology in it. Beside of that you really have to know how QoS works in details to design a prioritization concept. For example if you only have 20 high priority zones and 50 with medium priority but only 3 low priority zones, the low ones could even perform better. In the four years since its release I saw only a couple of customers really attempting to implement it.
In addition you need to buy the Adaptive Networking license!
So why should I care?
If QoS is such a niche feature, why blogging about it? Usually a port is configured for QoS when it comes from the factory. You can see it in the output of the command "portcfgshow". A new switch will have QoS in the state "AE" which means auto-enabled - in other words "on". An 8Gig ISL will be logically partitioned into the 16 VCs as described above and the buffer credits will be assigned to the high, the low and the medium priority VCs. But that does not mean that you can actually benefit from the feature, because you most probably have no QoS-zones! And so all your I/O share only the resources allocated for the medium priority. A huge part of the available buffers are reserved for VCs you cannot use! So as a matter of fact you end up with less buffers than without QoS and in many cases this made the difference between smooth running environments and immense performance degradation.
If you don't plan to design a detailed and well-balanced concept about the priorities in your SAN environments, I recommend to switch off QoS on the ports. I don't say QoS is bad! In fact with the Brocade HBA's possibility to integrate QoS even into the host connection - enabling different priorities for virtualized servers - you have the possibility to better cope with slow drain device behavior. But done wrong, QoS can have a very ugly impact on the SAN's performance!
Better know the features you use well - or they might turn against you...
As this was not clear enough in the text above and I got back a question about that, please be aware: Disabling QoS is disruptive for the link! In most FabricOS versions in combination with most switch models, the link will be taken offline and online again as soon as you disable it. In some combinations you'll get the message that it will turn effective with the next reset of the link. In that case you have to portdisable / portenable the port by yourself.
As this is a recoverable, temporary error your application most probably won't notice anything, but to be on the save side, you should do it in a controlled manner and - if really necessary in your environment - in times of little traffic or even a maintenance window. The command to disable it is:
portcfgqos --disable PORTNUMBER
Ha! Who says only EMC can do dull product placement? :o)
To be honest the title for this article could also be "How to ease the life of your technical support". But in fact it will ease the life of everyone involved in a problem case and the priority #1 is to solve upcoming problems as quickly as possible.
In the article The EDANT pattern I explained a structured way to transport a problem properly to your SAN support representative. In addition it might be a good idea to prepare the SAN for any upcoming troubleshooting.
The following suggestions are born out of practical experience. It's intended to help you to get rid of all the obstacles and showstoppers that could disturb or delay the troubleshooting process right from the start. Please treat them as well-intentioned recommendations, not as pesky "musts". :o)
Synchronize the time
Having the same time on all components in the datacenter is a huge help during problem determination. Most of the devices today support the NTP protocol. So the best practice is to have an NTP server (+ one or two additional ones for redundancy) in the management LAN and configure all devices (hosts, switches, storage arrays, etc) to use them. It's not necessary to have the NTP connected to an atomic clock. The crucial thing is to have a common time base.
Have a troubleshooting-friendly SAN layout
What is a troubleshooting-friendly SAN layout? I don't only mean that it's a good idea to always have an up-to-date SAN layout sketch at hand - which is very helpful in any case. What I mean is to have a SAN design that lacks of any artificial obscurities. If you have 2 redundant fabrics (yes there are still environments out there where this is not the case), it's best practice to connect all the devices symmetrically. So if you connect a host on port 23 of a switch in one fabric, please connect its other HBA to port 23 of the counterpart switch in the redundant fabric.
Use proper names
It may sound laughable but bad naming can harm a lot. I think 4 points are important here:
- The naming convention - It may be funny to have server names like "Elmo", "Obi-Wan" or "Klingon" but for troubleshooting it may be better to have some useful info within the name. Something like BC01_Bl12_ESX for example. (for Bladecenter 1, Blade 12, OS is ESX).
- Naming consistency - It's even more important to actually use the same names for the same item. So it's very helpful if for example the host has the same name in the switch's zoning, in the storage array's LUN mapping and on the host itself.
- Unique domain IDs - The domain ID is like the ZIP-Code for a switch and according to the fibre channel rules it has to be unique within a fabric. But in addition to that it is very helpful to keep it unique across fabrics as well. Domain IDs are used to build the fibre channel address of a device port - the address used in each frame. Within the connected devices's error logs (hosts, storages, etc) these fibre channel addresses are often the only information that reference for the SAN components. To be able to know which paths over exactly which switch are affected at any time is priceless.
- Brocade: chassisname - As Virtual Fabrics become more and more a standard in Brocade SANs it's crucial to set the chassisname, because the switchname is bound to the logical switch, not to the box. These chassisnames are used for the naming of the data collections (supportsaves) and if you don't configure them, the device/type will be used instead. So you'll most probably end up with a huge collection of supportsave files which differ only in the date. The chassisname can easily be set with the command "chassisname". That's one small step for... :o)
Use a change management
I couldn't emphasize this more: Please use a change mangement. Even for the smallest SAN environment where you would think "Nah! That's my little SAN, I can keep all the stuff in my head." Even for the biggest SAN Environment, where you would think "Nah! Too many people from too many departments are involved here. The SAN is living and evolving every day." Beside of any internal policy and external requirement (mandatory change management methods for several industries) a proper change management also helps in the troubleshooting process. If you can come up with a complete time plan of all actions done in the SAN and the assertion that no unplanned maintenance actions are done in the SAN during the problem determination you will have a very happy SAN support member :o)
Backup your configuration
Bad things could happen every day. Things that wipe parts or all of your switches's configuration or even worse turn them into useless doorstoppers. It's not likely that it happens, but if and when it happens you better be prepared. To be up and running again as soon as possible, you should not only back up your user data but also your configurations on a regular basis. For Brocade switches use "configupload" and for Cisco switches copy the running-config to an external server. The SAN Volume Controller (SVC) and the Storwize V7000 have options to backup the configuration in their GUI as well. Beside of that it helps a lot to also store all your license information for your switches at a well known place. At least for the SAN switches IBM cannot generate licenses and there's also no "emergency stock" for licenses. The support would have to open a ticket at the manufacturer and clarify the license issue with them. This might cost precious time in problem situations.
Keep you firmware up-to-date
This advise often has the smack of a "shoot from the hip", something like "Did you reboot your PC?" for PC tech support. But to be fair, it's not just the SAN support member's blanket mantra. No software is absolutely bug free and because of that there are patches or - for the SAN topic - more likely maintenance releases. Often there are parallel code streams. Newer ones with more features but with a higher risk of new bugs. On the other hand older ones with a long history of fixed defects and a "comfortable" level of stability but most probably already with an "End of Availability" in sight. And between these both extremes are the mature codes like the v6.3x code stream for Brocade switches. It doesn't have the latest features but a good amount of "installed hours" all over the world. It is still fully supported, so if you really would run into a new bug, Brocade would write a fix for it. It's essentially the same for Cisco and for our virtualization products.
So it's up to you. If you want the new features, you have to use the latest code. If you don't need them at the moment, the latest version of a mature code stream might be better for you. Of course you have to align these considerations with the recommended or requested versions of the connected devices as some really require a specific version. A best practice is to update the switches and if possible also all devices proactivily twice a year - beside of any additional recommended updates due to problem cases where a particular bug has to be fixed. If you need support with all the planning and doing, please contact your local IBM sales rep for an offering called Total Microcode Support. These guys will check the SAN environment including the attached devices for their firmware and will come up with a consistent list of recommended versions which should be compatible and cross-checked. Another view on the topic comes from Australian IBMer Anthony Vandewerdt in his Aussie Storage Blog.
Think about your features
Speaking about code updates and features, it's of course a good idea to actually read the release notes. They contain crucial information about the version and should also explain new features. The crux of the matter is that there could be new features that you actually do not need and some of them will be enabled by default. One of these examples is the Brocade feature "Quality of Service" (short: QoS). In simple terms it will "partition" the ISLs to grant high prioritized traffic to have some kind of "right of way" to medium or low prioritized traffic. Buffer-to-Buffer credits will be reservered for the different priority levels to enable this. But to really use it, you actually have to decide which traffic falls into which category. You would do this by so called QoS-Zones. If you don't configure the zones but leave QoS enabled, all the traffic is categorized as medium prioritized and you don't use the reservered resources for the high and the low priority. In times of high workload, this might end up in an artificial bottleneck resulting in frame drops, error recovery and performence problems. This is only one example that shows that it's better to be aware which additional features are activated and if you really need them.
Know the support pages
IBM as well as other vendors has a comprehensive "Support" section on its homepage. It offers loads of information, manuals, links to code downloads, technotes and flashes. It's possible to open and track a support case there via the web. With all the stuff on these pages and all the products IBM offers support for you might get lost a bit. Our "IBM Electronic Support" team (@ibm_eSupport) is constantly optimizing these pages but the hint number one is: Register for an account and set up these pages for you as you like them. So you have your products at hand and you find all related information easily. And if you have some spare time (do you ever?) just have a look around on the support pages. There might be useful hints or important flashes concerning your IBM products.
As always this "list" isn't exhaustive and you probably did additional things to be prepared for problem determination. Feel free to share them in the comments below. Thank you!
Modified on by seb_
From the moment you wanted to read that interesting analyst paper, that compelling best practice guide, or that promising market study, you regret that you typed your email address into that innocent looking form. Now you get them day after day: Newsletters for stuff you don't really care for.
But there are newsletters that really make sense. The Cisco Notification Service (CNS) is one of that kind. It's a very good way to keep yourself up-to-date with support-relevant news about your Cisco storage networking product (and well, yes, any other Cisco product, too). The only thing you need is a cisco.com-user.
And the best thing: You get exactly what you're looking for. So here is, how to configure it:
First go to http://www.cisco.com/cisco/support/notifications.html
After you logged in using your CCO password the page should look like that:
In the Profile Manager you can administer multiple notifications. Let's create the first one by clicking on "Add Notification".
Put in a proper name for the notification in this screen. You can also choose the type of the notification. You can have an email with links and summaries (default), only an email with links or even an RSS feed. For the emails you can choose daily, weekly or monthly summaries and the feed could be configured for today, or 7 or 30 days. The recipent address for the emails can be changed, too. So if you work in a team, you could set it up to send it to the team's group email address or a distribution list.
For the Topic Type you have 3 options: Product-centric, alert-centric and based on a particular Bug ID. If you choose the product or alert approach is up to you. That mainly depends on your role (remember: I write this blog for both admins and tech support people) and the amount of different products and topics you are interested in. The tracking of bugs can also be configured directly out of the bug tool, which makes a bit more sense in my eyes. So for the moment, let's stick with the alert-centric approach.
As this specific notification was about software alerts, I choose "Software Updates" now. You can also see the other options like the EOL (End-of-Life) info, the Field Notices, known bugs and security alerts. Again your choice totally depends on your needs here. Keep in mind: You can manage multiple notifications and maybe you want to select different notification types (email / feed) with different frequencies for different topics. The main goal is still to receive the notification in a manner that you are also willing to read them in one month from now - otherwise it's not better than the negative examples from the beginning.
Now you choose the products you want to be notified about. The MDS products belong to "Storage Networking".
Just click along the tree of products. You can be very specific or just use the "All..." option from a subcategory.
As you see in the picture above, you can apply for very specific hardware and firmware combintions, like NX-OS 6.2(3) for the MDS 9222i or more general like the entries above. To add a product, just click on "Add another subtopic". If you have everything you need, click on "Finish" to store and activate this notification configuration and return to the notification profile manager.
For each notification item you can see the status and the expiration date. Yes, Cisco won't spam you with emails until your personal EOL just because you forgot how to get rid of them. The notification we just created is only valid for one year. If you still find it useful then, you have to renew it. And yes, you will be notified by mail if you have notification configurations that will expire soon.
And if you notice that the setup wasn't optimal - for example you want to change the frequency, email address, notification type or products, then just click on the edit button on the right side of the notification's header (the red circled one). Here you can also copy it or even delete it if you are not interested anymore.
You see, there are a lot of possibilities, but the configuration is quite simple and straight forward.
So try it and keep yourself up-to-date, because surprises are something for a birthday party. Not for a storage environment :o)
Here is a new one. If you think a little out of the box, it might be an easy one. But this time, no rot-13 solution :o)
PS.: Don't worry, seb's sanblog won't turn into a puzzle blog. There will be new SAN related articles :o)
Just had some picture puzzles in my head. Here is one :o)
Solution: "N pybhq nepuvgrpg qrcyblf n cevingr pybhq".
Well this year passed by with highspeed. How perception can change... I felt 2011 was a pretty long year. Our first child was born and the life of my wife and me was turned upside down. It was a lot of work, but good work! And I felt my body and brain adjusting to the demands. Time felt going by slower. Maybe I was just rushed with adrenaline for several months. This year was different. Many things happened at my job and time flew by. Most of the things were internal stuff - important for me - maybe interesting, though - but unfortunately nothing I could blog about here. Sure, I'm still deeply related to the topic SAN troubleshooting, but there was so much else to do in 2012.
So what to expect from 2013?
Well, I will still be here blogging from time to time. Hopefully a little bit more than in 2012. Let's see if that really works, because we expect our second baby to be born mids of February. So I hope for the adrenaline to kick in again. :o) There are still a lot of ideas in my mind and SAN troubleshooting is an ongoing thing. I'm here to share my experiences from my work as a SAN PFE (Product Field Engineer) in the IBM ESCC. Usually I don't get much feedback about it, but what I got was really good and I'm happy that I was able to help in a good number of situations. In 2012 the blog had about 254k hits, which I think is a good amount given my relatively small target group of SAN admins, designers and troubleshooters. Of course, I don't earn any money with ads or so :o)
But often enough in 2012 I felt it wouldn't do any harm if I'd expand my scope a little bit. So in 2013 I plan to stop restricting myself and to write a little bit more about other storage related topics as well. At the moment I'm not really sure if I will do that within this blog or if I will create a new one. I'm tending to the first choice but if you, my dear reader, have some good reasons to keep the sanblog "clean", I'll consider them, too.
Until then... Have a good start into 2013 and Happy New Year!
A slow drain device often has a huge impact on the performance of many other devices in a SAN environment. That happens, because they block resources in a fabric other devices use as well. The main example for such a resource are ISLs, particularly the Virtual Channel(s) within those ISLs that are used to reach the slow drain device. But as soon as you have an appliance in the SAN, this could turn into such a blocked resource as well.
Disclaimer: There are several definitions and types of appliances. Within this article an appliance is a device "in the middle" between the hosts and the storages with a specific task such as a compression, encryption, virtualization or deduplication appliance. While I had the SAN Volume Controller (SVC) in mind while I wrote this, it applies to many other products matching this definition. The common thing is that the performance they can provide is to some degree dependent on their destination device's performance.
Fortunately many of the fabrics I saw over the recent years were designed using a core-edge approach. If the device is in the communication path of many of the devices in a SAN it's best practice to attach it directly to the core. But a slow drain device can still block it. This is how it happens:
In this sketch the appliance sends data towards a slow drain device. It will not be able to process the incoming frames quickly enough - they pile up in its HBA's ingress buffers (1). As the appliance is still sending frames but the edge switch cannot send them further to the slow drain device, they also pile up in the ingress buffer of the ISL port of the edge switch (2). Now this could already impair the performance of the other host connected to the same edge switch like the slow drain device - if the frames towards it use the same VC. Some microseconds later the same might happen to the frames from the appliance entering the core (3). They pile up there as well and as soon as that happens, this so called back pressure reaches the appliance itself then. As there are no VCs on the F-to-N-port connection used to attach the appliance to the core, the chance is high that the appliance cannot send any frames out to the SAN anymore - no matter to which destination (4).
Well, that means you just turned your appliance into a slow drain device itself! The performance of the whole environment is heavily impaired now:
In step (5) the frames from the other hosts towards the appliance pile up in the core as well and then the back pressure spreads further to the hosts connected to the edge switches as well (6).
Worst case, hmm?
After the ASIC hold time is reached (usually 500ms) the switches will begin to drop frames to free up buffers again. But as all switches have the same ASIC hold time, you'll end up in the situation that while the edge switch reach these 500ms first, the core switch will start to drop the frames likewise before the buffer credit replenishment information (VC_RDY) from the edge switches arrive. So not only the frames from the communication with the initial slow drain device will be dropped, but most of the others down the path as well. As as the appliance itself turned into a slow drain device, the same might happen to the frames piled up because of that, too.
So what to do against it?
The first thing is: give the F-ports of the appliance as much buffers as possible. Prio 1 should be that it should be able to send its frames out into the fabric, so the chances are higher that when the frames of the open I/O against the slow drain device are out there, there could be still some buffer credits left to send stuff to other devices. For clustered appliances like the SVC it's even more important, because they use these ports for their cluster-internal communication as well. Blocked ports could result in cluster segmentation then (SVC: single nodes rebooting due to "Lease expiry"). To assign more buffers to the switch port (= more buffer credits for the port of the appliance), use
portcfgfportbuffers --enable [slot/]port buffers
Update: Please keep in mind, that adding more buffers to an F-Port is of course disruptive for the link!
To check how much buffers are available, you can check
But in many cases this is not enough. Some time ago, Brocade released Fabric Resiliency Best Practices with some good advises. In my opinion every SAN admin with Brocade gear should have read it. It recommends:
- Use Fabric Watch to get alarms for frame timeouts. (Erwin von Londen wrote a good article about that.
- Use Port Fencing to isolate slow drain devices. (Read Erwin's post about that, too.)
- Configure and use the Edge Hold Time.
- Configure bottleneckmon to get alarms for latency and congestion bottlenecks.
While Fabric Watch is used more and more and especially in the FICON world - but also for open systems - I see some of our customers using port fencing, I hardly see anyone utilizing the Edge Hold Time feature. For a situation as described above it could really improve the situation for the appliance and the other hosts dramatically. It can be set to any value between 100ms and 500ms. It was introduced in FOS v6.3.1b. So if you expect hosts connected to an edge switch to behave slow draining in certain situations, in my opinion the Edge Hold Time of that switch should be set as low as possible. Of course it's always depending on your environment and how likely it is to be impaired by a slow drain device, but 100ms is a long time in a SAN. If you also have some legacy devices connected to these edge switches, check if a decreased hold time could be a problem for them.
It can be enabled and configured using the "configure" command, where it can be found in:
Not all options will be available on an enabled switch.
To disable the switch, use the "switchDisable" command.
Fabric parameters (yes, y, no, n): [no] yes
Configure edge hold time (yes, y, no, n): [yes]
Edge hold time: (100..500) 
You don't need to disable the switch to change the Edge Hold Time and as one of the fabric parameters it will be included in a configupload.
As it seems to be used very seldom in the field I would like to get some feedback if you actually used it. Please give me a hint if and in which situation it helped you. Thanks!
But don't forget: The most important thing is to get rid of the slow draining behavior!
I check the referrers of this blog from time to time to get to know where my readers are coming from. For many of them I cannot actually see it, because often "bridge"-pages are used - for example by the social networking sites. But a fair amount come from searches in google and other search engines. Some search queries there seem to repeat very often. Maybe I will write more articles about the others - because hey, this seems to be the stuff you're coming for :o) - but this time it's about congestion bottlenecks.
Congestion bottlenecks - beside of latency bottlenecks - are one of the two things the Brocade bottleneckmon can detect. The default setup will alert you - if you enabled bottleneckmon with alerting - for all situations where 80% of all seconds within a 5 minutes interval had 95% link utilization. That is a big number! Of course you can also modify the setup to be more aggressive or to spare you some messages in an environment that is usually "under fire"...
But I encourage you to take it seriously!
In my opinion, a healthy SAN should NEVER have congestion bottlenecks. With "healthy" I mean of course the time of normal operation. Not when you have an incident at the moment and there is no redundancy in some parts because for example the second fabric has a problem or one out of two ISLs between two switches had to be disabled... I wrote an article last year about that and I think it fits well within the topic.
Rule of thumb: Link utilization should be up to 50% only.
And of course it should not be only 50% because you configured too few buffers! The setup of the link should always allow it to transport up to 100% of the workload that's physically possible. Otherwise you will have no real redundancy again!
But how to handle them now that I have them?
So you see these [AN-1004] messages in the error log and you know the port. What now? This is more about your SAN concept than defects or software features. The congestion bottleneck happens because the utilization of a link approaches its physical capabilities. Here are some ideas:
- Start a real performance monitoring. Have a look at what the Advanced Performance Monitoring license can do for you. (Ask your IBM SAN support for a free 60days Universal Temporary License)
- For example use Top Talkers to find out where the heavy traffic comes from.
- Simple: Use more links - if you have enough resources. For ISLs - if you have free ports on both switches and the possibility to connect them over your patch panels or if needed over the long distance between two sites, just add ISLs to spread the load.
- If you use a Traffic Isolation setup, check if it's done correctly and does not cost you too much bandwidth.
- Check if you can run the ISL on a higher line rate ("speed"). More line-rate means you can actually transfer more data at the same time. But please keep in mind, that higher line rates require better cable quality. If you have only 100m OM2 cables between two switches, increasing the line rate from 4Gb to 8Gb or even 16Gb will most probably result in problems for the link!
- For congestion bottlenecks on end-device links you should check the multipath driver first. Is the load spread over the available paths? If you have a plain active/passive failover multipather it might be okay to have high load on the ports in use. But if you use a round-robin load balancing, check if you can add additional paths (more HBA-ports) with common sense. Keep in mind that "more paths --> longer failover times" and many devices have a maximum limit for paths!
- Server virtualization may allow you to move workloads to more suitable regions of your SAN environment to relieve the ones under pressure.
And often forgotten:
In many cases the congestion bottlenecks will be observed only at specific times. Usually the devices in your SAN don't have the same workload all the times. There is time when people sleep, there is time when people come to work and switch on their VDI'ed PCs, there is time when the backups run and there is time when big batch jobs run. A proper planning and scheduling is mandatory in today's data centers! Don't let the big workloads run at the same time. Spread them accross the course of the 24 hours you have. The same is true for the course of the week, the months, the quarter, the year.
The fewest environments are totally under-sized for the average mix of workload - but the demand of the components of this mix over time is the heart of your storage environment's performance!
If you need help to better manage your workloads, I'm sure your local IBM Sales rep or IBM business partner can bring you in contact with the right performance expert to work these things out for your special situation.
Modified on by seb_
It's the nightmare of every motorist. Your car was just repaired a few days ago and now it stopped running in the middle of nowhere. Or you even crashed, because the brakes just didn't work in the rain. Fake parts are a big problem in the automotive industry. Original-looking parts from dubious sources could even work as expected in normal operations but when the going gets tough, the weak won't get going. So before a fake cambelt wrecks your engine or a fake brake pad costs your life, it might be a good idea to not save on the wrong things.
But a faked SFP?
Like a brake pad an SFP is somewhat a consumable. Light is transformed into an electric signal and vice versa, produces heat and the components wear out over time. Some sooner, some later. If you bought the SFPs from IBM for a switch under IBM warranty or maintenance, broken SFPs will be replaced for free. But if you decide to buy an SFP, you'll notice after a quick web search that there are a lot of supplier out there offering the same SFP for a much smaller price than IBM. And with "the same SFP" I mean they offer the very same IBM part number - for example 45W1216. That's an 8G 10km LW SFP.
Is it really the same?
Of course not - although they claim it to be the same. Their usual explanation is , that all these SFPs are coming from the same manufacturer anyway. SFPs are built using open standards defined by T11 and therefore they should be compatible per se. I can tell from several occasions: That's not true. There are of course more than 1 SFP manufacturers and I'm sure each of you know a handful offhand. In addition: Even in times before 8G there were SFPs working much better with certain switches than others.
With the 8G platform Brocade decided to offer Brocade-branded SFPs and restricted their switches to only support them and to refuse others (beside of very few exceptions for CWDM SFPs). So Brocade took control over which SFPs can be used and they were able to finetune their ASICs to allow better signal handling and transmission. To enforce that the switch checks the vendor information from the SFP to determine if it's a Brocade branded one. Cisco does the same for the SFPs in their switches.
Here is where the fake begins...
There are several vendors of devices to rewrite these SFP internal information. By spoofing vendor names, OUIs (Organizationally Unique Identifier) and part numbers they try to circumvent the detection mechanisms on the switch. So independent suppliers buy "generic" bulk SFPs and "rebrand" them to sell them as "IBM compatible" with the same part number. And because IBM officially supports the part number (like announced here) one might assume everything will be fine then.
In fact it's not...
Imagine a migration project. The plan is in place, everything is prepared, the components are bought and onsite, all the necessary people are there in the middle of the night or during a weekend and the maintenance window begins. And then these ports everything depends on just don't come online - Only because someone faked these "cheaper but still compatible" SFPs negligently. I had a case where the same SFPs did work in one 8G switch model but not in the other - also 8G - with exactly the same FabricOS.
In the sfpshow output they looked like this:
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 5401001200000000 200,400,800_MB/s SM lw Long_dist
Vendor Name: XXXXXX
Vendor OUI: 00:05:1e
Vendor PN: 57-1000012-01
The supplier did not write "Brocade" into the "Vendor Name" field (I replaced it with Xs) but in the "Vendor OUI" field he inserted the OUI from Brocade. In addition he also faked the "Vendor PN" but even used a wrong one. This one is the PN for a shortwave SFP.
But beside of being an ugly showstopper for the migration - driving costs far beyond of what could have been saved by buying the cheaper parts - that's not even the worst case. Perfectly faked SFPs might be accepted by the switch, but you never know if they are really running fine. I don't wish anybody to be called at 3am about the crash of half the servers, because an ISL started to toggle. Or to have increasing performance problems, because every now and then a faked SFP "on the edge of the spec" devours a buffer credit by misinterpreting an R_RDY.
Troubleshooting this can be a pain itself. But the money potentially lost on outages will hardly be compensated by the savings from cheaper SFPs!
I got the confirmation from IBM product management, that IBM itself will only deliver Brocade-branded SFPs for its current b-type SAN portfolio.
So if you have non-Brocade-branded SFPs in your 8G or 16G Brocade switches be aware that they are probably not supported and there could be some unplanned night or weekend working hours for you in the future...
Working on a refresh of my IBM internal SAN problem determination course from last year, I stumbled over the first couple of slides again. They are a little bit of a "raison d'être" for the course - the answer to the question "Why should I learn that?". And they got me pondering again. How long will this still be relevant? Here's what I think:
Every once in a while someone restarts the discussion if tape will die soon. The same discussion comes and goes for Fibre Channel in total. There are lots of people predicting each year that SAN is a dying concept at all. The cloud is here today. Not some spooky future concept but deployed in many forms and flavors. And there are stacks, too. A whole data center in a rack. Pre-integrated, pre-configured, pre-optimized, pre-fueled with software and "expertise".
So does it still make sense to build up SAN skills?
If all the expertise is already in the product, why spend time to become an expert? If management UIs become childishly simple, why should a company pay certified specialists? So why to learn all the stuff and get certified in a world where storage comes as a commodity? Any why are there still people out there saying everything becomes more and more complex if it appears that everything gets so easy and simple now?
A look back...
Unfortunately in many regions in the world water is not a commodity. It has to be brought from remote sites over long distances. Often there are people whose only task is to fetch the water. And if a drought lasts longer than the water in the reservoirs their special skills to find alternative sources of water are in demand. It's undoubted that such skills must be maintained, transfered and extended to ensure the survive of the family or community. In return we - the people in the industrialized countries - take water for granted. It comes out of the tap. My 1.5-year-old understands that concept. Need water? Open the tap.
But does that mean "No experts needed anymore"?
Certainly not, quite the opposite! The preparation of drinkable water and its distribution as well as the handling of the sewage is a complex process chain today. It involves infrastructure specialists, biologists, chemists, process technicans, civil engineers and many more highly skilled persons working together. And given the challenges of the future it will become even more complex most probably.
The same is true for SAN skills
As long as we don't have a worldwide grid of quantum-entanglemented RAM-locality-based servers, I predict that there will still be something like a SAN. There will still be need for architects, for specialists implementing it and for sure for well-skilled people troubleshooting it if problems arise. Storage - as well as Computing in general - might become a commodity like water and we'll definitely won't need everybody to know how it works. But the ones remaining in that area will need to be the real experts. Skilled, trained, experienced and motivated.
You might say: "That's the case with virtually everything!" and you are right. So why not SAN?