One of the ugliest things that can happen in a SAN is a big performance problem introduced by a slow drain device (or slow draining device). Why is it so ugly? Well, if a full fabric or a full data center drops down - due to a fire for example - it's definitely ugly, too. But such situations can be covered by redundancy (failover to another fabric, to another data center, etc), because the trigger is very clear. Whereas a performance degredation due to a slow drain device is not so obvious - at least not for the most hosts, operators or automatic failover mechanisms. Frames will be dropped randomly, paths fail but with the next TUR (Test Unit Ready) they seem to work again, just to fail again minutes later. Error recovery will hit the performance and the worst thing: If commonly used resources are affected - like ISLs - the performance of totally unrelated applications (running on different hosts, using different storage) is impaired.
So you have a slow drain device. If you have a Brocade SAN you might have found it by using the bottleneckmon or you noticed frame discards due to timeout on the TX side of a device port. If you have a Cisco SAN you probably used the creditmon or found dropped packets in the appropriate ASICs. Or maybe your SAN support told you where it is. Nevertheless, let's imagine the culprit of a fabric-wide congestion is already identified. But what now?
The following checklist should help you to think about why a certain device behaves like a slow drain device and what you can do about it. I don't claim this list to be exhaustive and some of the checks may sound obvious, but that's the fate of all checklists :o)
- Check the firmware of the device:
Check the configuration:
- Is this the latest supported HBA firmware?
- Are the drivers / filesets up-to-date and matching?
- Any newer multipath driver out there?
- Check the release notes of all available firmware / driver version for keywords like "performance", "buffer credits", "credit management" and of course "slow drain" and "slow draining".
- If you found a bugfix in a newer and supported version, testing it is worth a try.
- If you found a bugfix in a newer but unsupported version, get in contact with the support of the connected devices to get it supported or info about when it will be supported.
Check the workload:
- Is it configured according to available best practices? (For IBM products, often a Redbook is available.)
- Is the speedsetting of the host port lower than the storage and switches? Better have them at the same line rate.
- Queue depth - better decrease it to have fewer concurrent I/O?
- Load balanced over the available paths? Check you multipath policies!
- Check the amount of buffers. Can this be modified? (direction depends on the type of the problem).
Check the concept:
- Do you have a device with just too much workload? Virtualized host with too much VMs sharing the same resources? Better separate them.
- Too much workload at the same time? Jobs starting concurrently? Better distribute them over time.
- Multi-type virtualized traffic over the same HBA? One VM with tape access share a port with another one doing disk access? Sequential I/O and very small frame sizes on the same HBA? Maybe not the best choice.
Check the logs for this device for any incoming physical errors. Of course, error recovery slows down frame processing.
Check the switch port for any physical error. If you have bit errors on the link, the switch may miss the R_RDY primitives (responsible for increasing the sender's buffer credit counter again after the recipent processed a frame and freed up a buffer).
Use granular zoning (Initiator-based zoning, better 1:1 zones) to have the least impact of RSCNs. (A device that has to check the nameserver again and again has less time to process frames.)
If all other fails: Look for "external" tools and workarounds:
- If the slow drain device is an initiator, does it communicate with too many targets? (Fan-out problem)
- If the slow drain device is a target, is it queried by too many initiators? (Fan-in problem)
- Is it possible to have more HBAs / FC adapters? On other busses maybe?
- Is the device connected as an L-Port but capable to be an F-Port? Configure it as an F-Port, because the credit management of L-Ports tends to be more vulnerable for slow drain device behavior.
- Does the slow drain host get its storage from an SVC or Storwize V7000? Use throttling for this host. Other storages may have similar features.
- Brocade features like Traffic Isolation Zones, QOS and Trunking can help to cushion the impact of slow drain devices.
- Have a Brocade fabric with an Adaptive Networking license? Give Ingress Rate Limiting a try.
- Last resort: Use port fencing or an automated script to kick marauding ports out of the SAN.
The list above is just a collection of things I already saw in problem cases. Having said this, it might be updated in the future if I encounter more reasons for slow drain device behavior. Of course I'm very interested in your opinion and more reasons or ways to deal with them!
Performance problems are still the most malicious issues on my list. They come in many flavors and most of them have two things in common: 1) They are hardly SAN defects and 2) They need to be solved as quickly as possible, because they really have an impact.
If just a switch crashed or an ISL dropped dead or even an ugly firmware bug blocks the communication of an entire fabric, it might ring all alarm bells. But that's something you (hopefully) have your redundancy for. Performance problems on the other hand can have a high impact on your applications across the whole data center without a concerning message in the logs, if your systems are not well prepared for it. Beside of the preparation steps I pointed out here there is a tool in Brocade's FabricOS especially for performance problems: The bottleneck monitor or short:
If a performance problem is escalated to the technical support the next thing most probably happening is that the support guy asks you to clear the counters, wait up to three hours while the problem is noticeable, and then gather a supportsave of each switch in both fabrics.
Why 3 hours?
A manual performance analysis is based on certain 32 bit counters in a supportsave. In a device that's able to route I/O of several gigabits per second, 32 bits aren't a huge range for counters and they will eventually wrap if you wait too long. But a wrapped counter is worthless, because you can't tell if and how often it wrapped. So all comparisons would be meaningless.
Beside the wait time the whole handling of the data collections including gathering and uploading them to the support takes precious time. And then the support has to process and analyze them. After all these hours of continously repeating telephone calls you get from management and internal and/or external customers, the support guy hopefully found the cause of your performance problem. And keeping point 1) from my first paragraph in mind, it's most probably not even the fault of a switch*). If he makes you aware to a slow drain device, you would now start to involve the admins and/or support for the particular device.
You definitely need a shortcut!
And this shortcut is the bottleneckmon. It's made to permanently check your SAN for performance problems. Configured correctly it will pinpoint the cause of performance problems - at least the bigger ones. The bottleneckmon was introduced with FabricOS v6.3x and some major limitations. But from v6.4x it eventually became a must-have by offering two useful features:
Congestion bottleneck detection
This just measures the link utilization. With the fabric watch license (pre-loaded on many of the IBM-branded switches and directors) you can do that already for a long time. But the bottleneckmon offers a bit more convenience and brings it in the proper context. The more important thing is:
Latency bottleneck detection
This feature shows you most of the medium to major situations of buffer credit starvation. If a port runs out of buffer credits, it's not allowed to send frames over the fibre. To make a long story short if you see a latency bottleneck reported against an F-Port you most probably found a slow drain device in your SAN. If it's reported against an ISL, there are two possible reasons:
- There could be a slow drain device "down the road" - the slow drain device could be connected to the adjacent switch or to another one connected to it. Credit starvation typically pressures back to affect wide areas of the fabric.
- The ISL could have too few buffers. Maybe the link is just too long. Or the average framesize is much smaller than expected. Or QoS is configured on the link but you don't have QoS-Zones prioritizing your I/O. This could have a huge negative impact! Another reason could be a mis-configured longdistance ISL.
Whatever it is, it is either the reason for your performance problem or at least contributing to it and should definitely be solved. Maybe this article can help you with that then.
With FabricOS v7.0 the bottleneckmon was improved again. While the core-policy which detects credit starvation situations was pretty much pre-defined before v7.0 you're now able to configure it in the minutest details. We are still testing that out more in detail - for the moment I recommend to use the defaults.
So how to use it?
At first: I highly recommend to update your switches to the latest supported v6.4x code if possible. It's much better there than in v6.3! If you look up bottleneckmon in the command reference, it offers plenty of parameters and sub-commands. But in fact for most environments and performance problems it's enough to just enable it and activate the alerting:
myswitch:admin> bottleneckmon --enable -alert
That's it. It will generate messages in your switch's error log if a congestion or a latency bottleneck was found. Pretty straightforward. If you are not sure you can check the status with:
myswitch:admin> bottleneckmon --status
And of course there is a show command which can be used with various filter options, but the easiest way is to just wait for the messages in the error log. They will tell you the type of bottleneck and of course the affected port.
And if there are messages now?
Well, there is still the chance, that there are actually situations of buffer credit starvation the default-configured bottleneckmon can't see. However as you read an introduction here, I assume you just open a case at the IBM support.
You'll Never Walk Alone! :o)
*)Depending on country-specific policies and maintenance contracts a performance analysis as described above could be a charged service in your region.
When Brocade released FabricOS v6.0 in 2007 Quality of Service sounded like a great idea: It allows you to prioritize your traffic flow to the level of certain device pairs. There are 3 levels of priority:
High - Medium - Low
Inter Switch Links (ISLs) are logically partitioned into 8 so called Virtual Channels (VCs). Basically each of them has its own buffer management and the decision which virtual channel a frame should use is based on its destination address. If a particular end-to-end path is blocked or really slow, the impact on the communication over the other VCs is minimal. Thus only a subset of devices should be impaired during a bottleneck situation.
Quality of Service takes this one step further.
QoS-enabled ISLs consist of 16 VCs. There are slightly more buffers associated with a QoS ISL and these buffers are equally distributed over the data VCs. (There are some "reserved" VCs for fabric communication and special purposes). The amount of VCs makes the priority work - the most VCs (and therefore the most buffers) are dedicated to the high priority, the least for the low one. Medium lies in the middle obviously. So more important I/Os benefit from more resources than the not so important ones.
Sounds like a great idea!
Theoretically you can configure the traffic flow in terms of buffer credit assignment in your fabric very fine-grained. But that's in fact also the big crux: You have to configure it! That means you actually have to know which host's I/O to which target device should be which priority. Technically you create QoS-zones to categorize your connections. Low priority zones start with QOSL, high priority zones start with QOSH. Zones without such a prefix are considered as medium priority.
But how to categorize?
That's the tricky part. The company's departments relying on IT (virtually all) have to bring in their needs into the discussion. Maybe there are already different SLAs for different tiers of storage and an internal cost allocation in place. The I/O prioritization could go along with that and of course it has to be taken into account to effectively meet the pre-defined SLAs. If you have to start from the scratch, it's more a project for weeks and months than a simple configuration. And there is much psychology in it. Beside of that you really have to know how QoS works in details to design a prioritization concept. For example if you only have 20 high priority zones and 50 with medium priority but only 3 low priority zones, the low ones could even perform better. In the four years since its release I saw only a couple of customers really attempting to implement it.
In addition you need to buy the Adaptive Networking license!
So why should I care?
If QoS is such a niche feature, why blogging about it? Usually a port is configured for QoS when it comes from the factory. You can see it in the output of the command "portcfgshow". A new switch will have QoS in the state "AE" which means auto-enabled - in other words "on". An 8Gig ISL will be logically partitioned into the 16 VCs as described above and the buffer credits will be assigned to the high, the low and the medium priority VCs. But that does not mean that you can actually benefit from the feature, because you most probably have no QoS-zones! And so all your I/O share only the resources allocated for the medium priority. A huge part of the available buffers are reserved for VCs you cannot use! So as a matter of fact you end up with less buffers than without QoS and in many cases this made the difference between smooth running environments and immense performance degradation.
If you don't plan to design a detailed and well-balanced concept about the priorities in your SAN environments, I recommend to switch off QoS on the ports. I don't say QoS is bad! In fact with the Brocade HBA's possibility to integrate QoS even into the host connection - enabling different priorities for virtualized servers - you have the possibility to better cope with slow drain device behavior. But done wrong, QoS can have a very ugly impact on the SAN's performance!
Better know the features you use well - or they might turn against you...
As this was not clear enough in the text above and I got back a question about that, please be aware: Disabling QoS is disruptive for the link! In most FabricOS versions in combination with most switch models, the link will be taken offline and online again as soon as you disable it. In some combinations you'll get the message that it will turn effective with the next reset of the link. In that case you have to portdisable / portenable the port by yourself.
As this is a recoverable, temporary error your application most probably won't notice anything, but to be on the save side, you should do it in a controlled manner and - if really necessary in your environment - in times of little traffic or even a maintenance window. The command to disable it is:
portcfgqos --disable PORTNUMBER
Modified on by seb_
The Storwize V7000 and the SVC (SAN Volume Controller) share the same code base and therefore the same error codes. Many of them indicate a failure condition in this very machine, but there are others just pointing to an external problem source. The error 1370 is one of the second kind. There is not really much information about it in the manuals but in fact it could give you a good understanding about what's going wrong.
As storage virtualization products the SVC and the V7000 - if you use it to virtualize external storage - are actually the hosts for the external storage. Speaking SCSI they are the initiators and the external backend storage arrays are the targets. Usually the initiators monitor their connectivity to the targets and do the error recovery if necessary. And so the SVC and the V7000 focus on monitoring the state of their backend storage and can actually help you to troubleshoot them.
So you have 1370 errors, what now?
They come in two flavors: The event id 010018 (against an mdisk) and the event id 010030 (against a controller - aka storage array). I'll explain the 010030 as it's easier to understand but understanding it will give the insight to understand the 010018, too.
If you double-click the 1370 in your event log, you see the details of the error:
You see the reporting node and the controller the error is reported against. But the most important thing is the KCQ. The Sense Key - Code - Qualifier.
Imagine this situation: The SVC is the initiator. It sends an I/O towards the storage device - the target. But the target faces a "note-worthy" condition at the very moment. So it will make the initiator aware of it by sending a so called "check condition". As curious as it is, the initiator wants to know the details and requests the sense data. These sense data will now be stored in - you already guess it - a 1370 in the format Key - Code - Qualifier. Often the last both are referred to as ASC (Additional Sense Code; the green one) and ASCQ (Additional Sense Code Qualifier; the blue one).
Where's the Rosetta Stone?
These sense data can be translated using the official SCSI reference table by Technical Commitee T10 (the council making the SCSI protocol). If you encounter an ASC/ASCQ combination in a 1370 that can't be found in that list, it's most probably a vendor specific one. In that case the manufacturer of the target device could give you more information about it.
Back to our example. So you see the ASC 29 (the "Code") and the ASCQ 00 (the "Qualifier") here. Looking that up in the list reveals: It's a "POWER ON, RESET, OR BUS DEVICE RESET OCCURRED". This so called "POR" should make you aware that the target was recently either powered on or did a reset. Usually the initiator gets this with the first I/O it does against the target after such an event, to be aware that any open I/O it has against this target is voided and has to be repeated.
Ah, okay. That's it?
No! You see the orange box? This is the time since this sense data was received. The unit is 10ms, so this number actually represents a long time since there really was a POR for this controller.
So why do we have a 1370 today?
The 1370 is more of a container for sense data. The number behind the attributes show the "slot". So the information visible here are for the first slot and as such a long time passed since it occurred it's meaningless for us now. Let's scroll down a bit:
In the second slot you see what's really going wrong within the external storage device at the moment, because the time value is 0. That means the 1370 was triggered because of it. And it contains a different set of sense data. ASC 0C / ASCQ 00! If you try to look it up in the list, you will find 0C/00, but hey - this cannot be! The combination 0C/00 means "WRITE ERROR", but it's not defined for "Direct Access Block Devices" like storage arrays.
A Dead End?
No, of course not. In this example the storage is a DS4000. Just download the DS4000 Problem Determination Guide and it will provide an ASC/ASCQ table. Here you'll see that 0C 00, together with the Sense Key 06 (the red circle) means "Caching Disabled - Data caching has been disabled due to loss of mirroring capability or low battery capacity."
Running without the cache in the backend storage could lead to severe performance degradation and should definitely be troubleshooted! Without even looking into the backend storage you already know what's going wrong there! No need to involve SVC or V7000 support this time. Just focus on the backend storage and find out why the caching is disabled.
So please don't shoot this messenger, it just tries to help you!
Update - December 2nd 2013
The SCSI Interface Guide for IBM FlashSystem can be found here.
In one of my previous posts I wrote about "Why inter-node traffic across ISLs should be avoided". There is an additional "bad practice" that could lead to performance problems in the host-to-SVC traffic.
Let's imagine a core-edge fabric. A powerful switch (or director) in its center is the core. The SVC and its backend storage subsystems are directly connected to it. Beside of that there are also the ISLs to the edge switches where the hosts are connected to. As there is an SVC in the fabric, all host traffic usually goes to the SVC and the SVC is the only host of all other storages. From time to time I see a cabling like the one below. The devices are connected in a common pattern. For example SVC ports are always on port 0, 4, 8, ... or for a director for example on port 0 and 16 on each card... Something like that. The reason behind that is often to spread the workload over several cards/ASICs to minimize impact in case of a hardware failure. But there's a risk in doing so.
Index Port Address Media Speed State Proto
0 0 190000 id 8G Online FC F-Port 50:05:07:68:01:40:a2:18
1 1 190100 id 8G Online FC F-Port 20:14:00:a0:b8:11:4f:1e
2 2 190200 id 8G Online FC F-Port 20:16:00:80:e5:17:cc:9e
3 3 190300 id 8G Online FC E-Port 10:00:00:05:1e:0f:75:be "fcsw2_102" (downstream)
4 4 190400 id 8G Online FC F-Port 50:05:07:68:01:40:06:36
5 5 190500 id 8G Online FC F-Port 20:04:00:a0:b8:0f:bf:6f
6 6 190600 id 8G Online FC F-Port 20:16:00:a0:b8:11:37:a2
7 7 190700 id 8G Online FC E-Port 10:00:00:05:1e:34:78:38 "fcsw2_92" (downstream)
8 8 190800 id 8G Online FC F-Port 50:05:07:68:01:40:05:d3
The SAN perspective
In the situation described above, all host traffic is passing the ISLs from the edge switches to the core. ISLs are logically "partitioned" into so called virtual channels. Of course the ISL is still just one fibre and only one signal is passing it physically at the same time. The virtual channels are just portions of buffer credits dedicated and the decision which virtual channel a frame takes - and therefore which portion of the buffers credits it uses - is made by looking into the destination fibre channel address.
Technical deep dive
A normal non-QOS ISL has 4 virtual channels for data traffic. For an 8G link each one of them has 5 buffers. They can only work with these 5 buffers and there is no possibility to "borrow" some out of a common pool like for QoS links. With the command "portregshow" you can see the buffer credits assigned to the virtual channels (I added the first line):
VC 0 1 2 3 4 5 6 7
0xe6692400: bbc_trc 4 0 5 5 5 5 1 1
Only VCs 2-5 are used for data traffic. This makes 20 usable buffers which normally should be enough for a normal multimode connection between two switches in the same room with only some metres cable length. Basically the switch uses the last two bits of the second byte of the destination address. That looks so:
Bits 00 -> frame uses VC 2 (which is the first virtual channel for data)
Bits 01 -> frame uses VC 3
Bits 10 -> frame uses VC 4
Bits 11 -> frame uses VC 5
So where's the problem now?
In our imaginary core-edge fabric where for example all SVC ports are connected to ports 0 (bin 00), 4 (bin 100), 8 (bin 1000), 12 (bin 1100) , ... all host I/O towards SVC would use the same virtual channel. As this is the only traffic that passes the ISLs from edges to cores, only a quarter of the buffers are actually used! 5 buffers are very heavy in use and 15 are idling around never to be filled. And 5 buffers are actually pretty few for an edge switch full of hosts wants to speak with the core switch where the SVC is connected. The result would be credit starvation and congestion on a virtual channel level.
How to solve that?
There are 3 possibilities:
1.) You could re-cable your SAN in a manner that all VCs are used. But beside of the risk of physical problems and problems introduced by maintenance actions the devices have to learn about the new addresses of the SVC ports. For many operating systems this still means reboots or reconfigurations. It could involve a lot of work and risk for outages.
2.) You could just change the addresses with the portaddress command. This command is usually used in the virtual fabric environment and if you can use it depends on installed firmware and used platform. While it avoids the physical actions, it still has the disadvantages for the hosts because of changed addresses.
3.) The best and least interrupting possibility might be to set the ISLs to LE mode. This is the long distance mode dedicated for links under 10km in length. It will not only put more buffers on the link (40 for user traffic in an 8G link compared with the 20 for a normal 8G E-Port) but will also collapse the 4 user traffic VCs to just one. It looks like this then:
VC 0 1 2 3 4 5 6 7
0xe6602400: bbc_trc 4 0 40 0 0 0 1 1
So all buffers and therefore also all buffer credits will be used by the hosts and nothing idles. There will of course be a short interruption while changing the ISL to LE mode but beside of that nothing changes for the hosts, because all the addresses stay the same. This is clearly the way to go in the situation described above.
Just something strange for the end: Some switches are delivered from manufacturing with an alternative addressing pattern. For example port 1 of domain 3 won't have the address 030100 then but something like 030d00. In that case the problem can happen similarly but on other ports. But using LE-mode would solve it in the pretty same way.
Please keep in mind that the whole article relates to a very special (although very common) SAN layout in an SVC-centered environment. This is clearly not a standard action plan for all performance problems but it could help if you have a customer in a situation like this. For any questions, feel free to contact me.
Additionally, please be aware that this is not an SVC problem by itself but will happen with every central storage connected to a switch using a pattern as described above and being used by hosts connected to another switch over an ISL!
Update from May 9th:
I was made aware that readers of this article queried their vendors, maintenance providers or business partners with the idea to just set all their ISLs to LE-mode regardless if the condition as described above is actually met. Because of that, I would like to state more clearly: Using LE-mode as a general approach for your ISLs can cause severe problems!
If the SVC ports are not connected in a way that only one Virtual Channel would be used, it actually makes sense to have ISLs with more than one VC. Virtual Channels are a good feature to prevent that a latency bottleneck due to back pressure impairs the traffic of all devices using the same ISL. If devices on the edge switches communicate with other devices connected to other ports of the core (or other edges) as well, the impact of using LE-mode would be even more extreme in the case of slow drain devices.
I made some drawings to illustrate this. The first one shows 1 normal ISL between the edge and the core. You can see the 4 VCs used for data traffic. (I left out the other VCs for better visibility):
Here host 1 and 2 make traffic against the SVC (green), host 3 against an additional disk subsystem (purple) and host 4 against a tape drive (orange). Based on the ports these devices are connected to, other VCs are used for that traffic.
If you would use an LE-port instead, it would look like this:
Now all 4 data traffic VCs collapsed to a single one. As long as everything runs smoothly, you won't see an impact.
Buf if for example one of the devices connected to the core is slow draining, following will happen most probably:
In the picture above the purple disk is a slow drain device. Due to back pressure the whole ISL will be a latency bottleneck, because all data traffic shares the same VC in LE-mode. The back pressure goes further towards the edge switch and all 4 hosts of our example are affected now although only host 3 communicates with the slow drain device!
With a normal E-port it looks like this:
Now only VC4 is affected while VC2, 3 and 5 are running smoothly, because they have their own, unaffected buffer management. Therefore only host 3 will face a performance problem while the hosts 1, 2 and 4 are running fine.
You see: Using LE-mode for the purpose described in my original article does only make sense if these special conditions are really met. In all other cases it can impair the SAN performance tremendously!
Modified on by seb_
I thought I'd never have to write about fillwords. I thought: there will be a phase of some months and then this topic is dead. Strangely enough it's still alive. I still get questions about them, I still see people blaming them and I still see evitable problems because of changing them.
For every new line rate (now read "Generation" or "Gen"), usually the switch and HBA vendors are the first ones to adopt the new standard and release their products. It was the same for 8Gbps, which came with a new fillword. Fillwords are 4-byte-words without a special task. A port sends them all the time it doesn't have to send something else. They're used to maintain the synchronization of the link and therefore the fillword used up to incl. 4Gbps was fittingly called IDLE. Depending on the workload, the ports and the CPU utilization of a PC have one thing in common: You see a lot of IDLE. Therefore it made sense to think about the optimal fillword and so it was changed for 8Gbps. In the first published version it was quite like "Let's replace all instances of IDLE with a better one: ARBff". First products were developed and among them Brocade's 8Gbps switches.
Later it turned out that it would be better to not just replace all IDLEs out of hand, because they were not only used as a fillword, but in the link initialization, too. The standard was updated and then said, "Use ARBff as a fillword, but keep the IDLE for link initialization".
For products released after that point in time the vendors usually implemented the new version of the standard, which was not compatible with the first one. So clients bought new 8Gbps-capable devices, for example DS5000 boxes or SAN Volume Controllers, and failed to get them online. These devices tried to use the standard-compliant word during the critical link initialization phase and when they noticed that the switches sent the wrong ones, the link initialization failed.
I have to admit that most vendors' information politics were very "unlucky" at that time. Everybody blamed everybody else. After some protocol traces it was clear that the problem was the use of ARBff during link initialization. So as a workaround we recommended to configure the switches to use IDLE again (mode 0). Eventually new firmware versions were written and Brocade came up with two new fillword modes - one of them compliant to the standard (mode 2) and another more dynamic mode 3. It tried ARBff in link initialization first (like mode 1) and if that failed, then it behaved like mode 2. So mode 3 became the natural choice.
For some time we had a lot of cases for that problem and many people in the broad area of storage got in touch with the term fillword. While the number of problem cases about them decreased, the memory about fillwords stayed active in people's minds. In addition there is a counter called "er_bad_os" for each port. It means "Error: bad ordered set" and increases basically for 2 situations: 1) If such a 4-byte word is corrupted or 2) if the port receives an ordered set it didn't expect. The first situation is a problem, but you get other indications as well ("enc out", "enc in", ...). The second situation could for example happen if a running port expects the IDLE fillword (because it was configured to mode 0 as a workaround as stated above) but receives ARBff. Although the counter increases in the ASIC there is no impact on a running connection. In fact the fibre channel protocol says that each well-encoded ordered set without any other function should be treated the same way as IDLEs. So as long as there is no bit error in them, it doesn't matter what kind of fillword is received - the switch must use it to maintain the synchronization.
However, the myth was already born: Blame it on the fillword! For a lot of totally unrelated problems, like performance problems, CRC errors, occasional link resets and even SFP heat issues, SAN admins and even support personnel for the attached device blamed the fillword. "The fillword is wrong!", "Change the fillword first!", "Look at this rapidly increasing error counter!" - Changing the fillword mode to 3 became the new mantra for every howsoever remote storage problem. And now it's very similar to bloodletting in the medicine of the previous centuries: A sophisticated-sounding theory everybody could agree on and a simple action plan.
But just like bloodletting, it only helps in certain situations and used as a general treatment it does more harm than good.
Changing the fillword mode is disruptive for a link. If you really have a problem with a wrong fillword setting, this is not very concerning, because as stated above, the link initialization would have failed and the device wouldn't be online at that moment anyway. But for all the cases where the port is actually up and running there will be a new link initialization. All current I/O belonging to this port will be void. There will be command timeouts. Error recovery needs to take place. Depending on the robustness of the attached device this could already lead to problems. But not enough, I even saw a lot of SAN admins even changing the fillword mode for normal E-ports, which is complete nonsense. Believe me, you don't want to disturb your fabric stability by bouncing each and every ISL in your SAN environment within a short time without a solid reason.
And changing running ports to a more compliant fillword is certainly NOT a solid reason.
The sad part is that often the perceived problems improved by this action. But then a simple portdisable/portenable would most probably have had the same effect, too. It's like patients recover - not because of bloodletting, but despite of it.
Conclusion and tl;dr
Don't change the fillword mode on a running port! It's disruptive!
I claim that in 2012 performance problems will keep their place amongst the most frequent and most impacting problems in the SAN. In many of the cases the client's users really notice a performance impact and so the admin calls for support. Other support cases are opened because of performance related messages like the ones from Brocade's bottleneckmon or Cisco's slowdrain policy for the Port Monitor. Beside of that there are also cases that look not really like performance problems from the start but turn out to occur because of the same reasons like them. "I/O abort" messages in the device log, link resets, messages about frame drops, failing remote copy links, failing backup jobs or even worse failing recoveries - these could all be "performance problems in disguise".
When I analyze the data then and find out that a slow drain device or congestion is the real reason for the problem I write my findings down and try to give the client some hints about possible next steps. For example by mentioning my earlier blog article about How to deal with slow drain devices.
Do you know what's mean about it?
Often clients never heard of slow drain devices before. Longtime storage administrators are confronted with a term that sounds like a support guy made it up to fingerpoint to another vendor's product. Of course I usually explain what it is, what it means for the fabric and for the connected devices. But to be honest, I would be sceptical, too. I would go to the next search engine and query "slow drain device". The first finds are from this blog and from the Brocade community pages and there are some questions about that topic. Considering the substance of posts in public forums, I would check Brocade's own SAN glossary. Guess what? Not a word about slow drain devices - Which is no surprise as it's from 2008. I would check wikipedia. Nothing. My fellow blogger Archie Hendryx mentioned that it's missing in the SNIA dictionary, too. And he's right: Nothing!
So why is that so?
Why are the terms "HTML" and "export" explained in the dictionary of the Storage Networking Industry Association but there is not a single appearance of the term "slow drain device" on the complete SNIA website (according to their in-built search function)? Well I don't know but of course we can change that. The SNIA dictionary makers are asking for contribution, so if you have a term that has a meaning in the storage industry, feel free to send them a definition for the next release. I thought about doing that as well for some of the SAN performance-related terms I didn't find in the dictionary. Below you'll find some definitions that I wrote. But I'm not inerrable and therefore I would like to have an open discussion about them. Let me know what you think about them. Let me know if your understanding of a term (used in the area of SAN performance of course) differs from mine. Let me know if my wording hurts the ears of native English speakers. Let me know if you have a better definition. Let me know if there are important terms missing. And let me know if you think that a term is not really so generally used or important that it should appear in the SNIA dictionary - side by side to sophisticated terms like Tebibyte :o).
slow drain device - a device that cannot cope with the incoming traffic in a timely manner.
Slow drain devices can't free up their internal frame buffers and therefore don't allow the connected port to regain their buffer credits quickly enough.
congestion - a situation where the workload for a link exceeds its actual usable bandwidth.
Congestion happens due to overutilization or oversubscription.
buffer credit starvation - a situation where a transmitting port runs out of buffer credits and therefore isn't allowed to send frames.
The frames will be stored within the sending device, blocking buffers and eventually have to be dropped if they can't be sent for a certain time (usually 500ms).
back pressure - a knock-on effect that spreads buffer credit starvation into a switched fabric starting from a slow drain device.
Because of this effect a slow drain device can affect apparently unrelated devices.
bottleneck - a link or component that is not able to transport all frames directed to or through it in a timely manner. (e.g. because of buffer credit starvation or congestion)
Bottlenecks increase the latency or even cause frame drops and upper-level error recovery.
Feel free to use the comment feature here or tweet your thoughts with hashtag #SANperfdef. If you add @Zyrober in the tweet, I'll even get a mail :o)
I updated the definitions with an additional sentence. Feel free to comment.
Modified on by seb_
It's the nightmare of every motorist. Your car was just repaired a few days ago and now it stopped running in the middle of nowhere. Or you even crashed, because the brakes just didn't work in the rain. Fake parts are a big problem in the automotive industry. Original-looking parts from dubious sources could even work as expected in normal operations but when the going gets tough, the weak won't get going. So before a fake cambelt wrecks your engine or a fake brake pad costs your life, it might be a good idea to not save on the wrong things.
But a faked SFP?
Like a brake pad an SFP is somewhat a consumable. Light is transformed into an electric signal and vice versa, produces heat and the components wear out over time. Some sooner, some later. If you bought the SFPs from IBM for a switch under IBM warranty or maintenance, broken SFPs will be replaced for free. But if you decide to buy an SFP, you'll notice after a quick web search that there are a lot of supplier out there offering the same SFP for a much smaller price than IBM. And with "the same SFP" I mean they offer the very same IBM part number - for example 45W1216. That's an 8G 10km LW SFP.
Is it really the same?
Of course not - although they claim it to be the same. Their usual explanation is , that all these SFPs are coming from the same manufacturer anyway. SFPs are built using open standards defined by T11 and therefore they should be compatible per se. I can tell from several occasions: That's not true. There are of course more than 1 SFP manufacturers and I'm sure each of you know a handful offhand. In addition: Even in times before 8G there were SFPs working much better with certain switches than others.
With the 8G platform Brocade decided to offer Brocade-branded SFPs and restricted their switches to only support them and to refuse others (beside of very few exceptions for CWDM SFPs). So Brocade took control over which SFPs can be used and they were able to finetune their ASICs to allow better signal handling and transmission. To enforce that the switch checks the vendor information from the SFP to determine if it's a Brocade branded one. Cisco does the same for the SFPs in their switches.
Here is where the fake begins...
There are several vendors of devices to rewrite these SFP internal information. By spoofing vendor names, OUIs (Organizationally Unique Identifier) and part numbers they try to circumvent the detection mechanisms on the switch. So independent suppliers buy "generic" bulk SFPs and "rebrand" them to sell them as "IBM compatible" with the same part number. And because IBM officially supports the part number (like announced here) one might assume everything will be fine then.
In fact it's not...
Imagine a migration project. The plan is in place, everything is prepared, the components are bought and onsite, all the necessary people are there in the middle of the night or during a weekend and the maintenance window begins. And then these ports everything depends on just don't come online - Only because someone faked these "cheaper but still compatible" SFPs negligently. I had a case where the same SFPs did work in one 8G switch model but not in the other - also 8G - with exactly the same FabricOS.
In the sfpshow output they looked like this:
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 5401001200000000 200,400,800_MB/s SM lw Long_dist
Vendor Name: XXXXXX
Vendor OUI: 00:05:1e
Vendor PN: 57-1000012-01
The supplier did not write "Brocade" into the "Vendor Name" field (I replaced it with Xs) but in the "Vendor OUI" field he inserted the OUI from Brocade. In addition he also faked the "Vendor PN" but even used a wrong one. This one is the PN for a shortwave SFP.
But beside of being an ugly showstopper for the migration - driving costs far beyond of what could have been saved by buying the cheaper parts - that's not even the worst case. Perfectly faked SFPs might be accepted by the switch, but you never know if they are really running fine. I don't wish anybody to be called at 3am about the crash of half the servers, because an ISL started to toggle. Or to have increasing performance problems, because every now and then a faked SFP "on the edge of the spec" devours a buffer credit by misinterpreting an R_RDY.
Troubleshooting this can be a pain itself. But the money potentially lost on outages will hardly be compensated by the savings from cheaper SFPs!
I got the confirmation from IBM product management, that IBM itself will only deliver Brocade-branded SFPs for its current b-type SAN portfolio.
So if you have non-Brocade-branded SFPs in your 8G or 16G Brocade switches be aware that they are probably not supported and there could be some unplanned night or weekend working hours for you in the future...
I didn't blog for a while now because of an internal project. Like each software development project it's never really over and development will be going on in the next years to bring in new functions, but I hope I have some more time for blogging again now. :o) I also decided to go a bit away from the long blog posts I did in the past to more conveniently readable short posts if possible.
Long distance modes
Brocade has basically 3 long distance modes:
- LE mode - merges all user-data virtual channels and assigns the amount of buffers necessary to cover a 10 km distance based on the full frame size for the given speed. It requires no license.
- LS mode - like LE mode, but is used for distances > 10 km and requires the "Extended Fabric License". You configure it with a fixed distance.
- LD mode - similar to LS mode, but the distance is measured automatically and the buffers are assigned according to the measured distance. You configure it with a "desired distance".
So what's the problem with LD?
If you have two data centers with a distance of 30 km between them and you configure 60 km, the switch will only assign the buffers for the measured 30 km. Increasing the desired distance doesn't change anything.
Wait! Why should I increase it anyway?
As written above the number of buffers depends on the distance. The switch just calculates the amount of buffers by the number of full sized frames (frames with maximum frame size - usually 2kB) needed to span the distance. But the problem is: in real life the average frame size is actually much smaller than the maximum one.
In the picture above you see a write I/O out of a fibre channel trace. The lines with the rose background are the frames from the host, the ones with the gray background are the responses from the storage. The last column shows the size of the frame. Only the 4 data frames have the full frame size. The other 3 frames have a size far smaller than 2kB. So the average frame size in this example is just 1.2kB. With this average frame size you would need almost double the amount of buffers to fill the link than the number the switch calculated! And it could be much worse. I ran a report over the full trace and the average frame size for the transmit and receive traffic was:
Given that numbers and added a "little buffer reserve" you would need 3 times the buffers than the switch would use!
Okay so let's give it more buffers!
Yes, for LS mode this would exactly be the action plan. But remember: For LD mode, the switch just uses the measured distance. The desired distance is only used as an additional maximum. So if you have 30 km and configure 20 km, it will only assign the buffers for 20 km. If you configure 50 km, it will only assign the buffers for 30 km. So my general recommendation is:
Use LS instead of LD!
LS mode gives you the full control. And use it with enough buffers by configuring a multiple of the physical distance. 3x is a good practice but you can increase it even more if there are buffers left. You can always check the available buffers with the command "portbuffershow".
Don't leave those lazy buffers unassigned but use them to fill your links!
I was asked where to look at in a switch to find the average frame size for a port. The safest way would be to use an external monitoring tool like a VirtualWisdom or a tracer as described in my LD mode article but if you don't own something like that you can get a good guess from the switches themselves. You just have to calculate it out of the number of frames and the number of bytes transferred.
For Cisco it's easy. Just look into the "show interface" for the specific port and you'll find the both numbers in the statistics section for each interface:
1887012 frames input, 1300631486 bytes
542470 frames output, 482780325 bytes
So we can just calculate the average frame sizes for both directions:
1300631486 bytes / 1887012 frames = 689 bytes per frame
482780325 bytes / 542470 frames = 890 bytes per frame
For Brocade switches you can get the information out of the portstatsshow command:
stat_wtx 35481072 4-byte words transmitted
stat_wrx 70173758 4-byte words received
stat_ftx 1111087 Frames transmitted
stat_frx 1177665 Frames received
Here we don't have the plain bytes but 4-byte words. Don't worry - fillwords don't count into this number, so it's still valid for our calculation. We just have to multiply it by four to use it:
(35481072 * 4) bytes / 1111087 frames = 128 bytes per frame
(70173758 * 4) bytes / 1177665 frames = 238 bytes per frame
It's really that easy?
Basically yes. With this average frame size you can find out the multiplier for the buffer credits settings. So if you have an average frame size of 520 and a link of 30 km, just calculate:
2112 (the max frame size) / 520 = 4
So you would set up the link for 120 km instead of 30 km to reserve a sufficient amount of buffers. That's it.
A last catch
If you read my article about bottleneckmon you probably already know that we work with 32 bit counters here. While they cover a few hours for the frames they wrap much quicker for the 4-byte words. So to be able to calculate an average frame size over several hours or days, 32 bit counters are not enough. Actually there are 64 bit counters for these values in the switches - although they are not part of a supportsave. The command portstats64show provides them. The first thing to keep in mind: While in the latest FabOS versions a statsclear resets these counters as well, you had to reset them with portstatsclear in the older versions.
The 64 bit counters are actually two 32 bit counters and the lower one ("bottom_int") is the 32 bit counter we used all the time in portstatsshow. But each time it wraps, it increases the upper one ("top_int") by 1. So after a while you might see an portstats64show output like this:
stat64_wtx 0 top_int : 4-byte words transmitted
2308091032 bottom_int : 4-byte words transmitted
stat64_wrx 39 top_int : 4-byte words received
1398223743 bottom_int : 4-byte words received
stat64_ftx 0 top_int : Frames transmitted
9567522 bottom_int : Frames transmitted
stat64_frx 0 top_int : Frames received
745125912 bottom_int : Frames received
For the received frames it's then:
(2^32 * 39 + 1398223743) * 4 bytes / 745125912 frames = 907 bytes per frame.
Much manual computing, hmm?
Of course you could write a script for that or prepare a spreadsheet but my recommendation is still to start with a multiplier of 3 for normal open systems traffic and check with the command portbuffershow how many buffers are still available. And if you still have some, use them - but keep them in mind if you connect additional long distance ISLs or devices you want to give additional buffers as well.
Update Nov. 2nd 2012:
I was made aware that there is an easier and much more convenient way to use portstats64show: Just use the -long option.
pfe_ODD_B40_25:root> portstats64show 26
stat64_wtx 7 top_int : 4-byte words transmitted
485794041 bottom_int : 4-byte words transmitted
stat64_wrx 13 top_int : 4-byte words received
2521709207 bottom_int : 4-byte words received
pfe_ODD_B40_25:root> portstats64show 26 -long
stat64_wtx 30557972957 4-byte words transmitted
stat64_wrx 58371265974 4-byte words received
Much better, isn't it? Thanks to Martin Lonkwitz!
There are some goodies in FOS 7.0 that are not announced big-time. Goodies especially for us troubleshooters. There are regular but not too frequent so called RAS meetings. Here we have the possibility to wish for new RAS features - wishes born out of real problem cases. Some of the wishes we had were implemented in FOS 7.0 (beside of the Frame Log I already described in a previous post).
Time-out discards in porterrshow
You probably noticed that I have a hobbyhorse when it comes to troubleshooting in the SAN: performance problems. Medium to major SAN-performance problems usually go along with frame drops in the fabric. If a frame is kept in a port's buffer for 500ms, because it can't be delivered in time, it will be dropped. So these drops would be a good indicator for a performance problem. There is a counter in portstatsshow for each port (depending on code version and platform) named er_tx_c3_timeout, which shows how often the ASIC connected to a specific port had to drop a frame that was intended to be sent to this port. It means: This guy was busy X times and I had to drop a frame for him.
But who looks in the portstatsshow anyway? At least for monitoring? In that area the porterrshow command is way more popular, because it provides a single table for all FC ports showing the most important error counters. Unfortunately it had only one cumulative counter for all reasons of frame discards - and there are a lot more beside of those time-outs. But now there are two additional counters in this table: c3-timeout tx and c3-timeout rx. Out of them the tx counter is the important one as described above. The rx counter just gives you an idea where the dropped frames came from.
So: just focus on the TX! If it counts up, get some ideas how to treat it here.
The firmware history
Just last week I had a fiddly case about firmware update problems again. There are restrictions about the version you can update to based on the current one. If you don't observe the rules, things could mess up. And they could mess up in a way you don't see straightaway. But then suddenly, after some months and maybe another firmware update, the switch runs into a critical situation. Or it has problems with exactly that new firmware update. Some of these problems can render a CP card useless, which is ugly because from a plain hardware point of view nothing is broken. But the card has to be replaced at the end. Sigh.
To make a long story short: Wouldn't it be better to actually know the versions the switch was running on in the past? And that's the duty of the firmware history:
switch:admin> firmwareshow --history
Firmware version history
Sno Date & Time Switch Name Slot PID FOS Version
1 Fri Feb 18 12:58:06 2011 CDCX16 7 1556 Fabos Version v7.0.0d
2 Wed Feb 16 07:27:38 2011 CDCX16 7 1560 Fabos Version v7.0.0a
(example borrowed from the CLI guide)
No access - No problem
There is a mistake almost everybody in the world of Brocade SAN administration makes (hopefully only) once: Trying to merge a new switch into an existing fabric and fail with a segmented ISL and a "zone conflict". Then the most probable reason is that the new switch's default zoning (defzone) is set to "no access".
This feature was introduced a while ago to make Brocade switches a little more safe. Earlier each port was able to see every other port as long as there was no effective zoning on the switch. With "no access" enabled, all traffic between each unzoned pair of devices is blocked if there is no zone including them both. The drawback of "no access" is its technical implementation, though. As soon as it was enabled a hidden zone was created and its pure existence blocked the traffic for all unzoned devices. And so without any indication the switch did end up with a zone.
But entre nous: no sane person accepts this without raising a few eyebrows. With FOS 7.0 this (mis-)behavior is gone. The new switch has a "no access" setting and wants to merge the fabric? Fine. You don't have to care, the firmware cares for you!
Thanks for the little helpers Brocade - and I hope you stay open for new ideas :o)
Modified on by seb_
Sometimes we notice that an ISL is actually a bottleneck in a fabric. Not a congestion bottleneck where the throughput demand is just too high for the ISL's bandwidth. This one could be solved by putting another cable between the both switches. But if you have a latency bottleneck your ISL won't be running at the maximum of the bandwidth. The contrary is the case: it lacks the buffer credits to ensure a proper utilization. If you see a latency bottleneck on an ISL it's often back pressure from a slow drain device attached to the adjacent switch. But every now and then I get a case where it's just the ISL. Sometimes in one direction, sometimes in both. Even with lengths were you don't think about using long distance settings at all.
But in the past we did exactly that!
When we encountered a situation like that, the first step was always to get rid of everything that reduces buffer credits for the real traffic flows, like an active QoS setting without having QoS zones. If the problem was still there the only way to give it more buffers was to configure a long distance mode then. We solved performance bottlenecks on ISLs by setting up a let's say 50m ISL in a 10km long distance mode (LE). I described this also 2 years ago in the article How to NOT connect an SVC in a core-edge Brocade fabric. While this indeed gives you more buffers, it comes with a drawback.
Long distance and Virtual Channels
On normal ISLs we have Virtual Channels. They work in a way that the buffer credit management of the ISL is logically partitioned into 8 channels. When we talk about normal Class 3 open systems traffic it's used this way:
Class 3 data
Class 3 data
Class 3 data
Class 3 data
VC 0 is used for inter-switch communication. For example if a new zoning is distributed to all switches. The VCs 6 and 7 are not really of interest most of the time. We have to focus on VCs 2, 3, 4, and 5. (Mind the Oxford comma!) If you have a slow drain device that is reached using Virtual Channel 2 in your fabric, then at least the traffic of the other 3 data VCs is unaffected. With a long distance mode like LE you lose that advantage.
Buffer distribution on Virtual Channels on a normal ISL:
Buffer distribution on Virtual Channels on a LE configured ISL:
While you have more buffers in total now, only the first data VC has them assigned. There is no partition of data traffic anymore and the result is the risk of Head of Line Blocking (HoLB). A latency bottleneck (for example due to back pressure from a slow drain device) will always impact ALL the user data going over that ISL! That's a high price for those additional buffers.
With FabricOS v7.2x Brocade introduced a new command:
It allows you to assign a freely configurable number of credits between 5 and 40 for that ISL. You might ask:
But LE mode gives me 80 on 16Gbps!?!
Yes, but look at the distribution:
Not the whole data part of the link will share the 40 buffers. Each data VC gets its own 40 buffers and they are still handled independently! No Head of Line Blocking! And remember: This is not meant for long distance connection and it still comes for free! It works on 8G switches, too, as long as they are running at least v 7.2x.
To give 40 buffers to each data VC on an ISL at port 1 you would enter:
portcfgeportcredits --enable 1 40
With the --disable parameter you switch back to normal mode and with --show you can see the current configuration of a port.
And please keep an eye on the number of remaining buffers in portbuffershow :o)
So from now on if you need just some more buffers on your ISLs to keep everything running smoothly:
Many of you (at least many of the few really reading this stuff) may already know what CRC is. But I think it doesn't hurt to have a short recap. CRC means Cyclic Redundancy Check and can be used as an error detection technique. Basically it calculates a kind of hash value that tends to be very different if you change one or more bits in the original data. Beside of that it's quite easy to implement. I once wrote a CRC algorithm in assembler (but for the Intel 8008) during my study and it was a nice exercise for optimization.
What has that got to do with SAN?
In Fibre Channel we calculate a CRC value for each frame and store it as the next-to-last 4 bytes before the actual end of frame (EOF). The recipient will read the frame bit by bit and meanwhile it calculates the CRC value by itself. Reaching the end of the frame it knows if the CRC value stored there matches the content of the frame. If this is not the case, it knows that there was at least one bit error and it is supposed to be corrupted and thus can be dropped. Now if the recipient is a switch the next thing to happen depends on which frame forwarding method is used:
The switch reads the whole frame into one of its ingress ("incoming") buffers and checks the CRC value. If the frame is corrupted the switch drops it. It's up to the destination device to recognize that a frame is missing and at least the initiator will track the open exchange and starts error recovery as soon as time-out values are reached. Many of the Cisco MDS 9000 switches work this way. It ensures that the network is not stressed with frames that are corrupted anyway, but it's accompanied with a higher latency. From a troubleshooting point of view the link connected to the port reporting CRC errors is most probably the faulty one.
To decrease this latency the switch could just read in the destination address and as soon as that one is confirmed to be zoned with the source connected to the F-port (a really quick look into the so called CAM-table stored within the ASIC) it goes directly on the way towards the destination. So if everything works fine - enough buffer-credits are available - the frame's header is already on the next link before the switch even read the CRC value. The frame will travel the whole path to the destination device even though it's a corrupted frame and all switches it passes will recognize that this frame is corrupted. Brocade switches work this way. As soon as the corrupted frame reaches the destination, it will be dropped.
Regardless which method is used, the CRC value remains just an error detection and most probably the whole exchange has to be aborted and repeated anyway.
So how to troubleshoot CRC errors on Brocade switches then?
If you would only have a counter for CRC errors, you would be in trouble now. Because if all switches along the path increase their CRC error counter for this frame, how would you know which one is really broken? If you have multiple broken links in a huge SAN, this could turn ugly. But there are 2 additional counters for you:
- enc in - The frame is encoded additionally in a way that bit errors can be detected. And because the frame is decoded when it's read from the fiber and encoded again before it's sent out to the next fiber, the enc in (encoding errors inside frames) counter will only increase for the port that is connected to the faulty link.
- crc g_eof - Although a corrupted frame will be cut-through as explained above, there is just one thing the switch can do in addition when it encounters a mismatch between the calculated CRC value and the one stored in the frame: it will replace the EOF with another 4 bytes meaning something like "This is the end of the frame, but the frame was recognized as corrupted." The crc g_eof counter basically means "The CRC value was wrong but nobody noticed it before. Therefore it still had a good EOF." So if this counter increases for a particular link, it is most probably faulty.
frames enc crc crc too too bad enc disc link loss loss frjt fbsy
tx rx in err g_eof shrt long eof out c3 fail sync sig
1: 1.5g 1.8g 13 12 12 0 0 0 1.1m 0 2 650 2 0 0
2: 1.3g 1.4g 0 101 0 0 0 0 0 0 0 0 0 0 0
3: 1.9g 2.9g 82 15 0 0 3 12 847 0 0 0 0 0 0
Port 1 shows a link with classical bit errors. You see CRC errors and also enc in errors. Along with them you see
crc g_eof. Everything as expected. Just go ahead and and check / clean / replace the cable and/or SFPs. There are some tests you could do to determine which one is broken like "porttest" and "spinfab".
Port 2 is a typical example of an ISL with forwarded CRC errors. This ISL itself is error-free. It just transported some previously corrupted frames (crc err but no enc in) which were already "tagged" as corrupted, hence no crc g_eof increases.
Port 3 is a bit tricky now. If you just rely on crc g_eof it seems to be a victim of forwarded CRC errors, too. But that's not the case. Actually they were broken in a manner that the end of the frame was not detected properly, so too long an bad eof is increased. Best practice: Stick with the enc in counter. It still shows that the link indeed generates errors.
Hold on, Help is on the way!
Now with 16G FC as state of the art things changed a bit. It uses a new encoding method and it comes with a forward error correction (FEC) feature. Brocade provides this with its FabricOS v7.0x on the 16G links. It will be able to correct up to 11 bits in a full FC frame. FEC is not really highlighted or specially standing out in their courses and release notes, but in my opinion this thing is a game changer! Eleven bit errors within one frame! Based on the ratio between enc in and crc err - which basically shows how many bit errors you have in a frame on the average - we see so far, I assume this to just solve over 90% of the physical problems we have in SANs today. Without the end-device-driven error recovery which takes ages in Fibre Channel terms. Less aborts, less time-outs, less slow drain devices because of physical problems! If this works as intended SANs will reach a new level of reliability.
So let's see how this turns out in the future. It might be a bright one! :o)
Two additional topics for my previous post came into my mind and I doubt that they will be the last ones :o)
Have a proper SAN management infrastructure
For most of you it's self-evident to have a proper SAN management infrastructure, but from time to time I see environments where this is not the case. In some it's explained with security policies ("Wait - you are not allowed to have your switches in a LAN? And the USB port of your PC is sealed? You have no internet access? No, I don't think that you should send a fax with the supportshow...), sometimes it's just economizing on the wrong end. And sometimes there is just no overall plan for SAN management. So I think at least the following things should be given to enable a timely support:
- A management LAN with enough free ports to allow integration of support-related devices. For example a Fibre Channel tracer.
- A host in the management LAN which is accessible from your desk (e.g. via VNC or MS RDP) and has access to the management interfaces of all SAN devices. This host should at least boot from an internal disk rather than out of the SAN.
- A good ssh and telnet tool should be installed which allows you to log the printable output of a session into a text file. I personally like PuTTY.
- A tFTP- and a FTP-Server on the host mentioned above. It can be used for supportsaves, config backups, firmware updates etc. They should always run and where it's possible the devices should be pre-configured to use them. (e.g. with supportftp in Brocade switches)
- If it's possible with your security policy, it's helpful to have Wireshark installed on it which could be used for "fcanalyzer" traces in Cisco switches or also to trace the ethernet if you have management connection problems with your SAN products.
- The internet connection needs enough upload bandwidth. Fibre Channel traces can be several gigabytes in size. When time matters undersized internet connections are a [insert political correct synonym for PITA here :o) ]
- Callhome and remote support connection where applicable. Callhome can save you a lot of time in problem situations. No need to call support and open a case manually. The support will call you. And most of the SAN devices will submit enough information about the error to give the support member at least an idea where to start and which steps to take first. So in some situations callhomes trigger troubleshooting before your users even notice a problem. In addition some machines (like DS8000) allow the support to dial into it and gather the support data directly - and only the support data. Don't worry - your user data is safe!
- Have all passwords at hand. This includes the root passwords as some troubleshooting actions can only be done with a root user.
- Have all cables and at least one loopback plug at hand. With cables I mean at least: one serial cable, one null-modem cable, one ethernet patch cable and one ethernet crossover cable (not all devices have "auto-negotiating" GigE interfaces)... better more. And of course a good stock of FC cables should be onsite as well.
- The NTP servers as mentioned in my previous blog post.
Monitoring, counter resets and automatic DC
Beside of any SAN monitoring you hopefully do already (Cisco Fabric Manager / Brocade DCFM / Network Advisor / Fabric Watch / SNMP Traps / Syslog Server / etc) there is one thing in addition: automatic data collections based on cleared counters. Finding physical problems on links, frame corruption on SAN director backlinks, slow drain devices or toggeling ports - for all these problems it helps a lot if you can 1. do problem determination based on counters cleared on a regular basis and 2. look back in time to see exactly when it started and maybe how the problem "evolved" over time.
What you need is some scripting skills and a host in the management LAN (with an FTP server) to run scripts from as mentioned above. A good practice is, to have a look for a good time slot - better do not do this on workload peak times - and set up a timed script (e.g. cron job) that does:
- Gather data collections of all switches - use "supportsave" for Brocade switches and for Cisco switches log the output of a "show tech-support details" into a text file.
- Reset the counters - use both "slotstatsclear" and "statsclear" for Brocade switches and for Cisco switches run both "clear counters interface all" and "debug system internal clear-counters all". The debug command is a hidden one, so please type in the whole one as auto-completion won't work. The supportsave is already compressed but for the Cisco data collection it might be a good idea to compress it with the tool of your choice afterwards.
Additional hint: Use proper names for the Cisco Data collections. They should at least contain the switchname, the date and the time!
Depending on the disk space and the number of the switches, it may be good to delete old data collections after a while. For example you could keep one full week of data collections and for older ones only keep one per week as a reference.
If you have a good idea in addition how to be best prepared for the next problem case, please let me know. :o)
A slow drain device often has a huge impact on the performance of many other devices in a SAN environment. That happens, because they block resources in a fabric other devices use as well. The main example for such a resource are ISLs, particularly the Virtual Channel(s) within those ISLs that are used to reach the slow drain device. But as soon as you have an appliance in the SAN, this could turn into such a blocked resource as well.
Disclaimer: There are several definitions and types of appliances. Within this article an appliance is a device "in the middle" between the hosts and the storages with a specific task such as a compression, encryption, virtualization or deduplication appliance. While I had the SAN Volume Controller (SVC) in mind while I wrote this, it applies to many other products matching this definition. The common thing is that the performance they can provide is to some degree dependent on their destination device's performance.
Fortunately many of the fabrics I saw over the recent years were designed using a core-edge approach. If the device is in the communication path of many of the devices in a SAN it's best practice to attach it directly to the core. But a slow drain device can still block it. This is how it happens:
In this sketch the appliance sends data towards a slow drain device. It will not be able to process the incoming frames quickly enough - they pile up in its HBA's ingress buffers (1). As the appliance is still sending frames but the edge switch cannot send them further to the slow drain device, they also pile up in the ingress buffer of the ISL port of the edge switch (2). Now this could already impair the performance of the other host connected to the same edge switch like the slow drain device - if the frames towards it use the same VC. Some microseconds later the same might happen to the frames from the appliance entering the core (3). They pile up there as well and as soon as that happens, this so called back pressure reaches the appliance itself then. As there are no VCs on the F-to-N-port connection used to attach the appliance to the core, the chance is high that the appliance cannot send any frames out to the SAN anymore - no matter to which destination (4).
Well, that means you just turned your appliance into a slow drain device itself! The performance of the whole environment is heavily impaired now:
In step (5) the frames from the other hosts towards the appliance pile up in the core as well and then the back pressure spreads further to the hosts connected to the edge switches as well (6).
Worst case, hmm?
After the ASIC hold time is reached (usually 500ms) the switches will begin to drop frames to free up buffers again. But as all switches have the same ASIC hold time, you'll end up in the situation that while the edge switch reach these 500ms first, the core switch will start to drop the frames likewise before the buffer credit replenishment information (VC_RDY) from the edge switches arrive. So not only the frames from the communication with the initial slow drain device will be dropped, but most of the others down the path as well. As as the appliance itself turned into a slow drain device, the same might happen to the frames piled up because of that, too.
So what to do against it?
The first thing is: give the F-ports of the appliance as much buffers as possible. Prio 1 should be that it should be able to send its frames out into the fabric, so the chances are higher that when the frames of the open I/O against the slow drain device are out there, there could be still some buffer credits left to send stuff to other devices. For clustered appliances like the SVC it's even more important, because they use these ports for their cluster-internal communication as well. Blocked ports could result in cluster segmentation then (SVC: single nodes rebooting due to "Lease expiry"). To assign more buffers to the switch port (= more buffer credits for the port of the appliance), use
portcfgfportbuffers --enable [slot/]port buffers
Update: Please keep in mind, that adding more buffers to an F-Port is of course disruptive for the link!
To check how much buffers are available, you can check
But in many cases this is not enough. Some time ago, Brocade released Fabric Resiliency Best Practices with some good advises. In my opinion every SAN admin with Brocade gear should have read it. It recommends:
- Use Fabric Watch to get alarms for frame timeouts. (Erwin von Londen wrote a good article about that.
- Use Port Fencing to isolate slow drain devices. (Read Erwin's post about that, too.)
- Configure and use the Edge Hold Time.
- Configure bottleneckmon to get alarms for latency and congestion bottlenecks.
While Fabric Watch is used more and more and especially in the FICON world - but also for open systems - I see some of our customers using port fencing, I hardly see anyone utilizing the Edge Hold Time feature. For a situation as described above it could really improve the situation for the appliance and the other hosts dramatically. It can be set to any value between 100ms and 500ms. It was introduced in FOS v6.3.1b. So if you expect hosts connected to an edge switch to behave slow draining in certain situations, in my opinion the Edge Hold Time of that switch should be set as low as possible. Of course it's always depending on your environment and how likely it is to be impaired by a slow drain device, but 100ms is a long time in a SAN. If you also have some legacy devices connected to these edge switches, check if a decreased hold time could be a problem for them.
It can be enabled and configured using the "configure" command, where it can be found in:
Not all options will be available on an enabled switch.
To disable the switch, use the "switchDisable" command.
Fabric parameters (yes, y, no, n): [no] yes
Configure edge hold time (yes, y, no, n): [yes]
Edge hold time: (100..500) 
You don't need to disable the switch to change the Edge Hold Time and as one of the fabric parameters it will be included in a configupload.
As it seems to be used very seldom in the field I would like to get some feedback if you actually used it. Please give me a hint if and in which situation it helped you. Thanks!
But don't forget: The most important thing is to get rid of the slow draining behavior!